MySQL datatypes
We all might have gone through the confusion of choosing among the different string related datatypes for data storage, and so have I, many a times.Thus here I have compiled all the info that I had, got and could gather into a single post so as to give a better understanding of things which need to be factored while much such critical decisions.
As we all are aware that the following are the string data-types that are supported by MySQL
Just for your reference I have put hyperlinks for the corresponding mysql manual doc page.
I have divided the post into 4 sets, each discussing about the two datatypes, and how to choose between them
CHAR/VARCHARBINARY/VARBINARY
BLOB/TEXT
ENUM/SET
and in section with BLOB and TEXT we discuss about the things should be considered when deciding between all the the first 6 dataypes.
We will start with discussing about their basic propeties and a little bit about their mysql internals.
CHAR
- Trailing white space is always removed.
- Every record takes up a fixed amount of space - 19 characters in your case. Your 17 character records will waste two bytes of storage. ( Internally - values are right-padded with spaces to the specified length and when retrieved, trailing spaces are removed.)(also assuming latin1 charset)
- Faster SELECTs when searching this column.(although you wont gain any speed by using a CHAR in a table that also contains a VARCHAR.) - Max length 255
VARCHAR
- Any trailing whitespace is preserved. (MySQL 5.0.3+ only.)
- Minimal storage overhead. A 17 character record takes up only 17(+ prefix in this case 1 byte) characters of space, even if the length is specified as 19(assuming latin1 charset)
- 1 byte prefix till length 255, then 2 bytes prefix afterwards
- Slower SELECTs when searching this column.
- Max length is subject to the maximum row size of 65,535 bytes
Value | CHAR(4) | Storage Required | VARCHAR(4) | Storage Required |
---|---|---|---|---|
'' | ' ' | 4 bytes | '' | 1 byte |
'ab' | 'ab ' | 4 bytes | 'ab' | 3 bytes |
'abcd' | 'abcd' | 4 bytes | 'abcd' | 5 bytes |
'abcdefgh' | 'abcd' | 4 bytes | 'abcd' | 5 bytes |
Important -
- To calculate the number of bytes used to store a particular
CHAR
,VARCHAR
, orTEXT
column value, you must take into account the character set used for that column and whether the value contains multi-byte characters. In particular, when using theutf8
Unicode character set, you must keep in mind that not all characters use the same number of bytes and can require up to three bytes per character. - The effective maximum number of bytes that can be stored in a VARCHAR or VARBINARY column is subject to the maximum row size of 65,535 bytes, which is shared among all columns(of the row)
- For a VARCHAR column that stores multi-byte characters, the effective maximum number of characters is less. Like, utf8 characters can require up to three bytes per character, so a VARCHAR column that uses the utf8 character set can be declared to be a maximum of 21,844 characters
I am considering that you have got a fair idea about both the data types by now. Now how to decide, when to use CHAR and when VARCHAR?
Well there is no silver bullet for this, we will have to deal with this on case to case basis.
The most important questions you need to ask while deciding among these two is
1. How many bytes am I gonna save per record on an average if I use VARCHAR instead of CHAR?
- If on an average we are able to save upto say 5 percent of total number of bytes(length) per record then CHAR would be the best option, although if the saving increases VARCHAR becomes a more favourable candidate.(do consider this - these days disk space is cheaper than cola)
2. What type of operations are you gonna run on this column ?
- If there are a lot of deletes and updates in your table then there is definitely gonna be fragmantation issues, and if in this case VARCHAR is used then the fragmantations will occur more frequently cos on every update(with smaller data) mysql will not utilise the freed-up space(resulting in fragmantation issues). Thus in these kind of scenarios CHAR is recommended atleast there would be less fragmantation related issues.
3. What type of queries will you be running on this table?
- If there are gonna be primary lookups for one row at a time, then it will probably take your HDD almost the same amount of time to look up a 250 byte(CHAR) row compared to a 40 byte (VARCHAR) row, as the largest expense will be disk seek time/number of disk seeks, not I/O throughput.
No formal benchmarks are run so I have not guessed numbers, but these factors should be factored in.
So in most cases I would be using CHAR over VARCHAR even though I would be wasting a few bites per record. But if I have no idea whatsoever on the lengths of the user input, I would go with varchar(255) as the best option.
Lets move on to the other datatypes
BINARY
- Similar to CHAR only they contain binary strings rather than nonbinary strings
- Maxlength same as CHAR, except here we count the bytes occupied by the character instead of the character count
- Values are right-padded with
0x00
(instead of space in CHAR datatype , although in older MySQL version its space padded)
- For a
BINARY(3)
column, 'a '
becomes 'a \0'
when inserted. 'a\0'
becomes 'a\0\0'
when inserted. Both inserted values remain unchanged when selected(Unlike in CHAR where the padded space is removed while select)
VARBINARY
- Similar to VARCHAR only they contain binary strings rather than nonbinary strings
- Maxlength same as VARCHAR, except here we count the bytes occupied by the character instead of the character count
- As these two dataypes are binary they dont have any charset
- and sorting and comparison are based on the numeric values of the bytes in the values
- not to be confused with
CHAR BINARY
andVARCHAR BINARY
data types,CHAR(5) BINARY
is treated asCHAR(5) CHARACTER SET latin1 COLLATE latin1_bin
- when BINARY or VARBINARY values are stored the standard character set of the operating system is used to translate each character (Or byte values for multi-byte character sets.)
Well now when deciding between these two we ask the same questions, as we did in CHAR/VARCHAR
Although if you thinking on using BINARY/VARBINARY over CHAR/VARCHAR consider thes points
- CHAR/VARCHAR/TEXT compare characters using a "collation". BINARY/VARBINARY/BLOB compare bytes.
- Most collations are "case insensitive", so upper case and lower case are considered equal, WHERE, ORDER BY, etc, use collations to order text strings in more complicated ways.
- It is generally best to use CHAR (etc) for text, and BINARY (etc) for streams of bytes (eg, bytes of an image jpeg) and other non-text stuff.
Lets now move on to BLOB and TEXT
- treated as binary strings (byte strings) thus no character set, and sorting and comparison are based on the numeric values of the bytes in column values
- in most respects, a BLOB column is lika VARBINARY(expect a few differences we will take below)
- prefix of 1,2,3 or 4 bytes depending on length
TEXT
- treated as nonbinary strings (character strings) and have a character set, and values are sorted and compared based on the collation of the character set(similar to CHAR and VARCHAR)
- in most respects, a TEXT column is like VARCHAR(expect a few differences we will take below)
- prefix of 1,2,3 or 4 bytes depending on length
- If you use the BINARY attribute with a TEXT data type, the column is assigned the binary collation of the column character set(as is the case with CHAR BINARY, VARCHAR BINARY)
The BLOB and TEXT types are devided into 4 types
- TINY - prefix 1 byte, maximum length 28
- TEXT/BLOB - prefix 2 bytes, maximum length 216
- MEDIUM - prefix 3 bytes, maximum length 224
- LONG - prefix 4 bytes, maximum length 232
BLOB and TEXT have the following differences from/relations with VARBINARY and VARCHAR
- Strictly all the trailing spaces are retained in BLOB and TEXT in all MySQL version. While for VARBINARY and VARCHAR depends on the version
- while defining indexes on BLOB, TEXT you need to define the index length, contrary to optional index length in VARBINARY, VARCHAR
- BLOB, TEXT cant have default values
- LONG and LONG VARCHAR map to the MEDIUMTEXT data type.
- MySQL Connector/ODBC defines BLOB values as LONGVARBINARY and TEXT values as LONGVARCHAR
- VARCHAR up to 255 and TINYTEXT are virtually identical. Each take 1 byte for length, plus the actual length.
- TEXT is better to use in the case that you are going to be using some type of search function on your website that uses FULL-TEXT searching (and works much better than searching using the LIKE statement).
- Blob can have it's uses when you store HTML pages inside of a database and you are not using a search feature or you don't really need a more efficient way of storing data, and as stated, is more friendly sometimes with encoding.
- Each BLOB or TEXT value is represented internally by a separately allocated object. This is in contrast to all other data types, for which storage is allocated once per column when the table is opened, so there can be small overhead on reads/writes. Thus avoid using
SELECT * which selects all columns(including TEXT/BLOB).
- VARCHAR 256..65536 (if you have a newer version) and TEXT use a 2-byte length. You pay 1 byte, but get a huge increase in the max allowed(216).
- Strictly all the trailing spaces are retained in BLOB and TEXT in all MySQL versions - so if you want to store binary values such as results from an encryption or compression function that might contain arbitrary byte values, use a BLOB column rather than a CHAR or VARCHAR column, to avoid potential problems with trailing space removal that would change data values.
- Because BLOB and TEXT values can be extremely long, you might encounter some constraints in using them like
- Only the first
max_sort_length
bytes of the column are used when sorting.
ENUM and SET columns provide an efficient way to define columns that can contain only a given set of values, thus they should strictly be used in such scenarios only(and you will never need to store additional related info).Now the ENUM and SET datatypes
Because ALTER TABLE is a lot "harder" --on sensibility, maintenance, portability, and resources-- than an INSERT and a million joins.
ENUM
- String object with a value chosen from a list of permitted values
- Always have a default value. If you specify no default value, then it is NULL for columns that can have NULL, otherwise it is the first enumeration value in the column definition.
- Incorrect values are stored as index 1 and returned as space
- Can have a maximum of 65,535 elements.
- Trailing spaces are automotically deleted
- Character set and collations can be assigned
For example, a column specified as
ENUM('one', 'two', 'three')
Value | Index |
---|---|
NULL | NULL |
'' | 0 |
'one' | 1 |
'two' | 2 |
'three' | 3 |
SET
- String object that can have zero or more values, each of which must be chosen from a list of permitted values
- SET member value cannot have a comma in it, as the members are specified separated by commas
- Can have a maximum of 64 different members.
- Trailing spaces are automatically deleted
- Character set and collations can be assigned
- SET values are stored numerically, with the low-order bit of the stored value corresponding to the first set member(as it can have zero or more values)
For a column specified as SET (
'a','b','c','d')
SET Member | Decimal Value | Binary Value |
---|---|---|
'a' | 1 | 0001 |
'b' | 2 | 0010 |
'c' | 4 | 0100 |
'd' | 8 | 1000 |
9
to this column, that is 1001
in binary, so the first and fourth SET value members 'a'
and 'd'
are selected and the resulting value is 'a,d'
.
Ok so now when to use ENUM and when to use SET, consider these
- ENUM can have many more values than SET
- ENUM - only one of the choices, SET - zero to all of the choices
- Well apart from these I coundnt find anymore, if someone finds something then please put in comment, I will incorportate them in my post
I would love to hear from you, suggestions welcome
Reference sites - dev.mysql.com, forums.mysql.com, stackoverflow.com
So Long,
Sandeep Rajoria
Comments