Skip to main content

MySQL - What should I use varchar, char, text or blob and when?


MySQL datatypes

We all might have gone through the confusion of choosing among the different string related datatypes for data storage, and so have I, many a times.
Thus here I have compiled all the info that I had, got and could gather into a single post so as to give a better understanding of things which need to be factored while much such critical decisions.

As we all are aware that the following are the string data-types that are supported by MySQL
  1. CHAR 
  2. VARCHAR 
  3. BINARY 
  4. VARBINARY 
  5. BLOB 
  6. TEXT 
  7. ENUM 
  8. and SET.
 Just for your reference I have put hyperlinks for the corresponding  mysql manual doc page.

I have divided the post into 4 sets, each discussing about the two datatypes, and how to choose between them
CHAR/VARCHAR
BINARY/VARBINARY
BLOB/TEXT
ENUM/SET
and in section with BLOB and TEXT we discuss about the things should be considered when deciding between all the the first 6 dataypes.

We will start with discussing about their basic propeties and a little bit about their mysql internals.


CHAR
- Trailing white space is always removed.
- Every record takes up a fixed amount of space - 19 characters in your case. Your 17 character records will waste two bytes of storage. ( Internally - values are right-padded with spaces to the specified length and when retrieved, trailing spaces are removed.)(also assuming latin1 charset)
- Faster SELECTs when searching this column.(although you wont gain any speed by using a CHAR in a table that also contains a VARCHAR.) - Max length 255

VARCHAR
- Any trailing whitespace is preserved. (MySQL 5.0.3+ only.)
- Minimal storage overhead. A 17 character record takes up only 17(+ prefix in this case 1 byte) characters of space, even if the length is specified as 19(assuming latin1 charset)
- 1 byte prefix till length 255, then 2 bytes prefix afterwards
- Slower SELECTs when searching this column.
- Max length is subject to the maximum row size of 65,535 bytes


ValueCHAR(4)Storage RequiredVARCHAR(4)Storage Required
'''    '4 bytes''1 byte
'ab''ab  '4 bytes'ab'3 bytes
'abcd''abcd'4 bytes'abcd'5 bytes
'abcdefgh''abcd'4 bytes'abcd'5 bytes
source mysql reference manual (assuming latin1 charset)

Important -  
  • To calculate the number of bytes used to store a particular CHAR, VARCHAR, or TEXT column value, you must take into account the character set used for that column and whether the value contains multi-byte characters. In particular, when using the utf8 Unicode character set, you must keep in mind that not all characters use the same number of bytes and can require up to three bytes per character.
  • The effective maximum number of bytes that can be stored in a VARCHAR or VARBINARY column is subject to the maximum row size of 65,535 bytes, which is shared among all columns(of the row) 
  • For a VARCHAR column that stores multi-byte characters, the effective maximum number of characters is less. Like, utf8 characters can require up to three bytes per character, so a VARCHAR column that uses the utf8 character set can be declared to be a maximum of 21,844 characters  
I am considering that you have got a fair idea about both the data types by now. Now how to decide, when to use CHAR and when VARCHAR?
Well there is no silver bullet for this, we will have to deal with this on case to case basis.

The most important questions you need to ask while deciding among these two is
1. How many bytes am I gonna save per record on an average if I use VARCHAR instead of CHAR?
  • If on an average we are able to save upto say 5 percent of total number of bytes(length) per record then CHAR would be the best option, although if the saving increases VARCHAR becomes a more favourable candidate.(do consider this - these days disk space is cheaper than cola)
2. What type of operations are you gonna run on this column ?
  • If there are a lot of deletes and updates in your table then there is definitely gonna be fragmantation issues, and if in this case VARCHAR is used then the fragmantations will occur more frequently cos on every update(with smaller data) mysql will not utilise the freed-up space(resulting in fragmantation issues). Thus in these kind of scenarios CHAR is recommended atleast there would be less fragmantation related issues.
3. What type of queries will you be running on this table?
  • If there are gonna be primary lookups for one row at a time, then it will probably take your HDD almost the same amount of time to look up a 250 byte(CHAR) row compared to a 40 byte (VARCHAR) row, as the largest expense will be disk seek time/number of disk seeks, not I/O throughput.
No formal benchmarks are run so I have not guessed numbers, but these factors should be factored in.
So in most cases I would  be using CHAR over VARCHAR even though I would be wasting a few bites per record. But if I have no idea whatsoever on the lengths of the user input, I would go with varchar(255) as the best option.


Lets move on to the other datatypes

BINARY
- Similar to CHAR only they contain binary strings rather than nonbinary strings
- Maxlength same as CHAR, except here we count the bytes occupied by the character instead of the character count
- Values are right-padded with 0x00(instead of space in CHAR datatype , although in older MySQL version its space padded)
- For a BINARY(3) column, 'a ' becomes 'a \0' when inserted. 'a\0' becomes 'a\0\0' when inserted. Both inserted values remain unchanged when selected(Unlike in CHAR where the padded space is removed while select)

VARBINARY
- Similar to VARCHAR only they contain binary strings rather than nonbinary strings
- Maxlength same as VARCHAR, except here we count the bytes occupied by the character instead of the character count
  • As these two dataypes are binary they dont have any charset
  • and sorting and comparison are based on the numeric values of the bytes in the values
  • not to be confused with CHAR BINARY and VARCHAR BINARY data types, CHAR(5) BINARY is treated as CHAR(5) CHARACTER SET latin1 COLLATE latin1_bin
  • when BINARY or VARBINARY values are stored the standard character set of the operating system is used to translate each character (Or byte values for multi-byte character sets.)
The above 4 points hold true for BLOB datatype as well

Well now when deciding between these two we ask the same questions, as we did in CHAR/VARCHAR
Although if you thinking on using BINARY/VARBINARY over CHAR/VARCHAR consider thes points
  • CHAR/VARCHAR/TEXT compare characters using a "collation". BINARY/VARBINARY/BLOB compare bytes.
  • Most collations are "case insensitive", so upper case and lower case are considered equal, WHERE, ORDER BY, etc, use collations to order text strings in more complicated ways.
  • It is generally best to use CHAR (etc) for text, and BINARY (etc) for streams of bytes (eg, bytes of an image jpeg) and other non-text stuff. 

Lets now move on to BLOB and TEXT

BLOB
- treated as binary strings (byte strings) thus no character set, and sorting and comparison are based on the numeric values of the bytes in column values
- in most respects, a BLOB column is lika VARBINARY(expect a few differences we will take below)
- prefix of 1,2,3 or 4 bytes depending on length
TEXT
- treated as nonbinary strings (character strings) and have a character set, and values are sorted and compared based on the collation of the character set(similar to CHAR and VARCHAR)
- in most respects, a TEXT column is like VARCHAR(expect a few differences we will take below)
- prefix of 1,2,3 or 4 bytes depending on length
- If you use the BINARY attribute with a TEXT data type, the column is assigned the binary collation of the column character set(as is the case with CHAR BINARY, VARCHAR BINARY)
The BLOB and TEXT types are devided into 4 types
  1. TINY - prefix 1 byte, maximum length 28
  2. TEXT/BLOB - prefix 2 bytes, maximum length 216
  3. MEDIUM - prefix 3 bytes, maximum length 224
  4. LONG - prefix 4 bytes, maximum length 232
 BLOB and TEXT have the following differences from/relations with VARBINARY and VARCHAR  
  • Strictly all the trailing spaces are retained in BLOB and TEXT in all MySQL version. While for VARBINARY and VARCHAR depends on the version
  • while defining indexes on BLOB, TEXT you need to define the index length, contrary to optional index length in VARBINARY, VARCHAR 
  • BLOB, TEXT cant have default values
  • LONG and LONG VARCHAR map to the MEDIUMTEXT data type.
  • MySQL Connector/ODBC defines BLOB values as LONGVARBINARY and TEXT values as LONGVARCHAR
  • VARCHAR up to 255 and TINYTEXT are virtually identical. Each take 1 byte for length, plus the actual length.  
Well deciding between BLOBs and TEXTs is exactly like deciding between BINARY and CHAR, but what is more important is deciding between BLOBs and BINARY/VARBINARY, TEXTs and CHAR/VARCHAR, but still these can be considered
  • TEXT is better to use in the case that you are going to be using some type of search function on your website that uses FULL-TEXT searching (and works much better than searching using the LIKE statement).
  • Blob can have it's uses when you store HTML pages inside of a database and you are not using a search feature or you don't really need a more efficient way of storing data, and as stated, is more friendly sometimes with encoding.
As we have already discussed how to decide among BINARY and VARBINARY or CHAR and VARCHAR, we now should discuss either to prefer BLOB over the first two or TEXT over the latter two. The following things should be considered
  • Each BLOB or TEXT  value is represented internally by a separately allocated object. This is in contrast to all other data types, for which storage is allocated once per column when the table is opened, so there can be small overhead on reads/writes. Thus avoid using SELECT * which selects all columns(including TEXT/BLOB).
  • VARCHAR 256..65536 (if you have a newer version) and TEXT use a 2-byte length. You pay 1 byte, but get a huge increase in the max allowed(216).
  • Strictly all the trailing spaces are retained in BLOB and TEXT in all MySQL versions - so if you want to store binary values such as results from an encryption or compression function that might contain arbitrary byte values, use a BLOB column rather than a CHAR or VARCHAR column, to avoid potential problems with trailing space removal that would change data values. 
  • Because BLOB and TEXT values can be extremely long, you might encounter some constraints in using them like
    • Only the first max_sort_length bytes of the column are used when sorting.

 Now the ENUM and SET  datatypes

ENUM and SET columns provide an efficient way to define columns that can contain only a given set of values, thus they should strictly be used in such scenarios only(and you will never need to store additional related info).
Because ALTER TABLE is a lot "harder" --on sensibility, maintenance, portability, and resources-- than an INSERT and a million joins.

ENUM
- String object with a value chosen from a list of permitted values 
- Always have a default value. If you specify no default value, then it is NULL for columns that can have NULL, otherwise it is the first enumeration value in the column definition.
- Incorrect values are stored as index 1 and returned as space
- Can have a maximum of 65,535 elements.
- Trailing spaces are automotically deleted
- Character set and collations can be assigned

For example, a column specified as ENUM('one', 'two', 'three')
ValueIndex
NULLNULL
''0
'one'1
'two'2
'three'3


SET
- String object that can have zero or more values, each of which must be chosen from a list of permitted values
- SET member value cannot have a comma in it, as the members are specified separated by commas
- Can have a maximum of 64 different members.
- Trailing spaces are automatically deleted
- Character set and collations can be assigned
- SET values are stored numerically, with the low-order bit of the stored value corresponding to the first set member(as it can have zero or more values)

For a column specified as SET ('a','b','c','d')
SET MemberDecimal ValueBinary Value
'a'10001
'b'20010
'c'40100
'd'81000
If you assign a value of 9 to this column, that is 1001 in binary, so the first and fourth SET value members 'a' and 'd' are selected and the resulting value is 'a,d'.

 
Ok so now when to use ENUM and when to use SET, consider these
  • ENUM can have many more values than SET
  • ENUM - only one of the choices, SET - zero to all of the choices
  • Well apart from these I coundnt find anymore, if someone finds something then please put in comment, I will incorportate them in my post
I hope I have made things clearer.
I would love to hear from you, suggestions welcome

Reference sites - dev.mysql.com, forums.mysql.com, stackoverflow.com

So Long,
Sandeep Rajoria

Comments

Popular posts from this blog

Multi Tenancy with Codeigniter

In this post I will show you how I converted my normal Codeigniter application into a multi-tenant system.(first step to a SaaS implementation) Note - This implementation is a separate DB multi-tenancy implementation. Lets say we have an up and running CI application name ci_app , with the directory structure like this ./application ./application/config ./application/...so many other important directories ./asset ./asset/js ./asset/images ./asset/css ./system ./index.php which is accessed through browser like http://localhost/ci_app So to implement the multi-tenant arch we are most concerned about the following files, and we will be editing them ./index.php ./application/config/config.php ./application/config/database.php And also we need to create a few new ones Create a tenant folder in your www root directory, lets say tenant_1 Cut the ./index.php from ci_app and paste it in tenant_1 directory  Create a blank file config.php in tenant_1 directory Crea

Profiling and checking PHP error_reporting in a Codeigniter App, without editing the config!!

Hi all, You must have definitely used the Profiling in Codeigniter and error_reporting many a times in Development and Testing environment, but I am sure you must have missed it on a real Production environment. As there are scenarios, where you want to quickly debug the Production application and find out what PHP errors is the application throwing, check the page profile, that too without putting the time and effort in replicating the whole production environment on your local machine, or perhaps a testing server. This small piece of code(we could perhaps call it a hack), which I have used in almost all of my CI applications, will make your life very easy, without losing anything on the security of the system. Following points, essentially sum up what exactly it does - Check for the dev(or root or admin, whichever name you use for the su access), if it is logged in, as we don't want others to see all the Profile data and other errors. Check for a specific query str

How I solved design problems by using various design patterns in my Laravel Project

Hey guys, Lately I have been working on a Virtual marketplace application using Laravel and PostgreSQL. So, when I was asked to build this huge application, the biggest challenge I faced was the design. Having a fair bit of prior experience in Laravel and upon following the current community trend, I decided to go with Laravel. And I hoped and expected that this, somewhat opinionated framework, would take care of my design to a large extent. When I actually started designing it, I realized that for a small/medium application Laravel already has things in place, you, as a developer just need to follow the guidelines set in place by the framework and use the features its providing out-of-box. However, for a larger application, with lot of interdependent modules and complex business flows, you need to make your own design decisions as well along with the existing features. This gave me an opportunity to take a look into the various existing design patterns to solve my design probl