Primary key: To GUID or not to GUID?
The primary key of a table on an RDBMS uniquely identifies each record in the table. It can either be a normal attribute that is guaranteed to be unique or it can be generated by the RDBMS (such as a globally unique identifier). Primary keys may consist of a single attribute (single key) or multiple attributes in combination and in that case we speak about composite keys.
I am sure that you have asked yourself the question in the title of this post several times in your professional life: every time you create a new database table. For sure, this is one of those recurrent questions that now and then appears and make you think over and over again.
The short answer to it is: it depends (as usually, isn’t it?)
Before thinking about data types we should consider and decide what is the Primary Key going to represent. This can be a Social Security Number, ISIN number of a book, or simply a progressive sequence of numbers that then would represent the row. So, every table has its own kind of problems, and for sure the data type has to respect the original need and represent that accordingly.
As my primary database I am used to work with Microsoft SQL Server, and therefore the examples here below are all around that particular database system, but for sure the same (or similar) rules would apply to most of the other RDMBS.
Generally good candidates for primary keys are (we will see later why).
- Numbers (tiny, smallint, int, bigint)
- Unique Identifiers (GUID)
- In some very special cases: strings
Bad candidates are:
- Strings: generally speaking)
- Dates: can a date be a Primary Key? I see very limited usage here, and usually is wrong to choose a date.
- floats, decimals
- xml: too complex and this usually could be solved with previously mentioned composite keys.
What are the consideration to undertake?
There are certainly many aspects (or questions we should ask) in order to decide the data type to use, here are just some:
- Expected table growth: Will the table grow in the future so much that we would need to change the datatype.
- Indexing: What kind of indexing is needed on a certain column.
- Storage Space: Something that definitively needs consideration, as this could.
- Processing speed: Bigger data types usually require more CPU processing.
- Latency: Usually a database is running on a central server, big data means more traffic…
- Functionality (operations): Should we order, compare, sum, etc the primary key…
- I am sure there is more, and I hope I listed the most important ones.
Let’s give some numbers
The following table depicts some common data types and their possible values. The table is sorted in a growing order. The last column shows how much storage space would be needed just for the primary key in case the table has 1 billion rows.
|Data type||From||To||Storage||Storage needed for 1bio rows|
|tinyint||0||255||1 byte||Not possible|
|2 bytes||Not possible|
|4 bytes||3.72 GB|
|8 bytes||7.45 GB|
5,316,911,983,139,663,491,615,228,241,121,400,000 possible combinations
|16 bytes||14 GB|
This table shows clearly that the GUID are the heaviest data type to use, but at the same time is the most capable when it comes to the number of combinations (I am not sure even how to pronounce this number)
Microsoft SQL Server offers a possiblity for GUID (uniqueidentifier) to act as a primary key. GUID can be compared. The only operations that are allowed against a uniqueidentifier value are comparisons (=, <>, <, >, <=, >=) and checking for NULL (IS NULL and IS NOT NULL). No other arithmetic operators are allowed. All column constraints and properties except IDENTITY are allowed on the uniqueidentifier data type. It is possible to have sequential uniqueidentifiers as a by assigning NewSequentialID() as the default value of the column
Benefits of using uniqueidentifier as Primary key:
- GUID would guarantee unique keys between multiple servers, in case we would like to split the database
- Anticipate having more rows in a table than could be supported by a smaller datatype.
- Everything depends on the application, but a GUID offers the substantial benefit of being unique at creation time, as compared for instance to a sequence of integers.
- GUID could be generated outside the database and be unique without blocking.
- Synchronization between different stores. Theoretically there is no way for two items to overlap.
- When merging data we don’t have to worry about duplicates
- Processing speed impacted, storage space needed is significant. Usually GUID performs very badly as the PK.
- Readability affected: Easier to say, “return the row with the id=5″or , “return a row with the id = 6F9619FF-8B86-D011-B42D-00C04FC964FF”
- Debugging more difficult for the same reason of readability.
- Index fragmentation: Please check this article for a much better explanation that I could give Index fragmentation finding – the basics
- GUID theoretically are unique but practically (in very few cases , could overlap).
As you have seen, there are positive and negative points in using Guids to be considered
I will leave to you the final answer to the “Guid or not Guid” question, citing that I personally would avoid using Guid as the primary key if really not strictly necessary, and in cases other data types cannot achieve the same functionality.
The approach I am usually following (obviously, whenever possible) is that every table has it’s own sequential (identity) primary key which is an Int , which is usually more than enough given the limit of 2,147,483,647, and in the future is always possible to switch to the BIGINT if necessary (not without some glitches).
If you read this post unfortunately you are going to have a completely different opinion nhibernate-poid-generators-revealed.aspx