Possible ways to generate unique ids for your application ?
Why should you use an ID field as the primary/foreign key?
- A join on an integer between tables is more efficient and faster . Integer comparisons are faster compared to other data types.
- Other Column values may change like your name . So we need an identifier which will not change in any condition. And what better than having an unique id for every row.
- We do not want duplicate fields, the identifier should be unique.
- An auto generated incremental id, which is easier to track because of the sorting order.
- The execution of queries are faster when we have unique ids as they can be used for indexing.
Now you can come up with an example saying that to login to any social network, you use an email id . Example — vivek.sinless@gmail.com
However in the backend servers this email id is mapped to a unique id (for eg : 1928202862). This id then is compared to different tables like address (to get the address) and posts(to get the posts of this user). I mean the comparison of an id would obviously be relatively cheaper then comparing the email id. Also there are possibilities that you might change your email id in the future, however the unique id would always be the same.
Different ways to create a unique id :
Serialized Identity integer column
Serialized Ids tend to take less space compared to GUIDs More easily indexed. It makes a great clustered index. Less fragmentation as new records are kept in order. This is easier to join tables. If we are using an auto generated incremental id provided by databases, then it becomes faster to query and find the data because sorting order is maintained. However this kind of approach might not be best suitable for distributed systems, because 2 databases can generate the same id. So collision is a problem.
GUIDs Global Unique identifiers
Guid’s can be very useful when you have a distributed system (for example, replicated databases) where a non-trivial amount of work would have to go into a key generation mechanism that wouldn’t cause collisions between the portions of the system.
The uniqueness of a GUID relies on the algorithm that was used to generate it.
Do GUIDs ever repeat?
A GUID is a 128-bit integer (16 bytes) that can be used across all computers and networks wherever a unique identifier is required.
Let’s assume that both GUID and UUID are similar . GUIDs are just Microsoft’s implementation for UUID.
The problem with UUID/GUID is the size is relatively larger , so it doesn’t index well. Hence the query performance will take a hit.
Read more : http://guid.one/guid
With the current model, there are enough GUIDs to fit ~800 million GUIDS per nanosecond over the last 13.800 billion years. Even if we get quadrillions of computers each creating billions of GUIDs every second, the risk will be negligible for a long time.
MongoDB’s objectID
Objectids are 12 byte long and are made up of :
- a 4-byte epoch timestamp in seconds,
- a 3-byte machine identifier,
- a 2-byte process id, and
- a 3-byte counter, starting with a random value.
This is much better, however still longer in size when compared to auto incremental feature of a sql like database.
Centralized Database
This approach use an additional database whose primary purpose it to create unique ids. So say that we are working on a sql like database. However we want sharded sql database which has a distributed node of dbs. In this case the distributed dbs can generate the same ids. However we can use a centralized database which is responsible for creating the ids. Hence the distributed node(dbs) will not create an auto increment id by themselves, but will call the centralized database to create a unique id. The problem with this approach is that we have an extra database, and we are increasing the network latency.
Twitter Snowflake
Snowflake is a network service generating unique ID numbers at high scale with some simple guarantees. The ids are 64 bits long.
id is composed of:
- time — 41 bits (millisecond precision w/ a custom epoch gives us 69 years)
- configured machine id — 10 bits — gives us up to 1024 machines
- sequence number — 12 bits — rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
The ids are compact and short(in comparison to GUID and ObjectId).
The ids are sortable.
https://github.com/twitter-archive/snowflake/tree/snowflake-2010
So the ids are shorter, doesn’t take more space. Ids are sortable, and could be used in a distributed environment. In a way, we can assume that this service is efficient in comparison to the other options out there.
Conclusion
The foremost thing to figure out in cases where you need a unique identifier is the kind of database architecture you want — distributed/sharded or single database.
Please reach out to vivek.sinless@gmail.com for any queries. Happy learning! Cheers.