So this is one of the most popular interview questions.
The theme of URL shorter is that the user will provide a long URL we need to return the short URL. If the user provides short URL we need to provide the long URL.
One of the simplest solution is to create a Map which stores a key value pair. But this solution will not be scalable and will not be distributed.
Hence we have to come up with a solution that is scalable and distributed.
The solution can be divided into 3 parts.
1. Memory Consumption on load
2. API that can be used
3. Application Layer
1. Memory Consumption on load
Before designing our system, we shall see that will be the storage space required.
Assume that Twitter it will have 300Million users per month. If your tweet is getting arounf 30Million per month traffic, means 1Million users per day.
And assume our url length of the shortURL is 7.
So in our DB we need to save at minimum following fields:
LongURL -> The max length can be 2048 characters i.e 2kb
shortURL -> The max length can be 17 characters i.e 17byte [including domain name]
CreatedAt -> Date -> 7bytes
Total will be around 2KB of data per shortURL entry.
So for 30M users we generate 60 GB/Month. 0.7TB/Year. 3.6 TB/Year.
2. API that can be used
Here we create 2 simple API.
“createTiny(longURL)” this will create a shortURL from the long URL.
“getLong(shortURL)” this will get the long URL from the shortURL.
3. Application Layer
In this part we shall discuss on different methods on how to change a longURL to shortURL and vice versa and get the shortURL as unique as possible.
Let us understand how a user will use the service. Consider the example below:
So a user will make a API request using Rest, HTTP or any other protocol. The restAPI will go to a Load Balancer. A load balancer is used to distribute the traffic equally to multiple application servers.
Then the application server will take the longURL and convert it into shortURL and store it in the DB. And when the user sends a longURL request, it will take the shortURL and get the longURL from the server and return it into the client.
We can also have a cache server to store the popular URL’s. It can be memcache, redis or any other cache server available in the market.
Now we have understood the flow, now shall look on how to create a shortURL from the given longURL.
We shall discuss several methods in achieving the same:
Some of the assumptions that we have made is as shown below:
Below are the characters that are allowed in our shortURL.
“a to z”
“A to Z”
“0 to 9”
SO we have 26 + 26 + 10 = 62 characters.
Our shortURL will have 7 characters in length. So we get approximately 62^7 shortURLs. As this is a very large value, it will take years to finish all the values.
The DB schema will be a key-value pair. Where key is the “shortURL” and value is the “longURL”.
Method 1:
Generate a random shortURL and check
In this method, we get a longURL and convert it into shortURL by using some random method. So once we generate “shortURL” one of the 3 are possible.
1. You check the DB for “shortURL” by using a “get” method. If it is not present, then “put” the key-value pair.
But above method has a flaw. For example, if the server_1 will check of the random shortURL has been inserted or not, when it checks it is not inserted. Hence it will call a “put” method to insert the key-value pair. But at the same time if another server checks for the same random shortURL and tries to insert it, you will have a same shortURL pointing to 2 different longURL.
2.In this method we check the database, if the shortURL is absent, we directly insert it into the DB.
3. In this method, we insert shortURL along with the longURL into the DB. Then we get the shortURL and check if the longURL is same as the original value. If it is same as the original value, then leave it. Else again get a shortURL and insert into the DB and again check. Do this process untill you get the unique value.
In all the 3 different methods, we are atleast doing one get method to check if the shortURL is taken or not. Hence we shall move to the second method.
Method 2: MD5 method.
In this method we use MD5 algorithm. It is a hashing function that generates 128 bits long hash. Here we take the MD5 value of the longerURL and take the first 43 bits of the result and get the shortURL. Again there is a propability of collision. Hence again we need to use a get method to know if the shortURL is already taken or not.
The only advantage is, MD5 will give the same result if the input is same. Hence in this approach, if 2 users are trying to generate shortURL for the same link, we can check our DB and give the same result instead of giving 2 random shortURL. Hence saving space.
So how to convert 43 bits long hash to a shortURL?
Once you get a binary number from the 43 bits, you take the deciaml number.
For example, suppose when you convert 43 binary numbers into decimal, you will get 1362849. Then convert that number into base 62.
Once you convet the number to base 62, you get the numbers from 0 to 61.
Example:
60, 9, 30,0
Then all you need to do is to map it to the 62 characters we got it in earlier part. [A to Z, a to z, 0 to 9].
So
1 maps to A
2 maps to B
3 maps to C
.
.
.
.
This way you can generate the shortURL.
But this method also uses atleast one get method to check if the shortURL is present or not.
Method 3: Counter based approach
In this method we can guarentee that there will no collision. Hence we no need to use get method. In the counter based approach, there are 2 different ways to achieve it.
They are:
1. Single Host
2. Range based approach
1. Single host approach:
In this approach, there will be a single host, all the application servers will be connecting to that host when ever it recieves a shortURL request. Then the app server will get a number from the host, then the host will increment the number. Hence the app server can generate a unique shortURL based on the number. THe drawback will be single point of failure [when the host is down, it will affect all the app servers] and bottleneck [when the number of request is high, it might take time to process all the requests]. The single host can be a database or a zookeeper.
2. Range based approach:
So in this approach, we divide the counter into ranges. And those ranges will be stored in a server, it can be zookeeper. Then we assign those ranges to a particular app server.
For example, we divide the first 5000 number into 5 parts.
1 – 1000
1001 – 2000
2001 – 3000
3001 – 4000
4001 – 5000
Here the first app server will come and selects the first range, zookeeper will reserve that range for app server 1. Similarly app server 2 will take the next range. Assuming there are only 2 app servers, they will act on those two range. Zookeeper will increment the values everytime an app server will ask for a value.
Suppose if the server 1 has exhauted it’s range, then it will contact zookeeper to give another value. Then zookeeper will give next range and it reserve that range.
Thus this model is highly scalable and guarentees unique shortURL.
Some of the tools used here is:
Load balancer
RestAPI
zookeeper
NoSqlDB
CDN
MD5
Memcache