No F*cking Idea

Common answer to everything

Designing Twitter Clone in Redis

| Comments

Ont he official redis site http://redis.io you can find this http://redis.io/topics/twitter-clone/ post about building twitter clone in redis. I based my design post partially on it but i would like to go more deep into building timeline and posts.

Quick review

I used similar approach to store followers and following so i will just go fast through the keys and design.

1
2
  twtr:<user_id>:follows -> set of ids this user follows
  twtr:<user_id>:followers -> set of id's  that follows this user

What happens when i click “follow”

example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
redis 127.0.0.1:6379> SADD twtr:kuba:following amelia
(integer) 1
redis 127.0.0.1:6379> SADD twtr:amelia:followers kuba
(integer) 1
redis 127.0.0.1:6379> SADD twtr:kuba:following dan
(integer) 1
redis 127.0.0.1:6379> SADD twtr:dan:followers kuba
(integer) 1
redis 127.0.0.1:6379> SADD twtr:kuba:following ben
(integer) 1
redis 127.0.0.1:6379> SADD twtr:ben:followers kuba
(integer) 1
redis 127.0.0.1:6379> SMEMBERS twtr:kuba:following
1) "ben"
2) "amelia"
3) "dan"

So now i follow amelia, ben and dan.

What happens when amelia click followers!

example
1
2
redis 127.0.0.1:6379> SMEMBERS twtr:amelia:followers
1) "kuba"

This way for each user we can see set of people who follow him and those who he follows. Thats all we need like in tutorial.

Post

So i think it is not waste if we will decide to keep post in form of 2 keys

1
2
3
  twtr:<user_id>:post:<post_id> -> content text of post
  twtr:<user_id>:post:<post_id>:created_at -> creation time of post
  twtr:post_owner:<post_id> -> id of post creator.

Why this approach and not compacting all the things into pipe separate key? Both solutions seems to be ok, this just leaves little bit more flexibility. I know that it will generate 2 x times more pickups to redis then previous so you can consider doing

1
  twtr:<user_id>:post:<post_id> -> (timestamp|text)

Both solutions have a pros and cones, first one will require 2 x lookups and second one will require parsing data in app layer. Still i prefer first one.

Post id

This is hard topic, because in future we would want to scale ( lol ) . Generating post id is not easy task in this case. We could just use auto incrementing counter like this

1
  twtr:post_id:next -> auto incr

But to have something more flexible you should look at something like snowflake http://engineering.twitter.com/2010/06/announcing-snowflake.html.

Post list

For each user we will store a list of posts he wrote. Initially i thought that we could just pump everything into list. But this is not optimal. This is single point which will grow like crazy and we will be not able to decide how and when to archive parts that are not used. Eg. posts did by user 8 months ago are not really relevant today because if we will make assumption that on average person posts few times a week this 8 month old entry will be way forgotten. We want to archive it, also it will be healthier for memory to store short lists.

I see here two scenarios:

  • user looks at his last few posts < 100
  • user is infinite scrolling through all posts.

So this scenario reasonable seems to have list of lists in which we will have ordered post lists id’s. if we will use only LPUSH to add posts lists tot his list we will be able to to do easy LRANGE 0 to get newest lists.

1
2
3
  twtr:<user_id>:lists -> list of lits id's only LPUSH id and LRANGE 0 number.
  twtr:<user_id>:list_next -> auto incr counter for lists id's
  twtr:<user_id>:list:<list_id> -> list with 100 posts

So how do we get most recent posts ? we just LRANGE 0 2 to get most recent two lists and next we will merge them first + second. Both are LPUSH’ed so should be semi ordered. (we don’t really care about order). adding stuff to time line is bit tricky.

we need to do it like this LRANGE <key> 0 1 current list id, next we need to LLEN <key> to check size and if it is < SIZE ( for our example 100 ) we just LPUSH <key> <value> and job done, if size is > 100 we need to INCR <list counter> and LPUSH its result on list of lists and next we need to LPUSH <key> <value> on the new list.

And all of this in application layer. But this is the hardest bit to do. May seem to be complicated but if this seems to be not optimal you can add one more list

1
  twtr:<user_id>:list:current -> list of current 100 posts

This list is just the most current posts of particular user. How does this list work ? Algorithm is simple

  • LPUSH new post id’s
  • RPOP if SIZE > 100

This could be useful to reduce number of hits you get against redis.

Time line

Now the time line. Time line is exactly the same as user post list. we will just one “bit” about adding posts.

Algorithm here is when you add a post you have to pick all id’s of people who follow you. (example from top if you are adding post as amelia)

1
SMEMBERS twtr:<user_id>:followers

And you need to push your post id into their time line posts list. Thats all. Ofc one thing that we need to add are keys for time line

1
2
3
4
  twtr:<user_id>:timeline:lists -> list of lits id's only LPUSH id and LRANGE 0 number.
  twtr:<user_id>:timeline:list_next -> auto incr counter for lists id's
  twtr:<user_id>:timeline:list:<list_id> -> list with 100 posts
  twtr:<user_id>:timeline:current -> LPUSH, RPOP > SIZE current list cache

Summary

This is how i would approach building twitter like clone. Things like old lists can be easily archived into mysql, postgres or other thing and EXPIREed from redis. One thing in my design is common that i put a lot into keys <user_id> this could be skiped but in my opinion it is not bad. IF you will use <user_id> in form of user email md5 you can use it directly to access gravatar of that user.

On average you will need to do around 10-30 hits into redis to get data if you plan to do it in a “lazy way” you can minimize number of hits to around 10.

If you see problem with my design comment i want to know it!. The core of this design is that each user post data is stored into one redis instance. This is important because of access and race condition stories if you will have many redis instances. But achieving “sharding” in application layer is not hard. Only thing that i would care about is post id generator. This is single point of failure because i have a strong assertion that post_id is unique in whole system.

Cheers!

Comments