At Vinted, we use a data structures server Redis for many things including Resque, news feed, application, etc. We are not able to restart or upgrade Redis instances without having zero downtime. High availability is critical for us. Therefore, we decided to try database services like Redis Sentinel or Redis Cluster.

The first thing we did was test Redis Cluster. However, due to a lack of client-side software we decided not to go with this solution. Redis Cluster itself is stable, but it’s client-side is very basic and lacks advanced functionalities, such as pipelining, which we use.

Once we were finished with testing Redis Cluster, we moved onto Redis Sentinel. Redis Sentinel monitors slave servers and elects a new master if the quorum is satisfied. In our case, we tested it with 3 nodes (quorum=2). It is not worth going into details about Redis Sentinel, as the configuration is very simple.

We run multiple mini clusters, each one formed by one master and two slaves. This allows us to run as many instances inside one server (due to listening via different port numbers).

If we need to launch another cluster, we simply add the role redis-shards-<country> and Chef will automatically spawn what is needed.

The most interesting thing about Sentinel is that it writes the state into the configuration file. As a result this file cannot be overwritten. This means that Chef will regenerate these files if they do not exist.

Technical details

Failover

Every time Redis completes a failover, it calls sentinelStartFailover(). Sentinels exchange hello messages using Pub/Sub and update the last_pub_time variable.

So, let’s dig deeper into this. Here is the snippet (Systemtap) used to probe the user-space:

probe process("/usr/local/bin/redis-server").function("sentinelStartFailover")
{
        elapsed = gettimeofday_ms() - $master->last_pub_time;
        printf("%d.%03ds\n", (elapsed / 1000), (elapsed % 1000));
}

Manual failover using redis-cli took 0.835s, while failover with configured timeout took 5.843s.

Measuring how quickly manual failover can converge was crucial for us, as we care about latency. Failing fast is also integral, so it is important to adjust these timers to determine whether it is enough to perform manual failovers for maintenance, or if it is preferable to go with configured timeouts.

Migration process

  • Stop all sentinel instances, to avoid electing new master;
  • Make sure every redis instance is master;
  • Sentinel master node replicates from origin;
  • Sentinel slaves replicate from sentinel master;
  • After everything is in sync, stop syncing master from origin and start sentinel instances.

Monitoring

We monitor Redis instances using Redistop.rb.

We don’t use the built-in monitoring tool (redis-cli -p <port> monitor), because it is more intrusive (~12%) than our own. In addition, our own tool allows us to monitor how many requests we have per second per instance, sort by latency, sort by count, and see the most used keys and commands.

~$ ruby redistop.rb -R
Probing...Type CTRL+C to stop probing.

PID   REQ/S
1794  2345
22463 1025
2068  785
53680 757
1747  519
1841  462
53633 204

Total:  6116 req/s

~$ ruby redistop.rb -F
Probing...Type CTRL+C to stop probing.

PID   COUNT LATENCY     CMD
1794  925   <0.000023>  zrangebyscore
2068  324   <0.000032>  zrangebyscore
22463 293   <0.000033>  get
53680 255   <0.000014>  get
53680 252   <0.000017>  hget
1794  249   <0.000015>  get
1794  248   <0.000018>  hget
22463 230   <0.000039>  hget
1747  225   <0.000053>  zrangebyscore
2068  179   <0.000018>  hget

~$ ruby redistop.rb -K
Probing...Type CTRL+C to stop probing.

COUNT KEY
1320  get
1107  hget
966   zrangebyscore
486   fr:ab_test_ids
462   pl:ab_test_ids
442   de_babies:ab_test_ids
308   cz:ab_test_ids

Lessons Learned

  • Redis Cluster is a very cool service, but due to the immaturity of client-side we decided to postpone using it.
  • Redis Sentinel failover is implemented as expected. Manual failover works instantly.
  • Migration from standalone instance to Redis Sentinel is very simple.
  • Monitoring Redis instances became very easy for us as we can inspect the most interesting things.