饿了么 Redis 平台建设

阅读 1476
收藏 23
2018-11-18
原文链接:zhuanlan.zhihu.com

Redis Cluster

Redis作为一个支持多种数据结构的键值对内存数据库, 被广泛用于饿了么的各类业务场景中。其官方集群方案Redis Cluster因其支持分片,高可用和平滑扩缩容,被我们选为最主要的分布式Redis方案。

我们的主要工作是给Redis Cluster做自动化运维和管理工具。

Redis Cluster 简介

Redis Cluster由多个分片组成, 每个分片包含一个master和多个slave来实现高可用, 每个master和slave都是单线程。Redis每个请求和数据都对应一个key, Redis Cluster将请求的所有key通过crc16哈希到16384(2的14次方)个slot中, 并将这16384个slot分配到集群中的所有分片上. 这样每个分片就只负责一部分请求和数据。Redis Cluster的扩缩容就是在分片间迁移slot和增加剔除分片节点.

自动化运维平台

Redis Cluster功能完善, 但创建和变更集群都要按一套复杂流程调集群接口来实现.

例如扩容集群要走以下流程:

  1. 发送CLUSTER MEET命令添加新节点
  2. 对迁移slot的两个节点使用CLUSTER SETSLOT标记迁移状态
  3. 循环发送MIGRATE命令迁移数据
  4. 再次使用CLUSTER SETSLOT命令变更slot所有权

而官方对此提供了一个Ruby脚本来简化运维工作。使用这个脚本意味着必须登上机器操作, 这对于需要运维成百上千个集群的用户来说是完全不够的。

因此早期大规模使用Redis Cluster的用户实现自动化运维的方案一般就是重新实现这个脚本并在此基础上实现一个运维平台, 也就是我们的初期方案。

这样对应到平台上面的功能就是:

  1. 挑选机器部署redis节点
  2. 指定redis节点添加到集群
  3. 迁移slot到新节点

集群结构规范化

为什么以上三个步骤不合并为一个扩容操作呢? 因为当时并没有想明白每个集群真正需要做的变更操作, 拆分出来的小功能点会用于多处. 例如部署新节点可能用于扩容, 也可能用于补充因宕机缺失的从节点. 而各个集群的结构千差万别, 有每个分片一个slave, 有每个分片多个slave, 也有因人为疏忽缺失slave节点的集群, 针对不同结构的集群做变更和修复需要结合多个小功能点.

而无法确定需要哪些操作的原因是集群结构不一. 集群结构也就是哪些机器上部署节点, master-slave怎么分布.

然而我们一开始人肉管理集群结构导致线上集群的结构越来越不同, 问题越来越多, 需要运维工具提供的操作越来越复杂.

问题有:

  • 缺少slave节点
  • master和slave节点分布在同一台机器上, 机器宕机后该分片没有其它slave节点就会不可用
  • 过半数master节点都在某一台机器上, 该机器宕机导致整个集群不可用
  • 宕机后master节点切换导致少数机器网卡被打爆
  • slot不平均
  • 每台机器压力不平均

这时我们发现无论实现运维工具还是做运维操作都非常困难, 运维工具开始需要使用复杂的图算法, 运维人员在面板上需要做大量的小操作才能完成任务, 而且线上事故频发.

Redis Cluster自身不是多线程, 无法在一个集群上面提供租户功能, 我们不得不将各个业务方对应的集群混合部署在多台机器上. 对于这种单台机器有多个master slave节点的结构, 集群结构决定了稳定性. 只有让程序管理集群才能保证良好的集群结构, 才能减少运维人员操作的复杂度. 而让程序管理集群结构必须建立一套集群结构的规则.

运维集群和实现运维工具的核心是规范化集群.

规范与Chunk算法

集群结构规范需要满足以下规则:

  1. master节点和slave节点不能在一台机器上
  2. 不能有过半数(包括半数)master节点在同一台机器上(否则宕机后集群无法选举)
  3. 集群在挂一台机器的情况下, 压力应尽可能平均分流到其它机器
  4. master节点和slave节点在各台机器上的分布应当平均

有一些公司采取的是根据机器各项指标(CPU, 内存, 网卡使用率)分配单个节点的方案, 并且像迁移容器一样动态迁移单个redis节点来实现机器的压力平衡, 配以以上规则来约束节点分配和迁移.

而我们希望用更简单的分配方式和避免动态迁移节点来减少复杂度.

我们的方案是首先将每4个redis节点分为一组, 并且满足以下结构, 平均分到两台机器上, 两两互为master-slave.

所有集群都由一个个chunk组合而成, 这样master-slave必不在同一台机器上, 且每台机器上的master-slave比例永远是1:1, 只要每台机器分配的节点数是平均的, master slave节点在每台机器的数量也是平均的, 规则(1)(4)就满足了.

而我们设计了一个分配算法让分配出来的集群可以满足(2)(3)。详细说明和证明有兴趣可以看文末附件.

这种结构还能让增删节点(增删chunk)后集群还能满足上述的规则.

规范了集群的结构后, 只要集群不满足初始的结构(例如宕机或master-slave切换), 就把集群修复到原来的结构.

这样集群一直处于健康的结构, 并且由于结构是固定并且有规律的, 无需提供纷繁的小操作, 运维人员常规做的只有:

  • 创建集群
  • 替换机器(也可以用于处理宕机)
  • 迁移集群(用于扩缩容)

Redis Cluster的坑

  • 运维Redis Cluster两年期间我们发现了不少集群相关的bug, 如
  • epoch解决冲突机制失效导致同一分片同时存在两个master节点
  • 选举有时候需要花几分钟而不是常规的30秒内
  • 节点间握手可能会一直不成功
  • 集群元信息不一致但是不能自动修复 (这个问题官方最近开始考虑修复了)
  • 迁移slot失败(例如迁移中迁出节点宕机)后部分迁移中的slot会无人管.

遇上这种问题往往很难定位集群中哪些状态是不正确的, 而操作不健康的集群又容易踩进其它坑里面, 通过当场修复不正确的状态往往需要耗费大量. 早年多起相关问题的事故都教训了我们不要浪费时间去修复集群. 对于这类集群损坏的问题我们目前全部都是通过切换到一个新集群来解决.

而正常的变更也不容易做好. 从很多工具(包括官方脚本)都没有处理好集群中各种隐含的状态和边界条件. 比如迁移slot最后的CLUSTER SETSLOT如果操作顺序不对会引发大量请求重定向. Redis github issue上也可以看到不少用官方脚本操作后集群完全损坏的问题.

而扩容中, 我们发现有时MIGRATE命令会返回busy loading的错误, 这是由于节点会有一个很短暂的初始化阶段, 这个阶段的节点会拒绝服务几乎所有类型的请求. 此外, 还需要考虑实现迁移slot的并行化来加速集群的扩缩容. 我们在官方的slot迁移上做了很多功夫, 最后发现可靠的方案逻辑复杂, 速度又慢. 最后还是不得不抛弃原生的扩缩容, 使用一致性差但是稳定性更好的建新集群加导数据的方案.

后续的工作

目前我们正在做集群的快速扩缩和新的Redis Cluster Proxy, 欢迎有兴趣的同学加入我们, 发简历到tianwen.zhang@ele.me

chunk分配算法及证明

by 杨瑒

The node allocation algorithm aims to distribute chunks of nodes to achieve
the maximum of balance, aka. trying best to spread the failover of slaves
on the lost host most widely across the whole cluster.

Note:
  1. In this proof, we'll use "=" as assignment, and "==" as mathematical equal
     to conform to python's notations.
  2. It is hard to write mathematical symbols and notations in plain text,
     and therefore we adopt some of python's functions to help explanation.

Glossary:
1. chunk:
  A group of nodes in 4 across 2 hosts where 1 master node
  and 1 slave node each.
  Either HOST 1 or HOST 2 fails, redis node A and B
  should work fine after failing-over to the other host.
  -------------------------------
             one chunk
  | HOST   1 |      | HOST   2 |
  | master A | <--> | slave  A |
  | slave  B | <--> | master B |
  -------------------------------

2. link(ed) host:
  One chunk consists of 2 hosts, and we call them linked since the slave
  of the first host's master is on the second host, and vice versa.

Algorithm Description:
  Assume there're n hosts, whose corresponding available node number
  is host[i], where i in range(0, n) (aka. len(host) == n)

  The available node number on each host should meet the premise of:
  1. sum(host) % 4 == 0
  2. host[i] is integer for i in range(0, n)
  3. host[i] % 2 == 0 for i in range(0, n)
  4. host[m] <= sum(<host[i]>, i in range(0, n) && i != m),
     where m = max_index(host) (the index of the max value in host)

  Build up a link table (lt) which tracks the number of links
  from host[i] to host[j], where i, j in range(0, n):

  |        | HOST 0   | HOST 1   | HOST 2   | HOST 3   | HOST 4   |
  |--------|----------|--------- |----------|--------- |----------|
  | HOST 0 | 0        | lt[0, 1] | lt[0, 2] | lt[0, 3] | lt[0, 4] |
  | HOST 1 | lt[0, 1] | 0        | lt[1, 2] | lt[1, 3] | lt[1, 4] |
  | HOST 2 | lt[0, 2] | lt[1, 2] | 0        | lt[2, 3] | lt[2, 4] |
  | HOST 3 | lt[0, 3] | lt[1, 3] | lt[2, 3] | 0        | lt[3, 4] |
  | HOST 4 | lt[0, 4] | lt[1, 4] | lt[2, 4] | lt[3, 4] | 0        |
                  (A link table of hosts of 5)


  Steps:
  1. Check the input, if host does not meet up with the premise, reject.
  2. Initialize the lt (link table) to zeros.
  3. while any(host > 0) do following:
    3.1. m = max_index(host) (the index of the max value in host)
    3.2. llh = find_least_linked_host_index(m)
         (search the link table at row m, and find the minimum excluding self)

         i.e.
         |        | HOST 0   | HOST 1   | HOST 2   | HOST 3   | HOST 4   |
         |--------|----------|--------- |----------|--------- |----------|
         | HOST 0 | 0        | 2        | 4        | 4        | 4        |

         In such case, llh is 1 (HOST 1) (remember to exclude self, which is 0)
    3.3. establish a link between m and llh and create a chunk
    3.4  lt[m, llh] = lt[m, llh] + 1
    3.5. host[m]    = host[m]    - 2
    3.6. host[llh]  = host[llh]  - 2

Proof of convergence: (Prove that sum(host) == 0 finally)

  Lemma 1. For each iteration,
           host[m] <= sum(host[i], i in range(0, n) && i != m),
           where m = max_index(host)
           is always true.

  Proof of Lemma 1:
    According to premise 4, the initial state is true.

    Let host_next list be the next state of host after one iteration,
        p be the index of host linked with m in the new chunk.

    Therefore, we have:
      1. host_next[m] = host[m] - 2
      2. host_next[p] = host[p] - 2

    The next iteration state diverges into 2 following conditions:
    1) Assume m is still the max_index of host_next, which is:

               m == max_index(host_next)

       Left of inequation:

               host_next[m]
            == host[m] - 2

       Right of inequation:

               sum(host_next[i], i in range(0, n) && i != m)
            == sum(<host[i]>, i in range(0, n) && i != m) - 2

       Premise 4:

               host[m] <= sum(<host[i]>, i in range(0, n) && i != m)

       Therefore:

               host[m] - 2 <= sum(<host[i]>, i in range(0, n) && i != m) - 2

               host_next[m] <= sum(<host_next[i]>, i in range(0, n) && i != m)

    2) Assume q = max_index(host_next), and q != m


       Suppose q == p, then, host_next[q] == host_next[p] == host[p] - 2

       Since,
         1. host_next[m] == host[m] - 2
         2. host[m] > host[p] (Use condition 1 for host[m] == host[p])

       Then

         host_next[m] > host_next[p] == host_next[q]

       which is in contradiction to q being the max_index of host_next, and
       therefore

               q != p


       With q != m and q != p, we have
          1. host_next[q] == host[q] > host_next[m]
          2. host_next[m] == host[m] - 2
         ==> host[q] > host[m] - 2

       We also know that

               host[i] % 2 == 0 for i in range(0, n) ... Premise 3
               host[m] >= host[q]                    ... Premise 4

       We apply 3 conditions mentioned above together:
       1. host[i] % 2 == 0 for i in range(0, n)
       2. host[q] > host[m] - 2
       3. host[q] <= host[m]

       Then,
       1. host[m] - 2 < host[q] <= host[m]
       2. host[q] % 2 == 0

       Therefore,

               host[q] == host[m]

       Left of inequation:

               host_next[q]
            == host[q]

       Right of inequation:

               sum(<host_next[i]>, i in range(0, n) && i != q)
            == sum(<host[j]>) + host[m] - 2 - 2
            == sum(<host[j]>) + host[q] - 4 (since host[q] == host[m])
            , where j in range(0, n) && j != q && j != m

       Therefore, we need to prove that

               sum(<host[j]>) - 4 >= 0
            , where j in range(0, n) && j != q && j != m

        According to premise 2, 3, sum of host[j] results in
        3 following conditions:

        2.1) sum(<host[j]>) == 0

          Since
          1. q != m && q != p
          2. j != q && j != m
          3. host[p] >= 2 (host_next[p] = host[p] - 2 >= 0)

          Then,

                 sum(<host[j]>) >= host[p] >= 2

          which is in contradiction to sum(<host[j]>) == 0, and therefore
          condition 2.1 is logically impossible.

        2.2) sum(<host[j]>) == 2

          According to premise 3, non-zero values in host are
          {host[m], host[q], 2}

          Apply premise 1

               sum(host) % 4 ==0
           ==> host[m] + host[q] + 2 == 4*x
           ==> host[m] + host[q] == 4*x - 2
           ==> 2 * host[q] == 4*x - 2 (since host[q] == host[m])
           ==> host[q] == 2*x - 1

          host[q] == 2*x - 1 is in contradiction to premise 3, and therefore
          condition 2.2 is logically impossible.

        2.3) sum(<host[j]>) >= 4, conforms.

        Therefore,

               sum(<host[j]>) - 4 >= 0
           ==> host[q] <= host[q] + sum(<host[j]>) - 4
           ==> host[q] <= host[m] + sum(<host[j]>) - 4
           ==> host_next[q] <= host_next[m] + sum(<host_next[j]>)
           ==> host_next[q] <= sum(<host_next[i]>)
           , where i in range(0, n) && i != q,
                   j in range(0, n) && j != q && j != m

      Since condition 1) and 2) are respectively proved, lemma 2 is proved.

  For each iteration, sum(host_next) == sum(host) - 2 - 2
  Apply Lemma 1. host[m] <= sum(<host[i]>, i in range(0, n) && i != m),
  aka. 2*host[m] <= sum(host) is always true.

  Since sum(host) is monotonically decreasing, and 2*host[m] <= sum(host),
  host[m], aka. the maximum value of host inclines to zero.

  The algorithm converges.

  Q.E.D.


作者简介

黄光星,16年加入饿了么, 在饿了么主要负责Redis Cluster.

另外感谢2018开源数据库论坛(ODF)中国Redis用户组(CRUG)给我们颁奖! 希望各位Redis用户能加入CRUG贡献自己的经验!

评论