记一个 Docker-Compose 部署 ZooKeeper 集群的坑,集群模式下不可连接

4,292 阅读2分钟

背景

由于需要部署 Kafka 集群,决定使用 Docker 部署 ZooKeeper ,所以先使用来 docker-compose 部署 ZK 集群。过程中由于 ZK 版本和配置问题,出现集群所有节点的端口不可连接的 BUG。

重现步骤

1. 编写 docker-compose.yml

version: '3.1'

services:
  zoo1:
    image: zookeeper
    restart: always
    hostname: zoo1
    ports:
      - 2181:2181
    environment:
      ZOO_MY_ID: 1
      ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888

  zoo2:
    image: zookeeper
    restart: always
    hostname: zoo2
    ports:
      - 2182:2181
    environment:
      ZOO_MY_ID: 2
      ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888

  zoo3:
    image: zookeeper
    restart: always
    hostname: zoo3
    ports:
      - 2183:2181
    environment:
      ZOO_MY_ID: 3
      ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888

简单说明一下,我们设置3个 zk 容器,分别命名为 zoo1、zoo2、zoo3,每个分别占用宿主机 2181、2182、2183 端口,环境变量只需设置 zk 的 ZOO_MY_ID,和集群需要的配置 ZOO_SERVERS。网络让 Docker 自动构建一个网桥。

docker-compose up 启动容器

zoo3_1  | 2020-06-06 01:27:10,210 [myid:3] - INFO  [QuorumPeer[myid=3](plain=disabled)(secure=disabled):QuorumPeer@863] - Peer state changed: leading - broadcast

集群构建完成,zoo3_1 节点已经进入 leading 状态,证明群首选举已经完成。

测试

# docker exec -it zookeeper_cluster_test_zoo1_1 zkCli.sh
...
2020-06-06 01:35:05,000 [myid:localhost:2181] - WARN  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:342)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1262)

尝试连接主机的 2181 ,发现端口并不可用。此时推测,ZK 集群已经启动,但端口并没对外开放。

修复过程

前面有考虑过是不是 Docker 本身的问题,网上并没有相关异常。

后来尝试了一下非集群状态。即,把配置中的 ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888 注释掉(ZK 在没这个配置时,会自动使用单例模式运行)

在单例模式下,是可以稳定连接的。

问题已经定位到 ZK 本身的状态。

root@zoo1:/apache-zookeeper-3.6.0-bin# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /conf/zoo.cfg
Client port not found in static config file. Looking in dynamic config file.
grep: : No such file or directory
Client port not found. Terminating.

登录到 Docker 容器内,检查 zk 节点的状态,显示没有指定 Client port,问题基本定位了。

查看 hub.docker.com 对 zk 的说明,里面提及到关于 zk 的版本问题:

This will start Zookeeper 3.5 in replicated mode. Please note, that Zookeeper 3.4 has slightly different ZOO_SERVERS format. Run docker stack deploy -c stack.yml zookeeper (or docker-compose -f stack.yml up) and wait for it to initialize completely. Ports 2181-2183 will be exposed. Please be aware that setting up multiple servers on a single machine will not create any redundancy. If something were to happen which caused the machine to die, all of the zookeeper servers would be offline. Full redundancy requires that each server have its own machine. It must be a completely separate physical server. Multiple virtual machines on the same physical host are still vulnerable to the complete failure of that host. Consider using Docker Swarm when running Zookeeper in replicated mode.

官方已经提醒,zk 3.4 和 3.5 的配置格式是不一样的

zk3.4 ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888
zk3.5+ ZOO_SERVERS: server.1=zoo1:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=0.0.0.0:2888:3888;2181

修正,启动后验证没问题了。

参考

官方:docker/zookeeper