背景
由于需要部署 Kafka 集群,决定使用 Docker 部署 ZooKeeper ,所以先使用来 docker-compose 部署 ZK 集群。过程中由于 ZK 版本和配置问题,出现集群所有节点的端口不可连接的 BUG。
重现步骤
1. 编写 docker-compose.yml
version: '3.1'
services:
zoo1:
image: zookeeper
restart: always
hostname: zoo1
ports:
- 2181:2181
environment:
ZOO_MY_ID: 1
ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888
zoo2:
image: zookeeper
restart: always
hostname: zoo2
ports:
- 2182:2181
environment:
ZOO_MY_ID: 2
ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888
zoo3:
image: zookeeper
restart: always
hostname: zoo3
ports:
- 2183:2181
environment:
ZOO_MY_ID: 3
ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888
简单说明一下,我们设置3个 zk 容器,分别命名为 zoo1、zoo2、zoo3,每个分别占用宿主机 2181、2182、2183 端口,环境变量只需设置 zk 的 ZOO_MY_ID,和集群需要的配置 ZOO_SERVERS。网络让 Docker 自动构建一个网桥。
docker-compose up 启动容器
zoo3_1 | 2020-06-06 01:27:10,210 [myid:3] - INFO [QuorumPeer[myid=3](plain=disabled)(secure=disabled):QuorumPeer@863] - Peer state changed: leading - broadcast
集群构建完成,zoo3_1 节点已经进入 leading 状态,证明群首选举已经完成。
测试
# docker exec -it zookeeper_cluster_test_zoo1_1 zkCli.sh
...
2020-06-06 01:35:05,000 [myid:localhost:2181] - WARN [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:342)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1262)
尝试连接主机的 2181 ,发现端口并不可用。此时推测,ZK 集群已经启动,但端口并没对外开放。
修复过程
前面有考虑过是不是 Docker 本身的问题,网上并没有相关异常。
后来尝试了一下非集群状态。即,把配置中的 ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888
注释掉(ZK 在没这个配置时,会自动使用单例模式运行)
在单例模式下,是可以稳定连接的。
问题已经定位到 ZK 本身的状态。
root@zoo1:/apache-zookeeper-3.6.0-bin# zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /conf/zoo.cfg
Client port not found in static config file. Looking in dynamic config file.
grep: : No such file or directory
Client port not found. Terminating.
登录到 Docker 容器内,检查 zk 节点的状态,显示没有指定 Client port,问题基本定位了。
查看 hub.docker.com 对 zk 的说明,里面提及到关于 zk 的版本问题:
This will start Zookeeper 3.5 in replicated mode. Please note, that Zookeeper 3.4 has slightly different ZOO_SERVERS format. Run docker stack deploy -c stack.yml zookeeper (or docker-compose -f stack.yml up) and wait for it to initialize completely. Ports 2181-2183 will be exposed. Please be aware that setting up multiple servers on a single machine will not create any redundancy. If something were to happen which caused the machine to die, all of the zookeeper servers would be offline. Full redundancy requires that each server have its own machine. It must be a completely separate physical server. Multiple virtual machines on the same physical host are still vulnerable to the complete failure of that host. Consider using Docker Swarm when running Zookeeper in replicated mode.
官方已经提醒,zk 3.4 和 3.5 的配置格式是不一样的。
zk3.4 ZOO_SERVERS: server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888
zk3.5+ ZOO_SERVERS: server.1=zoo1:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=0.0.0.0:2888:3888;2181
修正,启动后验证没问题了。