MySQL의 mmm , mha MongoDB의 replicaSet 처럼 redis 에도 sentinel 이라는 HA 기능이 있습니다.
sentinel은 아래와 같은 기능이 있습니다
- Monitoring : replication이 제대로 동작하는지 지속적으로 monitoring
- Automatic Failover : redis의 master가 down 되었을 때 replica를 master로 승격시키고 다른 replica들이 새로운 master 를 보도록 재구성
- Notification : monitoring 하는 instance들이 failover 되었을 때 shell script 로 관리자에게 sms, email notification
sentinel 구성
- Master - 172.17.0.3 : 6001
- Slave1 - 172.17.0.4 : 6001
- Slave2 - 172.17.0.6 : 6001
- Sentinel1 - 172.17.0.3 : 5001
- Sentinel2 - 172.17.0.4 : 5001
- Sentinel3 - 172.17.0.6 : 5001
sentinel conf
- port 5001 : default는 26379 (5001으로 설정)
-
sentinel monitor mymaster : 172.17.0.3 6001 2
monitoring 할 master의 IP PORT , quorum
quorum 2의 의미는 sentinel이 3대일 때 2대 이상의 sentinel이 monitoring 하는 redis master가 down 되었다고 인식하면 failover 작업을 진행하겠다는 의미
auth-pass 설정이 이 설정보다 먼저 오면 인식 안됨 - sentinel down-after-milliseconds mymaster 3000
마스터가 다운되었다고 인지하는 시간(millisecond)
그 외 auth_path, dir, logfile 이름 ,daemonize 등의 설정을 필요에 맞게 설정
sentinel 기동
모든 노드
[root@8ce6101da595 redis-stable]# redis-sentinel ./sentinel.conf
280:X 30 Nov 2019 08:02:59.373 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
280:X 30 Nov 2019 08:02:59.373 # Redis version=5.0.7, bits=64, commit=00000000, modified=0, pid=280, just started
280:X 30 Nov 2019 08:02:59.373 # Configuration loaded
281:X 30 Nov 2019 08:02:59.376 * Running mode=sentinel, port=5001.
281:X 30 Nov 2019 08:02:59.377 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
281:X 30 Nov 2019 08:02:59.379 # Sentinel ID is 4eabe8a748855566d1786c87c3440310a06f28ef
281:X 30 Nov 2019 08:02:59.380 # +monitor master mymaster 172.17.0.3 6001 quorum 2
281:X 30 Nov 2019 08:03:00.000 * +sentinel sentinel 3f2c41fdeb6d4806db1c05a1f932069032155c26 172.17.0.5 5001 @ mymaster 172.17.0.3 6001
281:X 30 Nov 2019 08:03:00.358 * +sentinel sentinel d804939c0984dfe5fab504d3faa85f5b2ad4582b 172.17.0.4 5001 @ mymaster 172.17.0.3 6001
sentinel status
[root@8ce6101da595 data]# redis-cli -p 5001
127.0.0.1:5001> info sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=mymaster,status=ok,address=172.17.0.3:6001,slaves=2,sentinels=3
- monitoring 중인 master는 한 대
- 같이 monitoring 하는 sentinel 은 모두 세 대
- 만약 sentinels 의 개수가 맞지 않는다면 sentinel id가 동일하게 넣어져 있진 않은지 auth-pass 가 monitor 설정 보다 먼저 오진 않았는지 등을 확인해야함
127.0.0.1:5001> sentinel sentinels mymaster
1) 1) "name"
2) "d804939c0984dfe5fab504d3faa85f5b2ad4582b"
3) "ip"
4) "172.17.0.4"
5) "port"
6) "5001"
7) "runid"
8) "d804939c0984dfe5fab504d3faa85f5b2ad4582b"
9) "flags"
10) "sentinel"
.
.
2) 1) "name"
2) "3f2c41fdeb6d4806db1c05a1f932069032155c26"
3) "ip"
4) "172.17.0.5"
5) "port"
6) "5001"
7) "runid"
8) "3f2c41fdeb6d4806db1c05a1f932069032155c26"
9) "flags"
10) "sentinel"
.
.
=> 어떤 sentinel이 인식안되었는지를 확인하는 방법
failover test
master down
127.0.0.1:6001> role
1) "master"
2) (integer) 427287
3) 1) 1) "172.17.0.5"
2) "6001"
3) "427287"
2) 1) "172.17.0.4"
2) "6001"
3) "427287"
127.0.0.1:6001> debug sleep 60
승격되는 replica 서버의 sentinel log
170:X 30 Nov 2019 08:10:59.936 # +sdown master mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:00.037 # +odown master mymaster 172.17.0.3 6001 #quorum 2/2
170:X 30 Nov 2019 08:11:00.037 # +new-epoch 8
170:X 30 Nov 2019 08:11:00.037 # +try-failover master mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:00.041 # +vote-for-leader d804939c0984dfe5fab504d3faa85f5b2ad4582b 8
170:X 30 Nov 2019 08:11:00.047 # 3f2c41fdeb6d4806db1c05a1f932069032155c26 voted for d804939c0984dfe5fab504d3faa85f5b2ad4582b 8
170:X 30 Nov 2019 08:11:00.047 # 4eabe8a748855566d1786c87c3440310a06f28ef voted for d804939c0984dfe5fab504d3faa85f5b2ad4582b 8
170:X 30 Nov 2019 08:11:00.143 # +elected-leader master mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:00.143 # +failover-state-select-slave master mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:00.199 # +selected-slave slave 172.17.0.4:6001 172.17.0.4 6001 @ mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:00.199 * +failover-state-send-slaveof-noone slave 172.17.0.4:6001 172.17.0.4 6001 @ mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:00.283 * +failover-state-wait-promotion slave 172.17.0.4:6001 172.17.0.4 6001 @ mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:01.103 # +promoted-slave slave 172.17.0.4:6001 172.17.0.4 6001 @ mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:01.103 # +failover-state-reconf-slaves master mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:01.189 * +slave-reconf-sent slave 172.17.0.5:6001 172.17.0.5 6001 @ mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:02.164 * +slave-reconf-inprog slave 172.17.0.5:6001 172.17.0.5 6001 @ mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:02.164 * +slave-reconf-done slave 172.17.0.5:6001 172.17.0.5 6001 @ mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:02.220 # -odown master mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:02.220 # +failover-end master mymaster 172.17.0.3 6001
170:X 30 Nov 2019 08:11:02.221 # +switch-master mymaster 172.17.0.3 6001 172.17.0.4 6001
170:X 30 Nov 2019 08:11:02.222 * +slave slave 172.17.0.5:6001 172.17.0.5 6001 @ mymaster 172.17.0.4 6001
170:X 30 Nov 2019 08:11:02.222 * +slave slave 172.17.0.3:6001 172.17.0.3 6001 @ mymaster 172.17.0.4 6001
- sdown 감지한 sentinel 들이 설정한 quorum 2 을 만족했기 때문에 odown 발생
- failover 를 진행하는데 투표 결과 172.17.0.4:6001 서버가 새로운 마스터
- 172.17.0.4:6001 서버에는 replicaof no one , 그 외 서버에는 replicaof 새로운 마스터ip port 를 수행하고 config rewrite 까지 진행함
다른 sentinel 서버 log
281:X 30 Nov 2019 08:10:59.841 # +sdown master mymaster 172.17.0.3 6001
281:X 30 Nov 2019 08:11:00.045 # +new-epoch 8
281:X 30 Nov 2019 08:11:00.047 # +vote-for-leader d804939c0984dfe5fab504d3faa85f5b2ad4582b 8
281:X 30 Nov 2019 08:11:00.947 # +odown master mymaster 172.17.0.3 6001 #quorum 3/2
281:X 30 Nov 2019 08:11:00.947 # Next failover delay: I will not start a failover before Sat Nov 30 08:17:00 2019
281:X 30 Nov 2019 08:11:01.191 # +config-update-from sentinel d804939c0984dfe5fab504d3faa85f5b2ad4582b 172.17.0.4 5001 @ mymaster 172.17.0.3 6001
281:X 30 Nov 2019 08:11:01.192 # +switch-master mymaster 172.17.0.3 6001 172.17.0.4 6001
281:X 30 Nov 2019 08:11:01.193 * +slave slave 172.17.0.5:6001 172.17.0.5 6001 @ mymaster 172.17.0.4 6001
281:X 30 Nov 2019 08:11:01.194 * +slave slave 172.17.0.3:6001 172.17.0.3 6001 @ mymaster 172.17.0.4 6001
기존 master sentinel log ( redis 다시 정상화 된 경우 )
281:X 30 Nov 2019 08:10:59.841 # +sdown master mymaster 172.17.0.3 6001
281:X 30 Nov 2019 08:11:00.045 # +new-epoch 8
281:X 30 Nov 2019 08:11:00.047 # +vote-for-leader d804939c0984dfe5fab504d3faa85f5b2ad4582b 8
281:X 30 Nov 2019 08:11:00.947 # +odown master mymaster 172.17.0.3 6001 #quorum 3/2
281:X 30 Nov 2019 08:11:00.947 # Next failover delay: I will not start a failover before Sat Nov 30 08:17:00 2019
281:X 30 Nov 2019 08:11:01.191 # +config-update-from sentinel d804939c0984dfe5fab504d3faa85f5b2ad4582b 172.17.0.4 5001 @ mymaster 172.17.0.3 6001
281:X 30 Nov 2019 08:11:01.192 # +switch-master mymaster 172.17.0.3 6001 172.17.0.4 6001
281:X 30 Nov 2019 08:11:01.193 * +slave slave 172.17.0.5:6001 172.17.0.5 6001 @ mymaster 172.17.0.4 6001
281:X 30 Nov 2019 08:11:01.194 * +slave slave 172.17.0.3:6001 172.17.0.3 6001 @ mymaster 172.17.0.4 6001
281:X 30 Nov 2019 08:11:39.534 * +convert-to-slave slave 172.17.0.3:6001 172.17.0.3 6001 @ mymaster 172.17.0.4 6001
=> sentinel 이 replica 로 투입시킴
승격되는 replica redis log
33:M 30 Nov 2019 08:11:00.284 # Setting secondary replication ID to 057a128e63d977ea467b1a3e2515725ba7eec3ef, valid up to offset: 868315. New replication ID is 605ef3fef49cfa88e04e216cc804fdec80f4cf96
33:M 30 Nov 2019 08:11:00.284 # Connection with master lost.
33:M 30 Nov 2019 08:11:00.284 * Caching the disconnected master state.
33:M 30 Nov 2019 08:11:00.284 * Discarding previously cached master state.
33:M 30 Nov 2019 08:11:00.284 * MASTER MODE enabled (user request from 'id=369 addr=172.17.0.4:52207 fd=10 name=sentinel-d804939c-cmd age=927 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=274 qbuf-free=32494 obl=36 oll=0 omem=0 events=r cmd=exec')
33:M 30 Nov 2019 08:11:00.285 # CONFIG REWRITE executed with success.
33:M 30 Nov 2019 08:11:01.210 * Replica 172.17.0.5:6001 asks for synchronization
33:M 30 Nov 2019 08:11:01.210 * Partial resynchronization request from 172.17.0.5:6001 accepted. Sending 425 bytes of backlog starting from offset 868315.
33:M 30 Nov 2019 08:11:40.060 * Replica 172.17.0.3:6001 asks for synchronization
33:M 30 Nov 2019 08:11:40.060 * Partial resynchronization not accepted: Requested offset for second ID was 951529, but I can reply up to 868315
33:M 30 Nov 2019 08:11:40.060 * Starting BGSAVE for SYNC with target: disk
33:M 30 Nov 2019 08:11:40.060 * Background saving started by pid 178
178:C 30 Nov 2019 08:11:40.065 * DB saved on disk
178:C 30 Nov 2019 08:11:40.065 * RDB: 0 MB of memory used by copy-on-write
33:M 30 Nov 2019 08:11:40.073 * Background saving terminated with success
33:M 30 Nov 2019 08:11:40.073 * Synchronization with replica 172.17.0.3:6001 succeeded
=> replica 서버들이 새로운 master 에게 sync 요청을 하고 부분 동기화 , 전체 동기화 를 수행하여 동기화함
replication 정상 확인
127.0.0.1:6001> info replication
# Replication
role:slave
master_host:172.17.0.4
master_port:6001
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
slave_repl_offset:1019638
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:605ef3fef49cfa88e04e216cc804fdec80f4cf96
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:1019638
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:876434
repl_backlog_histlen:143205