今天尝试配置RedisSentinel来监控Redis服务器,中间由于某些设想我突然想到如果两个Redis实例互相slaveof会怎样。以下是我的试验:两个Redis实例,redis1配置作为master,redis2配置作为slave:slaveofredis1。启动redis1、redis2。启动成功并且redis2也成功slaveofredis1后,redis-cli连接redis1,执行命令将redis1设置为redis2的从库:slaveof[redis2IP][redis2port]执行后的结果是......两个redis都在重复抛出SYNC命令执行失败的log,也就是显然两个redis不能互相作为从库。redis1执行slaveof后的log:[14793]06Sep17:36:20.426*SLAVEOF10.18.129.49:9778enabled(userrequest)[14793]06Sep17:36:20.636-Accepted10.18.129.49:44277[14793]06Sep17:36:20.637-Clientclosedconnection[14793]06Sep17:36:20.804*ConnectingtoMASTER...[14793]06Sep17:36:20.804*MASTERSLAVEsyncstarted[14793]06Sep17:36:20.804*NonblockingconnectforSYNCfiredtheevent.[14793]06Sep17:36:20.804*MasterrepliedtoPING,replicationcancontinue...[14793]06Sep17:36:20.804#MASTERabortedreplicationwithanerror:ERRCan'tSYNCwhilenotconnectedwithmymaster[14793]06Sep17:36:21.636-Accepted10.18.129.49:44279[14793]06Sep17:36:21.637-Clientclosedconnection[14793]06Sep17:36:21.804*ConnectingtoMASTER...[14793]06Sep17:36:21.804*MASTERSLAVEsyncstarted[14793]06Sep17:36:21.804*NonblockingconnectforSYNCfiredtheevent.[14793]06Sep17:36:21.804*MasterrepliedtoPING,replicationcancontinue...[14793]06Sep17:36:21.804#MASTERabortedreplicationwithanerror:ERRCan'tSYNCwhilenotconnectedwithmymaster[14793]06Sep17:36:22.636-Accepted10.18.129.49:44281[14793]06Sep17:36:22.637-Clientclosedconnection[14793]06Sep17:36:22.804*ConnectingtoMASTER...[14793]06Sep17:36:22.804*MASTERSLAVEsyncstarted[14793]06Sep17:36:22.804*NonblockingconnectforSYNCfiredtheevent.[14793]06Sep17:36:22.804*MasterrepliedtoPING,replicationcancontinue..redis2的log:[14796]06Sep17:36:20.426-Clientclosedconnection[14796]06Sep17:36:20.636*ConnectingtoMASTER...[14796]06Sep17:36:20.636*MASTERSLAVEsyncstarted[14796]06Sep17:36:20.636*NonblockingconnectforSYNCfiredtheevent.[14796]06Sep17:36:20.636*MasterrepliedtoPING,replicationcancontinue...[14796]06Sep17:36:20.636#MASTERabortedreplicationwithanerror:ERRCan'tSYNCwhilenotconnectedwithmymaster[14796]06Sep17:36:20.804-Accepted10.18.129.49:51034[14796]06Sep17:36:20.805-Clientclosedconnection[14796]06Sep17:36:21.636*ConnectingtoMASTER...[14796]06Sep17:36:21.636*MASTERSLAVEsyncstarted[14796]06Sep17:36:21.636*NonblockingconnectforSYNCfiredtheevent.[14796]06Sep17:36:21.636*MasterrepliedtoPING,replicationcancontinue...[14796]06Sep17:36:21.637#MASTERabortedreplicationwithanerror:ERRCan'tSYNCwhilenotconnectedwithmymaster[14796]06Sep17:36:21.804-Accepted10.18.129.49:51036[14796]06Sep17:36:21.805-Clientclosedconnection[14796]06Sep17:36:22.636-DB0:20keys(0volatile)in32slotsHT.[14796]06Sep17:36:22.636-0clientsconnected(0slaves),801176bytesinuse[14796]06Sep17:36:22.636*ConnectingtoMASTER...[14796]06Sep17:36:22.636*MASTERSLAVEsyncstarted[14796]06Sep17:36:22.636*NonblockingconnectforSYNCfiredtheevent.[14796]06Sep17:36:22.636*MasterrepliedtoPING,replicationcancontinue..两个redis就这样都进入SYNC失败的死循环状态。我想到的疑问是:为什么原来的从库redis2会重新执行SYNC命令?从上面的redis2的log第一行可以看到原先的主从连接断开了。看了执行主从设置的源码replication.c,下面是redis1执行slaveof命令的代码,它在中间执行disconnectSlaves()导致原来的主从连接断开:voidslaveofCommand(redisClient*c){if(!strcasecmp(c->argv[1]->ptr,"no")&&!strcasecmp(c->argv[2]->ptr,"one")){//省略了}else{//省略了/*Therewasnopreviousmasterortheuserspecifiedadifferentone,*wecancontinue.*/sdsfree(server.masterhost);server.masterhost=sdsdup(c->argv[1]->ptr);server.masterport=port;if(server.master)freeClient(server.master);disconnectSlaves();/*Forceourslavestoresyncwithusaswell.*/cancelReplicationHandshake();server.repl_state=REDIS_REPL_CONNECT;redisLog(REDIS_NOTICE,"SLAVEOF%s:%denabled(userrequest)",server.masterhost,server.masterport);}addReply(c,shared.ok);}disconnectSlaves()旁边的注解是:Forceourslavestoresyncwithusaswell.意思类似于先把你们(redis2)断开,等我(redis1)同步我的主库搞定后你们再来向我同步。这样导致redis2和redis1断开了,而redis2一开始作为从库如果它和主库断开它会不断尝试重新连接并执行SYNC命令直到成功。了解了为什么redis2也执行SYNC命令后,第二个疑问是为什么两个redis的SYNC操作都会一直失败,实际上原因和第一个差不多。两个redis的log异常都是:ERRCan'tSYNCwhilenotconnectedwithmymaster。这个log在代码中是:voidsyncCommand(redisClient*c){/*ignoreSYNCifalreadyslaveorinmonitormode*/if(c->flags&REDIS_SLAVE)return;/*RefuseSYNCrequestsifweareaslavebutthelinkwithourmaster*isnotok...*/if(server.masterhost&&server.repl_state!=REDIS_REPL_CONNECTED){addReplyError(c,"Can'tSYNCwhilenotconnectedwithmymaster");return;}/*SYNCcan'tbeissuedwhentheserverhaspendingdatatosendto*theclientaboutalreadyissuedcommands.Weneedafreshreply*bufferregisteringthedifferencesbetweentheBGSAVEandthecurrent*dataset,sothatwecancopytootherslavesifneeded.*/if(listLength(c->reply)!=0){addReplyError(c,"SYNCisinvalidwithpendinginput");return;}//省略}syncCommand函数是Redis作为主库收到从库发来的SYNC命令时的处理,看上面注释部分“RefuseSYNCrequestsifweareaslavebutthelinkwithourmasterisnotok...”。当redis1作为主库收到从库的SYNC命令,会执行syncCommand函数,其中if(server.masterhost&&server.repl_state!=REDIS_REPL_CONNECTED)...,redis1刚好设置为别的主库(redis2)的从库但还没完成同步工作(redis1需要向redis2发送SYNC请求并且返回成功才能完成同步,而redis2处理redis1的SYNC请求时又需要redis1处理好redis2的SYNC请求才行,这导致死锁了),所以这个判断返回true,redis1直接replyerror:Can'tSYNCwhilenotconnectedwithmymaster)。redis2的情况也一样,所以双方都处在Can'tSYNCwhilenotconnectedwithmymaster的状态。