😇 All I need to know about 6.5840
Full code on:
https://github.com/humbornjo/MIT-6.5840
Lab1
💡 虽然本次实验没有触及,但是GFS才是Map Reduce的灵魂。这也是很多MR方法局限在单机运行的根本原因。在分布式的场景中,intermediate的交换才是事实上的重点。
Difficulties & solution
coordinator 作为调度者,就算其元素都是线程安全的,依然是需要加一个整体的锁的。(仅针对我的设计与实现)
type Coordinator struct {
// Your definitions here.
nMap int
nReduce int
stage int
currTask chan Task
runnTask map[Task]chan int
lock sync.RWMutex
}
在我的实现中,当currTask和runnTask全为空时方可触发stage的转换(比如从MAP转移到REDUCE)。如果分别加锁,就可能导致 →
- 检查currTask时为空
- runnTask超时,删除任务,未完成的任务重新入队currTask
- 检查runnTask,也为空
- 状态转换了,MR失败
rpc的设计与细节
/*
type Task struct {
Tid int
Fname string
}
*/
type AskForTaskArgs struct {
Wid int
}
type AskForTaskReply struct {
Stage int
Task Task
Nout int
}
type ReportForTaskArgs struct {
Task Task
}
type ReportForTaskReply struct {
Foo bool
}
总体分为Ask和Report两种,主要的问题在于Report RPC前后临时文件的保存和状态的转变,下面进行分类分析。
- MAP阶段的Report,是在RPC获得true的回复后进行保存临时文件,还是先保存临时文件,再call RPC。
- 如果在获得RPC回复后再保存。如果RPC结果为true(在我的实现中,是
reply.Foo == true
)。会出现一个恐怖的事,那就是coordinator已经转换状态了。开始分发REDUCE的任务了,但是MAP阶段的intermediate文件还没有保存完。这就会导致,如果有一个Worker此时恰好接了任务,并竞争运行了,然而目标文件甚至不存在,最终REDUCE的任务结果出错。如果RPC结果为false,影响不大,就算保存了也会被另一个Worker的文件写入替换。 - 如果在获得RPC回复前再保存。如果RPC结果为true,不会有任何问题,解决了上方的问题。如果RPC结果为false,影响不大,就算保存了也会被另一个Worker的文件写入替换。
- 综上,选择在RPC回复前保存。
- Reduce阶段的Report,没啥好说的,保存必然是在RPC之前的,因为实现borrow了mrsequential的代码,目前来看,换成在RPC之后阻塞地保存,唯一的风险就是Worker的临时宕机。(最后不要忘了删除临时文件)
if !ReportForTask(
&ReportForTaskArgs{Task: reply.Task},
&ReportForTaskReply{}) {
os.Remove(oname)
} else {
for _, filename := range filenames {
os.Remove(filename)
}
}
- PASS LOG
MIT-6.5840/src/main via 🐹 v1.20.6
> bash test-mr.sh
*** Starting wc test.
--- wc test: PASS
*** Starting indexer test.
--- indexer test: PASS
*** Starting map parallelism test.
--- map parallelism test: PASS
*** Starting reduce parallelism test.
--- reduce parallelism test: PASS
*** Starting job count test.
--- job count test: PASS
*** Starting early exit test.
--- early exit test: PASS
*** Starting crash test.
--- crash test: PASS
*** PASSED ALL TESTS
Lab2
💡 What u deserve to understand is that, even raft can make mistake. When the delay is high and command flood in.
Core Idea
- Raft RPCS is idempotent
- Randomization causes simplicity
- Leader takes it all
- Logical timing instead of physical timing
$$ broadcastTime ≪ electionTimeout ≪ MTBF $$
Mind Trail
Difficulties & solution
2A
- Term严格小的才能给Term大的server投票,但是容易出现以下情况 (忽略第三列),即server S1和S2几乎以同一时间开始选举 (即使存在随机timeout),导致多轮都无法选出leader。
Debug log
S0 C0 ask for vote, T: 8
S0 not leader, election timeout...
S0 convert to candidate, calling election T: 7
S2 C0 ask for vote, T: 8
S1 C0 ask for vote, T: 8
S0 not leader, election timeout...
S0 convert to candidate, calling election T: 8
S2 C0 ask for vote, T: 9
S1 not leader, election timeout...
S1 convert to candidate, calling election T: 8
S2 C0 ask for vote, T: 9
S1 C0 ask for vote, T: 9
S0 C0 ask for vote, T: 9
S0 not leader, election timeout...
S0 convert to candidate, calling election T: 9
S2 C0 ask for vote, T: 10
S1 C0 ask for vote, T: 10
S1 not leader, election timeout...
S1 convert to candidate, calling election T: 9
S2 C0 ask for vote, T: 10
S0 C0 ask for vote, T: 10
S0 -> S1 vote false, VT: 0
S0 not leader, election timeout...
S0 convert to candidate, calling election T: 10
排查出的问题
1. 在RequestVote RPCs中就更新了自己的Term,使得落后的server一下子就追上了因random而领先的server 2. timeout区间为 [150, 300] ms可能的解决办法
1. ~~不在RequestVote RPCs中更新Term。~~ 在更新以后把isTimeout置为false,跳过一轮选举自增。 2. 增大timeout的区间方案
采用了方法1,并将timeout区间修改为 [200, 500] ms,暂时解决问题。
- 观察第二列和最后一列,可以发现这里 S1 和 S6 出现了脑裂现象,但是并未得到解决。原因是两方此时的Term相同,两方互相会拒绝对方的heartbeat log而不会进行状态转换。
Debug log
S1 multi leaders(2)
T: 7
S6 multi leaders(2) T:
7
S6 -> S5 T: 7, send
{PLI: 0, PLT: 0}
S1 -> S6 T: 7, send
{PLI: 0, PLT: 0}
S2 convert to
candidate, calling
election T: 7
S6 C0 ask for vote, T:
8
S6 -> S0 T: 7, send
{PLI: 0, PLT: 0}
S6 -> S1 T: 7, send
{PLI: 0, PLT: 0}
S6 -> S2 T: 7, send
{PLI: 0, PLT: 0}
S6 -> S3 T: 7, send
{PLI: 0, PLT: 0}
S6 -> S4 T: 7, send
{PLI: 0, PLT: 0}
S1 -> S0 T: 7, send
{PLI: 0, PLT: 0}
S1 -> S2 T: 7, send
{PLI: 0, PLT: 0}
S1 -> S3 T: 7, send
{PLI: 0, PLT: 0}
S1 -> S4 T: 7, send
{PLI: 0, PLT: 0}
S1 -> S5 T: 7, send
{PLI: 0, PLT: 0}
排查出的问题
1. Term相同,在身为leader的同时对峙,由于拒绝了heartbeat log导致的状态无法转换可能的解决方法
1. 本质上是在disconnect的时候没有检查自己的killed(),及时转换为follower。论文中没有提到Term相同的情况下Entry (Including heartbeat)的处理情况。方案
1. 在每个heartbeat loop中检查killed(),及时转换状态。 2. 运行时又发现了一个小问题,下面分析出问题的代码。AppendEntry分为三个情况,currTerm < args.Term, currTerm = args.Term 和 currTerm > args.Term。**第一种情况**没啥好说的,**第二种情况下**,若server还未投票 (voteFor == -1) 则server一定不是Leader;投了票的情况下拒绝可以保证一个Term只投一次票的原则。第三种情况下,需要注意,server的状态可能是三个状态中的任意一种,所以一定要重置server的状态为Follower来保证Term大的优先的原则。if rf.currTerm > args.Term {
reply.VoteGranted = false
DebugLog(dVote, "S%d -> S%d vote false, T: %d, CT: %d\n", rf.me, args.CandidateId, rf.currTerm, args.Term)
return
} else if rf.currTerm == args.Term {
if rf.votedFor == -1 {
rf.votedFor = args.CandidateId
reply.VoteGranted = true
DebugLog(dVote, "S%d -> S%d vote true, same term\n", rf.me, args.CandidateId)
return
} else {
reply.VoteGranted = false
DebugLog(dVote, "S%d -> S%d vote false, VT: %d\n", rf.me, args.CandidateId, rf.votedFor)
return
}
} else {
rf.currTerm = args.Term
rf.isTimeout = false
rf.state = Follower // !!!!!!!!!!!!!! IMPORTANT !!!!!!!!!!!!!!
rf.votedFor = args.CandidateId
reply.VoteGranted = true
DebugLog(dVote, "S%d -> S%d vote true, from T: %d to CT: %d\n", rf.me, args.CandidateId, reply.Term, rf.currTerm)
return
}
- PASS LOG
humborn@shin MIT-6.5840/src/raft on master [!] [!] 09:36:06
➜ go test -run 2A
Test (2A): initial election ...
... Passed -- 3.0 3 58 15440 0
Test (2A): election after network failure ...
... Passed -- 4.4 3 140 27904 0
Test (2A): multiple elections ...
... Passed -- 6.2 7 686 129664 0
PASS
ok 6.824/raft 13.599s
2B
- Server在成功append以后,竟然没有让Leader更新commitIndex
Debug log
S0 not leader, election timeout...
S1 -> S2 T: 1, send {PLI: 0, PLT: 0,
CI: 0}
S1 -> S0 T: 1, send {PLI: 0, PLT: 0,
CI: 0}
S0 -> S1 append ok T: 1, LENLOG: 2
S2 -> S1 append ok T: 1, LENLOG: 2
S2 not leader, election timeout...
S1 -> S2 T: 1, send {PLI: 1, PLT: 1,
CI: 0}
S1 -> S0 T: 1, send {PLI: 1, PLT: 1,
CI: 0}
S0 -> S1 append ok T: 1, LENLOG: 2
S2 -> S1 append ok T: 1, LENLOG: 2
S1 -> S2 T: 1, send {PLI: 0, PLT: 0,
CI: 0}
S1 -> S0 T: 1, send {PLI: 0, PLT: 0,
CI: 0}
S0 -> S1 append ok T: 1, LENLOG: 2
S2 -> S1 append ok T: 1, LENLOG: 2
FAIL
FAIL 6.824/raft 2.313s
排查出的问题
1. sort中索引有问题,原来是从小到大排序后,取len(rf.peers)/2-1,当#Server为3时,结果为0,很显然不对。 2. 搞错索引了,AppendEntries时可能会传入空Entries。这时想普适地满足AppendEntries RPCs的最后一条实现规则,就需要取server自身logEntries的最后一条entry的index 3. 创建新logEntry时index和term位置反了,cao!方案
1. 改为(len(rf.peers)-1)/2 2. 由args.Entries→rf.logEntries取最后一个entry的index 3. cao!Task log
➜ go test -run BasicAgree
Test (2B): basic agreement ...
... Passed -- 0.5 3 16 4258 3
PASS
ok 6.824/raft 0.494s
- 如果一个server掉线了,会不断
StartElection()
自增,当它重联时,它的Term将会很大。就会导致一个情况,当前 leader 的 AppendEntries RPCs 将会被拒绝,reply中的Term巨大,会将Leader打回Follower并更新Term,最好给Leader特权,让Leader尽快timeout,再次成为Leader;由它召开的Election也会被其他server拒绝(LastLogIndex很小)并使其他server更新Term,不用给特权通道。但是在下方log中 S0 apply msg: {true 101 1 false [] 0 0} 重复出现,且重联后,并未一直尝试Append(在第一次被拒绝后)导致没有及时更新。
Debug log
S0 reconnect...
S2 recv command, LENLOG: 6, T: 1
S2 send to S1, ENTRIES: [{106 1 6}]
S2 send to S0, ENTRIES: [{102 1 2} {103 1 3} {104 1
4} {105 1 5} {106 1 6}]
S1 recv append from S2, T: 1, ENTRIES: [{106 1 6}]
S1 -> S2 append ok T: 1, LENLOG: 7
S0 recv append from S2, T: 1, ENTRIES: [{102 1 2}
{103 1 3} {104 1 4} {105 1 5} {106 1 6}]
S0 append fail, T: 6 large term
S2 -> S1 T: 1, send {PLI: 5, PLT: 1, CI: 5,
BEGINLOGIDX: 6, ENDLOGIDX: 6, LOG: [{106 1 6}]}
S2 update commitIndex CI: 6
S2 apply msg: {true 106 6 false [] 0 0}
S2 -> S0 T: 1, send {PLI: 1, PLT: 1, CI: 5,
BEGINLOGIDX: 2, ENDLOGIDX: 6, LOG: [{102 1 2} {103 1
3} {104 1 4} {105 1 5} {106 1 6}]}
S2 convert to candidate, calling election T: 6
S1 C6 ask for vote, T: 7
S0 C6 ask for vote, T: 7
S1 -> S2 vote true, from T: 1 to CT: 7
S0 -> S2 vote true, from T: 6 to CT: 7
S2 convert to leader at T: 7
S1 recv append from S2, T: 7, ENTRIES: []
S1 -> S2 recv heartbeat
S1 apply msg: {true 106 6 false [] 0 0}
S2 -> S1 T: 7, send heartbeat
S0 recv append from S2, T: 7, ENTRIES: []
S0 append fail, diff term
S2 -> S0 T: 7, send heartbeat
S2 -> S0 update failed, NEXT: 6
S1 recv append from S2, T: 7, ENTRIES: []
S1 -> S2 recv heartbeat
S0 recv append from S2, T: 7, ENTRIES: []
S0 append fail, diff term
S2 -> S1 T: 7, send heartbeat
S2 -> S0 T: 7, send heartbeat
S2 -> S0 update failed, NEXT: 5
S0 recv append from S2, T: 7, ENTRIES: []
S0 append fail, diff term
S1 recv append from S2, T: 7, ENTRIES: []
S1 -> S2 recv heartbeat
S2 -> S0 T: 7, send heartbeat
S2 -> S0 update failed, NEXT: 4
S2 -> S1 T: 7, send heartbeat
S2 -> S0 T: 7, fail send {PLI: 1, PLT: 1, CI: 4}
S0 recv append from S2, T: 7, ENTRIES: []
S0 append fail, diff term
S1 recv append from S2, T: 7, ENTRIES: []
S1 -> S2 recv heartbeat
S2 -> S1 T: 7, send heartbeat
S2 -> S0 T: 7, send heartbeat
S2 -> S0 update failed, NEXT: 3
S1 recv append from S2, T: 7, ENTRIES: []
S1 -> S2 recv heartbeat
S0 recv append from S2, T: 7, ENTRIES: []
S0 append fail, diff term
S2 -> S1 T: 7, send heartbeat
S2 -> S0 T: 7, send heartbeat
S2 -> S0 update failed, NEXT: 2
S1 recv append from S2, T: 7, ENTRIES: []
S1 -> S2 recv heartbeat
S0 recv append from S2, T: 7, ENTRIES: []
S0 -> S2 recv heartbeat
S0 apply msg: {true 101 1 false [] 0 0}
S2 -> S1 T: 7, send heartbeat
S2 -> S0 T: 7, send heartbeat
S2 -> S0 T: 7, fail send {PLI: 1, PLT: 1, CI: 3}
S1 recv append from S2, T: 7, ENTRIES: []
S1 -> S2 recv heartbeat
S2 -> S1 T: 7, send heartbeat
S0 recv append from S2, T: 7, ENTRIES: []
S0 -> S2 recv heartbeat
S2 -> S0 T: 7, send heartbeat
排查出的问题
1. 只要leaderCommit比自己的commitId要大,就会信号提示applyMsg。 2. 重联后会重新选举Leader,也就是说重新选出的Leader会重新初始化自己的nextIndex。自然的,第一次尝试对重联的Server进行append会因为没有匹配的index而被拒绝,于是寄了。方案
1. 不再对heartbeat特殊对待,也就是说heartbeat和broadcast争抢对Server的更新- Apply massage的时候出现了顺序出错的问题
Debug log
// log
S0 commitIndex: 6
S0 -> S2 append ok T: 9, LENLOG: 7
S2 -> S1 T: 9, send heartbeat
S2 -> S1 update success, NEXT: [1 7 7], MATCH: [0 6 0]
S2 get N: 6
S0 apply msg: {true 106 6 false [] 0 0}
S0 ######## CI: 6, cfg.log: [map[0:init server 1:101]
map[0:init server 1:101 2:102 3:103 4:104 5:105 6:106]
map[0:init server 1:101 2:102 3:103 4:104 5:105 6:106]] S2 -> S0 T: 7, send heartbeat
// related code
func (cfg *config) checkLogs(i int, m ApplyMsg) (string, bool) {
err_msg := ""
v := m.Command
for j := 0; j < len(cfg.logs); j++ {
if old, oldok := cfg.logs[j][m.CommandIndex]; oldok && old != v {
log.Printf("%v: log %v; server %v\n", i, cfg.logs[i], cfg.logs[j])
// some server has already committed a different value for this entry!
err_msg = fmt.Sprintf("commit index=%v server=%v %v != server=%v %v",
m.CommandIndex, i, m.Command, j, old)
}
}
DebugLog(dDrop, "S%d ######## CI: %d, cfg.log: %v\n", i, m.CommandIndex, cfg.logs)
_, prevok := cfg.logs[i][m.CommandIndex-1]
cfg.logs[i][m.CommandIndex] = v
if m.CommandIndex > cfg.maxIndex {
cfg.maxIndex = m.CommandIndex
}
return err_msg, prevok
}
排查出的问题
1. 首先,chan是一个fifo。这里会根据fifo中吐出的ApplyMsg进行判断,将ApplyMsg中的信息提取并储存在cfg.log中。Server重联后,仅放入最新的ApplyMsg会导致index out of range的问题。方案
1. 也就是说,需要按照顺序往chan里一个一个放入ApplyMsg而不是仅放入最新的ApplyMsg。- PASS LOG
➜ go test -run 2B
Test (2B): basic agreement ...
... Passed -- 0.5 3 16 4290 3
Test (2B): RPC byte count ...
... Passed -- 1.3 3 48 113614 11
Test (2B): agreement after follower reconnects ...
... Passed -- 5.3 3 192 50800 8
Test (2B): no agreement if too many followers disconnect ...
... Passed -- 3.4 5 264 51340 3
Test (2B): concurrent Start()s ...
... Passed -- 0.6 3 24 6790 6
Test (2B): rejoin of partitioned leader ...
... Passed -- 3.9 3 146 35480 4
Test (2B): leader backs up quickly over incorrect follower logs ...
... Passed -- 14.0 5 2244 1728755 102
Test (2B): RPC counts aren't too high ...
... Passed -- 2.0 3 62 18665 12
PASS
ok 6.824/raft 31.039s
2C
- 一运行到unrealiable的test就出现out of index的错误,现在想来好像在2B中偶尔也会出现这个错误,还债了属于是。
Debug log
➜ go test -run UnreliableChurn
Test (2C): unreliable churn ...
panic: runtime error: index out of range [8] with length 8
goroutine 22 [running]:
6.824/raft.(*Raft).Apply(0xc0001b2000)
/home/humborn/Code/go/MIT-6.5840/src/raft/raft.go:596 +0x214
created by 6.824/raft.Make
/home/humborn/Code/go/MIT-6.5840/src/raft/raft.go:757 +0x3d6
exit status 2
FAIL 6.824/raft 0.300s
排查出的问题
1. 很奇怪,有时可以通过,有时不行,2C中的Unreliable测试放大了系统的弱点,可以确定的是,心跳间隔的设置和这个错误是有关的。后来发现原来是因为延迟,Follower接受了Index更古老的消息,并覆盖了新的消息。S0 reconnect...
S1 reconnect...
S2 reconnect...
S3 reconnect...
S3 -> S4 append ok T: 26,
LENLOG: 207
S4 -> S2 LogInfo {IDX: 192,
LENLOG: 207, LOG:
{426083054347800940 26 206}}
S4 LogInfo {IDX: 192,
LENLOG: 207, LOG:
{426083054347800940 26 206}}
S4 send to S2, ENTRIES:
{426083054347800940 26 206}
S4 -> S3 update success,
NEXT: [193 205 193 207 193],
MATCH: [0 204 0 206 0]
S4 get N: 204
S4, LENLOG: 207, LA: 204
apply msg {CI: 205}
S4 -> S3 LogInfo {IDX: 206,
LENLOG: 207, LOG:
{426083054347800940 26 206}}
S4 LogInfo {IDX: 206,
LENLOG: 207, LOG:
{426083054347800940 26 206}}
S4 send to S3, ENTRIES:
{426083054347800940 26 206}
S1 -> S4 append ok T: 26,
LENLOG: 207
S2 append fail, T: 27 large
term
S4 -> S1 update success,
NEXT: [193 207 193 207 193],
MATCH: [0 206 0 206 0]
S4 get N: 206
S4 update commitIndex CI:
206
S4, LENLOG: 207, LA: 205
apply msg {CI: 206}
S4 -> S2 LogInfo {IDX: 192,
LENLOG: 207, LOG:
{426083054347800940 26 206}}
S4 LogInfo {IDX: 192,
LENLOG: 207, LOG:
{426083054347800940 26 206}}
S4 send to S2, ENTRIES:
{426083054347800940 26 206}
S3 commit {CI: 205, LENLOG:
207}
S2 append fail, T: 27 large
term
S4 reconnect...
S1 -> S4 append ok T: 26,
LENLOG: 206
S4 -> S1 LogInfo {IDX: 206,
LENLOG: 207, LOG:
{426083054347800940 26 206}}
S4 LogInfo {IDX: 206,
LENLOG: 207, LOG:
{426083054347800940 26 206}}
S4 send to S1, ENTRIES:
{426083054347800940 26 206}
S3 -> S4 append ok T: 26,
LENLOG: 205
- 有时候会出现索引取到-1的情况,排查出是append时条件设置有误。此外,当一个Server detach以后,会持续向别的server发送信息,不断收到fail update,自然会不断减小nextIndex的值。直到nextIndex变为0,自减后变为-1,索引越界。
方案
1. 修改条目更新的规则,遵循VoteRequest RPCs中的up-to-date原则。if len(args.Entries) == 0 {
DebugLog(dLog, "S%d -> S%d recv heartbeat\n", rf.me, args.LeaderId, rf.currTerm, len(rf.logEntries))
} else {
rfLastLogIndex, rfLastLogTerm := rf.LogInfoByIndex(len(rf.logEntries) - 1)
argLastLogIndex, argLastLogTerm := args.Entries[len(args.Entries)-1].Index, args.Entries[len(args.Entries)-1].Term
if rfLastLogTerm < argLastLogTerm || (rfLastLogTerm == argLastLogTerm && rfLastLogIndex < argLastLogIndex) {
rf.logEntries = rf.logEntries[:args.PrevLogIndex+1]
rf.logEntries = append(rf.logEntries, args.Entries...) // persist
rf.persist()
}
DebugLog(dLog2, "S%d -> S%d append ok T: %d, LENLOG: %d\n", rf.me, args.LeaderId, rf.currTerm, len(rf.logEntries))
}
同时要注意修改reply处理部分的代码,不然会将nextIndex和matchIndex更新成旧的索引。
if reply.Success {
if rf.nextIndex[id] < endOfIndex {
rf.nextIndex[id] = endOfIndex
rf.matchIndex[id] = endOfIndex - 1
DebugLog(dError, "S%d -> S%d update success, NEXT: %v, MATCH: %v\n", rf.me, id, rf.nextIndex, rf.matchIndex)
go rf.updateCommitIndex()
}
// 讲道理,更新成功就要试试能不能update commitIndex
} else {
rf.nextIndex[id] /= 2
DebugLog(dError, "S%d -> S%d update failed, NEXT: %d\n", rf.me, id, rf.nextIndex[id])
}
- 贴出错误代码
~~if args.PrevLogIndex >= len(rf.logEntries) || rf.logEntries[args.PrevLogIndex].Term != args.PrevLogTerm {~~
if args.PrevLogIndex > len(rf.logEntries)-1 || rf.logEntries[args.PrevLogIndex].Term != args.PrevLogTerm {
reply.Success = false
DebugLog(dVote, "S%d append fail, diff term\n", rf.me)
return
}
此外,打了个补丁,让nextIndex等于max(1, nextIndex/2),不知道这个补丁可不可以美化。
- 有极小的概率出现问题:ApplyMsg顺序错位。
Debug log
Test (2C): Figure 8 (unreliable) ...
3: log map[1:2296 2:812 3:2449 4:9494 5:4661 6:4352 7:549 8:7084 9:6810 10:2131 11:5797 12:3999 13:6910 14:5209 15:4842 16:8485 17:8166 18:6914 19:6957 20:2262 21:4452 22:1043 23:1784 24:9148 25:759 26:2552 27:7916 28:6243 29:3173 30:749 31:3392 32:2467 33:1591 34:2306 35:2107 36:9868 37:9973 38:1755 39:8718 40:6107 41:7839 42:2187 43:1553 44:629 45:6596 46:4295 47:1687 48:9196 49:2304 50:54 51:6167]; server map[1:2296 2:812 3:2449 4:9494 5:4661 6:4352 7:549 8:7084 9:6810 10:2131 11:5797 12:3999 13:6910 14:5209 15:4842 16:8485 17:8166 18:6914 19:6957 20:2262 21:4452 22:1043 23:1784 24:9148 25:759 26:2552 27:7916 28:6243 29:3173 30:749 31:3392 32:2467 33:1591 34:2306 35:2107 36:9868 37:9973 38:1755 39:8718 40:6107 41:7839 42:2187 43:1553 44:629 45:6596 46:4295 47:1687 48:9196 49:2304 50:54 51:6167 52:9070 53:9855 54:4432 55:1244 56:1773 57:5898 58:2104 59:6905 60:3111 61:6309 62:8721 63:1023 64:4093 65:7848 66:7256 67:6673 68:4167 69:7717 70:9556]
3: log map[1:2296 2:812 3:2449 4:9494 5:4661 6:4352 7:549 8:7084 9:6810 10:2131 11:5797 12:3999 13:6910 14:5209 15:4842 16:8485 17:8166 18:6914 19:6957 20:2262 21:4452 22:1043 23:1784 24:9148 25:759 26:2552 27:7916 28:6243 29:3173 30:749 31:3392 32:2467 33:1591 34:2306 35:2107 36:9868 37:9973 38:1755 39:8718 40:6107 41:7839 42:2187 43:1553 44:629 45:6596 46:4295 47:1687 48:9196 49:2304 50:54 51:6167]; server map[1:2296 2:812 3:2449 4:9494 5:4661 6:4352 7:549 8:7084 9:6810 10:2131 11:5797 12:3999 13:6910 14:5209 15:4842 16:8485 17:8166 18:6914 19:6957 20:2262 21:4452 22:1043 23:1784 24:9148 25:759 26:2552 27:7916 28:6243 29:3173 30:749 31:3392 32:2467 33:1591 34:2306 35:2107 36:9868 37:9973 38:1755 39:8718 40:6107 41:7839 42:2187 43:1553 44:629 45:6596 46:4295 47:1687 48:9196 49:2304 50:54 51:6167 52:9070 53:9855 54:4432 55:1244 56:1773 57:5898 58:2104 59:6905 60:3111 61:6309 62:8721 63:1023 64:4093 65:7848 66:7256 67:6673 68:4167 69:7717 70:9556]
3: log map[1:2296 2:812 3:2449 4:9494 5:4661 6:4352 7:549 8:7084 9:6810 10:2131 11:5797 12:3999 13:6910 14:5209 15:4842 16:8485 17:8166 18:6914 19:6957 20:2262 21:4452 22:1043 23:1784 24:9148 25:759 26:2552 27:7916 28:6243 29:3173 30:749 31:3392 32:2467 33:1591 34:2306 35:2107 36:9868 37:9973 38:1755 39:8718 40:6107 41:7839 42:2187 43:1553 44:629 45:6596 46:4295 47:1687 48:9196 49:2304 50:54 51:6167]; server map[1:2296 2:812 3:2449 4:9494 5:4661 6:4352 7:549 8:7084 9:6810 10:2131 11:5797 12:3999 13:6910 14:5209 15:4842 16:8485 17:8166 18:6914 19:6957 20:2262 21:4452 22:1043 23:1784 24:9148 25:759 26:2552 27:7916 28:6243 29:3173 30:749 31:3392 32:2467 33:1591 34:2306 35:2107 36:9868 37:9973 38:1755 39:8718 40:6107 41:7839 42:2187 43:1553 44:629 45:6596 46:4295 47:1687 48:9196 49:2304 50:54 51:6167 52:9070 53:9855 54:4432 55:1244 56:1773 57:5898 58:2104 59:6905 60:3111 61:6309 62:8721 63:1023 64:4093 65:7848 66:7256 67:6673 68:4167 69:7717 70:9556]
apply error: commit index=52 server=3 2540 != server=4 9070
排查出的问题
1. 没排查出来方案
1. 将AppendEntries中的 `defer rf.applyCh.Signal()` 变成了 `rf.applyCh.Signal()`- PASS LOG
➜ go test -run 2C
Test (2C): basic persistence ...
... Passed -- 3.1 3 78 18954 6
Test (2C): more persistence ...
... Passed -- 14.4 5 1036 200240 16
Test (2C): partitioned leader and one follower crash, leader restarts ...
... Passed -- 1.3 3 36 8539 4
Test (2C): Figure 8 ...
... Passed -- 27.0 5 1180 256006 58
Test (2C): unreliable agreement ...
... Passed -- 1.7 5 1048 345471 246
Test (2C): Figure 8 (unreliable) ...
... Passed -- 33.6 5 19228 18476272 138
Test (2C): churn ...
... Passed -- 16.4 5 6992 9763562 1432
Test (2C): unreliable churn ...
... Passed -- 16.1 5 4236 5664994 384
PASS
ok 6.824/raft 113.623s
2D
在2D中,很重要的一点是,切片底层是数组实现,如果想要让GC reach到被snapshot掉的无用内存,就需要重新创建切片,原切片指针置空。
- 有概率出现问题:在更新commitIndex的时候出现out of Index问题。
Debug log
Test (2C): unreliable churn ...
panic: runtime error: index out of range [1656] with length 1656
goroutine 106315 [running]:
6.824/raft.(*Raft).updateCommitIndex(0xc000136200)
/home/humborn/Code/go/MIT-6.5840/src/raft/raft.go:647 +0x345
created by 6.824/raft.(*Raft).broadcastAppendEntries.func1
/home/humborn/Code/go/MIT-6.5840/src/raft/raft.go:622 +0x796
exit status 2
排查出的问题
1. updateCommitIndex函数原本是使用协程实现的,所以有极小的概率,在updateCommitIndex之前,完成了Leader转换、重新广播、更新新的CommitIndex这一系列操作。使得一开始的更新协程出现越界。方案
1. 线性执行updateCommitIndex,去除函数中的加锁操作,统一由主协程管理锁func (rf *Raft) updateCommitIndex() {
// rf.mu.Lock()
// defer rf.mu.Unlock()
var dupMatchIndex []int
dupMatchIndex = append(dupMatchIndex, rf.matchIndex...)
dupMatchIndex[rf.me] = rf.logEntries[len(rf.logEntries)-1].Index
sort.Ints(dupMatchIndex)
DebugLog(dError, "S%d get N: %d\n", rf.me, dupMatchIndex[(len(rf.peers)-1)/2])
var N int = dupMatchIndex[(len(rf.peers)-1)/2]
if N > rf.commitIndex && rf.logEntries[N].Term == rf.currTerm {
rf.commitIndex = N
defer rf.applyCond.Signal()
DebugLog(dError, "S%d update commitIndex CI: %d\n", rf.me, rf.commitIndex)
}
}
- 死锁问题:一开始执行Snapshot函数就卡在获得锁的操作这里。
Debug log
S2 LogInfo {IDX: 13,
LENLOG: 14, LOG:
{8815208668614808923 1
13}}
S2 -> S1 append ok T: 1,
LENLOG: 14
S1 start create snapshot
LII: 9, T: 1
S1 start create snapshot,
before lock
S1, LENLOG: 14, LA: 1
apply msg {CI: 10}
S0 convert to candidate,
calling election T: 1
S0 LogInfo {IDX: 1, LENLOG:
2, LOG:
{7859782014890577642 1 1}}
S2 C1 ask for vote, T: 2
S1 C1 ask for vote, T: 2
S2 LogInfo {IDX: 13,
LENLOG: 14, LOG:
{8815208668614808923 1
13}}
S2 -> S0 vote false, old
log
S2 convert to candidate,
calling election T: 2
S2 LogInfo {IDX: 13,
LENLOG: 14, LOG:
{8815208668614808923 1
13}}
S1 C1 ask for vote, T: 3
S0 C1 ask for vote, T: 3
S0 LogInfo {IDX: 1, LENLOG:
2, LOG:
{7859782014890577642 1 1}}
S0 -> S2 vote true, from T:
2 to CT: 3
S2 convert to leader at
T: 3
S2 -> S1 LogInfo {IDX:
13, LENLOG: 14, LOG:
{8815208668614808923 1
13}}
S2 LogInfo {IDX: 13,
LENLOG: 14, LOG:
{8815208668614808923 1
13}}
排查出的问题
1. 快照的间隔是每10条command进行一次快照,因此节点在进行将已经提交了的指令发送到applyCh进行执行的时候不能获取有rf.mu这个互斥锁,因为在你提交指令并将该指令发送到applyCh执行的同时,测试脚本会调用Snapshot函数进行快照,但是我设计的这个函数也需要获取rf.mu互斥锁,那么这个节点就会进入死锁状态:无法获取rf.mu互斥锁进行快照,另一边是需要等快照结束才能继续提交指令并执行,以及后续动作。方案
1. 在Apply函数中进行如下修改 ++ rf.mu.Unlock()
rf.applyCh <- msg
++ rf.mu.Lock()
- AppendEntries中出现负数的Log索引
排查出的问题
1. 原来只需要检查args.PrevLogIndex小于等于Log最后一项的索引值。但是在引入snapshot后还需要检查args.PrevLogIndex大于等于lastIncludedIndex,因为历史的AppendEntries RPC可以延迟抵达 if args.PrevLogIndex > rf.logEntries[len(rf.logEntries)-1].Index ||
rf.logEntries[args.PrevLogIndex-rf.snapshot.LastIncludedIndex].Term != args.PrevLogTerm { // ATTENTION 这里可能会越界,要减去lastincludedindex
reply.Success = false
DebugLog(dVote, "S%d append fail, diff term\n", rf.me)
return
}
方案
1. 在这种情况下,可以直接将reply的success置true并返回,因为对于成功时返回值的处理已经考虑到了延迟消息(2C) if rf.snapshot.LastIncludedIndex > args.PrevLogIndex {
reply.Success = true
DebugLog(dVote, "S%d append success, snapshot interrupt...\n", rf.me)
return
}
- broadcast中出现负数的Log索引
Debug log
nextIndex[0] == snapshot.lastIncludedIndex
S1 -> S0 LogInfo {IDX: 108,
LENLOG: 8, LOG:
{5178380306894727311 1 116}},
NEXT: [109 1 117], SS: {109 1}
S2, LENLOG: 18, LA: 108 apply
msg {CI: 109}
S2, LENLOG: 18, LA: 108 apply
msg {CI: 110}
S2 -> S1 recv heartbeat
#################################################
panic: runtime error: index out of range [-1]
goroutine 890 [running]:
6.824/raft.(*Raft).LogInfoByIndex(0xc000134100, 0xffffffffffffffff)
/home/humborn/Code/go/MIT-6.5840/src/raft/util.go:101 +0xbd
6.824/raft.(*Raft).broadcastAppendEntries.func1(0x0)
/home/humborn/Code/go/MIT-6.5840/src/raft/raft.go:716 +0x84c
created by 6.824/raft.(*Raft).broadcastAppendEntries
/home/humborn/Code/go/MIT-6.5840/src/raft/raft.go:669 +0x15f
exit status 2
排查出的问题
1. 同样的,广播发布了协程后并未立即执行,而snapshot在协程运行之前更新了lastIncludedIndex。仔细观察可以发现出错的Log的索引都是mod 10余9的,对应了snapshot的生成规律,验证了猜想。 if args.PrevLogIndex > rf.logEntries[len(rf.logEntries)-1].Index ||
rf.logEntries[args.PrevLogIndex-rf.snapshot.LastIncludedIndex].Term != args.PrevLogTerm { // ATTENTION 这里可能会越界,要减去lastincludedindex
reply.Success = false
DebugLog(dVote, "S%d append fail, diff term\n", rf.me)
return
}
方案
1. 将广播条件中触发快照加载的条件由 < 改为 ≤ 。go func(id int) {
rf.mu.Lock()
// 需要注意的是,当rf.nextIndex[id] == rf.logEntries[len(rf.logEntries)-1].Index+1,对应的是发送heartbeat的情况
if rf.nextIndex[id] > rf.logEntries[len(rf.logEntries)-1].Index+1 {
rf.mu.Unlock()
return
}
~~if rf.nextIndex[id] < rf.snapshot.LastIncludedIndex {~~
if rf.nextIndex[id] <= rf.snapshot.LastIncludedIndex {
- ApplyMsg顺序错位第二弹,详循2C中的第二个问题
Debug log
Test (2D): snapshots basic ...
apply error: server 0 apply out of order, expected index 100, got 101
exit status 1
FAIL 6.824/raft 1.269s
排查出的问题
1. 在原来的Apply实现中,循环中的cid只被初始化了一次,但是在这里,每次循环中会随插入记录的数量增加生成snapshot,改变原来的初始值。 for cid := max(rf.lastApplied, rf.snapshot.LastIncludedIndex) + 1; cid <= rf.commitIndex; cid++ {
DebugLog(dError, "S%d, LENLOG: %d, LA: %d apply msg {CI: %d} \n", rf.me, len(rf.logEntries), rf.lastApplied, cid)
msg := ApplyMsg{
CommandValid: true,
Command: rf.logEntries[cid-rf.snapshot.LastIncludedIndex].Command,
CommandIndex: cid,
}
rf.mu.Unlock()
rf.applyCh <- msg
rf.mu.Lock()
}
方案
1. 多次初始化。 cid := max(rf.lastApplied, rf.snapshot.LastIncludedIndex) + 1
for cid <= rf.commitIndex {
DebugLog(dError, "S%d, LENLOG: %d, LA: %d apply msg {CI: %d} \n",
rf.me, len(rf.logEntries), rf.lastApplied, cid)
msg := ApplyMsg{
CommandValid: true,
Command: rf.logEntries[cid-rf.snapshot.LastIncludedIndex].Command,
CommandIndex: cid,
}
rf.mu.Unlock()
rf.applyCh <- msg
rf.mu.Lock()
rf.lastApplied = cid
cid = max(rf.lastApplied, rf.snapshot.LastIncludedIndex) + 1
}
- 首先,可以知道,apply out of order有两种形态,一种是接收到了有延迟的信息,还有一种是接收到了未来的信息。以InstallSnapshot RPC为例,可以通过下面这段代码对延迟信息进行过滤。但是raft的Term、commitIndex和lastApplied等索引都是完备的,讲道理不会出现“未来的”信息。
if rf.commitIndex < args.LastIncludedIndex {
rf.commitIndex = args.LastIncludedIndex
}
if rf.lastApplied < args.LastIncludedIndex {
rf.lastApplied = args.LastIncludedIndex
}
Debug log
apply error: server 0 apply out of order, expected index 230, got 231
exit status 1
FAIL 6.824/raft 1.852s
排查出的问题
1. 在测试代码中,snapshot在ApplyCh中被接收后,会以自身的lastIncludedIndex刷新service server的对应参数,造成“版本回退”。方案
1. 在InstallSnapshot中与lastApplied进行比较判断过滤“版本回退” if args.LastIncludedIndex <= rf.snapshot.LastIncludedIndex ||
args.LastIncludedIndex <= rf.lastApplied {
rf.mu.Unlock()
return
}
- 会报莫名其妙的错误
Debug log
第一列的Index和LII不对应造成的
S0 ingestSnap: snapshot doesn't match
m.SnapshotIndex, INDEX: 39, LII: 29
S0 C46 install snapshot, IDX: 39 T: 1,
start from 39 to 46
#############################################################################
apply error: server 0 snapshot doesn't match m.SnapshotIndex
exit status 1
FAIL 6.824/raft 1.041s
排查出的问题
1. 由于在前面的问题排查中已经基本解决了out of order的问题,推测出现这种情况的原因是SaveStateAndSnapshot函数存在先后执行顺序,协程有延迟,使得保存的时候没有起到拦截out of order的作用方案
1. 把所有的rf.mu.Lock()统统移到SaveStateAndSnapshot函数后面保证线性顺序。- PASS LOG
➜ go test -run 2D
Test (2D): snapshots basic ...
... Passed -- 1.8 3 506 164008 235
Test (2D): install snapshots (disconnect) ...
... Passed -- 32.9 3 2444 1245742 316
Test (2D): install snapshots (disconnect+unreliable) ...
... Passed -- 38.7 3 3604 1300893 327
Test (2D): install snapshots (crash) ...
... Passed -- 24.5 3 1148 879888 334
Test (2D): install snapshots (unreliable+crash) ...
... Passed -- 33.0 3 1432 692990 322
Test (2D): crash and restart all servers ...
... Passed -- 6.7 3 244 64312 54
PASS
ok 6.824/raft 137.622s