记一次kernel panic调试

「一杯茶一包烟，一个BUG调一天」—— 题记

最近在做MIT 6.S081的Lab Locks的Buffer cache，大致要求是实现一个LRU缓存队列，用于文件系统的块缓存(block cache)，在保证线程安全的前提下尽可能的实现高吞吐量。
实验给原始代码是基于双向链表的LRU队列，用了一个自旋锁保护链表和其中所有元素的值，可想而知这个自旋锁的竞争非常激烈实验给了很多提示，比如将缓存改成哈希表，用链地址解决冲突；给每个bucket配一个锁，不同bucket里的元素互不影响，可以并发执行；用时间戳记录访问时间，避免反向遍历；

有如下几种情况，按序号递增顺序依次处理：

buffer已存在：获取当前bucket的锁，遍历链表，找到该buffer后返回；
buffer不存在，但当前bucket中有空余的buffer：遍历当前bucket的链表，按照LRU个空余的buffer，分配给当前请求并返回；
buffer不存在，当前bucket无空余，但其他bucket有: 依次遍历整个哈希表，从其他bucket那里“偷”一个空余的buffer，分配给当前请求并返回；
buffer不存在，且整张哈希表均无空余buffer: 真实的OS会分配额外的buffer以响应请求，然而xv6只是个用于教学的OS，在这种情况下直接kernel panic，死给你看；

代码还算好写，调通了以后本应该万事大吉，可是内核在usertest的manywrites函数制造的满负载的情况下挂了

1
2


test manywrites: 
panic: sched locks

来看看发出该panic的代码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


// Switch to scheduler.  Must hold only p->lock
// and have changed proc->state. Saves and restores
// intena because intena is a property of this
// kernel thread, not this CPU. It should
// be proc->intena and proc->noff, but that would
// break in the few places where a lock is held but
// there's no process.
void
sched(void)
{
  int intena;
  struct proc *p = myproc();

  if(!holding(&p->lock))
    panic("sched p->lock");
  if(mycpu()->noff != 1)
    panic("sched locks");
  if(p->state == RUNNING)
    panic("sched running");
  if(intr_get())
    panic("sched interruptible");

  intena = mycpu()->intena;
  swtch(&p->context, &mycpu()->context);
  mycpu()->intena = intena;
}

sched是一个C函数，检查一系列不变式(invariants)后调用汇编子例程swtch进行上下文切换。自旋锁(spinlock)，顾名思义，是一种基于忙等(busy waiting)的进程同步机制。自旋锁的设计、实现以及使用必须十分小心，持有自旋锁时进行中断处理、进程调度等操作稍有不慎就会导致死锁。

Some xv6 spinlocks protect data that is used by both threads and interrupt handlers. For example, the clockintr timer interrupt handler might increment ticks (kernel/trap.c:163) at about the same time that a kernel thread reads ticks in sys_sleep (kernel/sysproc.c:64). The lock tickslock serializes the two accesses.

The interaction of spinlocks and interrupts raises a potential danger. Suppose sys_sleep holds tickslock, and its CPU is interrupted by a timer interrupt. clockintr would try to acquire tickslock, see it was held, and wait for it to be released. In this situation, tickslock will never be released: only sys_sleep can release it, but sys_sleep will not continue running until clockintr returns. So the CPU will deadlock, and any code that needs either lock will also freeze

因此xv6选择了简单粗暴的方式：若处理机持有自旋锁，则在该处理机上关闭中断

To avoid this situation, if a spinlock is used by an interrupt handler, a CPU must never hold that lock with interrupts enabled. Xv6 is more conservative: when a CPU acquires any lock, xv6 always disables interrupts on that CPU. Interrupts may still occur on other CPUs, so an interrupt’s acquire can wait for a thread to release a spinlock; just not on the same CPU.

明白了这些之后回来看代码，mycpu()->noff是当前自旋锁的嵌套层数。每次acquire锁时+1，release时-1，当noff==0时恢复原有的中断状态。

xv6是一个支持多处理机的内核，多个调度器在不同处理机上同时执行，因此进程控制块(即p)也需要由一个自旋锁保护，这个锁就是p->lock。其由swtch切换到的调度器释放，获取p->lock时noff=1。panic:sched locks代表了这时持有了额外的自旋锁。调试的主要思路就时要把它找出来。

Round 1 —— Printf

既然是持有了额外的锁，那acquire和release操作时把锁的名字打出来，就很容易定位没有被release的锁了。xv6为了方便调试，给每个自旋锁留了名字，并且还在内核态实现了printf函数，代码实现起来非常方便

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


void
acquire(struct spinlock *lk)
{
  printf("Acquire %s\n",lk->name);
  ....
}

void
release(struct spinlock *lk)
{
  printf("Release %s\n",lk->name);
  ....
}

make clean && make qemu编译运行，熟悉的终端并没有出来，看了一眼printf，发现它扭头又调用了acquire，循环调用爆栈后死于缺页中断，Printf大法堂堂失败。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


void
printf(char *fmt, ...)
{
  va_list ap;
  int i, c, locking;
  char *s;

  locking = pr.locking;
  if(locking)
    acquire(&pr.lock);
   ....

Round 2 —— MyPanic

Print不管用只好另辟蹊径了。或许可以在panic的时候打印出当前处理机持有的锁?可是panic的方法签名并不具备这样的扩展性，并且panic已经在kernel的其他地方大量使用了，改签名牵一发而动全身。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


void
panic(char *s)
{
  pr.locking = 0;
  printf("panic: ");
  printf(s);
  printf("\n");
  panicked = 1; // freeze uart output from other CPUs
  for(;;)
    ;
}

那换个思路，写一个MyPanic，将挂掉时的名字用栈传入，打印所有锁的名字后最终调用panic

1
2
3
4
5
6
7


void MyPanic(char* s,char* stack[],int top){
  printf("locks: \n");
  for(int i=0;i<top;i++){
    printf("%s\n",names[i]);
  }
  panic(s);
}

省略全局变量定义、使用、makefile的若干踩坑，总之失败了。原因是

当前CPU上由acquire保存的锁的名字会被其他CPU其他锁的release顶掉，让栈顶保的名字失去意义；
栈本身也要加锁保护..就绕了一圈又绕回来…

Round 3 —— GDB

你永远可以相信GDB，这不，走投无路的我来了。

b panic打上断点，然后bt得到调用栈，嗯？为啥sleeplock出问题了？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


#0  panic (s=s@entry=0x80008240 "sched locks") at kernel/printf.c:124
#1  0x0000000080001632 in sched () at kernel/proc.c:483
#2  0x00000000800016cc in sleep (chan=chan@entry=0x800171e0 <bcache+32736>, 
    lk=lk@entry=0x800171e8 <bcache+32744>) at kernel/proc.c:547
#3  0x0000000080003c16 in acquiresleep (lk=lk@entry=0x800171e0 <bcache+32736>)
    at kernel/sleeplock.c:26
#4  0x0000000080002618 in bget (blockno=<optimized out>, dev=2147578312) at kernel/bio.c:134
#5  bread (dev=dev@entry=1, blockno=<optimized out>) at kernel/bio.c:212
#6  0x0000000080002a54 in balloc (dev=<optimized out>) at kernel/fs.c:72
#7  0x0000000080002b9e in bmap (ip=ip@entry=0x8001b298 <itable+320>, bn=bn@entry=3)
    at kernel/fs.c:386
#8  0x000000008000336c in writei (ip=0x8001b298 <itable+320>, user_src=user_src@entry=1, 
    src=src@entry=51776, off=3072, n=n@entry=3072) at kernel/fs.c:499
#9  0x000000008000406e in filewrite (f=0x8001cf58 <ftable+112>, addr=48704, n=<optimized out>)
    at kernel/file.c:164

sleeplock是xv6实现的另一个内核态同步机制，类似信号量(Mutex)。sleeplock与spinlock的最大区别在于sleeplock会让出处理机。在块上的写入非常耗时，为了节省资源，xv6在块缓存(block cache)上用了sleeplock。sleeplock的acquiresleep调用sleep将当前进程设置为阻塞态后调用调度器sched进行上下文切换，而在切换的过程中出现了kernel panic，这很合理。

1
2
3
4
5
6
7
8
9


struct buf {
  int valid;   // has data been read from disk?
  int disk;    // does disk "own" buf?
  uint dev;
  uint blockno;
  struct sleeplock lock;
  uint refcnt;
  ...
};

bget函数在LRU的队列中返回一个被锁住的buf：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


static struct buf*
bget(uint dev, uint blockno)
{
  ...
  acquire(&bcache.lock);
  // Is the block already cached?
  for(b = bcache.hashtable[no].next; b != &bcache.hashtable[no]; b = b->next){
	if(b->dev == dev && b->blockno == blockno){
      b->refcnt++;
      acquiresleep(&b->lock);
      release(&bcache.hashlock[no]);
      return b;
    }
  }

眼尖的读者可能已经看出来了，在acquiresleep前，当前处理机还持有了bcache.lock和bcache.hashlock[no]这两个锁。这就是问题的原因么？

对比原版代码，好像确实是顺序反了：

1
2


  release(&bcache.lock);
  acquiresleep(&b->lock);

将acquiresleep放到release之后顺利的通过了测试点，所以就是顺序的问题！

重读课本6.4章Deadlock and lock ordering，果然上锁顺序的重要性已经被不止一次强调了

If a code path through the kernel must hold several locks at the same time, it is important that all code paths acquire those locks in the same order. If they don’t, there is a risk of deadlock….
To avoid such deadlocks, all code paths must acquire locks in the same order. The need for a global lock acquisition order means that locks are effectively part of each function’s specification: callers must invoke functions in a way that causes locks to be acquired in the agreed-on order…
Honoring a global deadlock-avoiding order can be surprisingly difficult. the danger of deadlock is often a constraint on how fine-grained one can make a locking scheme, since more locks often means more opportunity for deadlock…
The need to avoid deadlock is often a major factor in kernel implementation.

结论&教训

上锁顺序(lock ordering)是内核中避免死锁的重要手段
设计一个全局的上锁顺序是非常困难的。
上锁顺序需要程序员的显示注意(explicit attention)，稍有不慎就会死锁
~~GDB必秒printf~~

最后来个随堂小测，请指出下列输出的产生方式

一个程序员遇到了一个问题，两个现在问题有他了他决定，用多线程解决。
一个程序员遇到了一个问题，他决定用多线程解决，现在他有两个问题了