Debugging QEMU/KVM Setup If Something Goes Wrong (e.g., Support for 1GB Pages)
01 Nov 2020In our previous projects, we always did all development on real hardware. For example, LXDs and LVDs required baremetal speed of the cache-coherence protocol and support for nested virtualization (both systems use hardware-supported virtualization). So development under QEMU looked unrealistic. Well, maybe we need to explore more. KVM supports nested virtualization, but we needed support for features like extended page table (EPT) switching with VMFUNC. In the end, we were reluctant to take this approach.
But our most recent project, RedLeaf, is a new operating system implemented from scratch in Rust. This was the first time we used QEMU/KVM pair for development and found it extremely effective. Developing OS kernels under QEMU/KVM has a much quicker development cycle and gives a ton of debugging opportunities (e.g., attaching GDB, dumping page tables from QEMU, understanding triple faults, etc.). Plus it removes an extremely annoying long reboot cycle.
We will describe what we’ve learned in a collection of posts and hopefully, our lessons are useful to others. It took us some time to debug several things that did not work as expected when run on top of QEMU/KVM. Here, we describe our experience of debugging 1GB page support with KVM.
Spoiler: our bug is trivial, we just did not pass the correct CPU model as an argument. So if you simply want to get it running scroll to the bottom. Our goal with this post is to share the tricks that allow us to debug similar issues with the QEMU setup.
The problem
We started our development with a 3 level pagetable with 2MiB hugepages. Later, we wanted more memory and decided to support huge pagetables (1GiB pages).
Here is our pagetable setup, trying to direct-map the first 32GiBs of memory:
setup_huge_page_tables:
; map first P4 entry to P3 table
mov rax, hp3_table
or rax, 0b11 ; present + writable
mov [hp4_table], rax
;map each P3 entry to a huge 1GiB page
mov ecx, 0 ; counter variable
.map_hp3_table:
; map ecx-th P3 entry to a huge page that starts at address 1GiB*ecx
mov rax, 1 << 30 ; 1GiB
mul ecx ; start address of ecx-th page
shl rdx, 32
or rax, rdx
or rax, 0b10000011 ; present + writable + huge
mov [hp3_table + ecx * 8], rax ; map ecx-th entry
inc ecx ; increase counter
cmp ecx, 0x20 ; if counter == 32, 32 entries in P3 table is mapped
jne .map_hp3_table ; else map the next entry
; Apic regions would belong in the first few gigabytes
ret
section .bss
hp4_table:
resb 4096
hp3_table:
resb 4096
With this boot-time pagetable, everything was good when we run it on bare-metal, but things started to break under QEMU/KVM. All we had access to was an internal error from KVM and a register dump.
KVM internal error. Suberror: 1
emulation failure
EAX=80000011 EBX=00000000 ECX=c0000080 EDX=00000000
ESI=00000000 EDI=00000000 EBP=00000000 ESP=01bfa000
EIP=00133025 EFL=00010086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
CS =0010 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0018 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0018 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
FS =0018 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
GS =0018 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS64-busy
GDT= 0000000000100018 0000000f
IDT= 0000000000000000 00000000
CR0=80000011 CR2=0000000000000000 CR3=0000000000bf8000 CR4=00000020
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000500
The program counter was pointing to an instruction which accesses the VGA buffer
133025: 66 c7 05 00 80 0b 00 48 02 movw $0x248,0xb8000(%rip)
At this point, what can we do to debug the crash? We were kind of lost, why VGA buffer, what went wrong?
Step 1: Enable KVM tracing
We start debugging by enabling KVM tracing with Linux tracepoints. [Linux tracepoints] (https://www.kernel.org/doc/html/latest/trace/tracepoints.html) are a lightweight instrumentation facility embedded in the Linux kernel. One can dynamically enable these tracepoints by registering a function that would be called when the tracepoint is executed.
KVM Code has a lot of tracepoints
for instrumenting various events. The list of tracepoints could be obtained by
running perf list
as shown below.
$ sudo perf list | grep kvm
...
kvm:kvm_emulate_insn [Tracepoint event]
kvm:kvm_enter_smm [Tracepoint event]
kvm:kvm_entry [Tracepoint event]
kvm:kvm_eoi [Tracepoint event]
kvm:kvm_exit [Tracepoint event]
kvm:kvm_fast_mmio [Tracepoint event]
kvm:kvm_fpu [Tracepoint event]
kvm:kvm_halt_poll_ns [Tracepoint event]
...
trace-cmd
offers a set of tools to trace and collect these events.
Let’s run with all kvm tracepoints enabled
sudo trace-cmd record -b 20000 -e kvm
From the dumped report,
sudo trace-cmd report > trace-cmd.txt
we have some more details
qemu-system-x86-31218 [000] 159269.806542: kvm_exit: reason EPT_MISCONFIG rip 0x133025 info 0 0
qemu-system-x86-31218 [000] 159269.806546: kvm_emulate_insn: 0:133025: ec
qemu-system-x86-31218 [000] 159269.806547: kvm_emulate_insn: 0:133025: ec FAIL
qemu-system-x86-31218 [000] 159269.806548: kvm_userspace_exit: reason KVM_EXIT_INTERNAL_ERROR (17)
qemu-system-x86-31218 [000] 159269.806548: kvm_fpu: unload
qemu-system-x86-31215 [007] 159325.605844: kvm_hv_stimer_cleanup: vcpu_id 0 timer 0
qemu-system-x86-31215 [007] 159325.605852: kvm_hv_stimer_cleanup: vcpu_id 0 timer 1
qemu-system-x86-31215 [007] 159325.605852: kvm_hv_stimer_cleanup: vcpu_id 0 timer 2
qemu-system-x86-31215 [007] 159325.605853: kvm_hv_stimer_cleanup: vcpu_id 0 timer 3
Well, grepping KVM_EXIT_INTERNAL_ERROR
did not give us much information. But
what caught the eye was that the instruction dump at rip
was not what we see in
objdump
. So, we tried looking into the previous kvm_emulate_insn
logs in
the trace report to see how it is handled.
Below is our previous instance of emulate_insn
trace.
qemu-system-x86-31218 [000] 159269.805554: kvm_exit: reason IO_INSTRUCTION rip 0xa962 info 1770008 0
qemu-system-x86-31218 [000] 159269.805555: kvm_emulate_insn: f0000:a962: ec
qemu-system-x86-31218 [000] 159269.805555: kvm_userspace_exit: reason KVM_EXIT_IO (2)
From the above, it looks like kvm seem to have failed to fetch the instructions at our
crashed rip (0x131025
).
The kvm_emulate_insn
trace is located at arch/x86/kvm/x86.c
.
The exit reason is an EPT misconfiguration.
qemu-system-x86-31218 [000] 159269.806542: kvm_exit: reason EPT_MISCONFIG rip 0x133025 info 0 0
and the request for emulation originated from here: handle_ept_misconfig
which matches with our trace log.
static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
{
...
ret = handle_mmio_page_fault(vcpu, gpa, true);
if (likely(ret == RET_MMIO_PF_EMULATE))
return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
EMULATE_DONE;
...
From arch/x86/kvm/x86.c
int x86_emulate_instruction(struct kvm_vcpu *vcpu,
unsigned long cr2,
int emulation_type,
void *insn,
int insn_len)
{
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
...
if (!(emulation_type & EMULTYPE_NO_DECODE)) {
init_emulate_ctxt(vcpu);
...
r = x86_decode_insn(ctxt, insn, insn_len);
trace_kvm_emulate_insn_start(vcpu);
Step 2: Taking a closer look at function arguments with Systemtap
We want to understand what goes wrong in decoding the instruction at our
faulting rip
. For example, we may want to peek into the arguments to
x86_decode_insn
function to observe insn
or its return value.
One way is to modify the kernel sources and add more debugging information in these functions. But that’s quite an invasive change. Instead, we use systemtap, a non-invasive way to attach probes at call and return sites for various kernel functions without modifying the kernel sources.
Systemtap offers a nice commandline interface and a scripting language using which one can attach call/return probes to kernel functions. Additionally, it offers guru mode - where you can place embedded C blocks that can use kernel datastructures and functions.
The code flow of x86_decode_insn
in our scenario is:
In the above graph, gva_to_gpa
and is a function pointer which dynamically changes based
on the context we run the guest on. Instead of trying to dig through the code
to resolve it, we can run systemtap to dynamically figure out which gva_to_gpa
pointer is mapped.
Also, to understand if address translation happens correctly, we monitored both
these calls: kvm_fetch_guest_virt
and kvm_vcpu_read_guest_page
.
Here is the systemtap script we use to stick a probe at call-site for these functions.
global count = 0
probe module("kvm").function("kvm_fetch_guest_virt") {
printf("%s eip 0x%08x, addr: 0x%08x, bytes: %d\n", ppfunc(), $ctxt->eip, $addr, $bytes);
if (count == 0) {
printf("fp_gva_to_gpa: %x\n", print_fp($ctxt));
count++;
}
// We want to know the gva_to_gpa when our guest is at this rip
if ($addr == 0x133025) {
printf("fp_gva_to_gpa: %x\n", print_fp($ctxt));
}
}
probe module("kvm").function("kvm_vcpu_read_guest_page") {
printf(" -%s => gfn 0x%08x\n", ppfunc(), $gfn);
}
Running this by logging the output to a file,
sudo stap -v ./kvm.stp -o kvm_stap.log
Here is what we have,
kvm_fetch_guest_virt eip 0x0000cfa6, addr: 0x000fcfa6, bytes: 15
fp_gva_to_gpa: ffffffffa04416e0 // This is what gva_to_gpa points to at the beginning
-kvm_vcpu_read_guest_page => gfn 0x000000fc
...
<snip>
kvm_fetch_guest_virt eip 0x00133025, addr: 0x00133025, bytes: 15
fp_gva_to_gpa: ffffffffa0448660 // This is what gva_to_gpa points to when we hit our bug
The log shows that for the offending address, we see the record for
kvm_fetch_guest_virt
, but we do not see any for kvm_vcpu_read_guest_page
,
which means the our gva_to_gpa
likely returned an error.
/* used for instruction fetching */
static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
gva_t addr, void *val, unsigned int bytes,
struct x86_exception *exception)
{
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
unsigned offset;
int ret;
/* Inline kvm_read_guest_virt_helper for speed. */
gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access|PFERR_FETCH_MASK,
exception);
if (unlikely(gpa == UNMAPPED_GVA))
return X86EMUL_PROPAGATE_FAULT;
offset = addr & (PAGE_SIZE-1);
if (WARN_ON(offset + bytes > PAGE_SIZE))
bytes = (unsigned)PAGE_SIZE - offset;
ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
offset, bytes);
if (unlikely(ret < 0))
return X86EMUL_IO_NEEDED;
return X86EMUL_CONTINUE;
}
From kallsyms
, we can get the actual functions the gva_to_gpa
functions pointers point to.
$ sudo grep -e ffffffffa04416e0 -e ffffffffa0448660 /proc/kallsyms
ffffffffa04416e0 t nonpaging_gva_to_gpa [kvm]
ffffffffa0448660 t paging64_gva_to_gpa [kvm]
During initialization, paging is turned off, so the function pointer is
pointing to nonpaging_gva_to_gpa
and when paging is turned on it points to
paging64_gva_to_gpa
. So, at our offending address,
paging64_gva_to_gpa
likely returns a failure.
All gva_to_gpa
helpers are templatized in the file
paging_tmpl.h
.
paging64_gva_to_gpa
is nothing but a wrapper around paging64_walk_addr
which inturn is a wrapper to paging64_walk_addr_generic
which walks the relevant pagetable.
Adding a return probe at this function in the systemtap script would give us the return value.
global ret_active = 0
probe module("kvm").function("paging64_walk_addr_generic") {
if ($addr == 0x00133025) {
printf("Walking: %s\n", $$parms);
ret_active = 1;
}
}
probe module("kvm").function("paging64_walk_addr_generic").return {
if (ret_active > 0) {
printf("return: %s\n", $$return);
}
}
This gives us:
Walking: walker=0xffff88140cf07a50 vcpu=0xffff882820b48000 mmu=0xffff882820b48300 addr=0x133025 access=0x10
return: return=0x0
Step 3: Enable even more KVM tracing
The function (paging64_walk_addr_generic
) returns 1
if it has found a valid
mapping and 0
otherwise (in case of an error). From systemtap log, we see a
failure in address walk, but we do not know yet what happened.
Fortunately, there are more tracepoints in the page walk code.
Also, from perf list
we have a bunch of kvm_mmu
tracepoints.
$ sudo perf list | grep kvmmmu
...
kvmmmu:kvm_mmu_set_dirty_bit [Tracepoint event]
kvmmmu:kvm_mmu_set_spte [Tracepoint event]
kvmmmu:kvm_mmu_spte_requested [Tracepoint event]
kvmmmu:kvm_mmu_sync_page [Tracepoint event]
kvmmmu:kvm_mmu_unsync_page [Tracepoint event]
kvmmmu:kvm_mmu_walker_error [Tracepoint event]
...
By enabling the tracepoints in paging64_walk_addr_generic
,
sudo trace-cmd record -b 20000 -e kvm -e kvm_mmu_pagetable_walk -e kvm_mmu_paging_element -e kvm_mmu_walker_error
we have more information about the error:
qemu-system-x86-31218 [000] 159269.806542: kvm_exit: reason EPT_MISCONFIG rip 0x133025 info 0 0
qemu-system-x86-31218 [000] 159269.806544: kvm_mmu_pagetable_walk: addr 133025 pferr 10 F
qemu-system-x86-31218 [000] 159269.806545: kvm_mmu_paging_element: pte bf9023 level 4
qemu-system-x86-31218 [000] 159269.806545: kvm_mmu_paging_element: pte a3 level 3
qemu-system-x86-31218 [000] 159269.806546: kvm_mmu_walker_error: pferr 9 P|RSVD
qemu-system-x86-31218 [000] 159269.806546: kvm_emulate_insn: 0:133025: ec
qemu-system-x86-31218 [000] 159269.806547: kvm_emulate_insn: 0:133025: ec FAIL
qemu-system-x86-31218 [000] 159269.806548: kvm_userspace_exit: reason KVM_EXIT_INTERNAL_ERROR (17)
Pagetable organization
Before decoding the error, let’s take a quick look at our pagetable organization. We have just two levels (level4
and level3
)
From the log, the walk was successful for level4
table and while walking
level3
we get an error which hints that the reserved bit is set on the level3
entry.
static int FNAME(walk_addr_generic)(struct guest_walker *walker,
struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gva_t addr, u32 access)
{
...
trace_kvm_mmu_pagetable_walk(addr, access);
do {
...
if (unlikely(is_rsvd_bits_set(mmu, pte, walker->level))) {
errcode = PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
goto error;
}
...
} while (!is_last_gpte(mmu, walker->level, pte));
error:
...
trace_kvm_mmu_walker_error(walker->fault.error_code);
return 0;
}
The check for reserved bit is happening here
static bool
__is_rsvd_bits_set(struct rsvd_bits_validate *rsvd_check, u64 pte, int level)
{
int bit7 = (pte >> 7) & 1, low6 = pte & 0x3f;
return (pte & rsvd_check->rsvd_bits_mask[bit7][level-1]) |
((rsvd_check->bad_mt_xwr & (1ull << low6)) != 0);
}
In our pagetable entry, we set bit7
in the level3
entry for enabling 1GiB
pages. With those inputs, the expression above expands to
rsvd_bits_mask[1][2]
. The mask is set in __reset_rsvds_bits_mask
.
From arch/x86/kvm/mmu.c
static void
__reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
struct rsvd_bits_validate *rsvd_check,
int maxphyaddr, int level, bool nx, bool gbpages,
bool pse, bool amd)
{
...
if (!gbpages)
gbpages_bit_rsvd = rsvd_bits(7, 7);
...
switch (level) {
...
case PT64_ROOT_LEVEL:
...
rsvd_check->rsvd_bits_mask[1][2] = exb_bit_rsvd |
gbpages_bit_rsvd | rsvd_bits(maxphyaddr, 51) |
rsvd_bits(13, 29);
break;
...
}
static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
struct kvm_mmu *context)
{
__reset_rsvds_bits_mask(vcpu, &context->guest_rsvd_check,
cpuid_maxphyaddr(vcpu), context->root_level,
context->nx, guest_cpuid_has_gbpages(vcpu),
is_pse(vcpu), guest_cpuid_is_amd(vcpu));
}
So, the hint of whether to set this bit as reserved comes from whether the
guest cpu has gbpages capability. This comes from CPUID leaf
0x8000_0001
.EDX[26].
Running cpuid on the host gives,
...
$ cpud -1
...
extended feature flags (0x80000001/edx):
SYSCALL and SYSRET instructions = true
execution disable = true
1-GB large page support = true
...
Root-cause: CPUID
The immediate conclusion is: our host support 1GiB pages but our guest CPU does not support it. Digging through the sources shows that only a few server class CPU support this feature in QEMU.
static X86CPUDefinition builtin_x86_defs[] = {
...
{
.name = "Skylake-Server",
.level = 0xd,
.vendor = CPUID_VENDOR_INTEL,
.features[FEAT_8000_0001_EDX] =
CPUID_EXT2_LM | CPUID_EXT2_PDPE1GB | CPUID_EXT2_RDTSCP |
CPUID_EXT2_NX | CPUID_EXT2_SYSCALL,
...
Also, from qemu documentation
pdpe1gb
Recommended to allow guest OS to use 1GB size pages.
Not included by default in any Intel CPU model.
Should be explicitly turned on for all Intel CPU models.
Note that not all CPU hardware will support this feature.
Solution: passing the CPU model that supports 1GB pages
The conclusion is, in QEMU command line, one should specify a CPU that supports this feature (according to QEMU sources) or pass this a flag to the cpus to enable 1GiB large page support.
qemu-system-x86_64 -cpu Haswell,pdpe1gb ...
or
qemu-system-x86_64 -cpu Skylake-Server ...
Having this in our QEMU commandline does not set bit7
as reserved and thus
solves the bug we encountered at the beginning of this post.