Skip to content

test(sandbox): fix flaky arm64 procfs binary_path tests#881

Merged
pimlock merged 2 commits intomainfrom
fix/procfs-flaky-arm64-tests
Apr 18, 2026
Merged

test(sandbox): fix flaky arm64 procfs binary_path tests#881
pimlock merged 2 commits intomainfrom
fix/procfs-flaky-arm64-tests

Conversation

@pimlock
Copy link
Copy Markdown
Collaborator

@pimlock pimlock commented Apr 17, 2026

Summary

Failing tests:

Three procfs::tests tests flake on arm64 CI runners. Two independent test-side races, each needing its own mitigation:

  1. Fork/exec racebinary_path_strips_deleted_suffix and binary_path_preserves_live_deleted_basename read /proc/<child_pid>/exe immediately after Command::spawn() returns. spawn() returns once the child is scheduled, not once it has completed exec(). Before execve lands, the child inherits the parent's mm->exe_file, so /proc/<child>/exe readlinks to the test harness binary. The target-path assertion then fails with the harness path on the left (left: \"/__w/.../openshell_sandbox-f9dbf131daaba99d\", right: \"/tmp/.../sleepy (deleted)\").
  2. ETXTBSY on spawn — all three tests copy /bin/sleep to a temp path and then spawn it. The kernel rejects execve when inode->i_writecount > 0, and the release of that counter after the writer fd is closed isn't synchronous with close(2) under contention, so the very-next-instruction execve can still race it. This is independent of how the binary is written — CI has surfaced ETXTBSY in every test, not just the one that originally used OpenOptions.

Changes

crates/openshell-sandbox/src/procfs.rs:

  • New wait_for_child_exec(pid, target) helper (test-only, `#[cfg(target_os = "linux")]`). Polls `/proc//exe` on a 10 ms interval with a 2 s deadline until the readlink's byte prefix matches `target`. Byte-level `starts_with` tolerates the kernel's `" (deleted)"` suffix on unlinked binaries and non-UTF-8 filenames.
  • New `spawn_retrying_on_etxtbsy(cmd)` helper (test-only, `#[cfg(target_os = "linux")]`). Retries `Command::spawn()` up to 20 times with 50 ms backoff (~1 s budget) when `err.kind() == ErrorKind::ExecutableFileBusy`. Any other error surfaces immediately.
  • Both helpers applied at every `spawn()` site in all three affected tests.
  • `binary_path_strips_suffix_for_non_utf8_filename` also switches from the `OpenOptions + sync_all` pattern to `std::fs::copy` so all three tests set up their target binary the same way. With the retry in place the write path no longer matters for correctness — this is purely for consistency.

Production `binary_path` behaviour is unchanged.

Related Issue

Blocking CI on #867.

Testing

  • `cargo test -p openshell-sandbox --lib` green locally (macOS; Linux-only tests gated out).
  • `cargo check -p openshell-sandbox --tests` clean.
  • `cargo fmt --all -- --check` clean.
  • arm64 CI green on this PR (verifies the flake is actually addressed, not just papered over locally).

Checklist

  • Conforms to Conventional Commits
  • No new secrets or credentials
  • Scope limited to the test harness

@pimlock pimlock requested a review from a team as a code owner April 17, 2026 23:23
Three `procfs::tests` tests flake on arm64 CI runners. Two root causes,
both test-side and both fixable with smaller changes than a retry loop:

1. Fork/exec race (`binary_path_strips_deleted_suffix`,
   `binary_path_preserves_live_deleted_basename`): `Command::spawn` returns
   once the child is scheduled, not once it has completed `exec()`. On a
   contended runner the immediately-following `/proc/<pid>/exe` readlink
   still returns the parent (test-harness) binary, and the target-path
   assertion fails with the harness path on the left.

2. ETXTBSY on spawn (`binary_path_strips_suffix_for_non_utf8_filename`):
   the test wrote the target binary via `OpenOptions::write(true) + sync_all
   + scope drop` before spawning it. Under load the kernel's release of the
   inode write lock raced `execveat`.

Changes:

- Add a small `wait_for_child_exec(pid, target)` helper that polls
  `/proc/<pid>/exe` for up to 2s until the readlink's byte prefix matches
  `target`. Byte-level `starts_with` tolerates the kernel's `" (deleted)"`
  suffix on unlinked binaries. Applied after every `spawn()` that is
  followed by a `/proc/<pid>/exe` readlink.

- Replace the `OpenOptions + sync_all` dance in
  `binary_path_strips_suffix_for_non_utf8_filename` with `std::fs::copy`,
  matching the other two tests. `std::fs::copy` handles non-UTF-8
  filenames correctly on Unix and doesn't leave a writer fd held across
  the exec boundary, which was the ETXTBSY trigger.

Production `binary_path` behaviour is unchanged.
@pimlock pimlock force-pushed the fix/procfs-flaky-arm64-tests branch from 40cbb57 to db80442 Compare April 17, 2026 23:33
…ests

The ETXTBSY race isn't specific to the `OpenOptions + sync_all` pattern
— it's the kernel's `inode->i_writecount` release not being synchronous
with `close(2)`, so any "write then exec immediately" sequence can race
`execve`. CI has now surfaced the flake in
`binary_path_strips_deleted_suffix` and
`binary_path_preserves_live_deleted_basename` as well, both of which
write via `std::fs::copy`.

Add `spawn_retrying_on_etxtbsy` (20 attempts × 50 ms backoff, matches on
`ErrorKind::ExecutableFileBusy`, any other error panics immediately) and
apply it at every `spawn()` site. `wait_for_child_exec` still covers the
separate post-spawn fork/exec race.

Production `binary_path` is unchanged.
@pimlock pimlock merged commit e39bb38 into main Apr 18, 2026
13 checks passed
@pimlock pimlock deleted the fix/procfs-flaky-arm64-tests branch April 18, 2026 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants