From now on, we will aim to ensure that the test driver
gets tested by OfBorg using all our available tests.
This commit adds the driver timeout test to the driver.
Since the debut of the test-driver, we didn't obtain
a race timer with the test execution to ensure that tests doesn't run beyond
a certain amount of time.
This is particularly important when you are running into hanging tests
which cannot be detected by current facilities (requires more pvpanic wiring up, QMP
API stuff, etc.).
Two easy examples:
- Some QEMU tests may get stuck in some situation and run for more than 24 hours → we default to 1 hour max.
- Some QEMU tests may panic in the wrong place, e.g. UEFI firmware or worse → end users can set a "reasonable" amount of time
And then, we should let the retry logic retest them until they succeed and adjust
their global timeouts.
Of course, this does not help with the fact that the timeout may need to be
a function of the actual busyness of the machine running the tests.
This is only one step towards increased reliability.
Now that we have a QMP client, we can wire it up in the test driver.
For now, it is almost completely useless because of the need of a constant "event loop", especially
for event listening.
In the next commits, we will slowly enable more and more usecases.
When listening on unix sockets, it doesn't make sense to specify a port
for nginx's listen directive.
Since nginx defaults to port 80 when the port isn't specified (but the
address is), we can change the default for the option to null as well
without changing any behaviour.
This also makes configuration available if you just run those tools locally.
Also use ruff instead of pylint because it's faster and more
comprehensive.
Since 008f9f0cd4
("nixos/test-driver: actually use the backdoor message to wait for backdoor"),
when boot is still computering, we can get a tons of empty strings in response to the shell.
This is not really useful to print and waste the disk space for any CI system that logs them.
We stop logging chunks whenever they are empty.
While working on #192270, I noticed that only some wait_for_* helper
functions make the timeout configurable. I think we should be able to
customize it in all cases
New EDK2 sets up the backdoor port as a serial console, which feeds the test driver
a bunch of boot logs it can safely ignore. Do so by waiting for the message the
backdoor shell prints before doing anything else.
By some miracle, before, it was possible to reconnect to the `node1` without
doing any relevant dance.
But now we are direct booting (¿), it seems like we need to do the right things.
This introduces a `check_output` flag for `execute` because we do not want to steal the
messages from the backdoor service as we might execute the kexec too fast compared
to when we will reconnect.
Therefore, we will let the message in the pipe if needed.
- `wait_until_fails` was not passing through its `timeout` argument to
the internal `retry` function, hence was always using 900 seconds (the
default timeout for `retry`) rather than the user-specified value.
Previously, `wait_for_console_text` would block indefinitely until there were lines
shown in the buffer.
This is highly annoying when testing for things that can just hang for some reasons.
This introduces a classical timeout mechanism via non-blocking get on the Queue.
This is useful whenever you want to diagnose the current state of UEFI
variables, to assert that bootloaders or boot programs (systemd-stub)
did their job correctly and set their variables accordingly.
In the future, it can enable inspecting SecureBoot keys also.
This warning was added a year and a half ago, but still no test in
NixOS directly instantiates the machine class, presumably because it's
not actually possible for a test to do so without losing
functionality. For example, there's no way for a NixOS test to access
the output directory that create_machine passes to the Machine
constructor.
This warning is therefore just contributing to alert fatigue for
users, who are unable to follow its advice. Once it's actually
possible to do what it suggests, the warning can be reintroduced.
What the code was trying to do was helpfully add a directory and
extension if none were specified, but it did this by checking whether
the filename was composed of a very limited character set that didn't
even include dashes.
With this change, the intention of the code is clearer, and I can put
dashes in my screenshot names.