The /run/wrapper directory is a tmpfs. Unfortunately, it's mounted with
its root directory has the standard (for tmpfs) mode: 1777 (world writeable,
sticky -- the standard mode of shared temporary directories). This means that
every user can create new files and subdirectories there, but can't
move/delete/rename files that belong to other users.
`rngd` seems to be the root cause for slow boot issues, and its functionality is
redundant since kernel v3.17 (2014), which introduced a `krngd` task (in kernel
space) that takes care of pulling in data from hardware RNGs:
> commit be4000bc4644d027c519b6361f5ae3bbfc52c347
> Author: Torsten Duwe <duwe@lst.de>
> Date: Sat Jun 14 23:46:03 2014 -0400
>
> hwrng: create filler thread
>
> This can be viewed as the in-kernel equivalent of hwrngd;
> like FUSE it is a good thing to have a mechanism in user land,
> but for some reasons (simplicity, secrecy, integrity, speed)
> it may be better to have it in kernel space.
>
> This patch creates a thread once a hwrng registers, and uses
> the previously established add_hwgenerator_randomness() to feed
> its data to the input pool as long as needed. A derating factor
> is used to bias the entropy estimation and to disable this
> mechanism entirely when set to zero.
Closes: #96067
systemd-confinement's automatic package extraction does not work correctly
if ExecStarts ExecReload etc are lists.
Add an extra flatten to make things smooth.
Fixes#96840.
Attempting to reuse keys on a basis different to the cert (AKA,
storing the key in a directory with a hashed name different to
the cert it is associated with) was ineffective since when
"lego run" is used it will ALWAYS generate a new key. This causes
issues when you revert changes since your "reused" key will not
be the one associated with the old cert. As such, I tore out the
whole keyDir implementation.
As for the race condition, checking the mtime of the cert file
was not sufficient to detect changes. In testing, selfsigned
and full certs could be generated/installed within 1 second of
each other. cmp is now used instead.
Also, I removed the nginx/httpd reload waiters in favour of
simple retry logic for the curl-based tests
- Use an acme user and group, allow group override only
- Use hashes to determine when certs actually need to regenerate
- Avoid running lego more than necessary
- Harden permissions
- Support "systemctl clean" for cert regeneration
- Support reuse of keys between some configuration changes
- Permissions fix services solves for previously root owned certs
- Add a note about multiple account creation and emails
- Migrate extraDomains to a list
- Deprecate user option
- Use minica for self-signed certs
- Rewrite all tests
I thought of a few more cases where things may go wrong,
and added tests to cover them. In particular, the web server
reload services were depending on the target - which stays alive,
meaning that the renewal timer wouldn't be triggering a reload
and old certs would stay on the web servers.
I encountered some problems ensuring that the reload took place
without accidently triggering it as part of the test. The sync
commands I added ended up being essential and I'm not sure why,
it seems like either node.succeed ends too early or there's an
oddity of the vm's filesystem I'm not aware of.
- Fix duplicate systemd rules on reload services
Since useACMEHost is not unique to every vhost, if one cert
was reused many times it would create duplicate entries in
${server}-config-reload.service for wants, before and
ConditionPathExists
If the config does not exist, then apparmor_parser will throw a warning.
To avoid that and make the parser configurable, we now add a new option
to it.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
Since systemd 243, docs were already steering users towards using
`journal`:
eedaf7f322
systemd 246 will go one step further, it shows warnings for these units
during bootup, and will [automatically convert these occurences to
`journal`](f3dc6af20f):
> [ 6.955976] systemd[1]: /nix/store/hwyfgbwg804vmr92fxc1vkmqfq2k9s17-unit-display-manager.service/display-manager.service:27: Standard output type syslog is obsolete, automatically updating to journal. Please update│······················
your unit file, and consider removing the setting altogether.
So there's no point of keeping `syslog` here, and it's probably a better
idea to just not set it, due to:
> This setting defaults to the value set with DefaultStandardOutput= in
> systemd-system.conf(5), which defaults to journal.
In /etc/sudoers, the last-matched rule will override all
previously-matched rules. Thus, make the default rule show up first (but
still allow some wiggle room for a user to `mkBefore` it), before any
user-defined rules.
This patch was done by curro:
The generated /etc/pam.d/* service files invoke the pam_systemd.so
session module before pam_mount.so, if both are enabled (e.g. via
security.pam.services.foo.startSession and
security.pam.services.foo.pamMount respectively).
This doesn't work in the most common scenario where the user's home
directory is stored in a pam-mounted encrypted volume (because systemd
will fail to access the user's systemd configuration).
nixos/modules/config/nsswitch.nix uses `passwdArray` for both `passwd`
and `group`, but when moving this into the google-oslogin module in
4b71b6f8fa, it didn't get split
appropriately.
In /etc/doas.conf, the last-matched rule will override all
previously-matched rules. Thus, make the default rule show up first (but
still allow some wiggle room for a user to `mkBefore` it), before any
user-defined rules.
Systemd ProtectSystem is incompatible with the chroot we make
for confinement. The options is redundant with what we do anyway
so warn if it had been set and advise to disable it.
Merges: https://github.com/NixOS/nixpkgs/pull/87420
`doas` is a lighter alternative to `sudo` that "provide[s] 95% of the
features of `sudo` with a fraction of the codebase" [1]. I prefer it to
`sudo`, so I figured I would add a NixOS module in order for it to be
easier to use. The module is based off of the existing `sudo` module.
[1] https://github.com/Duncaen/OpenDoas
This reverts commit 5532065d06.
As far as I can tell setting RemainAfterExit=true here completely breaks
certificate renewal, which is really bad!
the sytemd timer will activate the service unit every OnCalendar=,
however with RemainAfterExit=true the service is already active! So the
timer doesn't rerun the service!
The commit also broke the actual tests, (As it broke activation too)
but this was fixed later in https://github.com/NixOS/nixpkgs/pull/76052
I wrongly assumed that PR fixed renewal too, which it didn't!
testing renewals is hard, as we need to sleep in tests.
This allows to have multiple certificates with the same common name.
Lego uses in its internal directory the common name to name the certificate.
fixes#84409
Previously, the NixOS ACME module defaulted to using P-384 for
TLS certificates. I believe that this is a mistake, and that we
should use P-256 instead, despite it being theoretically
cryptographically weaker.
The security margin of a 256-bit elliptic curve cipher is substantial;
beyond a certain level, more bits in the key serve more to slow things
down than add meaningful protection. It's much more likely that ECDSA
will be broken entirely, or some fatal flaw will be found in the NIST
curves that makes them all insecure, than that the security margin
will be reduced enough to put P-256 at risk but not P-384. It's also
inconsistent to target a curve with a 192-bit security margin when our
recommended nginx TLS configuration allows 128-bit AES. [This Stack
Exchange answer][pornin] by cryptographer Thomas Pornin conveys the
general attitude among experts:
> Use P-256 to minimize trouble. If you feel that your manhood is
> threatened by using a 256-bit curve where a 384-bit curve is
> available, then use P-384: it will increases your computational and
> network costs (a factor of about 3 for CPU, a few extra dozen bytes
> on the network) but this is likely to be negligible in practice (in a
> SSL-powered Web server, the heavy cost is in "Web", not "SSL").
[pornin]: https://security.stackexchange.com/a/78624
While the NIST curves have many flaws (see [SafeCurves][safecurves]),
P-256 and P-384 are no different in this respect; SafeCurves gives
them the same rating. The only NIST curve Bernstein [thinks better of,
P-521][bernstein] (see "Other standard primes"), isn't usable for Web
PKI (it's [not supported by BoringSSL by default][boringssl] and hence
[doesn't work in Chromium/Chrome][chromium], and Let's Encrypt [don't
support it either][letsencrypt]).
[safecurves]: https://safecurves.cr.yp.to/
[bernstein]: https://blog.cr.yp.to/20140323-ecdsa.html
[boringssl]: https://boringssl.googlesource.com/boringssl/+/e9fc3e547e557492316932b62881c3386973ceb2
[chromium]: https://bugs.chromium.org/p/chromium/issues/detail?id=478225
[letsencrypt]: https://letsencrypt.org/docs/integration-guide/#supported-key-algorithms
So there's no real benefit to using P-384; what's the cost? In the
Stack Exchange answer I linked, Pornin estimates a factor of 3×
CPU usage, which wouldn't be so bad; unfortunately, this is wildly
optimistic in practice, as P-256 is much more common and therefore
much better optimized. [This GitHub comment][openssl] measures the
performance differential for raw Diffie-Hellman operations with OpenSSL
1.1.1 at a whopping 14× (even P-521 fares better!); [Caddy disables
P-384 by default][caddy] due to Go's [lack of accelerated assembly
implementations][crypto/elliptic] for it, and the difference there seems
even more extreme: [this golang-nuts post][golang-nuts] measures the key
generation performance differential at 275×. It's unlikely to be the
bottleneck for anyone, but I still feel kind of bad for anyone having
lego generate hundreds of certificates and sign challenges with them
with performance like that...
[openssl]: https://github.com/mozilla/server-side-tls/issues/190#issuecomment-421831599
[caddy]: 2cab475ba5/modules/caddytls/values.go (L113-L124)
[crypto/elliptic]: 2910c5b4a0/src/crypto/elliptic
[golang-nuts]: https://groups.google.com/forum/#!topic/golang-nuts/nlnJkBMMyzk
In conclusion, there's no real reason to use P-384 in general: if you
don't care about Web PKI compatibility and want to use a nicer curve,
then Ed25519 or P-521 are better options; if you're a NIST-fearing
paranoiac, you should use good old RSA; but if you're a normal person
running a web server, then you're best served by just using P-256. Right
now, NixOS makes an arbitrary decision between two equally-mediocre
curves that just so happens to slow down ECDH key agreement for every
TLS connection by over an order of magnitude; this commit fixes that.
Unfortunately, it seems like existing P-384 certificates won't get
migrated automatically on renewal without manual intervention, but
that's a more general problem with the existing ACME module (see #81634;
I know @yegortimoshenko is working on this). To migrate your
certificates manually, run:
$ sudo find /var/lib/acme/.lego/certificates -type f -delete
$ sudo find /var/lib/acme -name '*.pem' -delete
$ sudo systemctl restart 'acme-*.service' nginx.service
(No warranty. If it breaks, you get to keep both pieces. But it worked
for me.)
The current weekly setting causes every NixOS server to try to renew
its certificate at midnight on the dot on Monday. This contributes to
the general problem of periodic load spikes for Let's Encrypt; NixOS
is probably not a major contributor to that problem, but we can lead by
example by picking good defaults here.
The values here were chosen after consulting with @yuriks, an SRE at
Let's Encrypt:
* Randomize the time certificates are renewed within a 24 hour period.
* Check for renewal every 24 hours, to ensure the certificate is always
renewed before an expiry notice is sent out.
* Increase the AccuracySec (thus lowering the accuracy(!)), so that
systemd can coalesce the renewal with other timers being run.
(You might be worried that this would defeat the purpose of the time
skewing, but systemd is documented as avoiding this by picking a
random time.)
lego already bundles the chain with the certificate,[1] so the current
code, designed for simp_le, was resulting in duplicate certificate
chains, manifesting as "Chain issues: Incorrect order, Extra certs" on
the Qualys SSL Server Test.
cert.pem stays around as a symlink for backwards compatibility.
[1] 5cdc0002e9/acme/api/certificate.go (L40-L44)
Lego allows users to use the DNS-01 challenge to validate their
certificates. It is mostly backwards compatible, with a few
caveats.
- extraDomains can no longer have different webroots to the
main webroot for the cert.
- An email address is now mandatory for account creation
The following other changes were required:
- Deprecate security.acme.certs.<name>.plugins, as this was
specific to simp-le
- Rename security.acme.validMin to validMinDays, to avoid
confusion and errors. Lego requires the TTL to be specified in
days
- Add options to cover DNS challenge (dnsProvider,
credentialsFile, dnsPropagationCheck)
- A shared state directory is now used (/var/lib/acme/.lego)
to avoid account creation rate limits and share credentials
between certs
In 5532065d06, acme was changed to be
RemainAfterExit=true, but `postRun` commands are implemented as
`ExecStopPost`. Systemd now considers the service to be still running
after simp_le is finished, so won't run these commands (e.g. to reload
certificates in a webserver). Change `postRun` to use `ExecStartPost` to
ensure the commands are run in a timely manner.
A centralized list for these renames is not good because:
- It breaks disabledModules for modules that have a rename defined
- Adding/removing renames for a module means having to find them in the
central file
- Merge conflicts due to multiple people editing the central file
Fixes https://github.com/NixOS/nixpkgs/issues/75075.
To summarize the report in the aforementioned issue, at a glance,
it's a different default than what upstream polkit has. Apparently
for 8+ years polkit defaults admin identities as members of
the wheel group [0]. This assumption would be appropriate on NixOS, where
every member of group 'wheel' is necessarily privileged.
[0]: 763faf434b
Change order of pam_mount.conf.xml so that users can override the preset configs.
My use case is to mount a gocryptfs (a fuse program) volume. I can not do that in current order.
Because even if I change the `<fusermount>` and `<fuserumount>` by add below to extraVolumes
```
<fusemount>${pkgs.fuse}/bin/mount.fuse %(VOLUME) %(MNTPT) "%(before=\"-o \" OPTIONS)"</fusemount>
<fuseumount>${pkgs.fuse}/bin/fusermount -u %(MNTPT)</fuseumount>
```
mount.fuse still does not work because it can not find `fusermount`. pam_mount will told stat /bin/fusermount failed.
Fine, I can add a `<path>` section to extraVolumes
```
<path>${pkgs.fuse}/bin:${pkgs.coreutils}/bin:${pkgs.utillinux}/bin</path>
```
but then the `<path>` section is overridden by the hardcoded `<path>${pkgs.utillinux}/bin</path>` below. So it still does not work.
Add a new option permitting to point certbot to an ACME Directory
Resource URI other than Let's Encrypt production/staging one.
In the meantime, we are deprecating the now useless Let's Encrypt
production flag.
Previously setting `allowKeysForGroup = true; group = "foo"` would not
apply the group permission change of the certificates until the service
gets restarted. This commit fixes this by making systemd restart the
service every time it changes.
Note that applying this commit to a system with an already running acme
systemd service doesn't fix this immediately and you still need to wait
for the next refresh (or call `systemctl restart acme-<domain>`). Once
everybody's service has restarted once this should be a problem of the
past.
Let's encrypt bumped ACME to V2. We need to update our nixos test to
be compatible with this new protocol version.
We decided to drop the Boulder ACME server in favor of the more
integration test friendly Pebble.
- overriding cacert not necessary
- this avoids rebuilding lots of packages needlessly
- nixos/tests/acme: use pebble's ca for client tests
- pebble always generates its own ca which has to be fetched
TODO: write proper commit msg :)
Updating:
- nixos module to use the new `account_reg.json` file.
- use nixpkgs pebble for integration tests.
Co-authored-by: Florian Klink <flokli@flokli.de>
Replace certbot-embedded pebble
In #68792 it was discovered that /dev/fuse doesn't have
wordl-read-writeable permissions anymore. The cause of this is that the
tmpfiles examples in systemd were reorganized and split into more files.
We thus lost some of the configuration we were depending on.
In this commit some of the new tmpfiles configuration that are
applicable to us are added which also makes wtmp/lastlog in the pam
module not necessary anymore.
Rationale for the new tmpfile configs:
- `journal-nowcow.conf`: Contains chattr +C for journald logs which
makes sense on copy-on-write filesystems like Btrfs. Other filesystems
shouldn't do anything funny when that flag is set.
- `static-nodes-permissions.conf`: Contains some permission overrides
for some device nodes like audio, loop, tun, fuse and kvm.
- `systemd-nspawn.conf`: Makes sure `/var/lib/machines` exists and old
snapshots are properly removed.
- `systemd-tmp.conf`: Removes systemd services related private tmp
folders and temporary coredump files.
- `var.conf`: Creates some useful directories in `/var` which we would
create anyway at some point. Also includes
`/var/log/{wtmp,btmp,lastlog}`.
Fixes#68792.
* nixos/acme: Fix ordering of cert requests
When subsequent certificates would be added, they would
not wake up nginx correctly due to target units only being triggered
once. We now added more fine-grained systemd dependencies to make sure
nginx always is aware of new certificates and doesn't restart too early
resulting in a crash.
Furthermore, the acme module has been refactored. Mostly to get
rid of the deprecated PermissionStartOnly systemd options which were
deprecated. Below is a summary of changes made.
* Use SERVICE_RESULT to determine status
This was added in systemd v232. we don't have to keep track
of the EXITCODE ourselves anymore.
* Add regression test for requesting mutliple domains
* Deprecate 'directory' option
We now use systemd's StateDirectory option to manage
create and permissions of the acme state directory.
* The webroot is created using a systemd.tmpfiles.rules rule
instead of the preStart script.
* Depend on certs directly
By getting rid of the target units, we make sure ordering
is correct in the case that you add new certs after already
having deployed some.
Reason it broke before: acme-certificates.target would
be in active state, and if you then add a new cert, it
would still be active and hence nginx would restart
without even requesting a new cert. Not good! We
make the dependencies more fine-grained now. this should fix that
* Remove activationDelay option
It complicated the code a lot, and is rather arbitrary. What if
your activation script takes more than activationDelay seconds?
Instead, one should use systemd dependencies to make sure some
action happens before setting the certificate live.
e.g. If you want to wait until your cert is published in DNS DANE /
TLSA, you could create a unit that blocks until it appears in DNS:
```
RequiredBy=acme-${cert}.service
After=acme-${cert}.service
ExecStart=publish-wait-for-dns-script
```
Currently if you want to properly chroot a systemd service, you could do
it using BindReadOnlyPaths=/nix/store or use a separate derivation which
gathers the runtime closure of the service you want to chroot. The
former is the easier method and there is also a method directly offered
by systemd, called ProtectSystem, which still leaves the whole store
accessible. The latter however is a bit more involved, because you need
to bind-mount each store path of the runtime closure of the service you
want to chroot.
This can be achieved using pkgs.closureInfo and a small derivation that
packs everything into a systemd unit, which later can be added to
systemd.packages.
However, this process is a bit tedious, so the changes here implement
this in a more generic way.
Now if you want to chroot a systemd service, all you need to do is:
{
systemd.services.myservice = {
description = "My Shiny Service";
wantedBy = [ "multi-user.target" ];
confinement.enable = true;
serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice";
};
}
If more than the dependencies for the ExecStart* and ExecStop* (which
btw. also includes script and {pre,post}Start) need to be in the chroot,
it can be specified using the confinement.packages option. By default
(which uses the full-apivfs confinement mode), a user namespace is set
up as well and /proc, /sys and /dev are mounted appropriately.
In addition - and by default - a /bin/sh executable is provided, which
is useful for most programs that use the system() C library call to
execute commands via shell.
Unfortunately, there are a few limitations at the moment. The first
being that DynamicUser doesn't work in conjunction with tmpfs, because
systemd seems to ignore the TemporaryFileSystem option if DynamicUser is
enabled. I started implementing a workaround to do this, but I decided
to not include it as part of this pull request, because it needs a lot
more testing to ensure it's consistent with the behaviour without
DynamicUser.
The second limitation/issue is that RootDirectoryStartOnly doesn't work
right now, because it only affects the RootDirectory option and doesn't
include/exclude the individual bind mounts or the tmpfs.
A quirk we do have right now is that systemd tries to create a /usr
directory within the chroot, which subsequently fails. Fortunately, this
is just an ugly error and not a hard failure.
The changes also come with a changelog entry for NixOS 19.03, which is
why I asked for a vote of the NixOS 19.03 stable maintainers whether to
include it (I admit it's a bit late a few days before official release,
sorry for that):
@samueldr:
Via pull request comment[1]:
+1 for backporting as this only enhances the feature set of nixos,
and does not (at a glance) change existing behaviours.
Via IRC:
new feature: -1, tests +1, we're at zero, self-contained, with no
global effects without actively using it, +1, I think it's good
@lheckemann:
Via pull request comment[2]:
I'm neutral on backporting. On the one hand, as @samueldr says,
this doesn't change any existing functionality. On the other hand,
it's a new feature and we're well past the feature freeze, which
AFAIU is intended so that new, potentially buggy features aren't
introduced in the "stabilisation period". It is a cool feature
though? :)
A few other people on IRC didn't have opposition either against late
inclusion into NixOS 19.03:
@edolstra: "I'm not against it"
@Infinisil: "+1 from me as well"
@grahamc: "IMO its up to the RMs"
So that makes +1 from @samueldr, 0 from @lheckemann, 0 from @edolstra
and +1 from @Infinisil (even though he's not a release manager) and no
opposition from anyone, which is the reason why I'm merging this right
now.
I also would like to thank @Infinisil, @edolstra and @danbst for their
reviews.
[1]: https://github.com/NixOS/nixpkgs/pull/57519#issuecomment-477322127
[2]: https://github.com/NixOS/nixpkgs/pull/57519#issuecomment-477548395
So far we had MountFlags = "private", but as @Infinisil has correctly
noticed, there is a dedicated PrivateMounts option, which does exactly
that and is better integrated than providing raw mount flags.
When checking for the reason why I used MountFlags instead of
PrivateMounts, I found that at the time I wrote the initial version of
this module (Mar 12 06:15:58 2018 +0100) the PrivateMounts option didn't
exist yet and has been added to systemd in Jun 13 08:20:18 2018 +0200.
Signed-off-by: aszlig <aszlig@nix.build>
Noted by @Infinisil on IRC:
infinisil: Question regarding the confinement PR
infinisil: On line 136 you do different things depending on
RootDirectoryStartOnly
infinisil: But on line 157 you have an assertion that disallows that
option being true
infinisil: Is there a reason behind this or am I missing something
I originally left this in so that once systemd supports that, we can
just flip a switch and remove the assertion and thus support
RootDirectoryStartOnly for our confinement module.
However, this doesn't seem to be on the roadmap for systemd in the
foreseeable future, so I'll just remove this, especially because it's
very easy to add it again, once it is supported.
Signed-off-by: aszlig <aszlig@nix.build>
seems that this got broken when the config option was made to use enums. "secure" got replaced with "enum", which isn't a valid option for the failure mode.
My implementation was relying on PrivateDevices, PrivateTmp,
PrivateUsers and others to be false by default if chroot-only mode is
used.
However there is an ongoing effort[1] to change these defaults, which
then will actually increase the attack surface in chroot-only mode,
because it is expected that there is no /dev, /sys or /proc.
If for example PrivateDevices is enabled by default, there suddenly will
be a mounted /dev in the chroot and we wouldn't detect it.
Fortunately, our tests cover that, but I'm preparing for this anyway so
that we have a smoother transition without the need to fix our
implementation again.
Thanks to @Infinisil for the heads-up.
[1]: https://github.com/NixOS/nixpkgs/issues/14645
Signed-off-by: aszlig <aszlig@nix.build>
From @edolstra at [1]:
BTW we probably should take the closure of the whole unit rather than
just the exec commands, to handle things like Environment variables.
With this commit, there is now a "fullUnit" option, which can be enabled
to include the full closure of the service unit into the chroot.
However, I did not enable this by default, because I do disagree here
and *especially* things like environment variables or environment files
shouldn't be in the closure of the chroot.
For example if you have something like:
{ pkgs, ... }:
{
systemd.services.foobar = {
serviceConfig.EnvironmentFile = ${pkgs.writeText "secrets" ''
user=admin
password=abcdefg
'';
};
}
We really do not want the *file* to end up in the chroot, but rather
just the environment variables to be exported.
Another thing is that this makes it less predictable what actually will
end up in the chroot, because we have a "globalEnvironment" option that
will get merged in as well, so users adding stuff to that option will
also make it available in confined units.
I also added a big fat warning about that in the description of the
fullUnit option.
[1]: https://github.com/NixOS/nixpkgs/pull/57519#issuecomment-472855704
Signed-off-by: aszlig <aszlig@nix.build>
Another thing requested by @edolstra in [1]:
We should not provide a different /bin/sh in the chroot, that's just
asking for confusion and random shell script breakage. It should be
the same shell (i.e. bash) as in a regular environment.
While I personally would even go as far to even have a very restricted
shell that is not even a shell and basically *only* allows "/bin/sh -c"
with only *very* minimal parsing of shell syntax, I do agree that people
expect /bin/sh to be bash (or the one configured by environment.binsh)
on NixOS.
So this should make both others and me happy in that I could just use
confinement.binSh = "${pkgs.dash}/bin/dash" for the services I confine.
[1]: https://github.com/NixOS/nixpkgs/pull/57519#issuecomment-472855704
Signed-off-by: aszlig <aszlig@nix.build>
Quoting @edolstra from [1]:
I don't really like the name "chroot", something like "confine[ment]"
or "restrict" seems better. Conceptually we're not providing a
completely different filesystem tree but a restricted view of the same
tree.
I already used "confinement" as a sub-option and I do agree that
"chroot" sounds a bit too specific (especially because not *only* chroot
is involved).
So this changes the module name and its option to use "confinement"
instead of "chroot" and also renames the "chroot.confinement" to
"confinement.mode".
[1]: https://github.com/NixOS/nixpkgs/pull/57519#issuecomment-472855704
Signed-off-by: aszlig <aszlig@nix.build>
Currently, if you want to properly chroot a systemd service, you could
do it using BindReadOnlyPaths=/nix/store (which is not what I'd call
"properly", because the whole store is still accessible) or use a
separate derivation that gathers the runtime closure of the service you
want to chroot. The former is the easier method and there is also a
method directly offered by systemd, called ProtectSystem, which still
leaves the whole store accessible. The latter however is a bit more
involved, because you need to bind-mount each store path of the runtime
closure of the service you want to chroot.
This can be achieved using pkgs.closureInfo and a small derivation that
packs everything into a systemd unit, which later can be added to
systemd.packages. That's also what I did several times[1][2] in the
past.
However, this process got a bit tedious, so I decided that it would be
generally useful for NixOS, so this very implementation was born.
Now if you want to chroot a systemd service, all you need to do is:
{
systemd.services.yourservice = {
description = "My Shiny Service";
wantedBy = [ "multi-user.target" ];
chroot.enable = true;
serviceConfig.ExecStart = "${pkgs.myservice}/bin/myservice";
};
}
If more than the dependencies for the ExecStart* and ExecStop* (which
btw. also includes "script" and {pre,post}Start) need to be in the
chroot, it can be specified using the chroot.packages option. By
default (which uses the "full-apivfs"[3] confinement mode), a user
namespace is set up as well and /proc, /sys and /dev are mounted
appropriately.
In addition - and by default - a /bin/sh executable is provided as well,
which is useful for most programs that use the system() C library call
to execute commands via shell. The shell providing /bin/sh is dash
instead of the default in NixOS (which is bash), because it's way more
lightweight and after all we're chrooting because we want to lower the
attack surface and it should be only used for "/bin/sh -c something".
Prior to submitting this here, I did a first implementation of this
outside[4] of nixpkgs, which duplicated the "pathSafeName" functionality
from systemd-lib.nix, just because it's only a single line.
However, I decided to just re-use the one from systemd here and
subsequently made it available when importing systemd-lib.nix, so that
the systemd-chroot implementation also benefits from fixes to that
functionality (which is now a proper function).
Unfortunately, we do have a few limitations as well. The first being
that DynamicUser doesn't work in conjunction with tmpfs, because it
already sets up a tmpfs in a different path and simply ignores the one
we define. We could probably solve this by detecting it and try to
bind-mount our paths to that different path whenever DynamicUser is
enabled.
The second limitation/issue is that RootDirectoryStartOnly doesn't work
right now, because it only affects the RootDirectory option and not the
individual bind mounts or our tmpfs. It would be helpful if systemd
would have a way to disable specific bind mounts as well or at least
have some way to ignore failures for the bind mounts/tmpfs setup.
Another quirk we do have right now is that systemd tries to create a
/usr directory within the chroot, which subsequently fails. Fortunately,
this is just an ugly error and not a hard failure.
[1]: https://github.com/headcounter/shabitica/blob/3bb01728a0237ad5e7/default.nix#L43-L62
[2]: https://github.com/aszlig/avonc/blob/dedf29e092481a33dc/nextcloud.nix#L103-L124
[3]: The reason this is called "full-apivfs" instead of just "full" is
to make room for a *real* "full" confinement mode, which is more
restrictive even.
[4]: https://github.com/aszlig/avonc/blob/92a20bece4df54625e/systemd-chroot.nix
Signed-off-by: aszlig <aszlig@nix.build>
For the hardened profile disable symmetric multi threading. There seems to be
no *proven* method of exploiting cache sharing between threads on the same CPU
core, so this may be considered quite paranoid, considering the perf cost.
SMT can be controlled at runtime, however. This is in keeping with OpenBSD
defaults.
TODO: since SMT is left to be controlled at runtime, changing the option
definition should take effect on system activation. Write to
/sys/devices/system/cpu/smt/control