sysctl drop_caches is write-only since Linux 5.5

TL;DR

1
2
3
4
5
6
7
8
9
10
terry@n54l:~$ cat /proc/sys/vm/drop_caches
cat: /proc/sys/vm/drop_caches: Permission denied

terry@n54l:~$ sudo -i
root@n54l:~# cat /proc/sys/vm/drop_caches
cat: /proc/sys/vm/drop_caches: Permission denied
root@n54l:~# whoami
root

# WTF?

Since Linux 5.5, drop_caches has become write-only (mode bits 0200) to avoid confusions when operating at scale.

From Kernel Newbies Linux 5.5: make drop_caches sysctl write-only commit

main()


From the git commit: kernel: sysctl: make drop_caches write-only

drop_caches mode bits changed from 0644 to 0200 which means write-only.

Justification

Currently, the drop_caches proc file and sysctl read back the last value written, suggesting this is somehow a stateful setting instead of a one-time command. Make it write-only, like e.g compact_memory.

It makes sense, drop_caches is one-off command, stateless. It really confuses if operating at scale, I’ve been in that boat before (many times).

Author explained a bit more with real world experience:

1
2
3
4
5
While mitigating a VM problem at scale in our fleet, there was confusion about whether writing to this file will permanently switch the kernel into a non-caching mode.

This influences the decision making in a tense situation, where tens of people are trying to fix tens of thousands of affected machines: Do we need a rollback strategy? What are the performance implications of operating in a non-caching state for several days?

It also caused confusion when the kernel team said we may need to write the file several times to make sure it's effective ("But it already reads back 3?").

Another sysctl syscall fun fact

Came across this in Linux 5.5(https://kernelnewbies.org/Linux_5.5) change log

Remove the sysctl system call (deprecated a long time ago) commit

1
This system call has been deprecated almost since it was introduced.

In a survey of the linux distributions I can no longer find any of them that enable CONFIG_SYSCTL_SYSCALL. The only indication that I can find that anyone might care is that a few of the defconfigs in the kernel enable CONFIG_SYSCTL_SYSCALL However this appears in only 31 of 414 defconfigs in the kernel, so I suspect this symbols presence is simply because it is harmless to include rather than because it is necessary.

As there appear to be no users of the sysctl system call, remove the code. As this removes one of the few uses of the internal kernel mount of proc I hope this allows for even more simplifications of the proc filesystem.

I decided to do a validation on the distributions I use daily. As you can see below, obviously Arch Linux, Fedora were fine, but Ubuntu, hmm… ;-)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# fedora 31
PRETTY_NAME="Fedora 31 (Thirty One)"
root@n54l:/boot# grep CONFIG_SYSCTL_SYSCALL config-$(uname -r)

# arch
terry@netbook:~$ grep PRETTY_NAME /etc/os-release
PRETTY_NAME="Arch Linux"
terry@netbook:~$ zcat /proc/config.gz | grep CONFIG_SYSCTL_SYSCALL

# ubuntu 18.04
$ grep PRETTY_NAME /etc/os-release
PRETTY_NAME="Ubuntu 18.04.4 LTS"
$ grep CONFIG_SYSCTL_SYSCALL /boot/config-$(uname -r)
CONFIG_SYSCTL_SYSCALL=y

EOF