Last week at the office, somebody came up to me to ask if I could figure out why it was no longer possible to log into one of the servers. This server has a history of flakiness, there's probably a bad memory module on the board, and sometimes it becomes unresponsive. So, my co-worker, upon realizing that he couldn't log in, had rebooted the computer. However, even after the reboot, he still couldn't log on, either as his regular user through SSH, or as root on the console.
The first step, before getting out of my chair, was to telnet to port 22 on the box. I got a "connected" message, and a text string indicating that I was attached to an SSH daemon. This told me that the kernel was alive, it was accepting new connections and passing them to the appropriate processes, which were themselves able to make forward progress. So, the box wasn't wedged. I went to the console, and tried to log in through the getty running on the text login screen. I entered 'root' at the username, and got a password prompt. When I entered the password and pressed ENTER, the getty process froze, and did not present me with a shell.
So, we have two very different authentication schemes that are failing to allow logins. The console doesn't allow root logins. Something seemed to be interfering with the general activity of authentication. The first thought is that this might be a PAM problem, but it would be a strange one. We didn't get authentication failure messages, we got a hang after authentication. Root's credentials were stored on the local drive, so it wasn't an LDAP issue, and in any case, the machine was on the network, and there weren't LDAP problems anywhere else in the office.
When multiple independent programs fail together, the next thought is that there's probably a full disk somewhere. If you fill up /tmp your system can start to behave very strangely. The login problems were a symptom of something, as yet unknown. So, the next thing to do is to check the hard drive to see if we had any full partitions. Because I didn't know what else might be misbehaving, I wanted to avoid all of the startup jobs, so I rebooted the machine with an appended kernel parameter, "init=/bin/bash". Instead of running the usual /sbin/init, and all of the various scripts under /etc/init.d, the computer would start up the kernel and then drop immediately to a root shell. No logins, no passwords, no startup scripts. I could then run 'df' at the prompt, and confirm that there were no partitions within 5% of full (remember that a default ext2 format will reserve 5% of the blocks for root, so a disk that's 96% full could actually be entirely full for some users). Checking with 'df -i' showed that we had not run out of inodes either.
So, what's next? I decided to reboot the machine into single-user mode so that I could easily modify files on the disk but still get onto the computer without a password. This is done by appending the parameter "S" on the kernel boot line. Again, I get a shell, but this time the disks are read-write mounted, and various services have started up. So, I modify the inittab. I replaced the getty on tty1 with /bin/bash. That means that when the computer is rebooted into multi-user mode, tty1 has a root shell while the other ttys are still running their gettys.
Reboot into the usual multi-user mode. I have a root shell on tty1. I run "ps ax", and find the PID of the getty on tty2. Then, I run the command
strace -f -p <PID>
at the shell prompt of tty1. Changing virtual consoles to tty2 with the usual command, CTRL-ALT-F2, I am presented with a login prompt. I enter the username 'root', and enter the password. The program hangs. So, I change back to tty1 to see what strace has to say about the program. The last things the program did are on the screen. It opened a device called /dev/audit, did some things with it, then issued an ioctl() on the file descriptor. That ioctl was not returning to the caller, so the program was blocking waiting for a response from something associated with /dev/audit.
None of us had heard of /dev/audit, so it was time to do a bit of research. It turned out to be a package that was included in the RHEL distribution installed on that computer. There is communication between the device and a daemon. That daemon keeps logs, so I went to its logging directory to see what was there. I found 4 GB of data there. Apparently that had reached some sort of internal limit, and the daemon responded by forbidding further auditable actions until some of the logs were removed by the administrator. Logins, being auditable actions, were blocked.
So, delete all of the logs in the directory, and reboot the computer. Everything returned to normal.
Now, a logging function like this is very useful for some users. There are some people who must know exactly who logged into the machine, what database entries they accessed or modified, and so on. We are not such people. A service we never knew about, enabled for all because it is useful by some, wound up locking us out of our own machine.
It's a security feature that logins are forbidden until the logs have been inspected and removed. If you're going to design a function like this, then this is the correct way to go about it. Of course, it was very easy for me to overcome this security feature with access to the console, but that's generally true. I probably would have set it up so that gettys are permitted to log in as root even when an audit failure occurs, but that level of flexibility may not be available, if the behaviour is driven by a special PAM module or a patched glibc.