Soon my users were experiencing some issues on logging in via SSH. The symptom was one of the following would occur:
- User would immediately get the error "ssh_exchange_identification: Connection closed by remote host"
- User would have to wait ~20 seconds to get a password prompt. After entering a password, another ~20 seconds would pass before getting the shell prompt.
I did a lot of different things to try to find the problem. I found that rebooting the system resolved the issue temporarily. Exacerbating the issue was that only a couple systems were experiencing the problem -- not all of them. So at first I discounted it being an issue with the configuration of the system. I chased the red herring of a certain application we were running on *only* those systems.
Another symptom of the issue was that "su" would also pause for ~20 seconds before presenting the password prompt, so I knew it wasn't limited to sshd.
Eventually I found the problem by doing the following:
# strace su - [...] ### The output would pause here: connect(3, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0 sendto(3, "<86>Jul 19 16:36:59 su: pam_unix"..., 96, MSG_NOSIGNAL, NULL, 0) =/dev/log is a UNIX domain socket, used by the syslog (and rsyslog) daemons. The system was trying to write to the socket, and it was taking a long time.
When I looked at the config file (/etc/rsyslog.conf), I found that it had been setup to log to a remote server on our production network. Of course, the lab did not have access to the production network! So rsyslog was trying to log to a non-existent system.
The fix was simply to remove the non-existent remote syslog server from rsyslog.conf and restart syslog.
Why didn't *all* the lab systems have the same problem? I believe this is because the "problem" systems had much more syslog messages than the other systems -- due to the application we were running on those systems. So it was only indirectly related to the application!