Friday, July 19, 2013

non-existent remote syslog server causes SSH and su to pause/hang

I recently setup a VMware lab for a project at work. As a quick start, I used some of our existing kickstart deployments (actually using cobbler), which included a bunch of operational configuration items.

Soon my users were experiencing some issues on logging in via SSH. The symptom was one of the following would occur:

  • User would immediately get the error "ssh_exchange_identification: Connection closed by remote host"
  • User would have to wait ~20 seconds to get a password prompt. After entering a password, another ~20 seconds would pass before getting the shell prompt.

I did a lot of different things to try to find the problem. I found that rebooting the system resolved the issue temporarily. Exacerbating the issue was that only a couple systems were experiencing the problem -- not all of them. So at first I discounted it being an issue with the configuration of the system. I chased the red herring of a certain application we were running on *only* those systems.

Another symptom of the issue was that "su" would also pause for ~20 seconds before presenting the password prompt, so I knew it wasn't limited to sshd.

Eventually I found the problem by doing the following:

 # strace su -
[...]
###  The output would pause here:
    connect(3, {sa_family=AF_FILE, path="/dev/log"}, 110) = 0
    sendto(3, "<86>Jul 19 16:36:59 su: pam_unix"..., 96, MSG_NOSIGNAL, NULL, 0) = 
/dev/log is a UNIX domain socket, used by the syslog (and rsyslog) daemons. The system was trying to write to the socket, and it was taking a long time.

When I looked at the config file (/etc/rsyslog.conf), I found that it had been setup to log to a remote server on our production network. Of course, the lab did not have access to the production network! So rsyslog was trying to log to a non-existent system.

The fix was simply to remove the non-existent remote syslog server from rsyslog.conf and restart syslog.

Why didn't *all* the lab systems have the same problem? I believe this is because the "problem" systems had much more syslog messages than the other systems -- due to the application we were running on those systems. So it was only indirectly related to the application!