syslogd hangs the system

From: Yinglin Sun <yinglin.s_at_gmail.com>
Date: Mon, 15 Nov 2010 18:27:24 -0800

Hi folks,

Recently I'm struggling with a problem that syslogd hangs the system.
syslogd is running on our system, and we configured a couple of remote
log servers in /etc/syslogd.conf. We found syslogd was stuck in doing
gethostbyname when our DNS servers are not reachable, so blocked all
processes writing logs to syslogd.

After digging a little bit, I found for each remote log host,
gethostbyname takes 20 seconds to return until timeout when one DNS
server is unreachable, 40 s if two DNS servers cannot be reached.
Since we have many lines doing remote logging in /etc/syslogd.conf,
syslogd takes a lot of time for gethostbyname and hangs the system.

By searching Internet, this problem looks very popular. Many people
ran into it. However, I cannot find the solution for it. I'm wondering
if there is already some fix to address this problem?

By looking at the 1.5 code, I found two problems.
1. f->f_time is not updated in the case F_FORW_UNKN at line 1820.
This makes it do gethostbyname 10 times consecutively if the logging
messages come in the high rate. Let's say 3 DNS servers are not
reachable, 5 lines in syslogd.conf use remote server. Then syslogd
will have to take 3 * 20 * 5 * 10 = 50 minutes for gethostbyname. The
thing will get even worse if more remote servers and DNS servers are
used.

If f->f_time is updated every time we hit case F_FORW_UNKN, we can
distribute these lookups every INET_SUSPEND_TIME (3 minutes). Although
the system still hangs for a while, it's much better than hanging for
50 minutes consecutively.

2. resolve the same remote host every time
When we use the same remote log servers in multiple lines of
syslogd.conf, syslogd always resolves the same servers again, since it
treats every line separately. If we can resolve the same servers only
once in a period like INET_SUSPEND_TIME, and reuse the result in the
following attempts, that will save a lot of time for gethostbyname.

I don't know if we already have the fix for this critical problem. Any
information will be helpful for me.

I attach our syslogd.conf.

Thanks!

Yinglin

----------------------------------------------------------------------------------------------------------------------------------------------------------------
*.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
 /ddr/var/log/messages
*.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
 @abcdefgsldkjf.com
*.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
 @kdsljfdjff.com

*.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
 /ddr/var/log/debug/messages.support

*.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
 /ddr/var/log/debug/messages.engineering

local1.notice
 /ddr/var/log/messages
local1.notice
 @abcdefgsldkjf.com
local1.notice
 @kdsljfdjff.com

local1.notice;local3.notice
 /ddr/var/log/debug/messages.support

local1.notice;local3.notice;local4.*
 /ddr/var/log/debug/messages.engineering

authpriv.*
 /ddr/var/log/debug/secure.log

mail.*
 /var/log/maillog

cron.*
 /var/log/cron

*.alert;local3.none;local4.none *
*.alert;local3.none;local4.none
 @abcdefgsldkjf.com
*.alert;local3.none;local4.none
 @kdsljfdjff.com

uucp,news.crit
 /var/log/spooler

*.alert;local3.none;local4.none
 |/ddr/dev/ems_pipe

kern.alert
 |/ddr/dev/kmsg_pipe

local2.notice
 /dev/console

kern.*
 /ddr/var/log/debug/platform/kern.info
kern.*
 @abcdefgsldkjf.com
kern.*
 @kdsljfdjff.com

kern.error
 /ddr/var/log/debug/platform/kern.error
kern.error
 @abcdefgsldkjf.com
kern.error
 @kdsljfdjff.com

local6.*
 /ddr/var/log/debug/cifs/cifs.log
Received on Tue Nov 16 2010 - 03:27:24 CET

This archive was generated by hypermail 2.2.0 : Tue Nov 16 2010 - 03:27:26 CET