Re: syslogd hangs the system

From: Yinglin Sun <yinglin.s_at_gmail.com>
Date: Tue, 16 Nov 2010 10:48:10 -0800

Thanks a lot for the suggestions. See my comment inline.

On Mon, Nov 15, 2010 at 6:53 PM, guy keren <choo_at_actcom.co.il> wrote:
>
> as a work around, you may:
>
> 1. disable name lookups completely using the '-x' flag of syslogd.
>

It is not allowed to disable name lookups completely at customer
production systems. And actually no '-x' flag in 1.5. Is it a new flag
added after 1.5?

> or
>
> 2. use a local caching name server (for example: nscd).

It should work as a workaround. However, nscd also raises other
problem, see http://sourceware.org/bugzilla/show_bug.cgi?id=4428

So I worry introducing nscd will cause other problems.

So we don't have a fix for this issue, right?

Thanks.

Yinglin

>
> --guy
>
> Yinglin Sun wrote:
>>
>> Hi folks,
>>
>> Recently I'm struggling with a problem that syslogd hangs the system.
>> syslogd is running on our system, and we configured a couple of remote
>> log servers in /etc/syslogd.conf. We found syslogd was stuck in doing
>> gethostbyname when our DNS servers are not reachable, so blocked all
>> processes writing logs to syslogd.
>>
>> After digging a little bit, I found for each remote log host,
>> gethostbyname takes 20 seconds to return until timeout when one DNS
>> server is unreachable, 40 s if two DNS servers cannot be reached.
>> Since we have many lines doing remote logging in /etc/syslogd.conf,
>> syslogd takes a lot of time for gethostbyname and hangs the system.
>>
>> By searching Internet, this problem looks very popular. Many people
>> ran into it. However, I cannot find the solution for it. I'm wondering
>> if there is already some fix to address this problem?
>>
>> By looking at the 1.5 code, I found two problems.
>> 1. f->f_time is not updated in the case F_FORW_UNKN at line 1820.
>> This makes it do gethostbyname 10 times consecutively if the logging
>> messages come in the high rate. Let's say 3 DNS servers are not
>> reachable, 5 lines in syslogd.conf use remote server. Then syslogd
>> will have to take 3 * 20 * 5 * 10 =  50 minutes for gethostbyname. The
>> thing will get even worse if more remote servers and DNS servers are
>> used.
>>
>> If f->f_time is updated every time we hit case F_FORW_UNKN, we can
>> distribute these lookups every INET_SUSPEND_TIME (3 minutes). Although
>> the system still hangs for a while, it's much better than hanging for
>> 50 minutes consecutively.
>>
>> 2. resolve the same remote host every time
>> When we use the same remote log servers in multiple lines of
>> syslogd.conf, syslogd always resolves the same servers again, since it
>> treats every line separately. If we can resolve the same servers only
>> once in a period like INET_SUSPEND_TIME, and reuse the result in the
>> following attempts, that will save a lot of time for gethostbyname.
>>
>> I don't know if we already have the fix for this critical problem. Any
>> information will be helpful for me.
>>
>> I attach our syslogd.conf.
>>
>> Thanks!
>>
>> Yinglin
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>  /ddr/var/log/messages
>>
>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>  @abcdefgsldkjf.com
>>
>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>  @kdsljfdjff.com
>>
>>
>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>  /ddr/var/log/debug/messages.support
>>
>>
>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>  /ddr/var/log/debug/messages.engineering
>>
>> local1.notice
>>  /ddr/var/log/messages
>> local1.notice
>>  @abcdefgsldkjf.com
>> local1.notice
>>  @kdsljfdjff.com
>>
>> local1.notice;local3.notice
>>  /ddr/var/log/debug/messages.support
>>
>> local1.notice;local3.notice;local4.*
>>  /ddr/var/log/debug/messages.engineering
>>
>> authpriv.*
>>  /ddr/var/log/debug/secure.log
>>
>> mail.*
>>  /var/log/maillog
>>
>> cron.*
>>  /var/log/cron
>>
>> *.alert;local3.none;local4.none                                         *
>> *.alert;local3.none;local4.none
>>  @abcdefgsldkjf.com
>> *.alert;local3.none;local4.none
>>  @kdsljfdjff.com
>>
>> uucp,news.crit
>>  /var/log/spooler
>>
>> *.alert;local3.none;local4.none
>>  |/ddr/dev/ems_pipe
>>
>> kern.alert
>>  |/ddr/dev/kmsg_pipe
>>
>> local2.notice
>>  /dev/console
>>
>> kern.*
>>  /ddr/var/log/debug/platform/kern.info
>> kern.*
>>  @abcdefgsldkjf.com
>> kern.*
>>  @kdsljfdjff.com
>>
>> kern.error
>>  /ddr/var/log/debug/platform/kern.error
>> kern.error
>>  @abcdefgsldkjf.com
>> kern.error
>>  @kdsljfdjff.com
>>
>> local6.*
>>  /ddr/var/log/debug/cifs/cifs.log
>>
>>
>
>
Received on Tue Nov 16 2010 - 19:48:10 CET

This archive was generated by hypermail 2.2.0 : Tue Nov 16 2010 - 19:48:16 CET