Re: syslogd hangs the system

From: guy keren <choo_at_actcom.co.il>
Date: Wed, 17 Nov 2010 00:14:55 +0200

Yinglin Sun wrote:
> Thanks a lot for the suggestions. See my comment inline.
>
> On Mon, Nov 15, 2010 at 6:53 PM, guy keren <choo_at_actcom.co.il> wrote:
>> as a work around, you may:
>>
>> 1. disable name lookups completely using the '-x' flag of syslogd.
>>
>
> It is not allowed to disable name lookups completely at customer
> production systems. And actually no '-x' flag in 1.5. Is it a new flag
> added after 1.5?

i don't know - i simply searched for sysklogd's current docs and saw
this flag documented. if you don't have it there - then i guess it
doesn't help you.

>
>> or
>>
>> 2. use a local caching name server (for example: nscd).
>
> It should work as a workaround. However, nscd also raises other
> problem, see http://sourceware.org/bugzilla/show_bug.cgi?id=4428
>
> So I worry introducing nscd will cause other problems.
>
> So we don't have a fix for this issue, right?

looks like it. it appears there is a drop-in replacement for nscd, named
unscd, that doesn't suffer from this problem. you can check if it's good
enough for you - but perhaps fixing syslogd will be better.

note, however, that this dns problem implies that you will lose log
records every now and then - is that acceptable for your setup?

--guy

>
> Thanks.
>
> Yinglin
>
>> --guy
>>
>> Yinglin Sun wrote:
>>> Hi folks,
>>>
>>> Recently I'm struggling with a problem that syslogd hangs the system.
>>> syslogd is running on our system, and we configured a couple of remote
>>> log servers in /etc/syslogd.conf. We found syslogd was stuck in doing
>>> gethostbyname when our DNS servers are not reachable, so blocked all
>>> processes writing logs to syslogd.
>>>
>>> After digging a little bit, I found for each remote log host,
>>> gethostbyname takes 20 seconds to return until timeout when one DNS
>>> server is unreachable, 40 s if two DNS servers cannot be reached.
>>> Since we have many lines doing remote logging in /etc/syslogd.conf,
>>> syslogd takes a lot of time for gethostbyname and hangs the system.
>>>
>>> By searching Internet, this problem looks very popular. Many people
>>> ran into it. However, I cannot find the solution for it. I'm wondering
>>> if there is already some fix to address this problem?
>>>
>>> By looking at the 1.5 code, I found two problems.
>>> 1. f->f_time is not updated in the case F_FORW_UNKN at line 1820.
>>> This makes it do gethostbyname 10 times consecutively if the logging
>>> messages come in the high rate. Let's say 3 DNS servers are not
>>> reachable, 5 lines in syslogd.conf use remote server. Then syslogd
>>> will have to take 3 * 20 * 5 * 10 = 50 minutes for gethostbyname. The
>>> thing will get even worse if more remote servers and DNS servers are
>>> used.
>>>
>>> If f->f_time is updated every time we hit case F_FORW_UNKN, we can
>>> distribute these lookups every INET_SUSPEND_TIME (3 minutes). Although
>>> the system still hangs for a while, it's much better than hanging for
>>> 50 minutes consecutively.
>>>
>>> 2. resolve the same remote host every time
>>> When we use the same remote log servers in multiple lines of
>>> syslogd.conf, syslogd always resolves the same servers again, since it
>>> treats every line separately. If we can resolve the same servers only
>>> once in a period like INET_SUSPEND_TIME, and reuse the result in the
>>> following attempts, that will save a lot of time for gethostbyname.
>>>
>>> I don't know if we already have the fix for this critical problem. Any
>>> information will be helpful for me.
>>>
>>> I attach our syslogd.conf.
>>>
>>> Thanks!
>>>
>>> Yinglin
>>>
>>>
>>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>> /ddr/var/log/messages
>>>
>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>> @abcdefgsldkjf.com
>>>
>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>> @kdsljfdjff.com
>>>
>>>
>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>> /ddr/var/log/debug/messages.support
>>>
>>>
>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>> /ddr/var/log/debug/messages.engineering
>>>
>>> local1.notice
>>> /ddr/var/log/messages
>>> local1.notice
>>> @abcdefgsldkjf.com
>>> local1.notice
>>> @kdsljfdjff.com
>>>
>>> local1.notice;local3.notice
>>> /ddr/var/log/debug/messages.support
>>>
>>> local1.notice;local3.notice;local4.*
>>> /ddr/var/log/debug/messages.engineering
>>>
>>> authpriv.*
>>> /ddr/var/log/debug/secure.log
>>>
>>> mail.*
>>> /var/log/maillog
>>>
>>> cron.*
>>> /var/log/cron
>>>
>>> *.alert;local3.none;local4.none *
>>> *.alert;local3.none;local4.none
>>> @abcdefgsldkjf.com
>>> *.alert;local3.none;local4.none
>>> @kdsljfdjff.com
>>>
>>> uucp,news.crit
>>> /var/log/spooler
>>>
>>> *.alert;local3.none;local4.none
>>> |/ddr/dev/ems_pipe
>>>
>>> kern.alert
>>> |/ddr/dev/kmsg_pipe
>>>
>>> local2.notice
>>> /dev/console
>>>
>>> kern.*
>>> /ddr/var/log/debug/platform/kern.info
>>> kern.*
>>> @abcdefgsldkjf.com
>>> kern.*
>>> @kdsljfdjff.com
>>>
>>> kern.error
>>> /ddr/var/log/debug/platform/kern.error
>>> kern.error
>>> @abcdefgsldkjf.com
>>> kern.error
>>> @kdsljfdjff.com
>>>
>>> local6.*
>>> /ddr/var/log/debug/cifs/cifs.log
>>>
>>>
>>
Received on Tue Nov 16 2010 - 23:14:55 CET

This archive was generated by hypermail 2.2.0 : Tue Nov 16 2010 - 23:15:00 CET