Re: syslogd hangs the system

From: Yinglin Sun <yinglin.s_at_gmail.com>
Date: Thu, 18 Nov 2010 20:46:00 -0800

On Tue, Nov 16, 2010 at 2:14 PM, guy keren <choo_at_actcom.co.il> wrote:
> Yinglin Sun wrote:
>>
>>> 2. use a local caching name server (for example: nscd).
>>
>> It should work as a workaround. However, nscd also raises other
>> problem, see http://sourceware.org/bugzilla/show_bug.cgi?id=4428
>>
>> So I worry introducing nscd will cause other problems.
>>
>> So we don't have a fix for this issue, right?
>
> looks like it. it appears there is a drop-in replacement for nscd, named
> unscd, that doesn't suffer from this problem. you can check if it's good
> enough for you - but perhaps fixing syslogd will be better.
>
> note, however, that this dns problem implies that you will lose log records
> every now and then - is that acceptable for your setup?
>

We configured to make every log message go to both local log file and
remote log server.

Yinglin

> --guy
>
>>
>> Thanks.
>>
>> Yinglin
>>
>>> --guy
>>>
>>> Yinglin Sun wrote:
>>>>
>>>> Hi folks,
>>>>
>>>> Recently I'm struggling with a problem that syslogd hangs the system.
>>>> syslogd is running on our system, and we configured a couple of remote
>>>> log servers in /etc/syslogd.conf. We found syslogd was stuck in doing
>>>> gethostbyname when our DNS servers are not reachable, so blocked all
>>>> processes writing logs to syslogd.
>>>>
>>>> After digging a little bit, I found for each remote log host,
>>>> gethostbyname takes 20 seconds to return until timeout when one DNS
>>>> server is unreachable, 40 s if two DNS servers cannot be reached.
>>>> Since we have many lines doing remote logging in /etc/syslogd.conf,
>>>> syslogd takes a lot of time for gethostbyname and hangs the system.
>>>>
>>>> By searching Internet, this problem looks very popular. Many people
>>>> ran into it. However, I cannot find the solution for it. I'm wondering
>>>> if there is already some fix to address this problem?
>>>>
>>>> By looking at the 1.5 code, I found two problems.
>>>> 1. f->f_time is not updated in the case F_FORW_UNKN at line 1820.
>>>> This makes it do gethostbyname 10 times consecutively if the logging
>>>> messages come in the high rate. Let's say 3 DNS servers are not
>>>> reachable, 5 lines in syslogd.conf use remote server. Then syslogd
>>>> will have to take 3 * 20 * 5 * 10 =  50 minutes for gethostbyname. The
>>>> thing will get even worse if more remote servers and DNS servers are
>>>> used.
>>>>
>>>> If f->f_time is updated every time we hit case F_FORW_UNKN, we can
>>>> distribute these lookups every INET_SUSPEND_TIME (3 minutes). Although
>>>> the system still hangs for a while, it's much better than hanging for
>>>> 50 minutes consecutively.
>>>>
>>>> 2. resolve the same remote host every time
>>>> When we use the same remote log servers in multiple lines of
>>>> syslogd.conf, syslogd always resolves the same servers again, since it
>>>> treats every line separately. If we can resolve the same servers only
>>>> once in a period like INET_SUSPEND_TIME, and reuse the result in the
>>>> following attempts, that will save a lot of time for gethostbyname.
>>>>
>>>> I don't know if we already have the fix for this critical problem. Any
>>>> information will be helpful for me.
>>>>
>>>> I attach our syslogd.conf.
>>>>
>>>> Thanks!
>>>>
>>>> Yinglin
>>>>
>>>>
>>>>
>>>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>>>  /ddr/var/log/messages
>>>>
>>>>
>>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>>>  @abcdefgsldkjf.com
>>>>
>>>>
>>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>>>  @kdsljfdjff.com
>>>>
>>>>
>>>>
>>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>>>  /ddr/var/log/debug/messages.support
>>>>
>>>>
>>>>
>>>> *.notice;auth.info,mail.none,news.none,authpriv.none,cron.none,kern.none,local1.none,local3.none,local4.none
>>>>  /ddr/var/log/debug/messages.engineering
>>>>
>>>> local1.notice
>>>>  /ddr/var/log/messages
>>>> local1.notice
>>>>  @abcdefgsldkjf.com
>>>> local1.notice
>>>>  @kdsljfdjff.com
>>>>
>>>> local1.notice;local3.notice
>>>>  /ddr/var/log/debug/messages.support
>>>>
>>>> local1.notice;local3.notice;local4.*
>>>>  /ddr/var/log/debug/messages.engineering
>>>>
>>>> authpriv.*
>>>>  /ddr/var/log/debug/secure.log
>>>>
>>>> mail.*
>>>>  /var/log/maillog
>>>>
>>>> cron.*
>>>>  /var/log/cron
>>>>
>>>> *.alert;local3.none;local4.none
>>>> *
>>>> *.alert;local3.none;local4.none
>>>>  @abcdefgsldkjf.com
>>>> *.alert;local3.none;local4.none
>>>>  @kdsljfdjff.com
>>>>
>>>> uucp,news.crit
>>>>  /var/log/spooler
>>>>
>>>> *.alert;local3.none;local4.none
>>>>  |/ddr/dev/ems_pipe
>>>>
>>>> kern.alert
>>>>  |/ddr/dev/kmsg_pipe
>>>>
>>>> local2.notice
>>>>  /dev/console
>>>>
>>>> kern.*
>>>>  /ddr/var/log/debug/platform/kern.info
>>>> kern.*
>>>>  @abcdefgsldkjf.com
>>>> kern.*
>>>>  @kdsljfdjff.com
>>>>
>>>> kern.error
>>>>  /ddr/var/log/debug/platform/kern.error
>>>> kern.error
>>>>  @abcdefgsldkjf.com
>>>> kern.error
>>>>  @kdsljfdjff.com
>>>>
>>>> local6.*
>>>>  /ddr/var/log/debug/cifs/cifs.log
>>>>
>>>>
>>>
>
>
Received on Fri Nov 19 2010 - 05:46:00 CET

This archive was generated by hypermail 2.2.0 : Fri Nov 19 2010 - 05:46:04 CET