Encoding problem with rss2email from Joey Schulze on 2007-05-18 (Infodrom Oldenburg Miscellaneous Mails)

From: Joey Schulze <joey_at_infodrom.org>
Date: Fri, 18 May 2007 16:47:39 +0200

Hi Aaron,

i'm using rss2email to retrieve information from several online
resources. Recently, one site totally got on my nerves because German
umlauts were not displayed properly.

This afternoon, I took the time to finally track down the problem.
Actually, there are two of them, which you may be able to fix (or may
not, I'm not entirely sure). Even if not, I believe you should know
about them, so they can be mentioned in the documentation or on your
website if you like (well, or not, if you prefer, but that's up to
you).

The RSS file in question is

<http://german-bash.org/latest-quotes.xml>

(It's content is similar to the one of <http://qdb.us/> just with a
German origin and stuff.)

The symptom was that umlauts were not displayed correctly.

Debugging this symptom I found two problems:

1. Some items in the file mentioned above were detected to have the
encoding ISO-8859-1 while they are indeed UTF-8 encoded. An
example for this is quote #103518.

This is handled in rss2email.py around line 110

for body_charset in 'US-ASCII', 'ISO-8859-1', 'UTF-8':

    For some reason .encode() doesn't fail for ISO-8859-1 even though
    it is wrong. I assume that this is a bug in the encoding
    detection of Python, or if that doesn't exist, in UTF-8 even since
    some characters cannot be detected as belonging to UTF-8 only.

    Whatever the cause is, leaving out ISO-8859-1 in the above list
    fixes this particular problem. Not sure if it would be feasable
    to implement this in the official version.

    However: Looking at the XML file that is parsed, its encoding is
    included and should be trusted. Maybe this string could be used
    instead of trying to determine the actual encoding?

    Unfortunately, feedparser.parse() does not seem to pass the
    encoding into rss2email. At least a quick look didn't provide the
    proper one.

2. With all umlauts not displayed I wanted to know why Mutt wasn't
able to display them at all. The mails contain:

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

    This looks fine in general, but still Mutt continued to display
    UTF-8 strings instead of properly converted umlauts. Debugging
    this lead to an interesting mail created by email.MIMEText indeed.
    The content is encoded into UTF-8 twice, it seems. Maybe the
    encoding step in rss2email.py around line 142

msg = MIMEText(body.encode(body_charset), contenttype, body_charset)

is somewhat overzealous?

This is version 2.60 or rss2email packaged by Debian.

I'd be glad to try out new code for you in case you come up with a
good solution to these problems.

Lacking a better solution myself I now use the hammer:

:0 E
* ^From:.*german-bash.*org
* ^Subject:.*Zitat
* ^Content-Transfer-Encoding:\ base64
* ^User-Agent: rss2email
{
:0 fbw
| base64-decode | iconv -f utf-8 -t latin1

:0 fhw
| formail -I Content-Transfer-Encoding:
}

-- 
Those who don't understand Unix are condemned to reinvent it, poorly.

Received on Fri May 18 2007 - 16:57:15 CEST

This archive was generated by hypermail 2.2.0 : Fri May 18 2007 - 16:57:15 CEST