Hi Aaron,
i'm using rss2email to retrieve information from several online
resources. Recently, one site totally got on my nerves because German
umlauts were not displayed properly.
This afternoon, I took the time to finally track down the problem.
Actually, there are two of them, which you may be able to fix (or may
not, I'm not entirely sure). Even if not, I believe you should know
about them, so they can be mentioned in the documentation or on your
website if you like (well, or not, if you prefer, but that's up to
you).
The RSS file in question is
<http://german-bash.org/latest-quotes.xml>
(It's content is similar to the one of <http://qdb.us/> just with a
German origin and stuff.)
The symptom was that umlauts were not displayed correctly.
Debugging this symptom I found two problems:
1. Some items in the file mentioned above were detected to have the
encoding ISO-8859-1 while they are indeed UTF-8 encoded. An
example for this is quote #103518.
This is handled in rss2email.py around line 110
for body_charset in 'US-ASCII', 'ISO-8859-1', 'UTF-8':
For some reason .encode() doesn't fail for ISO-8859-1 even though
it is wrong. I assume that this is a bug in the encoding
detection of Python, or if that doesn't exist, in UTF-8 even since
some characters cannot be detected as belonging to UTF-8 only.
Whatever the cause is, leaving out ISO-8859-1 in the above list
fixes this particular problem. Not sure if it would be feasable
to implement this in the official version.
However: Looking at the XML file that is parsed, its encoding is
included and should be trusted. Maybe this string could be used
instead of trying to determine the actual encoding?
Unfortunately, feedparser.parse() does not seem to pass the
encoding into rss2email. At least a quick look didn't provide the
proper one.
2. With all umlauts not displayed I wanted to know why Mutt wasn't
able to display them at all. The mails contain:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
This looks fine in general, but still Mutt continued to display
UTF-8 strings instead of properly converted umlauts. Debugging
this lead to an interesting mail created by email.MIMEText indeed.
The content is encoded into UTF-8 twice, it seems. Maybe the
encoding step in rss2email.py around line 142
msg = MIMEText(body.encode(body_charset), contenttype, body_charset)
is somewhat overzealous?
This is version 2.60 or rss2email packaged by Debian.
I'd be glad to try out new code for you in case you come up with a
good solution to these problems.
Lacking a better solution myself I now use the hammer:
:0 E
* ^From:.*german-bash.*org
* ^Subject:.*Zitat
* ^Content-Transfer-Encoding:\ base64
* ^User-Agent: rss2email
{
:0 fbw
| base64-decode | iconv -f utf-8 -t latin1
:0 fhw
| formail -I Content-Transfer-Encoding:
}
-- Those who don't understand Unix are condemned to reinvent it, poorly.Received on Fri May 18 2007 - 16:57:15 CEST
This archive was generated by hypermail 2.2.0 : Fri May 18 2007 - 16:57:15 CEST