guile-email discussion
 help / color / mirror / Atom feed
* [guile-email] parse-email-headers returns just “fields”
@ 2020-04-21 12:24 Ricardo Wurmus
  2020-04-23  1:26 ` Arun Isaac
  0 siblings, 1 reply; 9+ messages in thread
From: Ricardo Wurmus @ 2020-04-21 12:24 UTC (permalink / raw)
  To: guile-email

[-- Attachment #1: Type: text/plain, Size: 896 bytes --]

Hi,

I’m currently trying to parse the debbugs bug log files directly.  They
contain emails (and other information), so I cut out the email text and
feed it to parse-email.  In some cases the emails don’t seem to have a
content-transfer-encoding header, which leads to an error when trying to
decode the body.

So instead of using parse-email directly I use

  (email->headers+body
   (string->bytevector content "utf-8"))

and run parse-email-headers over the first value, add a dummy
content-transfer-encoding header with value 'binary if it’s missing and
then parse the body.

Now I noticed two odd things:

* I sometimes need to discard the first two lines of the raw email to
  get the headers to be fully parsed
* in some cases the result of parse-email-headers is a literal “fields”,
  not an alist.

Attached is one of these emails.

-- 
Ricardo



[-- Attachment #2: mail.txt --]
[-- Type: text/plain, Size: 4531 bytes --]

From drew.adams@oracle.com Sat Sep 13 09:41:10 2008
X-Spam-Checker-Version: SpamAssassin 3.2.3-bugs.debian.org_2005_01_02
	(2007-08-08) on rzlab.ucr.edu
X-Spam-Level: 
X-Spam-Status: No, score=-6.7 required=4.0 tests=AWL,BAYES_00,
	RCVD_IN_DNSWL_MED,UNPARSEABLE_RELAY autolearn=ham
	version=3.2.3-bugs.debian.org_2005_01_02
Received: (at submit) by emacsbugs.donarmstrong.com; 13 Sep 2008 16:41:10 +0000
Received: from fencepost.gnu.org (fencepost.gnu.org [140.186.70.10])
	by rzlab.ucr.edu (8.13.8/8.13.8/Debian-3) with ESMTP id m8DGf5S7011950
	for <submit@emacsbugs.donarmstrong.com>; Sat, 13 Sep 2008 09:41:07 -0700
Received: from mail.gnu.org ([199.232.76.166]:57008 helo=mx10.gnu.org)
	by fencepost.gnu.org with esmtp (Exim 4.67)
	(envelope-from <drew.adams@oracle.com>)
	id 1KeY9S-0000Yg-Bg
	for emacs-pretest-bug@gnu.org; Sat, 13 Sep 2008 12:39:14 -0400
Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim 4.60)
	(envelope-from <drew.adams@oracle.com>)
	id 1KeYBB-0004EU-3I
	for emacs-pretest-bug@gnu.org; Sat, 13 Sep 2008 12:41:04 -0400
Received: from agminet01.oracle.com ([141.146.126.228]:17202)
	by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60)
	(envelope-from <drew.adams@oracle.com>)
	id 1KeYBA-0004Cs-4Y
	for emacs-pretest-bug@gnu.org; Sat, 13 Sep 2008 12:41:00 -0400
Received: from agmgw1.us.oracle.com (agmgw1.us.oracle.com [152.68.180.212])
	by agminet01.oracle.com (Switch-3.2.4/Switch-3.1.7) with ESMTP id m8DGeoWf024917
	for <emacs-pretest-bug@gnu.org>; Sat, 13 Sep 2008 11:40:50 -0500
Received: from acsmt701.oracle.com (acsmt701.oracle.com [141.146.40.71])
	by agmgw1.us.oracle.com (Switch-3.2.0/Switch-3.2.0) with ESMTP id m8DGensx018593
	for <emacs-pretest-bug@gnu.org>; Sat, 13 Sep 2008 10:40:50 -0600
Received: from dradamslap1 (/24.23.165.218)
	by default (Oracle Beehive Gateway v4.0)
	with ESMTP ; Sat, 13 Sep 2008 09:40:49 -0700
From: "Drew Adams" <drew.adams@oracle.com>
To: <emacs-pretest-bug@gnu.org>
Subject: 23.0.60; incorrect code for filesets-get-filelist
Date: Sat, 13 Sep 2008 09:40:59 -0700
Message-ID: <002901c915bf$811df210$0200a8c0@us.oracle.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Office Outlook 11
Thread-Index: AckVv4Ci1hEH5okgSQ2EUXtBuurWgg==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3350
X-Brightmail-Tracker: AAAAAQAAAAI=
X-Brightmail-Tracker: AAAAAQAAAAI=
X-Whitelist: TRUE
X-Whitelist: TRUE
X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.4-2.6

The part that treats a :tree of the code defining
`filesets-get-filelist' is not correct and never could have been
correct. And it does not correspond to the (correct) code from the
filesets author.  One wonders if the GNU Emacs code was ever tested.
 
This is the `case' clause that treats :tree in the definition
of `filesets-get-filelist':
 
((:tree)
 (let ((dir  (nth 0 entry))
       (patt (nth 1 entry)))
   (filesets-directory-files dir patt ':files t)))
 
But `entry' here is a complete fileset, which is of the form
("my-fs" (:tree "/some/directory" "^.+\.suffix$"))
 
The above code thus tries to use "my-fs" as the directory, whereas it
should use "/some/directory".
 
This is the (correct) code in the latest version from the author
(http://members.a1.net/t.link/CompEmacsFilesets.html). (The comment is
from the author.)
 
((:tree)
 ;;well, the way trees are handled is a mess +++
 (let* ((dirpatt (if (consp (nth 1 entry))
                     (filesets-entry-get-tree entry)
                   entry))
        (dir     (nth 0 dirpatt))
        (patt    (nth 1 dirpatt)))
   (filesets-list-dir dir patt ':files t)))
 
However, I think the following would be sufficient:
 
((:tree)
 (let* ((dirpatt (filesets-entry-get-tree entry))
        (dir  (nth 0 dirpatt))
        (patt (nth 1 dirpatt)))
   (filesets-directory-files dir patt ':files t)))
 
I don't see why the author's more complex treatment would ever be
needed, since in order for the :tree clause of the `case' to be
reached (consp (nth 1 entry)) must be a cons, AFAICT.
 
At any rate, either the author's code or what I suggest immediately
above is needed. There is no way that the current GNU Emacs code can
work with a :tree fileset.


In GNU Emacs 23.0.60.1 (i386-mingw-nt5.1.2600)
 of 2008-09-03 on LENNART-69DE564
Windowing system distributor `Microsoft Corp.', version 5.1.2600
configured using `configure --with-gcc (3.4) --no-opt --cflags -Ic:/g/include
-fno-crossjumping'
 





[-- Attachment #3: Type: text/plain, Size: 110 bytes --]

-- 
guile-email mailing list
guile-email@systemreboot.net
https://lists.systemreboot.net/listinfo/guile-email

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [guile-email] parse-email-headers returns just “fields”
  2020-04-21 12:24 [guile-email] parse-email-headers returns just “fields” Ricardo Wurmus
@ 2020-04-23  1:26 ` Arun Isaac
  2020-04-23  1:26   ` Arun Isaac
  2020-04-23  6:35   ` Ricardo Wurmus
  0 siblings, 2 replies; 9+ messages in thread
From: Arun Isaac @ 2020-04-23  1:26 UTC (permalink / raw)
  To: Ricardo Wurmus, guile-email

[-- Attachment #1: Type: text/plain, Size: 908 bytes --]


> In some cases the emails don’t seem to have a
> content-transfer-encoding header,

This is not a problem. RFC2045 specifies that
"Content-Transfer-Encoding: 7BIT" should be assumed if the
Content-Transfer-Encoding header is not present. guile-email implements
this recommendation.

> * I sometimes need to discard the first two lines of the raw email to
>   get the headers to be fully parsed

Your attachment is an mbox, not a raw email. Perhaps you are treating it
as a raw email and that's why you have to chop off the first line? And,
I'm guessing your other problems are also related to this.

The following snippet works for me. Could you confirm?

--8<---------------cut here---------------start------------->8---
(parse-email
 (first
  (call-with-input-file "/path/to/your/attachment"
    mbox->emails)))
--8<---------------cut here---------------end--------------->8---

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [guile-email] parse-email-headers returns just “fields”
  2020-04-23  1:26 ` Arun Isaac
@ 2020-04-23  1:26   ` Arun Isaac
  2020-04-23  6:35   ` Ricardo Wurmus
  1 sibling, 0 replies; 9+ messages in thread
From: Arun Isaac @ 2020-04-23  1:26 UTC (permalink / raw)
  To: Ricardo Wurmus, guile-email


[-- Attachment #1.1: Type: text/plain, Size: 908 bytes --]


> In some cases the emails don’t seem to have a
> content-transfer-encoding header,

This is not a problem. RFC2045 specifies that
"Content-Transfer-Encoding: 7BIT" should be assumed if the
Content-Transfer-Encoding header is not present. guile-email implements
this recommendation.

> * I sometimes need to discard the first two lines of the raw email to
>   get the headers to be fully parsed

Your attachment is an mbox, not a raw email. Perhaps you are treating it
as a raw email and that's why you have to chop off the first line? And,
I'm guessing your other problems are also related to this.

The following snippet works for me. Could you confirm?

--8<---------------cut here---------------start------------->8---
(parse-email
 (first
  (call-with-input-file "/path/to/your/attachment"
    mbox->emails)))
--8<---------------cut here---------------end--------------->8---

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

[-- Attachment #2: Type: text/plain, Size: 110 bytes --]

-- 
guile-email mailing list
guile-email@systemreboot.net
https://lists.systemreboot.net/listinfo/guile-email

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [guile-email] parse-email-headers returns just “fields”
  2020-04-23  1:26 ` Arun Isaac
  2020-04-23  1:26   ` Arun Isaac
@ 2020-04-23  6:35   ` Ricardo Wurmus
  2020-04-23 11:31     ` Arun Isaac
  1 sibling, 1 reply; 9+ messages in thread
From: Ricardo Wurmus @ 2020-04-23  6:35 UTC (permalink / raw)
  To: Arun Isaac; +Cc: guile-email


Arun Isaac <arunisaac@systemreboot.net> writes:

>> In some cases the emails don’t seem to have a
>> content-transfer-encoding header,
>
> This is not a problem. RFC2045 specifies that
> "Content-Transfer-Encoding: 7BIT" should be assumed if the
> Content-Transfer-Encoding header is not present. guile-email implements
> this recommendation.

Hmm, I still get this error when the first line of an email to be
parsed is something like

    From debbugs-submit-bounces@debbugs.gnu.org Wed Apr 22 11:26:37 2020

The error message is

    Body decoding failed. Unknown encoding #f

So it fails to parse the headers correctly, ends up not finding the
header specifying the encoding, and then aborts body decoding.

>> * I sometimes need to discard the first two lines of the raw email to
>>   get the headers to be fully parsed
>
> Your attachment is an mbox, not a raw email. Perhaps you are treating it
> as a raw email and that's why you have to chop off the first line? And,
> I'm guessing your other problems are also related to this.

Oh, this may be.  However, I still get the same problem when I don’t
manually discard the first line if it looks like this:

    From debbugs-submit-bounces@debbugs.gnu.org Wed Apr 22 11:26:37 2020

I’m splitting the debbugs log file containing the messages and collect
all lines between the delimiters.  I then do

  (parse-email
   (and=> (call-with-input-string contents mbox->emails) first))

But this will only work if I drop the first line from “contents”.

-- 
Ricardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [guile-email] parse-email-headers returns just “fields”
  2020-04-23  6:35   ` Ricardo Wurmus
@ 2020-04-23 11:31     ` Arun Isaac
  2020-04-23 11:31       ` Arun Isaac
  2020-04-23 14:40       ` Ricardo Wurmus
  0 siblings, 2 replies; 9+ messages in thread
From: Arun Isaac @ 2020-04-23 11:31 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guile-email

[-- Attachment #1: Type: text/plain, Size: 837 bytes --]


> Hmm, I still get this error when the first line of an email to be
> parsed is something like
>
>     From debbugs-submit-bounces@debbugs.gnu.org Wed Apr 22 11:26:37 2020

Did the parse-email snippet I sent in my previous mail work on the
attached mbox you sent earlier? Could you send me any other mboxes you
are having trouble with? I'm unable to reproduce your problem, and
having a copy of the problematic mboxes would help very much.

>   (parse-email
>    (and=> (call-with-input-string contents mbox->emails) first))

Also, it isn't a good idea to handle emails or mboxes as strings (the
`contents' variable). They should be handled as bytevectors. Perhaps it
doesn't matter here, but just saying.

And, just out of curiosity, if there's no sensitive information in it,
could you share a copy of a debbugs log file?

Thank you!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [guile-email] parse-email-headers returns just “fields”
  2020-04-23 11:31     ` Arun Isaac
@ 2020-04-23 11:31       ` Arun Isaac
  2020-04-23 14:40       ` Ricardo Wurmus
  1 sibling, 0 replies; 9+ messages in thread
From: Arun Isaac @ 2020-04-23 11:31 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guile-email


[-- Attachment #1.1: Type: text/plain, Size: 837 bytes --]


> Hmm, I still get this error when the first line of an email to be
> parsed is something like
>
>     From debbugs-submit-bounces@debbugs.gnu.org Wed Apr 22 11:26:37 2020

Did the parse-email snippet I sent in my previous mail work on the
attached mbox you sent earlier? Could you send me any other mboxes you
are having trouble with? I'm unable to reproduce your problem, and
having a copy of the problematic mboxes would help very much.

>   (parse-email
>    (and=> (call-with-input-string contents mbox->emails) first))

Also, it isn't a good idea to handle emails or mboxes as strings (the
`contents' variable). They should be handled as bytevectors. Perhaps it
doesn't matter here, but just saying.

And, just out of curiosity, if there's no sensitive information in it,
could you share a copy of a debbugs log file?

Thank you!

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

[-- Attachment #2: Type: text/plain, Size: 110 bytes --]

-- 
guile-email mailing list
guile-email@systemreboot.net
https://lists.systemreboot.net/listinfo/guile-email

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [guile-email] parse-email-headers returns just “fields”
  2020-04-23 11:31     ` Arun Isaac
  2020-04-23 11:31       ` Arun Isaac
@ 2020-04-23 14:40       ` Ricardo Wurmus
  2020-04-23 21:54         ` Arun Isaac
  1 sibling, 1 reply; 9+ messages in thread
From: Ricardo Wurmus @ 2020-04-23 14:40 UTC (permalink / raw)
  To: Arun Isaac; +Cc: guile-email

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]


Arun Isaac <arunisaac@systemreboot.net> writes:

> Did the parse-email snippet I sent in my previous mail work on the
> attached mbox you sent earlier?

Yes, this worked.  Thank you.  In any case this simplifies the code.

>>   (parse-email
>>    (and=> (call-with-input-string contents mbox->emails) first))
>
> Also, it isn't a good idea to handle emails or mboxes as strings (the
> `contents' variable). They should be handled as bytevectors. Perhaps it
> doesn't matter here, but just saying.

Yeah, I figured as much, but I shied away from reading the file in
bytevector chunks that would then need to be searched for control
characters to split the parts of the log file.  I’ll probably do that
later, but for the first pass I just decided to use read-line.

> And, just out of curiosity, if there's no sensitive information in it,
> could you share a copy of a debbugs log file?

An example is attached.  I picked it arbitrarily from the many logs that
fail when I don’t drop the first line.

The mbox begins at the first ^G and ends at the next ^C.


[-- Attachment #2: 40755.log --]
[-- Type: application/octet-stream, Size: 11557 bytes --]

[-- Attachment #3: Type: text/plain, Size: 12 bytes --]


--
Ricardo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [guile-email] parse-email-headers returns just “fields”
  2020-04-23 14:40       ` Ricardo Wurmus
@ 2020-04-23 21:54         ` Arun Isaac
  2020-04-23 21:54           ` Arun Isaac
  0 siblings, 1 reply; 9+ messages in thread
From: Arun Isaac @ 2020-04-23 21:54 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guile-email


[-- Attachment #1.1: Type: text/plain, Size: 1668 bytes --]


> Yeah, I figured as much, but I shied away from reading the file in
> bytevector chunks that would then need to be searched for control
> characters to split the parts of the log file.  I’ll probably do that
> later, but for the first pass I just decided to use read-line.

I understand. Binary read is a little painful. Perhaps you could use the
read-bytes-till function in email/utils.scm of guile-email. It is kinda
internal to guile-email. So, if you're using it, you should probably
copy it into your source tree.

> The mbox begins at the first ^G and ends at the next ^C.

Ah, I see the problem. This is actually a bug on debbugs' part. The
mbox/email starting at line 194 is invalid. It is neither a valid email
nor a valid mbox. For it to be a valid mbox, the "From ..." line
(currently at line 195) should be the first line. It should not occur in
between the email headers as it does now. For it to be a valid email,
the "From ..." line should not occur at all.

I guess the only workaround is to find and delete the "From ..."
line. Here's one possible way to do it.

--8<---------------cut here---------------start------------->8---
(use-modules (email utils))

(parse-email
 (call-with-input-file "/path/to/40755.log"
   (lambda (port)
     (read-bytes-till port (make-bytevector 1 #x07))
     (get-line port)
     (get-line port)
     (let ((possible-from-line (get-line port)))
       (unless (string-prefix? "From " possible-from-line)
         (unget-string port possible-from-line))
       (read-bytes-till port (make-bytevector 1 #x03))))))
--8<---------------cut here---------------end--------------->8---

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

[-- Attachment #2: Type: text/plain, Size: 110 bytes --]

-- 
guile-email mailing list
guile-email@systemreboot.net
https://lists.systemreboot.net/listinfo/guile-email

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [guile-email] parse-email-headers returns just “fields”
  2020-04-23 21:54         ` Arun Isaac
@ 2020-04-23 21:54           ` Arun Isaac
  0 siblings, 0 replies; 9+ messages in thread
From: Arun Isaac @ 2020-04-23 21:54 UTC (permalink / raw)
  To: Ricardo Wurmus; +Cc: guile-email

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]


> Yeah, I figured as much, but I shied away from reading the file in
> bytevector chunks that would then need to be searched for control
> characters to split the parts of the log file.  I’ll probably do that
> later, but for the first pass I just decided to use read-line.

I understand. Binary read is a little painful. Perhaps you could use the
read-bytes-till function in email/utils.scm of guile-email. It is kinda
internal to guile-email. So, if you're using it, you should probably
copy it into your source tree.

> The mbox begins at the first ^G and ends at the next ^C.

Ah, I see the problem. This is actually a bug on debbugs' part. The
mbox/email starting at line 194 is invalid. It is neither a valid email
nor a valid mbox. For it to be a valid mbox, the "From ..." line
(currently at line 195) should be the first line. It should not occur in
between the email headers as it does now. For it to be a valid email,
the "From ..." line should not occur at all.

I guess the only workaround is to find and delete the "From ..."
line. Here's one possible way to do it.

--8<---------------cut here---------------start------------->8---
(use-modules (email utils))

(parse-email
 (call-with-input-file "/path/to/40755.log"
   (lambda (port)
     (read-bytes-till port (make-bytevector 1 #x07))
     (get-line port)
     (get-line port)
     (let ((possible-from-line (get-line port)))
       (unless (string-prefix? "From " possible-from-line)
         (unget-string port possible-from-line))
       (read-bytes-till port (make-bytevector 1 #x03))))))
--8<---------------cut here---------------end--------------->8---

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 487 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-04-23 22:52 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-21 12:24 [guile-email] parse-email-headers returns just “fields” Ricardo Wurmus
2020-04-23  1:26 ` Arun Isaac
2020-04-23  1:26   ` Arun Isaac
2020-04-23  6:35   ` Ricardo Wurmus
2020-04-23 11:31     ` Arun Isaac
2020-04-23 11:31       ` Arun Isaac
2020-04-23 14:40       ` Ricardo Wurmus
2020-04-23 21:54         ` Arun Isaac
2020-04-23 21:54           ` Arun Isaac

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox