Open sourcing our email signature parsing library

Back in 2011 we had several customers ask us for a high level message parsing API that they could use to strip signatures and quotes from an email like you see below:

Email conversation

The problem

While simple for humans, this is actually quite a challenging task for machines. One of the main reasons is because there is no standard format for an email message. Different email clients compose replies in different manners and even within the same email client, the sender can change the format to whatever they choose. For example, users can place their reply after quoting the original message (bottom-posting):

At 10.01am Wednesday, Danny wrote:
> By the way, which systems will be updated? I had some network
> problems after last week's update.  Will I have to reboot?

No, you won't have to reboot.

before the quoted original message (top-posting):

No, you won't have to reboot.

-------- Original Message --------
From: Danny <danny@example.com>
Sent: Tuesday, October 16, 2007 10:01 AM
To: Jim <jim@example.com>
Subject: RE: Job

By the way, which systems will be updated? I had some network
problems after last week's update.  Will I have to reboot?

or even interleave their reply:

> Can you present your report an hour later?

Yes I can. The summary will be sent no later than 5pm.
Jim

At 10.01am Wednesday, Danny wrote:
>> 2.00pm: Present report
> Jim, I have a meeting at that time. Can you present your report an hour later?

In-fact, there are so many different ways to reply, there is even a Wikipedia article about it! All of this makes parsing the body of an email a challenging task.

Even with machine learning, we had to constantly adjust things. Email formatting is constantly changing, phone email clients are introducing new signatures like “Sent from your XXX phone”, new edge cases are discovered, etc.

Here is a simple example. Back in the day all email signatures were separated with dashes:
Signature separated by dashes So the first thing that comes to mind is to write a regex to detect dashes as a signature splitter and extract lines after it as signature:

>>> signature = regex.match("^[\s]*--*[\s]*[a-z \.]*$).*", message)

But the next thing you know you get an email like this:
Itemized List

And your parser strips off the most important part of the email. It's a very simple example and you could easily workaround it. But in real life things get much more complicated and tricky.

Our Solution

We did a lot of research, looked at all the variations of email that passes through Mailgun and came up with a solution based on some machine learning techniques. The solution has been in production for several years now, undergoing bug fixes and enhancements. Overall we have received positive feedback from customers, though naturally developers tend to point out where you could improve :)

So now you all have the chance to help improve the solution. Because of the constantly changing and distributed landscape of email, we've decided to tackle this problem with a distributed solution: we're open sourcing our library so we can hack on this together!

We're calling our new library talon after a multipurpose robot designed to perform missions ranging from reconnaissance to combat and operate in a number of hostile environments.

In case you want to start testing it right away, we've prepared a simple Demo app and a QuickStart Guide for you. Otherwise read on for a more general overview, approaches we took, and assessment results.

Here's how most common workflows look like:

Talon Workflow Diagram

Currently we use machine learning only to classify signature lines. The rest of the library are various heuristics and sanity checks we came up with while working on support tickets and analyzing message formatting patterns / trends.

The machine learning part of the library is inspired by the following research papers:

To classify signature lines we used SVM with Linear Kernel. To assess our classifiers we used 5-fold cross-validation:

Confusion Matrix

The dataset consisted of 2912 email lines. Out of 1030 signature lines 954 were classified correctly. Out of 1882 non-signature lines 147 were mistaken for signature. Overall it gives us 92% success rate and 78% area under the ROC curve. Which could be regarded as excellent and fair correspondingly.

When we modified the library for outsourcing, we tried to provide a sturdy skeleton while making it easy to add more meat. From experience, the parts that could use the most focus are the regexps for quotations / signature separators and html quotations extraction by html tags. However, you are certainly welcome to contribute to any part of the library you like.

We hope that you'll find the library useful and it makes your life easier.

Happy emailing!

comments powered by Disqus

Mailgun Get posts by email

Like what you're reading? Get these posts delivered to your inbox.

No spam, ever.