How connection pooling helped us cut delivery time in half, offer opportunistic TLS
This post was written by Russell Jones, a software developer at Mailgun responsible for open-sourcing our mime-parsing library flanker. Today he's going to blog about how we optimized outbound connections and reduced sending time while implementing opportunistic TLS. The same technique can be used to optimize everything from web crawling to high-throughput external APIs, but we'll discuss SMTP as an example.
In January of 2014 we decided that we needed to refocus on our core sending pipeline to reduce downtime and increase performance. That means scaling and refactoring portions of Mailgun to achieve our goals. I've focused on the mail delivery part of the sending pipeline, the last step an email takes in the Mailgun sending pipeline, where we transmit the actual message to the intended recipient, and that's what I'll be talking about today.
Our objectives were simple. Stay ahead of our growth so that as we continue to add customers and send more mail, our customers don't experience any downtime or slowdown in delivery speed. We had a couple of concrete goals:
- Reduce the delivery time of an email.
- Reduce throttling we experience from recipient Email Service Providers (ESPs).
- Improve security by encrypting email delivery whenever possible (opportunistic TLS).
- Use monitoring to gain insight into new SMTP engine so we can better track delivery time and throttling.
Our original SMTP engine was simple, for every email we would pull out of our delivery queue, we would open a connection to the server we were trying to deliver to, send the message, and then close the connection as seen in Figure (1).
Figure 1: Original SMTP Engine Overview. Note this is simplified, we establish the encrypted connection after the EHLO during the SMTP chat using the STARTTLS verb.
While incredibly simple and effective, at the scale Mailgun operates now, this technique was wasteful. For every message we were sending, we had the overhead of a TCP handshake, SMTP handshake, and if we were delivering over TLS we had the TLS handshake. To give you some data, it would often take us over a minute and a half to deliver a message if we were trying to deliver over TLS. This is one of the reasons why we had not rolled out opportunistic TLS earlier - it was just too costly.
When we sat down and started thinking about improving delivery, we wanted to reduce the time it took to send a message as well as provide opportunistic TLS to our customers. Our solution was to send multiple messages per connection and use connection pooling to reuse already existing connections. Because Mailgun sends so much mail, finding a connection that is already open wasn't a problem, and it allowed us to eliminate the connection establishment overhead.
Figure 2: New SMTP Engine Overview. Note this is simplified, if no free TCP connection if found, we establish a new connection.
This allows us to amortize the cost of the TCP and TLS handshake over multiple messages driving down it's cost, while increasing delivery speed. The messages that could take a minute and a half or more to deliver now take under roughly 600 ms to deliver as you can see from Figure (3). It also allows us to fine tune IP to recipient ESP sending rates, which are critically important in email delivery, and increase overall delivery while reducing the cost on ourselves and ESPs. Being a better citizen in the email world leads to lower throttling, less resource utilization, and better delivery for customers.
Figure 3: On the left you see our original delivery time in seconds. On the right you see our currently delivery time in milliseconds. Note the reduction of large delivery time spikes (ESP throttling) with the new SMTP engine.
We also monitor everything from delivery rate, memory usage, to delivery time. This has allowed us to stay ahead of the health of the SMTP engine. We can now detect problems before they occur so customers are not impacted. This new data also helps us fight spammers, a never ending battle, and also help narrow down where throttling is occurring so that we can improve Mailgun in other areas to reduce throttling and decrease delivery time.
Where we are going from here
The next step logical step is to work on revamping our sending rate algorithms. Now that we have better delivery and monitoring, we can see when and where throttling occurs and what changes affect throttling. However, no amount of algorithmic changes from our end can beat having good traffic. High quality traffic trumps everything where email delivery is concerned. That is why our other big investment in 2014 is improving our reputation system to make it more accurate and to provide more data to our customers.
More to come on that front. Till then...