Mailgun Post Mortem May 2016
Mailgun recently incurred several separate incidents that impacted the availability of our services. We’d like to take this opportunity to provide our customers with visibility around the root cause of these incidents along with details on what we’ve done to to address them and ensure they do not occur again in the future.
Distributed Denial of Service (DDoS) Attacks
Mailgun is frequently the target of large and varied distributed denial of service (DDoS) attacks. While many attacks are blocked with minimal disruption, we’ve experienced several incidents where there has been a prolonged impact to our services. In particular, an attack on May 23rd targeted portions of our primary data center and the method of the attack was unique enough that our hosting provider required nearly an hour to effectively identify the attack and deploy the appropriate mitigations that allowed us to restore services to 100%.
API/SMTP Timeouts or “SSL handshake” Errors
Beginning earlier in the month, we observed a steady increase in customers experiencing abnormal timeouts when using the Mailgun API/SMTP service. As the number of reports regarding this issue increased, it became clear that there was a systemic issue with our service.
We maintain several different systems for monitoring throughput and latency in our service and our own data wasn’t correlating to what customers were experiencing. After we completed the investigation of our application, we escalated this issue to our hosting provider to inspect our network infrastructure to determine if a problem could be identified in our managed load balancer or other network devices.
The initial troubleshooting sessions were inconclusive as the issue itself was intermittent and difficult to reproduce. After several attempts, we were finally able to verify that the requests that were timing out were not reaching our edge networking device, leading us to investigate upstream devices in the network.
We started inspecting the device that is responsible for protecting our infrastructure from DDoS attacks and we observed that when this device was disabled, we were no longer able to reproduce these timeouts and we immediately began investigating to understand what the possible causes were. After analysis with our hosting provider’s DDoS team, we discovered that our DDoS mitigation system was impacting legitimate traffic. After making adjustments to our countermeasures, we were able to eliminate these errors.
Intermittent 421 Errors
Mailgun returns a 421 error when we are unable to successfully queue a message. This error message is designed to notify the user that the message was not received by Mailgun and should re-attempted later. This is a normal part of SMTP and is used to signal to the sender to re-try the message with a delay.
Last week, we started to see elevated levels of 421s being returned. The cause of this error was due to performance degradation we were experiencing with our Cassandra clusters, which is where we persist messages for storage.
The cause of our Cassandra performance issues was due to a compaction bug in the version of Cassandra we were running that was causing compactions to stall and disk I/O spikes reducing overall Cassandra throughput. While the cluster was in this condition, we were intermittently unable to store messages resulting in the 421 errors.
- Alerting – While in many cases our DDoS mitigation system does not cause disruption, we’ve learned that it’s important for our Mailgun engineering team to know when the system is activated. Having this data allows us to more effectively correlate whether or not issues are being caused by these protections. We’ve already worked with our hosting provider to deploy an alerting system that alerts our engineers when these protections activate.
- Mitigation Profiles - We are tuning our DDoS mitigation profiles to improve our defensive posture while minimizing the impact to legitimate traffic. We’re working with our hosting provider to develop these profiles and expect this work to be completed this week.
- Cassandra Upgrade - We’ve started performing rolling upgrades of our Cassandra clusters to upgrade them to a version that is not impacted by the Cassandra compaction bug along with configuration adjustments that are more suitable for the type of workload.
- Infrastructure Design – We’re in the process of redesigning the underlying Mailgun infrastructure. This effort will give us a more robust network and deployment structure that will reduce the impact of similar types of attacks. This effort is underway and we will be sharing more details in the future.
Finally, while we know these types of incidents are challenging, the Mailgun team is committed to focusing on the plan above and any other steps necessary to ensure that you can continue relying on Mailgun for your e-mail delivery.