Posts Tagged ‘monitoring’

Monitoring : Compare performance of local and remote DNS cache on spam filters

Monday, September 30th, 2013

Managing large scale email platforms for hosting companies has been something that I have a strong passion for. The battle between spammers and email administrators has been raging for more than 10 years so the challenge and dynamic nature of email is constantly changing which keeps  it interesting. One of the continual business challenges is to try and keep up with Moores law.. Performance doubling approximately every two years, people assume that means that if the volume of email is the same the size of the email cluster should be reduced to cut operating costs. Unfortunately it does not work like that as the complexity of spam and malware based emails has increased dramatically which means more processing and technology is required to keep end users inbox’s clean.  One of the methods that is consistently touted as increasing the performance of an email service is a local DNS cache.. The problem is how do we really quantify what the  the real world performance improvement of running a local DNS cache and forwarder on all of my anti-spam email servers really is. How much of an improvement does running a local DNS cache resolver have on overall antispam/email solution performance?

The logic is that by running a local cache the host name lookups should take less time to query and therefore free up resources quicker to process more email. The logic falls flat as there are many different use cases which fall outside of the particular statement. For example.. Not all emails are received repeatedly which means that running a local cache will not improve the performance drastically. However for domain names or services that send volumes of emails and can be cached (TTL is not too small) the performance difference is remarkable.

Below is a comparison chart comparing the average scan time of 500 email messages (out of 17000+ that we ran the test for) to show the difference in performance (average scan time) between running the local DNS cache and running the default remote DNS cache.

The numbers over the entire data set are as follows. (all numbers are in seconds)

Local Mean : 0.298937447873 
Remote Mean : 0.814649430081 
Local Standard Deviation : 2.40180676356 
Remote Standard Deviation : 5.03549899418 
Local Minimum scan time : 0.0 
Remote Minimum scan time : 0.0 
Local Maximum scan time : 294.87 
Remote Maximum scan time : 589.74

On average over the emails the scan time was halved by running a local DNS cache.  The statistical information above points out some interesting trends. The remote DNS cache will still impact the performance of the local name servers as they forward on to the upstream anyway. The architecture probably gives us some clues as to what causes the results to be so skewed. The architecture is very simple in this environment.

  • There are four email milter (email scanning) gateways. Each one is running Postfix with Sophos Pure Message for UNIX on Ubuntu 12.04.
  • The first three milter/email filter boxes each run the PowerDNS re-cursor which are configured to forward to our local corporate hosting DNS cache resolver. If the entry is not stored locally in it’s cache. the query is then forwarded to the upstream. If the upstream does not have the entry in it’s cache the process continues. The upstream hosting providers DNS server is an average of 2 milliseconds away.
  • The latency across the milter network is low (1Gbps switched network).

The servers performance and email average volumes and sizes do not drastically change between samples. All emails are legitimate customer emails spread across a user base of just under 500 active accounts

Email Relay Distribution

The distribution of email relays plays an important part in any email service. The reason is that if you know a certain percentage of your email goes to a certain number of hosts you can identify the efficiency gains by having local caches as a higher percentage of your DNS requests should be in the local cache. Here is an interesting summary of the numbers.

41% of all email comes from the top 100 email relays
3924 total relays seen over the sample period
3622 unique relays communicated with us via IPv4
302 unique relays communicated with us via IPv6 (8.33%) 
2292 relays only delivered a single email during the sample period
602 relays only delivered two emails during the sample period
17985 emails delivered over the sample period

To better visualize the data here are two charts which sum up the relays by volume.

 

CPU Load

CPU Load over the period of the test was also logged at 1second intervals. The corresponding timestamps from the message logs have been extracted to match with the CPU load log. Here is the chart.

After investigating why there is a such a spread difference between the remote and local DNS servers. It seems that because the four milters are referencing two name servers the cache-miss is actually much higher when running on the local cache for new domains. That means the local DNS cache has to request the results from the upstream. Previously the upstreams returned the results directly form the cache reducing the additional queries required to serve the record. However once the domain was cached the local DNS servers resulted in much faster DNS queries.

Now lets review the results if we set the local DNS caches to query the Google public DNS servers on 8.8.4.4 and 8.8.8.8. My guess is that because these servers are public and used by a huge number of users the local caches should be considerably bigger. The downside is that from the test cluster the DNS servers are further away than the ISP.s

[user@hostname:~]$ ping -c4 8.8.4.4
PING 8.8.4.4 (8.8.4.4): 56 data bytes
64 bytes from 8.8.4.4: icmp_seq=0 ttl=54 time=3.150 ms
64 bytes from 8.8.4.4: icmp_seq=1 ttl=54 time=6.270 ms
64 bytes from 8.8.4.4: icmp_seq=2 ttl=54 time=2.701 ms
64 bytes from 8.8.4.4: icmp_seq=3 ttl=54 time=2.535 ms

--- 8.8.4.4 ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 2.535/3.664/6.270/1.521 ms

[user@hostname:~]$ ping -c4 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=54 time=4.496 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=54 time=2.434 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=54 time=4.590 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=54 time=2.041 ms

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 2.041/3.390/4.590/1.162 ms

As we can see the average response time is 1ms longer than my ISP’s resolver. Let’s see what the results say. Stay tuned for data as it will take a few days to collate.

The raw data for the scan time can be downloaded here : puremessage-for-unix-scan-time