Today between 18:00 CEST and 18:15 CEST, we were facing an issue with one of our DNS resolvers (220.127.116.11). Our second DNS resolver (18.104.22.168) was not affected. The situation is stable again. We will follow up with a root cause analysis.
Root Cause Analysis
On 2018-08-21 from 18:00 CEST to 18:15 CEST, one of our two DNS resolvers (22.214.171.124) was no longer responding to queries. This issue was caused by the root partition filling quickly due to the temporarily increased log level on this resolver. The second resolver (126.96.36.199) was not affected and continued to work as expected. To prevent a similar issue from happening in the future we have increased the size of the root partition on both resolvers and improved our internal monitoring.
In its default configuration, the GNU libc always queries the topmost positioned resolver in
/etc/resolv.conf first and only uses subsequently listed resolvers in case of a timeout, which may then lead to degraded DNS performance.
You can reduce the impact of a failed resolver by configuring your system to rotate through all the resolvers and by decreasing the default timeout value from 5 to 2 seconds. To minimize the impact, set the following options in your
options rotate timeout:2
Please accept our apologies for the inconvenience this incident may have caused you and your customers. We keep doing our best to prevent such situations from happening in the future.