Operations at ARIN: New Blog Series and Recent Outage Information

Communicating with our community is vitally important to us,
especially when it comes to information about our services and operations. Currently,
we keep the community updated about the disposition of ARIN services at our
Public Policy and Members Meetings and directly through announcements. To keep
you better informed throughout the year, we are starting an operations-focused
blog series to update you on the development of new services, highlights about
our existing services, and any plans to modify our overall registry service
offerings. This series will feature posts from several different members of the
ARIN team to keep you current on operations at your Regional Internet Registry.

An important goal of this blog series is to openly address operational
topics, rather than serve as a marketing tool.  As an example, today’s blog covers an
operations mishap that occurred with our services last year at the end of
December.


One of the things we pride ourselves on at ARIN is our
ability to keep our services up and running – i.e. available to our customers
whenever they need them.  When we do have
to take an outage for whatever reason, we communicate plans to the community via
email and announcements posted on our website, and we do all that we can to
finish in the work window that we committed to in those announcements. Sadly,
we had a serious mishap in December of 2019 that we would like to explain.

On 23 December 2019, at 12:35 PM EST, ARIN Operations
received multiple alerts from monitoring systems concerning multiple virtual
machines that supported customer-facing ARIN services.  Operations staff attempted to manually force the
failover of the virtual machines to other hardware that was on standby.  However, this also failed and we lost hosting
of the virtual machines in that part of our network at 12:55 PM, at which point
both our website and our ARIN Online customer application were offline and
unavailable.

After troubleshooting, working with our virtualization
vendor, and rebuilding the virtualization nodes, it was determined that access
to the shared storage platform used by the virtual machine cluster was the
likely culprit.  We worked with our
storage system vendor to assist in diagnosis but they were unable to resolve
the storage problem in a timely manner. At 3:30 PM, given the length of the
outage and the uncertainty as to time-to-repair, it was decided to swing ARIN’s
website to our disaster recovery site – a capability we maintain and test
so that it is available for such situations. We were able to restore access to
the ARIN website by 4:00 PM and access to our customer application was restored
by 5:10 PM. 

While ARIN’s services were operational, we continued to work
with storage system vendor through the Christmas holiday to determine the
underlying problem. By 2 January, they were able to replace the failed
component, and we were able to put the impacted cluster back in operation.
Because we knew that swinging the website back to the primary data center would
require an outage, we elected to wait to perform this work in conjunction with
a previously planned maintenance window on 25 January.

So, what went wrong? Our virtualization cluster is based on
a high-availability configuration with redundancy throughout the system.  After a thorough after-action review, it
turned out that a hardware failure coupled with a system misconfiguration on
the storage appliance caused the virtualization engine to fail. The hardware has
since been replaced, the corrected configuration was validated by vendor
support, and everything has been put back into normal operation.

We are sorry for the inconvenience caused by the unplanned
outage, but also proud of our team for their quick thinking and hard work to
get us back online.  We are committed to keeping
the community informed about the status of our services, and hope you find our
operations-focused blog series helpful in keeping you informed about operations
at ARIN.

The post Operations at ARIN: New Blog Series and Recent Outage Information appeared first on Team ARIN.

Be the first to comment

Leave a Reply