NOTE: The issue has been resolved. Please see next post for details.
A bit before noon Central Standard Time today we had an intermittent outage start with avalon.ocssolutions.com. While we investigated as soon as our monitors went off, the problem was not immediately apparent and required further diagnostics. Because latency issues like this are sporatic and difficult to track down sometimes we did not immediately post on this while we worked with our network vendors to resolve the issue.
The issue was eventually identified to be caused by a faulty network card. We do not initially suspect faulty hardware on symptoms that were experienced today, but this was indeed the case. The card was replaced, but the issue is still occuring. Our level 3 technicians are working on this issue now and will have it up as soon as humanly possible.
We understand downtime can be very harmful to your site and business and do take issues like this very seriously. We average 99.99% uptime on all of our servers, but unfortuantely this does mean that at some point during the year, we do have an outage that can last several hours. When such an outage occurs we know the second it does and technicians work on it as quickly as possible to restore service. We appreciate your patience during this event.
While we have been keeping customers up to date on the situation via our ticket system, we had an error on our status blog that kept us from posting on this issue. We appologize for this delay and have remedied the situation.
UPDATE 6:01 PM CST:
We have confirmed that the issue is not with the server and nor with our immeidate network hardware. Our upstream vendors and level 3 technicians are working with us minute by minute to resolve this. We are passing this update along to keep you up to date.
UPDATE 6:57 PM CST:
Problem has been traced to a bad card in the distribution switch. Switching over to redundant card now and readjusting settings. Will update momentarily.
UPDATE 8:13 PM CST:
While slow, avalon.ocssolutions.com is up. Unfortunately the packet loss is substantial and data isn't getting in and out of it like it should. We know for certain the issue is one of the core distribution switches and we're working both with the upstream and hardware vendors to get this rectified.
UPDATE 11:50 CST:
We appologize for the continued poor network performance on avalon.ocssolutions.com. The issues have been complex and related to the internal network. We have even gone as far as rebooting internal distribution switches and core routers to resolve the problem, but to no avail. The networking team is working closely with our upstream providers, hardware vendors (in this case, Cisco), and our office and support teams to affect a fix as soon as possible. We realize that the delay has been unusual and that it is seriously affecting some of our customers. We will be posting a full analysis of the situation once it is resolved, but in the meantime are putting 100% of our efforts into fixing the situation. Please accept our appologies for the inconvenience and we appreciate your patience during this difficult situation.
UPDATE 12:16 CST:
We have made significant progress on reducing network problems on avalon.ocssolutions.com. Traffic is coming through much more quickly now. We are planning to shut down the machine in a few minutes to make one final configuration change in hopes to completely eliminate the problem.
UPDATE 12:30 CST:
As explained in the last update, we are shutting down the machine to put back up on the rack. This process will take 5-10 minutes, and your site will come down during this time. After the machine is powered back on in the rack, things will resume to normal. A full technical explaination will be posted shortly after that.
RESOLVED - Details in next post.