Thursday, June 25, 2009

pegasus.ocssolutions.com Filesystem Issue

NOTE: This issue is now resolved.

We are shutting down Pegasus for a file system check and a kernel upgrade to address recent issues on Pegasus. We anticipate it to be down no more than 20 minutes while this is occurring.

UPDATE 2:29 AM CDT: The outage is lasting a bit longer than we anticipated. We apologize for the inconvenience. We will update again very soon on the status of this.

UPDATE 4:00 AM CDT: The file system issue is nearly resolved and we anticipate being able to run a final series of checks in a few minutes. The server will be up shortly.


Sunday, April 12, 2009

pegasus.ocssolutions.com Reboot

NOTE: This issue is now resolved.

We are having to reboot pegasus.ocssolutions.com due to a filesystem issue. We expect the server to be online within a few minutes. We will provide an update as soon as possible.

UPDATE 3:36 AM CDT - The server is still checking the filesystem. It should be complete very soon. When this is complete, we will do a final reboot and bring the server fully back up.

UPDATE 3:55 AM CDT - The server is now back online.

Tuesday, April 7, 2009

sol.ocssoltuions.com RAID Drive Replacement

NOTE: This issue is now resolved.

Our advanced monitoring systems detected that one of the drives in sol.ocssolutions.com needs to be replaced. While this isn't an absolute emergency, it does need to be attended to quickly. We have scheduled this for 8 PM CDT tonight (April 7, 2009).

We expect downtime of around 15-20 minutes. We will post here on updates related to this issue.

Tuesday, March 3, 2009

VDS Host #5 Issues

NOTE: This issue is now resolved.

We are currently experiencing a hardware issue on one of our VDS servers. It looks like the network card on it has failed, and we're replacing it now. It should be back up momentarily. Once the server is back on line VDS's can take up to 10 minutes to fully spin up.

We appreciate your patience on this issue and will post an update once it is back up.

Tuesday, February 3, 2009

Scheduled Urgent Maintenance on a VDS Host

NOTE: This issue is now resolved.

One of our VDS hosts has reported that one of its drives in the RAID array has degraded and must be replaced. All data is safe due to the RAID array we use, but the drive must be replaced for optimal operation.

The operation to replace the drive is simple and we expect the machine to be down no more than 15 to 20 minutes. This maintenance will affect a very small percentage of our VDS customers.

The maintenance will occur at 2 AM Central Standard Time on February 4th, 2009.

This maintenance will also affect some Subversion and Trac accounts.

Tuesday, January 13, 2009

avalon.ocssolutions.com Situation Resolved

The machine avalon.ocssolutions.com is back up and now fully operational.  The network is ressponding normally to it.

The attempt with this post is to give a full breakdown of what happened.  In the 11 years we've never had an outage of this length, and while I'd like to attribute it to one specific cause, several things posed a problem during diagnostics..  Since this was a very unusual situation we had to work through many steps before finding the root cause.

The problem seems to have originted a few weeks ago when a temporary re-routing was applied to get around some latency on one of the Savvis links.  This worked well, but had some unforseen circumstances later down the road when a card failed in one of the distribution switches.

Diagnostics included replacing network hardware, isolating the server from the hardware firewall, and applying various updates to the dists and cores.  The core problem was difficult to diagnose but once found, the fix was implemented.

The network distribution cores and routers had to have the temporary settings cleared and reloaded.  Once this was done, we shut down the machine and moved it back on the rack, and the problem was resolved.

While we made every effort to be in constant contact with all of our customers affected, we realize that we could not get to everyone.  We also understand your frustration at the incident.  I also do want to take a minute to thank everyone for being so patient with us and all of the kind words we received during this time.

We will be issuing a credit of 10 times the actual downtime, which is 12 hours times 10, yielding 5 days of credit.  A credit will be applied on your next billing cycle.

If you have any questions, please feel free to open a ticket and let us know.

Monday, January 12, 2009

avalon.ocssolutions.com Outage

NOTE:  The issue has been resolved.  Please see next post for details.

A bit before noon Central Standard Time today we had an intermittent outage start with avalon.ocssolutions.com. While we investigated as soon as our monitors went off, the problem was not immediately apparent and required further diagnostics. Because latency issues like this are sporatic and difficult to track down sometimes we did not immediately post on this while we worked with our network vendors to resolve the issue.

The issue was eventually identified to be caused by a faulty network card. We do not initially suspect faulty hardware on symptoms that were experienced today, but this was indeed the case.  The card was replaced, but the issue is still occuring.  Our level 3 technicians are working on this issue now and will have it up as soon as humanly possible.

We understand downtime can be very harmful to your site and business and do take issues like this very seriously. We average 99.99% uptime on all of our servers, but unfortuantely this does mean that at some point during the year, we do have an outage that can last several hours. When such an outage occurs we know the second it does and technicians work on it as quickly as possible to restore service. We appreciate your patience during this event.

While we have been keeping customers up to date on the situation via our ticket system, we had an error on our status blog that kept us from posting on this issue. We appologize for this delay and have remedied the situation.

UPDATE 6:01 PM CST:

We have confirmed that the issue is not with the server and nor with our immeidate network hardware.  Our upstream vendors and level 3 technicians are working with us minute by minute to resolve this.  We are passing this update along to keep you up to date.  

UPDATE 6:57 PM CST:

Problem has been traced to a bad card in the distribution switch.  Switching over to redundant card now and readjusting settings.  Will update momentarily.

UPDATE 8:13 PM CST:

While slow, avalon.ocssolutions.com is up.  Unfortunately the packet loss is substantial and data isn't getting in and out of it like it should.  We know for certain the issue is one of the core distribution switches and we're working both with the upstream and hardware vendors to get this rectified.

UPDATE 11:50 CST:

We appologize for the continued poor network performance on avalon.ocssolutions.com.  The issues have been complex and related to the internal network.  We have even gone as far as rebooting internal distribution switches and core routers to resolve the problem, but to no avail.  The networking team is working closely with our upstream providers, hardware vendors (in this case, Cisco), and our office and support teams to affect a fix as soon as possible.  We realize that the delay has been unusual and that it is seriously affecting some of our customers.  We will be posting a full analysis of the  situation once it is resolved, but in the meantime are putting 100% of our efforts into fixing the situation.  Please accept our appologies for the inconvenience and we appreciate your patience during this difficult situation.

UPDATE 12:16 CST:

We have made significant progress on reducing network problems on avalon.ocssolutions.com.  Traffic is coming through much more quickly now.  We are planning to shut down the machine in a few minutes to make one final configuration change in hopes to completely eliminate the problem.

UPDATE 12:30 CST:

As explained in the last update, we are shutting down the machine to put back up on the rack.  This process will take 5-10 minutes, and your site will come down during this time.  After the machine is powered back on in the rack, things will resume to normal.  A full technical explaination will be posted shortly after that.

RESOLVED - Details in next post.