![]() |
|
|
|||||||
| Register | Forum FAQ | Search | Today's Posts | Mark Forums Read |
|
|
LinkBack | Thread Tools | Display Modes |
|
||||
|
5/20 Power Outage RFO
Severe storm cells came through North Georgia Region this evening. AtlantaNAP experienced an over current fault outage on one of our 2 main feeds. The feed is the original feed that has the most load currently connected to it. The amount of systems connected to the load is the amount of lightning and over current that will try to be passed to the system – i.e. if you don’t have very much load on it - like our new feed is currently only at 1/6th load - then current does not try to flow to it very much. Our first system is currently at 65% load so it tried to absorb much more of the lightning strike than the other one and hence the main breaker going into over current fault. I have spoken with all of our key electrical engineers associated with the building at this point. According to Georgia power / our PSSI and Cummins engineers – we likely took a lightning strike to the utility very near the facility which caused an over current fault on our main incoming breaker on our first set of switchgear. The breaker is designed to trip in the event of this kind of fault to protect the gear (your computers) inside the building from being burned up by the lightning strike. When this type of fault happens - the computer will not start the generators until an engineer verifies where the fault is. This is because a fault inside the wiring plant could also cause this kind of over current in the event of a main short if a feeder wire of main current in the building were to become damaged. In that case it would be very dangerous to turn the power back on manually or to force a manual start of the gen sets and push current to the system with a fault remaining. Lives and machinery could be lost. We dispatched several of our staff visually to inspect for faults – (we did not want to turn something on and have it fry everyone’s gear) and found none and verified it was likely a lightning strike and manually started the generators to restore power. Unfortunately the ups system is only designed to carry that load for 10 minutes which was not enough time for us to safely verify and do a manual start. This is apparently a rare event – to get a direct utility strike like this – that close that does not get dissipated before it hits us. The farther away from your site the strike occurs - the more other load and grounds it has to dissipate before it gets to you. The good news is we did not burn up any equipment. Some of you did not lose power because you were connected to the other lightly loaded feed coming in and it was not enough load source to overwhelm the breaker since it is only 18% loaded at this point. Some of you lost network connectivity because downstream feeder switches that your computers are connected to are only single power supply units. We are in the process of examining a facility wide network upgrade that will move to a newer chassis based solution throughout the facility - we started looking at this as a way to offer new services capability that many f you have been asking for - it is a costly upgrade and will bring redundancy but also brings some pitfalls as well since you have more connections into a single chassis. We are still looking at this currently and will keep you up to date as to the direction we decide to move. They have told me that under normal operating conditions there is really nothing we could have done and we should simply be glad we had good equipment installed that kept our computers from being fried. I am thankful that I am not looking at a lot of damaged equipment that could not simply be turned back on - that would be a disaster I do not want to deal with. At this point it seems like the new switchgear with over current protection was a good investment. |
|
||||
|
From the techs at the DC:
Vern is running a fsck onthe VZ directory. Ratbite is up but cannot get it to network to the outside they are checking on it now.
__________________
::::: 01001100 00110011 00110011 00110111 |
|
||||
|
Fsck takes a long time on VPS because VPS servers are usually:
very large sites + an OS + cPanel * "X" servers which take up a lot of disk space that need to be checked. Basically its like checking 15 to 20 or so dedicated servers and it cannot come back online until Fsck checks them all. Most people on VPS are on the upgrade path to dedicated and that is why they have such large sites. It is one of the largest drawbacks to VPS hosting IMHO.
__________________
::::: 01001100 00110011 00110011 00110111 Last edited by Matt; 05-20-2008 at 11:58 PM. |
|
|||
|
How long is long? Plus, how long have they been down? The one day I decide to take a few hours off.
Curious minds want to know if Vern will be back up and running by the start of business tomorrow - PST. |
|
||||
|
As you may have noticed vern is finally done with fsck. If any of you have problems please open a ticket. Alex is standing by waiting for your problems. We would have updated you a little sooner about the server being online but we were busy putting out other fires.
I am sure most of you noticed the machines back online. Sorry that it took so long to respond.
__________________
::::: 01001100 00110011 00110011 00110111 |
|
||||
|
Still working on getting parts for Ratbite. Looks like some of them were hosed in the power fault. We haven't forgotten about you. We will try to get it solved before the phones start ringing for you.
__________________
::::: 01001100 00110011 00110011 00110111 |