Site is back up and running - just wanted to say thanks

munitalP · Jul 2, 2012

ADMIN - thanks for the Facebook updates and info during this extended outage.

It must be so frustrating for you - as a group, I think we all feel for you when it comes to times such as this

G

Mwenenzi · Jul 2, 2012

Admin may have had a few stiff drinks after the outage !

admin · Jul 2, 2012

As you evan imagine it was a pretty frustrating 18 hours for me. Unfortunately it came at a really bad time - a Sunday and just the week-end when I moved house!! Once things settle (and I get my office working properly!) will be investigating the entire AFF technical setup. An 18 hour downtime is completely unacceptable. If there are any AFFers with experience in hosting high-traffic vBulletin dedicated servers, please PM me. All help will be appreciated!

SeatBackForward · Jul 2, 2012

Also, I had to do a manual Refresh of the site for it to re-load (CNTRL+F5), if any othere users still have trouble accessing.

v8Statesman · Jul 2, 2012

I slept through most of it. I'll blame jet lag (even though I didn't change time zones). Mostly a refusal to sleep in CX F. Wanted to enjoy the offerings.

Glad it's all working now though. Thanks Admin and all that got it up and going.

Sent from my Telstra iPhone using the Australian Frequent Flyer application.

samh004 · Jul 2, 2012

Glad we're back online



k3nnis · Jul 2, 2012

Thanks admin for your efforts

What was the issue?

kpc · Jul 2, 2012

As posted on facebook, I thought it was the Malaysian government blocking access to AFF

...as I was never able to access AFF since checking into the Hilton KL. Glad to see the site back up again as I was staring to suffer from withdrawal effects. Thanks admin for getting it back up again, and the updates on Facebook!

docjames · Jul 2, 2012

I just assumed it went out in sympathy to QF?

What was the issue (is it known?).

markis10 · Jul 2, 2012

Its been a while since I have had to endure Sunday night television, thinking of taking up renovating, 605K profit for 10 weeks work seems a good deal

admin · Jul 2, 2012

As I posted on FB, I don't usually go into details about our technical environment. But given the completely unacceptable 18+ hour outage yesterday, the inconvenience it caused everyone, the generous offers of help and support from our members, I'm now going to publish the analysis from our Server Administrator. I know that many of you are technical enough to understand this and have an opinion. I'd rather this doesn't become a public discussion, but if you have any opinion/insights/suggestion, please PM me.

Apologies again.

This is a follow up in a separate ticket regarding the problem you had with AFF this weekend and to discuss options.

The issue in itself was relating to the SSD drive taking over the boot loader which transpired after a power issue (Although this would have shown up next time the server was rebooted anyway). This shouldn't have taken more than 1-2 hours to identify and fix the rest of the time was all logistical in trying to get us to troubleshoot (without the tools) or the datacenter not responding or not being there - there was also a nearly 3 hour delay at one point on this end which I explained.

Given the timing of it the datacenter likely didn't have staff onsite which caused an initial few hour delay and the lack of troubleshooting assistance initially which is evident if you read between the lines of the replies we received.

Going forward you have some options,

1) We can have an IPMI card (or similar) permanently for the server which will avoid the need to wait for the datacenter to hook KVM's/etc up and cut out a large portion of these delays, you should however not be fooled into believing this will be the perfect solution as these often break and require actually being infront of the server. This issue for example needed a drive removed to confirm the issue and the order replaced, but it will significantly reduce the delay because if we can see it we can generally identify most issues quite rapidly and then guide them on what to do. Such as in this case.

This however is the cheapest option of them as it's basically a one time cost of the card. Anything that allows us an oversight of what is going on (assuming it actually works) will help us guide providers/take decisive action/resolve problems faster.

2) You can get an additional spare server with your current provider and we can configure it to be a failover server so your downtime should be in the minute range assuming your not having a power issue at the datacenter. These setups are not perfect with disk syncing but we've done thousands of them and maintain hundreds of them already so can quite comfortably manage this and wrap quirks up before they even occur.

3) You can move your server to another provider which will assist better in troubleshooting/have staff onsite 24/7. In reality for such issues of these while the actual work time is only 30-60 minutes they end up transpiring into 3-6 hours with most providers by the time everything is relayed unless you luck out and get a tech that actually is any good.

4) Same as #2 except an additional server located elsewhere to cover the event of your datacenter going down. This however would ideally need to be located on the West coast(such as LA) due to latency and isn't perfect again due to latency however can be made near perfect with the potential of loosing 1-2 minutes worth of posts in the event of a failover (You would need to remove the auto-fail back however as this would corrupt the database).

You however need to be the judge of what costs you most. Is the potential of ~12 hours or so in the event of a real hardware failure (Will your datacenter even replace the hardware/troubleshoot it?) verus the $300-400/month for an additional server/management costs (depending on the provider it could be more, west coast/la we can get you in that range however) for the potential of a long outage say once every 16-20 months.

I wouldn't also let this particular issue cloud everything as it's a rare issue in that it's relating to a recent hardware upgrade and only shown up when the server was rebooted, the real issue was all the delays not the technical problem itself.

Shelf · Jul 2, 2012

I didn't realise how often I check this forum until yesterday.

docjames · Jul 2, 2012

Interesting, thanks for sharing. I think most here, whilst wishing AFF was available instantly via every known electronic medium (

), are realistic enought to know there's a cost, and the vast majority arent paying anything.

We looked at this (downtime) issue for a professional body website, and decided the target downtime would be <24hrs as the cost to do otherwise became prohibitive (and not a good use of members funds). In the end, in AFF's case, it's "information and leisure", not banking or government, so whilst losing access for up to 24hrs isnt ideal, in the assessment we made, the cost to "prevent" wasnt worth it (and as you see, "prevent" doesnt always guarantee "prevention").

k3nnis · Jul 2, 2012

Thanks for sharing admin

We appreciate it

fbrimfield · Jul 2, 2012

Hey Admin.

Just some thoughts on the solutions provided by your sysadmin. I work as a freelance web and graphic designer in my spare time and have managed high traffic forums before.

Basically the hardware failure was something that shouldn't have happened in the first place, SSD's should not really be used for servers especially when they can cause issues with server bootloaders. They should be using standard hard drives which are just as reliable. The 18 hour downtime is a bit silly, it is an issue that many service hosters would have had fixed within the hour. A lot of webhosts run redundancy servers at no or minimal extra cost that kick in as your sysadmin said after a certain period of down time.

I would probably look at switching hosts if this kind of thing is happening often. I'm unsure if you're on a dedicated server or a shared server with your current host, but really I wouldn't think it would matter. Without seeing the traffic statistics of the site i'd say that a shared server would be just fine for one vBulletin install. I use a fantastic host in America who i've hardly ever had problems with, and who have great support and are pretty cheap too. PM me if you're interested in more details.

777 · Jul 2, 2012

AFF runs on Amadeus? Who knew?

harvyk · Jul 2, 2012

Hi Admin

I've just sent over a PM with some options for you.
As I mentioned in the PM, the response has a "backyard operator" feel about it, it also mainly reads as an excuses list.

k3nnis · Jul 2, 2012

I totally agree that servers shouldn't be run with SSD's. SAS drives will do the trick

yohy?! · Jul 2, 2012

not sure if it strictly applies but have heard good things about Amazon's cloud hosting offering which apparently is being deployed in AU at present.

777 · Jul 2, 2012

yohy?! said:
not sure if it strictly applies but have heard good things about Amazon's cloud hosting offering which apparently is being deployed in AU at present.

Clearly you heard these good things before yesterday: Amazon cloud outage takes down Netflix, Instagram, Pinterest, & more | VentureBeat

:0

Site is back up and running - just wanted to say thanks

Suspended

Established Member

Established Member

Senior Member

Established Member

Enthusiast

Established Member

Senior Member

Senior Member

Veteran Member

Established Member

Active Member

Senior Member

Established Member

Member

Established Member

Senior Member

Established Member

Established Member

Established Member

Become an AFF member!

AFF forum abbreviations