Keeping an eye on your hosting

Since I no longer manage servers directly, I have removed the GSM phone attached to my office server, sending me text alerts whenever a server or service is down.

I do wish to keep an eye on client websites of course though, as well as this one (which has had some small downtime issues in the past month due to some capacity issues with it’s host).

There are a plethora of website monitoring services out there, but the one I decided to take for a test drive was Site24×7.com. It offers the usual monitoring of a website and sends email and text alerts when a site is down. Best of all, it is free!

I had my concerns about the quality of the service given it is free – it is the kind of thing I would happily pay 10-15 euro a month for. I have checked the logs a few times and there are plenty of visits from the monitoring bot and it has reported downtimes on the sites it is monitoring within a minute or two. All in all, it seems a quality service.

One element which I am just about to take a look at in more detail is it’s ability to monitor transactions. This would allow you to not only ensure a site is up, but that it is functioning correctly by say, performing a search or adding something to the website’s shopping cart.

In doing so, you can ensure that the database server is running ok too, or nobody has made a change to your code that has broken the site!

Comments

The more you know, the less happy you are

My first (and certainly not last) 4am post!
Prepare to be underwhelmed.

For the last 36 hours I have been doing battle with two servers located within the same data centre and the guys who manage the network therein.

At about 12pm Tuesday I noticed that a number of mails from server 1 had not yet reached server 2. A quick look and push of the mail queue, and Exim is telling me it has “No route to host”. Hmm, not a good sign.

My first priority was to see just how big an issue it was. After running some checks using external mail programs and some log file checking, I satisfied myself that at least the routing problem seemed to be restricted to those two servers. Or rather those two subnets within the data centre network.

Neither server could ping the other, and even switching off the software firewalls, Iptables and APF, did nothing for me.

At this stage I felt the issue lay outside of either box and bumped it to DC support staff. I had a sense of foreboding about this as I knew in my heart that this kind of small, localised and fairly complex issue would take a lot of two-ing and fro-ing before it landed in the inbox of the sort of knowledgeable techie that could correctly identify and resolve the issue.

Meanwhile, back at the mail servers, mails are piling up in both mail queues. Gah! So a few 4 hour message not yet delivered mails will be received, nothing serious.

I won’t bore you with the mind numbing problem tennis I played with the support staff to get my issue aired, we all have our tales of woe in the department.

Eventually it was declared (second hand in the form “the network engineer said..”) that the network config on either or both machines were not correct.

Now, I’m not the world’s smartest guy, but if I have one strength, it is the ability to approach problem solving in an intelligent way. i.e: If two servers have been working fine for the past 12 months, then why the hell would you need to go about messing with their settings?

It was mentioned that a Cisco router (I think the brand was important as many techies seem to think Cisco is some sort of mythical place that nobody should ever enter or discuss in detail) was recently replaced and that the settings might have worked in the past but, blah blah blah.

“Hmm, ok, give me the settings and I will see how they compare to what I currently have”. And yes, on one server they were different. But changing the config for eth0 is not something to enter into with haste. In fact, it is up there with marrying a woman with a very nosy mother.

So I called them back to double check. No, no, I was assured, those values are correct.

Correct they may be, but not for my server that was down for 3 hours after I applied said settings. I didn’t think it would take them 3 hours to get it back up after I called and explained to them to get someone over to the box, login and apply my “wrong” settings once more, but they spent an hour rebooting the machine that was fine to get themselves warmed up for the main event.

So several hours later, I have my server back up and a gleeful response to the trouble ticket informing me that the server is now back online and the issue was resolved.

Apart from my original problem that is.

It was at that point that I remembered why I had tied bubble wrap to my forehead earlier in the day.

So, I have been immersing myself in learning more about networking, routing and eth0 than I really want to know, but after some careful information gathering, I have been able to create a static route between the two servers to re-establish the network route that was so cruelly taken from me by fate on Tuesday morning.

I had better remember to put a script in that adds the routes back in on next reboot.

I’m left with a slightly more advanced knowledge of networking and a distrust of asking front line tech support about such issues in future.

Which most likely means the next time something like this comes up I will try fixing it myself.

Apparently ignorance is bliss. I wouldn’t know.

Comments