My first (and certainly not last) 4am post!
Prepare to be underwhelmed.
For the last 36 hours I have been doing battle with two servers located within the same data centre and the guys who manage the network therein.
At about 12pm Tuesday I noticed that a number of mails from server 1 had not yet reached server 2. A quick look and push of the mail queue, and Exim is telling me it has “No route to host”. Hmm, not a good sign.
My first priority was to see just how big an issue it was. After running some checks using external mail programs and some log file checking, I satisfied myself that at least the routing problem seemed to be restricted to those two servers. Or rather those two subnets within the data centre network.
Neither server could ping the other, and even switching off the software firewalls, Iptables and APF, did nothing for me.
At this stage I felt the issue lay outside of either box and bumped it to DC support staff. I had a sense of foreboding about this as I knew in my heart that this kind of small, localised and fairly complex issue would take a lot of two-ing and fro-ing before it landed in the inbox of the sort of knowledgeable techie that could correctly identify and resolve the issue.
Meanwhile, back at the mail servers, mails are piling up in both mail queues. Gah! So a few 4 hour message not yet delivered mails will be received, nothing serious.
I won’t bore you with the mind numbing problem tennis I played with the support staff to get my issue aired, we all have our tales of woe in the department.
Eventually it was declared (second hand in the form “the network engineer said..”) that the network config on either or both machines were not correct.
Now, I’m not the world’s smartest guy, but if I have one strength, it is the ability to approach problem solving in an intelligent way. i.e: If two servers have been working fine for the past 12 months, then why the hell would you need to go about messing with their settings?
It was mentioned that a Cisco router (I think the brand was important as many techies seem to think Cisco is some sort of mythical place that nobody should ever enter or discuss in detail) was recently replaced and that the settings might have worked in the past but, blah blah blah.
“Hmm, ok, give me the settings and I will see how they compare to what I currently have”. And yes, on one server they were different. But changing the config for eth0 is not something to enter into with haste. In fact, it is up there with marrying a woman with a very nosy mother.
So I called them back to double check. No, no, I was assured, those values are correct.
Correct they may be, but not for my server that was down for 3 hours after I applied said settings. I didn’t think it would take them 3 hours to get it back up after I called and explained to them to get someone over to the box, login and apply my “wrong” settings once more, but they spent an hour rebooting the machine that was fine to get themselves warmed up for the main event.
So several hours later, I have my server back up and a gleeful response to the trouble ticket informing me that the server is now back online and the issue was resolved.
Apart from my original problem that is.
It was at that point that I remembered why I had tied bubble wrap to my forehead earlier in the day.
So, I have been immersing myself in learning more about networking, routing and eth0 than I really want to know, but after some careful information gathering, I have been able to create a static route between the two servers to re-establish the network route that was so cruelly taken from me by fate on Tuesday morning.
I had better remember to put a script in that adds the routes back in on next reboot.
I’m left with a slightly more advanced knowledge of networking and a distrust of asking front line tech support about such issues in future.
Which most likely means the next time something like this comes up I will try fixing it myself.
Apparently ignorance is bliss. I wouldn’t know.