mPulse

Friday, September 4, 2009

GrabPERF - Database Failure

The GrabPERF database server failed sometime early this morning. The hosting facility is working to install a new machine, and then will begin the long process of restoring from backups and memory.

Updates will be posted here.

UPDATE - Sep 4 2009 22:00 GMT: The database listener is up and data is flowing into the database and can be viewed in the GrabPERF interface. However, I have lost all of the management scripts that aggregate and drop data. These will be critical as the new database server has a substantially smaller drive. There is a larger attached drive, and I will try and mount the data there.

It will likely take more time than I have at the moment to maintain and restore GrabPERF to its pre-existing state. You can expect serious outages and changes to the system in the next few weeks.

[Whining removed. Self-inflicted injuries are always the hardest to bear.]

UPDATE - Sep 5 2009 03:30 GMT: The Database is back up, and absorbing data. Attempts to move it to the larger drive on the system failed, so the entire database is running on an 11GB partition. <GULP>.

The two most vital maintenance scripts are also running the way they should be. I had to rewrite those from very old archives.

Status: Good, but not where I would like it. I will work with Technorati to see if there is something that I'm missing in trying to use the larger partition. Likely it comes down to my own lame-o linux admin skillz.

I want to thank the ops team from Technorati for spending time on this today. They did an amazing job of finding a machine for this database to live on in record time.

I have also learned the hard lesson of backups. May I not have to learn it again.

UPDATE - Sep 5 2009 04:00 GMT: Thanks again to Jerry Huff at Technorati. He pointed out that if I use a symbolic link, I can move the db files over to the large partition with no problem. Storage is no longer an issue.

[And, why you ask, is Tara Hunt (@missrogue) on this post. Hey, when I asked Tagaroo for Technorati images, this is what it gave me. It was a bit of a shock after 8 hours of mind-stretching recovery work, but hey, ask and ye shall receive.]

UPDATE - Sep 7 2009 01:00 GMT: Seems that I got myself into trouble by using the default MySQL configuration that came with the CentOS distro. As a result, I ran out of database connections! Something that I have chided others for, I did myself.

The symptom appeared when I reactivated my logging database, which runs against the same MySQL installation, just in a separate database. It started to use up the default pool of connections (100) and the agents couldn't report in.

This has been resolved and everything is back to normal.

No comments:

Post a Comment