• Stop being a LURKER - join our dealer community and get involved. Sign up and start a conversation.

Reply to thread

The first time this happened to me (8 years ago) I realized that it was a problem that had to be solved.

The worst feeling is receiving a call from a client and having them inform you that something is offline and you are not aware of it.


At this point in time we're using a few things:


- Nagios / Zappix - both open source monitoring tools that let us know how exactly the server is performing. Especially when your VPS is on a host machine shared by 12+ other people, this is critical. Being online is great, but being online with MySQL using 90% of your CPU is not online enough for me.


- Pingdom - Pingdom pings the site every 60 seconds and confirms it is online. We also get a report every week showing the average load time of the site as a timeline across the week so we can identify any times that the server may be struggling to keep up.


- AWS Resource alerts - all our servers with AWS have resource alerts so that if anything spikes about 50% we are alerted and on the case immediately. We always maintain < 50% server usage so that spikes are mere blips on our radar when they happen.


- Chauffeur - this is an in-house tool we developed that actually checks and validates our websites. Not only does it show that they are online, but it also confirms that the necessary APIs are functioning and that the site is responding to performance checks.


- Ghost Inspector - this is more of a QA tool, but we still use it across the sites. Ghost Inspector lets you script front-end testing of the website, so you can ensure that not only are you online and responding quickly, but it can also confirm that lead submissions are working as expected and that every site has inventory showing, etc.


It's a fairly complicated stack, but it means that if a server ever goes offline, our office lights up like a christmas tree and every developer, devops and manager is aware immediately. Thankfully for my sanity this is a very rare occurrence, but even AWS goes down sometimes so the safety net is in place. Now that we have redundant copies of everything running on a separate server in a separate region, downtime actually just means a failover to a backup website, rather than a red alert site offline situation. If you can have 2 VPS running with a live-live redundancy you'll sleep easy :)