Just a quick note to inform websrv01 users that I have upgraded both Apache and PHP on websrv01 this afternoon. This server is now running PHP 5.2.12, and previous pdo_mysql issues should now be resolved. Our next PHP upgrade (likely the end of February) will be into the PHP 5.3 branch, so please ensure any PHP applications you may be running are ready for this.
Update: 9:29AM EST
Dell has successfully completed the maintenance work and replaced the failed hard drive on websrv01. All services are currently online; however, the server will be a little slow until it fully syncs the newly installed hard drive with its mirrored drive. We expect that this will take another few hours to complete. If you have any trouble, please let me know.
Original Post: Oct 27 @ 8:02PM EST
I confirmed earlier today with Dell that a service technician will be on-site in our Montreal data centre on Thursday, October 29th at 8:00AM EST to replace the failed hard drive on websrv01. While this maintenance is taking place, all services on the server will be unavailable. We expect that the downtime will no longer than 1 hour; however, your patience as we replace the failed hardware is appreciated.
Just to confirm, all services on websrv01 will be unavailable:
Thursday, October 29th, 2009 @ 8:00AM EST
Update: Oct 21 2:24AM
websrv01 is currently back online. Diagnostic tests are currently running on the server as we speak (thanks to Greg), but initial reports indicate that 1 of the 2 hard drives on websrv01 has died. Luckily we run a RAID 1 (mirror) configuration, so the other drive is picking up the slack (whew). Dell is aware of the issue and will get back to me later in the day to schedule a time for them to visit the data centre and investigate further. I will post more information as it becomes available.
Initial Report: Oct 20, 11:03PM
websrv01 is currently off-line. It is highly upsetting to say that; however, we are currently experiencing some major hardware issues / failures. I am currently working with Dell and our co-location provider to resolve the issue; however, we expect that the server will be down most of the day on Wednesday while we recover service.
This morning we experienced a short unscheduled service outage on websrv01 due to spam attack that took place early in the morning. This incident could have easily been avoided if a select few users had e-mail address passwords that were not incredibly simple. If you have a simple e-mail address password, please change it immediately. Passwords should be alphanumeric and contain a minimum of 6 characters, and no dictionary words.
We are currently experiencing an unscheduled service outage on websrv01 due to what we believe may be a hardware issue on the server. In fact, I think this could be the same issue we encountered on April 25th, and I hate to say it but our co-location provider *still* has not resolved the misconfigured the power port that our server is plugged into, so I am still unable to reboot the machine.
A technician has been informed of the problem, and someone is going down to the server to reboot it right now. Luckily, I am told there are people in the building today, so it should be back shortly. I will post an update as soon as I know anything.
Data centre technicians are making their way to the server right now to fix the APC switch and restart the machine.
I’m still waiting, and getting more angry by the minute. I apologize for the inconvenience.
websrv01 is back online after the technician finally rebooted the server, I apologize once again for the inconvenience. I am fairly certain that they assigned John to my support ticket:
We are currently experiencing an unscheduled service outage on websrv01 due to what we believe may be a hardware issue on the server. Unfortunately our co-location provider misconfigured the power port that our server is plugged into, so we were unable to reboot the machine ourselves. Currently we have a technician assigned in Montreal who is on his way to the data centre to reboot the server and investigate further. We will update this post as more information becomes available.
We are still working with our co-location provider to determine the exact cause of the problem. One theory currently being investigated is that we may be experiencing a distributed denial of service attack on the server. As soon as we have any further information, we will post it.
The problem has now been resolved, and all service has been fully restored. It does in fact appear to have been a distributed denial of service attack, which fortunately ceased on it’s own. We sustained 1Mbit of http traffic to websrv01 for only a short period of time before the server was unable to handle the requests. The 1Mbit wall continued until just after 6PM when it stopped just as mysteriously as it began. Further investigation is on-going and any new information will be made available.
We apologize for the inconvenience.
At approximately 3:05PM EDT today (Wednesday, March 21st, 2007) websrv01 encountered unexpected server downtime that lasted approximately 20 – 25 minutes. The cause of the downtime is currently unknown, a hard reboot of the server was required in order to restore service to the server.
We will post more information as to the cause after we determine what happened. We apologize for the inconvenience.
On Wednesday of this week the digital SSL certificate for secure.digitalorphans.org expired before I had a chance to renew it. Ouch. If you attempted to access Plesk from Wednesday to Friday evening on websrv01 you would have been presented with a warning message from your web-browser stating that the digital certificate for the domain has expired.
I apologize for not getting this renewed on time; however, it was successfully renewed and installed this morning. The old digital certificate is still installed when sending e-mail through secure.digitalorphans.org if you have SSL enabled, but I will get Greg to copy that over today.
Server Maintenance Report: websrv01
The scheduled maintenance of websrv01 went very well today with no problems to report. All relevant Red Hat updates were installed and I’ve upgraded PHP on the server to PHP 4.4.4.
Greg and I have not yet decided on a date to upgrade Plesk to version 8; however, we will be looking into it shortly and we will definitely post a solid date and time. I’m thinking arbitrarily sometime in mid-November… I’ve actually heard that SWSoft is going to be releasing Plesk 8.1 very shortly so at this point it might be better off to wait and see what they are doing first.
Server Downtime: websrv02
On Saturday, September 16th at approximately 5:25PM EST we had some unexpected downtime on websrv02. This was caused by a permissions problem during a standard RedHat package update. The issue was resolved after about 15 – 20 minutes when Greg found the issue and corrected it. Users may not have actually noticed an issue at all since it was a problem with DNS and it only affected new queries to Bind. websrv02 was rebooted during this update.
PHP Updated: websrv02
Thanks to Greg’s persistence and assistance, we have also upgraded PHP on websrv02 to PHP 5.1.6. If you had PHP scripts running on websrv02, please check them out for any compatibility issues.
Server Maintenance: websrv01
We need to do some updates to websrv01 as well and we’re thinking that Sunday morning would be ideal for this. Updates will be performed to RedHat Enterprise on websrv01 on Sunday, September 24th @ 10:00AM EST. Downtime should be minimal if any, although we will need to perform a reboot of the server. A PHP upgrade to PHP 4.4.4 will also be done at this time.
Future Maintenance: websrv01
We still have not decided on a date to perform the upgrade of Plesk on websrv01 to version 8.0.1; although I will consult with Greg about this shortly and provide a solid date. After we have upgraded Plesk to the 8.0 release, we will shortly there after be upgrading PHP to version 5.1.6 as well. As previously stated, please make sure that all PHP scripts on the server are compatible with PHP 5.1.