compo.sr infrastructure problems (now solved)

3,576 views
Added 20th February 2019, 11:10 pm
Author: Chris Graham

Hey all,

I wanted to write a bit about what has been happening over the last couple of weeks, as some of you may have noticed some instability here.

We got hit by a series of infrastructure problems that seemed to suddenly come from nowhere. And therefore, it's been a mad rush dealing with these unplanned issues.

First, our server's e-mail was no longer being received by gmail users. Even though we've had SPF configured for a long time, Gmail decided that because we didn't have DKIM (encrypted e-mail signing) set up here that it should treat e-mails as spam. It started with just a few, but it came to the point of doing it to all of them.

All ocProducts staff use gmail for official purposes (gmail serves the ocproducts.com domain) – so this meant many e-mails, such as tracker e-mails, weren't coming through to us.
We had a lot of stuff to catch up on due to missing e-mail alerts.

We also of course had to configure DKIM on the server, as well as narrowing the SPF configuration, and enabling DMARC. This isn't the easiest thing to do, so that was a day or two of work.

Meanwhile, our server performance went from strong to abysmal. We were seeing web requests taking over 10 seconds or being completely dropped with timeout errors. We'd fix a problem and get a period of restored performance, and then something else would push the performance back down. It took about four days of heavy server reconfiguration, bot banning, and some Composr optimisation, and now we finally have performance back to the point of being very snappy. It's hard to pin down to a single cause, it was a combination of many things happening at once.

(In fact, as I write this I just saw a new dum hack-bot has made over 9800 requests within the last hour trying attacking the site - banned!)

The good news is it did prompt us to analyze Composr performance in our live environment (using a profiler called tideways-xhprof) and we have made a number of optimizations which will be in the next patch release. Also, the performance optimisation tutorial has improved with quite a few further tips learned through this process.

During all this, we released the 10.0.23 patch release. This unfortunately has a nasty couple of notification bugs that came up with a reworking of that code; one of them produces an error message in some specific situations, and another more critically sends whisper notifications to a broader group of people than should (a very nasty bug caused simply by a missing set of parentheses in an SQL query). The code changes were unit tested for stability, but obviously not well enough. Honestly as it was in the middle of all these other problems, I was probably a bit distracted, and I was about to go on vacation for 2 days so wanted to get it out the door before I'd be back into catchup mode. So I apologise for this, I hate it when releases have major glitches. A new patch release will come very soon.

I wish the 7 or so days spent chasing these issues could have been spent on moving v11 forward towards a beta release. Maintaining the status quo does take more time than I'd like it to. To be fair, v10 keeps getting smoother, along with the documentation, but it's very much time for us to make our next big step forward. I want our users to know it's very much on my mind and there's been a lot of behind-the-scenes progress, v11 is coming.

Thanks for your understanding,
Chris