Community Help: Lussumo.com Connectivity Issues [RESOLVED]

Today you may have begun to notice database errors when attempting to load any of my websites. Particularly lussumo.com/community and markosullivan.ca/blog have been showing intermittent errors.

These errors have come at a particularly inopportune time (is there ever a good time?) because I am extremely busy with a new contract, development of the Garden framework, Vanilla 2, and I also manage to have a life in there somewhere (sometimes :).

When I began to notice the slow page-loading times on my server and then the errors that followed, I contacted my hosting company to find out what was going wrong. I am hosted at rackspace.com, and they are well known for their fanatical support. True to form, they got back to me quickly with a diagnosis of the problem:

Good Afternoon,

I have made some adjustments to the my.cnf configuration file in /etc

skip-bdb

query_cache_size=64M
query_cache_limit=12M

interactive_timeout=300
wait_timeout=300

tmp_table_size=128M
max_heap_table_size=128M

in order to decrease the high amount of disk I/O occuring on this server.  This should help with the query building by allocating more memory to this resource.  I have also disabled persistent MySQL connections from PHP:

mysql.allow_persistent = Off

It appears you are reaching your maximum connections limit for MySQL.  The above adjustments are conservative due to the low amount of physical memory you have on this server.

When your server runs out of physical memory, it resorts to using disk space (SWAP memory).  This swapping can and will cause your server to become unresponsive.

You may also consider increasing the amount of physical memory on this server with a RAM upgrade.  If you are interested in proceeding, I can send this ticket to a BDC who can assist you with this upgrade and update you on pricing for this component.

Besides processes in "sleep" status, indicating the use of persistent MySQL connections, it appears most of the connections are due to table locking occuring:

+-----+---------+-----------+-----------+---------+------+-------------------------------+------------------------------------------------------------------------------------------------------+
| Id  | User    | Host      | db        | Command | Time | State                         | Info                                                                                                 |
+-----+---------+-----------+-----------+---------+------+-------------------------------+------------------------------------------------------------------------------------------------------+
| 573 | xxxx | localhost | community | Query   |    9 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 574 | xxxx | localhost | community | Query   |   10 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 583 | xxxx | localhost | community | Query   |   10 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 584 | xxxx | localhost | community | Query   |    9 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 591 | xxxx | localhost | community | Query   |   10 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 593 | xxxx | localhost | community | Query   |   10 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 728 | xxxx | localhost | community | Query   |    5 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 729 | xxxx | localhost | community | Query   |    4 | Locked                        | select a.AddOnID  as AddOnID, a.AddOnTypeID  as AddOnTypeID, a.ApplicationID  as ApplicationID, a.Au | 
| 733 | xxxx | localhost | community | Query   |    3 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 734 | xxxx | localhost | community | Query   |    3 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 730 | xxxx | localhost | community | Query   |    3 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 735 | xxxx | localhost | community | Query   |    2 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 736 | xxxx | localhost | community | Query   |    2 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 737 | xxxx | localhost | community | Query   |    2 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 738 | xxxx | localhost | community | Query   |    0 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 739 | xxxx | localhost | community | Query   |    0 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs | 
| 740 | xxxx | localhost | community | Query   |    0 | Locked                        | SELECT t.DiscussionID  AS DiscussionID, t.FirstCommentID  AS FirstCommentID, t.AuthUserID  AS AuthUs |
+-----+---------+-----------+-----------+---------+------+-------------------------------+------------------------------------------------------------------------------------------------------+
as these queries are locking the table, subsequent queries are having to wait and thus stacking up taking available connections.  You may find that changing this table type to Innodb may help with this table locking issue.  You may need to discuss with your developers if this change would have an inverse affect to your applications.

As well, I have enabled slow query logging in:

/var/lib/mysqllogs/slow-log

which will log queries taking over 5 seconds to complete.  This information will help your developers to optimize any SQL queries and/or apply indexing where appropriate.

I have also put in the option in Apache:

MaxRequestsPerChild  1000

which will help to reduce the memory footprint of this service.

While it appears that the above changes helped with the non availabilty of MySQL, the server is still highly loaded.

Now, I always knew that the Vanilla 1 queries were hairy and could cause problems. I didn’t think it was going to happen any time soon, and I was hoping to get Vanilla 2 in place before this became an issue (Vanilla 2’s queries are much simpler and faster) – but it looks like that is not going to happen. Regardless, it would seem that my traffic has slowly and steadily been increasing at lussumo.com over the years. In December we peaked at 2.5 million page views for that month at lussumo.com alone, and we’ve maintained that amount of traffic almost every day since.

Obviously I could throw more RAM at the server as the Rackspace support person suggested – this seems to be a common answer to problems of this sort (we currently only have 1G of ram on the server), but I don’t know if that is the answer I should be looking for – especially considering that I’m already paying a lot of money for the server.

So, I am hoping that all of those who use Vanilla can step up to the plate and offer your expertise on how to resolve this issue. I am opening the doors and accepting any and all advice, questions, ideas on how to fix the problem.

Here is what I have tried so far:

* I reviewed the slow queries that mysql logged and found that 99% of them were Vanilla’s “comments page” and “discussions page” queries. I’ve uploaded a sample of the slow query log so you can see what queries are causing problems.

* I downloaded a copy of the Lussumo Community database to my local dev machine so I could get a good look at the tables, indexes, etc.

* I found that none of the indexes that are included with the current release of Vanilla 1 were applied on the tables (other than primary keys). This is probably due to the fact that I’ve just added columns as development has continued and never had a problem before now.

* I added the indexes that are shipped with the current release of Vanilla 1 to the community database. I found that this had little-to-no effect on the speed of the page-load (it might have even made the queries slower).

* I’ve created a script that converts all of the tables in the community db to innodb tables (as suggested by the rackspace tech). I’ve done some googling that has detailed both good and bad results of this type of change. It could start to throw fatal errors when data is being inserted (rather than while it’s being selected, as it is now). I have not yet run this script as I want to hear back from the community first.

* I’ve taken the community forums offline and enabled wp-cache on this blog so that everyone can have access to this blog post and be fully aware of the issue.

Help!

So, I am reaching out to you for help. No question is a dumb one. Any idea is welcome. Please share your expertise and help us to get this convoy back on the road…

Update

It turns out that I had forgotten to apply all of the indexes & optimizations to this database through the years that we’ve been online. The growth of our community, combined with poor indexing caused a couple of the tables to begin to lock. The LUM_User and LUM_UserDiscussionWatch tables in particular were locking. These tables are updated frequently with login information and discussion tracking information respectively. Because the tables were MyISAM type, all records would be locked when an update was applied to just a single row – this meant that all 9000+ user records would get locked whenever anyone’s “DateLastActive” field was updated, and all 90,000+ records in the LUM_UserDiscussionWatch table would get locked whenever anyone even looked at a single discussion (and the record of their view of that discussion was recorded).

To fix both of these issues, I changed their table types to InnoDB so that only the affected row should become locked when updates are applied.

I also analyzed the Discussions & Comments queries, which are (obviously) the most actively run queries in the application. The comments query was extremely slow. After running EXPLAIN on the query, I found that it was indexed incorrectly. For some reason the LUM_Comment table was using both the CommentID and the DiscussionID columns as it’s primary key. I removed the DiscussionID as a primary key and added it as a simple index. This allows the query to not scan the entire LUM_Comment table when performing the join to LUM_Discussion. I also found that the LUM_UserBlock table had no indexes at all, so I added those and was able to further reduce the query time. Here is a list of the changes that I made to the database for anyone who might be interested:

ALTER TABLE `community`.`LUM_Comment` DROP PRIMARY KEY,
 ADD PRIMARY KEY  USING BTREE(`CommentID`),
 ADD INDEX `comment_discussion`(`DiscussionID`);

ALTER TABLE LUM_UserBlock ADD INDEX (BlockingUserID);
ALTER TABLE LUM_UserBlock ADD INDEX (BlockedUserID);

ALTER TABLE LUM_User ENGINE=InnoDB;
ALTER TABLE LUM_UserDiscussionWatch ENGINE=InnoDB;

Thanks to Damien (Dinoboff) and Dave (Wallphone) for jumping in and offering some assistance.