MySQL Logo

MySQL – First Impressions

imageIn my new job as the CTO of Livedrive, I have the pleasure of working with both Microsoft SQL Server and MySQL. We have a rather nice real estate with tens of petabytes online which keeps us entertained with scale challenges.

Having spent some time with MySQL lately, and being an old SQL Server user, I thought it might be interesting to share some of my early experiences. The good, the bad and the ugly.

The opinions expressed here are my own, not those of my current or previous employers. With that said, lets get on with it.

.NET Support is Excellent

MySQL is beautifully integrated to .NET. Switchers from MSSQL will barely notice the difference in the API. If you code directly against interfaces like IDataReader and IDbConnection (and you should) it is drop and replace. You can connect to both Linux and Windows instances, there isn’t really any difference (unlike for example, DB2 where connectivity is very different across target platforms).

The MySQL connectivity libraries are available as a NuGet package. You can even use the same driver to connect to MariaDb, Percona and Oracle MySQL.

Too much choice?

There are currently at least 4 different versions of MySQL out there: MariaDb, Percona, MySQL (from Oracle) and MemSQL (A hekaton like, in memory optimized version). It takes some time to get used to the ecosystem. The client protocol is identical between the different variants, but picking between them is non trivial. The good news is that both Facebook and Google actively contribute to MySQL, so you will find a lot of cool development mindshare.

But choice isn’t always a good thing. When the solution that must be implemented is clear and well understood, and I would claim that building a database is, pooling resources into solving it The Right Way™ once and for all provides a better experience for the end user.

MySQL Does Not do Query Parallelism

There is no such thing as a parallel query in MySQL! This includes index builds, which are also single threaded. Obviously, this sets some rather serious limitations on what you can reasonably use MySQL for. For large tables, you can work around this limitation by partitioning the tables, but still…

However, you do have the notion of sharding data and using the Shard-Query tool to achieve an effect similar to that of MSSQL, or should I perhaps say PDW.

MySQL does not do Hash Joins

Some might consider this a good thing. If you are an OLTP system, the last thing you want is some developer getting hash join queries on a small dataset, claiming great performance and then horribly breaking stuff when the data grows, the optimiser starts using loops joins and all the missing indexes show their ugly face.

Unfortunately, hash joins are amazing for data warehousing. I have written previously about how they are the bread and butter of good star schema designs. They are sorely missed in MySQL.

MySQL History Lists are EVIL!

(Thanks to Sunny Bains for correcting me on the details of how the purge records are held, I hope this update does it more justice)

MySQL uses MVCC by default, but with a very odd implementation. When a query reads a row that is concurrently updated, the OLD version of the row (the one being read) is kept around in the index. MySQL maintains a list of the old records that are still alive called the “history list”. Eventually, when the old version of the records is no longer visible to queries, they are removed by a background thread – this is called a “purge”. I suppose this implementation is done in an attempt to save on I/O.

Because history lists cannot be “purged” while someone is accessing data in them, what will happen when someone is executing a long running SELECT statement, is that you up making the purge thread fall behind. As the history list grows and long running queries continue to arrive, everything eventually grinds to a halt on the server.

Users of MSSQL or PostGres will find this implementation choice very odd. On older version of MySQL (pre 5.5) the history list purging is single threaded and runs in the main event loop, making things even worse. The history list design is probably the largest design flaw in MySQL and one I hope will be addressed properly.

MySQL is NOT a Data Warehouse Engine

The lack of query parallelism, hash joins and proper handling of long running queries makes MySQL unsuitable as a Data Warehouse engine. MemSQL seems to address some of these issues and there are interesting projects on the that way which may make me change my mind. But for now, I would not use MySQL for anything but OLTP – not even for a mixed workload system.

Users of MSSQL, who often see mixed workloads and handle them nicely, may find themselves frustrated by MySQL when trying to switch. You have been warned.

Auto-increment is Stupid!

As a MSSQL user, you will be deeply familiar with IDENTITY and SEQUENCER to generate keys. MySQL does have the notion of an auto-increment column, which looks somewhat like IDENTITY. However, it has some odd drawbacks:

  • The max increment is 64K (which is inconvenient for certain shard strategies)
  • You cannot reset the seed value easily, especially not to a value less than the current max value (even though there is plenty of space to generate new values)
  • Inserting a value higher than the current value of the auto-increment resets the seed to the new value. In other words: auto-increment is always = SELECT MAX(column)

While the above behaviour protects people against shooting themselves in the foot (having the auto increment duplicate keys for example), it does not provide the freedom to modify key generation in smart ways, for example in replication scenarios.

MySQL supports GUID keys, which has some major advantages that I have previously described.

MySQL has Great Management Tools

There is a large undergrowth of tools for MySQL which fully match the capabilities of Enterprise Manager / Management Studio for MSSQL. There is everything you would want here: Schema exploration, automatic scripting of objects, intellisense, syntax highlighting, version control integration and GUI based management (for those of you who don’t like to script things).

There are a ton of open source tools that help both developers and administrators use the database. My personal favourite is mysql-genocide – which we are using in our QA environment to flush out hard to find bugs.

MySQL Exposes ALL the Knobs

Imagine having access to every undocumented trace flag in SQL Server. That’s what the file my.cnf is all about. In here, you can tweak stuff to your heart’s content. Want to remove ACID properties? There’s a setting for that. Change the way threading works – yep, got that too.  Micro manage sort buffer space? – be my guest and shoot yourself in the foot!

Users of SQL Server have a love/hate relationship with trace flags. On the one hand, experts will often wish for certain trace flags that will make designed database that makes assumptions about the world provide reliable response times. On the other hand, the existence of trace flags often lead people to believe that there is some magic “make my crap schema go faster knob”. Instead of fixing the root cause, which is normally bad database design, such people ask Microsoft to help them find and use these magic settings, allowing their denialist mindset to continue… In contrast, the MySQL administrator has a rather powerful armoury that allows him to tweak everything for max performance – or to blow his own brains out.

MySQL has Lovely Documentation and a Strong Ecosystem

I find that the quality of documentation for MySQL is fully on par with SQL Server Book Online. Furthermore, I have observed that when people write about MySQL, they have a respectful and academic (in the sense that they assume intelligence) approach to the reader. Case in point, the O’reilly book: “High Performance MySQL” – which is like the Inside SQL Server books for MySQL. In here you will find a great introduction to MySQL that is both accessible to the novice, yet useful as a reference guide for the expert. Information is presented clearly, concisely and without bloating the book with screenshots or superfluous information. The focus is on the book is making databases go fast, be reliable and provide good response times. The authors never lose sight of that.

While the inevitable Linux religiosity and unsupported elitism DOES taint the MySQL community, the complexity of database engineering tempers this and creates a willingness to learn from mistakes and designs of other database products (including, you will find, MSSQL).

My experience with asking tough questions to the MySQL community, is that this often results in a deeply knowledgeable debate about the trade-offs between different design choices, not just the spewing of “best practices”.

Instrumentation is Lacking – But it is Not as Bad as You May Think

Users of SQL server who have gotten familiar with the DMVs will find that MySQL (which stores this type of information in the INFORMATION_SCHEMA) is lacking for diagnostic capabilities.

The rich querying of operating system resources from inside the database (memory, thread scheduling, CPU usage, wait times) and statistics about the objects in the database are still in their early days in MySQL – on par with what you would expect from SQL Server 7.0/2000. Tracing is reduced to a single “slow query log” which is stored in the file system.

However, when it comes to digesting trace outputs, there are excellent tools out there to analyse traces and group queries up nicely (example: pt-query-digest). Digesting traces beautifully is something that SQL Server has been lacking, the index optimiser just doesn’t cut the mustard. However, one can argue that with sys.dm_exec_query_stats, its less important to have this digest capability, as MSSQL essentially exposes the output of pt-query-digest from INSIDE the database. Replaying traces at scale is also well supported by MySQL (but like SQL Server, it is still a bit hacky).

Online Operations

Online Index Rebuilds and schema modification is basically not available to MySQL users. The usual ways to work around this can of course be applied, but MySQL is lacking in this area.

SQL Server is by no means perfect either, but quite a bit further ahead.

(Again a thanks to Sunny for lifting my ignorance on this one and enlightening me on the new features)

In MySQL 5.6, a lot of operations are online as documented here:

We are currently not on that version, but will be moving there soon. Since we have some nice index tweaking to do, I look forward to seeing if MySQL will have similar issue to the SQL Server SCH-M lock issue when committing changes.

Partitioning in MySQL is Cool

MySQL supports some powerful partitioning strategies. As you would expect, you can do sliding window, range style partitions. However, you can also perform hash partitioning (a feature we make good use of at Livedrive) or even COMBINE partitioning schemes into composite partitioning strategies (something that users of Analysis Services should be familiar with).

In SQL Server, the lack of deeply integrated hash partitioning has always been a bugbear of mine. While you can fake it with computed columns, MSSQL lacks the awareness of hashing that the MySQL optimiser supports. As should be obvious from studying MPP designs, hash partitioning is the foundation for highly scalable designs – it will be interesting to see where MySQL takes this feature.

One feature I lack in MySQL partitioning is the ability to have non-aligned indexes. In SQL Server, it is possible to partition an index on a table with a different partition function than the table itself. MySQL forces the same partitioning of all indexes.


Even after some exposure to MySQL I am still split in my opinion about the product. It it obvious that the features of MySQL are targeted towards coders who want to “whip up a quick database” and that the design philosophy is one inspired deeply by OLTP systems. This lack of warehouse database features sets some limits on what it can be used for. On the other hand, I am happy to see that MySQL doesn’t try to be everything to everyone.

As long as the big database vendors continue to charge a premium for their products, there will be functionality sacrifices that developers and DBAs are willing to make to get a free database. Considering the price point (0 USD) – MySQL is a surprisingly powerful product for what it does. It is also encouraging to see that the development of MySQL is still focused on “getting the basics right” – not bloating the database with all sorts of beyond SQL/No-SQL like features.

We live in a world where the Web 2.0 developers have forsworn the beauty of relational algebra (mainly, I suspect, because they don’t actually grok single box parallelism) and where the big vendors have their heads so high up in the cloud that they have forgotten their customer base. It is encouraging to see MySQL keeping the relational candle lit, so that the lost NoSQL souls have somewhere to turn when the ACID burns leave them in the dark.

  No Comments

  1. datatake   •  


    Here is a thought for you, if you had an open source database that you can create your own fork for, in fact you would only have to do this for the plug gable storage engine, then if you a high performance solid state card, the driver for which runs on the actual server, if you owned the driver, you could integrate this with the database storage engine. Now, if you have this level of integration and the storage engine has intimate knowledge of the storage . . . . ;-)


    • Thomas Kejser   •  

      Chris… yes, this would be nice. But mySQL is nowhere near the point where that will make a major difference worth going after (like 2x throughput speedup or similar)

  2. Sunny Bains   •  

    For #1, yes, if a record is in the view of a running transaction then purge cannot remove it. MSSQL makes a copy and I can see why it can refer to that snapshot and not rely on the version in the table. The copy IMO has a cost, like I said above, it is a trade off, I don’t want to get into which is better.

    For #2, as Marko mentioned there is a slight complication to do it out of the box but what I had/have in mind is to use the purge threads instead of the user threads and do something similar to what the Change Buffer does, remove entries if we know that it will not lead to an underflow. A bit of hand waving but that’s the general idea. In effect it will allow weakening the total order constraint that Purge works under now, assuming the delete-marked records can be removed of course.

    • Thomas Kejser   •  

      Sunny, we have test servers where we can repro these issues. If you have a build to test new algorithms, just get in touch.

  3. Sunny Bains   •  

    Sure, it is orthogonal. I can see why but it can run out of space and the ghost record is the equivalent of the delete marked record in InnoDB. The two main differences I see are:

    1. Dropping old versions is faster, it is a simply matter of deleting the segment or whatever it is called (inside the tempdb). However, one cannot delete the segments out of order. That’s what the MSDN doco says anyway. Therefore long running SELECTs will prevent version data delete.

    2. Purging of ghost/delete marked rows is not dependent on the version info stored in tempdb. InnoDB needs to traverse the UNDO log to figure that out. This I think can be optimised for the case where the purge threads in InnoDB are blocked by a long running select. We should be able to trigger a cleanup via user threads. Something to investigate.
    This would potentially allow the same behaviour that you perceive to be an SQL Server feature.

    The opportunistic delete etc. are minor optimisations IMO and not important in the big scheme of things. Looks like SQL Server starting from 2005 more or less ended up doing what InnoDB and Oracle do with MVCC but with a minor twist.

    Interesting anyway, one idea that I like is the way SQL Server version log can be constrained by size. It affects readers but not updaters. Something to think about :-)

    One question I have for you is, if the tempdb is not redo logged and is reset on startup how can it do rollback of active transactions after crash recovery?

    • Thomas Kejser   •  

      1) The ability to only hold version records for rows that actually had conflicts is crucial here. A long running SELECT will not necessarily bloat the version store in MSSQL (only for the rows that actually had conflicts). But in MySQL, even if there is no conflict, the purge cannot proceed until the SELECT is done running – correct?

      2) Indeed. Having the purge thread handle both the cleanup of the version record as well as the reorg of the tree seems like you are trying to do on a background thread what a user thread should handle.

      Redo is handled by the main transaction log WAL.

  4. Sunny Bains   •  

    Thomas, I read whatever publicly information was available around SQL Server’s read snapshot thingy. It suffers from the same problem:

    Version Cleanup Rate (KB/s) (SQL Server 2005)
    Monitors the version cleanup rate in KBps in all version stores. If the version cleanup rate is lower than the version generation rate, the version store will use more and more space in tempdb. However, if the version cleanup rate is 0 but the version generation rate is not, there is probably a long-running transaction that is preventing the version store cleanup.

    What am I missing?

    I couldn’t find any documentation about the internals but reading between the lines I’m guessing it is done via some internal trigger like mechanism. The rows in the view are copied to the “append-only-store” that is inside tempdb and that’s why there is no need to delete individual rows. Assuming I’m on the right track, this implies that it is slower than the InnoDB MVCC implementation and then becomes a question of trade off.

    One thing that my simplistic take doesn’t explain is when is the view actually taken, is it timestamp based or transaction ID based?

    • Thomas Kejser   •  

      A few things missing Sunny

      First of all, free space recovery from deleted records is orthogonal to version store. This mean that index bloat from heavily modified indexes can be cleaned up even while the version store is active. The cleaner thread is the ghost cleaner (along with the opportunistic cleaning that goes on during page split)

      Second, you are right the the copy happen as a form of background trigger. But tempdb is volatile (it can reside purely in RAM) – so you don’t need to do the random I/O disk walk to clean up. Its the SELECT statements that are responsible for taking the snapshot, not the modifiers.

      Third, in situations where I execute my long running query as READ COMMITTED (which is the typical approach), rows that that did not need a version are never even generated or prevented from being cleaned up (does that make sense?). Oddly, this does not seem to be the happen with mySQL where touching a row prevents it from being cleaned up, even when I am no longer looking at it and I am not transactionally guaranteed that it wont change behind me (this at least seems to be what I am observing on our servers).

      Fourth, the version store implemented as a series of mini tables inside tempdb. Each table knows the start and end sequence number it belongs to and there are several version rows in each. Cleaning up the version store is as simple as truncating the tables that have transaction sequence number less than currently active. This means that the cleanup is near instant – while it take mySQL a long time to catch up after the version store has stalled (because you need to pay the reorg price of indexes to clean up – correct?).

      The version stored is based on the transaction sequence number

  5. Marko Mäkelä   •  

    Thomas, that is a good suggestion, and it would not even require any file format changes. There could be some locking considerations and I/O overhead, though. Because InnoDB does not keep a per-record DB_TRX_ID in secondary indexes, it may need to do a clustered index lookup to determine if a secondary index record can be safely removed.

    Your idea could make sense when there are no secondary indexes, or if we start to store a per-record DB_TRX_ID in secondary indexes. Currently, purge must remove the clustered index record last, because the current undo log format does not necessarily contain enough information to identify the secondary index records and to determine if they should be removed.

    Another optimisation that I have been thinking about, would be to empty the two system columns when we notice that the record exists in all read views. This would save 6+7 bytes per row in mostly-constant tables, which can be significant when the table only contains a few short columns.

    • Thomas Kejser   •  

      Marko, I think this optimisation actually benefits you the most for secondary indexes (even if it does require a primary lookup). The reason is that a secondary index is much more likely to change than the primary.

      To take a concrete example: We hold metadata for user’s file structure in the data to do our sync. To quickly locate all files in a folder, we have an index on the parent folder for each file (the primary index is on the unique ID of the file). Now, when a user moves files to a new folder (changing the parent_id), this results in a rather large update that immediately bloats the history list. Because our parentID is not sequential (its a hash value) you are quite likely to get some page splits in the page belonging to the old folder too, which would give you an opportunity to piggyback on the user thread to clean up data.

  6. Marko Mäkelä   •  

    This is an interesting discussion.

    My old blog post about the change buffering in 5.5
    can be confusing, because it is describing both the existing InnoDB behaviour and what is new in 5.5.

    Sunny, I have to say that your comment “It doesn’t have anything to with Purge” is wrong. The change buffer can handle “purge” operations in secondary index leaf pages, as long as the operation is known to be doable without shrinking the tree. (This is less often that it could be. We should also redesign the data structures of the change buffer.)

    Now, back to MVCC. The clustered index records contain two system columns, DB_TRX_ID and DB_ROLL_PTR. The former identifies the transaction that last modified (or created) the record version. The latter column is a pointer to the undo log. Because undo log records contain copies of these two system columns in the previous version, it actually forms a “stack” or singly-linked list of record versions.

    I would classify 3 different cases of updates. Let us take the example

    INSERT INTO t VALUES(1,2,3),(4,5,6);

    Updating a non-indexed column
    (UPDATE t SET c=c+4):
    We simply do it in the clustered index record(s). For each updated row, an update_undo log record is created, and DB_TRX_ID,DB_ROLL_PTR will be updated in the clustered index record, along with the column. (From the perspective of undo logging, it does not matter if the size of the record changes or if the update results in splitting or merging pages.)

    Updating an indexed column
    (UPDATE t SET b=3 WHERE a=1):
    We will write an update_undo log record as in the previous step, but we will also perform two operations on the secondary index records (b,a):
    INSERT (3,1)
    DELETE_MARK (2,1)
    Eventually, purge will process the update_undo log record and will actually delete the secondary index record (2,1).

    Updating a PRIMARY KEY column
    (UPDATE t SET a=a+1)
    This will essentially do the same as
    INSERT INTO t SELECT a+1,b,c FROM t;
    DELETE FROM t WHERE a=1 OR a=4;
    That is, the old record will be delete-marked in all indexes, and a new record will be inserted into all indexes. It will be a “fresh insert” (using the insert_undo log) into the clustered index.

    One more note about COMMIT and ROLLBACK. COMMIT leads to almost no page changes in InnoDB. We will simply drop the insert_undo log and mark the update_undo log committed. The B-tree pages will already be in “committed” state. ROLLBACK will replay the undo logs backwards and then proceed to COMMIT the transaction. ROLLBACK TO SAVEPOINT a.k.a. “statement rollback” will replay the logs backwards until the start of the statement and leave the transaction active.

    Fresh inserts (those that go to the insert_undo log) will be rolled back by purging the records immediately by the user thread. All other purging will be handled by the background thread.

    • Sunny Bains   •  

      Marko, right , and thanks for the example, the point I was trying to make was around the “heavy lifting” by purge. So, yes, delete buffering in 5.5+ does what you claim but only if no tree modifications are involved.

    • Thomas Kejser   •  

      Thanks for the additional details Marko. If I may suggest an optimisation here: When you split a page, why not clean up the dead records instead of waiting for the purge to do it?

  7. joeharris76   •  

    In my opinion you’re doing your SQL Server readers a disservice by not calling out the difference between MySQL and the storage engines like InnoDB. I realize InnoDB is the default now but it’s still a big part of why MySQL has been successful. Cf. TokuDB.

    That’s doubly the case when you talk about data warehousing. There are at least 2 excellent storage engines for analytic use (Infobright and Calpont InfiniDB) that will blow the doors off SQL Server for star schema queries. Hit up the Percona blog for some interesting SSB results.

    • Thomas Kejser   •  

      Not sure I follow your point Joe. My comments here are based on experiences from an InnoDb solution. I believe you mean that I should have pointed out the there are other, more DW friendly storage engines?

      I realize that there are other engines that give better performance for data warehouses, but if I pick those, I forfeit the good things from InnoDb. For example, I would expect to run parallel queries on row stores that are optimized for high speed singleton inserts. In other words, I would like my query optimizer to generate parallel plans, no matter how the data is represented in the storage engine.

      One of the issues I have with mySQL is exactly this undergrowth of different vendors who all try to do just about the same thing – as a consumer of the database, this is not helping adoption.

      As you can probably infer from my sections on data warehousing, I am interested in using mySQL for this purpose. I would appreciate if you link the benchmark a your refer to. The ones I have found on the Percone blog have been rather underwhelming. Maybe I am missing something.

      • Joe Harris   •  

        “Undergrowth of different vendors”? Careful, I think your SQL Server background is showing. This diversity of options is a huge strength for MySQL. Consider the pace and scope of engineering that has been done by small lean teams vis-a-vis the glacially slow pace of improvement in SQL Server.

        Different storage engines allow me to obtain vastly different behaviour for different workloads:
        * _Infobright_: Highly compressed column store. *Very* fast on simple queries. Limited to 1 CPU. Limited DDL support. Commercial version removes some limits.
        ** Vadim’s comparison of Infobright with MyISAM from 2009:
        * _InfiniDB_: Column store without compression. Uses vectorised ops to really push the CPU. Commercial version adds compression and other stuff.
        ** Vadim’s SSB benchmark of Infobright and InfiniDB from 2010 :
        * _TokuDB_: Uses a unique B-tree alternative to allow very fast and consistent insert rates. Also compresses the data so you can get much better use of, say, SSD storage.
        ** Numerous posts at Percona’s blog:

        Having to research, understand and try out these ‘engines’ is, IMO, not much different than having to understand all the various options (traceflags, etc), versions (Std vs Ent) and features (Page compression, partitioning, etc.) in SQL Server. There’s a lot of knobs to turn on both they’re just in some different places.

        I should point out that I’m not actually a MySQL expert _at all_. But I do specialise in DW so I like to know what my options are. :)

    • Thomas Kejser   •  

      Joe, I don’t see a lot of vendors working on solving almost exactly the same problem as an advantage to a consumer. A few vendors, yes (monopolies are bad after all) – but not many. You are only creating an interop nightmare and a difficult support matrix.

      To take a parallel example: Mobile Phones. Before Apple invented the iPhone, the so called “smart phones” were horrible. Vendors like HTC, Samsung, Motorola were all creating an interop nightmare and devices were overall unusable and uninventive (I stuck with an old Nokia phone back then, something I previously blogged about). Along comes Apple and sets the standard – closely followed by Android/Samsung. Today, those two vendors dominate the market and as a consumer, my choices are simple, yet I am guaranteed to have a two horse race keeping each vendor sharp. One would hope something similar could happen in OSS with PostGres vs. mySQL.

      Another good example of a working duopoly solving a problem that is clearly defined in in the area of client operating system (Windows vs. OSX). Both vendors are now highly advanced and competing to solve the same problem. In contrast, all the different variants of Linux and its ugly GUI designs are just a mess. Having “choice” adds very little value here and I find it rather sad to see mySQL going down the same route.

      But yes, I suppose it’s the cathedral vs. bazaar argument all over again. I am not a believer in either, I think the best solutions happen when duopolies are established.

      • joeharris76   •  

        Windows vs OS X keeping both vendors sharp? Have you used Windows 8?

        I also note that your blog is running on Linux even if you don’t realize it. Nice mess!

        • Thomas Kejser   •  

          Have you seen the Linux Kernel? It can’t even get spinlocks right. Or what about NFS? A shadow of what SMB3 is… At least the Windows file sharing protocol doesn’t take down the kernel when things go wrong – that Linux have still not managed to implement SMB properly is their issue, not Win… Daily crashes of Linux is common for us, yet our Windows boxes are rock steady. We live with it, because Linux is cheap and we have plenty redundancy (and we have learned to fix it) – but that doesn’t mean Linux is an amazing OS. Similar for WordPress, it’s cheap and good enough.

          And yeah, Windows8 looks like crap – the graphics designers took the mobile idea a bit too far – the market will punish them for it and beat them back in shape (compare the endless GUI war on Linux where everything still looks like crap because there is no market force to fix it). You should have a look at the win8 kernel, it might teach you a thing or two of what it takes to write a true enterprise grade OS kernel.

    • Thomas Kejser   •  

      On the star schema benchmarks Joe: I am afraid the links you point to all represent what I like to call “an order of magnitude expectation error”. The numbers quoted there (like loading a million row table in 2 minutes and running 130 inserts/sec) are terribly underwhelming… Today, the standard for loading data like TPCH is that you need to hit the 1M LINEITEM rows/sec just to be in the game (and engines like Vertica hit 50M/sec)… SQL Server, even on SMP, will happy hit 16M LINEITEM/sec and scale index build linearly up to 128 cores.

      Column stores are also very old news at this time (they were invented in the 70′ies) and even though SQL Server has only recently implemented them – the 10-20x speedup that people measure on them are completely expected and not something that I consider as blowing the socks off anything. The gold standard today is to be able to seamlessly mix column and row stores within the same query optimisation process (something that Vertica again does impressively and SQL Server can be hacked to do)

      • joeharris76   •  

        Um, thanks of the (incorrect) history lesson. In reality, the first production ready SQL speaking column store was Sybase IQ in 1996.

        Also the row/column hybrid approach is called ‘PAX’ and it’s not any sort of ‘gold standard’ just one approach with specific tradeoffs.

        Anyway, if your plan is to stick strictly with Oracle’s MySQL then I wish you luck. If you look at some of the critical bugs that Percona fixed *years* before MySQL then you may need it.

        • Thomas Kejser   •  

          Whau Joe, you sure have some pent up anger

          First of all, if you take a quick dig into the literature on ACM, you will find plenty of earlier examples of columns stores than SyBase. Heck, even the IBM mainframe had variants of column stores available!

          And no, PAX (which I am fully aware of) is not the answer. It gives you some of the benefits of optimized memory layout, but hybrids need much more than that. They need things like large block allocations, bitmap indexes vs. B-tree, different clustering on master and slave, high speed loading optimized for either row or column layout and background repacking of rows to column format.

          As you may have noticed, we are not running Oracle mySQL. I don’t believe Oracle has enough if a commercial interest in evolving the code base. But on that note, Oracle is a very good example of how cool a database CAN be when constructed in Cathedral Mode.

          None of you comments explain what you mean by mySQL blowing the socks of anything. It’s a nice database, don’t get me wrong, but it’s nowhere near what I would expect from something that is a drop and replace for an environment that already runs SQL server, Oracle or DB2.

  8. Sunny Bains   •  

    1. True – but only for secondary indexes where the key changes, that’s why I’m reluctant to generalise the issue.

    2. I really need to look at the code :-) but I will try and answer this one later

    3. Saving on IO is a little difficult to claim in the general case, it really depends on whether the page is in the buffer pool. I would only go so far as to say that it can potentially reduce IO.

    Parting general comment and some ramblings :-). Purge behaviour is rather complex and difficult to generalise around IO IMO. There are several issues here. Think of the UNDO logs as a file based linked list (or queue), a little simplified but will do. User threads are inserting to the head of the queue and reading from arbitrary segments within the queue to build previous versions for MVCC. Purge is reading from the tail of the list and going through the entries one by one and deleting the garbage. Purge is responsible for pruning this file based list, in InnoDB terminology it is called UNDO log truncate. This access pattern leads to a lot of random IO on the UNDO log and also has strange effects on the buffer pool. One way to mitigate the IO part in 5.6 is to place the UNDO logs on SSD.

    You assert that the “History list is EVIL!” :-), however, I don’t actually see this as a bad design decision per se. IMO, the architecture is OK, the problem is the file layout. The UNDO log has several types of entries, only UPD+DEL are considered by purge. The other entries are either required for rollback or for MVCC only. They can be split and that will make the truncate fast. It will also allow Purge to skip entries and revisit them later. It should allow taking snapshots in time etc. Hope that adds some value to this interesting discussion.

    • Thomas Kejser   •  

      Thanks for all your insights Sunny. You certain proved that discussion with the MySQL community are constructive and a pleasure to participate in.

      I think our disagreement boils down to when you want to pay the cost of keeping the current (as opposed to historical) tree up to date. I would claim that paying the cost up front, as SQL server does, when the operation happens and on the hot path, makes for a simpler and more reliable (in the sense of: consistent response time) design. Now it may be that a better implementation of the history list will prove me wrong on this point… Time will tell. I plan to blog more about our experiences and data at a later time…

      Again, thanks for you valuable insights. It was a pleasure.

    • Thomas Kejser   •  

      Oh.. One more thing Sunny. You may want to have a look at the Ghost Record cleaner of SQL server. Here is a good link:

      As you can see there, the cleanup is of deleted records is done opportunistically when pages are split and by scanning allocation maps (decoupled from MVCC)… It seems to me that a hybrid of the UNDO log and the opportunistic cleansing of SQL server might be a pretty good hybrid solution that would provide the benefits of both engines.

  9. Thomas Kejser   •  

    Probably best to proceed with an example:

    I have a table with columns: k, a, b. Let’s say k is the cluster and I have a non-unique index on a and that b is a varchar.


    1) what happens if I change some ‘a’ column values to take a value that is not in the secondary index?
    2) what happens if I change some ‘b’ values to be larger than what fits in the page those rows are on?
    3) what is kept in the pages if I delete some rows by key?

    • Sunny Bains   •  

      Note: Secondary indexes are not UNDO logged, only the cluster index. That’s why the UNDO LOG PTR is only in the cluster index records and not in the secondary index record. Also, the transaction ID on the secondary indexes is on the page not per record, this leads to some complications (and additional lookups etc). Whereas it is per record in the cluster index. This is related to MVCC handling and indirectly to this discussion around UNDO logging.

      1. The cluster index UNDO LOG PTR for the affected records will point to the delta for building the previous column values. The secondary index old entries will be delete marked and new entries inserted for ‘a’.

      2. DELETE + INSERT where the DELETEd entries UNDO LOG PTR will point to the delta for building the previous column values

      3. The cluster index rows will be delete marked and the UNDO LOG PTR will point to the old deltas to reconstruct the previous versions.

      4. What if the deleted rows in 3 are reinserted?
      The old (matching) rows will be delete unmarked and a new delta will be inserted at the head of the delta list and the UNDO LOG PTR will point to the new head.

      • Thomas Kejser   •  

        Thanks Sunny, this matches my understanding.

        1) so what happens with the old entries in the secondary index? If I happen to change the secondary index values a lot, it seems I can build up a large backlog of deleted records. Those are cleaned by the purge thread too, correct? They also get in the way of range scans of that index (because they bloat the pages), true?
        2) I am curious to know what the exact rules are for update in place. I’m SQL Server, the rules are rather complex. Even if you don’t change the size of the column value, you can get into trouble with NULLability and ghost records (when changing a key)

        3+4) would it be fair to say that this is the scenario where the purge implementation saves on I/O (what I alluded to in my blog entry)?

  10. Thomas Kejser   •  

    Not sure I follow your logic Sunny. A record that is removed is still kept in the tree until no one needs previous version, right? Also, if you do update in place, how do you handle situations where the row had to grow?… And what about non unique secondary indexes? What happens when I update a value there to no longer be in the tree?

    • Sunny Bains   •  

      Yes, a record entry is kept in the cluster index but there is only one copy in the cluster index not multiple copies. The previous versions are built up from the deltas in the UNDO log by following the UNDO LOG PTR system column. If the row grows (IIRC) it is a DELETE + INSERT if the row can’t be expanded in place, the last part I need to check the code. There are two types of updates, an update where the key changes and one where it doesn’t. Where the key value changes then it is also a delete + insert but the underlying logic regarding the UNDO log is the same. Secondary indexes are a little bit different and the logic is slightly complicated because we have to do a cluster index lookup to determine the visibility for MVCC. Secondary index records do not have the UNDO LOG PTR system column.

  11. Sunny Bains   •  

    The MVCC description is still not correct IMO. The cluster index has a pointer in every record that points to a list of previous deltas required to build an older version of the row. InnoDB does not keep the entire row around. Long running selects can block purge from proceeding but that is a separate thing.

    • Thomas Kejser   •  

      IMO, it doesn’t really matter if the row is kept around or just the pointer to it. The problem is that the the “heavy lifting” of reorient the tree is done in a background thread instead of amortized over the threads that actually do the query work… It’s like building up a large I/O debt.

      • Sunny Bains   •  

        I don’t really understand the high IO debt part. Can you elaborate?

      • Thomas Kejser   •  

        Of course I can change the max purge lag to force the purge. But this still creates a spiky behavior in a workload that is never resting (like ours).

    • Thomas Kejser   •  

      Sure. So, you keep “delete pointers” around both in the cluster and if I understand correct, also in the secondary indexes. Getting rid of those is an expensive operation, since you may need to change the structure of the tree (reorg parent index nodes, split/grow tree etc). Because I can almost infinitely delay this operation from happening with a few statements, I will be building up a queue of work that needs doing for the purger… When the long runners are finally (if ever) done running, the purge now has a lot of tree modifications pending that require I/O to flush… Assuming that work continues to arrive by query threads – you now find yourself in a bursty I/O, checkpoint like, situation.

      • Sunny Bains   •  

        The records are delete marked. The “pointer” is in all cluster index records, it is the UNDO log pointer. This pointer is the head of a list of deltas (the updates). Therefore there is ever only one record in the cluster index, the current version. InnoDB does update in place, this means that a delete marked record can simply be delete unmarked. This should have the reverse effect in reducing IO.

        Without the purge you would have to delete and then reinsert a new record each time, increasing CPU and IO overhead.

  12. Sunny Bains   •  

    Closing comment, we (the InnoDB team) are not big fans of this global UNDO log (history list) format. There is a *lot* of room for improvement here :-). I think the current format is passed its use by date.

    • Thomas Kejser   •  

      I look forward to seeing what you do with it. Also very curious to know your plans for parallelism. With Hash partitioning already in place, you have the foundations for some seriously scalable parallel query execution.

    • Thomas Kejser   •  

      The books SQL Server Internals is the best place for information how this works. The heavy tree modifications you have to pay, a delete is a delete after all :-) But I would much rather pay the price up front, when I execute the delete, than rely on a background thread keeping up with the workload.

      The old version of the row is maintained in a separate tree (in temp space, which may never hit disk) which is then “purged” periodically. SQL Server does not have the distinction between UNDO and REDO logs that Oracle uses for this. This also means that even if the version store is full (due to long running SELECT) I can continue to modify data, just not select it until more space becomes available.

      There is a decent resource on it freely available here: (see the table comparing Oracle and SQL)

    • Thomas Kejser   •  

      We have been tweaking our max_purge_lag a bit, but the fact that I even need to worry about this problem is quite a foreign thought when coming from SQL Server. Note that SQL server does NOT pay the heavy tree modification cost, data is simply updated in place (or inserted in the tree in the case of INSERT)…When a SELECT statement acquires a lock, you know everything you need to know about which copies you may need to create. When an UPDATE/INSERT/DELETE tries to modify a locked row, it can safely do so by simply given the SELECT a copy of the old row (which is stored in tempdb for easy purging), changing the “master tree” and not polluting the main B-tree with additional rows needed for MVCC… If my understanding of Postgres is correct, it takes a similar approach with its vacuum feature.

      • Sunny Bains   •  

        Fair enough, it is yet another quirk that one has to deal with if you are coming from another environment.

        I don’t see how SQL server can avoid the heavy tree modifications in user threads though. When user threads delete rows in the Btree they may shrink the tree, what I refer to as “heavy lifting”.

        Regarding previous versions, InnoDB does the same thing, the only difference is that the previous version of the row is constructed by following the UNDO log “pointer” into the (global) UNDO log and then constructing the version required to satisfy the view constraint. The difference from what I can tell, if I understand your explanation correctly, is in the format of the UNDO log in SQL server that makes it easier to truncate the old UNDO log entries.

        Do you have a link the SQL server documentation that describes in detail their UNDO log format etc? Thanks in advance.

    • Thomas Kejser   •  

      Will be included in my edits. We are in the process of upgrading to 5.6, so I am quote excited to see this feature available to us. Thanks

  13. Sunny Bains   •  

    The description of MVCC and the history list is completely wrong. Secondary index updates are done in place. The history list is a synonm for the UNDO log. Purge reads the UNDO logs that are no longer in any active transaction’s view and “purges” the delete marked records, both from the cluster index and the secondary index. After which it will truncate the UNDO list entries. The problem that you allude to I suspect is that InnoDB does not reclaim the deleted UNDO log space. This can cause the system tablespace to grow sometimes.

    This background purge is done so that the heavy tree modification changes are not done in user threads. This gives InnoDB an advantage in speed over other storage engines.

    • Thomas Kejser   •  

      Hi Sunny. Thanks for this detail, I will be updating the entry with your additional description.

      Its still a pretty silly way to handle this problem. SQL Server does not use the index to keep the old records around, each query simply maintains it own snapshot of the data and purges it on completion. This is a much more scalable solution that does not suffer from the long running query problem

    • Thomas Kejser   •  

      Sunny. It is still my understanding that secondary indexes has a change log and don’t do the updates in places but instead rely on insert/delete/purge. I base this on the description here:

      We have several secondary indexes in our databases that change frequently and I was under the impression that this creates a lot more purge traffic.

      • Sunny Bains   •  

        I see where the confusion is coming from. The Change Buffer is a completely separate thing. It has not directly related to purge. The Change Bufer (in earlier versions of InnoDB it was known as the Insert Buffer) is an optimisaton to convert random IO into sequential IO. A short explanation (high level) follows:

        1. For inserts it only works on non-unique secondary indexes
        2. For deletes it works on all secondary indexe

        It is a special type of index that keeps track of the leaf pages in a secondary index. The inserts into a secondary index first go into this Change buffer instead of directly to the index leaf page. This avoids expensive IO if the page is not already in the buffer pool. If the page was already in the buffer pool then the change is applied directly to the leaf page. If the leag page happens to be read in during normal operation then the changes from the Change Buffer are merged into the secondary index leaf page.

        The Change Buffer merge is done to disk in bulk and tries to maximise sequential IO, it is usually done when there is low user activity (ideally). This Change Buffer merge is done by another thread, historically called the “Master Thread”.

        It doesn’t have anything to with Purge though :-)

    • Thomas Kejser   •  

      Thanks for explaining this in detail. Does the change buffer merge run on the main thread in older version of MySQL (we are on 5.1). If so, this would interfere with the purge traffic too, correct?

      • Sunny Bains   •  

        Yes, in 5.1 Purge, Insert Buffer merge and dirty page flushes are done by the Master Thread. 5.5 introduced a separate Purge thread. In 5.6 we introduced two new threads the page cleaner thread and up to 32 purge threads.

        Since the the three activities are done by a single thread, it is done sequentially and you can infer from that, if say flushing takes up too much time then the history list will have grown because purge is blocked etc. etc.

        In 5.1 only Inserts are buffered, that’s why it was called the Insert Buffer earlier.

    • Thomas Kejser   •  

      That explains what we are seeing in our environment. Upgrading to 5.6 should alleviate the worst of that it seems.

      You should probably consider having a separate purge and change buffer data structures per NUMA node to partition the memory traversal required while this is going on. This is how SQL Server does checkpointing and other memory intensive operations. Eventually, those background threads fail to keep up and what is designed to be the cold path becomes the hot path.

    • Thomas Kejser   •  

      Two things Toby.

      First of all, we are not greenfield. MySQL was in place before I arrived.

      Second, while I like PostGres for its strict adherence to the ANSI standard, the implementation leaves a lot to be desired. I would encourage you to run a quick code profile of PostGres doing a TPC-C run. It’s not a pretty sight!

  14. Davide   •  

    Hey Thomas, also Postgres is free, but for some reason MySQL seems to be the preferred choice of developers. Now that you’re into it, do you have in idea why? (I have mine but I’d like to ear yours first)
    And, btw, best luck with your new job!!!!!

    • Thomas Kejser   •  

      See my reply to Toby Davide. MySQL as an OLTP engine is actually quite scalable on the box. The backing of Facebook and Google to the project also helps a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">