When Statistics are not Enough – Search Patterns
Co-author: Lasse Nedergaard
Yesterday, Lasse ran into an issues with a query pattern in the large database that he is responsible for. Based on our conversation, we wrote up this blog and created a repro.
The troublesome query we were debugging executed like this:
- Find a list of keys values to search for
- Insert these keys in a temp table – lets call this the SearchFor table
- Join the temp table to a large table (lets call it BigTable) and retrieve the full row from the large table
Why not use a correlated sub query in step 2? In this case, the customer in question had multiple code paths (including one accepting XML queries) that all needed to pass thousands of key to a final search procedure. They wanted a generic way to pass these key filters to to the final access of BigTable. The schema will make it clearer.
Schema
We created a repro and it turns out to be a great look into the strange world of database statistics. Here is the schema:
CREATE TABLE BigTable
(
NearKey UNIQUEIDENTIFIER NOT NULL
, Payload char(200) NOT NULL
, RowType INT NOT NULL
)/* Insert rows to simulate large table with 177 being the skewed value */
INSERT INTO Bigtable WITH (TABLOCK) (NearKey, PayLoad, RowType)
SELECT NEWID()
, REPLICATE(‘X’, 200)
, CASE WHEN n%100 BETWEEN 0 AND 50 THEN 177 ELSE n%100 END
FROM dbo.fn_nums(1000000)/* Creates indexes and statistics */
CREATE INDEX IX_NearKey ON BigTable (NearKey)
CREATE STATISTICS STAT_RowType ON BigTable(RowType)UPDATE STATISTICS BigTable WITH FULLSCAN
/* Insert rows that are NOT in the statistics with key = 199 */
INSERT INTO BigTable (NearKey, PayLoad, RowType)
SELECT NEWID(), ‘New Rows Not stat updated’, 199 /* New type*/
FROM fn_nums (100000)/* Create generic search table */
CREATE TABLE #SearchFor (NearKey UNIQUEIDENTIFIER)
In Production, BigTable is a very large table with billions of rows. Each row in the table has RowType that describes metadata about the row. The metadata is used to populate #SearchFor before accessing BigTable, as this saves significant I/O.
Things to note about BigTable:
- All stats have been fully updated for values that are not RowType = 199. As long as we stay below this value, we have done ALL we can
- There is big skew in the RowType column.
- There are 100K values that are not in the statistics (you can run DBCC SHOW_STATISTICS (BigTable, STAT_RowType) WITH HISTOGRAM to convince yourself of this)
Query Problems
- When a user queries this system, the code will first populate #SearchFor with a stored procedure like this:
Typical search procedures in the system look like this:
CREATE PROCEDURE GetDataA
ASSELECT … INTO #SearchFor
EXEC dbo.ReturnFromBigTable
ReturnFromBigTable contains the troublesome statement that is the subject of this blog:
SELECT *
FROM BigTable
JOIN #SearchFor
ON BigTable.NearKey = #SearchFor.NearKey
WHERE RowType = <SomeValue>
Looks quite innocent, doesn’t it?
To simulate the customer’s workload, let us say we are searching for 1% of the rows in BigTable. We do this by populating #SearchFor.
INSERT INTO #SearchFor
SELECT TOP 10000 NearKey FROM BigTable
TABLESAMPLE (1 PERCENT)
And now, the fun begins! Lets first use our NearKey search and for a non skewed RowType:
SELECT * FROM BigTable
JOIN #SearchFor
ON BigTable.NearKey = #SearchFor.NearKey
WHERE RowType = 99
Plan (only join shown)
Looking at Predicate in properties for the BigTable scan we unsurprisingly see that the predicate is being driven down: [dbo].[BigTable].[RowType]=(99)… Nothing interesting so far. The plan is parallel. Again, not surprising considering that we are scanning accessing 1M rows in BigTable.
… but as Lasse and I observed on his workload, something isn’t quite right here.
Let us run the same query, but with the filter RowType = 99 replaced by RowType = 177. The new plan is:
It is at this point your design intuition should wake you up with that good ol’ WTF feeling. “Wait a minute you should say”…before (when there were fewer rows coming out of BigTable) we were parallel and hashing the output of BigTable. Now, when there are more (50x more) rows, we are both reversing the hash order, probing into BigTable and hashing SearchFor. Note that all estimates are relatively accurate (within 10%) and statistics in the space we are searching is fully up to date… If we take a fully naïve approach, trusting relational algebra and query optimizers, there is nothing more we can do here.
But hey, it gets worse: What happens when we go outside the boundaries of the statistics. This for example, could happen when the stats are not fully updated – a completely normal situation in a database. Searching with RowType = 199 (that is not in the histogram) and estimating the plan, we get:
The estimate coming out of BigTable is 1 row, the actual is 100K; this is probably going to hurt.
Running the query gives us the deadly:
Congratulations to us, we just committed performance suicide!
Lets just list the execution times of these query variants:
| Join Strategy | RowType | CPU time | Elapsed | Logical I/O Reads |
| Hash BigTable Probe #SearchFor |
99 | 421ms | 549ms | 31429 |
| Hash #SearchFor Probe BigTable |
177 | 765ms | 1293ms | 31429+62 |
| Merge Join (sort both) | 199 | 1575ms | 2261ms | 31429+62 |
What is the problem here? From the user’s perspective, the query response time will often vary unpredictably – which is a poor experience. But worse, from a system and DBA perspective: the memory consumption will vary a lot depending on which query plan we end up with. This means it becomes hard to extrapolate the total system throughput and resource requirements based on the queries. If you are capitalist inclined in your mindset, let me translate this: At the end of the day, this costs you money!
Why is this happening? It is happening because there are hidden assumptions here that you have not brought into the light. We are trusting technology (in this case the database statistics and optimizer) with solving something that is a data modeling problem.
This is a very good example of something Thomas has been seeing a lot lately: An unwillingness to be specific about your requirements costs you money – a LOT of money. In other words: There is a price tag associated with ignorance. Let us make it clear what we are NOT saying: There is nothing wrong with ignorance: it is hard to know what you don’t know. Sometimes, you simply don’t know a requirement exists. The trouble arrives when denial and lack of curiosity and analysis trumps the observed data.
Fortunately, Lasse is the curious kind and he asked all the right questions and sought knowledge to remedy the situation.
Let’s get pragmatic, and talk about solutions.
The solution
With all the query plans above, which one do we want? In this case, the answer is: “None of them”
Here is an observation about the workload we discovered as we were analysing the solution: It is reasonable to assume ask the user to accept that response time is proportional with the number of results returned. From a system perspective, we also want the memory consumption and CPU to be proportional to the returned result size. These are questions you have to ask yourself as system designer and discussion you can have with users.
This is where computer science comes into the equation. We know:
- That we can search a B-tree of n rows in lg(n) time (which is as good as constant)
- That the the returned result are capped by the number of rows (let us call this R) in #SearchFor
- That we are willing to accept that the runtime and resource use is proportional with number of rows returned
- That we cant live with unpredictable behavior
The optimizer KNOWS about the search time and the cap (1+2). But the optimizer does NOT know that we are willing to declare certain properties that narrow the search spare (3) and that we want predictability (4). Our "ignorance” has a price, and the optimizer makes a choice that might be optimal, but which unfortunately is also unpredictable.
This is a great case for hinting. And we can do it like this:
SELECT *
FROM #SearchFor
INNER LOOP JOIN BigTable
ON BigTable.NearKey = #SearchFor.NearKey
WHERE RowType = 177
Note something important here: The order of the joined tables matter when you use the LOOP hint (which is why we flipped them above). The above hint BOTH forces the join order and the join strategy.
An alternative, which may be more generally applicable is this rewrite:
SELECT *
FROM BigTable
INNER JOIN #SearchFor WITH (FORCESEEK)
ON BigTable.NearKey = #SearchFor.NearKey
WHERE RowType = 177
OPTION (LOOP JOIN)
The above rewrite does not guarantee join ordering. However, with the FORCESEEK hint, the plan we want is almost guaranteed.
Let us add some new rows to the result:
| Join Strategy | RowType | CPU time | Elapsed | Logical I/O Reads |
| Hash BigTable Probe #SearchFor |
99 | 421ms | 549ms | 31429 |
| Hash #SearchFor Probe BigTable |
177 | 765ms | 1293ms | 31429+62 |
| Merge Join (sort both) | 199 | 1575ms | 2261ms | 31429+62 |
| Hinted LOOP | 99 | 406 | 718 | 81259 |
| Hinted LOOP | 177 | 499 | 964 | 81259 |
| Hinted LOOP | 199 | 306 | 718 | 81259 |
Above, we can see why the optimizer didn’t want to help us! The number of IOPS required for my hinted plan are much higher than the optimizer generated plans.
The hinted plan shape is:
Obviously, we could replace the non clustered index with a clustered on (which would get rid of the RID lookup) to make this query faster. But in this case, this was not a viable solution for other reasons.
Summary
In this blog, I have shown you an example of database optimizers failing to make the “right choice”. There are several reasons this is happening:
Lack of information: The database needs declarative information about the user’s intent and what we consider the Right Choice™. In the example, the missing information is: “the speed of the returned result should always be proportional with the result size”. We used a hint to control the optimizers behavior, using our knowledge of the physical execution of queries. Other people might have taken more extreme steps and argued for noSQL – perhaps too aggressive a step in this case.
Leaky abstractions: We saw the line between the “logical” and “physical” model break down. We believed that by building the right model, we had supplied enough information to the database. The distinction between physical and logical model has, as Thomas previously argued, always been a pretty useless paradigm. We have to consider the components of the system in a holistic way, but not at the expense of maintaining the overview of individual building blocks.
Optimality vs. Good enough: the optimizers default behavior is to look for “optimal plans”. In the plan search space, even a very small space like this example, plans are sometimes found that look optimal from statistics. Yet, these plans may not be optimal or have the properties we seek – or we may fail to recognize the good plans as we search the space. In our case, we just wanted a plan that was “good enough” and had certain predictable attributes.
False hardware assumptions: The optimizer makes some assumptions about the cost of IOPS which unfortunately do not correlate with the reality of a modern machine. It assumes that optimizing for a low number of Logical IOPS is a Good Thing™. In this light, the plans generated are the right plans. However, the optimizer does not take into account that the I/O system might be fast, and that there is enough RAM to have a likelihood of finding some of the needed data already residing in the buffer pool. To prove this: Recall that we had fully updated statistics available for queries going after RowType = 177. But, running the query on a petty IBM laptop the loop hinted plan, even on an empty buffer pool, is still faster than running the plan generated by the optimizer.
The simple query we studied here raise some interesting questions about the design tradeoffs that the data modeler and architect is forced to make. In this case, we were fortunate that Lasse was on the alert for these tradeoffs. With this example, we have seen what the potential consequences of “behavior and model ignorance” are and why you need to be alert to it.
Happy New Year Update – and OOF
Happy new year everyone. First of all, thanks for staying in touch and reading my site. According to Google Analytics, I have over 27K unique page hits and there is a steady traffic of over 50 hits/day of “not search engine crawler and not spam” traffic. Considering that I have done very little to expand my reach (with the exception of picking the inevitable fight with Dan Lindstedt in the Kimball forums) I think that is pretty nice. Peak traffic is now 500 page views in a single day for the most popular post (“Boosting INSERT speed Generating Scalable Keys” in case you were wondering)
The structure of my writing is beginning to converge on a well defined set of categories. One may even argue that I am becoming a grumpy old man with set opinions. Because of this, I have updated the taxonomy of the site to make it easier for you to track the stuff you are interested in.
Basically, it works like this (and I will keep it “insert only” after this – promise)
- Engines: Advise and observations specific to one database engines (and I use that term broadly).
- Grade of the Steel: My performance testing experiments and crazy tuning, where I try to squeeze out the max throughput of large systems.
- Modeling: Everything related to modeling, sub grouped by the type of modeling. There will be engine specific things in here too, but I will strive for it to be stuff everyone can benefit from reading. Readers who are only interested in my data warehouse series should RSS subscribe here.
- Musings: My personal views on the world, where I get to practice my writing and exhibit my freedom of speech. If you like my content, but hate my guts (or if you just are easily offended) you may want to skip past these posts.
- Utilities: Small utilities I sometimes create (under GPL) and a bag of tricks that did not fit in anywhere else.
- Note that there is a many-to-many relation between posts and categories: posts can appear in more than one category at a time. Of course, you can still subscribe to the “master level” directly at blog.kejser.org/feed where all posts will appear to track it all.
- The tag cloud will evolve over time and helps the search engines categorize what is going on here. It also has the narcissistic effect of showing me a “mental map” of what I am most interested in.
I will be going OOF until end of January. After this, the plan is to pick up the Data Warehouse series again.
Again, thanks to everyone who is reading and commenting. Here’s a toast to yet another year of arguments and strong opinions…
Cheers!
Thomas
Running Many Batch Statements in Parallel
Location: Somewhere over the Atlantic
When designing highly scalable architectures for modern machines, you will often need to do some form of manual parallelism control. Managing this is not always easy, but in this blog I will give you one piece of my toolbox to help you.
Let us walk through an example together, a tiny case study. This is a problem which many of you will be familiar with.
Let us say you have 16 files that you want to load into the same table in your database in an automated manner. The naïve approach will do something like this:
BULK INSERT MyTarget FROM ‘C:\temp\MyFile1′ WITH
(FIELDTERMINATOR = ‘;’, ROWTERMINATOR = ‘\n’)
BULK INSERT MyTarget FROM ‘C:\temp\MyFile2′ WITH
(FIELDTERMINATOR = ‘;’, ROWTERMINATOR = ‘\n’)… etc…
BULK INSERT MyTarget FROM ‘C:\temp\MyFile16′ WITH
(FIELDTERMINATOR = ‘;’, ROWTERMINATOR = ‘\n’)
Now here is the problem with this approach: it executes one statement at a time. Sequential execution is BAD, you need to stop thinking about the world like that if you want to scale on a modern architecture.
Lets assume we have enough hardware resources (in this case, it would take a blade server and a decent I/O system). What we really want is to run every one of these statements in parallel. Unfortunately, SQL server does not have a command to start up new connection from inside T-SQL… what to do?
Getting to the Command Line
Because you cannot execute more than one command on a single connection at a time, we will need multiple connections to SQL Server and this mean we have to go back to the command line. Let us start by creating a little batch file Worker.Cmd with this content:
REM Worker.Cmd File
CALL SQLCMD –S.\MyServer –q”BULK INSERT MyTarget FROM ‘C:\Temp\MyFile%1’ …EXIT
This allows us to invoke a bulk load for the first file by executing: Worker.Cmd 1
Unfortunately, we still cannot start multiple connections without manually firing up a lot of command prompts. The coders in the audience may at this point reach for their favorites programming language to write a little utility that can spawn multiple copies of an executable.
However, there is a problem with such a home made executable: you cannot generally rely on a server having the necessary runtime libraries. Typical comments might be:
“No, we don’t have .NET 4.0 here, this is not yet certified by our infrastructure department. Could you recompile it for 1.1 please?”.
“Power Shell is much too fancy for us, what is wrong with running this on Windows 2000?”
Perhaps this customer is just skeptical about letting you run your executable on a server. This may sound silly, but I have seen this happen too many times to make assumptions.
Start to the Rescue
There is a very nice little utility for the good old command prompt that allows you to fire up new processes: START.EXE. This comes with all versions of Windows and it takes any command line executable as input, fires it up in a new thread and returns control back to the caller.
Using start.exe, we can write batch script that fire up multiple copies of the same executable. It looks like this:
REM SpawnMany.cmd
REM Author: Thomas Kejser
REM Purpose: Spawns many copies of the same executable. Useful for running many things in parallel@ECHO OFF
ECHO Spawning %2 copies of %1FOR /L %%i IN (1, 1, %2) DO (
ECHO Spawning thread %%i
START "Worker%%i" /Min %1 %%i %2
)
Each new process is started in a minimized window and we pass the thread number and the total number of threads to it. Using this little batch script, we can now do this:
SpawnMany.exe Worker.Exe 16
This starts 16 workers, each with their own thread number assigned. Very useful for running stuff in parallel in a quick and dirty way. For example, I use this to run the TPC-H data generator dbgen.exe highly parallelized.
Notice that I added the EXIT command at the end of the worker.cmd batch. This makes sure that the window closes itself when done executing.
Summary
In this blog, I have shown you how to write a little batch script to fire up multiple threads, from the command line. each doing their own work in parallel. The script is “zero dependency” which makes it ideal for server use and for hacking together quick and dirty parallelism for test scenarios.
I mentioned that SQL server does not have a way to start up new connections from T-SQL. This is not strictly true. Sorry for leading you astray, but I wanted you to see how to do this from the command line first (and go through the pain
). There is a way to hack SQL Server and implement a stored procedure I like to call sp_executesql_async. This will be the subject of a future blog, but since I am heading into a lab for a few weeks, you just have to wait for it.
OT: Optimizing Skyrim
The winter solstice peace has finally settled and I am heading up North, Scotland, to indulge in good food, good company and great amounts of Whiskey.
In the little spare time I had the last month, I have had a chance to play Skyrim. Some people may call me a sad character, but I apply tuning to all aspects of my life – including the computer games I play.
Here follows my guide to the optimal Skyrim Character Build. If you do not play Skyrim, this article will be lost on you, but I hope to post more interesting things during this vacation.
Target Character
I like to tune things by first knowing the engine I am tuning for, and then looking for the optimal hill to stand on in the landscape of design points.
I both want the leveling experience to be smooth at lower levels (making me enjoy the process), while at same time not sacrificing ultimate power at later levels. Also, I don’t want (like it was often the case in older Elder Scrolls games) to spend stupid amounts of time on leveling some irrelevant skills.
Here is what we are aiming for:
Level: 81 (The max achievable in the game)
Skills: All at 100
Magic Resistance: >75%
Fire/Frost Resist: >50%
Melee Damage Output: One Handed Weapons 300+ DMG/hit
Ranged Weapon Output: Unlimited cast destruction spells, either fire or shock
Equipment:
| Item Slot | Item | Enchant 1 | Enchant 2 |
| Head | Dragon/Deadric Plate | Fortify Destruction | Fortify One-Handed |
| Body (Option 1) |
Dragon/Deadric Plate | Fortify Destruction | Fortify Destruction + Regenerate Magicka |
| Body (Option 2) |
Ebony Mail (Daedric Artefact) |
Muffle | Area Effect Damage |
| Arms | Dragon/Deadric Plate | Fortify Destruction | Fortify One-Handed |
| Legs | Dragon/Deadric Plate | Resist Fire/Frost | Muffle if you don’t use Ebony Mail, if not: Fire/Frost |
| Neck | Any | Resist Magic 25% | Resist Magic 25% |
| Ring | Any | Resist Magic 25% | Resist Fire/Frost |
| Weapon | Legendary Deadric Sword / Dagger | Shock Damage | Fire Damage |
This character is capable of sneaking silently in Heavy Armour (due to Muffle enchant). In dungeons, you can sneak up on pretty much any enemy and one hit them with a backstab.
But you are not safe, sometimes you do not have time to hide or people see through your sneak. The combination of legendary deadric weapons with shock damage enchant (which stacks with the Destruction perk) will make short work of any enemies that take you by surprise.Because your destruction skill is fortified to 100% with enchants, you will not spend charges on the item either.
Having 100% fortified destruction also means you can take down pesky ranged attack enemies (ex: dragons, vampires and necromancers) with either shock or fire (I prefer shock, it just looks cooler). You can do this while having your sword equipped, or you can dual wield spells. This takes away all the hassle of managing items.
Selecting Race and Attributes
Your first choice in the game is your race. I recommend one or these three:
- Nord: 50% frost resist
- Dunmer (Dark Elf): 50% Fire resist
- Breton: 25% Magic Resist
All the other races have abilities which are either useless (ex: Imperial Voice of the Emperor), can be easily replaced with potions (ex: Argonian Water Breathing, Wood elf Disease Resist) or matched with a few extra levels (ex: Altmer Magicka Bonus)
However, the resistances of the Nord, Dumer and Breton are very useful. Since you are limited to 12 apparel enchantments, any amount of additional resistance is good news. I started the game as Imperial because I liked the money perk – but later discovered that money is abundant later in the game. Looking back, I should have picked Nord.
My one other consideration would be the Orc, getting double damage output from berserk could make for some spectacular videos for YouTube and for powerful boss takedowns.
With regards to attributes, I recommend you don’t go higher than about 300 in Magicka. As your character progresses, you will gradually eliminate the need for this attribute through enchantments. You just need enough of it to cast the highest level spells once. Stamina is somewhat useful for carrying lots of loot and doing power attacks. A ratio of 3:2 between health and stamina is probably around right.
Building Craft Skills – The City Routine
Craft skills forms a very important part of Skyrim. My optimized character relies heavily on them for these reasons:
- They allow you to gradually improve the gear instead of relying in “big finds”
- They give you the OPTIMAL final gear in the game
- They are a great source of making gold
- They provide free levels, without combat, early in the game
In order to gain the benefits of crafting early, when money is scarce, you need to gradually train both Alchemy, Enchanting and Smithing.
Follow the routine below whenever you visit one of the major cities to restock. I like to use Whiterun as my base, because it has an enchanting table next to the alchemist lab in the castle and all stores conveniently located near the spawn point. Also, Whiterun has a home you can buy right next to the blacksmith
1) Visit the blacksmith.
- Buy all of his/her:
- Iron Ore
- Silver Ore
- Iron Ingots
- Leather Strips
- Leather
- Turn the leather into leather bracers
- Turn the iron ingots into iron daggers
Be on the lookout for Armour that has the muffle enchant. If you see it, buy it.
While you are walking around town, cast transmute to turn iron/silver ore into gold ore.
2) Visit the Alchemy Store
- Buy all ingredient that cost less than 100 gold (you will soon make this money back).
- Sell all potions you have acquired or created that are NOT Magicka or Heath potions.
3) Visit General Store, Tailor and Local Mage
- Buy all common, lesser and petty soul gems. Including the ones that already have souls in them.
- If you can afford it, buy great and grand soul gems too
- Buy any lock picks you see, you will need them
- If you do not already have them, buy unenchanted arm, head and body cloth armour
- Sell misc. stuff you have picked up on your travels to get the money you just spend back from the merchant.
- Sell old version of your gear that you no longer use (mages will buy rings/necklaces from you, general stores buy most items)
3) Visit an enchanting table
- Disenchant any gear you have picked up that has unknown enchantments
- Enchant all the iron daggers you have created with Fortify Stamina or Turn Undead. This levels up enchanting and makes you money later
- (Optional): Quaff a Fortify Enchantment potion
- Enchant Clothing, Rings and Necklaces with Fortify Alchemy. Four items take this enchant
- Enchant Clothing, Rings and Necklaces with Fortify Smithing. Four items take this enchant. (If you have the Extra Effect perk, you can combine this with step above)
- If you have a newly crafted weapon, enchant it with Shock or Fire Damage (or both, if you have Extra Effect perk)
4) Visit the Alchemist table
- Put on your best Fortify Alchemy gear
- If you have ingredients, create the one of each of the following:
- Fortify Enchanting (Blisterworth, Glowing Mushroom, Sabre Cat Tooth, Spriggan Sap)
- Fortify Smithing (Blue Butterfly Wing, Hagraven Claw, Snowberries, Spriggan Sap)
- Create all the Health and Magicka Potions you have ingredients for
- Use up the rest of the ingredients on the most expensive potions you can create, leveling skill. Good potions to level up with are:
- Slow (Deathbell, Large Antlers, River Betty, Salt Pile)
- Paralyze (Briar Heart, Canis Root, Human Flesh, Imp Stool, Swamp Fungal Pod)
- Invisibility (Chaurus Egg, Nirnroot, Ice Wraith Teeth, Luna Moth Wing, Crimson Nirnroot, Vampire Dust)
- Ravage Magicka (Many ingredients, experiment to discover)
- Any potion with multiple effects
5) Visit the Blacksmith again
- Sell the enchanted daggers you created. This should earn you back your money
- If you do not already have one of the proper material, create the highest level dagger or sword you can (buy ore/ingots if needed)
- Wear the best Fortify Smithing Gear you have created in Step 3
- Quaff your best Fortify Smithing potion
- Improve your weapon using a grindstone
- Improve your armour using the workbench
- Craft all your gold ingots into and equal number of necklaces and rings
Don’t spend gold on anything else than the above routine – just don’t buy anything else from vendors. After a few iterations, you will at least break even. Also, while this may seem complicated at first, it takes only a minute or two once you have done it a few times. I go through this iteration once per visit to a city, and every time your crafting gear gets a little better – until you get gear that boosts your Alchemy and Smithing skill with +150% or more. You will also be able to creating Fortify Enchanting Potions that add +32% to your skill, which adds a significant boost to your late game gear.
This routine also makes sure your gear levels with you, and it positions you to create the super powerful dragon/deadric plate armour and deadric dagger late in the game. Lets talk about how you get there.
Spoiler Warning: The following will contain spoilers
Your first 30 Levels
The focus of the first level is to build some basic survivability, while ensuring a strong setup for the later levels. The game is quite gentle on your in the beginning, so you get some space for error.
After escaping the dragon and visiting the standing stones, choose the mage stone. You will be doing a lot of spell casting soon.
After getting to Whiterun, your first task should be to join the companions and pursue their quest line till the very end of it. This gives you two advantages: 1) You get the ability to transform into a werewolf which is very useful for boss encounters. 2) You get access to free warrior training (more about this later).
The first 20 or so levels, your basic combat strategy is:
- Cast Conjured Sword in one hand, and attack enemies that get up close
- Cast shock or fire Attacks with the other hand
- Cast Heal spells whenever you are hurt (and gain skill in Restoration)
- Try to sneak up on enemies for backstabs
- Use alteration spells like Oakflesh and Lesser Ward to protect yourself
- Wear cloth armour for best speed and silent sneaking
This strategy levels up Destruction, Alteration, Sneak, Restoration and One-Handed, giving you that edge you need during combat. You will be wearing cloth armour, since the early game alteration spells are better than what you can craft (and it adds skill points too). Cloth also improves your chances of sneaking successfully.
Whenever you return to cities to sell loot, follow the city routine
Perks:
Pick perks in this order (if you do not yet qualify, skip to next one down the list, don’t save them up)
- Destruction: Novice Destruction – gets you started
- Destruction: Augmented Shock or Flames (x2) – Significant boost in damage output
- Destruction: Dual Casting – Unlocks more damage
- Destruction: Impact – stun while casting
- Smithing: Steel Smithing
- Smithing Dwarven Smithing – Easy to find in Dwemer ruins
- Stealth (x5)
- Stealth: Muffled Movement – needed to sneak in heavy armour
- Stealth: Backstab – Makes backstabbing worth it
- Smithing: Orcish Smithing – useless, but prerequisite
- Smithing: Ebony Smithing – Your run of the mill armour, until you get
- Smithing: Deadric Smithing
- Enchanting: Enchanter (x5)
- One-Handed: Armsman (x5) – boost output
- Heavy Armour: Juggernaut (one level only for now)
- Heavy Armour: Fists of Steel – prerequisite
- Heavy Armour: Cushioned – prerequisite, unlocks power!
“Wait a minute”, you may say: “How can I level up Heavy Armour when I am not using it?”. Glad you asked: This is why you should be joining the companions. Once you have reached a high enough status in the guild, you can recruit their members as followers. One of the followers (Vilkas, I think) will train you in Heavy Armour. At the first given opportunity, start doing this right after EVERY level up (you can only pay to train 5 skill points each level). The beauty of having Vilkas as a follower is you can trade with him and get back the gold you spend on training. Free training.
At level 30, you should have nearly 100 skill in heavy armour. A fair amount of skill in Destruction and Sneak and some basic skill in One-handed and Alteration. As you hit level 31, you can take the Heavy Armour: Conditioning perk and you can now move around unencumbered, muffled (if you found the enchantment) and with high speed, in the best armour you can craft. Don’t worry if you cannot craft Deadric or even Ebony yet, you will soon get there.
Other Tips:
- Equally balance Magicka, Stamina and Health when leveling up to 30.
- When you are in draugr ruins, always try to backstab the inactive draugr. Even with a low sneak skills, they are easy to sneak up on
- Enchant a blade with Soul Trap and use it in your off-hand to finish off any enemies whose soul you want.
- Obtain the Alteration: Transmute spell early so you can make gold necklaces and rings
- When cornered, switch to werewolf form. This is also useful for taking down bosses you cannot sneak up on
- Obtain the Muffle spell from the illusion school early, and cast it frequently in dungeons. This levels up Illusion skills and makes you much less likely to be detected
- If you still find at hard to sneak up on people, join the Dark Brotherhood or Thieves Guild to get their special armour. But if you do, don’t level up your light armour skill
- In Riften, you will find a quest (Unfathomable Depths) that takes you to some dwarven ruins to deliver a lexicon. Do this quest early, to obtain a powerful ability that makes you level faster in blacksmithing and gives you an armour bonus when wearing Dwarven
- Go to Solitude to work on becoming the thane there. You may have to do the Bard College quests first. The thane quest starts in the jarl’s castle in Solitude.
- It is a good idea to stick to the roads at early levels, and stay away from mountains. Trolls and necromancers can make short work of you while you are still weak.
- Travel with the coaches you find at stables to get to large cities, don’t try to walk there.
Rounding off the Character – Getting to around level 60
When you hit level 31, you have established a basic fighting technique and is now able to wear heavy armour while dual casting deadly destruction spells that knock enemies back. Likely, you will be wearing dwarven armour (with the nice 25% bonus if you took my advise) at this point.
It is time to bolster your defenses against spell casters, dragons and range attackers, all of which will be common now. At this point, it is probably a good idea to switch to the Lover’s Stone, since your skill leveling will be more varied.
You may also want to work on losing your werewolf blood to get well rested bonuses again (you cannot gain well rested while you have werewolf blood)
Keep doing the city routine until you have the best gear in the game
Perks:
Since you already have a strong build, it is now less important which order you take perks in – enjoy the smooth sailing. However, it IS important WHICH perks you take and I recommend you focus on this priority list:
- Enchanting: Insightful Enchanter – this makes your one-handed enchants much stronger, boosting damage output
- Enchanting: Corpus Enchanter – useless, prerequisite
- Enchanting: Extra Effect –this makes your gear ridiculously powerful, you can start experimenting with prototypes of the gear I introduced in the beginning
- Alchemy: Alchemist (x5) – a this will greatly boost your smithing skills, making your weapons and armour even more powerful, at this point, you should be using Ebony
- Alchemy: Physician – useless, unlocks next skill
- Alchemy: Benefactor – more skill points in your potions, this is your final skill in the Alchemy sign
- Heavy Armour – Finish off the sign. Perhaps with exception of Matching Set, if you plan to wear the Ebony Mail
- Illusion: Novice Illusion – Prerequisite
- Illusion: Animage – Useless, but prerequisite
- Illusion: Kindred Mage – as above
- Illusion: Quiet Casting –the only skill we really want from this school, hours of fun
- Sneak: Finish off the sign, the last skill is particularly useful. Take Deadly Aim/Assasins Blade only if you want to dual wield daggers.
With fully leveled up sneak and Quiet Casting, you can play a lot of interesting tricks on your enemies. Should they get close, you can finish them off quickly with enchanted, One-hand weapons.
Other tips:
- When you get access to Deadric armor, don’t enchant it until you have peaked out on both alchemy and enchanting skills as above. At this point, you should be able to create the ultimate gear as described in the introduction
- Stop adding Magicka during level up when you hit about 300. Destruction spells will be free soon.
- If you have trouble finding money to buy ebony, take the Dragon Smithing perk. Dragon scales and bones are abundant and you can experiment with different enchants until you have maxed out Alchemy and Enchanting.
- As your Destruction skill gets around 75, you need to work on the College of Winterhold quests. They will give you higher level spells.
- Do the Azura Star quest and get the Black Star. This gives you an unlimited amount of Grand Souls.
- When entering a room with many enemies, set traps for them using lightning runes. This levels up Destruction fast
- Do the question Boethia’s Calling to obtain the Ebony Mail. This rounds off your sneak abilities very nicely, especially if you have not found the Muffle enchant yet.
- Continue the trick used to level up Heavy Armour. The companions will supply you with free training in Archery, Block and Two-Handed. While you will not be using these skills – you should still train them 5 points every level. They contribute to your total levels and perks.
- If you have joined the thieves guild, you can use the same trick to level up Lock Picking and Pickpocket. While you will level up Lock Picking, it is annoying to level up, and so is Pickpocket
- Disenchant the Shield of Solitude (obtained by becoming thane of Solitude). It contains an enchantment that can be stacked with the regular Resist Magic on the same item. This allows you to reach the cap of 85% Resist Magic by enchanting a necklace and a ring.
- If you are Nord, enchant your boots with Resist Fire.
- If you are Dunmer, enchant your boost with Resist Frost
- If you are Breton, you can skip on enchant on the necklace slot and get both Resist Frost and Resist Fire.
- With the proper combination of resistance enchants (as per first section) you can stand in the middle of a dragons breath and take very little damage.
Boasting – getting to level 81
When you have reached the build above, most enemies will fall fast to either your spells or your blade. Enemies will still level up with you, so prepare for some interesting fights.
From here, there are a few paths you can take that will make your character even stronger, and more fun to play.
- Level up Alteration to get another 30% magic resist. This frees up an enchant slot (because Resist Magic caps at 85%) and allows you to boost one-handed damage even further. With the proper Fortify One-handed enchants, you can one-hit most enemies.
- Level up one-handed and specialize in axes, sword or maces to create fun effects while fighting people (heads will literally roll)
- Dual wield daedric swords. Combine with dual wield perks and fortify one-handed enchant for massive melee damage output (who needs a shield and magic anyway?)
- Level up Pickpocket and kill people by placing poisons (which you can make very strong now) in their inventory. Plunder entire cities, steal people’s weapons and armour and watch them helplessly try to fight you.
- Level up Conjuration (you already have a high skills, you used Conjured Sword, remember?) and summon an army of minions to do the dirty work for you.
- Level up Archery and destroy enemies from afar while comfortably hidden with Sneak.
Getting the Last Levels and skill points:
To get all the way to level 81, you have to get most of your skills to 100. All your craft skills (Alchemy/Enchanting/Smithing), Sneak, One-handed, Heavy Armour and Destruction should already be close and easy to finish off. If you used the trainers like I advised, you also have a good skill in Two-handed, Block, Archery, Pickpocket and Lock Picking. Here are some tips to finish it all off:
- First: Use your super powered enchanting skill to create items that boost the skill you want to level up.
- To level up two-handed and archery: Take you newly crafted, legendary great sword and deadric bow to a dragon or giant fight. Use items to boost damage output. Leave your companions at home.
- To level up Block and Light Armour, take your big shield and dragon scale armour to a giant fight. Heal with one hand and hold the shield with the other (this also levels up restoration)
- To level up Illusion, create gear to provide 100% Fortify Illusion, then cast Muffle non-stop while sneaking in town
- Buy as much skill training in Light Armour as you can (5 points each level)
- To level up Pickpocket, quaff potions or use skill items and plunder a city
- To level up alteration, hold up wards against dragons and other big creates
Tank Style Variant
As I played the game this far, I realized there is an interesting variant of this strategy. Overall, I think the cloth wearing technique for the first 30 levels is pretty solid. But it could be argued that you could instead use a shield and either destruction or a one-handed weapons for survivability (instead of sneak) until you reach get sufficient skill to make heavy armour weightless and fast (I find that moving in heavy armour is simply too slow, so I defer until I get the proper perk).
For such a build you would completely skip the Illusion and Sneak trees and instead focus on boosting melee damage output until you can eventually give up the shield as you acquire strong, enchanted gear. Without sneak, you will get into some much nastier fights, but you can counter this with additional skill in one-handed weapons. The basic gear you will be aiming for is mostly the same. Though you may free up the Muffle enchant slot for more one-handed damage output.
Summary
In this very off topic blog post, i have described some of the strategies I have used to optimize my Skyrim game experience. I am fully aware that not everyone enjoys computer games in this way – but this is the way I like to play it: With max power.
I have created a new category on this site for musing like these. If you are only interested in my technical ramblings, you can simply subscribe to individual categories found on the right side of this page.
Don’t Become a One-trick Architect
We are near the dawn of a new workload: BigData. While some people say that “it is always darkest just before the dawn”. I beg to differ: I think it is darkest just before it goes pitch black. Have a cup of wakeup coffee, get your eyes adjusted to the new light, and to flying blind a bit, because the next couple of years are going to be really interesting.
In this post, I will be sharing my views on where we have been and a bit about where we are heading in the enterprise architecture space. I will say in advance that my opinions on BigData are just crystalizing, and it is most likely that I will be adjusting them and changing my mind.
Yet, I think it will be useful to go back in history and try to learn from it. Bear we me, as I reflect on the past and walk down memory lane.
The Dawn of Computing: Mainframes
If you were to look at the single most successful computing platform of ancient times (that would be before 1990), you are stuck with the choice between the Apple II, the C64 or the IBM S/360 mainframe. The first two are consumer devices, and I may get back to those in another blog post. Today, let us look at the heavy lifting server platforms, since we are after all going to talk about data.
Under the guidance of Fred Brooks, IBM created one of the most durable, highly available and performing computing platforms in the history of mankind. Today, mainframes are challenged by other custom build supercomputers, x86/x64 scale-up platforms and massive scale out systems (ex: HADOOP). But even now, the good old S/360 still holds on to a very unique value proposition. No, it is not the fact that some of these machines almost never need to reboot. It is not the prophetic beauty of JCL (a job scheduler that “gets” parallelism) or the intricacies of COBOL or PL/I…
In fact, it is not the mainframe itself that gives it an edge, it is the idea of MIPS: Paying for compute cycles!
When you pay for compute cycles, every CPU instruction you use counts. The desire to write better code goes from being a question of aesthetics, to a question about making business sense (i.e. money). Good programmers, who can reduce MIPS count, can easily visualize their value proposition to business people, and justify extraordinary consulting fees.
As we shall see, it took the rest of us a long time to realize this.
The Early 90ies: Cataclysm
My programming career really took off in the late 80ies and early 90ies – before I got bitten by the database bug. I used to write C/C++, LISP, Perl, Assembler (various), ML and even a bit of Visual Basic (sorry!) back then. Pardon my nostalgia, but in those “old days” it was expected that you were fluent in more than one programming language.
There were some common themes back then.
First of all, we took pride in killing mainframes. We saw the old green/black terminals as an early, and failed, attempt at computing – dinosaurs that had to be killed off by the cataclysm of cheap, x86 compute power (or maybe RISK processors, though I never go around to using them). We embraced new programming languages and the IDE with open arms. We thought we succeeded: IBM entered a major crisis in the 80ies for the first time in their long and proud history. However, it could be argued that it was IBM as a company that failed, not the mainframe. There are still a lot of mainframes alive today, some of them have not been turned off since the 70ies and they run a large part of what we like to call civilisation. As a computer scientist, you have to tip your hat to that.
Another theme was a general sense of quick money. IBM had a lot of fat on their organization, and all that cost had to go somewhere: It ended up as MIPS charges to the customer. This made mainframes so expensive that it was easy to compete. It was the era of the “shallow port”. ERP systems running on ISAM style “databases” were ported 1-1 to relational databases on “decentral platforms” – aka: affordable machines. Back then, this was much cheaper than running on the mainframes and it required relatively little coding effort.
The results of shallow ports was code strewn with cursors. I suspect that a lot of our hatred towards cursors is from that time. People would do incredibly silly things with cursors, because that was the easy way to port mainframe code. Oracle supported this shallow port paradigm by improving the speed of cursors and introducing the notion of sequencers and Row ID, which allows even the database ignoramus to get decent, or should we say: “does not suck completely”, performance out of a shallow port. If you hook up a profiling tool to SAP or Axapta today, you can still see the leftovers from those times.
Late 90ies: All Abstractions are Leaky
Just as proper relational data models were beginning to take off, and we began realizing the benefits of cost based optimizers and set based operations, something happened that would throw us into a dark decade of computer science. As with all such things, it started well intentioned: A brilliant movement of programmers began to worry about encapsulation and maintainability of large code bases. Object oriented (OO) programming truly took off, it was older than that, but it now had critical mass.
The KoolAid everyone drank was that garbage collection, high level (OO) languages, IDE support and platform independence (P code/Byte Code) would massively boost productivity and that we never had to worry about low level code, storage and hardware again. To some extend, this was true: we could now trains armies of cheap programmers and throw them at them at problem (I that ironic way history plays tricks on us, the slang term was: The Mongolian Horde Technique). We also had less need to worry about ugly memory allocations and pointers – in the garbage collector we trusted. Everything was a class, or an object, and the compilers better damn well keep up with that idea (they didn’t). The champion of them all: JAVA, suffered the most under this delusion. C/C++ programmers became an endangered species. Visual Basic programmers became almost well respected. And, to please people who would give up pointers and malloc, but not curly brackets, a brilliant Dane invented C#.
At my university Danes, proud of our heritage, even invented our own languages supporting this new world (Mjolnir Beta if anyone is still interested). Everyone was high on OO. A natural consequence of this was that relational database had to keep up. If I could not squeeze a massive object graph into a relational database, that was a problem with the database, not with OO. Relational databases were not really taught at universities, it was bourgeois.
This was the dawn of object databases and the formalization of the 3-tier architecture. Central to the this architecture was the notion that scale-out was mostly about two things:
- Scaling out the business logic (ex: COM+, Entity Beans)
- Scaling out the rendering logic (ex: HTML, Java Applets)
We still hadn’t figured out how to fully scale databases, though most database vendors were working on it, and there were expensive implementations that already had viable solutions (DB2 and Tandem for example). What we called “Scale-out” in the 3-tier stack was functional scale, not data scale: Divide up the app into separate parts and run those parts on different machines.
I suspect we scaled out business logic because, we believed (I did!) that this was where the heavy lifting had to be done. There was also a sense that component reuse was a good thing, and that highly abstract implementations of business logic (libraries of classes) facilitated this. Of course, we did not see the dark side: that taking a dependency on a generic component, tightly coupled you to release cycles of that component. And thus was born “DLL-hell” and an undergrowth of JAVA protocols for distributed communication (ex: CORBA, JMS, RMI)
Moving business logic to components outside the database also created a demand for distributed transactions and frameworks for handling it (ex: DTC). People started arguing that code like this was the Right Way™ of doing things:
Begin Distributed Transaction
MyAccount = DataAccesLayer.ReadAccount()
if withdrawAmount <= accountBalance then
MyAccount.Balance = accountBalance – withdrawAmount
MyTransaction = DatabaseAccessLayer.CreateTran()
MyTransaction.Debit = withDrawAmount
MyTransaction.Target = MyAccount
MyTransaction.Credit = withDrawAmount
MyTransaction.Source = OtherAccount
MyTransaction.Commit
else
Rollback Transaction
Throw “You do not have enough money!”
end
Commit Distributed Transaction
You get the idea… Predictably, this led to chatty interfaces. Network administrators started worrying about latency, people were buying dark fibers like there was no tomorrow. Database administrators were told to fix the problem and tune the database, which was mostly seen as a poorly performing file system. Looking back, it was a miracle that noSQL wasn’t invented back then.
Since this was the dawn of the Internet, scale out rendering made a lot of sense. It was all about moving fat HTML code over stateless protocols. It was not unreasonable to assume that you needed web farms to handle this rendering, so we naturally scaled out this part of the application. Much later, someone would invent AJAX and improve browsers to a point where this would become less of a concern.
We were high on compute power and coding OO like crazy. But like the Titanic and the final end of the Victorian technology bubble, we never saw what was about to hit us.
Y2K: Mediocre, But Ambitious
The new millennium had dawned (in 2000 or 2001, depending on how you count) and people generally survived. The mainframes didn’t break down, life went on. But a lot of programmers found themselves out of work.
In this light, it is interesting to consider that programmers were considered the most expensive part of running successful software back then. JAVA didn’t live up to its promise of cross platform delivery – it became: “Write once, debug everywhere” and people hated the plug-ins for the browser. While productivity gains from OO were clearly delivered, I personally think that IntelliSense was the most significant advance in productivity that happened between 1995 and 2005 (it takes away work from typing, so I can use it on thinking).
Professional managers, as the good glory hunters they are, quickly sniffed out the money to be made in computers during the tech bubble. As these things go, they invented a new term: “IT” that they could use to convince themselves that once you name something, you understand it. It was argued that we needed to “build computer systems like we build bridges”, but the actions we took looked more like “building computer systems like we butcher animals”. The capitalist conveyor belt metaphor, made so popular by Henry Ford and “enriched” by the Jack Welsh’ish horror regime of the 10-80-10 curve, eventually led to the predictable result: Outsourcing.
Make no mistake, I am a great fan of outsourcing. I truly believe in the world-wide, utilitarian view of giving away our technology, knowledge, money and work to poor countries. It is a great way to equalize opportunities in the world and help nations grow out of poverty – fast. In fact, I think we need to start outsourcing more to Africa now. Outsourcing it is the ultimate, global socialist principle.
The problem with outsourcing isn’t that we gave work away to unqualified people in Asia and Russia – because we didn’t – those people quickly became great, and some of the best minds in computer science were created there. Chances are that you are reading this blog on a software written by Indians.
The problem with outsourcing was that it led to a large portion of westerners artificially inflating the demand for Feng Shui decorators, lifestyle coaches, Pilates instructors, postmodern “writers”, public servants and middle level management. These days, Europe is waking up to this problem.
But then again, if we send the jobs out of the country, all the unemployed have to do something with their time. Perhaps it is no coincidence that the computer game industry soon outgrew the movie industry. You can after all waste a lot more time playing Skyrim (amazing by the way) than watching the latest Mission Impossible, and you don’t have to support Tom Cruise’s delusions in the process
Sorry, I ramble, somebody stop me. Yet, the fact is that a lot of companies wasted an enormous amount of money sending work outside the country, dealing with cultural differences and building very poor software. One of the the results (again a big equalizer, this time of life expectancy) was that numerous people must have died from mal-practice as the doctors were busy arguing about the latest way to AVOID building a centralized, and properly implemented, Electronic Patient Journal/Record system. As far as I know, that battle is still with us.
The Rise of the East: Nothing interesting to say about XML
Once you have latched on to the idea of a fully componentized code base, it makes sense to standardize on the protocols used for interoperability. Just when we thought we couldn’t waste any more compute cycles – someone came up with SOAP and XML.
This led to a new generation of architects arguing that we need Service Oriented Architectures (SOA). There are very few of those today (both the architects and the systems), but the idea still rings true and for the customers who have adopted it. And it IS very true: a lot of things become easier if you can agree on XML as an interchange format. We also saw the rise of database agnostic data access layers (DAL). Everyone was really afraid to bind themselves to a monopoly provider.
I don’t think it is a coincidence that open standards and the rise of Asia/Eastern Europe/Russia coincided with *nix becoming very popular. The critical mass of programmers, as well as the interoperability offered by new and open standards, made it viable to build new businesses and software. The East has a cultural advantage of teaching mathematics in school at a level unknown to most westerns – I suspect it will be the end of our Western culture if we don’t adopt. Good riddance..
And the problem was again the over interpretation of a new idea. Just because XML is a great interchange format does not mean it is a good idea to store data in that format. We saw the rise (and fall) of XML database and a massive amount of programmer effort wasted on turning things that work perfectly well (SQL for example) into XML based, slower, versions of the same thing (XQuery for example). But something was true about this, and it lingered in our minds that SQL didn’t address everything, there WAS a gap: Unstructured data…
All the while, we were still high on compute power. Racking blade machines with SAN storage like there was no tomorrow. The term “server sprawl” was invented – but no one who wrote code cared. We continued to believe all that compute power was really cheap. Moore’s Law just kept on giving.
The end of the Free Lunch: Multi Core
But around 2005, it was the end of it. CPU clock speed stopped increasing. Throwing hardware at bad code was no longer an option. In fact, throwing hardware at bad code made it scale WORSE.
Those of us who had flirted with scale-out in the 90ies, and failed miserably, got flashbacks. Predictably, we acted like grumpy old men: angry and bitter that we didn’t get the blonde while she was still young and hot. Oracle became hugely successful with RAC, and the idea of good data modeling that actually has to run on hardware came back on the table, snatched from the hands of the OO fans while they were distracted by UML. People started arguing that perhaps, some of that business logic we all loved scaling out (functional scale-out, mind you), DID belong in the database after all. Thus came SQLCLR and JAVA support in Oracle and DB2. Opportunistic companies started publishing data models and selling them for good money, it was believed that if ONLY we could do 3NF modeling of data, all would be well. Inmon followers were high on data models!
People demanded automatic scale and everyone in the know, knew that this was not going to work without a fair amount of reengineering. But of course, we continued to pay homage to the idea – people don’t like their illusions broken.
While everyone was busy wrapping their head around scale-out, the infrastructure guys were beginning to show signs of panic. Big banks were complaining (and still are) that their server rooms were overflowing with machines. Their network switches were not keeping up anymore and the SAN guys went into denial: “You may measure a problem in Windows, but that is Windows who can’t do I/O, I see no problem in the SAN”. All our XML, OO and lazy code and CPU greed had started to take its toll on the environments. Somebody started talking about “Green Computing” – damn hippies!
Flashback: Kimball’s Revenge
While Teradata was busy selling “magic scale out” engines that had made great progress in this space, Kimball followers were spraying out Data Marts and delivering value in less time than it took the “EDW people” to do a vendor evaluation. The Kimball front got themselves a nice weapon of mass destruction with the Westmere and Nehalem CPU cores: cheap, powerful CPU power that took up very little space in the server room and didn’t need any special network components. It was all in-process on the same mainboard. Itanium (IA-64) went the way of the Alpha chip, the final realization that most code out there truly sucks: high clock speed beat elegant architectures.
Finally, a single machine with lots of compute power, no magic scale-out tricks, could solve a major subset of all warehouse problems (and interestingly most of the classic OLTP problems too). Vendors started digging out old 1970 storage formats that had a great fit with Kimball. Column stores and bitmap indexing got a revival. We saw the rise of Neteeza, Redbrick, Analysis Services, TM1, DB2 OLAP Views and ESSBase. This trend continues today, for example with Vertica and the Microsoft VertiPaq engines. there is even a great website dedicated to it: The BI Verdict.
Data kept growing though, and today we are seeing an interest in MPP, Kimball style, warehouses promising truly massive scale. Scale-out and -Up combined in a beautiful harmony.
But of course, social media beat us all to it. No matter how much data we tried to cram into the warehouse, the world could spam it faster than we could consume it.
BigData: More Pictures of Cats than you can Consume
Our tour of history is nearly at an end – bear with me a little longer. Today, the old dream of handling unstructured data is becoming reality – but perhaps a nightmare for some.
With BigData and HADOOP, we have a new architecture that for the first time makes some promises we can start believing:
- Storage is TRULY cheap and massive (But you have to live with SATA DAS, not SAN)
- Unstructured data, and queries on it, can be run in decent time (Not optimally fast, but good enough)
- Semi automated scale is doable (but only if you know how to write Map/Reduce or use expression trees)
- This means that we can now drink the barrage of data coming from the Internet. Of course, human nature being what it is, a lot of this data is pure noise. In fact, the signal to noise ratio is scary on the Internet – I can think of no better word for it.
- Among all the noise of pictures of cats, tweets and re-tweets about toilet visits, and latest news about Britney Spear’s haircut, there is something we treasure: Knowledge about human nature. The problem that BigData helps us address is that we don’t always know how to find the needle in the haystack. If we build BigData installations, we can keep the data around, unafraid of losing it or burning all our money, while we dig for that insight we so desperately seek about our customers and the world around us.
- I have written elsewhere about the great fit between BigData and Data Warehouses. But I would like to reinforce my point here: Once you know the structure of the data you are looking for, by all means model it and put it in a warehouse for fast consumption. If you don’t know what you are looking for yet (or if you can wait for it), put it in BigData.
- Just be careful, there is a price associated with choosing temporary ignorance over knowledge seeking.
The Cloud: Grow or Pay!
Remember those infrastructure people who complained about server rooms overflowing? Do I hear your growls of anger? As architect/programmer/DBA, I think we often have felt the frustration of dealing with infrastructure departments (in my case the SAN guys take the price). “Why can’t those people just rack the damn server and be done with it – instead of letting me wait 3 months?” we ask ourselves as we again bang our fist into the table in frustration.
But the fact of the matter is that those server rooms ARE overflowing and overheating all over the world. Handling all the complicated logistics of racking and patching servers is expensive, and that is a direct consequence of our greed for compute power. Those infrastructure people are just trying to survive the onslaught of the server sprawl – give them a break. They respond by virtualizing (SAN and virtual machines) and they fight back when we ask for hardware that is non-standard or “not certified”.
Perhaps it is time we look ourselves in the mirror and ask if we really need to waste all those CPU cycles on overly complex architectures that protect us from ourselves and our inability to properly analyse what we plan to do with data. Every time we rack a new server, power consumption goes up in the server room and it gets just a little warmer in there. We are incurring a long term cost in power and cooling by writing poor code and designing overly flexible architectures. At the end of the day, a conditional branch in the code (an “if statement”), metadata and dynamically typed, interpreted code costs CPU, and if we want to be flexible, we often need a lot of that.
If there is one thing we have learned from mainframes, it is that code lives nearly forever. Especially when it is so bad that no one dares to touch it, or so complex that no one understands it.
If it makes your day, think of this as a green initiative: you are saving the planet by writing better code and saving power. But of course, such tree hugger mentality gets us nowhere in the modern business world, so how about this instead: with the dawn (overcast?) of the cloud, compute cycles will once again be billed direct to you: the coder and DBA!
MIPS are back in a new form, and it again makes good business sense to write efficient code. In fact, it always made good sense (even in the 90ies), but all those costs you were incurring from writing bad code or forcing data into storage formats that are unnatural (e.g. XML) has so far been hidden in your corporate balance sheet. With the cloud, that cost is about to uncloak itself, and like a Klingon battleship – this might leave you in trouble!
Summary: What history Taught Us
If you are still with me here, thank you for putting up with my tantrums. It is time to wrap this up. What has history taught us about architectures and how we go about storing data?
I like to think of computing as evolutionary, not revolutionary. There are very few major breakthroughs in computer science (perhaps with the aforementioned exception of IntelliType) that don’t arrive with a lot of advance warning, and grow up slowly from many failed attempts at implementations. For example, column stores are super hot today, but they were invented back in the 70ies.
Let us have a look at the “evolutionary tree of life” for data storage engines:
Above is of course a gross over-simplification, and you could argue about where the arrows should connect. But no matter where those arrows lead, it makes the point I want to make: That nearly all of these technologies are still around today and very much alive (perhaps with the exception of XML databases, but those were pretty silly to begin with). In fact, most large organisations have most of them around.
Are all these technologies just alive because they have not yet been replaced with the latest and greatest? Is noSQL, HADOOP or some other new product bound to replace them all over time?
I think the different technologies are there for a good reason. A reason very similar to why there are many species of animals in the world. Each technology is evolved to solve a certain problem, and to live well in a specific environment. When you target the right technology to a certain problem (for example column stores towards dimensional models) you solve that problem elegantly and with the least amount of compute cycles – preserving energy. Very soon, solving problems with a low number of compute cycles in elegant ways is going to really count.
While you might be able to solve most of the structured DW data problems with HADOOP, you eventually have to ask: Is a generic map/reduce paradigm really the way to go about that?
Humans have a strange way of wanting a single answer to the big and complex question in life, and we waste significant time searching for it. Entire cults of management theories (and I use that term lightly) and religions are build around these over simplifications. We seek simple answers to complex question, and as IT architects, this can lead us down a path of believing in “killer architectures”.
What history has taught us is that such killer architectures do not exist, but that the optimal strategy (just like in nature) depends on which environment you find yourself in. Being an IT architect means rising above all these technologies, to see that big picture, and resisting the temptation to find a single solution/platform for all problems in the enterprise. They key is to understand the environment you live in, and design for it using a palette of technologies.
Staying in the metaphor, this leads me to another conclusion: Just like evolution made us store fat on our bodies in different places depending on the usage (thank goodness) you also have to consider the option of storing your data in more than one place. Having a bit of padding all over the body is a lot more charming (and healthy) than a beer belly ![]()
References:
- The Data Modeling Category on this site with more ramblings
- Fred Brooks wrote a classic, “The Mythical Man Month”, about his experiences working on the S/360
- Ralph Kimball’s books is all you need to read to get started on data warehousing
- Louis Gerstner, former CEO of IBM, wrote a great book about the fall and rebirth of this great company: “Who Says Elephants Can’t Dance”