Sunday, August 21, 2011

The 3 Principle Rules of Normalisation - Or was that 2 or was it 5 - Whatever !

Peter Vogel wrote some fantastic editorials over the decade that he was editor of Smart Access. Here is one of those ...

IF memory servers me correctly (and I believe it does), great database administrators have a saying that goes, “Normalize, normalize, normalize.” By which they mean that you should apply at least the first three rules of normalization to your database design in order to get the best design.

“Iron Chef” references aside, there’s a lot of truth to this rule, and it’s one that I believe in very much. However, like any other rule, you must recognize when it applies and when it doesn’t apply.

Please don’t misunderstand what I’m saying here: Full normalization is essential to high-performance database applications. I often find myself fighting to keep developers from “de-normalizing” their databases. My typical situation is arguing against designs that combine separate entities into a single table or repeat data from one table in another. Often, this drive for de-normalization is triggered by a concern that the application will run too slowly because of the number of tables generated by the normalization process. Just as often, this concern is raised before any testing has been done to determine whether there’s a performance problem. Even if testing shows a performance problem, de-normalization is often done before developers examine their code or question whether they’re following the best practices for developing against a relational database. More often than not, I’ve found that performance issues are related to insufficient knowledge of SQL or an insufficiently normalized design rather than one that’s been “over-normalized.”

De-normalization gives you exactly one benefit: a database optimized for specific activities and, as a result, suboptimized for every other activity. Hardly a desirable goal.

So please, please, please don’t use this editorial to justify de-normalizing a database. Though, considering the matter, why you’d think that my opinion would change anyone’s mind is a mystery to me.

The reason that I bring this up is that I was working with a client who was setting up a home-grown data warehouse using Access. Having been thoroughly indoctrinated in the rules of normalization, the DBA was intent on fully normalizing the data warehouse. The purpose of the data warehouse was to support end-user reporting (again, using Access), and some users had already begun to work with the information in the warehouse. Well, at least the users were trying to work with the data. My client was finding a lot of confusion among the users around the many tables they’d have to join together in order to get the result that they wanted. One user even complained that the reason his report took so long to run was because of the joins between the tables underlying the report. I suggested that any report that extracted tens of thousands of records to generate four summary totals was going to take awhile to run regardless of how many tables were being processed.

To reduce confusion among the users, the database administrator had started to create a set of queries that joined together tables in a way to support the typical queries. My suggestion was that the company shouldn’t bother with the queries. Instead, when loading the data into the data warehouse, they should store the data in de-normalized tables. I suggested that the only normalization rule that should be enforced is the first one: no repeating fields.

From the look on the DBA’s face you’d have thought I was suggesting that we de-throne the Queen (remember, I’m up in Canada). My point was that the higher rules of normalization (rule two and higher) were designed to prevent problems during updates, something that doesn’t happen in a data warehouse. The closest that this data warehouse got to an update was when new data was added during the nightly update.

Having abandoned one of the iron rules in my life as a database designer, I got a little giddy. My next recommendation was that we duplicate enormous quantities of data among tables. The benefits in simplifying end-user reporting would be significant and, since we had no updates, we needn’t be concerned about different copies of the data getting “out of sync” with each other. The only cost would be the extra disk space. With the cost of disk space so low, it was hard to get concerned about this cost.

We could, of course, have made data reporting easier by using queries that drew data from separate tables to create the illusion of repeated data without actually repeating the data in the base tables. However, using a query does impose a performance burden (however small) on a report. By actually duplicating the data, we were trading off disk space against CPU cycles. Since it was far easier for my client to add disk space than to create more CPU cycles, we duplicated the data. Since most of the end-user reporting was done during peak business hours when computer time was in short supply, our solution made even more sense.

Who knows where this de-evolution in my standards will end? By next month, I may be using GoTos in my code. And that report that took so long to run? It took just as long against the “de-normalized” tables.

No comments: