|
|||||||
|
|
![]() |
|
|
Database Normalization, Part 3 by Jim Coker In the first two installments of our articles, we refined our database design until it reached third normal form (3NF). You will find the majority of tables in most iSeries applications are in 3NF, which is usually a good balance between performance and database flexibility. So what else is left? Actually, there are a couple of things. This article will explore the less frequently used fourth and fifth normal forms. Then we will look at some performance issues to be considered when determining the appropriate level of normalization. Fourth Normal Form, or 4NF To begin, let's take a look at our sample database in its 3NF design. We are once again using the same diagram sequences to remain consistent with the earlier articles.
Table D1: Student Information (3NF)
Table D2: Student Class (3NF)
Table D3: Class (3NF)
Table D4: School (3NF) By reaching 3NF, we can accurately state that all non-key attributes in the table are related to the key, the whole key, and nothing but the key. However, what if there were two different relationships between the key and non-key attributes? Let's look at an example to clarify. Suppose the student information table contained some additional attributes. Say that in addition to information about the student's school, there is also information about professional societies or certifications. The updated example of the student information table would look like this:
Table E1: Student Information (3NF) Although a person may belong to many societies and hold many certifications, for simplicity we will include only one instance of each. To make the transition to 4NF, we must first achieve 3NF and then meet the conditions of the following rule. Rule: Isolate independent multiple relationships. No table may contain two or more 1:n or n:m relationships that are not directly related. Huh? What this rule is stating--and what our example is violating--is that not only must every non-key field be related to the key, but it must share the same relationship as well. That is, if one or more non-key fields are related to the student ID to define school information, all non-key fields should define school information. The table includes not only school information, such as graduation date and grade point average, but also attributes about professional societies and certifications. Fourth normal form would require us to break the school information out to a separate table, called student school; to place society information in a separate table, called student society; and to place certification information in a separate table, called student certification. This way, only the personal attributes (like name, address, and birth date) are left in the student information table. Here is the revised 4NF design:
Table F1: Student Information (4NF)
Table F2: Student School (4NF)
Table F3: Student Society (4NF)
Table F4: Student Certification (4NF)
Table F5: Student Class (4NF)
Table F6: Class (4NF)
Table F7: School (4NF) Again, for simplicity we are restricting each student to zero or one school, zero or one professional society, and zero or one certification. Now, for the big finale, what is this fifth normal form that is looming out there? Let's take a look. Fifth Normal Form, or 5NF We know that each successive level of normalization is based on the presumption that we have met all of the "lower" normal forms leading up to it. So the first requirement is that our database be in 4NF before even discussing 5NF. Then we must apply the following rule, which introduces a string of seemingly complex--and frightening--conditions. Rule: Isolate semantically related multiple relationships. There may be practical constraints on information that justify separating logically related many-to-many relationships. What the heck does that mean? It basically is relating to combinations of values. Say, for example, that there are three bookstores serving the school. The school places a requirement on each store that if they sell any of the books for a given class, they must sell all of the books for a given class. So class 101, for example, has two textbooks: A and B. Class 102 has three books: C, D, and E. And there is only one book for class 103: F. A total of six books is required for the three classes. Based on the restrictions of all or nothing, if we have three bookstores selling all six books, we would find 6 x 3 = 18 rows in our bookstore book table. On the other hand, if we update our database design to 5NF, we isolate semantically related multiple relationships. Meaning, we isolate the relationship of books to class, and class to bookstore. The bookstore example appears in the following diagram:
Table G1: Class (5NF)
Table G2: Class Books (5NF)
Table G3: Class Bookstore (5NF)
Table G4: Book (5NF)
Table G5: Bookstore (5NF) Looking at this form of the database, we have defined the same combinations but using only 6 (books) + 3 (classes) x 3 (bookstores) = 6 + 9 = 15 rows. Depending on the type of relationship and how many instances of each attribute exist, 5NF may save you a significant amount of time and space by not processing as many rows in the database. Why Be Normal? Other Considerations I have shown you how to take an unstructured database all of the way from 0NF through 5NF, but the developer is reminded that it is neither necessary nor recommended--nor even possible in all cases--to take your database to 5NF. You must choose the appropriate level of normalization for your requirements. In particular, you should consider overall system performance when it comes down to the design. As you normalize to a higher degree, you get closer to the goal of storing a fact in the database in one and only one place. However, with that higher degree comes higher complexity. For example, we started out with one flat file in 0NF. By our 5NF example, we had implemented a total of 11 tables. Some of this refinement into separate tables is absolutely necessary to ensure the integrity of the data. And some of the refinement may enhance performance by making it easier to search out those rows you are trying to process. On the other hand, breaking the design into too many tables could make your queries extremely complex in terms of the various joins needed to retrieve data. One could also argue that introducing multiple tables uses more disk space, since the same field, such as Class ID, may appear multiple times in the database. Typically, in this day and age of inexpensive, high-performance disk drives, it is not a strong argument. A more significant argument would be the impact on journaling or mirroring data across a wide area network for disaster recovery. There are no right or wrong answers. Basically, you are being cautioned to consider all of the alternatives and to decide whether it makes sense to continue normalizing the data model. If you look at many of the robust applications that exist on the iSeries platform, you will find many that take transaction files to 2NF and master files to 3NF. History files often tend to be normalized to a lower degree (i.e., 2NF) for the sake of performance. Data warehouses purposely introduce redundancy to the tables to facilitate much faster queries with easier selection. Given some of these ideas to consider, it is now up to you to balance the performance and integrity of your database. Check the Internet or booksellers for more information on database design and normalization. Other Articles in This Series: Database Normalization, Part 1 Database Normalization, Part 2 Jim Coker is the IT director for Behr Process Corp. in Santa Ana, California. He has worked on the AS/400 and iSeries since 1989. E-mail: jcoker@behr.com
|
Editors
Contact the Editors |
| Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved. |