Newsletters Subscriptions Media Kit About Us Contact Search Home

Stuff
OS/400 Edition
Volume 2, Number 18 -- September 11, 2003

Database Normalization, Part 3


by Jim Coker

In the first two installments of our articles, we refined our database design until it reached third normal form (3NF). You will find the majority of tables in most iSeries applications are in 3NF, which is usually a good balance between performance and database flexibility.

So what else is left? Actually, there are a couple of things. This article will explore the less frequently used fourth and fifth normal forms. Then we will look at some performance issues to be considered when determining the appropriate level of normalization.

Fourth Normal Form, or 4NF

To begin, let's take a look at our sample database in its 3NF design. We are once again using the same diagram sequences to remain consistent with the earlier articles.


Field Name Description
Student ID (PK) Unique identifier of the student
First Name First name of the student
Middle Name Middle name of the student
Last Name Last name of the student
Address Mailing address for the student
City City in which student lives
State State in which student lives
ZIP code ZIP code for this student
Birth Date MM/DD/YYYY that the student was born
School ID Unique identifier of the school

Table D1: Student Information (3NF)



Field Name Description
Student ID (PK) Unique identifier of the student
Class ID (PK) Unique identifier of the class
Grade Grade received for the class

Table D2: Student Class (3NF)


Field Name Description
Class ID (PK) Unique identifier of the class
Class Title Title of the class
Instructor Instructor of the class

Table D3: Class (3NF)



Field Name Description
School ID (PK) Unique identifier of the school
Name Name of the school
Address Street address of the school
City City in which school is located
State State in which school is located
ZIP ZIP code for the school

Table D4: School (3NF)



By reaching 3NF, we can accurately state that all non-key attributes in the table are related to the key, the whole key, and nothing but the key. However, what if there were two different relationships between the key and non-key attributes? Let's look at an example to clarify.

Suppose the student information table contained some additional attributes. Say that in addition to information about the student's school, there is also information about professional societies or certifications. The updated example of the student information table would look like this:


Field Name Description
Student ID (PK) Unique identifier of the student
First Name First name of the student
Middle Name Middle name of the student
Last Name Last name of the student
Address Mailing address for the student
City City in which student lives
State State in which student lives
ZIP code ZIP code for this student
Birth Date MM/DD/YYYY that the student was born
School ID Unique identifier of the school
Graduation Date Date student graduated from the school
GPA Grade point average of all coursework completed
Society ID Unique identifier of the professional society
Membership Date Date student joined the professional society
Chapter Number Chapter of the professional society
Certification ID Unique identifier of the professional certification
Certification Date Date professional certification was earned

Table E1: Student Information (3NF)



Although a person may belong to many societies and hold many certifications, for simplicity we will include only one instance of each. To make the transition to 4NF, we must first achieve 3NF and then meet the conditions of the following rule.

Rule: Isolate independent multiple relationships. No table may contain two or more 1:n or n:m relationships that are not directly related.

Huh? What this rule is stating--and what our example is violating--is that not only must every non-key field be related to the key, but it must share the same relationship as well. That is, if one or more non-key fields are related to the student ID to define school information, all non-key fields should define school information. The table includes not only school information, such as graduation date and grade point average, but also attributes about professional societies and certifications.

Fourth normal form would require us to break the school information out to a separate table, called student school; to place society information in a separate table, called student society; and to place certification information in a separate table, called student certification. This way, only the personal attributes (like name, address, and birth date) are left in the student information table. Here is the revised 4NF design:


Field Name Description
Student ID (PK) Unique identifier of the student
First Name First name of the student
Middle Name Middle name of the student
Last Name Last name of the student
Address Mailing address for the student
City City in which student lives
State State in which student lives
ZIP code ZIP code for this student
Birth Date MM/DD/YYYY that the student was born

Table F1: Student Information (4NF)



Field Name Description
Student ID (PK) Unique identifier of the student
School ID Unique identifier of the school
Graduation Date Date student graduated from the school
GPA Grade point average of all coursework completed

Table F2: Student School (4NF)




Field Name Description
Student ID (PK) Unique identifier of the student
Society ID Unique identifier of the professional society
Membership Date Date student joined the professional society
Chapter Number Chapter of the professional society

Table F3: Student Society (4NF)




Field Name Description
Student ID (PK) Unique identifier of the student
Certification ID Unique identifier of the professional certification
Certification Date Date professional certification was earned

Table F4: Student Certification (4NF)




Field Name Description
Student ID (PK) Unique identifier of the student
Class ID (PK) Unique identifier of the class
Grade Grade received for the class

Table F5: Student Class (4NF)




Field Name Description
Class ID (PK) Unique identifier of the class
Class Title Title of the class
Instructor Instructor of the class

Table F6: Class (4NF)




Field Name Description
School ID (PK) Unique identifier of the school
Name Name of the school
Address Street address of the school
City City in which school is located
State State in which school is located
ZIP ZIP code for the school

Table F7: School (4NF)



Again, for simplicity we are restricting each student to zero or one school, zero or one professional society, and zero or one certification. Now, for the big finale, what is this fifth normal form that is looming out there? Let's take a look.

Fifth Normal Form, or 5NF

We know that each successive level of normalization is based on the presumption that we have met all of the "lower" normal forms leading up to it. So the first requirement is that our database be in 4NF before even discussing 5NF. Then we must apply the following rule, which introduces a string of seemingly complex--and frightening--conditions.

Rule: Isolate semantically related multiple relationships. There may be practical constraints on information that justify separating logically related many-to-many relationships.

What the heck does that mean? It basically is relating to combinations of values. Say, for example, that there are three bookstores serving the school. The school places a requirement on each store that if they sell any of the books for a given class, they must sell all of the books for a given class. So class 101, for example, has two textbooks: A and B. Class 102 has three books: C, D, and E. And there is only one book for class 103: F. A total of six books is required for the three classes. Based on the restrictions of all or nothing, if we have three bookstores selling all six books, we would find 6 x 3 = 18 rows in our bookstore book table.

On the other hand, if we update our database design to 5NF, we isolate semantically related multiple relationships. Meaning, we isolate the relationship of books to class, and class to bookstore. The bookstore example appears in the following diagram:


Field Name Description
Class ID (PK) Unique identifier of the class
Class Title Title of the class
Instructor Instructor of the class

Table G1: Class (5NF)




Field Name Description
Class ID (PK) Unique identifier of the class
Book ID Unique identifier of the book

Table G2: Class Books (5NF)




Field Name Description
Class ID (PK) Unique identifier of the class
Store ID (PK) Unique identifier of the bookstore

Table G3: Class Bookstore (5NF)




Field Name Description
Book ID (PK) Unique identifier of the book
Book Title Title of this book
Author Author of this book

Table G4: Book (5NF)




Field Name Description
Store ID (PK) Unique identifier of the store
Name Name of the bookstore
Street Address Address of the bookstore
City City
State State
ZIP ZIP code
Phone Phone number
Hours Hours the store is open

Table G5: Bookstore (5NF)



Looking at this form of the database, we have defined the same combinations but using only 6 (books) + 3 (classes) x 3 (bookstores) = 6 + 9 = 15 rows. Depending on the type of relationship and how many instances of each attribute exist, 5NF may save you a significant amount of time and space by not processing as many rows in the database.

Why Be Normal? Other Considerations

I have shown you how to take an unstructured database all of the way from 0NF through 5NF, but the developer is reminded that it is neither necessary nor recommended--nor even possible in all cases--to take your database to 5NF. You must choose the appropriate level of normalization for your requirements.

In particular, you should consider overall system performance when it comes down to the design. As you normalize to a higher degree, you get closer to the goal of storing a fact in the database in one and only one place. However, with that higher degree comes higher complexity. For example, we started out with one flat file in 0NF. By our 5NF example, we had implemented a total of 11 tables. Some of this refinement into separate tables is absolutely necessary to ensure the integrity of the data. And some of the refinement may enhance performance by making it easier to search out those rows you are trying to process.

On the other hand, breaking the design into too many tables could make your queries extremely complex in terms of the various joins needed to retrieve data. One could also argue that introducing multiple tables uses more disk space, since the same field, such as Class ID, may appear multiple times in the database. Typically, in this day and age of inexpensive, high-performance disk drives, it is not a strong argument. A more significant argument would be the impact on journaling or mirroring data across a wide area network for disaster recovery.

There are no right or wrong answers. Basically, you are being cautioned to consider all of the alternatives and to decide whether it makes sense to continue normalizing the data model. If you look at many of the robust applications that exist on the iSeries platform, you will find many that take transaction files to 2NF and master files to 3NF. History files often tend to be normalized to a lower degree (i.e., 2NF) for the sake of performance. Data warehouses purposely introduce redundancy to the tables to facilitate much faster queries with easier selection.

Given some of these ideas to consider, it is now up to you to balance the performance and integrity of your database. Check the Internet or booksellers for more information on database design and normalization.


Other Articles in This Series:

Database Normalization, Part 1

Database Normalization, Part 2


Jim Coker is the IT director for Behr Process Corp. in Santa Ana, California. He has worked on the AS/400 and iSeries since 1989. E-mail: jcoker@behr.com


Sponsored By
T.L. ASHFORD

BARCODE400 by T.L. Ashford is the easiest
and fastest way to create and print Compliance
Labels directly from the AS/400 and iSeries.

Ashford's comprehensive library of Compliance formats is available to Barcode400 users. AIAG labels for Ford and Motorcraft, GM, and many more are available. BARCODE400 is backed by the best Technical Support Team in the industry.

FREE Guide to Bar Code Labeling

www.tlashford.com or call 800.541.4893


THIS ISSUE
SPONSORED BY:

Lakeview Technology
T.L. Ashford
ASNA
Profound Logic Software


BACK ISSUES

TABLE OF
CONTENTS
iSeries Access for Web

Data Normalization, Part 3

Cast Your Data Conversion Troubles Away

OS/400 Alert: Nasty Little Viruses


Editors
Shannon O'Donnell
Kevin Vandever

Managing Editor
Shannon Pastore

Contributing Editors:
Howard Arner
Raymond Everhart
Joe Hertvik
Ted Holt
Marc Logemann
David Morris

Publisher and
Advertising Director:

Jenny Thomas

Advertising Sales Representative
Kim Reed

Contact the Editors
Do you have a gripe, inside dope or an opinion?
Email the editors:
editors@itjungle.com


Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.