Newsletters Subscriptions Media Kit About Us Contact Search Home

Stuff
OS/400 Edition
Volume 2, Number 17 -- August 28, 2003

Database Normalization, Part 2


by Jim Coker

The first article in this series discussed what is meant by the term database normalization and how it pertains to development on an iSeries. In this article, we will take our database design to second normal form, to eliminate redundant data, and then to third normal form, to address potential database anomalies.

The last article gave an example of a poorly designed student database that was then modified to "first normal form." That design appears below. (For consistency between articles, I will use the same diagrams that were used in the first article.)


Field Name Description
Student ID (PK) Unique identifier of the student
First Name First name of the student
Middle Name Middle name of the student
Last Name Last name of the student
Address Mailing address for the student
City City in which student lives
State State in which student lives
ZIP code ZIP code for this student
Birth Date MM/DD/YYYY that the student was born
School Name of the school the student attends
School Address Street address of the school
School City City in which school is located
School State State in which school is located
School ZIP ZIP code for the school

Table B1: Student Information (1NF)



Field Name Description
Student ID (PK) Unique identifier of the student
Class ID (PK) Unique identifier of the class
Class Title Title of the class
Instructor Instructor of the class
Grade Grade received for the class

Table B2: Student Class (1NF)


In this article, we will take our database design to second normal form and then to third normal form. To begin, let's consider what's wrong with the 1NF design.

There are two problems with the present design: update anomalies and delete anomalies. So what the heck are those? An anomaly indicates that there is inconsistent data in the database. Let's take a closer look.

Update Anomalies

Suppose several students are enrolled in a class. The class ID is WD, and the class title is "Web Development." Then the class is renamed to "Java Programming." The first challenge is that our application must go through all of the rows in the student class table to see if the class ID is WD and, where necessary, change the class title from "Web Development" to "Java Programming." If a row is not updated, there will be a problem. Some rows have a class ID of WD, with a class title of "Java Programming," while other rows have a class ID of WD, with the old class title of "Web Development." So there is now inconsistent data defined for class ID WD. That is what's known as an update anomaly.

Delete Anomalies

Let's use the same example outlined above. Suppose there are students enrolled in a class with a class ID of WD and a class title of "Web Development." Now suppose that every student were to drop out of the class. Removing the rows containing class ID WD created a new problem: There is no longer any trace of a class with a class ID of WD! By removing the students enrolled in the class, we inadvertently removed the class itself. That is what's called a delete anomaly.

Second Normal Form, or 2NF

We have a mess on our hands with these potential anomalies floating around in the database. So what shall we do about it? Go to second normal form, of course!

Rule: Eliminate redundant data. If an attribute depends on only part of a multi-value key, remove it to a separate table. Every non-key attribute must be related to the key. If the key consists of more than one field, every non-key field must be related to the entire composite key.

This rule might seem a little intimidating at first. But we will use an example to step you through it. Our first task is to eliminate redundant data. This is done by looking for any attribute that depends on only part of a multi-valued key. The student class table has a multi-value key that consists of Student ID and Class ID. However, it has some non-key attributes (or fields), class title and instructor, which are related only to Class ID and have nothing to do with Student ID.

The 2NF rule instructs us to remove these attributes and put them in a separate table. So we will now create a third table in the database called Class.


Field Name Description
Student ID (PK) Unique identifier of the student
First Name First name of the student
Middle Name Middle name of the student
Last Name Last name of the student
Address Mailing address for the student
City City in which student lives
State State in which student lives
ZIP code ZIP code for this student
Birth Date MM/DD/YYYY that the student was born
School Name of the school the student attends
School Address Street address of the school
School City City in which school is located
School State State in which school is located
School ZIP ZIP code for the school

Table C1: Student Information (2NF)



Field Name Description
Student ID (PK) Unique identifier of the student
Class ID (PK) Unique identifier of the class
Grade Grade received for the class

Table C2: Student Class (2NF)


Field Name Description
Class ID (PK) Unique identifier of the class
Class Title Title of the class
Instructor Instructor of the class

Table C3: Class (2NF)


Notice that we only moved the non-key attributes out of the student class table into the new table. Class ID remains part of the key in Student Class because there are, in fact, attributes (namely the grade earned in the class) that are based on the entire composite key of Student ID and Class ID.

We have solved the problem of update anomalies because we can now be certain that there is only one class title for any given class ID, since there is exactly one row for that class ID in the new table. We have also addressed the issue of delete anomalies, since a class can still be defined in the new table even if there are no students enrolled in it.

We are in second normal form, and finally everything is great, right? We have eliminated the redundant data and anomalies. Or have we? Taking another look at the 2NF design, there is still the possibility of an update or delete anomaly.

Reviewing the attributes of the student information table, you see the name and address of the school. While the school information is clearly "related" to the student, it is not "dependent" on the student. Therefore, we can once again introduce inconsistent values for the school name. Also, the school cannot exist in the database unless there is at least one student attending it. Once again, we are faced with update and delete anomalies.

Third Normal Form, or 3NF

But wait! The reason we went to 2NF was to eliminate anomalies. So how is it that we are in 2NF and still have anomalies? That's because we only addressed those attributes based on part of a composite key. The remaining problems exist because the school information is not dependent on any of the key fields. To solve this problem, we need to take the design to third normal form.

Rule: Eliminate columns not dependent on the key. If attributes do not contribute to a description of the key, remove them to a separate table. All non-key fields must be exclusively dependent on the key field.

This means it is time to break the school information out to a separate table. The school address, city, state, and ZIP code are all related to the school, not to the student. Therefore, those fields do not contribute to a description of the student ID and, in turn, do not belong in the student information table. The following table shows the revised database in third normal form.


Field Name Description
Student ID (PK) Unique identifier of the student
First Name First name of the student
Middle Name Middle name of the student
Last Name Last name of the student
Address Mailing address for the student
City City in which student lives
State State in which student lives
ZIP code ZIP code for this student
Birth Date MM/DD/YYYY that the student was born
School ID Unique identifier of the school

Table D1: Student Information (3NF)



Field Name Description
Student ID (PK) Unique identifier of the student
Class ID (PK) Unique identifier of the class
Grade Grade received for the class

Table D2: Student Class (3NF)


Field Name Description
Class ID (PK) Unique identifier of the class
Class Title Title of the class
Instructor Instructor of the class

Table D3: Class (3NF)



Field Name Description
School ID (PK) Unique identifier of the school
Name Name of the school
Address Street address of the school
City City in which school is located
State State in which school is located
ZIP ZIP code for the school

Table D4: School (3NF)



Looking at each of the four tables in the database, you find in each case that all non-key attributes are related to the entire key. There is a wonderful summary of what it takes to declare a table is in third normal form. In their book DB2: Design and Development Guide, Wiorkowski and Kull boiled it down to one statement: "The rules leading to and including the third normal form can be summed up in a single statement: Each attribute must be a fact about the key, the whole key, and nothing but the key."

By measuring all tables in the database with this one test, we can confirm that the design is in third normal form, which, by definition, means we are also in second normal form and, in turn, first normal form. That is a lot of validation resulting from one easy phrase!

Don't Give Up Now. We're Almost There!

In this article, we started with a database that was in first normal form. We moved all attributes that were related to part of a composite key to a separate table. That eliminated some anomalies and took us to second normal form. We then moved all the attributes that were not related to any of the key fields to a separate table. That eliminated the remaining anomalies and left our design in third normal form.

In many cases, it is acceptable to bring your design to 3NF, and that will provide adequate flexibility for a robust, maintainable application. In the third and final article in this series, I will reveal the less commonly used fourth and fifth normal forms. We will then examine some trade-offs and discuss how much normalization is appropriate for a given application design.


Jim Coker is the IT director for Behr Process Corp. in Santa Ana, California. He has worked on the AS/400 and iSeries since 1989. E-mail: jcoker@behr.com


Sponsored By
T.L. ASHFORD

BARCODE400 by T.L. Ashford is the easiest
and fastest way to create and print Compliance
Labels directly from the AS/400 and iSeries.

Ashford's comprehensive library of Compliance formats is available to Barcode400 users. AIAG labels for Ford and Motorcraft, GM, and many more are available. BARCODE400 is backed by the best Technical Support Team in the industry.

FREE Guide to Bar Code Labeling

www.tlashford.com or call 800.541.4893


THIS ISSUE
SPONSORED BY:

T.L. Ashford
ASNA
Lakeview Technology
Profound Logic Software


BACK ISSUES

TABLE OF
CONTENTS
Database Normalization, Part 2

The NULL Nemesis

Linux on iSeries Network Alternatives

OS/400 Alert: A New Feature for Midrange Programmer


Editors
Shannon O'Donnell
Kevin Vandever

Managing Editor
Shannon Pastore

Contributing Editors:
Howard Arner
Raymond Everhart
Joe Hertvik
Ted Holt
Marc Logemann
David Morris

Publisher and
Advertising Director:

Jenny Thomas

Advertising Sales Representative
Kim Reed

Contact the Editors
Do you have a gripe, inside dope or an opinion?
Email the editors:
editors@itjungle.com


Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.