Newsletters Subscriptions Media Kit About Us Contact Search Home

Stuff
OS/400 Edition
Volume 2, Number 15 -- July 31, 2003

Database Normalization, Part 1


by Jim Coker

What is the best design for a database? As is the case with many aspects of application development, there are many approaches to properly organizing the data elements found in a database application. And many of these approaches can be successfully defended as good, efficient designs. But as far as relational database management systems (like DB2 UDB for the iSeries) are concerned, the "best" design is a normalized design.

In this article, I would like to explain why database normalization is important in supporting the future demands of an application. The properly designed database will support maximum flexibility in adding, maintaining, and retrieving data in the database.

Although to some normalization may sound somewhat esoteric or better suited for academia, it is actually very applicable to any project that relies on a relational database implementation. I would like to put forth a caveat, that if you look at a dozen different references, you are liable to find a dozen interpretations of what is meant by each "normal form." So I will add my own interpretation to the mix, based on a variety of sources that I have consulted.

First, let's try to define "normalization." There are various degrees of normalization. It can be thought of as a process with as many as five steps--each one building on the prior--that breaks down data attributes (or fields) into appropriately organized units. Carrying out this process will allow the data to be stored efficiently, while allowing flexibility for future processing requirements.

I will try to demonstrate this flexibility and to explain the benefits as we move through my example: a student database in which I want to store information about a student, the school he attends, and the classes he takes.

So let's start at the beginning.

Zero Normal Form (0NF)

Simply put, data that has not been normalized is be considered to be in "zero normal form." It is arguable whether this is a valid normal form; nevertheless it gives us a nice starting point for our discussion.

Let's look at an example of a student database that is not normalized.


Field Name Description
Name First, middle, and last name of the student
Address Mailing address for the student
Birth Date MM/DD/YYYY that the student was born
School Name of the school the student attends
School Address Street address of the school
Class 1 Class number 1
Description 1 Title of class number 1
Instructor 1 Instructor of class number 1
Grade 1 Grade earned in class number 1
Class 2 Class number 2
Description 2 Title of class number 2
Instructor 2 Instructor of class number 2
Grade 2 Grade earned in class number 2
Class 3 Class number 3
Description 3 Title of class number 3
Instructor 3 Instructor of class number 3
Grade 3 Grade earned in class number 3

Table A1: Student Information (0NF)


Looking at this initial design of a student information database, it is easy to point out several problems with this type of design. The developer who tries to implement this system with such a database design is in for a lot of complex coding to support it.

First of all, how would you locate a record for a given student? You would need to type his first, middle, and last names precisely how they appear in the Name column in order to find a matching record. Even if a matching record were found, how could you determine if that student were taking a "world history" course? The program would have to search the information from class 1, then class 2, and then class 3 before determining whether a given student was enrolled. What happens if the student is taking four classes? It is clear that this design is inflexible and would fail to achieve many foreseeable requirements.

In other words, zero normal form has offered zero flexibility! Let's see what we can do to improve the situation.

First Normal Form (1NF)

We can immediately solve some problems by refining the database design and implementing first normal form.

Rule: Eliminate repeating groups. Make a separate table for each set of related attributes, and give each table a primary key. All attributes must be in their simplest form, meaning none can be broken down into smaller forms.

We have seen that searching for a given class among a student's information is not easily accomplished with the initial design. So we will break out the information about classes--a repeating group--in order to reach first normal form. Once it has been decided to move that class information out of the student information table, we need to identify a way to link it to the rest of the student information that remains in the original table.

This presents a potential problem. At the moment, we are looking at identifying the student information by using his full name. It is possible, of course, to use the full name in each of the Student Class rows as a unique identifier. But this is not necessarily a good idea. By using the student's name, it is not impossible, although it is unlikely, to have more than one student with the same name. In addition, the student may get married or divorced, resulting in changing that name. If that were to happen, your application would have to read through all of the tables where that name is stored and change it. While feasible, it is once again inefficient and open to potential problems down the road.

A better approach would be to assign a "surrogate key" to each student; that is, create a student identification number that is issued by the application whenever a new student is added to the database. When the application issues the ID number, it can guarantee it to be unique throughout the entire database. In addition, it is more compact than the full name and, as such, requires less storage in each row of each table.

So the first step we will take is to assign a unique student ID number as our primary key in the Student Information table. The second step is to create a second table, called student class. We will subsequently move the class, description, instructor, and grade fields from the Student Information table into the new table. Doing so will result in a refined database that is in first normal form and looks like table 2. The primary key fields are denoted with a (PK) symbol.


Field Name Description
Student ID (PK) Unique identifier of the student
First Name First, middle, and last name of the student
Middle Name First, middle, and last name of the student
Last Name First, middle, and last name of the student
Address Mailing address for the student
City City in which student lives
State State in which student lives
ZIP code ZIP code for this student
Birth Date MM/DD/YYYY that the student was born
School Name of the school the student attends
School Address Street address of the school
School City City in which school is located
School State State in which school is located
School ZIP ZIP code for the school

Table B1: Student Information (1NF)



Field Name Description
Student ID (PK) Unique identifier of the student
Class ID (PK) Unique identifier of the class
Class Title Title of the class
Instructor Instructor of the class
Grade Grade received for the class

Table B2: Student Class (1NF)


It should be noted that a couple of other changes were made as we moved to the updated design. Instead of maintaining one student name field, it has been broken into three fields: first, middle, and last. That's because it is generally considered advantageous to break each attribute into its smallest, most atomic form. Doing so in this case permits us to execute a search by last name, if we so desire. From an application standpoint, it is much easier to combine smaller fields into one aggregate value, rather than parse a larger value into its smaller components.

The Address and School Address fields were similarly resolved into their simpler components.

One might reasonably ask, why, then, are we not breaking the Birth Date field from MMDDYYYY into its smaller components of MM, DD, and YYYY? This is an excellent question. If this question were posed a few years ago, it might have been argued successfully that these values should be stored atomically. However, the answer today lies in the way the database stores dates internally. Using Time, Date, or Timestamp data types, the aggregate value can be stored by the database in such a manner that any of those individual values can be easily retrieved using built-in functions. The database knows that in this case we are dealing with a date, so it knows that it may need to sort it as YYYYMMDD and display it as MMDDYYYY. DB2 will do all of the ugly work for us! For a closer look at the date data type, check out Kevin Vandever's article "Back to Basics: The Date Data Type."

We have also brought to light another database term in our example. We have now introduced a "foreign key" (FK) to our model. The FK is that column (or columns) used to link to another (foreign) table. In our example, the Student ID column in the Student Class table is a FK used to link back to the related student row in the Student Information table.

Having reached first normal form, our database queries are already starting to get easier! If we now want to know if a particular student is taking a particular class, we can immediately inspect the Student Class table for a row containing both the Student ID and the Class ID. We have also solved the problem of a student wanting to take four classes, since we can simply add another row to the Student Class table. And we don't waste space with empty class columns if a student is taking a semester off and is not enrolled in any classes; we simply have no associated records for the student in the Student Class table.

Stay Tuned!

We have already made some significant improvements to our original, "zero normal form" design by reaching first normal form. So now that we have solved our problems, does that mean that all is well in our database? Sorry, not yet! We still have a few problems to resolve. In my next installment, we will learn about database anomalies, and see how to continue refining our database into second and third normal forms.


Jim Coker is the IT director for Behr Process Corp. in Santa Ana, California. He has worked on the AS/400 and iSeries since 1989. E-mail: jcoker@behr.com


Sponsored By
WORKSRIGHT SOFTWARE

600 Billion

That's how much a recent independent study estimated U.S. businesses spend on dirty data. How much of that 600 billion is spent by your company? Cleanse your dirty ZIP Codes and mailing addresses with our software and save big bucks.

WorksRight Software, Inc.
Phone: 601-856-8337
E-mail: software@worksright.com
Web site: www.worksright.com


THIS ISSUE
SPONSORED BY:

ASNA
T.L. Ashford
WorksRight Software
Profound Logic Software


BACK ISSUES

TABLE OF
CONTENTS
BLOB Your Data for the Web

Debug iSeries Programs with the iSeries Distributed Debugger

Database Normalization, Part 1

Program Cloning to Prevent Overtime


Editors
Shannon O'Donnell
Kevin Vandever

Managing Editor
Shannon Pastore

Contributing Editors:
Howard Arner
Raymond Everhart
Joe Hertvik
Ted Holt
Marc Logemann
David Morris

Publisher and
Advertising Director:

Jenny Thomas

Advertising Sales Representative
Kim Reed

Contact the Editors
Do you have a gripe, inside dope or an opinion?
Email the editors:
editors@itjungle.com


Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.