An Introduction to Database Normalization

This image for a blog about database normalization shows a screen displaying graphs and charts.
This image for a blog about database normalization shows a screen displaying graphs and charts.

Data has become the currency of marketing. And thanks to modern technology, marketers have more data-driven insights than ever when it comes to learning about audiences, buying behaviors, and the impacts of their campaigns. However, only data that is normalized can go beyond the database to provide those insights.

Database normalization adds context and structure to your data. Rather than having bits of information stashed here and there, data is arranged in a logical, usable format. Users can more quickly find information within the database without it feeling like a game of hide-and-seek. And when data can be more easily found, it can also be put to better use.

How can you transform database normalization from a problem into a top priority? Use this guide to get started.

Table of Contents:

What Is Database Normalization?

The letters SQL rest on a wooden background with the words "structured", "query", and "language" around them in this image for a blog about database normalization.

Let’s get some clarity on database normalization: What exactly is it?

Data normalization is the systematic process of inputting and organizing a data set. It uses Structured Query Language (SQL), which is a standard for accessing and altering database information.

Unlike the Marie Kondo approach, where you only keep what brings you joy, this type of organization focuses on arranging data in a logical manner. 

Normalizing data is the next logical step after creating a database. It is where you remove any potential anomaly, error, or redundancy, set up a rule to link certain data together, and test your rules to make sure they work.

The end results are simplicity and power. When you add structure and logic to your data, you can maintain a smaller database that’s accurate and easier to use. If that’s the case, you’re inherently able to do more with your data.

Why Is Normalized Data Important?

Normalized data has become increasingly important as marketing data proliferates. Marketers are collecting information from more sources than ever, including email, social media, search engines, and market research. If you’re using multiple Software-as-a-Service tools, those apps may not always support each other.

By normalizing how all of your data is stored and accessed, you can expect multiple benefits:

  • Eliminate data redundancy
  • Focus only on necessary values
  • Store related data together
  • Resolve conflicting data
  • Reduce the complexity of databases
  • Analyze data faster and more accurately

Normalizing data is also an opportunity to address anomalies that might cloud your analysis. For instance, anomalies might appear over time due to data insertion, deletion, or alteration. Once you find these anomalies and discover how to fix them, your data will ultimately do more for you.

The key thing to remember is that the ultimate benefit isn’t data normalization itself. Rather, companies need to focus on the benefits that lie behind the action. Without going through this process, a lot of the valuable data you collect will never get used — at least not to its full potential. Normalizing data is a great way to get rid of unnecessary or inaccurate data that affects big picture insights. 

Types of Data Normalization

Interconnected windows appear on a computer screen in this image for a blog about database normalization.

Normalization has many faces, depending on the database. What are the data normalization types? Let’s look at some examples:

  • First Normal Form: The first normal form (abbreviated as 1NF) is the simplest and most basic form of normalization. This process focuses on removing redundant data and separating related data into their own tables.
  • Second Normal Form: An extension of 1NF, the second normal form (2NF) eliminates subgroups of data that are displayed in multiple rows on a table. New tables are created and connections between them are added.
  • Third Normal Form: The third normal form (3NF) satisfies all the rules from 2NF and then some. It eliminates columns that have no intra-table dependencies on the main key value.
  • Fourth Normal Form and Fifth Normal Form: The most complex normal form is the fifth normal form (5NF), with the fourth normal form (4NF) following closely behind. These are rarely used, as the first three normal form types are most common.
  • Boyce-Codd Normal Form: The Boyce-Codd normal form (abbreviated as BCNF) is a happy medium between 3NF and 4NF (call it a 3.5NF). After meeting 3NF guidelines, BCNF can determine a superkey value if all other values are functionally dependent on it.

With each increase in the level of normalization, data receive a more technical makeover. For example, in 4NF, all dependencies between multiple values are deleted. However, technical and complex processes are not to be confused with “better” processes. A lower level of technicality can work just fine. In some cases, you might choose a combination of rules from different normalization types.

How to Normalize Data

Now for the big question: How do you normalize data with all of the above information?

Normalization is a process, one that’s best broken down into phases. In a typical normalization project, you’ll need to do the following:

  1. Create tables for each value
  2. Connect values between tables
  3. Connect main keys with non-key values

Let’s do a real-world sample project together.

Sample Normalization Use Case

Let’s say that you want to create a relational database for your marketing agency. You offer multiple marketing services for a variety of industries. So far, your relational database only has tables for campaigns, employees assigned to the campaigns, and projects within the campaign.

But you’d like to learn more. That’s why, after your initial database design, you’d like to add more details to your database. This will usually require normalization, since adding new values that weren’t part of the existing schema may not be properly connected to the other values. 

Phase One: Create Your Tables

The first step is to make sure your table schema supports all of your values. It helps to start by creating a map of all the relational data you need to collect.

In the above use case, you’d have one table for listing your marketing services, one for employees, and one for projects.

Now, let’s say you’d like to add a customers table to add an extra layer of visibility. This new table will contain key customer details, such as company names, contact person, industry, phone number, address, and ZIP code.

Your table should be broken down into the simplest information possible. For example, you would not want to combine your contact person and job title into the same table column. Assign only one value for each column and row. Here’s a basic example:

An example of a header row of a customers table showing 10 labeled boxes: Project ID, Company Name, Industry, Contact Name, Contact Role, Contact Phone, Address, City, State, and ZIP Code.

You also need to give the table a primary key (also called a candidate key). The primary key is how users can identify unique data that is not the same as anything else. In this use case, we could assign project IDs as the primary key, which requires a separate column.

Also, make sure there is no repeating information in the table. For example, if you have multiple columns for project ratings to account for multiple projects for the same client, you should move these into a new table. This reduces the complexity of the customers table without getting rid of important feedback. From there, you can link the project ratings table to the customers table and projects table (which comes into play in the next phase).

Doing this should satisfy all 1NF requirements.

Phase Two: Connect Values Between Tables

You could stop right here and consider your database normalized. But going the extra mile to finish 2NF doesn’t require much extra work. Plus, it can help you gain better control of your data.

In this phase, you need to make sure that all the columns in the table relate to the primary key. Those that don’t directly relate to the primary key should branch off into another table.

In our sample use case, we created a column for project IDs and made it our primary key. This unique identifier helps us see which projects belong to each client.

However, we wouldn’t be able to relate the contact person, job title, or phone number values to the primary key because this information might change. This adds some risk to your database accuracy: What if a new employee of that company becomes your new contact person? That might mean the phone number and job title values will also change. That’s a lot of extra fields to update when a point of contact changes.

So, since these values can’t relate to the primary key, we need to shift them to their own table. This new table will hold all the data related to the contact person. Then, we can connect the table to the customers table by using a contact person ID as the primary key. The new tables might look something like this:

A row of eight columns, including one a Project ID column, in a “Customers Table” above a row of four columns, including one labeled “ContactPersonID” in a “ContactPerson Table”.

Now, when you get a new contact person for a client, you can simply update the "Contact Person ID" in the customers table. Looking at this ID will pull up all details for the current point of contact.

If the contact person changes, simply create a new record in the contact person table.

Rinse and repeat to ensure all columns in a table are assigned to a primary key and consider 2NF complete.

Phase Three: Connect Main Keys with Non-Key Values

To satisfy 3NF, we need to make sure that the columns only rely on the primary key and no other columns. Having functional dependency on more than one column in a table can impact data accuracy as values change over time.

Using our example customers table above, we see functional dependency at work in the city, state, and ZIP code columns. These not only relate to the customer’s primary key, but also to each other. For instance, if a company were to change locations, updating the city would most certainly affect the ZIP code.

Why doesn’t this apply to address or state, you might wonder? That’s because these values only affect each other some of the time. If a company moved down the street, it likely wouldn’t affect the city, state, or ZIP. Changing an address may or may not change the state. But changing cities or states will always affect the ZIP code.  

We can align with 3NF by (you guessed it) making a separate table. This table should contain the city, state, and ZIP code(s) associated with each client. You can link it to the customers table by creating a foreign key. In this case, it’s the "ZIPCodeID".

Our new tables will look something like this:

A six-column Customers Table appears above a ZIP Codes Table with three boxes labeled “ZIPCodeID”, “City”, and “State”.

When your customer changes their address, you can create a new record in the ZIP codes table to update the city, state, and ZIP, then update the address in the customers table.

Turning your big data into a normalized database can be quite the undertaking, but it also helps you get the most from data science practices. If machine learning, data model automation, or data mining for deep learning is on your radar in the future, you need to standardize your dataset organization. Good database management will only help you.

To learn how Meltwater uses database normalization on your behalf, contact us today.

Loading...