To normalize or to denormalize, that is the question

Published in

ITNEXT

5 min readApr 23, 2021

One of your most valuable assets is data. And, because data is so valuable it is important you take time to think through and ask as many questions as possible before you start to define any schemas. How will the data be consumed? How often will it be updated? Are we ok with eventual consistency? What aggregates or reports will we need? And, much much more. If you are interested in learning how to model relational data in NoSQL you may be interested in my previous article NoSQL does not mean !Relational.

One of the benefits I find with adopting Cloud Firestore and other NoSQL databases is that they are extremely flexible but never forget “With great power comes great responsibility” — Uncle Ben (“Stan Lee”). NoSQL databases are typically marketed as schemaless and that you can alter the data to fit your model on the fly, but 9/10 times that thinking will cause you more pain than you realize.

So far I’ve done a bunch of talking and not a lot of explaining so let’s get to it. We will look at a ticket application something like a CRM in an attempt to understand when we should normalize and when we should denormalize.

First, for those that don’t know, data normalization is the technique of structuring data into multiple containers (e.g., tables) to reduce data duplication and to improve data integrity.

SQL ModelOrganizations
  - id
  - name
  - address
  - city
  - state
  - postal_code
  - phoneUsers
  - id
  - name
  - email
  - phone
  - organization_idTickets
  - id
  - submitter_name
  - note
  - assigned_to // REFERENCES App_User(id)

Looking at the SQL model above, you can see we have normalized data - we reference the User in Tickets using assigned_to, and we reference the Organization in User through the organization_id.

For those that don’t know what denormalized data is, denormalized data is a strategy used on a previously normalized database to increase read performance, at the cost of write performance, by duplicating data¹.

But that isn’t entirely accurate, denormalizing data can also be used for data integrity. In my previous article, I discuss a food ordering application that has dishes, orders and order_items , an order has many order_items and a dish has a price set by the restaurant . When a user selects a dish to order we create an order and anorder_item. Now if we were to reference the dish through a foreign key and the restaurant increases the dish price as a result of an ingredient shortage what would happen to our historical report data? The report data would be wrong! The customer for example would not be able to pull up their order history from a restaurant and see that they paid $10 for the dish previously, but if we were to denormalize and copy the price from the dish record? We preserve the price the customer paid at the time of the order.

Getting back to our ticket app, let’s see a denormalized model, and this time I’ll use a Document database model.

Document Model - Firestore
users {
  user_id {
    name,
    email,
    phone,
    organization: {
      id: ${organization_id}
      name: ${organization.name}
    }
  }
}
organizations {
  organization_id {
    name,
    address,
    city,
    state,
    postalCode,
    phone,
  }
  
  tickets (sub-collection)
    id {
      submitted_name,
      note,
      assigned_to: {
        id: ${user_id},
        name: ${user.name}
      }
    }
}// users & organizations are collections
// tickets are a sub-collection under the organization

Now looking at the Document Model and comparing it to the SQL Model, you’ll see I have normalized the assigned_to data on the SQL Tickets but in the Document tickets I have denormalized the assigned_to by including the user’s name directly on the ticket. To be fair, we can do the exact same denormalized schema with RDBMS, this is not something unique to a Document or NoSQL database. Many engineers will look at this and say, wait just reference the user and join the data, and that is a logical solution, but it may not be the best solution.

In a ticketing dashboard, you will usually have some limited data exposed— you’ll typically see the name of the submitter, create time, who it is assigned to, and likely the last updated timestamp. Now let’s say the employee assigned to the ticket gets married and changes their name from Jane Doe to Jane Smith, if using the normalized model a customer (“submitter”) that logs in and is looking at their tickets may be confused why Jane Doe is no longer their assigned rep when they see Jane Smith. In the normalized model, you cannot choose if the old name should stay on the ticket or if the new name should be present on the ticket. Now, looking at the denormalized model you see we have a choice and better yet we can offer that choice to Jane or the organization. We can let Jane or the organizationdecide if all of her tickets get updated with her new name or if only new tickets get her new name.

So how do we update the name if Jane decided she wants all tickets to have her new name? In SQL this is easy, let’s assume we have denormalized the SQL Model above and Tickets now includes assigned_to_id and assigned_to_name. Our SQL query would look like this

UPDATE Tickets 
SET assigned_to_name = $assigned_to_name 
WHERE assigned_to_id = $user_id;

Most document databases require you to query and update each document. An example using a Document DB would be

Firestore Node.js (Admin-SDK)// Function to update the ticket
// In the event of an error we push the data to pubsub
async function updateTicket({ticketId, data}){
  const docRef = firestore
    .doc(`organizations/{organizationId}/tickets/{ticketId}`);  try {
    await docRef.update(data);
  } catch error {
    // handle your error
    logging(error);
    pubsub.add(JSON.stringify({ticketId, data});
  }
};// retrieve all of the tickets for the user by the user id
const tickets = await firestore
  .collection(`organizations/${organizationId}/tickets`)
  .where('assigned_to.id', '==', ${userId})
  .get();const promises = tickets.docs.map(doc => {
  // use our updateTicket function to write data
  return updateTicket({doc.id, assigned_to_name: $assigned_to_name})
});await Promise.all(promises);
// This is only an example

As you can see above, the Document DB update does take more work. You will most likely want to perform this in the background, something like Cloud Functions are a great tool for this use case.

There are other caveats with denormalized data, you must ensure you update the denormalized data when changing the foreign key/reference. So what does that mean? If we decide to reassign a ticket from Jane to John, in a normalized database, we do not need to worry about updating the name on the ticket to John we will get his name when we JOIN the data. But, on a denormalized model, we would need to update both the assigned_to.id and the assigned_to.name. This is not a big deal as you are updating the document/record for the new id, you just need to be sure to remember to update the name field too — hint: this is a good place for some testing.

So when do you decide if you should denormalize or not? I generally think about it this way, if my presentation/view needs to show data I will put it on my document. If my data needs to preserve history I will denormalize the data. However, if the data has a tendency to change often and we do not need to preserve the historical value, I opt for a normalized model.

[1] Wikipedia.org — Denormalization https://en.wikipedia.org/wiki/Denormalization#:~:text=Denormalization%20is%20a%20strategy%20used,data%20or%20by%20grouping%20data.

To normalize or to denormalize, that is the question

Written by John Rodkey