7 DATABASE DESIGN PROBLEMS

How to avoid the worst problems in database design

INTRODUCTION: Data Architect Challenges

Data architects face many challenges on a day-to-day basis. In the data management arena, you may constantly hear from data professionals that if you don’t get the data right, nothing else matters. However, the business focus on applications often overshadows the priority for a well-organized database design. The database just comes along for the ride as the application grows in scope and functionality. Several factors can lead to a poor database design — lack of experience, a shortage of the necessary skills, tight timelines and insufficient resources can all contribute. In turn, poor database design leads to many problems down the line, such as sub-par performance, the inability to make changes to accommodate new features, and low-quality data that can cost both time and money as the application evolves. Addressing some simple data modeling and design fundamentals can greatly alleviate most, if not all, of these concerns. Let’s focus on seven common database design problems that can be easily avoided and suggest ways to correct them in future projects:
  • POOR OR MISSING DOCUMENTATION FOR DATABASE(S) IN PRODUCTION
  • POORLY DEFINED BUSINESS REQUIREMENTS
  • NOT TREATING THE DATA MODEL LIKE A LIVING, BREATHING ORGANISM
  • IMPROPER STORAGE OF REFERENCE DATA
  • NOT USING FOREIGN KEYS OR CHECK CONSTRAINTS
  • NOT USING DOMAINS AND STANDARD ELEMENTS
  • NOT UNDERSTANDING PRIMARY KEYS AND INDEXES PROPERLY

1. MISSING DATABASE DOCUMENTATION

Documentation for databases usually falls into three categories: incomplete, inaccurate, or none at all.

To make matters worse, it is almost never centralized in one place. Without proper, centralized documentation, understanding the impact of a change is difficult at best, and impossible at worst. This causes data stewards, developers, DBAs, architects, and business analysts to scramble to get on the same page. They are left up to their own imagination to interpret the meaning and usage of the data. Additionally, developers may think that the table and column names are descriptive or self- explanatory enough to understand the usage of the data (more on this with Sin #6). However, as the workforce turns over, if there is no documentation in place, the essential knowledge about the systems can literally walk out the door and leave a huge void that is impossible to fill. Even starting bottom-up with just the physical information can help to alleviate the issue of lack of documentation tremendously. Also in this day and age, we recognize that not only is data the new oil, but also a poisoned chalice. Certain data such as personal data must be treated with great caution. We need to be able to ask and answer the question, “Where is this data?” at any time. We also need to be able to go to any data asset and ask the question, “What rules apply to this data?” Ideally we will maintain models of the technical metadata of our data assets in one place. These will be stored as Physical Data Models which clearly show the technical structure of the assets. This structure includes the table and column names, data types, and relationships between tables in RDBMS assets or documents in NoSQL assets. But this is only half of the problem, because we don’t know the meaning of each piece of data. We can document each piece of metadata in a number of ways:
  • Mark up each table and column, etc. with business names and comments, ownership rules, etc.
  • Create a logical data model which shows the structure using business language, or connect to an existing common corporate model.
  • Classify each table, column, or document, etc. against Business Terms from within the Business Glossary
This single contiguous model can now be interrogated to answer those important questions. We can then go on to ask more questions about the most important information, such as applying data quality metrics.

2. POORLY DEFINED DATABASE BUSINESS REQUIREMENTS

We have all seen the rope swing graphic where what the client asks for differs wildly from what the architect designs and what the developer delivers.

This chain of communication is vital to ensure that the clients needs are met. The initial capture of requirements must be in business language and in a form that the client can understand. Logical data models are an effective way to show the concepts, their attributes, and relationships of a data asset, independent of the technology that implements the data asset. These logical models can then be transformed into a physical model more closely aligned with the target technology. This physical model will often employ abbreviated, more technical naming conventions which are more difficult for the end user to understand. Using a tool will provide traceability from the logical model through to the final database design in the physical model. This traceability allows testing to be performed.

3. DATA MODEL REQUIREMENTS

There are numerous examples of customers performing the modeling up front, but once the design is in production, all modeling ceases.

It is always preferable to design first then build. It is proven that errors corrected in the design phase are significantly less costly than those fixed after production. The sin arises when changes creep into the database due to critical production issues. Inevitably, the model is then left languishing on the side if there is not a process to update the model along with the database. As more changes occur in the database, the model becomes useless. Undocumented data can also lead to security and compliance risks, poor understanding of future changes and the inability to adapt to future needs of the business. To maintain flexibility and ensure consistency when the database changes, those modifications need to find their way back to the model. As the model changes, configuration management processes can be used where the requirements for changes are defined within the Logical Data Model, approved for sign-off then a version snapshot cast. Once approved, the physical model can then be updated and again approved and again a version snapshot cast. At this point tooling can help produce scripts to update the target data asset with those changes only. The DBA can view scripts for the asset and provide a final approval before implementation. In this way changes are controlled and documented. This will reduce the opportunity for errors to occur, and if they do, to identify where they came from and fix them.

4. IMPROPER DATABASE REFERENCE DATA STORAGE

There are two main problems with reference data.

It is either stored in many places or, even worse, embedded in the application code. It is almost never captured in the data model with the rest of the documentation. This causes huge problems when a reference value needs to change. Reference values provide valuable documentation which should be communicated in an appropriate location. Your best chance is often via the model. It may not be practical to store reference values in the data model if you have large volumes, but it is a best practice to point to them from the model. The key is to have it defined in one place and used in other places. The benefit of this is that any changes can be understood before they are implemented, then properly controlled and documented.

5. NOT USING FOREIGN KEYS AND CHECK CONSTRAINTS

Customers complain all the time about the lack of referential integrity (RI) or validation checks defined in the database when reverse engineering databases. For older database systems, it was thought that foreign keys and check constraints slowed performance, thus, the RI and checks should be done via the application. This might have been the case in the past, but DBMSs have come a long way. The ramifications on data quality if the data is not validated properly by the application or by the database can be significant. If it is possible to validate the data in the database, you should do it there. Error handling will be drastically simplified and data quality will increase as a result.

6. NOT USING DATABASE DOMAINS AND STANDARD ELEMENTS

Domains and naming standards are probably two of the most important things you can incorporate into your modeling practices.

Domains allow you to create reusable attributes so that the same attributes are not created in different places with different properties. It is extremely important to have a common set that everyone can use across all models. Naming standards allow you to clearly identify those attributes consistently. Having a set of standards also ensures consistency across systems and promotes readability of models and code. You don’t want short, cryptic names that users need to interpret. Given the advanced nature of the latest vendor releases, the days of limited column length name are over when building new databases. Always have a common set of classwords to identify key types of data and use modifiers as needed. Reusing data elements from a well-maintained model will also help with Data Governance initiatives. Being able to trace the data in assets back to reusable elements in maintained Business Glossaries or Logical Models helps understand whether policies and data quality or usage rules are being maintained within the asset. Conversely, a model where data assets can be traced to reusable elements helps you find data for impact analysis or sources for Business Intelligence.

7. NOT UNDERSTANDING PRIMARY KEYS AND INDEXES

A senior data architect once said, “When choosing a primary key, you’d better get it right, because changing it down the line will be a royal pain.”

Sure enough, another customer had the un-enviable chore of managing a migration project because a system used Social Security Number as a primary key for individuals. They found out the hard way that SSNs are not always unique and not everyone has one. The simplest principle to remember when picking a primary key is SUM: Static, Unique, Minimal. It is not necessary to delve into the whole natural vs. surrogate key debate; however, it is important to know that although surrogate keys may uniquely identify the record, they do not always uniquely identify the data. There is a time and a place for both, and you can always create an alternate key for natural keys if a surrogate is used as the primary key. This process of understanding how records can be identified early on also provides useful knowledge when optimizing the database later on. Capturing candidate keys during the design process makes defining indexes easier later on. Unique indexes provide optimized methods of retrieving, sorting, and finding data. Plus the data in the primary key may not be available when finding a record.

ER/Studio Data Modeling Tools

It may be a multi-step process to define your strategy to eliminate all of these issues, but it is important to have a plan to get there. Hopefully this information has helped you to evaluate your data management habits and assess your current database structure. IDERA can help you to address these issues with the ER/Studio data modeling tools. ER/Studio enables your business and technical stakeholders to map your complex data landscape, building a business-driven data architecture that serves as a solid foundation for data governance. Create consistent data models, document metadata, and improve data quality for your organization.