Data Dictionary

“When I took the first survey of my undertaking, I found our speech copious without order, and energetick without rules: wherever I turned my view, there was perplexity to be disentangled, and confusion to be regulated; choice was to be made out of boundless variety, without any established principle of selection; adulterations were to be detected, without a settled test of purity; and modes of expression to be rejected or received, without the suffrages of any writers of classical reputation of acknowledged authority.”
Preface to Dictionary of the English Language, Samuel Johnson, 1755 [9]

What is it?

A data dictionary is defined in the IBM Dictionary of Computing as a “centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.”[1] However, Business Analysts should be aware that the term data dictionary can have several meanings depending on the person one is speaking with. A data dictionary can be:

  1. A read-only set of tables that provides information about an Oracle database. [2]
  2. An automated tool for collecting and organizing the detailed information about system components. [3]
  3. A reference that includes standard definitions of data elements, their meanings, and allowable values. [4]
  4. A collection of data about the data, with the purpose of rigorously defining each and every data element, data structure, and data transform.[5]

As you can see from the list above, the term data dictionary may be interpreted differently by the person you are speaking to if they are a Oracle database administrator (#1 above), database architect (#4 above), system analyst (#2 above), or business analyst (#3 above). And if you are talking a non-technical business person, they may conflate a data dictionary with a glossary of business terms.

Depending on the context, a data dictionary can be a component part of database management system (DBMS) or a separate data store in the form of a database, spreadsheet, or even an index card file (in the old days).

Data Dictionary Types

Generally, there are two data dictionary types that Business Analysts deal with. These are logical data dictionaries and physical data dictionaries. And then some data dictionaries combine the metadata of both a logical and physical dictionary.

A logical data dictionary typically describes information in business terms and focuses on the meaning of terms and their relationship with other terms. For example, from the business perspective there may be a data entity called “Client”, which includes “Client Name”, Client ID”, and “Order History” attributes. Once defined, a logical data dictionary rarely needs to change.

In contrast, a physical data dictionary the physical attributes of a data element. What is the data type (string, boolean, integer, etc)? What is the maximum length? What is the format? What is the data encoding standard (ANSI, UTF-8, etc)? Where is the data physically stored (database name, column, row, etc)?

While a the attributes of a logical data dictionary are unlikely to change unless there are significant business changes, the physical attributes can change much more frequently. And because the data element could be stored in multiple databases with different physical structures, there may be different physical data dictionaries that include the same logical data entity.

Most business analysts will be creating the logical data dictionary as part of their requirements work, which would then be converted to physical data dictionary as part of the solution definition and design phases of an effort.

Why do it?

According to the International Standards Organization (ISO): “The increased use of data processing and electronic data interchange heavily relies on accurate, reliable, controllable, and verifiable data recorded in databases. One of the prerequisites for a correct and proper use and interpretation of data is that both users and owners of data have a common understanding of the meaning and descriptive characteristics (e.g., representation) of that data. To guarantee this shared view, a number of basic attributes has to be defined.” [9]

This is done through data dictionaries. Or put another way:

“Data dictionaries assist with gathering requirements, standardize data element definitions, reduce redundancy, and enhance consistency between systems. What may seem like administrative busy work when you’re launching a project can become a time and money-saving resource by revealing complexities that were overlooked in initial specifications and spotting costly inconsistencies before they are deeply embedded in the organization’s systems.”[8]

Among the many benefits of a data dictionary are:

  • Labeling information consistently, with agreed-upon definitions for data elements and a common set of properties for each data element, makes systems and data analysis easier and business intelligence more effective because of the access to high data quality information that the data dictionary enables.
  • Describing data elements and the use of a defined set of properties for each data element reduce or eliminate the creation of redundant data elements.
  • Using a common set of properties for each data element and consistent labeling of data elements ensure that business and programmer analysts can easily identify relevant data to support implementation of business requirements.
  • A business data dictionary provides a way of organizing all of the data elements that are relevant to a particular software system.

The data dictionary can be one of the most important requirement artifacts that a BA creates when dealing with software solutions. It ensures that all stakeholders are in agreement on the definition and structure of data that the solution will use. The saying “Garbage In, Garbage Out” is never more true than when dealing with system data, so spend time ensuring that you have carefully defined the data needs of your solution.

A data dictionary (and glossary) is often one of the first artifacts that gets created when the requirements elicitation process starts. One of the key needs in any project is to make sure everyone is speaking the same language. A glossary helps with this. But unless you dive into the level needed to define terms that goes into a data dictionary, you are unlikely to determine if there are slightly different terms that are used within the business units you are working with.

An Example

An example of a situation where a data dictionary can help identify the usage of a seemingly common term with different meanings comes from the financial services field, where a seemingly standard term such as “Cash” can have different meanings depending on who you talk to. For one person “Cash” might mean literally cash in hand or immediately accessible money such as that in a bank account. But from various investment perspectives “Cash” might include short-term Treasury Bills, Certificates of Deposit, certain types of promissory notes, and other types relatively liquid investments. Now imagine a situation where the formula you need to calculate a figure for a regulatory report requires one definition of “Cash” but your internal systems are all built around a different definition of “Cash”. Unless you have used a data dictionary to very precisely define what you mean by “Cash”, your stakeholders and developers may not even realize there is a difference.

How do I do it?

Step 1 – Define the Location and Structure

Choose your data dictionary location and structure. Are you going to use Excel? Does your requirements management tool specifically support a data dictionary?

Step 2 – Define the Attributes

Decide what attributes you are going to capture in data dictionary. Logical and Physical data dictionaries frequently have different attributes that should be captures. However, some data dictionaries are set up to capture both types of information. In those cases the business analyst might start the data dictionary by working with the business to define the logical data attributes while the technical team would then expand the data with the physical attributes for the specified data elements.

Logical Attributes

For each data element, the following logical attributes may be used in a data dictionary:

Attribute Name Description
ID Code Every data element in the dictionary should have a unique identification outside of its name. This might be a number, or an alphanumeric combination.
Name Every data element in the dictionary should have a unique name.
Aliases If the data element is potentially referred to by more than one name, the additional names can be recorded as aliases (or synonyms). For example, an element named “Price” may also be referred to as “Cost” or Value. Thus “Cost” and “Value” would be recorded as Aliases of “Price”
Definition Every attribute should include a definition that precisely defines the element associated with the “Name”, and which distinguishes it from other elements that may be similar.
Formula This specifies if the value being defined should be calculated using a specific formula.
Unit If the data element should be specified in a specific unit of measurement (such a miles, centimeters, or a currency)
Update Frequency This attribute how often the business expects the element value to be updated. This may be daily, weekly, monthly, point in time (if updates occur on a non-scheduled basis such as when a client changes their mailing address), or other values.
Definition Source It is often a good idea to keep track of who provided a specific element definition.
Range of Values This is the possible range of values the element needs to support. This could be a range of numbers, a set of letters, or one of a specific set of values (such as State abbreviations if part of an address)
Time Dimension Frequently, a business can want the same value but over a different period of time. For example, a business want to know the total amount of sales to a customer over a 1-month, 6-month, 1-year, and lifetime period. This attribute can also interact with the Update Frequency attribute to provide different results, so there could be a 1-year sales total that is updated on once a year on January 1st to provide a calendar year value, and another 1-year sales total that is updated daily to provide a rolling value.
History Required For some elements, the business may want to maintain not just a current value, but past values as well. For example, if the business was tracking a daily sales total, they may want to maintain specific daily values for the prior 5 years.
Notes This attribute field can be used to provide specific examples, background, or other points of information that the BA feels appropriate.
Components If the element is a compound element, this attribute can be used to indicate the component elements that are incorporated into this element. This can also be used to indicate the relationships with the component elements (mandatory, optional, 1 to 1, 1 to many, etc)
Owner This is the “owner” of the data value that is being defined. This is usually the business unit or person who is responsible for the data in question. This might mean that they are responsible for the system that generates the data, or are the subject-matter expert for the ensuring the quality and accuracy of the data in question.
Domain This is the context in which the data element exists. For example, data elements within a “client” domain may include client ID, first name, last name, address, city, state, zip code, order history, preferred payment method, and other data elements that are related to the client. Sometimes a data element can be part of more than one domain.
Default Value This specifies if the data element has a default value, and if so what that value is.
Security Classification If the data element has a particular security classification that should control its use or access to the data, it might be noted here. An example might be sensitive client data (such as Social Security Numbers in the U.S.) that a firm might have to store for regulatory purposes, but for which access should be restricted and which should not be made public.
* This is not intended to be a complete list of possible logical attributes. Please add any new attributes you may use or be aware of.

Physical Attributes

For each data element, the following attributes may be used in a data dictionary:

Attribute Name Description
Size If the element has a specific size (25 characters), or a maximum size (100kb), it should be documented here
Database Name The name of the database the element is stored in
Database location The specific location (column, row, record, etc) in a database that the element is stored in.
Data Type The type of data the elements is stored as. This can include a Boolean value, integer, number, record, or a number of other specific data types.

* This is not intended to be a complete list of possible physical attributes. Please add any new attributes you may use or be aware of.

Step 3 – Define the Process

Decide what your version control system and change process for the data dictionary is. The data dictionary most frequently the document that is worked on by multiple parties simultaneously (business analysts, systems analysts, data analysts, data architects, etc.) and is thus one of the documents most in need of a solid version control and update process.

Step 4 – Adding Elements

Start defining your data elements.

Make sure that each data dictionary entry defines only one data element. When working to define a new element, make sure to figure out if the element under discussion is a primitive or composite data element. Primitive data elements are those that cannot be deconstructed any further. Composite data elements are those that are made up of or incorporate primitive data elements.

For example, a name element is generally a composite data element that is made up of the following primitive data elements: first name, middle name, family name, and name suffix. There may also be a need to specify a name prefix element that would support titles such as “Mr.”, “Dr.”, “Ms.”, and other similar titles.

When defining a composite element you should specify the primitive elements first and then the composite element, making sure that the definition of the composite element explicitly references the primitive elements that it is made up of.

Step 5 – Verify Attributes

The attributes for each data element in the dictionary should be verified with all stakeholder groups in order to reach agreement on naming, definitions, attributes, formulas, and other critical attributes of the data.

What Should the Results be?

The end result of a data dictionary effort should be a single artifact that:

  • Uniquely identifies all data elements that are being used by a system that are used, created, or impacted by a solution effort.
  • Ensures consistent naming of data to ensure that the everybody involved in the solution effort understand exactly what is meant by any data reference.
  • Provides an easy-to-use reference for both the business and technical stakeholders to leverage for reporting, development, and similar needs.
  • Supports future development and support efforts by precisely defining every data element and its related attributes.

Advantages

  • Data dictionaries are the single best way to discover the sort of language ambiguities that occur when different users use the same term for different things, or different terms for the same thing. The rigor involved ensures that data elements used in a solution are the right elements.
  • A complete data dictionary with the necessary attributes captures is one the key factors in the success of many software projects. This is especially true for data-heavy applications where hundreds or even thousands of data elements are involved.

Disadvantages

  • Data dictionaries are among the most time-consuming and frustrating requirements artifacts to create and verify.
  • Verification of data dictionaries often requires the simultaneous involvement of multiple business groups and one or more technical groups, in addition to the business analyst(s) involved. Without all of these parties present you may miss important information or complexities in the data that are based on the internalized assumptions of some participants.

Tips

  • Be sure to double check the data sources your technical team propose to use for your application. Check with the originator of the data to make sure it meets your precise needs. There have been occasions when the technical team thought they were getting the data we specified, only to find that despite having a similar name and similar description it was calculated in a slightly different way than was needed. If you are working in a regulated environment, or even worse, in a multinational that has to deal with multiple regulatory environments, being 100% sure that the data is exactly what you need it to be can be the difference between a successful project and major fine from a regulator.
  • Class Diagrams and Entity Relationship Diagrams are fantastic companion pieces to your data dictionary.
  • When you start eliciting requirements, I have found it useful to start documentation using three documents: a data dictionary, a class diagram, and process flows or BPMN diagrams. The data dictionary helps me determine if different people are using the same term in different ways; the class diagram helps to start documenting the entities (as classes) involved, the data they use, and the actions they take; and the process diagram to start mapping the different work flows of those involved.
  • In some cases you may have to work through an Enterprise Data Dictionary, which acts as a single source for all data definitions used in an enterprise. Rather than creating a separate data dictionary, all references to data elements would provide a reference to the specific entry about that data element in the EDD. This usually involves working though a data governance and/or data stewardship process.

References

1. IBM Dictionary of Computing, 10th edition, 1993
2. Oracle Database Concepts 11g Release, Data Dictionary
3. Book Chapter: The Data Dictionary, by Marty Modell in the online version of his “A Professional’s Guide to Systems Analysis”, Second Edition
4. BABOK Guide, section 9.5.3
5. The Information System Consultant’s Handbook: System Analysis and Design, by William S. Davis and David C. Yen.
6. Wikipedia Entry – Data Dictionary
7. Structured Analysis Wiki – Data Dictionaries
8. Business Data Dictionary Template – Project Connections web site.
9. Best Practices for Data Dictionary Definitions and Usage, Version 1.1 2006-11-14. Northwest Environmental Data Network.

Other Resources

 



© 2013 by David Olson

Leave a Reply

Your email address will not be published. Required fields are marked *