Data Warehouse Testing – Demystified

The expected use of a product describes how it will be built; for instance, a handbag must have the capacity to store and means to be carried. Features add up in a product as more details on usage are discovered or invented. 

Expected use not only drives how a product will be built but also defines how testing should be performed. System testing is for the most part doing what a user would do to know if the system would behave as specified / expected under a given scenario. Therefore, knowing the purpose of the system is the key in defining any test strategy. 

Data warehouse (DW) testing is no exception, to understand this better lets first touch on some concepts. One of my favorite Data Warehouse definitions goes something like this

“A data warehouse is a copy of transactional data specifically structured for querying and reporting” which essentially elaborates on the purpose of its existence. Therefore, DW must be tested for its ability to efficiently store and present information for effective querying and reporting. Let us dig deeper into how to go about that

The sheer volume of data in a data warehouse makes it impractical to validate each row. On the other hand, selective testing always has its own risks and can lead to faulty data to be loaded and/or reported unnoticed. Whenever a system involves numerous output elements where its impractical to check each item, ensuring quality involves more than just looking at sample outputs. For example, for a newspaper publishing millions of copies each day, it makes more sense to have a “controlled process” that would let the company comfortably know that the newspaper printed was just as good as the editor saw it on his screen, with of course some acceptable percentage of error. That is, in essence, in software context, making sure that development is using some accepted quality standards and the testers, along the assembly line, are ensuring that dev has successfully done that. This pushes the test team to go in the black box and identify points along the assembly line (integration) where additional tests can be performed; we call this Gray Box Testing.

Data follows a stringent path in a data warehouse; therefore it is easier in a DW project to predefine points where the data will be tested. Typically data is modeled, extracted from various sources, transformed and loaded into the data warehouse, where it is queried as required. All of these steps provide us with a generic guideline as to where quality tests should be performed. Namely, a test team must verify 

  1. Data Model Quality (Logical and Physical)
  2. IO Channel Quality
  3. Data Quality
  4. Presentation Quality
  5. Performance
No alt text provided for this image

Test Data:

The most important prerequisite of every data warehouse testing is test data. Generally test data should be available from historical data requirements and the testing team should not be creating this test data.

However, test data might not be available in situations where the subsequent source system is being developed in parallel. Historical data would need to be migrated or might not be available for some reason. In this case – once data warehouse development is over, it is inevitable that test data would be necessary for testing. In such a case the testing team would have to prepare data in the source system.

  • Identify the source tables from which the data warehouse extracts data.
  • Go through the constraints of these source tables and identify the dependent tables.
  • Consolidate this list.
  • Determine a range of values possible for each fact which is present in these tables, e.g., temperature should range between 45 – 50º F. This is a value-add exercise. If it is too time-consuming, the tester can ignore it with no impact on effectiveness. Data generation tools are available to create the data in these tables.

Data Model Quality:

The key to a DW tester’s success is the understanding of the data models. All further testing is based on the comprehension of each data element and how it will assist the company, specifically in meeting the business objectives associated with analysis and reporting. Data models are to be tested and accepted before ETL process begins. The data model testing starts from conceptual, goes through logical and ends at physical model. The definition of quality of data model elements may vary; in this regard the testers are expected to use the standard that is being followed for a particular project.

Conceptual Data Model testing includes the following

  1. Model should be expressed in business terms
  2. Verify
  3. Business Term Definition Quality
  4. Entity type Definition Quality
  5. Attribute Definition Quality
  6. Data Name and Definition Consistency
  7. Domain Value Consistency
  8. Business Rule Quality

To test a Logical Data Model, consider the following

  1. Table names should be intuitive
  2. Each table should be identified by how it is to be used within a DW (e.g. fact table, dimension table, summary table)
  3. Primary keys should be identified and placed on the top
  4. Primary keys should make each row in a table unique
  5. Description of each data element should be captured in a repository and a physical name should be assigned
  6. Where possible, physical names should reflect names in the source tables and follow company standards, if any
  7. Data elements that are created through a transformation or summary process should be clearly identified and calculations for each of these data elements need to be captured and entered into the repository.

Physical Data Model can be tested using the following

  1. Some data elements captured in the logical model may reside in the physical tables, but will remain as placeholders for future enhancements. A list of all data elements that will not be loaded must be created and verified.
  2. The logical model may also contain several tables that will be collapsed into a single physical table. A list of this should be maintained and verified.

IO Channel and Data Quality:

Data in a data warehouse is extracted from variety of sources, transformed and then loaded in the warehouse based on the data models defined earlier. This process is known as ETL and there are several tools available in the market to do this. Some companies use their home-grown ETL solutions. Since the basic function of off-the-shelf and home-grown tools remains the same, we’ll focus on the aspects that affect quality of this process.

Mature tools not only perform ETL but also provide features such as Data Profiling, Data Assessment, Duplicate Finding and also report on errors during loading. All of these provide a simple and efficient method to determine IO Channel Quality and trouble spots.

Any organized attempt to improve data quality must begin with an assessment of what you already have, which can happen at multiple levels. The simplest level is profiling: a way to analyze table- and column-level data using a set of metrics that helps you learn from the data distribution. Data profiling includes row count, null count, maximum and minimum values, statistical metrics and pattern analysis.

Data assessment is finding similar data elements and phrases and grouping them together to identify the inconsistencies. For example, determining that “New Jersey,” “Jersey,” and “Jersee” are really the same. 

Duplicate Finding is to identify duplicate or near-duplicate data using match definition and sensitivity. Match definitions are predefined algorithms that let you specify the type of match you want to find. For example, a name column in a table might match to a full name, a first name or a last name. Sensitivity specifies the required precision. 

Reviewing the mapping utilized by the ETL tool is the primary task during data validation. Due to the sheer volume of data to be loaded into the fact tables, it is imperative that care is taken that the mapping is validated to verify that the data that you think is being loaded into specific data elements is in fact being sourced from tables that we know contain the information in the operational system.

The ETL tool will create a report that can be viewed as is or imported into an analytical interface with reporting capabilities such as Microsoft Access. An interface should be utilized, the name of each table, both logical and physical, as well as each data element, also logical and physical, should be present. Each data element should have a formal description and a mapping back to the source table(s) used to populate it during the ETL process. If any transformation is to occur, this too should be documented. If no transformation is to occur, the sample SQL to capture the data element from the source table may be documented as well, but is not required.

The next step in data validation is completed by the use of counts. Simple database queries can be run either on an entire table (for most dimensions this is possible) or for subset (i.e., number of activity records for the month of May). If these counts are equal, it can be safely assumed that records were not left out due to an error during the ETL or simple load process. This is further verified by the lack of errors (not necessarily warnings) in the exception reporting by the ETL tool.

Many dimensional tables have an exact duplicate in the operational database schema. Doing a simple table compare can further validate these tables. If a table compare indicates that no difference were found, these tables can be considered 100 percent validated. For additional verification, actual rows from both the operational and data warehouse tables can be printed and listed side by side for comparison.

Randomly selecting rows of data from the test environment and comparing them to data stored within the test data warehouse should give a high level of confidence in the ETL process. 

Presentation Quality:

While testing the presentation layer, the testers are supposed to consider the ETL as a block box and will assume that the presentation is the only interface available. The presentation testing includes verifying the functionality of the reports and the validity of the data displayed.

For the functionality of the presentation, consider these

  1. The fields present in the report are as specified in the specifications
  2. Units of data are displayed along with the column heading (or as specified)
  3. Report description is available and valid
  4. Drill-down features, if exist are functional and yield correct data
  5. Column headings should make business sense
  6. Graphs are present as specified in the requirements

For data validity consider the following

  • In every data warehouse, data would be present at an aggregated level with reports being seen at summary levels. However invariably data would be stored in the data warehouse at the same granularity as the source system. If reports are verified against this granular data vs. the source system, there is little reason for aggregations to go wrong.
  • Understand each report – understand the linkages of every field displayed in the report with the schema and trace its origin back to the source system.
  • Alternatively, it could be possible that the OLTP has a graphical screen through which the client can enter data. In this scenario the client can populate a particular data item through the screens into the OLTP system. ETL process is then run. After successful completion of the ETL process, the testing team is expected to pull out a report that shows the same data item. The client is satisfied if the same value appears in the report.
  • The test team can otherwise develop small SQL queries, which would validate the most granular data items against the OLTP system. The only drawback with this approach is that certain calculations done by ETL processes are too complex to be emulated in these queries. Some manual calculations like transfer of data to Microsoft Excel might be then needed to arrive at the result.
  • There might be some derived data items in the schema, which have been derived by calculations too complex to emulate during System testing. In such cases, the test team can isolate these from system testing and use the unit test results.

Performance Testing:

  • Performance testing is the most important aspect after data validation. Systems might satisfy all the tests above and may fail at the performance level test.
  • Performance testing should check for:
  • ETL process completing within the load window. Check the ETL process for update times and time taken for processing of reject records.
  • Refresh times for standard reports.
  • Refresh times for complex reports.
  • Significant data should be present in the data warehouse. Above this, data loads for single days or batch windows should be done to check whether they finish within the provided window. Tools are available for simulating the number of concurrent users accessing the system.
  • Logging and Analysis of Test Results
  • Documentation of test results should be strictly adhered to. 

Error Category Area Characteristics

  • For data verification of the report data with the OLTP system, the team can log the query results along with a screenshots of the reports.
  • It is advisable to develop a customized spreadsheet for the unit, IT and system level tests.
  • Fix the grain of measuring testing activities – which should be the conditions in the test case.
  • Maintain counts of the number of conditions that were executed, number of conditions which failed and number of conditions which could not be executed with reasons.
  • Give the testers sufficient space to log their own observations over and above the test case. Maintain proper traceability between all the testing documents.

Completion Criteria

Since you cannot check every data element, how do you know when testing and validation is complete? Completion of testing has to be determined by the number of defects found and resolved. The test log, which identifies all defects and the status of each, is a good source to use in measuring completion criteria. The true acceptance and determination of completion can only be determined when the sponsor feels comfortable with the results in both sample reporting and the results of the tests previously listed. Without confidence in the numbers, the data warehouse will not be utilized and thus deemed a failure. To this end, it is important to revisit the purpose of the data warehouse. The data warehouse is, by definition, not an operational system. The data stored in the data warehouse is typically for analytical and reporting purposes only. The goals of most data warehousing efforts should be focused on allowing individuals the ability to create new strategies for success and optimize their organization. Neither of these goals requires a 100 percent accuracy rate in the millions of rows of facts stored over several years, but inaccuracies based on incorrect data mapping or invalidated transformations can be worse than not doing anything at all.

This level of validation will add time to the plan and increase costs, but these can be minimized if these techniques are put into plan early and integrate them into the existing methodology. In the long run the business users will be glad that we did.

The Human Angle

A tester is in a unique position – while he needs to depend on the development team to understand the process thoroughly, there is a bit of inclination on their part to hide mistakes or not to reveal everything.

Also when you find a bug or something which you might feel is an error, be very careful on how you convey the issue, as the person who committed the mistake may not like pointing out the fault. The approach here would undoubtedly depend on individual personalities. However with experience, the tester learns to deal with such situations in a diplomatic way. Make sure to frame words correctly when pointing out faults.

Data warehousing applications are subject to ever-changing requirements. It is often the case that a module tested on commencing the testing cycle has changed quite a bit when the testing cycle ends. Data warehousing provides the facility of ad hoc reporting. There is always the possibility of some query malfunctioning, which has not been tested. This is unavoidable. It should be understood that this is where new requirements arise and might require more development and testing.

Changing source systems also come up as a risk to the testing process credibility.

Such risks can be understood and mitigated by making the client aware about the iterative factor in the data warehousing life cycle.