Case study: The process of transforming data into useful information
The ETL model for extracting, transforming, and loading data is used with an audit that guarantees success
“A company’s intelligence quotient is determined by the degree to which its IT infrastructure connects, shares, and gives structure to information,” Steve Haeckel, former Director of Strategic Studies at IBM’s Advance Business Institute.
We have always been told that information is power. But this is not exactly true. Data, in itself, is nothing or almost nothing. A transformation process is needed to convert data into information that once analyzed, is valuable.
In recent years, a variety of names that refer to data analysis have emerged. Traditional business intelligence now includes concepts such as data discovery, visual analysis, agile BI, and business analytics.
Data collection process
In order to use information, a series of initial processes must be performed to transform data and adapt it to our analysis model.
This process, known as ETL (Extraction, Transformation, Load), is made up of three main phases:
- During the extraction stage, data is obtained from sources by downloading flat text files (or obtaining them from the client), and then loading them into the repository (ODS) in intermediate tables that contain the data without the final structure of the model.
- The information is adjusted in the transformation stage. In this process, it is common to duplicate tables that contain the correct information and to create new fields or new tables with aggregate and/or calculated data. For example, this may be done in order to group information by criteria for geography, time period, or hierarchical or commercial structure.
- Finally, in the third and final step of ETL, data loading programs are executed, and once the information has been reorganized, it is loaded into the definitive tables of the data repository: datawarehouse (corporate) and/or datamart (departmental). The tables that contain the correct information are once again duplicated, and the new fields that are needed to contain all the information are created.
A highly recommended (and sometimes essential) step is to perform a data audit before the definitive load because incorrect information must not be added to our analysis model since this would distort the results by creating null values, duplicates, extreme values, text instead of numbers, etc.
A good audit must consist of the following steps:
- Detection of incorrect data
In the first step, we must determine the potential casuistics of incorrect data. Since most of the information usually originates from a database that is already operational, these casuistics are usually identified. However, executing simple processes such as counting null or accumulated data can help prevent last-minute surprises.
Cleaning refers to all the processes for deleting records that will not be added to our data storage. It is advisable for these records to be stored in tables that will not be used (as a backup copy) in the event that a back-out is required. For example, this refers to records without valid information, null or incorrect fields, isolated data, etc.
This step is followed by standardization, which ensures that all the values which refer to the same data contain unified information, in a single value, as in the case of people’s names, addresses, the languages for the addresses, etc.
The deduplication process consists of identifying potential data duplicates and placing them in quarantine until they have been ruled out and deleted as per the corresponding criteria. It is sometimes necessary to perform parallel calculation processes (sums, counting, averages, percentages, etc.) in order to group the different values contained in each of the client’s codes.
If all the previous steps of the audit have been completed successfully, the integrity (coherence) of the data is guaranteed and we may proceed with the definitive load.
For all of these steps, there are tools available on the market that facilitate the processes by automating them and making it possible to design a workflow so the changes in the ETL have a minimum impact on development costs.
Once all of these processes have been performed, the ETL and data collection portion is completed. We will then be ready to analyze our information. In this initial phase, the key is to guarantee that the data is debugged and imported correctly for its subsequent use and application in the area of business intelligence. This will be explained in the next article.