ETL

ETL stands for “extract, transform, load.” It is a process of acquiring and combining data from either separate databases, separate hardware, or different software applications. It is used primarily in business intelligence, building data warehouses and data marts, and migrating information.

The “extract” portion of ETL consists of locating the information required. Extraction is a very critical part of the process because if it is wrong, or not well implemented, data may become corrupted, misattributed, or lost. Part of the process may be analyzing that data to make sure that it fits within certain parameters or is in a particular format. If it is not, it may be rejected from being implemented in the target database. Programmers need to make sure that they are aware of the types of data that they are looking for, the possible formats, and what aberrations that may exist. Once data is extracted, it can be processed by either the engine of the target database, or it may be processed by an engine inherent to the ETL process.

As data is extracted and analyzed, it is then “transformed.” Transforming the data can be many possible things. Transformation can be merely making sure it is the right data or converting it to a different format, or it could be a complex process of combining different data sets to develop new data relationships for storage. The range of transformation options is dependent on the flavor of SQL being used, the engine that is providing the service, and the target database where the new data is going to be stored.

The level and purpose of transformation is determined by the purpose of going through an ETL process. For gaining business intelligence, it may require complex algorithms to discover trends, evaluate processes, or develop contingencies. If the purpose is to migrate or combine data systems, the ETL process may be as simple as making sure all the dates are in the same format, or tables with different names in different databases, but the same type of data, are aggregated into the correct place in the new database. Hence, the importance of accuracy in the extraction process.

“Load” is simply the ETL process writing the result of the extracted and transformed data into the right place in the correct format. However, the process of writing that data is very important as well. When developing an ETL process, the end result of what the data should be doing is essential. Since ETL is used a lot in data warehousing, loading the data can either overwrite data already stored in, or add new data to, the current database. There may be instances where both are necessary. For instance, in an organization with different departments, information about a specific case may be represented in different ways, with different types of information. The ETL process may over-write customer specific information like if the customer changes their address, but will add information to the database for a customer’s making a new order.