Search This Blog

30.12.10

What is ETL?

ETL Tool is a term related to Data Warehousing. Abbreviation for ETL is extraction, transformation and Load. Extraction is importing data, Transformation is transforming or preparing the data into required form or in a form that is understandable to business or report reader. Load is the last stage where whatever the data has been imported and transformed would be loaded to the Data Warehouse.

Extracting what and from where?

Data Warehouse is a collection of data. Data collection can be from operation databases or from files or from third party interfaces (third party interfaces can be anything which holds data and its data required in reporting or for decision making for example MIS, Web survey to name a few). So how this data would come to the place where it can be transformed? Normally in Data Warehouse design they keep a staging area where all the collected data can stored. Bringing data from different operation databases, files and third party interfaces is known as Extraction. Staging area is also used to keep all the collected data into one location and in singular format.

What is Transformation and transforming what?

Transformation is mainly changing the extracted data into such a form that it is understandable for the decision makers. Transformation of the data can be based on many business rules so that the extracted data can be polished in order to be up-loadable. Transformation of the data depends on the source of the data. Some operational databases have very good checks and verification in place which does not allow user to input un-cleansed data into database. But this is not the case every where. Some legacy database, files and third party interfaces has data which needs cleansing. This cleansing is nothing but the transformation however transformation is not limited to cleansing of the data. Sometime business logic is also the part of transformation.

For example
1) Updating the ID values in the form of YES or NO.
2) Uploading selected number of columns out of given file.
3) Uploading data from XML files in tabular forms.
4) Calculating the fields based on business logic.
5) Fetching descriptive fields from lookup (master) tables.
6) Grouping the values or aggregating the values.
7) Ordering the output.
8) Reading the complex files.
9) Preparing flat files for other interfaces.

All above example comes under the transformation of the data.

Now the last bit of the abbreviation LOAD, so what is Load?

Load is the last stage in any ETL process where whatever data which has been transformed get loaded to data warehouse. Loading can be of different type as well. This depends on the requirement. It could be incremental, rebuilding from scratch, fortnightly or monthly uploads. Sometime it can be once in a year as well. All depends on the requirement and design of the warehouse.

Rest in next post.

Happy working...

No comments:

Post a Comment