Incremental loading in the DataStage can be performed by using the Change Data Capture stage. This is a stage which takes sorted input on the key fields and generates a change code which denotes the type of change. Based on change code required DMLs can be performed on target database. Following figure can explain the basic design of a CDC stage:
To know more on data analysis, data science. Follow dataenq.com
In the attached design we will have to
1) Define the SOURCE and TARGET datasets as after and before. This means the dataset which has got latest value is a after dataset and the one with it is being compared is the before dataset.
2) Define change key columns on which the sorted output would be compared.
3) Define the change mode "Explicit Keys, All Values" this means change in the whole record will be considered. Any change in the any columns of the record will be detected as a change.
4) Define the different DML codes ie DELETE code = 3, INSERT code = 2 and so on.
5) once the above settings done we are good to identify the changes between two datasets. This CDC stage will give us the change code by using change_code() functions in the derivation area of CDC output mapping.
6) Pass this output to filter stage which will be filtering the data based on the change code and would output in three different links.
7) These filtered data would then go to the modify stage to drop the change_code column from the schema to make it database compatible.
8) Finally the datasets would get the all the newly inserted, updated and deleted records. Use appropriate native database stages to read these datasets to load them into the database.
Points to remember while using CDC stage:
1) CDC needs sorted input hence if you have large number of records to compare against then it could impact the performance of the job. Also this kind of job could be very memory intensive hence its placement in the sequence should be correct during the execution.
2) DO NOT forget to drop the change_code field from the schema of the datasets otherwise the next job to load data in the database would generate error.
Reusable design:
Above design can be reusable, we can run the same job for the multiple different datasets with different schema to compare and identify the change. Define parameters for the name of the datasets and keep RCP check-box checked to load the schema at runtime.
This is one of those methods which can be used to do the incremental loading. Here we have seen the example with the datasets and database if you use sequential file stage instead of datasets then you can create the files with the differences as well.
No comments:
Post a Comment