CDC and DataStage. My experience.

IBM Change Data Capture previously known as DataMirror is a Data replication tool and the latest version comes with ETL capabilities as well. Main use of this tool is to move data from one database to another (within a database or between two databases).

Tips - How to prepare for DataStage interview Part 2

Tips explained in part 1 would definitely help one preparing for interview. Now following are few more tips to help friends around preparing for DataStage interviews:

How to prepare for DataStage Interview - Part 1

There are many friends who are trying to crack down DataStage interviews but somehow because of lack of real world project exposure; not able to make it through. The tips explained below are to help them so that they can be confident while perusing interview and prepare in a way that they are capable of handling any kind of question and answers them tactfully.

Incremental or delta load in DataStage

Incremental loading in the DataStage can be performed by using the Change Data Capture stage. This is a stage which takes sorted input on the key fields and generates a change code which denotes the type of change. Based on change code required DMLs can be performed on target database. Following figure can explain the basic design of a CDC stage:

DataStage Interview Questions - Part 2

1. What is "Parallel Processing"?
In simple words, parallel processing means executing the jobs on multiple processors. Performing ETL simultaneously.

DataStage String Functions

Following is the list of DataStage String Functions

Function Usage
AlNum Cab be used to check if the given string has alphanumeric characters
Alpha TRUE if string is completely alphabetic
CompactWhiteSpace all consective whitespace will be reduced to single space
Compare Compares two strings for sort
ComparNoCase Compare two strings irrespective of Case in-sensitiveness
ComparNum Compare the first n characters of the two strings
CompareNumNoCase Compare first n characters of the two strings irrespective of case in-sensitiveness
Convert Replace character in a string with the given character.
Count Count number of times a given substring occurs in a string
Dcount Returns count of delimited fields in a string
DownCase Change all uppercase letters in a string to lowercase
DQuote Enclose a string in double quotation marks
Field Return 1 or more delimited substrings
Index Find starting character position of substring
Left Finds leftmost n characters of string
Len Length of the string or total number of characters in a string
Num Return 1 if string can be converted to a number
PadString Return the string padded with the optional pad character and optional
Right Finds Rightmost n characters of string
Soundex Returns a string which identifies a set of words that are
phonetically similar
Space Return a string of N space characters
Squote Covers a string into single quotation marks
Str Repeat a string
StripWhiteSpace Return the string after removing all whitespace
Trim Remove all leading and trailing spaces and tabs. Also reduce the internal occurrences of spaces and tabs into one.
TrimB Remove all trailing spaces and tabs
TrimF Remove all leading spaces and tabs
Trim Returns a string with leading and trailing whitespace removed
Upcase Change all lowercase letters in a string to uppercase

Food for Thought - Episode 6

Scenario: A column contains both numeric and alpha numeric values in it (see example below):

Source column

Challenge: Segregate these values into two columns one for numeric only and another one for alpha numeric values (see below)

Numeric Only - Alpha Only
123 - ABC
456 - DEF
789 - GHI
101 - JKL

Give this challenge a shot. Reply.

How to fetch just first and last record from DataStage sequential file.

How would you fetch just first and the last record from a sequential file without using operating system commands and where design should be platform independent. So following is one of the approach to address this problem:

Following points explain the above figure:

1. Read sequential file in parallel.

2. Enable "ROW NUMBER COLUMN" property of sequential file stage. This will give us the unique number of type BigInt into a separate column. Now name this column as ROWNUM. Remember user has to define this column into the output schema as this can be omitted in RCP is enabled into the design.

3. Now output of our sequential file stage would go into a transformer. Transformer stage will have two output links where a constraint will be applied on one input to only allow record where ROWNUM = 1. This will give us the first record of the file.

4. Now the second output of the transformer will be the input for a copy stage where an in-line sort will be performed on ROWNUM column in descending order. You can change the copy stage to run into sequential mode otherwise it would also work into parallel. You can drop the ROWNUM column here.

5. The sorted output of the copy stage would now be the input of another transformer stage where a stage variable StageVar will be defined as Integer (with 0 as initial value) and would be assigned by StageVar + 1 value. This will assign unique number to the records being passed from the transformer. A constraint will be defined in this transformer stage to filter only those records where StageVar = 1. This will give us the last record from the file.

6. Both the output from step 3 and 6 will be the input for funnel stage which will ultimately populating the any stage to store the first and last record from a sequential file.

This is one of the approach to handle this requirement. There can be lost of other way to address this situation. Even the approach given here can be tuned a bit more but this one would also do the trick.

If you have a more sophisticated and optimized way to handle it then please share it with the fellow readers.

DataStage - Triggers and Expressions of a Sequence

Sequence stage has a tab called trigger. This tab gives you a place to specify the expressions on the input links or in other words this tab provides a facility to control the behaviour of job activities inside sequences when executed one after another.

For example we have two job A and Job B. Jobs B should always run once job A finishes successfully. There should not be any warnings in the execution of job A and then only job B's execution should starts. Hence to achieve this we will select appropriate "Expression Type" and "Expression" in trigger tab of the sequence.

In above scenario "Expression Type" would be "OK - (Conditional)" and "Expression" section can be blank. Now what would we write if we have a job which can generate warnings and still we want job B to run. Now in this situation we will be specifying different "Expression Type" and "Expressions". Here we should choose Expression type as Custom - (Conditional) and specify Expression as "DSJS.RUNWARN Or DSJS.RUNOK". This expression will make sure that job B runs even if job A generated warnings or executed cleaned.

"DSJS.RUNWARN and DSJS.RUNOK" are status which can be used in the expression section of sequence. Following is the list of available status which can be used in the expression depending upon the requirement.

DSJS.RUNNING - Job running
DSJS.RUNOK - Job finished a normal run with no warnings
DSJS.RUNWARN - Job finished a normal run with warnings
DSJS.RUNFAILED - Job finished a normal run with a fatal error
DSJS.QUEUED - Job queued waiting for resource allocation
DSJS.VALOK - Job finished a validation run with no warnings
DSJS.VALWARN - Job finished a validation run with warnings
DSJS.VALFAILED - Job failed a validation run
DSJS.RESET - Job finished a reset run
DSJS.CRASHED - Job was stopped by some indeterminate action
DSJS.STOPPED - Job was stopped by operator intervention
DSJS.NOTRUNNABLE - Job has not been compiled
DSJS.NOTRUNNING - Any other status

There are few things we should remember with respect to triggers on sequences and they are as follows:

1 - Job activities in a sequence can have one input trigger and multiple output triggers.
2 - Trigger name must be unique in the activity however you can have multiple same named triggers in a sequence.
3 - Trigger would fire if the source meets the conditions defined in the activity. In our example even if Job A generates warning job B will be executed because trigger expression suffice this condition.
4 - DataStage macros can be used in triggers.

If we use correct and appropriate expressions types and expressions in our sequences then we can control the interconnected activities more efficiently in the sequences for better performance and output.

Happy Working...

Food for Thought - Episode 5 (Answered)

What is the use of following special values?

2. $ENV

Give this challenge a shot. Reply.

Array size property of Oracle Enterprise Stage

Array size is a property of Oracle Enterprise stage which helps to reduce the context switch between DataStage and Oracle database. We should keep this in mind that however it help us reducing the context switch but this can lead the metadata repository database toward having deadlocks (if the meta data repository is Sql Server). Having large array size would improve the performance of the job as it will fasten the write operation on database.

Happy Working...