Geek Articles: January 2011

31.1.11

Food for Thought - Episode 7 (Answered)

What is the reason behind following error:

29.1.11

DataStage Interview Questions - Part 2

1. What is "Parallel Processing"?
In simple words, parallel processing means executing the jobs on multiple processors. Performing ETL simultaneously.

More...

26.1.11

Function	Usage
AlNum	Cab be used to check if the given string has alphanumeric characters
Alpha	TRUE if string is completely alphabetic
CompactWhiteSpace	all consective whitespace will be reduced to single space
Compare	Compares two strings for sort
ComparNoCase	Compare two strings irrespective of Case in-sensitiveness
ComparNum	Compare the first n characters of the two strings

Food for Thought - Episode 6

Scenario: A column contains both numeric and alpha numeric values in it (see example below):

Source column
ABC
123
DEF
456
GHI
789
JKL
101

More...

24.1.11

How to fetch just first and last record from DataStage sequential file.

How would you fetch just first and the last record from a sequential file without using operating system commands and where design should be platform independent. So following is one of the approach to address this problem:

More...

22.1.11

DataStage - Triggers and Expressions of a Sequence

Sequence stage has a tab called trigger. This tab gives you a place to specify the expressions on the input links or in other words this tab provides a facility to control the behaviour of job activities inside sequences when executed one after another.

More...

20.1.11

Food for Thought - Episode 5 (Answered)

What is the use of following special values?

1. $PROJDEF
2. $ENV
3. $UNSET

Give this challenge a shot. Reply.

16.1.11

Array size property of Oracle Enterprise Stage

Array size is a property of Oracle Enterprise stage which helps to reduce the context switch between DataStage and Oracle database. We should keep this in mind that however it help us reducing the context switch but this can lead the metadata repository database toward having deadlocks (if the meta data repository is Sql Server). Having large array size would improve the performance of the job as it will fasten the write operation on database.

Happy Working...

15.1.11

DataStage &PH& directory

&PH& directory in the DataStage contain the files which get created during a job's execution. The two letters in the name of the directory hint to the word "phantom" which is the naming convention in DataStage for background processes. Every time a job runs, it generates these logs which contain the status of the different phantom processes generated during the run. There are two types of the files that can be seen in &PH& directory one which starts from DSD.RUN and another one which starts from DSD.OshMonitor. Following is the sample content from a file which starts from DSD.OshMonitor:

DataStage Job 57 Phantom 7652
readSocket() returned 16
DataStage Phantom Finished.

Sample content from file starts with DSD.RUN is as follows:

DataStage Job 103 Phantom 5388
DataStage Phantom Finished.

Each project in the DataStage server contains &PH& directory. This directory can grow very rapidly if not maintained periodically. A scheduled job can be created on the server to clean this directory periodically. Click here to see my post on server maintenance. However other then the projects folder DSEngine directory also has a &PH& directory in it but this location does not get the log files during the job execution.

Happy Working...

13.1.11

Exporting DataStage jobs from command prompt and GUI

DataStage jobs can be exported into two ways, using GUI and using command prompt. To export jobs from GUI, go to Export --> DataStage Component menu or right click on the particular job to export. A pop up window named "Repository Window" will appear on the screen where all or selected jobs can be exported. This Repository window allows various options to choose from. These options are as follows:

More...

Food for Thought - Episode 4 (Answered)

Scenario: We want to run a three stage parallel job in a 8 CPU SMP environment having at least two parallel nodes defined in the configuration file.

Challange: How many minimum processes would this job make? No combining set on any stages.

Give this challenge a shot. Reply.

12.1.11

What is Data Warehouse

Data Warehouse is a relational database which is mainly used for analytical and decision making purpose. Good analysis and decision making can happen when there are different sets of data available. When I say different sets, I mean historical data to compare with. Mostly all Data Warehouses holds historical information in them. This historical information gets stored into these data warehouses over the period of time from different sources. These sources can be the operational databases (OLTP DBs), third party interfaces or even various types of files.

More...

10.1.11

Food for Thought - Episode 3 (Answered)

While collecting data, if order of the records matter then which collection method should be used?

Give this challenge a shot. Reply.

8.1.11

Local and Shared Containers - A Quick walk-through

As the name suggest local container (LC) contains the stages locally in the job and shared container (SC) contains stage in the job as well as allow its usage to other jobs. Differences or qualities between these two types of containers can be listed as follows:

More...

6.1.11

Food for Thought - Episode 2 (Answered)

Scenario: Read a dataset and load a Oracle table using Oracle Enterprise Stage with "Load" option. The table to be loaded has a column named "X" and defined as VARCHAR2.

Challenge: Column "X" in the dataset has some values in it with leading and trailing spaces and some with pure spaces. How would you make sure that while loading the data these spaces should not be removed or converted into NULL?

Give this challenge a shot. Reply

Happy Working...

Data Profiling - A dip in the data

Data profile is an interesting task. if you love data, you would enjoy doing data profiling. As the name suggest data profiling is to profile or analyse the data. Analyse it's organization, type, amount and all other aspects which data can have. There could be various reason behind the profiling. Following are just to name a few:

1) To check that the data is suitable for other business requirements then one from where the data originated.
2) In case of migration from one system to another, it is an exercise to analyse the data if it is suitable for the target.
3) Profiling can result the information about its container, whether the database or file system is suitable enough to hold the information or it needs change.
4) To categorise the data to use it more effectively to make effective business decision.
5) Sometime profiling is required to check the quality of the data. Quality check of the data can yield wonderful results and could be beneficial for DSS and other interfaces.

During the analysis following aspects of the data can be considered:

Database/Business File
Business Table/File Name
Business Table/Element Name
Element Definition
Table Name/File Name
Column Name
Data Type
Data Length
Decimal Place
Valid Values
PK
FK
NULL Y/N
Element Required
Element Mandatory

Above aspects can be subdivided into logical and physical arrangement of the data. Following are some of the tools used in the industry to profile the data:

Infomatica Data Explorer
Taled Open profiler
Oracle Data Integrator
Datamartist
Datiris
pervasive Data Profiler

Happy Working...

5.1.11

Food for Thought - Episode 1 (Answered)

Scenario: We are reading an Oracle table using Oracle enterprise stage and putting data into a dataset.

Challenge: What would happen if the column names of that table contains characters which are not supported by DataStage? Would this job fail or Run? If fail then why? If run then what would be the output?

Give this challenge a shot. Reply.

4.1.11

DataStage server hang issue

There could be various reasons behind this issue. I will list down few of them here. In my case the IBM Information server set-up is on Windows Server 2003 SP2 and with IBM Information server 8.1. It is a SMP set-up and meta data repository database is on the same server as well. The meta data repository database is on SQL Server 2005.

Now server hang issue can occurred when

1) Metadata repository database detects a deadlock condition and choose failing job as the victim of the deadlock.
2) Log maintenance is ignored.
3) Temp folders are not maintained periodically.

I will try to explain above three points in detail below:

1) Occurrence of deadlock into meta data repository database - I have seen this scenario in the DataStage 8.1 with no fix packs installed in it. If you have fix packs (released later) installed then you may not get this problem at all. IBM Information Server throws an exception like "[IBM][SQLServer JDBC Driver][SQLServer]Transaction (Process ID 59) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction." This exception makes the whole process unresponsive which in turn hangs the server and needs a physical reboot. However deadlock would always be there and we can only reduce their occurrence. Following two actions can be performed to reduce the possibility of having deadlock situation:

a) Keep the user interaction minimized during the transactions. This is more related to SQL queries where user input is required.
b) Keep less number of records and in one batch. For example if you are using Oracle Enterprise stage to load data then by defining less array size would reduce the possibility of having deadlock during the job run.

2) Log maintenance is ignored - In DataStage 8.1 log gets stored into metadata repository database. A table called "logging_xmetagen_LoggingEvent1466cb5f" holds all the logs in XMETA database. If we ignore purging the logs periodically then this table's size would go very large and DMLs would take longer to execute. Hence we should schedule the log purging/clearing periodically. Logs purging/clearing can be scheduled from Director client.

To schedule the auto log purge, select a job and then by selecting Job menu --> Clear Log --> options from pop up window. You can set the logs to purge immediately after job run or set auto purge up to some last run or till a particular date. It would be better if we set the auto purge with "Older than (days)" setting.

One more thing to note here is that the table "logging_xmetagen_LoggingEvent1466cb5f" keep logs for all the projects present on the server. Hence we should not just fire a DML on this table as this could impact other projects as well. To remove entries from this table we can use the following command which creates a scheduler on the server using the logadmin utility and delete the logs.

This command will delete the logs from table "logging_xmetagen_LoggingEvent1466cb5f" irrespective to the project hence be careful while executing this command.

LoggingAdmin -user [DataStage Admin User] -Password [DataStage Admin Password] -create -schedule -name "DS job log purge" -frequency –minutes 30 -threshold 10000 -percentage 100 -includeCategories IIS-DSTAGE-RUN

Above scheduler can be deleted by following command:

LoggingAdmin –user [DataStage Admin User] -password [DataStage Admin Password] -delete -schedule -name "DS job log purge"

To check the number of records in "logging_xmetagen_LoggingEvent1466cb5f" table following query can be fired on XMETA database:

select count(*) from xmeta.logging_xmetagen_LoggingEvent1466cb5f

Having lots of logs into repository database would slow the log purge activity which in turn slows the performance of the job.

In DatStage 8.1 job log storage has changed from file system to database. In case your jobs logs gone very huge and you need some quick performance then you can switch back to the file system logging by just changing the following two project level parameters in DSparam in windows:

RTLogging=1
ORLogging=0

After changing the above settings logs would be stored in the file system and old logs can still be viewed into director client. Once we clear all the logs we should revert the changes to original setting. DSparam file can be found at \IBM\InformationServer\Server\Projects\YOUR_PROJECT\

3) Folders maintenance - There are few folder which needs maintenance in DataStage in windows environment. First one is tmp folder which store all the virtual dataset and other stuff while executing the jobs. Another one is UVtemp. We should make sure that this folder should not get full as this would cause the jobs to hang or slow. Keep clearing this folder periodically. Last but not the least is &PH& folder. Keep this folder clean as well. To remove the entries from &PH& folder "CLEAR.FILE &PH&" command can be executed from administrator client. To execute this command go to Administrator client --> Click on Projects tab --> Select the appropriate project --> Click on Command --> Paste CLEAR.FILE &PH& into command text box and click on Execute button. Once cleared you will see a popup window with a message "File &PH& has been cleared".

If we follow above listed maintenance activities then we can increase the performance of the jobs as well as keep the server safe from going into hung state. There are few more
tunings which we can perform on the server itself to make it suitable for large parallel jobs on Windows environment. Will write something on it in the next post.

Happy Working...

2.1.11

DataStage - The specified field: does not exist in the view adapted schema.

Most of the time error "The specified field: XXXXXX does not exist in the view adapted schema" occurred when we missed a field to map. Every stage has got an output tab if used in the between of the job. Make sure you have mapped every single field required for the next stage.

Sometime even after mapping the fields this error can be occurred and one of the reason could be that the view adapter has not linked the input and output fields. Hence in this case the required field mapping should be dropped and recreated.

Just to give an insight on this, the view adapter is an operator which is responsible for mapping the input and output fields. Hence DataStage creates an instance of APT_ViewAdapter which translate the components of the operator input interface schema to matching components of the interface schema. So if the interface schema is not having the same columns as operator input interface schema then this error will be reported.

Happy working...

www.beingmoneysmart.com

Search This Blog

31.1.11

29.1.11

26.1.11

24.1.11

How would you fetch just first and the last record from a sequential file without using operating system commands and where design should be platform independent. So following is one of the approach to address this problem:

22.1.11

Sequence stage has a tab called trigger. This tab gives you a place to specify the expressions on the input links or in other words this tab provides a facility to control the behaviour of job activities inside sequences when executed one after another.

20.1.11

16.1.11

15.1.11

13.1.11

12.1.11

What is Data Warehouse

10.1.11

8.1.11

As the name suggest local container (LC) contains the stages locally in the job and shared container (SC) contains stage in the job as well as allow its usage to other jobs. Differences or qualities between these two types of containers can be listed as follows:

6.1.11

5.1.11

4.1.11

2.1.11