Search This Blog

31.12.10

Understanding DataStage

Posting some videos related to DataStage.

DataStage Installation


DataStage Parallel Processing Architecture Overview


Lookup Stage


Reading the Score


Data Elements


IBM InfoSphere DataStage

DataStage - Problem when running large parallel jobs on Windows

If you have large parallel jobs (8.1) running in Windows environment then it is worth checking the servers event log. You could find an error related to Nutcracker.exe. An administrator with access right on the server can see the events logs.

Nutcracker error could come up while running very heavy jobs where total cpu utilization is upto or above 90% and server starts using virtual memory. This is just an example of the impact of running large parallel jobs in Windows server. Hence to make the system somewhat full proof following changes can be done in Windows registry. The idea is to tune the TCP/IP buffer, increase the heap size and set the virtual memory to the recommended amount.

To tune TCP/IP settings go to registry editor and check if MaxUserPort and TcpNumConnections keys are present in the TCP/IP buffer and connection registry. Generally they will not be there. TcpNumConnections settings will impact on number of Osh processes.

To change the settings
Go to registry editor
Go to HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Services > Tcpip > Parameters.
Now on the right pane modify the values:
Value Name: MaxUserPort
Value Type: DWORD
Value data: 65534
More information on MaxUserPort click here.

Value Name: TcpNumConnections
Value Type: DWORD
Value data: 65534
More information on TcpNumConnections click here.

To change the heap size
Go to registry editor and change the non interactive desktop heap size.
Go to HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Control > Session Manager > SubSystems.
Click on windows on the right pane.
Now edit the ShareSection parameter in the value box
Change the last value from 512 to 1024.

Now what is Desktop heap and why we are changing it?

In windows anything you create as a desktop object has got some amount of memory attached to it which is known as single desktop heap. The desktop heap is responsible for storing user interface objects, for example menus (Start for example) and windows (any new Window). When we run any application which needs to create these kind of interface object it needs space (in general terms, however it does have a process by which it allocates the area.) which goes from Desktop heap. In normal scenarios no one bother about how much Desktop heap has been configured in the system. The interesting thing here is if the application does not require any kind of interface objects it will not allocate the area from heap. The value which we are changing here is related to non interactive desktop which is related to services. So we are allowing more heap size in our environment so that while running services it should not exhaust. If you need more information on Windows heap please click here.

Now the last bit is changing the virtual memory configuration.

Virtual memory can be configured by right clicking on "My Computer" and going to Advance tab's performance section. Now change the virtual memory to the recommended size. For example if you have 8GB of RAM then recommended size would be 12 GB.

However all above setting can overcome issues related to large jobs but patches recommended for Nutcracker issue should also be installed.

Hope this will help.

Happy working...

30.12.10

DataStage - Log purge using Loggingadmin.

Sometime having too much logs stored in the metadata repository can be a reason behind a jobs slow performance.

Periodic log clearance would not only improve the performance but would keep the metadata database healthy. There are two ways to clear the job log:

1) by setting the clear log property present on director client under Job menu. There are options to choose from. You can purge the log just after the job run or purge upto previous few runs or days or till particular date. This approach is fairly simple can be performed by the developer having appropriate access rights. This approach is also project dependent.

2) by using the logging admin. Logging admin is a utility which can be executed on the server from command prompt. following command can be executed from command prompt in order to purge the logs:

Loggingadmin -user <<admin user>> -password <<admin password>> -create -schedule -name "DS Job log purge" -frequency -minutes 30 -threshold 10000 -percentage 100 -includecategories IIS-DSTAGE-RUN.

What is ETL?

ETL Tool is a term related to Data Warehousing. Abbreviation for ETL is extraction, transformation and Load. Extraction is importing data, Transformation is transforming or preparing the data into required form or in a form that is understandable to business or report reader. Load is the last stage where whatever the data has been imported and transformed would be loaded to the Data Warehouse.

Extracting what and from where?

Data Warehouse is a collection of data. Data collection can be from operation databases or from files or from third party interfaces (third party interfaces can be anything which holds data and its data required in reporting or for decision making for example MIS, Web survey to name a few). So how this data would come to the place where it can be transformed? Normally in Data Warehouse design they keep a staging area where all the collected data can stored. Bringing data from different operation databases, files and third party interfaces is known as Extraction. Staging area is also used to keep all the collected data into one location and in singular format.

What is Transformation and transforming what?

Transformation is mainly changing the extracted data into such a form that it is understandable for the decision makers. Transformation of the data can be based on many business rules so that the extracted data can be polished in order to be up-loadable. Transformation of the data depends on the source of the data. Some operational databases have very good checks and verification in place which does not allow user to input un-cleansed data into database. But this is not the case every where. Some legacy database, files and third party interfaces has data which needs cleansing. This cleansing is nothing but the transformation however transformation is not limited to cleansing of the data. Sometime business logic is also the part of transformation.

For example
1) Updating the ID values in the form of YES or NO.
2) Uploading selected number of columns out of given file.
3) Uploading data from XML files in tabular forms.
4) Calculating the fields based on business logic.
5) Fetching descriptive fields from lookup (master) tables.
6) Grouping the values or aggregating the values.
7) Ordering the output.
8) Reading the complex files.
9) Preparing flat files for other interfaces.

All above example comes under the transformation of the data.

Now the last bit of the abbreviation LOAD, so what is Load?

Load is the last stage in any ETL process where whatever data which has been transformed get loaded to data warehouse. Loading can be of different type as well. This depends on the requirement. It could be incremental, rebuilding from scratch, fortnightly or monthly uploads. Sometime it can be once in a year as well. All depends on the requirement and design of the warehouse.

Rest in next post.

Happy working...

DataStage Project Name With Space

It is a common problem which can cause a rework. So we should remember that during the installation we should not give a project name with spaces in it. If we give spaces in the name then the project will not be usable.

Happy working...

29.12.10

New user unable to login to DataStage client

what I have seen with many new users that they complain about not being able to login to the DataStage client after setup. This could be issue with the access right of UV_USERS file.

Ideally the user information should be automatically stored into UV_USERS file. This file will not have any information in it if the particular user has not got access on the file. Make sure a new user has access to the file and then ask user to try a login.

It is worth checking whether user credentials have been setup properly if a general user credential is not being used. User credentials can be checked by logging into DataStage web console --> user credentials.

28.12.10

DataStage Interview Questions - Part 1

Posting few questions to help you preparing for an DataStage interview:

1) Explain the architecture of IBM information server?

2) What is the difference between SMP and MPP?

3) How to check the current version of the DataStage application?

4) In what order DataStage executes Stage variables?

5) How constraint is different from Stage variables?

6) How to perform a date difference in transformer stage?

7) What is sparse lookup and how it is different from the normal lookup?

8) How sparse lookup impacts the performance?

9) What would be the first three steps when you asked to optimize any existing job?

10) Consider a scenario where you have 2 different project set up in a single server. Both the projects are similar but differ in data. Can we use a dataset created in one project in another one? If not then why?

11) Upto what size of record can be handled by DataStage? What would happen if Oracle table's record size if greater than size supported by DataStage record? How would you handle this?

12) How would you delete the Datasets in bulk?

13) What do you understand by operator and how it is associated with a stage?

14) When we should use APT_ORACLE_LOAD_OPTION environment variable?

15) How load method is different from upsert?

16) Where dataset gets stored in the DataStage setup? What is the significance of scratch area?

17) What is fork join?

18) How would you clear the locks from a job?

19) Can a DataStage job be delete from director client?

20) How would you turn the warning into information into director client?

21) Can we connect to lotus notes from DataStage? Does IBM support this interface?

22) What can be set from tunable page of administrator client?

23) What is runtime column propagation?

24) Which is fast a transformer or a filter?

25) How the score changes when you have a transformer in the job?

26) Why do you need a C++ compiler to be installed on the server?

27) Can you run multiple copy of the same job at same time? How would you differentiate the two copies then?

28) How would you catch Oracle exceptions in case you are using Oracle Enterprise Stage?

29) How would you kill a particular DataStage session?

30) In what kind of requirement you would use Analyzer?

31) How would you perform impact analysis in DataStage?

32) What is the drawback of using peek stage permanently in the job design?

33) What is the difference between BuildOps and Wrapper stages?

34) Can a column generator stage be the first stage in any job design?

35) List down the name of the services which needs to be started for successful use of DataStage?

36) how can you stop and start DataStage services in Windows environment?

37) How can we only export the executable of a job?

38) Can duplicate records be removed just by using sort stage?

39) We have job which address 50 million records and take 4 minutes to run. If we add a copy stage in between would it impact the performance of job?

40) Can one DataStage project have multiple configuration file?

41) How would you release resource locks?

42) How would you turn a warning into an info?

43) What is checkpoint with respect to sequences? How would you set this property for all sequences in a project in one shot?

44) Can RCP property be set for all jobs?

45) Does shared container support RCP?


more questions on the way...