Search This Blog

6.6.20

Learning various data operation in R with references to SQL commands for easy understanding

Hello there, welcome to my blog post. The reason behind this post is to practice R and R markdown for myself and share my work with the audience. I would make references to SQL/relational database commands while working with R. If you are not familiar with SQL then ignore SQL parts, this is just for easy understanding.

Prerequisites:
  1. Install R
  2. R Studio (for ease otherwise R itself is sufficient)
Assumption : You know what your working directory is and where you have to keep the files to upload.

10.5.20

How to import multiple .csv files R where files share same column format

Loading multiple csv files in R which share same column format

Scenario - We have number of csv files with identical column format and we would like to import all of this data into one data frames.

Solution - following code can help upload multiple identical files into list of data frames in R and then into one big data frame:

7.5.20

Common R commands to learn on early days

If you are just starting with R then here is the list of some commands which you must master. 

These commands will immensely help you in this journey. Here they are:

26.4.20

How to install R on Macbook Pro


  • Download R from https://cran.r-project.org/bin/macosx/
  • Select appropriate version. In this example we will use R-4.0.0.pkg
  • When downloaded, put the package in your working directory or leave it in the download folder

23.3.13

CDC and DataStage. My experience.

IBM Change Data Capture previously known as DataMirror is a Data replication tool and the latest version comes with ETL capabilities as well. Main use of this tool is to move data from one database to another (within a database or between two databases).

7.6.11

Tips - How to prepare for DataStage interview Part 2

Tips explained in part 1 would definitely help one preparing for interview. Now following are few more tips to help friends around preparing for DataStage interviews:

6.6.11

How to prepare for DataStage Interview - Part 1

There are many friends who are trying to crack down DataStage interviews but somehow because of lack of real world project exposure; not able to make it through. The tips explained below are to help them so that they can be confident while perusing interview and prepare in a way that they are capable of handling any kind of question and answers them tactfully.

10.2.11

Incremental or delta load in DataStage

Incremental loading in the DataStage can be performed by using the Change Data Capture stage. This is a stage which takes sorted input on the key fields and generates a change code which denotes the type of change. Based on change code required DMLs can be performed on target database. Following figure can explain the basic design of a CDC stage:

31.1.11

Food for Thought - Episode 7 (Answered)

What is the reason behind following error:

29.1.11

DataStage Interview Questions - Part 2

1. What is "Parallel Processing"?
In simple words, parallel processing means executing the jobs on multiple processors. Performing ETL simultaneously.

26.1.11

DataStage String Functions

Following is the list of DataStage String Functions

Function Usage
AlNum Cab be used to check if the given string has alphanumeric characters
Alpha TRUE if string is completely alphabetic
CompactWhiteSpace all consective whitespace will be reduced to single space
Compare Compares two strings for sort
ComparNoCase Compare two strings irrespective of Case in-sensitiveness
ComparNum Compare the first n characters of the two strings

Food for Thought - Episode 6

Scenario: A column contains both numeric and alpha numeric values in it (see example below):

Source column
ABC
123
DEF
456
GHI
789
JKL
101


24.1.11

How to fetch just first and last record from DataStage sequential file.

How would you fetch just first and the last record from a sequential file without using operating system commands and where design should be platform independent. So following is one of the approach to address this problem:



22.1.11

DataStage - Triggers and Expressions of a Sequence

Sequence stage has a tab called trigger. This tab gives you a place to specify the expressions on the input links or in other words this tab provides a facility to control the behaviour of job activities inside sequences when executed one after another.


20.1.11

Food for Thought - Episode 5 (Answered)

What is the use of following special values?

1. $PROJDEF
2. $ENV
3. $UNSET

Give this challenge a shot. Reply.

16.1.11

Array size property of Oracle Enterprise Stage

Array size is a property of Oracle Enterprise stage which helps to reduce the context switch between DataStage and Oracle database. We should keep this in mind that however it help us reducing the context switch but this can lead the metadata repository database toward having deadlocks (if the meta data repository is Sql Server). Having large array size would improve the performance of the job as it will fasten the write operation on database.

Happy Working...

15.1.11

DataStage &PH& directory

&PH& directory in the DataStage contain the files which get created during a job's execution. The two letters in the name of the directory hint to the word "phantom" which is the naming convention in DataStage for background processes. Every time a job runs, it generates these logs which contain the status of the different phantom processes generated during the run. There are two types of the files that can be seen in &PH& directory one which starts from DSD.RUN and another one which starts from DSD.OshMonitor. Following is the sample content from a file which starts from DSD.OshMonitor:

DataStage Job 57 Phantom 7652
readSocket() returned 16
DataStage Phantom Finished.

Sample content from file starts with DSD.RUN is as follows:

DataStage Job 103 Phantom 5388
DataStage Phantom Finished.

Each project in the DataStage server contains &PH& directory. This directory can grow very rapidly if not maintained periodically. A scheduled job can be created on the server to clean this directory periodically. Click here to see my post on server maintenance. However other then the projects folder DSEngine directory also has a &PH& directory in it but this location does not get the log files during the job execution.

Happy Working...

13.1.11

Exporting DataStage jobs from command prompt and GUI

DataStage jobs can be exported into two ways, using GUI and using command prompt. To export jobs from GUI, go to Export --> DataStage Component menu or right click on the particular job to export. A pop up window named "Repository Window" will appear on the screen where all or selected jobs can be exported. This Repository window allows various options to choose from. These options are as follows:


Food for Thought - Episode 4 (Answered)

Scenario: We want to run a three stage parallel job in a 8 CPU SMP environment having at least two parallel nodes defined in the configuration file.

Challange: How many minimum processes would this job make? No combining set on any stages.

Give this challenge a shot. Reply.

12.1.11

What is Data Warehouse

What is Data Warehouse


Data Warehouse is a relational database which is mainly used for analytical and decision making purpose. Good analysis and decision making can happen when there are different sets of data available. When I say different sets, I mean historical data to compare with. Mostly all Data Warehouses holds historical information in them. This historical information gets stored into these data warehouses over the period of time from different sources. These sources can be the operational databases (OLTP DBs), third party interfaces or even various types of files.


10.1.11

Food for Thought - Episode 3 (Answered)

While collecting data, if order of the records matter then which collection method should be used?

Give this challenge a shot. Reply.

8.1.11

Local and Shared Containers - A Quick walk-through

As the name suggest local container (LC) contains the stages locally in the job and shared container (SC) contains stage in the job as well as allow its usage to other jobs. Differences or qualities between these two types of containers can be listed as follows:


6.1.11

Food for Thought - Episode 2 (Answered)

Scenario: Read a dataset and load a Oracle table using Oracle Enterprise Stage with "Load" option. The table to be loaded has a column named "X" and defined as VARCHAR2.

Challenge: Column "X" in the dataset has some values in it with leading and trailing spaces and some with pure spaces. How would you make sure that while loading the data these spaces should not be removed or converted into NULL?

Give this challenge a shot. Reply

Happy Working...

Data Profiling - A dip in the data

Data profile is an interesting task. if you love data, you would enjoy doing data profiling. As the name suggest data profiling is to profile or analyse the data. Analyse it's organization, type, amount and all other aspects which data can have. There could be various reason behind the profiling. Following are just to name a few:

1) To check that the data is suitable for other business requirements then one from where the data originated.
2) In case of migration from one system to another, it is an exercise to analyse the data if it is suitable for the target.
3) Profiling can result the information about its container, whether the database or file system is suitable enough to hold the information or it needs change.
4) To categorise the data to use it more effectively to make effective business decision.
5) Sometime profiling is required to check the quality of the data. Quality check of the data can yield wonderful results and could be beneficial for DSS and other interfaces.

During the analysis following aspects of the data can be considered:

Database/Business File
Business Table/File Name
Business Table/Element Name
Element Definition
Table Name/File Name
Column Name
Data Type
Data Length
Decimal Place
Valid Values
PK
FK
NULL Y/N
Element Required
Element Mandatory

Above aspects can be subdivided into logical and physical arrangement of the data. Following are some of the tools used in the industry to profile the data:

Infomatica Data Explorer
Taled Open profiler
Oracle Data Integrator
Datamartist
Datiris
pervasive Data Profiler

Happy Working...

5.1.11

Food for Thought - Episode 1 (Answered)

Scenario: We are reading an Oracle table using Oracle enterprise stage and putting data into a dataset.

Challenge: What would happen if the column names of that table contains characters which are not supported by DataStage? Would this job fail or Run? If fail then why? If run then what would be the output?

Give this challenge a shot. Reply.

4.1.11

DataStage server hang issue

There could be various reasons behind this issue. I will list down few of them here. In my case the IBM Information server set-up is on Windows Server 2003 SP2 and with IBM Information server 8.1. It is a SMP set-up and meta data repository database is on the same server as well. The meta data repository database is on SQL Server 2005.

Now server hang issue can occurred when

1) Metadata repository database detects a deadlock condition and choose failing job as the victim of the deadlock.
2) Log maintenance is ignored.
3) Temp folders are not maintained periodically.

I will try to explain above three points in detail below:

1) Occurrence of deadlock into meta data repository database - I have seen this scenario in the DataStage 8.1 with no fix packs installed in it. If you have fix packs (released later) installed then you may not get this problem at all. IBM Information Server throws an exception like "[IBM][SQLServer JDBC Driver][SQLServer]Transaction (Process ID 59) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction." This exception makes the whole process unresponsive which in turn hangs the server and needs a physical reboot. However deadlock would always be there and we can only reduce their occurrence. Following two actions can be performed to reduce the possibility of having deadlock situation:

a) Keep the user interaction minimized during the transactions. This is more related to SQL queries where user input is required.
b) Keep less number of records and in one batch. For example if you are using Oracle Enterprise stage to load data then by defining less array size would reduce the possibility of having deadlock during the job run.

2) Log maintenance is ignored
- In DataStage 8.1 log gets stored into metadata repository database. A table called "logging_xmetagen_LoggingEvent1466cb5f" holds all the logs in XMETA database. If we ignore purging the logs periodically then this table's size would go very large and DMLs would take longer to execute. Hence we should schedule the log purging/clearing periodically. Logs purging/clearing can be scheduled from Director client.

To schedule the auto log purge, select a job and then by selecting Job menu --> Clear Log --> options from pop up window. You can set the logs to purge immediately after job run or set auto purge up to some last run or till a particular date. It would be better if we set the auto purge with "Older than (days)" setting.

One more thing to note here is that the table "logging_xmetagen_LoggingEvent1466cb5f" keep logs for all the projects present on the server. Hence we should not just fire a DML on this table as this could impact other projects as well. To remove entries from this table we can use the following command which creates a scheduler on the server using the logadmin utility and delete the logs.

This command will delete the logs from table "logging_xmetagen_LoggingEvent1466cb5f" irrespective to the project hence be careful while executing this command.

LoggingAdmin -user [DataStage Admin User] -Password [DataStage Admin Password] -create -schedule -name "DS job log purge" -frequency –minutes 30 -threshold 10000 -percentage 100 -includeCategories IIS-DSTAGE-RUN

Above scheduler can be deleted by following command:

LoggingAdmin –user [DataStage Admin User] -password [DataStage Admin Password] -delete -schedule -name "DS job log purge"

To check the number of records in "logging_xmetagen_LoggingEvent1466cb5f" table following query can be fired on XMETA database:

select count(*) from xmeta.logging_xmetagen_LoggingEvent1466cb5f

Having lots of logs into repository database would slow the log purge activity which in turn slows the performance of the job.

In DatStage 8.1 job log storage has changed from file system to database. In case your jobs logs gone very huge and you need some quick performance then you can switch back to the file system logging by just changing the following two project level parameters in DSparam in windows:

RTLogging=1
ORLogging=0

After changing the above settings logs would be stored in the file system and old logs can still be viewed into director client. Once we clear all the logs we should revert the changes to original setting. DSparam file can be found at \IBM\InformationServer\Server\Projects\YOUR_PROJECT\

3) Folders maintenance - There are few folder which needs maintenance in DataStage in windows environment. First one is tmp folder which store all the virtual dataset and other stuff while executing the jobs. Another one is UVtemp. We should make sure that this folder should not get full as this would cause the jobs to hang or slow. Keep clearing this folder periodically. Last but not the least is &PH& folder. Keep this folder clean as well. To remove the entries from &PH& folder "CLEAR.FILE &PH&" command can be executed from administrator client. To execute this command go to Administrator client --> Click on Projects tab --> Select the appropriate project --> Click on Command --> Paste CLEAR.FILE &PH& into command text box and click on Execute button. Once cleared you will see a popup window with a message "File &PH& has been cleared".

If we follow above listed maintenance activities then we can increase the performance of the jobs as well as keep the server safe from going into hung state. There are few more
tunings which we can perform on the server itself to make it suitable for large parallel jobs on Windows environment. Will write something on it in the next post.

Happy Working...

2.1.11

DataStage - The specified field: does not exist in the view adapted schema.

Most of the time error "The specified field: XXXXXX does not exist in the view adapted schema" occurred when we missed a field to map. Every stage has got an output tab if used in the between of the job. Make sure you have mapped every single field required for the next stage.

Sometime even after mapping the fields this error can be occurred and one of the reason could be that the view adapter has not linked the input and output fields. Hence in this case the required field mapping should be dropped and recreated.

Just to give an insight on this, the view adapter is an operator which is responsible for mapping the input and output fields. Hence DataStage creates an instance of APT_ViewAdapter which translate the components of the operator input interface schema to matching components of the interface schema. So if the interface schema is not having the same columns as operator input interface schema then this error will be reported.

Happy working...

www.beingmoneysmart.com

31.12.10

Understanding DataStage

Posting some videos related to DataStage.

DataStage Installation


DataStage Parallel Processing Architecture Overview


Lookup Stage


Reading the Score


Data Elements


IBM InfoSphere DataStage

DataStage - Problem when running large parallel jobs on Windows

If you have large parallel jobs (8.1) running in Windows environment then it is worth checking the servers event log. You could find an error related to Nutcracker.exe. An administrator with access right on the server can see the events logs.

Nutcracker error could come up while running very heavy jobs where total cpu utilization is upto or above 90% and server starts using virtual memory. This is just an example of the impact of running large parallel jobs in Windows server. Hence to make the system somewhat full proof following changes can be done in Windows registry. The idea is to tune the TCP/IP buffer, increase the heap size and set the virtual memory to the recommended amount.

To tune TCP/IP settings go to registry editor and check if MaxUserPort and TcpNumConnections keys are present in the TCP/IP buffer and connection registry. Generally they will not be there. TcpNumConnections settings will impact on number of Osh processes.

To change the settings
Go to registry editor
Go to HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Services > Tcpip > Parameters.
Now on the right pane modify the values:
Value Name: MaxUserPort
Value Type: DWORD
Value data: 65534
More information on MaxUserPort click here.

Value Name: TcpNumConnections
Value Type: DWORD
Value data: 65534
More information on TcpNumConnections click here.

To change the heap size
Go to registry editor and change the non interactive desktop heap size.
Go to HKEY_LOCAL_MACHINE > SYSTEM > CurrentControlSet > Control > Session Manager > SubSystems.
Click on windows on the right pane.
Now edit the ShareSection parameter in the value box
Change the last value from 512 to 1024.

Now what is Desktop heap and why we are changing it?

In windows anything you create as a desktop object has got some amount of memory attached to it which is known as single desktop heap. The desktop heap is responsible for storing user interface objects, for example menus (Start for example) and windows (any new Window). When we run any application which needs to create these kind of interface object it needs space (in general terms, however it does have a process by which it allocates the area.) which goes from Desktop heap. In normal scenarios no one bother about how much Desktop heap has been configured in the system. The interesting thing here is if the application does not require any kind of interface objects it will not allocate the area from heap. The value which we are changing here is related to non interactive desktop which is related to services. So we are allowing more heap size in our environment so that while running services it should not exhaust. If you need more information on Windows heap please click here.

Now the last bit is changing the virtual memory configuration.

Virtual memory can be configured by right clicking on "My Computer" and going to Advance tab's performance section. Now change the virtual memory to the recommended size. For example if you have 8GB of RAM then recommended size would be 12 GB.

However all above setting can overcome issues related to large jobs but patches recommended for Nutcracker issue should also be installed.

Hope this will help.

Happy working...

30.12.10

DataStage - Log purge using Loggingadmin.

Sometime having too much logs stored in the metadata repository can be a reason behind a jobs slow performance.

Periodic log clearance would not only improve the performance but would keep the metadata database healthy. There are two ways to clear the job log:

1) by setting the clear log property present on director client under Job menu. There are options to choose from. You can purge the log just after the job run or purge upto previous few runs or days or till particular date. This approach is fairly simple can be performed by the developer having appropriate access rights. This approach is also project dependent.

2) by using the logging admin. Logging admin is a utility which can be executed on the server from command prompt. following command can be executed from command prompt in order to purge the logs:

Loggingadmin -user <<admin user>> -password <<admin password>> -create -schedule -name "DS Job log purge" -frequency -minutes 30 -threshold 10000 -percentage 100 -includecategories IIS-DSTAGE-RUN.

What is ETL?

ETL Tool is a term related to Data Warehousing. Abbreviation for ETL is extraction, transformation and Load. Extraction is importing data, Transformation is transforming or preparing the data into required form or in a form that is understandable to business or report reader. Load is the last stage where whatever the data has been imported and transformed would be loaded to the Data Warehouse.

Extracting what and from where?

Data Warehouse is a collection of data. Data collection can be from operation databases or from files or from third party interfaces (third party interfaces can be anything which holds data and its data required in reporting or for decision making for example MIS, Web survey to name a few). So how this data would come to the place where it can be transformed? Normally in Data Warehouse design they keep a staging area where all the collected data can stored. Bringing data from different operation databases, files and third party interfaces is known as Extraction. Staging area is also used to keep all the collected data into one location and in singular format.

What is Transformation and transforming what?

Transformation is mainly changing the extracted data into such a form that it is understandable for the decision makers. Transformation of the data can be based on many business rules so that the extracted data can be polished in order to be up-loadable. Transformation of the data depends on the source of the data. Some operational databases have very good checks and verification in place which does not allow user to input un-cleansed data into database. But this is not the case every where. Some legacy database, files and third party interfaces has data which needs cleansing. This cleansing is nothing but the transformation however transformation is not limited to cleansing of the data. Sometime business logic is also the part of transformation.

For example
1) Updating the ID values in the form of YES or NO.
2) Uploading selected number of columns out of given file.
3) Uploading data from XML files in tabular forms.
4) Calculating the fields based on business logic.
5) Fetching descriptive fields from lookup (master) tables.
6) Grouping the values or aggregating the values.
7) Ordering the output.
8) Reading the complex files.
9) Preparing flat files for other interfaces.

All above example comes under the transformation of the data.

Now the last bit of the abbreviation LOAD, so what is Load?

Load is the last stage in any ETL process where whatever data which has been transformed get loaded to data warehouse. Loading can be of different type as well. This depends on the requirement. It could be incremental, rebuilding from scratch, fortnightly or monthly uploads. Sometime it can be once in a year as well. All depends on the requirement and design of the warehouse.

Rest in next post.

Happy working...

DataStage Project Name With Space

It is a common problem which can cause a rework. So we should remember that during the installation we should not give a project name with spaces in it. If we give spaces in the name then the project will not be usable.

Happy working...

29.12.10

New user unable to login to DataStage client

what I have seen with many new users that they complain about not being able to login to the DataStage client after setup. This could be issue with the access right of UV_USERS file.

Ideally the user information should be automatically stored into UV_USERS file. This file will not have any information in it if the particular user has not got access on the file. Make sure a new user has access to the file and then ask user to try a login.

It is worth checking whether user credentials have been setup properly if a general user credential is not being used. User credentials can be checked by logging into DataStage web console --> user credentials.

28.12.10

DataStage Interview Questions - Part 1

Posting few questions to help you preparing for an DataStage interview:

1) Explain the architecture of IBM information server?

2) What is the difference between SMP and MPP?

3) How to check the current version of the DataStage application?

4) In what order DataStage executes Stage variables?

5) How constraint is different from Stage variables?

6) How to perform a date difference in transformer stage?

7) What is sparse lookup and how it is different from the normal lookup?

8) How sparse lookup impacts the performance?

9) What would be the first three steps when you asked to optimize any existing job?

10) Consider a scenario where you have 2 different project set up in a single server. Both the projects are similar but differ in data. Can we use a dataset created in one project in another one? If not then why?

11) Upto what size of record can be handled by DataStage? What would happen if Oracle table's record size if greater than size supported by DataStage record? How would you handle this?

12) How would you delete the Datasets in bulk?

13) What do you understand by operator and how it is associated with a stage?

14) When we should use APT_ORACLE_LOAD_OPTION environment variable?

15) How load method is different from upsert?

16) Where dataset gets stored in the DataStage setup? What is the significance of scratch area?

17) What is fork join?

18) How would you clear the locks from a job?

19) Can a DataStage job be delete from director client?

20) How would you turn the warning into information into director client?

21) Can we connect to lotus notes from DataStage? Does IBM support this interface?

22) What can be set from tunable page of administrator client?

23) What is runtime column propagation?

24) Which is fast a transformer or a filter?

25) How the score changes when you have a transformer in the job?

26) Why do you need a C++ compiler to be installed on the server?

27) Can you run multiple copy of the same job at same time? How would you differentiate the two copies then?

28) How would you catch Oracle exceptions in case you are using Oracle Enterprise Stage?

29) How would you kill a particular DataStage session?

30) In what kind of requirement you would use Analyzer?

31) How would you perform impact analysis in DataStage?

32) What is the drawback of using peek stage permanently in the job design?

33) What is the difference between BuildOps and Wrapper stages?

34) Can a column generator stage be the first stage in any job design?

35) List down the name of the services which needs to be started for successful use of DataStage?

36) how can you stop and start DataStage services in Windows environment?

37) How can we only export the executable of a job?

38) Can duplicate records be removed just by using sort stage?

39) We have job which address 50 million records and take 4 minutes to run. If we add a copy stage in between would it impact the performance of job?

40) Can one DataStage project have multiple configuration file?

41) How would you release resource locks?

42) How would you turn a warning into an info?

43) What is checkpoint with respect to sequences? How would you set this property for all sequences in a project in one shot?

44) Can RCP property be set for all jobs?

45) Does shared container support RCP?


more questions on the way...