Datawarehousing & Business Intelligence: May 2012

Friday, May 18, 2012

How to remove duplicates without using "Remove Duplicates" stage in DataStage

Step 1:Use Copy stage and use hash partitioning

Step 2:mark option perform sort to get sorted records

Step3: mark option stable( to get first value as distinct value in case of duplicates)

Step 4:mark option unique to get DISTINCT records

Step 5:select key fields for sorting and partitioning usage.

Thursday, May 17, 2012

Sort and Remove Duplicates stages in DataStage

Step 1: Drag all the stages.here in example we have chosen a sequential file(.csv file to be specific) as the source and a complex flat file as the destination file

Input data

Step 2: the records in source are sorted in sort stage.Make sure to choose a key in order to sort the records based on the key field.In the example, the key field is chosen as shipdate.

Step 3: In the Remove duplicate stage also we need to specify the key (shipdate chosen in the example) based on which it will search for the duplicate record(s) and remove them.Hence the output will contain only disctinct records based on the key specified.

Output file will be

Hence we get distcinct records with each of them having unique shipdate value.

Tuesday, May 15, 2012

Using Column Export stage in DataStage

The Column Export stage is a restructure stage. It can have a single input link, a single output link and a single rejects link.

The Column Export stage exports data from a number of columns of different data types into a single column of data type ustring, string, or binary. It is the complementary stage to Column Import

Monday, May 14, 2012

Using Pivot in DataStage

Here the source is taken as an excel file(.csv extension) and the data contains

Name Qty1 Qty2 Qty3

abc 10 20 30

def 70 60 50

After pivot stage (as described in figure) the derivation done is as follows in output column

Derivation

Name Name

Qty Qty1 Qty2 Qty3

(please make sure the datatype should be same in derivation column)

hence,The destination (flat file in thi example) will be

Name Qty

abc 10

abc 20

abc 30

def 70

def 60

def 50

Using funnel and filter in DataStage

This is an example of parallel job in datastage

Description:-

Using two excel files with same structure (condition for funnel).the output of funneling stage is input for filter stage giving where clause as id=1 which is loaded into another excel file as output

what is the main difference between lookup ,join and merge in DataStage

LOOK UP STAGE:
* One output link and one reject link.

* Faster if the volume of data is less than the memory size.

* If the size is too huge then DS uses scrap files to store the
remaining data and that hinders the performance.

JOIN STAGE:
* Can have any no. of input links

* Does not use memory, look up is done on table level,
so would be better to use if the data volume is pretty huge.

* Doesn't have any reject link.

* Only one output link.

MERGE STAGE:
*Can have any number of input link and update

links.

* Need to have a key column.

* Can have any number of reject link.

* Only one output link.