Saturday, October 11, 2008

Handling Upserts Using 'Slowly Changing Dimension' in SSIS

Many of us would have faced a scenario where we have to bring in records from a source and insert if records don't exist or update if exists (also referred to as Upserts).This can now easily be done using 'Slowly Changing Dimension' (SCD) in SSIS 2005/2008.

In my example, I have a dimension table called dimPerson in Datawarehouse DB - 'DWDB' and a source table called Person in OLTP database - 'OLPTSource'.

My aim is to transfer records from OLPTSource.Person to DWDB.dimPerson based on these rules:

- If PersonID exists in DWDB and if there is any difference/change in columns FName, MName, LName, Address1, Address2, City, State or Zip then update the row with changes.

- If PersonID exists in DWDB and Phone number differs then mark the record as outdated and insert a new record for the PersonID

- If PersonID doesn't exists in DWDB then insert record

For this example you will need to create OLTPSource and DWDB databases and run the following script:

http://docs.google.com/Doc?id=ddstdscf_0c653h9cm

This will do the following –

- Create the two tables Person & dimPerson

- Populate Person & load dimPerson with Person records

- Update & insert statements to Person records to understand how SCD
handles these changes to source after the initial load of dimPerson

Here’s the steps & screenshots for creating the SSIS package -

1. Create a Sequence Container to hold Data Flow task. Drag and drop
a Data Flow Task. Create two OLEDB connections :

OLE DB Connection for datawarehouse database – DWDB
OLE DB Connection for OLTP source database - OLTPSource



2. Go to Data Flow view and configure OLEDB data source to retrieve records from OLTPSource.Person table. Drag and drop ‘Slowly Changing Dimension’ item.




3. Double Click on SCD to bring up Wizard



4. Select Key Type as ‘Business Key’ for PersonID



5. Select Change Type as ‘Historical Attribute’ for PhoneNo and ‘Changing Attribute’ for the rest.



6. Don’t check any of the boxes in the below screen



7. Select the column ‘Active’ to indicate current and expired rows. Enter 1 for current and 0 for expired value. If PhoneNo changes for a personID, then a new record will be inserted with a ‘Active’ column value of ‘1’. Old record will get the ‘Active’ column value updated to ‘0’.



8. Click finish on the next screen and you will notice that the wizard has updated items on the data flow view



9. Excute the package. If you ran the whole script in http://docs.google.com/Doc?id=ddstdscf_0c653h9cm , then you will see a similar result as in screen below. There were two updates to PhoneNo column, that shows up as 2 rows in ‘Historical Attribute Inserts Output’. ‘New Output’ shows up as 1 row because of the one row inserted in the script. Finally, the ‘Change Attribute Updates Output’shows up as 1 because of the one update statement changing Zip and Addess1. Notice that the update statement for PhoneNo and Zip for PersonID =2didn’t change Zip for historical record (active =0). You can change this behaviour by setting UpdateChangingAttributeHistory to True.



Check below msdn link to find out what more on SCD:
http://msdn.microsoft.com/en-us/library/ms141715.aspx

4 comments:

INDIAN said...

Thank so much for the article, but when I open an existing SCD flow --> Edit, it starts again instead of showing previous settings...How to see previous settings on SCD transform in SSIS?

aSQLGuy said...

Thanks Indian for your comment. Sorry, I didn't see it earlier. I am assuming you are trying to view an old package.You can choose 'work offline' from SSIS menu of your VS 2008 environment.After this right click on SCD and choose 'Show Advanced Editor'.

Sarvanan said...

I have a scenario where there are duplicate records in source itself. In that case SCD inserts all the records in target without updating.

Here is the sample scenario:

My Source table structure is like

ID (Primary Key) , MGR_Id , MGR_NAME , MGR_DOJ

Values are :

1 1 XXX 2001-01-11
2 2 bbb 2009-02-23
3 3 ccc 2009-02-23
4 3 ddd 2009-02-23

I have to incrementaly load (Type 2) my Target :

Which contains cols like

ID , MGR_Id , MGR_NAME , MGR_DOJ

But while using SCD transformation for loading target, it inserts all the records in Target.

ie, after the load Target table look like

1 1 XXX 2001-01-11
2 2 bbb 2009-02-23
3 3 ccc 2009-02-23
4 3 ddd 2009-02-23

In SCD , I used ID as bussiness key and all the remaining coloumns as changing attribute

Could you please tell me why there is two records for MGR_ID = 3.

Please provide your inputs

Thanks in advance

ashuthinks said...

hi i also want to same thing please help me...
i have to take price from flat file and want to insert into table checking its company name ...