Sunday, March 16, 2008

My Analytics Experience...

I have worked on 3 fresh implementations (2 major & 1 medium) and maintained 2 medium sized Analytics applications. My first 3 implementations had just SIEBEL CRM as data source and it included Reatil, Consumer Product Goods, Trade Funds and Service Analytics. My last implementation for a Financial services firm was well ahead of its time. We built real-time reporting, deployed ETLs to run every 30 Minutes and dealt with massive data. Read Many and Write Many data warehouses are talked about now but this was designed and delivered a year ago in my previous implementation. RTD systems are the feature of the BI tools today but a custom real-time report that gets refreshed every 3 Seconds was delivered a year back.

Last implementation was more on Contact Center reporting and less on Service Analytics modules. We had 10 data sources to extract data from and ETLs designed to run every 30 Minutes, hourly, daily, weekly, monthly and yearly. So, we covered all the frequencies that one can possibly think of. Some may ask - why will one need to run ETLs every 30 minutes? In a contact center, daily operations are dependent heavily upon the latest data available to the team leaders and managers. The telephony systems are proprietary and cannot be used as data sources in a reporting tool like Analytics. The only option left with is to build ETLs that can run every 30 Minutes. There are lot of things that need to be considered while designing ETLs to run at such short intervals. Aggregates need to be thought about and multiple options are to be explored to reduce the amount of time for aggregating data.

I have been using custom incremental aggregation and has been working great. I join Fact and Fact Stage table to identify the DISTINCT of columns that are used in GROUP BY clause in aggregate query and load a custom change capture table. A PL/SQL is created to delete records in aggregate table by joining aggregate table and change capture table. Aggregate query will be modified to join with the change capture table so that the changes are only aggregated. I call these steps Incremental Aggregation.


OOTB ETLs use a Lookup on target table to determine whether the record needs to be inserted/updated. The Lookup query should be changed to join the fact table with staging table and staging table should have an index on the columns that are used in join. In addition to this, it is also advised to comment the ORDER BY Clause that Informatica generates automatically, if the volume of data is expected to be huge.


These are not the only steps though but something that not many people implement. There are so many other things that most people implement - Index Hints in Source Qualifier query, Lookup query, Analyzing tables and Indexes weekly, avoiding Bit Map indexes etc..

One more thing that I implemented in my last implementation that really helped is a custom ETL runtimes subject area in Analytics. We created a ETL that actually pulls Session, Start Time, End Time, No. of Rows and load the data into a table which is configured as a fact table. We created another dimension table that has Session, Data Source. We had multiple reports built in Analytics to do trending analysis on data volume, run times and concentrate our efforts more on those ETLs that need to be tuned.

E-mail me at changa.reddy@gmail.com if you need any further information on any of the techniques.

No comments: