Tuesday, January 5, 2016

Data Analytics: - Analytics Cycle I: Extract Transform and Load (ETL)

Analytics Cycle
Everything has a process and so does Data Analytics.  This process I would call it as Analytics Cycle.  The Analytics Cycle is a looping process, included 5 steps as below:

  1. Extract Transform and Load (ETL);
  2. Analytics Rules;
  3. Case Management;
  4. Analytics Review;
  5. Optimization;

And today I would discuss about step 1:- Extract, Transform and Load (ETL).  The ultimate goal here is to identify and consolidate the data into an analyzable format namely a Common Data Model.  For example, a people database contains people information where this could be applied for customer analytics, HR analytics, relationship analytics, etc; OR a census database consolidates the government census data; etc. 

After identifying the Common Data Model needed, we would then identify the usable data source and to Extract, Transform and Load this into the Common Data Model.  Lucky thing is that data storage is normally not an issue given that the physical size of digital storage are becoming smaller and smaller and is available to be purchased anywhere anytime today excluding any of the business budgeting concern (if any).  

Assuming the infrastructure is ready, most of the time the first conflict between data expert and business management are between the data structure optimization and the project timeline limitation given that the best effective analyzable data structure does need times by performing data normalization.  People usually have no patient in waiting for data cleansing in the reality and hoping in quick wins with data.  How to balance the budget, timeline and technical works would be about where the data adviser value is.    

On the other hand, most of the data gathering works are robotic indeed, such as data archive extraction, data crawling from internet, etc.  Thus leveraging the scripting technique, such as VBScript, VBA in MS office, etc. to automate the works and let the computer works for you would definitely be one of the good options to speed up the robotic process in a effective manner.  Also, another advantages is that automated process could run with proper audit trial which would help to keep track of the works and to provide proof of works completion assuming no critical bugs in your script. 

Finally, all we need to do is to load the data into the database engine and to ensure all the data loading are being reconciled and verified with the concept of control total and data hash whichever appropriate.  Also, the more importantly, we should always pay high attention on the data with the high risk data types, for example the Datetime format and the Text format with character length limit, etc.

“Garbage in, garbage out”, ETL is the fundamental of successful data analytics works. Without a good plan of ETL with proper analyzable data structure and a meaningful data set towards your analytics idea, it will most likely be wasting of time.

No comments:

Post a Comment