GForensic - Gary's Digital Forensic and Data Analytics Sharing Blog

Recently people appear to be interested in my works as it sounds mysterious, excited and fun. Please take special notice that I would just like to share my personal thoughts only and am not representing any of my current or ex employers. Feel free to leave message OR contact me via LinkedIn if you got any questions about Digital Forensic and Data Analytics. Free consulting initial services will be provided for all readers of this Blog :)

Tuesday, December 6, 2016

Compliance framework vs. RegTech with Big Data

Around 6 months ago, I was being invited to participate in a panel discussion for an IT security conference. The topic was about Security Innovations and my focus is mainly on AML transactions monitoring as that time I was still working in this area for a global banking incorporation. Recent years, people start to discuss about the Regulatory Technology: - RegTech. The origin of the term “RegTech” was introduced by the Financial Conduct Authority (FCA), defining this as: "RegTech is a sub-set of FinTech that focuses on technologies that may facilitate the delivery of regulatory requirements more efficiently and effectively than existing capabilities". And I believed that RegTech is not only the ordinary technologies that highly used in the compliance transactions monitoring or name screening, but also an innovative high-tech solution on making the compliance more effective and efficient, in other words, a real revolution of the entire compliance framework today.

Compliance today is basically regulatory enquiry oriented resulting that the regulators continue to increase their scrutiny and then fines. I think the major reason is that regulators just indeed do not know what and where the industry would do and moving forward. The business model is ever-changing, from the ATM/Phone Banking in early stages to the e-Banking in mid 90s to the recent FinTech era. Over the last several years, regulatees have had to deal with an increasing number of diverse, uncoordinated and ever-changing set of regulations across different locations. To stem the tide, moving towards to a more holistic regulatory response mechanism might be the only door to out and the innovations of RegTech with Big Data would be the key.

Put into practice, I believed that RegTech is more on a workflow and I think one of the ways that we could go is to combine different market available solutions into the compliance workings cycle. For example, we could leverage the artificial intelligence solutions with the Big Data detection scenarios and the audit sampling approaches, such as combining Assisted Review in e-Discovery with Cumulative Monetary Amount (CMA) sampling method on top of the existing detection scenarios approach. I would be very keen that RegTech would create a revolution for today’s compliance which would facilitate the delivery of regulatory requirements and continue improves the efficiency and effectiveness of the existing capabilities. There is always a better solution to make our world better as well as making our life easier and RegTech approach would definitely be one of these where I would share more on my experience in the near future post. However, the very first road-block now might be more on the today’s compliance management people on who firstly dare to make this little step forward as to overcome the today’s problematic compliance approaches.

If you are also interested in this area, please feel free to drop me a note and I am always happy to discuss and work together with you as to make this little step forward.

Saturday, November 26, 2016

Data Analytics: - Are you ready for the Big Data era?

Many organizations are interested in leveraging their data in the purpose of decision making, exploring new opportunities and workflow optimization todays. However, as the managements, do you know if your organization ready for this? Or if you have already started your Big Data analytics plan, but it just not too effective as expected? As a data consultant, the following are the three of the most common road-blocks of applying data analytics into the reality I saw in my experiences: -

1. Too much technologies exists

New technologies appear every day. Companies are normal in having multiple platforms together, include proper database driven platform and tailor made applications, where some of these are indeed overlapping in both functionality and information contained. These indeed create data integrity issues and it would be a real headache when the data scientists try to gather the analyzable data and then resulting in false analytics insights and conclusion. As to resolve this issue, my suggestion is that it would be always good to have an ETL data expert working together with the business managements and those subject matters expert as to consolidate and put the data into a proper, extendable and analyzable format before rolling out any Big Data analytics plan. Put another way, data completeness and readiness would always be the very first thing to concern, confirm and establish before moving into any real analytics battles.

2. Lack of business overview

This usually happened in large organizations especially for those organizations which are running with Target Operating Model (TOM). Put another way, managements today in an organization are all talents and experienced subject matter expert but might too focus on their sole responsibilities, e.g. Procurement are not understand Manufacturing and, on the other hand, Manufacturing are not friends with sales and also logistic. The fact is that, in most cases, there is no one know the true picture of the business from top to bottom from in to out. Although this might be reasonable since it’s difficult to understand everything in the reality, this would still create significant impact on the accuracy and effectiveness of the analytics results. As to resolve this issue, I believed that a proper business plan and a very clear business goal of the data analytics projects would definitely be essential for both the internal industrial subject matter experts the data scientist as to identify the right and appropriate analytics direction and approach. Without a deep understands of the business, I believed that it is no way to be succeeded in any of the analytics projects. And that’s why a successful analytics deployment would always require both the business and the data scientist to work closely together while I saw most the time these two parties appear to be in conflict.

3. Lack of data compliance policy

People intend to use the easiest and the most effective ways to perform their jobs. However, what is the easiest and the most effective are usually subjective. For example, some people like to keep track of the data with Excel but others like using a giant databases or simple hard copy documents. In fact all these are good in the normal circumstances as long as its helps in driving the business growth in my point of view. However, if any unexpected things happened, for example regulatory requested look back review that happened quite often in banking industry these days, issues would then arise. As to avoid this, I believed that it would be never too late to get a proper data compliance policy deployed within an organizations. However, we must ensure that the data compliance policy are not trying to change or restrict how’s each of the management like to run their team, but a way to ensure that all the information and process are keep track properly in one places. Otherwise, things deployed but no one follow, it will just again completely useless which I saw in many of the large organizations. And this demonstrated that why not only the business team and the data scientists, but also the experts in executing deployment are indeed important.

In conclusion, if you believe in data and would like to deploy any data analytics in your organization, it’s not only about the data expert, but also your business team and numbers of parties involved. Big Data is not magic that will only drive advantages towards you, without a proper plan and understanding Big Data might only create a nightmare for your organization. If you are interested in learning or discuss more on any of your Big Data plan, please feel free to drop me a note and I am always happy to discuss and help.

Thursday, April 21, 2016

Data Analytics: - The Common Two Steps Approach to Identify Data Outliers

Regardless all the data related fancy wordings, one of the most common goals for any data experts would always be “identifying data outliers”. There existed plenty of ways to help us in identifying the outliers by leveraging both the internal and external data and today I am going to share some of the most common ways to achieve this goal.

In statistic, data outlier is known as a data point that stand apart from the major data set. A data outlier might sometime because of data error but, in most circumstances, that’s something that we should aware and take special notice on that. The following is a common two steps approach from both the quantitative and qualitative perspective to identify the data outliers: -

1. Statistical approach

This is under the quantitative approach that we could leverage some common statistical concept as to identify the data outliers with, but not limited to, Mean, Standard Deviation, etc regardless the actual industrial knowledge. This could also include some complex statistic model, such as those probability theories, but today I would just discuss about one of the most common ideas as to identify the data points which stand out of Mean plus or minus X times Standard Deviation.

Example on customers’ historical spending amounts

Assuming all these customers spending power are the same and we are now undergoing a new membership scheme. We could calculate the Mean plus and minus 2 Standard Deviation on spending amounts as the threshold 1 and 2 respectively. So that: -

Whoever have a total spending above threshold 1 might potentially consider as the most valuable customers the top level premium VIP card should be given aim at retain them to spend with us and show our appreciation; and

Whoever stand between threshold 1 and 2 might be consider as normal customers that normal membership should be given as to encourage them to spend more as to reach the top level premium VIP class; and

For the rest that below threshold 2, no membership might be grant but we might enhance something like the Points Collection Scheme on spending basis as to encourage this group of customers to spend more as to become our member and earn the subsequent benefits.

2. Detection scenario logic

Depends on the business nature, certain qualitative approach on detection scenario logics with different available data element could be applied, such as age, gender, profession, etc.

Example on customers data follow up on the above example

Assuming the target customer is on high income group who could be classified on age group and professions, we could then further categorize the customers in additional to the spending amounts as refer to the previous example. One point to note here is that in order to identify the right scenario logic, we might consider to leverage both internal and external data source. For example, government census data telling which professions and what age group would be the high income people in this case. And after this, we might then come up with some logics like: -

Top premium VIP might only be granted to those will high spending amounts, age above 40, professions in certain industries, etc. as the current customer base; and

A 2nd grade premium VIP class to those with normal spending amounts, age between 20 to 40 and profession in certain industries, etc. who potentially will be the successors of the current top level people and we should treat this group as the highly potential customers base.

In conclusion, there are plenty of statistic methodologies and scenarios that we might apply in our daily operations depend on the data sources availability and the data analysts’ experience. If you are interested to know more or would like to further discuss on this topic, please feel free to reach me out and I am always appreciate the chances to learn from all of you.

Saturday, February 27, 2016

Insights from Alan Turing – Father of artificial intelligence

Alan Turing (1912-1954) – a pioneering mathematician who also known as the father of computing and artificial intelligence; and was granted a Royal pardon by the Queen because of his contribution as a code-breaker in World War Two. One of the most famous Alan Turing’s related topics would be “Can machine thinks?”

Just a head up in advance, I am not going to share about who Alan Turing is or how Alan Turing breaks the Enigma ciphers or how the Turning machine works and the following in this post is somehow not Alan Turing related at all. The topic I would like to discuss a bit here is about the topic that just mentioned in the last paragraph on whether a machine thinks and, to certain extent, “Can an analytics platform thinks?”

Many people would ask whether Data Analytics believable which I have mentioned earlier in my old post. My first question is would people think or consider of getting “something” to help in decision making no matter you believe or not believe that “something” would or would not give you a reasonable, thinkable and questionable suggestion produced by acceptable analyzing or thinking process?

A data analytics platform normally works with various statistical mathematical modelling derived and programmed by the data experts and then produce advices accordingly. Question is shall we consider this advice generating process as “machine thinking”?

One idea is that data experts have in fact only programmed the rules into the machine’s memory and then the machine would work itself to produce the results once we request the machine to “think” by pressing the start button. As such, shall we consider what the data experts did as teaching rather than running the machines? Alan Turing predicted that “machine learning” would play an important role of building powerful machines which might teach which I would consider that every one of us who work in IT field are in fact teaching the machine in some circumstances.

However, on the other hand, the machine might only produce the pre-programmed advisory options but not anything out of the box, then is it still consider as “thinking”? But in fact, are we all, as a human, also just producing our personal thoughts based on what our peers, such as teachers, friends or own experiences, teach us or pre-programmed in our brain?

“Sometimes it is the people no one can imagine anything of who do the things no one can imagine.”

- Alan Turing

“We can only see a short distance ahead, but we can see plenty there that needs to be done.”

- Alan Turing

I would leave all the above questions open for discussion and I would hope to see what would happen in the rest of my life. However, it is no doubt for me that I believed Data Analytics is one of the most possible ways to create a better world tomorrow. For example from what I am doing in my job, leveraging Big Data to fight against Money Laundering activities.

Friday, February 19, 2016

Computer Forensic: - Forensic Workflow III & IV – Reporting & Testify as Expert Witnesses

As per what I mentioned in the past about Computer Forensic is mainly about story telling by presenting the fact to facilitate the investigating works and the judgement of the case, reporting would be one of the most critical area that demonstrating the examiners seniority following the analysis skill level. Computer forensic report is usually litigious and likely to be distributed to both technologies technical and non-technical parties. As such, accurately presenting the fact in a human-readable way with no bias would always be the key of writing a good report and, going forward, the following would be some noticeable requirements and pre-concept according to my computer forensic examiner’s experience.

1. Reporting purpose

The ultimate objective of reporting is to present the fact to address the technical concern. This must be presented in the manner of understandable and human-readable. Jargon must be carefully identified assuming that the readers are having zero computer knowledge especially if the report is going to be used in litigations, the report readers would then likely to be non-technical individuals, such as attorneys, judge, jury, etc. Besides, since the report may be the only opportunity to present the facts found in the investigation, this must encompass the whole of any testimony in details for the trier of fact. Otherwise this may induce serious financial and legal consequences due to misrepresent any of the findings.

2. Report structure and style

Ideally all examiner reports are required to be capable in standing on their own and providing the clear and accurate information to anyone, who read the report, to reach the same conclusions. Terms such as “many”, “significantly”, highly”, etc, which are subjective and able to be interpreted in multiple ways must be avoided. Industrial accepted reference should be used whenever possible as to substantiate the statements and the content presented. Also, every single page should contain a unique identifier include the report title, date of issue and also the examiner basic info / company name for references purpose. The more importantly, the examiner’s background are suggested to be clear state and identified at the beginning of the report and the following are the sections that typically included in the examiner reports:-

· Cover page

· Executive summary

· Examiner profile

· Introduction / Background of the case

· Scope of work

· List of supporting documents

· Observations and analyses conducted

· Examiner’s log

· Chain-of-custody records

· Photographs / reference materials

· Disclaimers

· Signature

3. Quality assurance

When the issues are complex, mistake and errors may always be present no matter how careful the examiner is. As such, peer review for me would be suggested as one of the most effective and essential way to resolve these issues. Peer review is to conduct by the one who is at the same level or more senior than you in terms of experience. At least two peers are suggested for you to invite as your peer reviewer. It is not only a general review in terms of grammatical errors or the phases and wordings used, but also a quality assurance on any of the assumptions and analysis made under the report.

The above would be only some basic idea on how a forensic examiner report looks like. In conclusion, here comes the end of the Computer Forensic Workflow overview. In the future computer forensic post, I would try to share some of the real-life examples. Hope all of you found this useful and I would be always happy to discuss if you are interested.

Previous Step

Thursday, February 11, 2016

Computer Forensic: - Forensic Workflow II – Forensic Analysis

Following up from the data acquisition, the next is to conduct actual forensic analysis. There are numbers of analyses available and the most common quick analyses are shared as below.

1. Deletion Analysis

This is one of the most common analyses that required in almost all kind of cases. We could normally achieve this easily by leveraging the forensic software functionalities. Depend on the custodian’s OS version, the data storage device type and the forensic software, the high level results, such as no. of file recovered, could be always different. Also, deletion analysis might not be available in some situations, such as SSD, Linux, etc. On the other hand, deletion analysis would also be available to mobile forensic but it would be subject to the level of data access that available to examiners and the mobile device models.

2. Signature Analysis

One of the most common ways to hide the data files for scanning is to alter its file extension, for example pretending an Excel file to a Text file by changing the extension from xlsx to txt. This would possibly affect the file extraction (if this relies on file type) and the subsequent keyword search process on e-Discovery or any other subsequent forensic data review process. However, the fact is that extension is not the only way to identify the file type. There would be always a file header for each file telling the system that what type of file is it. Signature Analysis is to confirm if the file header / signature tie to the extension and identify the potential real identities.

3. Hash Analysis

Files might be duplicated for backup purpose in general computer usage OR known as no risk since they are system file in fact. In order to identify this, cryptographic hash functions could help. According to Wikipedia, “a cryptographic hash function is a hash function which is considered practically impossible to invert, that is, to recreate the input data from its hash value alone.” MD5 is one the most commonly used hash function for data integrity verification purpose. If two files having the same hash code, then it would be confirmed and accepted to be identical in terms of file content. And for the zero-risk files, we may leverage the information from a project namely National Software Reference Library (NSRL) which provide a Reference Data Set (RDS) of most known and traceable software applications’ files. By comparing the hash with each other and with the NSRL list, the review population would be reduced effectively.

4. Keyword search

There is number of ways to perform analytics on the data acquired and Keyword Search would be known as the most common one. The basic idea is similar to perform search in Google by input the keyword and review the search results accordingly. There would be plenty of ways to run keyword search, such as running in the forensic software or perform file extraction and run Windows search. The most effective, traceable and auditable way is to load the data in scope into the e-Discovery platform for search and review. In terms of loading data for search and filter, ensure that not all data has to be loaded normally since there always exist advanced data analytics and filtering process, such as filter by file type / data, apply analytics on user deletion activities, etc. to trim down the data size for data loading and run the subsequent keyword search to identify the high risk data population for review.

Please note the above would be only a quick overview of the most common task for general investigation purpose. In fact it would be thousands more analysis that available for deep down investigation. I would share more on this in the near future with some real-life example.

Previous Step | Next Step

Thursday, January 21, 2016

Computer Forensic: - Forensic Workflow I - Data Acquisition And Preservation

Having been focused on Data Analytics in the previous posts, it’s time for Computer Forensic. As what I said earlier, Computer Forensic is mainly about story telling by presenting the fact to facilitate the investigating works. As such, in-depth IT technical knowledge on hardware and software as well as proper presentation skill would always be essentials.

Back to 2006 when I first jumped into this industry as a law enforcement officer, almost all the cases are about analyzing hard disk. However, it is no doubt that today Computer Forensic is becoming more and more complicated and people in the recent years are starting to call it as Digital Forensic. The fact is that types of digital devices are becoming more and more, such as smart phone, tablet, etc. Having said that, the workflow of computer forensic works still pretty much the same and is as below:-

1. Data Acquisition And Preservation

2. Forensic Analysis

3. Reporting

4. Testify as Expert Witnesses

The first step is to get the related data and to preserve it with an auditable process and proper chain-of-custody maintenance regardless the targeted devices type. A sounded forensic process is required and leveraged to ensure that no-alter exists during the acquisition by a proper forensic kit with the industrial acceptable verification algorithm, such as MD5 and SHA hash. The most preferred acquisition way is a full data cloning (also known bit-by-bit coping) with write-blocker connection to ensure that an identical copy is being obtained and no data integrity concern is available by preserving the data into a non-alterable format.

However, subject to technical limitations, sometimes we might only acquire the logical data file or might only able to perform a drag and drop data coping, such as server’s email data acquisition or old hard disk with serious bad-sector issues, etc. resulting that, worst come to worst, this could only be proofed and justified by the examiner personal integrity in some rare circumstances as I would say that it is always nothing is impossible in terms of technologies.

Throughout the data acquisition process, one Master and one Backup would be produced and sometimes an additional Working copy depends on the case nature. In most circumstances, the flow is to get the custodian’s devices, image the data and then return the devices. With this approach, once the devices is returned and the custodian started to use this again, the source data is altered and the exact image will never be able to re-produce again. Therefore, a backup would be essential and all analysis is supposed to be performed on the backup or the working copy. The master copy will only be used for creating backup copy whenever this is the only workable copy.

On top of the above data acquisition process, I did experience that, due to the case sensitive concern, the original copy also required to be seizure and only a clone copy for custodian continuous use. The main disadvantage of returning cloned copy only is that more cost would be induced but I believed that this would be the best and the most secure process that I ever performed.

In my forensic life, I have been experienced plenty of tools which allowing me to perform forensic imaging, such as, but not limited to, EnCase, FTK Imager, Paradin, Helix, etc. for hard disk data acquisition; and Oxygen Forensic, XRY, Cellebrite, etc. for mobile forensic; and Macquisition for MacOS data acquisition. My major comment on these tools is that most of them are similar to each other where 90% of data acquisition works are pretty strict forward and these tools perform very well but the rest 10% would be full of unexpected issues which relies on the examiner experience. I would share more on this unexpected issues in the future computer forensic post when sharing about real life example.

Saturday, January 16, 2016

Data Analytics: - Analytics Cycle III to V: Case Management, Analytics Review and Optimization

After all the analysis confirmed with the proper analyzable data set, we would then come to the step 3: - Case Management to manage and execute the follow up actions identified from the insights produced; step 4: - Analytics Review to review the effectiveness on the identified Analytics Rules and suggest improvements; step 5: - Optimization on executing the improvements to make the process be more efficient and effective.

These three steps are critical to the Analytics Cycle as:

A. Case Management is to manage and execute the follow-up actions from the analysis insights which are always help to continuously improve and contribute the actual business.

B. Analytics Review is to monitoring the progress and the performance of every single step on case execution; and to identify the best way to resolve the issues when applying the theoretical analytics ideas into the real business.

C. Optimization is to execute and test the suggestion identified and, more importantly, to help on smoothing the case execution and to improve its’ effectiveness and efficiency.

These three steps are highly dependent with each other and might require to move back to step 1 or 2 if further data needed or fine-tuning the analytics rules when executing. On one hand, step 4 is mainly relies on the data produced from step 3 and, on the other hand, step 4 and 5 would potentially speed up and also help to improve the case execution process in step 3 and, more importantly, to improve the entire analytics project become more efficient and effective.

Example on the e-Discovery process, after step 1 – ETL on loading all the acquired data into the system and step 2 – Analytics Rules to identify the right screening criteria, included, but not limited to, keywords list, time period filtering, etc. We would then move forward to review and tag the hit items with proper step 3 – Case Management process; and could analyse the effectiveness and the performance of the works with step 4 – Analytics Review, such as reviewing the periodic progress report to understand the efficiency of the reviewers for management purpose, to identify the keywords’ performance, etc. On applying step 5 – Optimization, one idea is that if a keyword always produces non-relevant hit, we could then leverage the sampling approaches rather than full-review on the related hit population which would potentially speed up the review process. If you are interested in how we could improve the e-Discovery process, please be patient and I would share more possible analytics on e-Discovery process in the future e-Discovery related post.

In conclusion, here comes the end of the Analytics Cycle overview. This is only one of the general flows on how Data Analytics and the Big Data approach could be applied. But again, it always requires times and money investments to produce profitable and adorable result. In the future data analytics post, I would try to share some of the real-life analytics examples. Hope all of you found this useful and I would be always happy to discuss if you are interested.

Previous Step

Sunday, January 10, 2016

Data Analytics: - Analytics Cycle II: Analytics Rules

Following up from the previous post, it comes to the step 2:- Analytics Rules of the Analytics Cycle after all the ETL works are done.

Analytics Rules is the core on how’s and what’s the analysis going to be. The ultimate goal is to analyze the readily data set and producing insights. For example in customer analytics, rules are set for classifying and categorizing the profitable and non-profitable customers into different segments OR to trend the customers behaviours for prediction purpose. From my experiences, Analytics rules could be classified into two major types: 1. on-going alert based; and 2. look-back review based.

On-going alert based rules are usually applied in transaction monitoring. Alert will be triggered by pre-set criteria, for example any transaction amount larger than 1M. Example included data screening in forensic investigation, AML transactions monitoring system, staff / customer behavior monitoring, spam email filtering, etc.

Look-back review based is to review the historical data and then produce suggestion, insights and fact-findings subsequently. Example included revenue optimization, predictive modelling, business review, assisted review in e-Discovery etc.

Regardless the type of analytics rules, this step is being conducted by various mathematical methodologies range from simple statistic with data visualization (such as bar chart, pie chart, trending line, etc.) to advanced modelling technique (such as back-propagation, self-organization map, regression analysis, probability, etc.)

There are plenty of tools that could help in implementing the analytics rules and below are some tools that I experienced:

1. Database Management System (DBMS), such as SQL Server, MySQL, Oracle, MS Access, Excel, allow us to conduct analytics by SQL programming or some built-in functions and stored procedure. Major advantage is that DBMS are always flexible and we could freely do any analysis we want on the basic data level. But disadvantage is that it required in-depth programming skill with certain knowledge in IT prior to any analytics skill.

2. Data Visualization Application, such as Tableau, Qlikview, i2 Analyst Workbook, allow you to create fancy layout of the data for presentation and to gain overview and insights from various visual effect. Outliners of the data set could be easily identifying and the major advantage is that these tools could be managed easily by some simple scripting or just drag and drop. But disadvantage is that too much data might possibly overwhelm the chart, for example it might become a mess when presenting 100,000+ entities in a network diagram OR when we are working with a dataset with too many dimensions.

3. Predictive Modelling Platform, such as Viscovery SOMine, Assisted Review functions in Relativity, allow you to leverage advanced mathematical formulas with no requirement on details understating of how’s the equations work. Major advantage is that user could operate the tools by only identifying and setting the training set to leverage the predictive modelling concept. But disadvantage is that whether the tool works correctly are less likely to be told as the theory behind are likely to be a black box for most of the ordinary user on the applications level.

Previous Step | Next Step

Tuesday, January 5, 2016

Data Analytics: - Analytics Cycle I: Extract Transform and Load (ETL)

Analytics Cycle

Everything has a process and so does Data Analytics. This process I would call it as Analytics Cycle. The Analytics Cycle is a looping process, included 5 steps as below:

Extract Transform and Load (ETL);
Analytics Rules;
Case Management;
Analytics Review;
Optimization;

And today I would discuss about step 1:- Extract, Transform and Load (ETL). The ultimate goal here is to identify and consolidate the data into an analyzable format namely a Common Data Model. For example, a people database contains people information where this could be applied for customer analytics, HR analytics, relationship analytics, etc; OR a census database consolidates the government census data; etc.

After identifying the Common Data Model needed, we would then identify the usable data source and to Extract, Transform and Load this into the Common Data Model. Lucky thing is that data storage is normally not an issue given that the physical size of digital storage are becoming smaller and smaller and is available to be purchased anywhere anytime today excluding any of the business budgeting concern (if any).

Assuming the infrastructure is ready, most of the time the first conflict between data expert and business management are between the data structure optimization and the project timeline limitation given that the best effective analyzable data structure does need times by performing data normalization. People usually have no patient in waiting for data cleansing in the reality and hoping in quick wins with data. How to balance the budget, timeline and technical works would be about where the data adviser value is.

On the other hand, most of the data gathering works are robotic indeed, such as data archive extraction, data crawling from internet, etc. Thus leveraging the scripting technique, such as VBScript, VBA in MS office, etc. to automate the works and let the computer works for you would definitely be one of the good options to speed up the robotic process in a effective manner. Also, another advantages is that automated process could run with proper audit trial which would help to keep track of the works and to provide proof of works completion assuming no critical bugs in your script.

Finally, all we need to do is to load the data into the database engine and to ensure all the data loading are being reconciled and verified with the concept of control total and data hash whichever appropriate. Also, the more importantly, we should always pay high attention on the data with the high risk data types, for example the Datetime format and the Text format with character length limit, etc.

“Garbage in, garbage out”, ETL is the fundamental of successful data analytics works. Without a good plan of ETL with proper analyzable data structure and a meaningful data set towards your analytics idea, it will most likely be wasting of time.

Next Step