Thursday, January 21, 2016

Computer Forensic: - Forensic Workflow I - Data Acquisition And Preservation

Having been focused on Data Analytics in the previous posts, it’s time for Computer Forensic.  As what I said earlier, Computer Forensic is mainly about story telling by presenting the fact to facilitate the investigating works.  As such, in-depth IT technical knowledge on hardware and software as well as proper presentation skill would always be essentials.

Back to 2006 when I first jumped into this industry as a law enforcement officer, almost all the cases are about analyzing hard disk.  However, it is no doubt that today Computer Forensic is becoming more and more complicated and people in the recent years are starting to call it as Digital Forensic.  The fact is that types of digital devices are becoming more and more, such as smart phone, tablet, etc.  Having said that, the workflow of computer forensic works still pretty much the same and is as below:-

1.      Data Acquisition And Preservation
2.      Forensic Analysis
3.      Reporting
4.      Testify as Expert Witnesses

The first step is to get the related data and to preserve it with an auditable process and proper chain-of-custody maintenance regardless the targeted devices type.  A sounded forensic process is required and leveraged to ensure that no-alter exists during the acquisition by a proper forensic kit with the industrial acceptable verification algorithm, such as MD5 and SHA hash.  The most preferred acquisition way is a full data cloning (also known bit-by-bit coping) with write-blocker connection to ensure that an identical copy is being obtained and no data integrity concern is available by preserving the data into a non-alterable format. 

However, subject to technical limitations, sometimes we might only acquire the logical data file or might only able to perform a drag and drop data coping, such as server’s email data acquisition or old hard disk with serious bad-sector issues, etc. resulting that, worst come to worst, this could only be proofed and justified by the examiner personal integrity in some rare circumstances as I would say that it is always nothing is impossible in terms of technologies.

Throughout the data acquisition process, one Master and one Backup would be produced and sometimes an additional Working copy depends on the case nature.  In most circumstances, the flow is to get the custodian’s devices, image the data and then return the devices.  With this approach, once the devices is returned and the custodian started to use this again, the source data is altered and the exact image will never be able to re-produce again.  Therefore, a backup would be essential and all analysis is supposed to be performed on the backup or the working copy. The master copy will only be used for creating backup copy whenever this is the only workable copy. 

On top of the above data acquisition process, I did experience that, due to the case sensitive concern, the original copy also required to be seizure and only a clone copy for custodian continuous use.  The main disadvantage of returning cloned copy only is that more cost would be induced but I believed that this would be the best and the most secure process that I ever performed.

In my forensic life, I have been experienced plenty of tools which allowing me to perform forensic imaging, such as, but not limited to, EnCase, FTK Imager, Paradin, Helix, etc. for hard disk data acquisition; and Oxygen Forensic, XRY, Cellebrite, etc. for mobile forensic; and Macquisition for MacOS data acquisition.  My major comment on these tools is that most of them are similar to each other where 90% of data acquisition works are pretty strict forward and these tools perform very well but the rest 10% would be full of unexpected issues which relies on the examiner experience.  I would share more on this unexpected issues in the future computer forensic post when sharing about real life example.

Saturday, January 16, 2016

Data Analytics: - Analytics Cycle III to V: Case Management, Analytics Review and Optimization

After all the analysis confirmed with the proper analyzable data set, we would then come to the step 3: - Case Management to manage and execute the follow up actions identified from the insights produced; step 4: - Analytics Review to review the effectiveness on the identified Analytics Rules and suggest improvements; step 5: - Optimization on executing the improvements to make the process be more efficient and effective. 

These three steps are critical to the Analytics Cycle as:

A.    Case Management is to manage and execute the follow-up actions from the analysis insights which are always help to continuously improve and contribute the actual business. 

B.     Analytics Review is to monitoring the progress and the performance of every single step on case execution; and to identify the best way to resolve the issues when applying the theoretical analytics ideas into the real business.

C.     Optimization is to execute and test the suggestion identified and, more importantly, to help on smoothing the case execution and to improve its’ effectiveness and efficiency.

These three steps are highly dependent with each other and might require to move back to step 1 or 2 if further data needed or fine-tuning the analytics rules when executing. On one hand, step 4 is mainly relies on the data produced from step 3 and, on the other hand, step 4 and 5 would potentially speed up and also help to improve the case execution process in step 3 and, more importantly, to improve the entire analytics project become more efficient and effective. 

Example on the e-Discovery process, after step 1 – ETL on loading all the acquired data into the system and step 2 – Analytics Rules to identify the right screening criteria, included, but not limited to, keywords list, time period filtering, etc.  We would then move forward to review and tag the hit items with proper step 3 – Case Management process; and could analyse the effectiveness and the performance of the works with step 4 – Analytics Review, such as reviewing the periodic progress report to understand the efficiency of the reviewers for management purpose, to identify the keywords’ performance, etc.  On applying step 5 – Optimization, one idea is that if a keyword always produces non-relevant hit, we could then leverage the sampling approaches rather than full-review on the related hit population which would potentially speed up the review process.  If you are interested in how we could improve the e-Discovery process, please be patient and I would share more possible analytics on e-Discovery process in the future e-Discovery related post.

In conclusion, here comes the end of the Analytics Cycle overview.  This is only one of the general flows on how Data Analytics and the Big Data approach could be applied.  But again, it always requires times and money investments to produce profitable and adorable result. In the future data analytics post, I would try to share some of the real-life analytics examples.  Hope all of you found this useful and I would be always happy to discuss if you are interested.  

Previous Step

Sunday, January 10, 2016

Data Analytics: - Analytics Cycle II: Analytics Rules

Following up from the previous post, it comes to the step 2:- Analytics Rules of the Analytics Cycle after all the ETL works are done.

Analytics Rules is the core on how’s and what’s the analysis going to be.  The ultimate goal is to analyze the readily data set and producing insights.  For example in customer analytics, rules are set for classifying and categorizing the profitable and non-profitable customers into different segments OR to trend the customers behaviours for prediction purpose.  From my experiences, Analytics rules could be classified into two major types: 1. on-going alert based; and 2. look-back review based.

On-going alert based rules are usually applied in transaction monitoring.  Alert will be triggered by pre-set criteria, for example any transaction amount larger than 1M.  Example included data screening in forensic investigation, AML transactions monitoring system, staff / customer behavior monitoring, spam email filtering, etc. 

Look-back review based is to review the historical data and then produce suggestion, insights and fact-findings subsequently.  Example included revenue optimization, predictive modelling, business review, assisted review in e-Discovery etc.

Regardless the type of analytics rules, this step is being conducted by various mathematical methodologies range from simple statistic with data visualization (such as bar chart, pie chart, trending line, etc.) to advanced modelling technique (such as back-propagation, self-organization map, regression analysis, probability, etc.)

There are plenty of tools that could help in implementing the analytics rules and below are some tools that I experienced:

1.      Database Management System (DBMS), such as SQL Server, MySQL, Oracle, MS Access, Excel, allow us to conduct analytics by SQL programming or some built-in functions and stored procedure.  Major advantage is that DBMS are always flexible and we could freely do any analysis we want on the basic data level.  But disadvantage is that it required in-depth programming skill with certain knowledge in IT prior to any analytics skill.

2.      Data Visualization Application, such as Tableau, Qlikview, i2 Analyst Workbook, allow you to create fancy layout of the data for presentation and to gain overview and insights from various visual effect.  Outliners of the data set could be easily identifying and the major advantage is that these tools could be managed easily by some simple scripting or just drag and drop.  But disadvantage is that too much data might possibly overwhelm the chart, for example it might become a mess when presenting 100,000+ entities in a network diagram OR when we are working with a dataset with too many dimensions.


3.      Predictive Modelling Platform, such as Viscovery SOMine, Assisted Review functions in Relativity, allow you to leverage advanced mathematical formulas with no requirement on details understating of how’s the equations work.  Major advantage is that user could operate the tools by only identifying and setting the training set to leverage the predictive modelling concept.  But disadvantage is that whether the tool works correctly are less likely to be told as the theory behind are likely to be a black box for most of the ordinary user on the applications level.

Tuesday, January 5, 2016

Data Analytics: - Analytics Cycle I: Extract Transform and Load (ETL)

Analytics Cycle
Everything has a process and so does Data Analytics.  This process I would call it as Analytics Cycle.  The Analytics Cycle is a looping process, included 5 steps as below:

  1. Extract Transform and Load (ETL);
  2. Analytics Rules;
  3. Case Management;
  4. Analytics Review;
  5. Optimization;

And today I would discuss about step 1:- Extract, Transform and Load (ETL).  The ultimate goal here is to identify and consolidate the data into an analyzable format namely a Common Data Model.  For example, a people database contains people information where this could be applied for customer analytics, HR analytics, relationship analytics, etc; OR a census database consolidates the government census data; etc. 

After identifying the Common Data Model needed, we would then identify the usable data source and to Extract, Transform and Load this into the Common Data Model.  Lucky thing is that data storage is normally not an issue given that the physical size of digital storage are becoming smaller and smaller and is available to be purchased anywhere anytime today excluding any of the business budgeting concern (if any).  

Assuming the infrastructure is ready, most of the time the first conflict between data expert and business management are between the data structure optimization and the project timeline limitation given that the best effective analyzable data structure does need times by performing data normalization.  People usually have no patient in waiting for data cleansing in the reality and hoping in quick wins with data.  How to balance the budget, timeline and technical works would be about where the data adviser value is.    

On the other hand, most of the data gathering works are robotic indeed, such as data archive extraction, data crawling from internet, etc.  Thus leveraging the scripting technique, such as VBScript, VBA in MS office, etc. to automate the works and let the computer works for you would definitely be one of the good options to speed up the robotic process in a effective manner.  Also, another advantages is that automated process could run with proper audit trial which would help to keep track of the works and to provide proof of works completion assuming no critical bugs in your script. 

Finally, all we need to do is to load the data into the database engine and to ensure all the data loading are being reconciled and verified with the concept of control total and data hash whichever appropriate.  Also, the more importantly, we should always pay high attention on the data with the high risk data types, for example the Datetime format and the Text format with character length limit, etc.

“Garbage in, garbage out”, ETL is the fundamental of successful data analytics works. Without a good plan of ETL with proper analyzable data structure and a meaningful data set towards your analytics idea, it will most likely be wasting of time.