Tuesday, March 28, 2017

Data Analytics: - Step by step sample size determination walk-through

As what we discussed in the previous post, throughout my experience as a Data Analytics consultant, I am required perform estimation for almost all of the projects which I ever worked on.  And Sampling is indeed one of the most commonly used analytics way since most people do adopt it a set of proper samples do represent for a unknown or very large population. 

Although sampling is being widely used, it appears that many people today are still determining the sample size by purely personal professional judgement, in other words, guts feeling. In fact, there exist many ways to determine the sample in a more scientific and systematical approach where a more proper sampling size would definitely allow your analysis become more justifiable and defensible.


Here we start our calculation with the “Margin of Error”: -


Depends on the project nature, we would then need to determine our precision level (i.e. Margin of Error) which we would normally set to 5% or 10%. 

Subsequently, our confidence level would then be 95% which means, in normal circumstances, we are 95% certain that our samples be able to represent the whole population. 

At confidence level 95%, the area to the right of a z-score becomes 1 – 0.5 / 2 = 0.975.  Therefore the z-score is 1.96.

Substituting all the above into the equation, we would then determine that 385 samples is the right no. of samples that we have to pick from our unknown or very-large population. 

And for a small and finite population, trying to keep it simple, I am not going to elaborate this in details here.  However, we could still simple apply the below formula with the sample size identified above for a known or finite population size.


Example, assume the population we know is 10,000 items.  Applying the formula, we would then determine that 371 samples (instead of 385 samples) is the right no. of samples that we have to pick from our data with population size equal to 10,000 items.

Hope the above would give you a high level understanding of how to determine your simple size so that no more “professional judgement” is required when you are required to provide estimation or justification with the method of sampling. If you have found any errors in the above or have any thoughts or interest on this topic, please feel free to drop me a note.  Thanks.



Tuesday, March 21, 2017

Data Analytics: - Sampling

Data Analytics are widely applied in data mining and data categorization today.  And among all the available data mining techniques, Detection Scenarios (“DS”) approaches are the most commonly used in the market, for example transactions monitoring in Financial Crime Compliance for Anti-Money Laundering and proactive fraud detection, customers categorization and classification, etc.

As to identify the effectiveness and efficiency of a DS, sampling method would definitely be one of the most common and sounded-reasonable ways for justification.  As such, “How to select your sample?” and “How many samples shall we pick?” would then be the two most critical questions.  Most people today would answer these accordingly to their experience.  However, as a data analytics professional, I would always prefer a more scientific way (or a more mathematical way) for sampling.

1.      How to sample?

Depends on the ultimate goal of the sampling process, we might use either probabilistic sampling /or non- probabilistic sampling. The most commonly used sampling methodologies are simple random sample and systematic sample in probabilistic sampling v. purposeful sampling in non-probabilistic sampling. 

Simple random sample is the easiest like lucky draw, e.g. lottery which we would normally be used if we do not have strong understanding and perception of the data.  However, this might be vulnerable to sampling error because the randomness of the selection may result in a sample that might not reflect the makeup of the population.  As such, in order to dilute the sample into the whole data population, the systematic sampling method is applied, such as Cumulative Monetary Amounts (CMA) sampling (a.k.a. monetary unit sampling or dollar-unit sampling) which is a typical example commonly used in Auditing and those monetary related analysis. 

In contrast, the purposeful sampling are mainly relied on the analyst understanding and perception of the data.  Samples are drawn from data which the analyst believe that is most representative.  A typical example is that we usually take the customers / suppliers with top X purchase / sales amount that coverage over certain percentage as the samples for testing where the customers / suppliers with lower purchase / sales amount are generally consider as insignificant to our studies.

2.      How many sample required?

In my experience, there are numbers of ways being used in determining the sampling size, such as, based on the leadership experience, subject to the sampling team size and time limitation, etc.  However, is it justifiable and defensible is usually the question?  As a data analytics professional, my preference is always back to a more scientific approach and we might start to identify the number from “Margin of Error” in statistics.  Margin of Error (or “Precision Level”) is a statistic expressing the amount of random sampling error based on the Z-score, standard deviation and sampling size.  I will write a step by step sample size determination walk-through in the next post. 

In conclusion, an appropriate sampling method with a proper sampling size is essential to any analytics project that sampling applied.  And to more extend, a sampling method and the subsequent sampling size is better to be justifiable and defensible in today Big Data era.  If you have found any errors in the above or have any thoughts or interest on this topic, please feel free to drop me a note.  Thanks.

Tuesday, December 6, 2016

Compliance framework vs. RegTech with Big Data

Around 6 months ago, I was being invited to participate in a panel discussion for an IT security conference.  The topic was about Security Innovations and my focus is mainly on AML transactions monitoring as that time I was still working in this area for a global banking incorporation.  Recent years, people start to discuss about the Regulatory Technology: - RegTech.  The origin of the term “RegTech” was introduced by the Financial Conduct Authority (FCA), defining this as: "RegTech is a sub-set of FinTech that focuses on technologies that may facilitate the delivery of regulatory requirements more efficiently and effectively than existing capabilities".  And I believed that RegTech is not only the ordinary technologies that highly used in the compliance transactions monitoring or name screening, but also an innovative high-tech solution on making the compliance more effective and efficient, in other words, a real revolution of the entire compliance framework today.

Compliance today is basically regulatory enquiry oriented resulting that the regulators continue to increase their scrutiny and then fines.  I think the major reason is that regulators just indeed do not know what and where the industry would do and moving forward.  The business model is ever-changing, from the ATM/Phone Banking in early stages to the e-Banking in mid 90s to the recent FinTech era.  Over the last several years, regulatees have had to deal with an increasing number of diverse, uncoordinated and ever-changing set of regulations across different locations.  To stem the tide, moving towards to a more holistic regulatory response mechanism might be the only door to out and the innovations of RegTech with Big Data would be the key.

Put into practice, I believed that RegTech is more on a workflow and I think one of the ways that we could go is to combine different market available solutions into the compliance workings cycle.  For example, we could leverage the artificial intelligence solutions with the Big Data detection scenarios and the audit sampling approaches, such as combining Assisted Review in e-Discovery with Cumulative Monetary Amount (CMA) sampling method on top of the existing detection scenarios approach.  I would be very keen that RegTech would create a revolution for today’s compliance which would facilitate the delivery of regulatory requirements and continue improves the efficiency and effectiveness of the existing capabilities.  There is always a better solution to make our world better as well as making our life easier and RegTech approach would definitely be one of these where I would share more on my experience in the near future post.  However, the very first road-block now might be more on the today’s compliance management people on who firstly dare to make this little step forward as to overcome the today’s problematic compliance approaches. 

If you are also interested in this area, please feel free to drop me a note and I am always happy to discuss and work together with you as to make this little step forward.

Saturday, November 26, 2016

Data Analytics: - Are you ready for the Big Data era?


Many organizations are interested in leveraging their data in the purpose of decision making, exploring new opportunities and workflow optimization todays.  However, as the managements, do you know if your organization ready for this?  Or if you have already started your Big Data analytics plan, but it just not too effective as expected?  As a data consultant, the following are the three of the most common road-blocks of applying data analytics into the reality I saw in my experiences: -

1.      Too much technologies exists

New technologies appear every day.  Companies are normal in having multiple platforms together, include proper database driven platform and tailor made applications, where some of these are indeed overlapping in both functionality and information contained.  These indeed create data integrity issues and it would be a real headache when the data scientists try to gather the analyzable data and then resulting in false analytics insights and conclusion.  As to resolve this issue, my suggestion is that it would be always good to have an ETL data expert working together with the business managements and those subject matters expert as to consolidate and put the data into a proper, extendable and analyzable format before rolling out any Big Data analytics plan.  Put another way, data completeness and readiness would always be the very first thing to concern, confirm and establish before moving into any real analytics battles.

2.      Lack of business overview

This usually happened in large organizations especially for those organizations which are running with Target Operating Model (TOM).  Put another way, managements today in an organization are all talents and experienced subject matter expert but might too focus on their sole responsibilities, e.g. Procurement are not understand Manufacturing and, on the other hand, Manufacturing are not friends with sales and also logistic.  The fact is that, in most cases, there is no one know the true picture of the business from top to bottom from in to out.  Although this might be reasonable since it’s difficult to understand everything in the reality, this would still create significant impact on the accuracy and effectiveness of the analytics results.  As to resolve this issue, I believed that a proper business plan and a very clear business goal of the data analytics projects would definitely be essential for both the internal industrial subject matter experts the data scientist as to identify the right and appropriate analytics direction and approach.  Without a deep understands of the business, I believed that it is no way to be succeeded in any of the analytics projects.  And that’s why a successful analytics deployment would always require both the business and the data scientist to work closely together while I saw most the time these two parties appear to be in conflict.

3.      Lack of data compliance policy

People intend to use the easiest and the most effective ways to perform their jobs.  However, what is the easiest and the most effective are usually subjective.  For example, some people like to keep track of the data with Excel but others like using a giant databases or simple hard copy documents.  In fact all these are good in the normal circumstances as long as its helps in driving the business growth in my point of view.  However, if any unexpected things happened, for example regulatory requested look back review that happened quite often in banking industry these days, issues would then arise.  As to avoid this, I believed that it would be never too late to get a proper data compliance policy deployed within an organizations.  However, we must ensure that the data compliance policy are not trying to change or restrict how’s each of the management like to run their team, but a way to ensure that all the information and process are keep track properly in one places.  Otherwise, things deployed but no one follow, it will just again completely useless which I saw in many of the large organizations.  And this demonstrated that why not only the business team and the data scientists, but also the experts in executing deployment are indeed important.


In conclusion, if you believe in data and would like to deploy any data analytics in your organization, it’s not only about the data expert, but also your business team and numbers of parties involved.  Big Data is not magic that will only drive advantages towards you, without a proper plan and understanding Big Data might only create a nightmare for your organization.  If you are interested in learning or discuss more on any of your Big Data plan, please feel free to drop me a note and I am always happy to discuss and help.

Thursday, April 21, 2016

Data Analytics: - The Common Two Steps Approach to Identify Data Outliers

Regardless all the data related fancy wordings, one of the most common goals for any data experts would always be “identifying data outliers”.  There existed plenty of ways to help us in identifying the outliers by leveraging both the internal and external data and today I am going to share some of the most common ways to achieve this goal.

In statistic, data outlier is known as a data point that stand apart from the major data set.  A data outlier might sometime because of data error but, in most circumstances, that’s something that we should aware and take special notice on that.  The following is a common two steps approach from both the quantitative and qualitative perspective to identify the data outliers: -

1.      Statistical approach

This is under the quantitative approach that we could leverage some common statistical concept as to identify the data outliers with, but not limited to, Mean, Standard Deviation, etc regardless the actual industrial knowledge.  This could also include some complex statistic model, such as those probability theories, but today I would just discuss about one of the most common ideas as to identify the data points which stand out of Mean plus or minus X times Standard Deviation

Example on customers’ historical spending amounts

Assuming all these customers spending power are the same and we are now undergoing a new membership scheme.  We could calculate the Mean plus and minus 2 Standard Deviation on spending amounts as the threshold 1 and 2 respectively.  So that: -
  • Whoever have a total spending above threshold 1 might potentially consider as the most valuable customers the top level premium VIP card should be given aim at retain them to spend with us and show our appreciation; and
  • Whoever stand between threshold 1 and 2 might be consider as normal customers that normal membership should be given as to encourage them to spend more as to reach the top level premium VIP class; and
  • For the rest that below threshold 2, no membership might be grant but we might enhance something like the Points Collection Scheme on spending basis as to encourage this group of customers to spend more as to become our member and earn the subsequent benefits.
2.      Detection scenario logic

Depends on the business nature, certain qualitative approach on detection scenario logics with different available data element could be applied, such as age, gender, profession, etc.

Example on customers data follow up on the above example

Assuming the target customer is on high income group who could be classified on age group and professions, we could then further categorize the customers in additional to the spending amounts as refer to the previous example. One point to note here is that in order to identify the right scenario logic, we might consider to leverage both internal and external data source.  For example, government census data telling which professions and what age group would be the high income people in this case. And after this, we might then come up with some logics like: -
  • Top premium VIP might only be granted to those will high spending amounts, age above 40, professions in certain industries, etc. as the current customer base; and
  • A 2nd grade premium VIP class to those with normal spending amounts, age between 20 to 40 and profession in certain industries, etc. who potentially will be the successors of the current top level people and we should treat this group as the highly potential customers base.
In conclusion, there are plenty of statistic methodologies and scenarios that we might apply in our daily operations depend on the data sources availability and the data analysts’ experience.  If you are interested to know more or would like to further discuss on this topic, please feel free to reach me out and I am always appreciate the chances to learn from all of you.

Saturday, February 27, 2016

Insights from Alan Turing – Father of artificial intelligence

Alan Turing (1912-1954) – a pioneering mathematician who also known as the father of computing and artificial intelligence; and was granted a Royal pardon by the Queen because of his contribution as a code-breaker in World War Two.  One of the most famous Alan Turing’s related topics would be “Can machine thinks?”

Just a head up in advance, I am not going to share about who Alan Turing is or how Alan Turing breaks the Enigma ciphers or how the Turning machine works and the following in this post is somehow not Alan Turing related at all.  The topic I would like to discuss a bit here is about the topic that just mentioned in the last paragraph on whether a machine thinks and, to certain extent, “Can an analytics platform thinks?”    

Many people would ask whether Data Analytics believable which I have mentioned earlier in my old post.  My first question is would people think or consider of getting “something” to help in decision making no matter you believe or not believe that “something” would or would not give you a reasonable, thinkable and questionable suggestion produced by acceptable analyzing or thinking process? 

A data analytics platform normally works with various statistical mathematical modelling derived and programmed by the data experts and then produce advices accordingly.  Question is shall we consider this advice generating process as “machine thinking”? 

One idea is that data experts have in fact only programmed the rules into the machine’s memory and then the machine would work itself to produce the results once we request the machine to “think” by pressing the start button. As such, shall we consider what the data experts did as teaching rather than running the machines? Alan Turing predicted that “machine learning” would play an important role of building powerful machines which might teach which I would consider that every one of us who work in IT field are in fact teaching the machine in some circumstances.

However, on the other hand, the machine might only produce the pre-programmed advisory options but not anything out of the box, then is it still consider as “thinking”?  But in fact, are we all, as a human, also just producing our personal thoughts based on what our peers, such as teachers, friends or own experiences, teach us or pre-programmed in our brain?

“Sometimes it is the people no one can imagine anything of who do the things no one can imagine.”
-          Alan Turing

“We can only see a short distance ahead, but we can see plenty there that needs to be done.”
-          Alan Turing
I would leave all the above questions open for discussion and I would hope to see what would happen in the rest of my life.  However, it is no doubt for me that I believed Data Analytics is one of the most possible ways to create a better world tomorrow.  For example from what I am doing in my job, leveraging Big Data to fight against Money Laundering activities.

Friday, February 19, 2016

Computer Forensic: - Forensic Workflow III & IV – Reporting & Testify as Expert Witnesses


As per what I mentioned in the past about Computer Forensic is mainly about story telling by presenting the fact to facilitate the investigating works and the judgement of the case, reporting would be one of the most critical area that demonstrating the examiners seniority following the analysis skill level.  Computer forensic report is usually litigious and likely to be distributed to both technologies technical and non-technical parties.  As such, accurately presenting the fact in a human-readable way with no bias would always be the key of writing a good report and, going forward, the following would be some noticeable requirements and pre-concept according to my computer forensic examiner’s experience.

1.      Reporting purpose

The ultimate objective of reporting is to present the fact to address the technical concern.  This must be presented in the manner of understandable and human-readable.  Jargon must be carefully identified assuming that the readers are having zero computer knowledge especially if the report is going to be used in litigations, the report readers would then likely to be non-technical individuals, such as attorneys, judge, jury, etc.  Besides, since the report may be the only opportunity to present the facts found in the investigation, this must encompass the whole of any testimony in details for the trier of fact. Otherwise this may induce serious financial and legal consequences due to misrepresent any of the findings.

2.      Report structure and style

Ideally all examiner reports are required to be capable in standing on their own and providing the clear and accurate information to anyone, who read the report, to reach the same conclusions.  Terms such as “many”, “significantly”, highly”, etc, which are subjective and able to be interpreted in multiple ways must be avoided.  Industrial accepted reference should be used whenever possible as to substantiate the statements and the content presented.  Also, every single page should contain a unique identifier include the report title, date of issue and also the examiner basic info / company name for references purpose.  The more importantly, the examiner’s background are suggested to be clear state and identified at the beginning of the report and the following are the sections that typically included in the examiner reports:-

·         Cover page
·         Executive summary
·         Examiner profile
·         Introduction / Background of the case
·         Scope of work
·         List of supporting documents
·         Observations and analyses conducted
·         Examiner’s log
·         Chain-of-custody records
·         Photographs / reference materials
·         Disclaimers
·         Signature

3.      Quality assurance

When the issues are complex, mistake and errors may always be present no matter how careful the examiner is.  As such, peer review for me would be suggested as one of the most effective and essential way to resolve these issues.  Peer review is to conduct by the one who is at the same level or more senior than you in terms of experience.  At least two peers are suggested for you to invite as your peer reviewer.  It is not only a general review in terms of grammatical errors or the phases and wordings used, but also a quality assurance on any of the assumptions and analysis made under the report.

The above would be only some basic idea on how a forensic examiner report looks like.  In conclusion, here comes the end of the Computer Forensic Workflow overview.  In the future computer forensic post, I would try to share some of the real-life examples.  Hope all of you found this useful and I would be always happy to discuss if you are interested.

Previous Step