Tuesday, March 28, 2017

Data Analytics: - Step by step sample size determination walk-through

As what we discussed in the previous post, throughout my experience as a Data Analytics consultant, I am required perform estimation for almost all of the projects which I ever worked on.  And Sampling is indeed one of the most commonly used analytics way since most people do adopt it a set of proper samples do represent for a unknown or very large population. 

Although sampling is being widely used, it appears that many people today are still determining the sample size by purely personal professional judgement, in other words, guts feeling. In fact, there exist many ways to determine the sample in a more scientific and systematical approach where a more proper sampling size would definitely allow your analysis become more justifiable and defensible.


Here we start our calculation with the “Margin of Error”: -


Depends on the project nature, we would then need to determine our precision level (i.e. Margin of Error) which we would normally set to 5% or 10%. 

Subsequently, our confidence level would then be 95% which means, in normal circumstances, we are 95% certain that our samples be able to represent the whole population. 

At confidence level 95%, the area to the right of a z-score becomes 1 – 0.5 / 2 = 0.975.  Therefore the z-score is 1.96.

Substituting all the above into the equation, we would then determine that 385 samples is the right no. of samples that we have to pick from our unknown or very-large population. 

And for a small and finite population, trying to keep it simple, I am not going to elaborate this in details here.  However, we could still simple apply the below formula with the sample size identified above for a known or finite population size.


Example, assume the population we know is 10,000 items.  Applying the formula, we would then determine that 371 samples (instead of 385 samples) is the right no. of samples that we have to pick from our data with population size equal to 10,000 items.

Hope the above would give you a high level understanding of how to determine your simple size so that no more “professional judgement” is required when you are required to provide estimation or justification with the method of sampling. If you have found any errors in the above or have any thoughts or interest on this topic, please feel free to drop me a note.  Thanks.



Tuesday, March 21, 2017

Data Analytics: - Sampling

Data Analytics are widely applied in data mining and data categorization today.  And among all the available data mining techniques, Detection Scenarios (“DS”) approaches are the most commonly used in the market, for example transactions monitoring in Financial Crime Compliance for Anti-Money Laundering and proactive fraud detection, customers categorization and classification, etc.

As to identify the effectiveness and efficiency of a DS, sampling method would definitely be one of the most common and sounded-reasonable ways for justification.  As such, “How to select your sample?” and “How many samples shall we pick?” would then be the two most critical questions.  Most people today would answer these accordingly to their experience.  However, as a data analytics professional, I would always prefer a more scientific way (or a more mathematical way) for sampling.

1.      How to sample?

Depends on the ultimate goal of the sampling process, we might use either probabilistic sampling /or non- probabilistic sampling. The most commonly used sampling methodologies are simple random sample and systematic sample in probabilistic sampling v. purposeful sampling in non-probabilistic sampling. 

Simple random sample is the easiest like lucky draw, e.g. lottery which we would normally be used if we do not have strong understanding and perception of the data.  However, this might be vulnerable to sampling error because the randomness of the selection may result in a sample that might not reflect the makeup of the population.  As such, in order to dilute the sample into the whole data population, the systematic sampling method is applied, such as Cumulative Monetary Amounts (CMA) sampling (a.k.a. monetary unit sampling or dollar-unit sampling) which is a typical example commonly used in Auditing and those monetary related analysis. 

In contrast, the purposeful sampling are mainly relied on the analyst understanding and perception of the data.  Samples are drawn from data which the analyst believe that is most representative.  A typical example is that we usually take the customers / suppliers with top X purchase / sales amount that coverage over certain percentage as the samples for testing where the customers / suppliers with lower purchase / sales amount are generally consider as insignificant to our studies.

2.      How many sample required?

In my experience, there are numbers of ways being used in determining the sampling size, such as, based on the leadership experience, subject to the sampling team size and time limitation, etc.  However, is it justifiable and defensible is usually the question?  As a data analytics professional, my preference is always back to a more scientific approach and we might start to identify the number from “Margin of Error” in statistics.  Margin of Error (or “Precision Level”) is a statistic expressing the amount of random sampling error based on the Z-score, standard deviation and sampling size.  I will write a step by step sample size determination walk-through in the next post. 

In conclusion, an appropriate sampling method with a proper sampling size is essential to any analytics project that sampling applied.  And to more extend, a sampling method and the subsequent sampling size is better to be justifiable and defensible in today Big Data era.  If you have found any errors in the above or have any thoughts or interest on this topic, please feel free to drop me a note.  Thanks.