Tuesday, March 21, 2017

Data Analytics: - Sampling

Data Analytics are widely applied in data mining and data categorization today.  And among all the available data mining techniques, Detection Scenarios (“DS”) approaches are the most commonly used in the market, for example transactions monitoring in Financial Crime Compliance for Anti-Money Laundering and proactive fraud detection, customers categorization and classification, etc.

As to identify the effectiveness and efficiency of a DS, sampling method would definitely be one of the most common and sounded-reasonable ways for justification.  As such, “How to select your sample?” and “How many samples shall we pick?” would then be the two most critical questions.  Most people today would answer these accordingly to their experience.  However, as a data analytics professional, I would always prefer a more scientific way (or a more mathematical way) for sampling.

1.      How to sample?

Depends on the ultimate goal of the sampling process, we might use either probabilistic sampling /or non- probabilistic sampling. The most commonly used sampling methodologies are simple random sample and systematic sample in probabilistic sampling v. purposeful sampling in non-probabilistic sampling. 

Simple random sample is the easiest like lucky draw, e.g. lottery which we would normally be used if we do not have strong understanding and perception of the data.  However, this might be vulnerable to sampling error because the randomness of the selection may result in a sample that might not reflect the makeup of the population.  As such, in order to dilute the sample into the whole data population, the systematic sampling method is applied, such as Cumulative Monetary Amounts (CMA) sampling (a.k.a. monetary unit sampling or dollar-unit sampling) which is a typical example commonly used in Auditing and those monetary related analysis. 

In contrast, the purposeful sampling are mainly relied on the analyst understanding and perception of the data.  Samples are drawn from data which the analyst believe that is most representative.  A typical example is that we usually take the customers / suppliers with top X purchase / sales amount that coverage over certain percentage as the samples for testing where the customers / suppliers with lower purchase / sales amount are generally consider as insignificant to our studies.

2.      How many sample required?

In my experience, there are numbers of ways being used in determining the sampling size, such as, based on the leadership experience, subject to the sampling team size and time limitation, etc.  However, is it justifiable and defensible is usually the question?  As a data analytics professional, my preference is always back to a more scientific approach and we might start to identify the number from “Margin of Error” in statistics.  Margin of Error (or “Precision Level”) is a statistic expressing the amount of random sampling error based on the Z-score, standard deviation and sampling size.  I will write a step by step sample size determination walk-through in the next post. 

In conclusion, an appropriate sampling method with a proper sampling size is essential to any analytics project that sampling applied.  And to more extend, a sampling method and the subsequent sampling size is better to be justifiable and defensible in today Big Data era.  If you have found any errors in the above or have any thoughts or interest on this topic, please feel free to drop me a note.  Thanks.

No comments:

Post a Comment