Data
Analytics are widely applied in data mining and data categorization
today. And among all the available data
mining techniques, Detection Scenarios (“DS”) approaches are the most commonly
used in the market, for example transactions monitoring in Financial Crime
Compliance for Anti-Money Laundering and proactive fraud detection, customers
categorization and classification, etc.
As to
identify the effectiveness and efficiency of a DS, sampling method would
definitely be one of the most common and sounded-reasonable ways for
justification. As such, “How to select
your sample?” and “How many samples shall we pick?” would then be the two most
critical questions. Most people today
would answer these accordingly to their experience. However, as a data analytics professional, I
would always prefer a more scientific way (or a more mathematical way) for
sampling.
1. How to
sample?
Depends
on the ultimate goal of the sampling process, we might use either probabilistic
sampling /or non- probabilistic sampling. The most commonly used sampling
methodologies are simple random sample and systematic sample in probabilistic
sampling v. purposeful sampling in non-probabilistic sampling.
Simple
random sample is the easiest like lucky draw, e.g. lottery which we would
normally be used if we do not have strong understanding and perception of the
data. However, this might be vulnerable
to sampling error because the randomness of the selection may result in a
sample that might not reflect the makeup of the population. As such, in order to dilute the sample into
the whole data population, the systematic sampling method is applied, such as
Cumulative Monetary Amounts (CMA) sampling (a.k.a. monetary unit sampling or
dollar-unit sampling) which is a typical example commonly used in Auditing and
those monetary related analysis.
In
contrast, the purposeful sampling are mainly relied on the analyst
understanding and perception of the data.
Samples are drawn from data which the analyst believe that is most
representative. A typical example is
that we usually take the customers / suppliers with top X purchase / sales
amount that coverage over certain percentage as the samples for testing where
the customers / suppliers with lower purchase / sales amount are generally
consider as insignificant to our studies.
2. How many
sample required?
In my
experience, there are numbers of ways being used in determining the sampling
size, such as, based on the leadership experience, subject to the sampling team
size and time limitation, etc. However,
is it justifiable and defensible is usually the question? As a data analytics professional, my
preference is always back to a more scientific approach and we might start to
identify the number from “Margin of Error” in statistics. Margin of Error (or “Precision Level”) is a statistic
expressing the amount of random sampling error based on the Z-score, standard
deviation and sampling size. I will
write a step by step sample size determination walk-through in the next post.
In
conclusion, an appropriate sampling method with a proper sampling size is essential
to any analytics project that sampling applied.
And to more extend, a sampling method and the subsequent sampling size
is better to be justifiable and defensible in today Big Data era. If you have found any errors in the above or
have any thoughts or interest on this topic, please feel free to drop me a
note. Thanks.
No comments:
Post a Comment