Thursday, April 21, 2016

Data Analytics: - The Common Two Steps Approach to Identify Data Outliers

Regardless all the data related fancy wordings, one of the most common goals for any data experts would always be “identifying data outliers”.  There existed plenty of ways to help us in identifying the outliers by leveraging both the internal and external data and today I am going to share some of the most common ways to achieve this goal.

In statistic, data outlier is known as a data point that stand apart from the major data set.  A data outlier might sometime because of data error but, in most circumstances, that’s something that we should aware and take special notice on that.  The following is a common two steps approach from both the quantitative and qualitative perspective to identify the data outliers: -

1.      Statistical approach

This is under the quantitative approach that we could leverage some common statistical concept as to identify the data outliers with, but not limited to, Mean, Standard Deviation, etc regardless the actual industrial knowledge.  This could also include some complex statistic model, such as those probability theories, but today I would just discuss about one of the most common ideas as to identify the data points which stand out of Mean plus or minus X times Standard Deviation

Example on customers’ historical spending amounts

Assuming all these customers spending power are the same and we are now undergoing a new membership scheme.  We could calculate the Mean plus and minus 2 Standard Deviation on spending amounts as the threshold 1 and 2 respectively.  So that: -
  • Whoever have a total spending above threshold 1 might potentially consider as the most valuable customers the top level premium VIP card should be given aim at retain them to spend with us and show our appreciation; and
  • Whoever stand between threshold 1 and 2 might be consider as normal customers that normal membership should be given as to encourage them to spend more as to reach the top level premium VIP class; and
  • For the rest that below threshold 2, no membership might be grant but we might enhance something like the Points Collection Scheme on spending basis as to encourage this group of customers to spend more as to become our member and earn the subsequent benefits.
2.      Detection scenario logic

Depends on the business nature, certain qualitative approach on detection scenario logics with different available data element could be applied, such as age, gender, profession, etc.

Example on customers data follow up on the above example

Assuming the target customer is on high income group who could be classified on age group and professions, we could then further categorize the customers in additional to the spending amounts as refer to the previous example. One point to note here is that in order to identify the right scenario logic, we might consider to leverage both internal and external data source.  For example, government census data telling which professions and what age group would be the high income people in this case. And after this, we might then come up with some logics like: -
  • Top premium VIP might only be granted to those will high spending amounts, age above 40, professions in certain industries, etc. as the current customer base; and
  • A 2nd grade premium VIP class to those with normal spending amounts, age between 20 to 40 and profession in certain industries, etc. who potentially will be the successors of the current top level people and we should treat this group as the highly potential customers base.
In conclusion, there are plenty of statistic methodologies and scenarios that we might apply in our daily operations depend on the data sources availability and the data analysts’ experience.  If you are interested to know more or would like to further discuss on this topic, please feel free to reach me out and I am always appreciate the chances to learn from all of you.

No comments:

Post a Comment