Originally published: March 12, 2015
When children are young, they are dependent creatures. They depend on others for just about everything. When they become teenagers (if not sooner), children strive for their own independence. They want to make their own decisions and choices, as a sign of growing up. Fortunately, life has a little surprise for us – growing up has another dimension: interdependence! As we mature, we increasingly become interdependent members of our family, our community, and our social/work groups. Success therefore depends on a good balance between dependence and independence – i.e., mutual interdependence!
Data science is the same way. A good mix and balance of dependence and independence goes a long way. This truth spells success for digital marketing analytics. However, sometimes (like teenagers) we focus on independence and consequently miss the power of dependence. We take a look here at some examples of each.
If you learned some statistics in school, you heard a lot about the importance of independence for various statistical tests. This is essentially a warning against using a biased (not independent) sample to validate a hypothesis. Biased samples have dependencies that lead to false conclusions.
For example, if I go to a New York Giants football game and I survey the attendees to find out which is “the best football team,” then I will get an overwhelming majority of votes for the Giants. This is clearly a biased sample, leading to survey responses that are not independent of the persons being surveyed.
Another example that illustrates the importance of statistical independence is A/B testing, which many marketers use. It is important that each trial be independent: whether the customer is subjected to the control (A) or to the treatment (B) should be completely independent of all other explanatory factors that might affect the A/B test results comparison.
So, independence is good. But, dependence is also good, in the right circumstances.
Statistical Dependence & Uplift
Consider the case of uplift. In its simplest statistical form, uplift is the ratio P(X,Y)/[P(X)P(Y)]. The numerator P(X,Y) is the joint frequency of two events X and Y co-occurring simultaneously. The denominator represents the probability that the two events X and Y will co-occur (at the same time) at random. If X and Y are completely independent events, then the numerator will equal the denominator – in that case, uplift equals 1 (i.e., no lift). Conversely, if there is a higher than random co-occurrence of X and Y, then uplift will be greater than 1! That’s marketing gold!
Uplift is beneficial specifically because X and Y are statistically dependent – the stronger their interdependence (correlation), then the greater the uplift. The famous case of Walmart customers who bought 7 times more strawberry pop-tarts prior to a hurricane than on other days is a powerful example. If hurricanes didn’t inspire customers to make lots of pop-tart purchases, then the uplift would not be so high. The joint frequency of co-occurrence of the two events (hurricanes and pop-tart purchases) is significantly higher than random joint occurrence.
Mining your transaction logs (or any other data set in any other domain) for high uplift (or even strong downlift, where the above ratio is much less than 1) will identify the most significant cases of mutual dependence of two events X and Y (e.g., if X and Y refer to specific products with strong uplift, then X can be offered through a recommender engine in an online store whenever Y is viewed). In the case of strong downlift, note that this result suggests “do not offer X when customer is buying Y” – this is a lost cause (or a case of “Do Not Disturb”)!
Consider one additional powerful use case for statistical dependence: Bayesian inference. This is one of the fundamental algorithms for discovery and insights from data. Bayes rule is also the core algorithm in Hidden Markov Modeling, which is a great technique for predictive analytics. The key statistical concept in Bayes rule is conditional probability – for example, P(O|E) denotes the probability that outcome O will occur given the presence of evidence E. If E and O are statistically independent, then Bayes rule cannot be exploited for business rule discovery, marketing insights, and predictive analytics.
As digital marketing spurs the digital business revolution, taking advantage of statistical dependencies in your data collection will lead to higher conversions, more effective marketing campaigns, and improved marketing analytics ROI.
For more from Kirk Borne, follow him on Twitter: @kirkdborne