Customer Churn / Attrition, a machine learning approach
Was chatting with a friend, hearing about a challenge common in the SAAS (software as a service) industry: customer attrition or churn. I'm always listening for business issues that can be helped with use of data science tools so I revved up to see if machine learning could deliver some insight.
I pulled out Data Science for Business, Provost & Fawcett, that has a great case study on customer churn. Found a relevant customer churn data set (http://www.sgi.com/tech/mlc/db/). Telco industry rather than software but comparable customer activities: delivering services for ongoing regular payments.
Can we take this cloud of client data & find patterns that help us make better decisions? The question in this case: Which clients are likely to leave? The strength of the answer surprised me. Running the data through several machine learning classifier algorithms - all models tested at >85% predictive on whether an individual client is likely to leave. For a data scientist, this is an unexpected level of accuracy for a first pass training session. It reinforces the power of applying machine learning to business challenges, even using smaller data sets that may have only a few thousand, or even hundreds, of entries.
Many of the variables in this data set seem to be around utilization of different services and level of usage. To answer this type of question in your business, data would likely be assembled from multiple sources: CRM system, financial system, internal platforms/databases or even external sources like economic trends or stock prices.
The first benefit this type of study is that you've now examined data aggregated from all available sources & through the fine-tuning of the predictive model, you may uncover new relevant data points that you'll track in the future to improve predictive accuracy.
Next benefit is you have a list of clients rated by probability of leaving (or whatever the question is that you're testing). You know, via your model, what an at-risk client 'looks like' and can identify them in your current client base or by looking at new clients...even prospective clients to evaluate 'stickiness' of your product/service offering.
Another machine learning approach is clustering...a look into how a population can be divided, starting with the criteria you select (such as likely to leave or likely to be profitable). The cluster diagram gives you a feel for groupings of your population of clients. The dendrogram (hierarchical diagram) shows the dividing points that the machine perceives, defining segments of the population.
The benefit here is in the progression from discovering 'who is at risk' to giving you groupings that can inform your strategy to retain a higher % of clients. The branching points shown on the dendrogram are unlabeled, the machine has no way to know the reason for the branching. Here you have the opportunity to label those nodes, applying your understanding of the customer's why's. Your typical analysis is better informed with the % likelihood that you lose any given client and having machine learning highlight what types of clients are leaving, or which types are staying (equally important insight for evaluating your strategy). Maybe you pinpoint the reason to be defection to a competitor, or different types of clients went to distinct competitors. Or maybe one of the groupings tells you that you lost clients based on price.
Details on the machine training:
Data source: www.sgi.com/tech/mlc/db/
The data set has 5000 rows, 21 features (columns) including the target classification.
3 files to download: churn.data, churn.names, churn.test. The data is already split into train/test subsets
The data set, contributed years ago by a telco company, was cleaner than you'd typically expect. No missing data, categorical variables were easy to turn into one or zeros. Evaluating the 20 features, I simply selected the 4 variables that most strongly correlated with the target classification (churn or not churn)...I trained logistic regression, random forest and neural network models using R Caret. Caret has some features that enhance accuracy over a simple call to randomforest for training, but it's much slower than a direct call. I'll often use the direct call (to whichever algorithm I'm using) for error testing then use caret at fine-tuning stage.
Dendrogram source: STHDA Guide to creating dendrograms
Family Incomes by Income Tax Tiers - Nov. 2016
Check out this snapshot of family incomes from the Census Bureau. I set out looking into correlations between federal income tax rates, GDP growth and US Budget deficits - until I started wading into the morass of shifting tax rate tiers -- what a mess! Look for another blog post when I get that sorted.
Early years show gradual income growth (in adjusted dollars), then there seems to be an inflection point for higher income tiers around 1982. I'm still working on making tax rate tables compare-able, but I do see that the top federal income tax rate dropped by 29% (from 70% to 50%) in 1982. Btw, there was another 30% reduction in max income tax level in 1964 (from 91% to 70%).
Are there conclusions that these trend lines show us? Dunno. The flatter growth rates for lower income tiers and their disparity in growth with higher tiers may be a factor in a feeling of being left behind by the system. Or - an opposite explanation - there may be a progression of taxpayers upward through these tiers, bearing out the American dream. After getting a feel for the data, next comes creating/testing a hypothesis to find the cause(s).
Straight Data on Federal Spending - July 2016
Fed'l tax receipts lagged GDP growth (output to be taxed) by 8.65% 1961 - 2015
(US OMB & World Bank data)
Note the trend in tax receipts growth showing similar variance but with much greater magnitude than before 2000. Reverting in recent years to an earlier normal.
Federal Spending - Social Security includedNote some large shifts - Social Security starts in the 1940s and continues to expand, Medicare begins early 1960's.
Federal Spending with Social Security netted out, % by Category
Looking at the above chart of Federal Spending, I thought I had a good picture of the split between defense/social spending categories
Federal Spending - Social Security excluded
(excluded because there is nearly equivalent offsetting revenue for US govt so ins/outs become 'noise' in this analysis)
This better description of Spending shows the scale/relationship of spending categories with less distortion
Map Your Data for Valuable Insight - May 2016 post
I'm generating this visualization using R. Another great tool for generating map visualizations (or a wide range of other useful visualizations) is Tableau. Tableau is easy to use but medium expensive; on the other hand, R is free but does require some investment of time to build capability.
Data Analytics services are available through DataXploits.com
|-||Customer Sentiment appearing in social media|
|-||Twitter, Facebook (public posts)|
|-||Measure engagement generated by your social marketing campaigns|
| ||web services by:|
| ||Stone Village|
| ||contact us|