WSDM — KK Box’s Churn Prediction Challenge
Table contents:
- Business problem
- Business problem to ML
- metrics and business constraints
- Source of data
- Existing approaches and First cut solution
- Exploratory data analysis/Data cleaning
- Feature engineering
- Training the model
- models comparison
- GitHub repository
- Future work
- References
WSDM KK box churn prediction challenge :
Basically in this problem, we need to predict weather the user will churn or not after his/her subscription expires When subscription ended? For the KK box generally their most of the revenue comes from a 30 days plan, so we’re predicting the user whose new service subscription transaction is within 30 days after his/her current subscription expiry date, so after 30 days subscription will be ended(expiry date).
Business problem:
Basically this problem is call churn prediction What is churn ? Churn quantifies the number of customers who have left your brand by cancelling their subscription or stopping paying for your services.
KKBOX, a popular music streaming service provider in south-east Asia, is a subscription and advertisement-based business model. Subscription primarily monthly is the major source of revenue and the users can cancel their subscription whenever they see fit. Hence, improving the churn prediction is indispensable for KKBOX’s growth.
This is bad news for any business as it costs five times as much to attract a new customer as it does to keep an existing one. A high customer churn rate will hit your company’s finances hard. By leveraging advanced artificial intelligence techniques like machine learning (ML), you will be able to anticipate potential churners who are about to abandon your services. Still not convinced? It will cost you 16 times more to bring a new customer up to the same level as an existing customer.
Business problem to ML :
I’m converting this business problem to two class classification problem i.e., user will churn=1(not subscribe) or not churn =0(re-subscribe)
so it is supervised classification task
Metrics:
Why log loss? See here we’re predicting the churn users, so it is always better to take the probability scores, how much probability the user will churn or not, so I’m going with the log loss metric, because it takes probability scores, and it also penalizes the error very robustly in nature.
Business constraints:
- No low latency constraint
- we should not miss the churn users, because it will impact the business, so in the other sense recall should be high(so, misclassification should be avoided
- interpretability is very important: instead of just giving predicting churn or not, it’s important to give why churn or not ,reasoning is pretty important
- giving probability is mandatory what % of churn or not and also it is useful to set the threshold for better accuracy.
Source of data:
The data was taken from the Kaggle.. our home for data science :)
WSDM - KK Box's Churn Prediction Challenge
Can you predict when subscribers will churn?
www.kaggle.com
Existing Solutions:
Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data by Bryan Gregory(1st place entry):
The main intention to choose this challenge is to focus on manual feature engineering, and the first-place winner used traditional machine learning models only, which implies its significance.
Most of the features were extracted by aggregation and comparison over the different or entire time using a mathematical operation like average or sum. A total of 76 features was used and the 10 most useful are shared. XGBoost with a weight of 0.88 and LightGBM with a weight of 0.12 were the models used for prediction.
This paper shows that simple ML techniques is used to achieve better log loss and feature engineering is done a lot.
First cut solution:
firstly, will apply simple models like logistic regression after doing the feature engineering part and later I will apply complex models like neural net to improve the better log_loss
One of the complex model like below
From the 1st stage, I take predicted probabilities for two classes (0 and 1) from each of the three LGBM Classifiers, and concatenate them to give as input to the Feed Forward Neural Network (FFNN) in the 2nd stage. FFNN predicts the final probability for customer churn.
Exploratory Data Analysis:
Firstly we will look what are all the columns present in each file, and later on we will do in depth analysis on each feature
transactions_v2 — This table contains the transactions data of the users and it contains the following columns
- msno: user id
- Payment_method_id: payment method
- Payment_plan_days: length of membership plan in days
- Plan_list_price: in New Taiwan Dollar (NTD)
- Actual_amount_paid: in New Taiwan Dollar (NTD)
- Is_auto_renew
- Transaction_date: format %Y%m%d
- Membership_expire_date: format %Y%m%d
- Is_cancel: whether or not the user canceled the membership in this transaction.
Members (members.csv) — It contains data about the users, but not for every user. It has following columns:
- msno
- City
- Age
- Gender
- Registered_via: registration method
- Registration_init_time: format %Y%m%d
- expiration_date: format %Y%m%d, taken as a snapshot at which the member.csv is extracted. Not representing the actual churn behavior.
user logs(user_logs.csv) : — It contains number of songs that are listened of certain length.
- num_25 : # of songs played less than 25% of the song length
- Date format : %Y%m%d
- num_50 : # of songs played between 25% to 50% of the song length
- num_75 : # of songs played between 50% to 75% of of the song length
- num_985 : # of songs played between 75% to 98.5% of the song length
- num_100 : # of songs played over 98.5% of the song length
- num_unique : # of unique songs played
- total_seconds : total seconds played
Train data(train_v2.csv) — It contains the churn data for March and has two columns
- msno : specifies the user id of the customer
- is_churn :1 (churn) ; 0 (re-subscribed)
Test data (sample_submission_v2.csv) — It contains the churn data for March and has two columns
- msno : specifies the user id of the customer
- is_churn :1 (churn) ; 0 (re-subscribed)
Here data is present in the multiple tables so we’re combining the tables into one based on the train data.
Data cleaning: checking for null values and filling with mean and median values
so in the above columns I’m filling with the mean and median values
Now we will do In- depth EDA with respect to churn label
Payment_plan_days:
from the above it was very clear that most of the users are using the 30 days plan
from the above we can say that most of the plans that the users are contributing are 30,150,180,200,210,230,400,410,415 for the churn users
Plan_list_price:
from the above we can see that most of the users will churn when the plan price is more >180 i.e. many users are there
in the same way most of the users will not churn when the plan price ranges in 0 to 200
so higher plan price users will most likely to churn
Here I’m doing more analysis on plan_list_price using the bar plots
we can observe that some users will churn when the plan is too high too i.e. 1788
Is_auto_renew:
from the above we can say that, non churn users are more interested in auto renew than churn users
Around 93.5% of the non churn users are interested in auto renew
it was also observed that 28.6% of the churn users are not interested for auto renew
City:
from the above observation we can say that most of the users are from city1 either they’re churn users or not churn users.
out of all non churn users around 59% of the people are from city1 and in the same way 10% are from city13
out of all churn users around 39% are from city1 and 14% are from city 13 and 11% of the churn users are from city5
we can also conclude most of the users are coming from city1, city13, city5
Is_cancel:
Almost all of the non churn users which are around 99.4%, didn’t cancelled their services before the membership expiry
Around 16% of the churn users cancelled their services, before their plan days or we can say membership expiry. remaining users didn’t cancelled the service
from the above we can say that churn users cancelled the services then the non churn users
Age :
from the above we can say that 80% of the users are having the age of 28 and later after printing the 85,90.. percentiles, we have approximately 31,34 as the age.
also note that some of the users are having the outliers, we need to pre process this variable
payment_method_id:
we can say that 60.1% of the non churn users are using the payment method_id 41 and rest remaining using the method it which mostly using are 39, 38 which are around 7.8% and 7% of the non churn users
we can also say that 54.3% of the churn users are using the payment_method_id 41 and it was also observed that 10.9% of the churn users are using the payment_method_id 38 which is quite interesting and it also observed that 8.2% of the churn users are using the payment_method_id 32, but in for payment_method_id 32, we don’t have any non churn users.
the conclusion that payment_method 41,38,32 has to keep an eye they’re having more churn users
Registered_via:
from the above we can say that 61.8% of the non churn users are registering via 7 and in the same way 23% of the non churn users are registering via 9 and it was also observed that more churn users are coming from registration 9 than 7.
so by conclusion we can say that registering via 7 is better than any other
Gender:
After filling the null values with NP
from the above we can observe that around 62% of the non churn users information is not provided and in the same way male people churn more than female(very slight increase)
most of the information is null, we will try to drop this feature because most of the information is same for the is_churn or not i.e. I mean the distribution will not add much value to the model even if we predict the gender also.
num_25:
From the above we can say that very few users who listened less number of songs which are of length 25% whatever the user is churn or not churn
this feature is not clearly separating the churn or not churn users
from the above we can see that very less number of songs that are listened churn users
num_100:
from the above we can say that around 100% of the users which are around 3000 are listening full length song either in churn or not churn users but their is edge for non churn users over the churn users, who are having more number of users.
and their are some more users who are listening full length song
from the above two graphs, we can clearly see that 80% of the churn users are listening 590 songs and non churn users are listening 650 number of songs, so non non churn users are listening more length songs more than the churn users.
num_unique:
from the above we can say that their is a slight change in the number of unique users in churn and non churn users, their is some edge for non churn users, means that they are more interested to listened
Total_minutes:
75% of the users are listening around 3000 minutes and who are non churn users
75% of the users are listening around 2500 minutes and who are churn users.
churn users are listening less
Final conclusion of EDA:
Firstly here using the plots, there is no clear separation between the churn and not churn users.
There are around 10000 songs listened which are of length 100, they perform better for classifying the not churn users and the feature plan list price is also less than 250 for the not churn users, so we can say that the users who listen more, are more likely to not churn.
remaining features we already discussed about it. most of the features doesn’t have clear separation of data.
Feature engineering:
Feature Engineering is one of the most important part of Machine Learning. We can create new features with the help of raw data provided and use them to train the model in order to achieve the desired performance as well as good/desired metric value (like accuracy, log loss etc.)
I have created around 21 features some are most useful in predicting the churn/non churn users
train_data['day_of_month']=train_data.transaction_date.apply(lambda x:np.int8(str(int(x))[6:]))train_data['transaction_month'] =train_data.transaction_date.apply(lambda x: np.int8(str(int(x))[4:6]))train_data['dayofweek'] =train_data.transaction_date.apply(lambda x: datetime.strptime(str(int(x)), "%Y%m%d").isoweekday())train_data['trans_is_weekend'] = train_data.dayofweek.map(lambda x:1 if x in (6,7) else 0)train_data['trans_month_day'] = train_data.transaction_date.apply(lambda x: np.int16(str(int(x))[4:]))
I have taken the transaction date from this I have derived the day of month,transaction month, dayofweek ,isweekend and transation month day
In the same way I have taken the registration date and then I derived the new features.
train_data['registration_day']=train_data.registration_init_time.apply(lambda x: np.int8(str(int(x))[6:] ))train_data['registration_month']=train_data.registration_init_time.apply(lambda x: np.int8(str(int(x))[4:6]))train_data['registration_day_month']=train_data.registration_init_time.apply(lambda x: np.int16(str(int(x))[4:]))train_data['registration_day_num'] =train_data.registration_init_time.apply(lambda x:datetime.strptime(str(int(x)), "%Y%m%d").isoweekday())train_data['registration_weekend']=train_data.registration_day_num.map(lambda x: 1 if x in (6,7) else 0)train_data['price_per_day'] = train_data['plan_list_price']/train_data['payment_plan_days']
using the user logs files I have derived, ratio features like taking total number of songs either 25 50,75,100 like this
#creating ratio features i.e number of songs that are 25% of their length out of total number of songs listened
# creating a feature 'total_num' i.e. total number of songs listened by the user
train_data['min_per_song'] = train_data['total_minutes']/(train_data['num_25']+train_data['num_50']+train_data['num_75']+train_data['num_985']+train_data['num_100'])
train_data['total_num'] = train_data['num_25']+train_data['num_50']+train_data['num_75']+train_data['num_985']+train_data['num_100']
train_data['num_25_ratio'] = train_data['num_25']/train_data['total_num']
train_data['num_50_ratio'] = train_data['num_50']/train_data['total_num']
train_data['num_75_ratio'] = train_data['num_75']/train_data['total_num']
train_data['num_985_ratio'] =train_data['num_985']/train_data['total_num']
train_data['num_100_ratio'] =train_data['num_100']/train_data['total_num']
train_data['avg_min_per_unq'] = train_data['mean_total_min']/train_data['mean_num_unq']
train_data['total_by_unq'] = train_data['total_num']/train_data['num_unq']
in the same way I have created some more features, some are not added much value to the model like for example isweekend is not added much value and some are added much value like trans_month_day, we will observe this phenomenon in the feature importance section
from the above we can clearly say that, This case study is heavily involved in the feature engineering part.
Training the model:
Firstly splitted the data into train and test
# splitting data into train and test set
X = train_data.drop(columns=['msno','is_churn','transaction_date','registration_init_time'],axis=1)
print(X.shape)
y = train_data['is_churn']
print(y.shape)
# splitting data into train and test sets
from sklearn.model_selection import train_test_split
xtr,xte,ytr,yte = train_test_split(X,y,test_size=0.20,stratify=y)
print("Shape of train = ",xtr.shape,ytr.shape)
print("Shape of test = ",xte.shape,yte.shape)
secondly I have trained the simple models like logistic regression and later then I trained the complex models like neural net
LogisticRegression:
parameters=[{"C":[10**-4,10**-2,10**0,10**2,10**4]}]
logisticregression=LogisticRegression()
algo=GridSearchCV(logisticregression,parameters,scoring="neg_log_loss",return_train_score=True
, n_jobs=-1, verbose=1,cv=3)
algo.fit(xtr,ytr)
from the above we can say that It performed somewhat better later I trained SVM it performed worse than LogisticRegression below is the screenshot for SVM trained model results
after training the decision tree model it performed well on the train data and test data, we say that it was overfitted, have a look at the below results we can see that their is some difference in the test and train loss
Random forest classifier is slightly overfitted as we can see the results below table
next, I have used Light GBM Classifier.
Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks. Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.
fitting with best parameter after doing the hyperparameter tuning
from lightgbm import LGBMClassifier
lgb = LGBMClassifier(max_depth=9,
subsample=0.5,
learning_rate=0.05,
n_estimators=700,
num_leaves=50,
colsample_bytree=0.9,
n_jobs=-1)
lgb.fit(xtr,ytr)
It performed well on the test data have a look at the below table
So from the above we can say that Light GBM Classifier performed well, so I have taken the 3 Light GBM Classifier and predicted the outputs and passed these outputs to the neural net model and then trained the neural net model ,it performed well
like this I have done
After fitting with neural net model I’m able to get an loss
Other than neural net model I tried other models as well like implementing custom ensemble for more code information please visit my GitHub account, which was given at the end of the blog
Feature importance : Returned by Light GBM model
Models comparison: below models comparison is on Kaggle test data results
I tried with above all models best log loss is for 3 LGB+ neural net is on Kaggle test log loss, custom ensemble log loss is on cross validation
GitHub repository: full code is available in my GitHub account, please have look at it.
Future work:
Need to experiment with some more feature engineering and their after we can also experiment with different models as well for the better log loss
Linked In:
This is my first ML case study hope you enjoyed reading, feel free to contact me.
https://www.linkedin.com/in/sai-teja-kandra-8246ba103/?originalSubdomain=in
References: