WSDM — KK Box’s Churn Prediction Challenge

14 min readNov 19, 2020

**Image source:** https://blog.ekbana.com

Table contents:

Business problem
Business problem to ML
metrics and business constraints
Source of data
Existing approaches and First cut solution
Exploratory data analysis/Data cleaning
Feature engineering
Training the model
models comparison
GitHub repository
Future work
References

WSDM KK box churn prediction challenge :

Basically in this problem, we need to predict weather the user will churn or not after his/her subscription expires When subscription ended? For the KK box generally their most of the revenue comes from a 30 days plan, so we’re predicting the user whose new service subscription transaction is within 30 days after his/her current subscription expiry date, so after 30 days subscription will be ended(expiry date).

Business problem:

Basically this problem is call churn prediction What is churn ? Churn quantifies the number of customers who have left your brand by cancelling their subscription or stopping paying for your services.

KKBOX, a popular music streaming service provider in south-east Asia, is a subscription and advertisement-based business model. Subscription primarily monthly is the major source of revenue and the users can cancel their subscription whenever they see fit. Hence, improving the churn prediction is indispensable for KKBOX’s growth.

This is bad news for any business as it costs five times as much to attract a new customer as it does to keep an existing one. A high customer churn rate will hit your company’s finances hard. By leveraging advanced artificial intelligence techniques like machine learning (ML), you will be able to anticipate potential churners who are about to abandon your services. Still not convinced? It will cost you 16 times more to bring a new customer up to the same level as an existing customer.

Business problem to ML :

I’m converting this business problem to two class classification problem i.e., user will churn=1(not subscribe) or not churn =0(re-subscribe)

so it is supervised classification task

Metrics:

Why log loss? See here we’re predicting the churn users, so it is always better to take the probability scores, how much probability the user will churn or not, so I’m going with the log loss metric, because it takes probability scores, and it also penalizes the error very robustly in nature.

Business constraints:

No low latency constraint
we should not miss the churn users, because it will impact the business, so in the other sense recall should be high(so, misclassification should be avoided
interpretability is very important: instead of just giving predicting churn or not, it’s important to give why churn or not ,reasoning is pretty important
giving probability is mandatory what % of churn or not and also it is useful to set the threshold for better accuracy.

Source of data:

The data was taken from the Kaggle.. our home for data science :)

WSDM - KK Box's Churn Prediction Challenge

Can you predict when subscribers will churn?

www.kaggle.com

Existing Solutions:

Predicting Customer Churn: Extreme Gradient Boosting with Temporal Data by Bryan Gregory(1st place entry):

The main intention to choose this challenge is to focus on manual feature engineering, and the first-place winner used traditional machine learning models only, which implies its significance.

Most of the features were extracted by aggregation and comparison over the different or entire time using a mathematical operation like average or sum. A total of 76 features was used and the 10 most useful are shared. XGBoost with a weight of 0.88 and LightGBM with a weight of 0.12 were the models used for prediction.

This paper shows that simple ML techniques is used to achieve better log loss and feature engineering is done a lot.

First cut solution:

firstly, will apply simple models like logistic regression after doing the feature engineering part and later I will apply complex models like neural net to improve the better log_loss

One of the complex model like below

From the 1st stage, I take predicted probabilities for two classes (0 and 1) from each of the three LGBM Classifiers, and concatenate them to give as input to the Feed Forward Neural Network (FFNN) in the 2nd stage. FFNN predicts the final probability for customer churn.

Exploratory Data Analysis:

Firstly we will look what are all the columns present in each file, and later on we will do in depth analysis on each feature

transactions_v2 — This table contains the transactions data of the users and it contains the following columns

msno: user id
Payment_method_id: payment method
Payment_plan_days: length of membership plan in days
Plan_list_price: in New Taiwan Dollar (NTD)
Actual_amount_paid: in New Taiwan Dollar (NTD)
Is_auto_renew
Transaction_date: format %Y%m%d
Membership_expire_date: format %Y%m%d
Is_cancel: whether or not the user canceled the membership in this transaction.

Members (members.csv) — It contains data about the users, but not for every user. It has following columns:

msno
City
Age
Gender
Registered_via: registration method
Registration_init_time: format %Y%m%d
expiration_date: format %Y%m%d, taken as a snapshot at which the member.csv is extracted. Not representing the actual churn behavior.

user logs(user_logs.csv) : — It contains number of songs that are listened of certain length.

num_25 : # of songs played less than 25% of the song length
Date format : %Y%m%d
num_50 : # of songs played between 25% to 50% of the song length
num_75 : # of songs played between 50% to 75% of of the song length
num_985 : # of songs played between 75% to 98.5% of the song length
num_100 : # of songs played over 98.5% of the song length
num_unique : # of unique songs played
total_seconds : total seconds played

Train data(train_v2.csv) — It contains the churn data for March and has two columns

msno : specifies the user id of the customer
is_churn :1 (churn) ; 0 (re-subscribed)

Test data (sample_submission_v2.csv) — It contains the churn data for March and has two columns

msno : specifies the user id of the customer
is_churn :1 (churn) ; 0 (re-subscribed)

Here data is present in the multiple tables so we’re combining the tables into one based on the train data.

Data cleaning: checking for null values and filling with mean and median values

so in the above columns I’m filling with the mean and median values

Now we will do In- depth EDA with respect to churn label

Payment_plan_days:

from the above it was very clear that most of the users are using the 30 days plan

from the above we can say that most of the plans that the users are contributing are 30,150,180,200,210,230,400,410,415 for the churn users

Plan_list_price:

from the above we can see that most of the users will churn when the plan price is more >180 i.e. many users are there

in the same way most of the users will not churn when the plan price ranges in 0 to 200

so higher plan price users will most likely to churn

Here I’m doing more analysis on plan_list_price using the bar plots

we can observe that some users will churn when the plan is too high too i.e. 1788

Is_auto_renew:

from the above we can say that, non churn users are more interested in auto renew than churn users

Around 93.5% of the non churn users are interested in auto renew

it was also observed that 28.6% of the churn users are not interested for auto renew

City:

from the above observation we can say that most of the users are from city1 either they’re churn users or not churn users.

out of all non churn users around 59% of the people are from city1 and in the same way 10% are from city13

out of all churn users around 39% are from city1 and 14% are from city 13 and 11% of the churn users are from city5

we can also conclude most of the users are coming from city1, city13, city5

Is_cancel:

Almost all of the non churn users which are around 99.4%, didn’t cancelled their services before the membership expiry

Around 16% of the churn users cancelled their services, before their plan days or we can say membership expiry. remaining users didn’t cancelled the service

from the above we can say that churn users cancelled the services then the non churn users

Age :

from the above we can say that 80% of the users are having the age of 28 and later after printing the 85,90.. percentiles, we have approximately 31,34 as the age.

also note that some of the users are having the outliers, we need to pre process this variable

payment_method_id:

we can say that 60.1% of the non churn users are using the payment method_id 41 and rest remaining using the method it which mostly using are 39, 38 which are around 7.8% and 7% of the non churn users

we can also say that 54.3% of the churn users are using the payment_method_id 41 and it was also observed that 10.9% of the churn users are using the payment_method_id 38 which is quite interesting and it also observed that 8.2% of the churn users are using the payment_method_id 32, but in for payment_method_id 32, we don’t have any non churn users.

the conclusion that payment_method 41,38,32 has to keep an eye they’re having more churn users

Registered_via:

from the above we can say that 61.8% of the non churn users are registering via 7 and in the same way 23% of the non churn users are registering via 9 and it was also observed that more churn users are coming from registration 9 than 7.

so by conclusion we can say that registering via 7 is better than any other

Gender:

After filling the null values with NP

from the above we can observe that around 62% of the non churn users information is not provided and in the same way male people churn more than female(very slight increase)

most of the information is null, we will try to drop this feature because most of the information is same for the is_churn or not i.e. I mean the distribution will not add much value to the model even if we predict the gender also.

num_25:

From the above we can say that very few users who listened less number of songs which are of length 25% whatever the user is churn or not churn

this feature is not clearly separating the churn or not churn users

from the above we can see that very less number of songs that are listened churn users

num_100:

from the above we can say that around 100% of the users which are around 3000 are listening full length song either in churn or not churn users but their is edge for non churn users over the churn users, who are having more number of users.

and their are some more users who are listening full length song

from the above two graphs, we can clearly see that 80% of the churn users are listening 590 songs and non churn users are listening 650 number of songs, so non non churn users are listening more length songs more than the churn users.

num_unique:

from the above we can say that their is a slight change in the number of unique users in churn and non churn users, their is some edge for non churn users, means that they are more interested to listened

Total_minutes:

75% of the users are listening around 3000 minutes and who are non churn users

75% of the users are listening around 2500 minutes and who are churn users.

churn users are listening less

Final conclusion of EDA:

Firstly here using the plots, there is no clear separation between the churn and not churn users.

There are around 10000 songs listened which are of length 100, they perform better for classifying the not churn users and the feature plan list price is also less than 250 for the not churn users, so we can say that the users who listen more, are more likely to not churn.

remaining features we already discussed about it. most of the features doesn’t have clear separation of data.

Feature engineering:

Feature Engineering is one of the most important part of Machine Learning. We can create new features with the help of raw data provided and use them to train the model in order to achieve the desired performance as well as good/desired metric value (like accuracy, log loss etc.)

I have created around 21 features some are most useful in predicting the churn/non churn users

train_data['day_of_month']=train_data.transaction_date.apply(lambda x:np.int8(str(int(x))[6:]))train_data['transaction_month'] =train_data.transaction_date.apply(lambda x: np.int8(str(int(x))[4:6]))train_data['dayofweek'] =train_data.transaction_date.apply(lambda x: datetime.strptime(str(int(x)), "%Y%m%d").isoweekday())train_data['trans_is_weekend'] = train_data.dayofweek.map(lambda x:1 if x in (6,7) else 0)train_data['trans_month_day'] = train_data.transaction_date.apply(lambda x: np.int16(str(int(x))[4:]))

I have taken the transaction date from this I have derived the day of month,transaction month, dayofweek ,isweekend and transation month day

In the same way I have taken the registration date and then I derived the new features.

train_data['registration_day']=train_data.registration_init_time.apply(lambda x: np.int8(str(int(x))[6:] ))train_data['registration_month']=train_data.registration_init_time.apply(lambda x: np.int8(str(int(x))[4:6]))train_data['registration_day_month']=train_data.registration_init_time.apply(lambda x: np.int16(str(int(x))[4:]))train_data['registration_day_num'] =train_data.registration_init_time.apply(lambda x:datetime.strptime(str(int(x)), "%Y%m%d").isoweekday())train_data['registration_weekend']=train_data.registration_day_num.map(lambda x: 1 if x in (6,7) else 0)train_data['price_per_day'] = train_data['plan_list_price']/train_data['payment_plan_days']

using the user logs files I have derived, ratio features like taking total number of songs either 25 50,75,100 like this

#creating ratio features i.e number of songs that are 25% of their length  out of total number of songs listened
# creating a feature 'total_num' i.e. total number of songs listened by the user
train_data['min_per_song'] = train_data['total_minutes']/(train_data['num_25']+train_data['num_50']+train_data['num_75']+train_data['num_985']+train_data['num_100'])
train_data['total_num'] = train_data['num_25']+train_data['num_50']+train_data['num_75']+train_data['num_985']+train_data['num_100']
train_data['num_25_ratio'] = train_data['num_25']/train_data['total_num']
train_data['num_50_ratio'] = train_data['num_50']/train_data['total_num']
train_data['num_75_ratio'] = train_data['num_75']/train_data['total_num']
train_data['num_985_ratio'] =train_data['num_985']/train_data['total_num']
train_data['num_100_ratio'] =train_data['num_100']/train_data['total_num']
train_data['avg_min_per_unq'] = train_data['mean_total_min']/train_data['mean_num_unq']
train_data['total_by_unq'] = train_data['total_num']/train_data['num_unq']

in the same way I have created some more features, some are not added much value to the model like for example isweekend is not added much value and some are added much value like trans_month_day, we will observe this phenomenon in the feature importance section

from the above we can clearly say that, This case study is heavily involved in the feature engineering part.

Training the model:

Firstly splitted the data into train and test

# splitting data into train and test set 
X = train_data.drop(columns=['msno','is_churn','transaction_date','registration_init_time'],axis=1)
print(X.shape)
y = train_data['is_churn']
print(y.shape)
# splitting data into train and test sets 
from sklearn.model_selection import train_test_split
xtr,xte,ytr,yte = train_test_split(X,y,test_size=0.20,stratify=y)
print("Shape of train = ",xtr.shape,ytr.shape)
print("Shape of test = ",xte.shape,yte.shape)

secondly I have trained the simple models like logistic regression and later then I trained the complex models like neural net

LogisticRegression:

parameters=[{"C":[10**-4,10**-2,10**0,10**2,10**4]}]
logisticregression=LogisticRegression()
algo=GridSearchCV(logisticregression,parameters,scoring="neg_log_loss",return_train_score=True
, n_jobs=-1, verbose=1,cv=3)
algo.fit(xtr,ytr)

from the above we can say that It performed somewhat better later I trained SVM it performed worse than LogisticRegression below is the screenshot for SVM trained model results

after training the decision tree model it performed well on the train data and test data, we say that it was overfitted, have a look at the below results we can see that their is some difference in the test and train loss

Random forest classifier is slightly overfitted as we can see the results below table

next, I have used Light GBM Classifier.

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks. Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.

fitting with best parameter after doing the hyperparameter tuning

from lightgbm import LGBMClassifier
lgb = LGBMClassifier(max_depth=9,
                     subsample=0.5, 
                     learning_rate=0.05,
                     n_estimators=700, 
                     num_leaves=50,
                    colsample_bytree=0.9,
                    n_jobs=-1)
lgb.fit(xtr,ytr)

It performed well on the test data have a look at the below table

So from the above we can say that Light GBM Classifier performed well, so I have taken the 3 Light GBM Classifier and predicted the outputs and passed these outputs to the neural net model and then trained the neural net model ,it performed well

like this I have done

After fitting with neural net model I’m able to get an loss

Other than neural net model I tried other models as well like implementing custom ensemble for more code information please visit my GitHub account, which was given at the end of the blog

Feature importance : Returned by Light GBM model

Models comparison: below models comparison is on Kaggle test data results

I tried with above all models best log loss is for 3 LGB+ neural net is on Kaggle test log loss, custom ensemble log loss is on cross validation

GitHub repository: full code is available in my GitHub account, please have look at it.

kandrateja/Churn

GitHub is home to over 50 million developers working together to host and review code, manage projects, and build…

github.com

Future work:

Need to experiment with some more feature engineering and their after we can also experiment with different models as well for the better log loss

Linked In:

This is my first ML case study hope you enjoyed reading, feel free to contact me.

https://www.linkedin.com/in/sai-teja-kandra-8246ba103/?originalSubdomain=in

References:

https://arxiv.org/ftp/arxiv/papers/1802/1802.03396.pdf

Applied Course

Applied Machine Learning Course PG Diploma in AI and ML GATE CS Blended Course Interview Preparation Course AI Workshop…

www.appliedaicourse.com

seaborn.countplot - seaborn 0.11.0 documentation

seaborn. countplot( *, x=None, y=None, hue=None, data=None, order=None, hue_order=None, orient=None, color=None…

seaborn.pydata.org

Kaggle Top 4% Solution: WSDM-KKBOX’s Churn Prediction

A solution to a Kaggle challenge with the entire beginner-friendly code.

medium.com