Starbucks Reward Mobile App Users: Predicting users behavioural pattern using supervised Machine Learning Models in unbalanced data setup

Paul Dzitse
21 min readMay 30, 2022

1. Introduction

In this project, we analyse behavioural pattern of Starbucks Mobil app users by using supervised machine learning techniques. Specifically, we determine whether an offer sent to potential customer we will be successful or not.

Our aim is to provide decision makers with an insight into reactions of the users to some parameters in their business models. With this information at hand, they can adjust these paramters and thus optimise their business goals.

The article is divided into three parts. The first part provides some information on the dataset and shows some steps taken to clean it. The second part presents some visualisations of important variables. In the later part of part two ,we explain how the dataset used for the research is prepared.

The last part answers some selected questions in general. Since our dataset is unbalanced, we handle it as imbalanced classification problem. The newly prepared dataset has 68115 entries with 1548 successful offers and 66567 unsuccessful offers. This gives successful to unsuccessful offers ratio of 1:43. Our classes in the model are:

· Majority Class: offer unsuccessful, class 0

· Minority Class: offer successful, class 1

There are methods that one can use to make unbalanced dataset balanced like synthetic techniques such as SMOTE or conventional oversampling and undersampling methods. Nevertheless, we work with original unbalanced dataset without making any changes to it.

So, we focus on estimating the minority class and use initially five models. With the initial setting of these models, we evaluate their predictive performance by using tools such learning curves. With this tool, we want to determine how the models are learning and generalising. Further we apply bootstrap resampling techniques to understand the models’ performance and times metrics. The bootstrap resamples helps quantify any uncertainty in the model building process. The evaluation of the model’s performance is based on metrics such threshold, ranking and probability metrics. Basically, within these metrics we focus on evaluating scores like precision, recall, f1-score, logloss and classification error.

After the general training and testing, we select XGB model based on outcome of score metrics. Further tuning is done to improve its predictive performance by optimizing the hyperparameters using Grid Search technique. The recall and logloss as well as the classification error scores improved drastically against the initial scores.

Finally, we use our improved model,with the help of SHAP values tools, to determine how various features contributed to the building of the final model. Here we show that some attributes in the business models that may favour offer being success and other may not.

1. Data information

We begin now by providing some information on our dataset. It contains simulated data that mimics customer behaviour on Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

There are three datasets as follows:

portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers completed

The main purpose is to combine these three datasets for the analysis.

Before continuing, we want to inform that Python codes are included in some section where it is most instructive. Full code and data information to can be retrieved on my project Github site.

In the up coming section, we provide short information on contents of the dataset and steps we take in cleaning, visualising, and preparing the final dataset.

2. Data Cleaning

  1. Portfolio Dataset
Portfolio dataset: data structure and distribution of offer tpyes

The above diagram shows the data structure of the portfolio data. We see variables such as reward, channels, difficulty, duration, offer_type and Id. The distribution of the offer_type can be seen in the pie chart. Discount and Bogo offers have the largest share.

The following steps are taken to clean portfolio dataset

a. Unpack lists in channels column and then split categorical columns into dummies, afterwards drop their parent column, b. Rename id column to offer_id, c. Map each offer hash in Id columns to an integer.

a. Unpack lists in channels column and then split categorical columns into dummies, afterwards drop their parent column

b. Map each offer hash in Id columns to an integer using the code below

The clean portfolio dataset has following form.

portfolio dataset

2. Profile Dataset

Profile dataset with boxplot and strip plot of age and income distribution

The profile dataset has information on gender, age, id, member year and income. It has about 17000 rows. Id column is in hash values and will be converted into integer values . The boxplot with strip plot shows the age and income distribution. Age values of 118 appear to be outliers and they have corresponding null income values. These values comprise 12.7% of the dataset and they are dropped bringing age values range from 20 to 100 years. The medium age is around 55 years. The income has a medium 70000 dollars, with income ranges from 20000 to 120000 dollars. Income values apper to be right-skewed.

We create age and income ranges for app users to better categorize these variables and hence our final dataset has this form.

Profile dataset

3. Transcript Dataset

Have a look at transcript dataset. The person, event and value columns need cleaning. The data type are in object, float and int. The event column has values for transaction and offers received, offers viewed and offers completed. The value column has nested dict. From event we create transaction and offer dataset. The necessary steps below are used.

Transcript dataset

a. Map each hash of person column to integer and rename it app_user. b. Expand value (dictionary column) to their own columns and get their dummies, c. Split event column to their own columns and get the dummies, d. Construct offer_data (offer received, offer viewed and offer completed)and transaction data

Offer and transaction dataset

The offer and transaction data are above. There are 149715 rows and 8 columns for the offer data and 125792 and 3 columns for the transaction data.

We now present some visualizations of the important variables in the second part.

3. Visualization of the clean dataset

Consider the visualization of the profile dataset.

Distribution of Gender type, age group distribution and gender distribution by years

From above pie chart, 57.2% are male users, 41.3% are females and 1.4% belong to other gender. Age group 55–64 has highest members, and is followed by age group 45–54. The last graph shows distribution of gender between the years. The year 2017 appears to have most users within each gender group.

An interesting feature is to observe how income is distributed among genders, how are age group as well as income distributed and spread.

Violin plot, boxplot and stripplot for Gender types

The Violin plot gives an insight into income distribution between the three gender types. We see that minimum and maximum values of female and male users appear to take similar values. Nevertheless, distribution of male income turns to be more spread at lower income levels as it turns to be wider there. While median incomes for males and other groups are centred around 60000 dollars, that of female users is 70000 dollars, which may be indicating that female users are weathier than other gender types.

The distribution of ages appear to less skewed than that of income. We see that with age approaching 100, users reduce drastically.

The next graph below shows the bar graph of income distribution within age groups and between years. Within genders and between years, male users appear to dominate in the overall income resources. This may be partly due to share numbers of male users, since they are the highest in total numbers.

inome distribution between year and within gender and trancript data

The graph to the right above shows the number of transactions, offers received, viewed and completed. Interestingly around half of the offers received were completed.

We create two datasets from the transcript data. These are the offer and transaction data. These datasets are presented below

Offer and transaction data

Construction of the dataset (offer successful)

Finally, we construct our data for the analysis. To construct our final dataset, we combine cleaned portfolio, profile, offer and transaction datasets. In the construction, we first filter a specific user by filtering offer data and transaction data for that specific user. Thereafter we create dataframe for when the customer receives, views and completed an offer. Afterwards, we iterate over each offer a customer receives and determine if the transaction is valid within the offer time window. We then determine if the customer completes that specific offer by finding out when he views the offer, completes it and if it is successful. Finally, we get customer’s transaction occurring within the valid offer time window and determine the amount spent by the customer. Refer to the Github project page for more detail on this coding.

Below we show the offer_completed dataset (offer that were completed) and whole dataset (offers that were completed and not completed). We name offer that were completed as offer successful and that are not completed as offer unsuccessful. Look at the various variables. There are 1548 data point for offers that are successfully completed. On the other hand, our clean data has 68118 rows with 38 column entries. For our analysis, we used the clean data that have 68115 data entries. The clean data, given its structure is an unbalanced data.

Offer_completed and clean_data

Look at the boxplot below. It presents the distribution of genders-age and total amount the company receives from all successful offers (offers received, offer viewed and offer completed).

Boxplot of sucessful offers

We see that most of amount received from successful offers are centred around lower values roughly between 0 to 50 dollars. Nevertheless, there are users that have total amount exceeding over 60 dollars. When we consider 6 years dataset we have and total amout that have sofar been generated for the company, it might be that products offered to users may be of lower price segment.

Finally, we present a heatmap to understand the pattern of relationship among the numerical variables.

Relation between variables

Interestingly, total_amount, discount, social are positively correlated. Duration and discount seems to be highly corrolated. Variables like Mobile and social seem to positively as well. Duration and social variables appear be correlated negatively. Mobile and discount seem to be negatively correlated as well.

With this short analysis, we continue to part three and we answer our research questions there.

4. Selection of variables and model construction

Consider our clean data for the analysis. We have 68115 entries and 38 columns in total. The variables are of the data type int or float. The dependent variable is offer_successful and a binary variables having value of 1 for successful and 0 for unsuccessful offers.

Structure of the clean dataset

The minority class (offer successful) has 1548 observations, and the majority class (offer unsuccessful) has 66567 observations. So, we have a minority-majority ratio of 1:43 and hence the dataset is hugely an unbalanced dataset. Take a look at the coding below and follow it.

We scale important numerical variables.

The target and features are constructed and set random state 42 for consistence in results. Follow coding steps below to get an understanding of how we arrived train and test sets.

preprocessing and selection of train and test dataset

We have unbalanced dataset and therefore, we use tools that can handle skewed dimension of unbalanced dataset. These evaluation tools are threshold, ranking and probabilistic metrics.

A. Threshold Metrics

1. Which model has the best predictive power?

To answer this question, we will look at our evaluation scores such as Precision, Recall and F1-Score. We first explain different predictive values such as True Postives, False Positives, True Negatives and False Negatives that are used to caculate these evaluation scores.

The True Positives are number of offers that are successful and the classifier correctly identifies them as successful offers. The False Positives occurs if an offer is unsuccessful, and a classifier identifies it as successful. The True Negatives are correctly identified unsuccessful and False Negatives are incorrectly identified successful offers.

This is how to calculate these metrics:

In order to solve our first research question, we train and test five models to determine, which of them has the best predictive power by evaluating their scores values. We set models construction with the code below. These models are LogisticRegression, RandomForstClassifier, KNeighborsClassifiers, GaussianNB and XGBClassifier.

The data frame of the confussion_matrix (left graph below) displays the outcome of the results on the test dataset. There are two classes with 0 (offers that are not successful) (majority class) and 1 (offers that are successful) (minority Class). The rows represent the predicted values of the models, and the columns represents the actual true values.

Confussion Matrix and Classification Report

The first row represents the models predicted value as unsuccessful offers, thus prediction for majority class. As we see, the model LogReg has correctly predicted 16548 unsuccessful offers (True Negatives) and incorrectly predicted 94 unsuccessful offers as successful offers (False Negatives).

On the other hand, the second row of LogReg model has 1 (prediction for minority class) as index and it represents offers that the model has predicted as successful. LogReg has predicted that 334 unsuccessful offers as successful (False Positives) and 53 truly successful offers as successful (True positives).

When we compare the models, one will notice that XGB correctly predicted the highest true positive values of 219 successful. This is followed by RF model with 191 successful offers. LogReg has the lowest with 53 correctly predicted offers.

Because of the unbalanced dataset, our focus is to analyse minority class. We select the best model that has highest predictive power.

The outcome of the classification report is presented above. The precision, recall, f1-sore and the support values are reported for the minority class (offer successful) and majority class (offer unsuccessful) as well. Equally, the support dataset upon which the predictive results are based are also presented. We have 387 datasets for minority class and 16642 for the majority class. In addition, information on the overall accuracy, macro avg and weighted avg are reported. All in all, we have a total of 17029 support values.

Among the various metrics, we are interested in precision, recall and F1-score of minority class (successful offers). Most especially due to the unbalanced nature of the dataset, recall score seems most approiate. From scores of models above, we observe relatively very high value for majority class. But due to unbalanced nature of the dataset, these results are of no use. Looking at all outcome results, we observe that XGB has in comparison, better evaluation score (precision, recall and f1-score) (56%, 57%, 56%) accordingly. This is followed by RF model with 57%, 49% and 53%. The worst model’s performance comes KNN.

2. Which model perform better using empirical bootstrapping at 40 samples?

We perform bootstrap sampling of 40 to determine the classifiers’ performance by using weighted scores of the recall, precision, accuracy, area under curve (roc_auc). We retrieve standard deviation and mean values of these performance scores as well. In addition, we report time and fit scores. With regards to the performance score, XGB perform better than the rest of the classifiers. KNN has the worst score time. The XGB classifier has better time and fit scores then RF model. Base on this analysis we consider selecting XGB over the other classifiers.

Standard deviation and Mean of Performance and Time metrics

Training and Validation Accuracy

To access how the models will perform with varying number of training samples, we use learning curve. This is achieved by monitoring the training and validation scores and obtaining accuracy scores with an increasing number of training samples. Evaluation of the training dataset gives us an idea of how well the model is learning and validation dataset tells us how well the model is generalizing.

The y-axis presents the model accuracy and the x-axis the sample sizes. The graphs below show graph of XGB, RF, GNB and KNN classifiers

Learning curves for RF and XGB Classifiers
Learning curves for GNB and KNN

It is the training and learning curves for four classifiers. As we can see from XGB model, the training accuracy is very high at the beginning but begins to decrease after sample size of increases. However, the validation accuracy seems to be increasing with increasing sample sizes. This is understandable as the model learns from previous errors and corrects it with increasing iteration. On the other hand, RF model training accuracy remains constant with increasing training samples. Validation accuracy is also increasing with increasing sample sizes.

However, caution must be held here. This result has taken both classes into consideration.

B. Ranking Metrics

3. Which model has the highest area under curve (precision-recall )

To evaluate performance of the models, we perform binary classification by using precision-recall curve as a tool. As stated, given unbalanced nature of our dataset, we focus on the minority class. We want to measure the performance of the classifiers using metric known as AUC-PR. AUC-PR stands for area under the (precision-recall) curve and can be expressed in percentages.

Take at the code below:

Below we present Precision and Recall graphs for four of our classifiers. On the y-axis, we have the precision and x-axis the recall values. The precision-recall curve is represented with green line and the dotted blue line is No skill classifier. The points form the curve and classifiers that perform better under a range of different thresholds will be ranked higher. A perfect classifier is represented by a point in the top right corner.

XGB and RF classifiers
GNB and KNN classifiers

RF has the highest predictive power of 58.9%, when we consider the area under the curve. This is followed by XGB classifier 55.1%. The f-score of XGB is higher than that of RF classifier. The worst classifier is the KNN.

Which models do we select to move forward? The accuracy from training and validation score favours XGB model. The bootstrapping with 40 samples also has favours XGB when we consider the performance and time metrics. Hence, we select XGB over RF classifier.

C. Probabilistic Metrics for Imbalanced Classification

4. How is the selected model certain against uncertainty?

Probabilistic metrics are designed specifically to quantify the uncertainty in a classifier’s predictions.

With this tool, we will quantify the uncertainty in the prediction of our selected classifer XGB. This will allow us to examine any existance of underfitting, over-fitting or if the train dataset used is suitable for modeling. Therefore, we use score measures like performance loss and classification error.

The graph below shows log loss for train and test data. They both begin to fall shapely initailly and there is no gap between them. However, after a while a gap appears between them. The classification error for the test data appears to be increasing while that of the train set continues to fall. Given this observation, we will tune the parameters of the model, by fine tunning with hyperparamters.

Hence, we set the following parameters by using GridSearch as an optimisation tool. We fit 5 folds for each of 120 candidates, totalling 640 fits.

Hyperparameter Search with GridSearchEstimator

In the above graph, we obtain the best parameter. Hence, we build our model with the newly acquired parameters, train, and test on model once again. Below are results of the confusion and classification report. The true positive values for the minority class now stands at 282 and the false positive values are 105. This is an improvement against initial results. There, they stand at 218 and 198 for true positive and false positive values accordingly. The classification reports for the precision, recall and f1-score are 57%, 73% and 64% according. Originally, they are 56%, 57% and 57%. The false negatives for our minority class have reduced to much of our excitement because the recall has now a score of 73%.

Confussion matrix and classification report

After fitting and tuning the model with the GridSearch parameters, we observe that the log loss and classification error have reduced as well and there is virtaully no gap between log loss of the train and test set. This is a confirmation of the optimisation of results of the minority class. This can be observed in the classfication error.

LogLoss, classfication error and threshold between precision recall

The last graph shows threshold between precision and recall values. Obviously, if we want to increase recall, we will need to compromise on precision and the other way around. That is, we will have to reduce threshold to increase recall. We see that both curves meet at a threshold of 0.55 and value 0.6. You can always go the github seit for the coding.

5. Which features should policy makers pay much attention to?

To access how much each factor in a model has contributed to the prediction, we use SHAP values. SHAP values are used to explain the predictions of a model. First, we create the model, then we pass our model into the SHAP Explainer function to create an explainer object and then use this to calculate the SHAP values for each observation.

Create SHAP Values

passing model to SHAP explanier and creating SPAP values

Plot 1. Beeswarm plot

The beeswarm plot highlight important relationships of the features for model’s construction. The values are grouped by the features on the y-axis. For each group, colour of the points is determined by the value of the same feature. Redder feature values are indication that the feature contributes positively to our model construction (minority class). Bluer values are an indication of the feature contributing negatively.

Let us look at the beeswarm plot below. Consider the first variable (total_amount). As its value increases, it becomes more red. This is an indication that it contributes positively towards the construction of the minority class (offer successful).

On the other hand, decreasing rewards values may work against the minority class (offer successful). If the difficulty level is reduced, offers sent to users will likely be successful. Reducing social variable may also work against the minority class. There are some variables that appear to be neutral. The variable web does seem to be neutral. It does not influence the model construction. This analysis can be done for all variables to determine how they contribute towards the model construction.

Plot 2. 1. Force plot of minority class

The force plot below shows the prediction of the minority class. Features that are important for model’s prediction are shown in red and blue, with red representing features that push the model score higher, and blue representing features that push the score lower.

In addition, features that have more of an impact on the score are located closer to the dividing boundary between red and blue, and the size of that impact is represented by the size of the bar. So we see that the feature total amount has the largest impact as it is closer to the boundary and the size of the bar is the largest.

From the plot we can see that certain values of the variables contributed positively towards the model construction. Features values like difficulty = 10, Social =1, reward=2 and total_amout =14.44 all contributed to pushing the model’s value from the base value of -3.262 to 0.5.

Waterfall plot of the minority class

Plot 2.2 Force plot for the majority class

We see that for the majority class features like total_amount =0, 2016=1, income=5.8e+4 and time=567 contribute negatively to building minority class. Feature values such as discount=1, duration=7, bogo=0, offer_id =6, diffculty =7 and reward=3 contribute positively towards building the minority model. Starting at a base value -3.262 they push the value of the model to -3.12. At boundary we have features reward=3 and total_amount =0. The impact of variables pushing from the right (in blue) is stronger than those pushing from the left(in red) and thus favours the majority class over minority class.

Plot 3. Waterfall plot

For better understanding of the importance of the features to the minority class we use waterfall plot.

The bottom of the E(f(x)= -3.262 starts as the expected value of the model output, and then each row shows how positive (red) or negative (blue) contribution of each feature moves the value from the expected model output over the background dataset to the model output for this prediction f(x) = 0.496. The units on the x-axis are log-odds units, so negative values imply probability less than 0.5 that feature will contribute positively to a successful offer and positive values means probability over 0.5 that the feature will contribute positively an offer being successful.

The feature total amount increases the log-odds by 2.98, while reward increases it by 0.24. In addition, there are 26 other factors that are impactful features and have been collapsed into a single term.

Plot 4. Scatter Plots

Since waterfall plots only show a single sample worth of data, we can’t see the impact of changing features. To see this, we use a scatter plot, which shows how the feature values impact the successful of offers. (minority class)

Scatter plot for total amout, reward, social

From the plot 4, we see that positive values of total_amout contributes towards an offer being successful. With no reward, offers sent to users will most likely not be successful and but rewarding has positive impact on offer’s success.

Scatter plot for difficulty, offer_id, time

There should be a certain degree of difficulty. However, difficulty level of 15 certainly has negative impact on successful offers. Some offer_ids have positive and other have negative impact. Those of the negative values like 2, 3, 5, 10 may be worth being removed. The time feature values have mixed impact.

Scatter plot for mobile, duration, discount

The last graph show features moble, duration and discount. How will they impact our minority class?

Conclusion

In this project, we analyse behavioural pattern of Starbucks Mobil app users by using supervised machine learning techniques. Specifically, we determine whether an offer sent to a potential customer we will be successful or not. We aim is to provide decision markers an insight into reactions of the users to some the parameters in the business models. With this information at hand, they can make changes to optimise their business goals.

The article is divided into three parts. The first part provides some information on the dataset and shows some steps taken to clean the data. The second part presents some visualisation of important variables.

The last part answers five selected questions in general. Given the unbalanced nature of our dataset, we handle it as imbalanced classification problem. We focus on analysing the minority class by starting with five models. We select XGB models after determining scores obtain from threshold, ranking and probability metrics. Furthermore, we fine tune the model with hyperparameters by using GridSearch. This improves the evaluation metrics like precision, recall and f1-score after training and testing. In addition, log loss and classification error also improve.

Finally, we use our improved model, with the help of SHAP values tools, to determine how some features contributed to the building of the selected final model, thereby pointing out features that decision makers may consider adjusting for better success of offers sent to the app users.

Our advice to decison makers:

1. Keep reward at all levels

2. Use social medium for champagning

3. There should be certain degree of difficulty, but difficulty level of 15 should be removed.

4. Offer_id 1, 6, 7, 8 and 9 should be kept, but consider removing offer_ids 2, 3 and 5

5. Keep duration variable of 5 and 7, but remove 1 and 4

6. Give discount to customers

Many thanks for reading.

Reference

https://github.com/Surveshchauhan/StarbucksKNowledge/blob/main/Starbucks_Capstone_notebook.ipynb

https://stackoverflow.com/questions/37627923/how-to-get-feature-importance-in-xgboost

https://stackoverflow.com/questions/48434960/getting-precision-and-recall-using-sklearn

https://www.anyscale.com/blog/how-to-tune-hyperparameters-on-xgboost

https://www.anyscale.com/blog/how-to-tune-hyperparameters-on-xgboost

https://stackoverflow.com/questions/68233466/shap-exception-additivity-check-failed-in-treeexplainer

https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/

--

--