Starbucks Capstone Challenge:
Offer Analysis and Success Prediction.

Jordi Lucas
The Startup
Published in
13 min readJan 28, 2021

--

A cup of coffee minutes read

Project overview

The basis of this project is the analysis and search for offers that successfully engage the company’s existing customers and attract new ones.

Starbucks is a data-oriented company that clearly invests in getting a 360- degree customer view using datasets that contain customers information, special offers and transactions.

To create a model that can classify the success of a special offer, I’ve worked in three phases:

  1. Inspect and cleaning the data provided by the company
  2. Create a dataset that combines all this information
  3. Build and measure the performance of three classification models that help predict if a special offer will be successful or not.

In a nutshell, the 360-degree view is the foundation that makes an organization’s relationship with customers experiential rather than transactional — the key to long-standing customer relationships and positive endorsements.

Problem Statement

What are you thinking? In a real life problem.

Investing in a powerful marketing campaign is a decision that involves the approval of different stakeholders, money, and time. That’s why a predictive model to classify whether it’s worth or not to launch a certain offer for a specific target group should be a strategic asset for any company.

To create this model we’ll use some supervised learning strategies of binary classification that will allow us to predict whether an offer is worth it or not.

The model’s final outcome will show if the offer will be effective or not.

Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well.

Some Data Wrangling

Portfolio, Profile and Transcript are three datasets the company provides us with.

Starbucks provides us with three datasets in json format: portfolio, profile and transcript. While the first offers us an inventory of active offers, the second contains demographic data about memberships and the third of them how customers interact with the available offers. Let’s analyze these datasets briefly.

Portfolio dataset

At first glance we can see that the channels column contains a list with the channels for that offer. To separate each channel into different columns, one-hot encoding is necesssary to binarize them. We will do the same with offer_type.

Profile dataset

The first noticeable thing in the dataset is that there are 118-year-old members.

People do get older, but it not very likely that there are 2.175 customers who turned 118 years old. We also observe that the affected rows also have empty gender and income columns, which is 12.97% of the dataset, so it’s better to drop them.

Another aspect to take into account is the treatment of the date column became_member_on that we will separate in different columns by year, month and day. Finally, just like we treated the channels column, we ‘ll treat the gender column in which three different types appear.

Transcript Dataset

The main thing we see here is that the value column has different contents that depend on the event column.

If event refers to one of the three possible offer status (viewed, received or completed), value column will contain the identifier of the offer, if event refers to transaction the value column will show the amount.

It is necessary to separate this behavior into two different datasets, for this we will create two new columns: offer_id and amount. When creating the offer_id column, we find that some offers have the attribute name of offer id and others offer_id, so it will be necessary to develop a small function that is capable of recognizing it.

This is the result of this before the separation of the dataset:

Inspect the data

Some EDA conclusions

The first conclusion is that the age range between 50 and 60 years old contains the most members, regardless of their gender.

Figure 1

Continuing with the demographic data we can see in figure 2 that there are more male members than female members, in all age ranges except the 80-year-olds.

Figure 2

Focusing on income by gender, there is a big difference between men and women up to 80.000 dollars a year. In the figure 3, women have the most income according to the data provided by the company.

Figure 3

On the other hand, in figure 4 it can see how the year where the number of members increased the most, for female and male members, was 2017. This is interesting. Perhaps the marketing campaign had changed that year, because before there had always been more female members.

Figure 4

If we make a bivariate analysis of the features amount and income, a correlation is expected. Members with higher incomes probably spend more.

Figure 5

Indeed, we see this correlation in if we show the heat map between three of the most relevant features related to the demographic information offered by the members in figure 6.

Figure 6

Even so, possible outliers are observed if we look at the data related to the feature amount.

The description of the continuous variable amount clearly states there is a maximum of 1062.28 $. This suggests that these are outliers. We will choose a threshold of 0.995 and substitute all the data above for the mean. The code below shows how.

Combining all in one

At this point we’ll need to combine the clean datasets and create a column that will act as a classification target (which will be offer_succesul). This has to done in such a way that we’ll get a single dataset that allows us to build the models that will determine if a certain offer is successful for a certain type customer or not.

The necessary condition to determine if an offer is successful is: a customer must view and complete it within the allowed time.

I have develop a support function to get this target value since offer completed and viewed data and the time range between them.

Once we have the master dataset that will help us to train the models, we can see the number of successful and unsuccesful offers. It is a balanced dataset.

Figure 7: data_combined is a balanced dataset

Finally, the master dataset called combined_data is saved. We’ll load this later into the next and last notebook where we’lll build and perform our three models.

Metrics

Accuracy, precision, recall and f1score.

Metrics used in this project are the common ones used to calculate any classification model’s performance. Performance evaluation of a classification model is based on the number of test records correctly and incorrectly predicted by the model. The confusion matrix represents this.

The confusion matrix provides a more detailed overview that not only includes the performance of a predictive model, but also which classes are being predicted correctly and incorrectly, and what type of errors are being made.

As an example in our project would be:

  • offer successful = 1
  • not offer succsseful = 0

As you can see in the image above, there are four types of results depending on their classification: TP, TN, FP and FN. In this project’s code, located in github (link at the end of article), you can see each of them for each of the used models. The metrics generated by these results are:

  • Accuracy
  • Precision
  • Recall
  • F1-score

Let’s have a look at them briefly:

Accuracy is the most frequent classification evaluation metric. It works well in balanced datasets (figure 7). Accuracy measures the percentage of cases that the model has classified correctly. Accuracy can be misleading, though. It can make a dysfunctional model look like it’s a good one.

The accuracy metric does not work well when classes are unbalanced. For problems with unbalanced classes it is much better to use precision, recall and F1. These metrics give a better idea of ​​the quality of the model.

Precision informs about the quality of the machine learning model in classification tasks. In our project, this metric would answer the question:

Q: What percentage of offers will be well classified as a successful offer?

Recall informs us about the quantity that the machine learning model is capable of identifying. In our project, this metric would answer the question:

Q:What percentage of the offers classified as successful can we identify?

F1Score combines the precision and recall metrics into a single value. This is practical, because it makes it easier to compare the combined performance of accuracy and completeness between various solutions.

Build and model performance

To get a model that allows us to predict whether or not an offer is successful, we’ll need to train the dataset that is created by combining the three initials. It is evident that this is a classification problem and for this we have chosen 3 different supervised algorithms:

  • Logistic Regression
  • Gradient Boosting
  • Random Forest

To measure our models’ performance we’ll use a visualization based on the ROC curve.

ROC-AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

To run the test with each model, I’ve developed a dynamic function that gets the necessary parameters. And we´ll use RandomizedSearchCV with 12 iterations to optimize the training time.

In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter.

Logistic Regression

The results of this algorithm are less good comparing to the other two.

# construct a params dict to tune the model
grid_params = {
'penalty': ['l1', 'l2'],
'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
# instantiate a logistic regression classifer object
log_reg = LogisticRegression(random_state=42, solver='liblinear')

Gradient Boosting

With gardient boosting we improve slightly and get better metrics than with logistic regression.

# Minimum number of samples required to split a node
min_split_samples = [2, 5, 8, 11]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4, 6, 8]
# Create the random grid
gb_random_grid = {'loss': ['deviance', 'exponential'],
'learning_rate': [0.1, 0.01, 0.001],
'n_estimators': [10, 30, 50, 100, 150, 200, 250, 300, 350],
'min_samples_leaf': min_samples_leaf,
'min_samples_split': min_split_samples}
# instantiate the classifier object
gb_clf = GradientBoostingClassifier(random_state=42)

Random Forest

Finally, with Random Forest we get the best results.

# Number of trees in random forest
n_estimators = [10, 50, 100, 150, 200, 250, 300, 350]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.arange(3, 13)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_split_samples = [2, 5, 8]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Create the random grid
random_grid_params = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_split_samples,
'min_samples_leaf': min_samples_leaf}
# instantiate a random forest classifier
rf_clf = RandomForestClassifier(random_state=42)

Furthermore, with this algorithm we’ll know the most important features of the dataset.

relative_importance = rf_random.best_estimator_.feature_importances_
Figure 8

ROC-AUC comparison

Finally, we compare the three models used by the ROC-AUC curve, and we see in the figure 9 that the best classifier for our dataset is Random Forest.

# Calculate AUC-ROC measure
fpr, tpr, thresholds = roc_curve(y_test, predicted)
roc_auc = auc(fpr, tpr)
Figure 9

Model Evaluation and Validation

Photo by Kolleen Gladden on Unsplash

As we’ve seen, Random Forest offers the best performance. We’re going to analyze and explain the parameters that we’ve used to configure it.

# Number of trees in random forest
n_estimators = [10, 50, 100, 150, 200, 250, 300, 350, 500]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.arange(3, 13)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_split_samples = [2, 5, 8]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
  • n_estimators: number of trees in the forest. (In Scikit-learn this parameter is called n_estimators)
  • max_features: max number of features considered for splitting a node.

If larger by an individual tree, the more chance it has of overfitting the training data, however, as in Random Forests we have many individual trees, this is not a problem.

  • max_depth: max number of levels in each decision tree.
  • min_split_samples: min number of data points placed in a node before the node is split.
  • min_samples_leaf: min number of data points allowed in a leaf node.

After train the model, these are the best parameters:

>> rf_random.best_params_{'n_estimators': 500,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': 'auto',
'max_depth': 12}

The number of trees, represented by n_estimators parameter, has impact in the final results: create a more robust aggregate model with less variance, at the cost of a greater training time.

On the other hand, increasing the Depth (on max_depth) of individual trees increases the possible number of feature/value combinations that are taken into account.

This parameter should be set to a reasonable amount depending on the number of features of your tree.

And this is the best estimator with some of the most interesting parameters in bold:

>> rf_random.best_estimator_RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=12, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=5,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_jobs=None, oob_score=False, random_state=42, verbose=0,
warm_start=False)

There are some interesting parameters to have a closer look at here. Criterion type can be gini or entropy indexes in classification Decision Trees.

Gini Index has values within the interval [0, 0.5] whereas the interval of the Entropy is [0, 1]. Computationally, entropy is more complex since it uses logarithms and consequently, the calculation of the gini index will be faster.

Usually the final results can be quite similar. It seems training is not worth the time when using the entropy criterion.

Evaluation

The final evaluation of the model is represented by the confusion matrix explained above. This is finally the performance representation of our model.

Metrics values and confusion matrix of Random Forest

Justification

To achieve the main objective of this problem — identify successful offers for a specific customer target group — I’ve used, after a meticulous data cleansing process and feature engineering, machine learning classification techniques based on supervised learning.

I’ve chosen 3 algorithms that are easy to implement and understand from a wide variety of them. After that, I’ve extracted the metrics indicated above and finally, I’ve compared their classification capacity.

The intention was to find the best performing model, to provide the company with a useful and strategic tool for business processes and data based decision making. I can say I’ve found a valid model which is, although it can be improved, a helpful starting point.

Random Forest was the best performance

Improvements

Photo by Jungwoo Hong on Unsplash

There are some improvements that can be applied to this project. One find most interesting is creating multiple supervised learning models and combining them all together into a custom ensemble model.

The advantage from combining different models is, because each model works differently, that its errors tend to be compensated. This leads to better generalization error.

At code level, this technique would fit the project by using VotingClasifier ensemble method.

from sklearn.ensemble import VotingClassifier#create a dictionary of our models
estimators=[('lr', log_reg_random), ('gb', gboost_random),
('rf', rf_random)]
#create our voting classifier, inputting our models
ensemble = VotingClassifier(estimators, voting='hard')
#fit model to training data
ensemble.fit(X_train, y_train)
#test our model on the test data
ensemble.score(X_test, y_test)

Conclusions

The project is open to check the performance with other types of classification algorithms such as SVM, K-NN, XGBoost or LightGBM, for example. In addition, more demographic information related to customers would be helpful to improve the models.

City, zip code, job, or your favorite Netflix series are easy data to obtain in surveys and provide relevant information when it comes to knowing customers and preferences.

Creating an application that applies the best classifier model is an interesting idea to consider. This application can be used to predict whether or not a certain type of offer is suitable for a certain type of customer before investing in a marketing campaign.

See the complete code in my github repository.

--

--

Jordi Lucas
The Startup

Data Science, Sports Analytics, Machine Learning, Deep Learning, AI, Python, Spark.