Comparethemarket.com is exploring ways to offer a more personalised, relevant experience to each of its customers. In order to achieve a great content recommendation and targeted products, it is strongly advised to model BGL’s user data to better understand customer’s behaviour.

For the purpose of this case study, a sample dataset was kindly provided by BGL to assist in the process. RStudio is the software that is used with the modelling of the data being implemented using Machine Learning techniques.

Context and Proposed Approach

As Bryan Eisenberg once said, “Our jobs as marketers are to understand how the customer wants to buy and help them to do so.” Having that in mind, one invaluable question for every business to answer, is “Would someone buy this product?” or better yet, “How probable would be that someone would buy that product?”. The answer to these questions can be achieved with the assistive power of predictive modelling. And this is where data comes into play.

Using Logistic Regression with L1 penalisation (Lasso Logistic Regression), we intend to build a recommendation model that will have a specific probability as an output. This probability indicates how probable would be that a specific user would interact with the respective product. For instance, if the output of the model is “1”, this suggests that the data yields a high confidence that the customer would interact with the product, whereas an output of “0” suggests otherwise. The reason for the choice of this particular machine learning model is due to the fact that Logistic Regression is a powerful model for such occasions.

Preliminary Insights and Conclusions

## REQUIRED PACKAGES
library(ggplot2)
library(dplyr)
library(ROCR)
library(caret)
library(randomForest)
library(glmnet)
library(gridExtra)

In order to have the best results possible, it is highly important to explore the data and prepare it for modelling - preprocessing.

First, let us explore how the data looks like

## READ THE DATA
data <- read.csv("CTM_DataScientistCaseStudy.csv")

## EXPLORE HOW THE DATA LOOKS
dim(data)  ## 100.000   6
## [1] 100000      6
head(data)
##   UserID Age UserSegment  Recency PriorEvent Event
## 1      1  56           A Inactive          0     0
## 2      2  39           A Inactive          0     0
## 3      3  62           A   Active          1     1
## 4      4  29           C Inactive          0     0
## 5      5  41           C  Dormant          0     0
## 6      6  53           A   Active          1     1
tail(data)
##        UserID Age UserSegment Recency PriorEvent Event
## 99995   99995  23           C  Active          0     0
## 99996   99996  43           C  Active          0     0
## 99997   99997  58           C  Active          0     0
## 99998   99998  70           A  Active          0     0
## 99999   99999  24           C  Active          0     0
## 100000 100000  45           C  Active          0     0

The data consists of 100.000 instances (i.e. customers) and 6 columns that correspond to

## EXPLORE THE ATTRIBUTES OF THE DATA
colnames(data)
## [1] "UserID"      "Age"         "UserSegment" "Recency"     "PriorEvent" 
## [6] "Event"

We observe that there are no duplicated entries in the data and that every user ID is unique and corresponds to a single individual.

## CHECK FOR DUPLICATED ENTRIES 
sum(duplicated(data))    ## 0 DUPLICATED
## [1] 0
n_distinct(data$UserID)  ## 100.000 UNIQUE CUSTOMER IDS
## [1] 100000

For a sanity check, we make sure that the columns of the data are in the correct format as this plays a major role in the modelling process later on.

## CHECK IF VARIABLES ARE IN THE CORRECT FORM 
str(data)
## 'data.frame':    100000 obs. of  6 variables:
##  $ UserID     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age        : int  56 39 62 29 41 53 30 44 71 40 ...
##  $ UserSegment: Factor w/ 4 levels "A","B","C","NULL": 1 1 1 3 3 1 3 1 3 2 ...
##  $ Recency    : Factor w/ 4 levels "Active","Dormant",..: 3 3 1 3 2 1 3 1 2 1 ...
##  $ PriorEvent : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ Event      : int  0 0 1 0 0 1 1 0 0 1 ...

To explore the data even further, it is highly recommended to have a summary of the data and the values that exist in each column. We also investigate for any N/A values. As it is shown below,

## TAKE A GENERAL INSIGHT OF THE DATA
summary(data)
##      UserID            Age        UserSegment      Recency     
##  Min.   :     1   Min.   :18.00   A   :21017   Active  :55428  
##  1st Qu.: 25001   1st Qu.:28.00   B   :19973   Dormant :22275  
##  Median : 50000   Median :36.00   C   :58517   Inactive:21804  
##  Mean   : 50000   Mean   :38.75   NULL:  493   NULL    :  493  
##  3rd Qu.: 75000   3rd Qu.:48.00                                
##  Max.   :100000   Max.   :75.00                                
##    PriorEvent          Event    
##  Min.   :0.00000   Min.   :0.0  
##  1st Qu.:0.00000   1st Qu.:0.0  
##  Median :0.00000   Median :0.0  
##  Mean   :0.09178   Mean   :0.3  
##  3rd Qu.:0.00000   3rd Qu.:1.0  
##  Max.   :1.00000   Max.   :1.0
cat("The number of N/A values are:", sum(is.na(data)), "\n")   ## NO N/A VALUES
## The number of N/A values are: 0
cat("The valid User Segment entries are:", levels(data$UserSegment), "\n")   ##   "A"        "B"        "C"      "NULL"
## The valid User Segment entries are: A B C NULL
cat("The valid Recency entries are:",levels(data$Recency))      ## "Active"   "Dormant"  "Inactive" "NULL"
## The valid Recency entries are: Active Dormant Inactive NULL

Although there are no N/A values per se, we discovered entries both in the “UserSegment” and “Recency” columns with the entry of “NULL”. These entries are 493 in both cases, which corresponds to almost 0.5% of the data. Therefore, and since we do not have any data on what the NULL values mean, we are going to ignore these records as this approach is considered an effective method for handling such data when the number of the occurrences is so low.

Our predictions will be focused on the “Event” variable as it will be treated as our response. “Event” will have the value of “1” when a customer interacted with a specific product or service (i.e. shown interest) and the value “0” otherwise. In our sample, 70% percent of the data corresponds to products where the customer showed no interest and the rest of 30% the opposite.

## DISTRIBUTION OF THE INTEREST OF THE CUSTOMER FOR A PRODUCT
table(data$Event)        ## POSTERIOR: NO INTEREST(0): 70.000      INTEREST(1): 30.000
##     0     1 
## 70000 30000

Apart from that, previous knowledge of the customer’s interest is also available in the “PriorEvent” variable where we see that the percentages are slightly more skewed in favor of the “No Interest” incidents. More specifically, almost 90% of the proposed products, were of no interest to the customers.

## DISTRIBUTION OF THE PRIOR INTEREST OF THE CUSTOMER FOR A PRODUCT
table(data$PriorEvent)   ## PRIOR: NO INTEREST(0): 90.822      INTEREST(1): 9.178
##     0     1 
## 90822  9178

Two major observations need to be discussed here. The first one is about what changed between the period of the “PriorEvent” and the “Event”. Whether this corresponds to a change in a marketing campaign or a period of economic growth, it could be a factor that needs to be investigated in order to understand the difference of interest of the customers. The second point that needs to be discussed is the question that the BGL group is trying to answer out of the data. The variable “PriorEvent” is going to be part of the modelling process. This denotes that the output of the model will indicate the probability that a customer will re-grow, or not, interest in the particular product. Thus, the question that will be answered is going to be whether the customer will re-perform an action to this particular product.

PRELIMINARY ANALYSIS

Our age range is from customers between 18 and 75 years old with an average age of almost 39 years old.

summary(data$Age)  ## CHECK THE SUMMARY OF THE AGE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   28.00   36.00   38.75   48.00   75.00

The major user segment is the “C” segment which accounts for 58,8% of the data and the rest almost evenly split between segments “A” and “B”; Segment “A” with 21,1% and segment “B” with 20,1%.

## DISTRIBUTION OF SEGMENTS
Segment <- as.data.frame(table(data$UserSegment))
Segment <- Segment[Segment$Var1 != "NULL",]   ## REMOVE THE NULL OBSERVATIONS
ggplot(data = Segment, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = "#F8766D") +
  geom_text(y = Segment[,2] - 2000,
            aes(label = paste0(round((Segment[,2]/sum(Segment[,2]))*100, digits = 1), "%" ))
  ) +
  labs(x = "Segments", y = "Frequency") +
  ggtitle("User Segments Distribution")

In addition, almost 55,7% of the data are described as “Active” users, 22,4% as “Dormant” and 21,9% as “Inactive”.

## DISTRIBUTION OF RECENCY
RecencyDis <- as.data.frame(table(data$Recency))
RecencyDis <- RecencyDis[RecencyDis$Var1 != "NULL", ]  ## REMOVE THE NULL OBSERVATIONS
ggplot(data = RecencyDis, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", position = "dodge", fill = "#F8766D") +
  geom_text(y = RecencyDis[,2] - 2000,
            aes(label = paste0(round((RecencyDis[,2]/sum(RecencyDis[,2]))*100, digits = 1), "%" ))
            ) +
  labs(x = "Recency", y = "Frequency") +
  ggtitle("Recency Distribution")

We also observe that the “Ages” distribution with respect to the “Recency” status follows the same pattern. The age group of 20-35 looks to be the most frequent in the respective “Recency” statuses.

## DISTRIBUTION OF THE AGES ACCORDING TO RECENCY
## VISUALISE THE CORRESPONDING DATA
AgeActivity <- as.data.frame(data %>%
       group_by(Recency, Age)
     %>% summarise(counts = n())
     %>% filter(Recency != "NULL")
     %>% arrange(desc(counts))
)

p1 <- ggplot(data = AgeActivity, aes(x = factor(AgeActivity[,2]), y = AgeActivity[,3])) +
  geom_bar(stat = "identity", aes(fill = AgeActivity[,1])) +
  labs(x="Ages", y="Frequency", fill="Recency") +
  ggtitle("Distribution of Ages by Recency") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
  
p2 <- ggplot(data = AgeActivity, aes(x = AgeActivity[,2], y = AgeActivity[,3], fill = AgeActivity[,1])) +
  geom_bar(stat = "identity") +
  facet_grid(AgeActivity[,1]) +
  labs(x="Ages", y="Frequency", fill="Recency")

grid.arrange(p1, p2, ncol = 1)

Finally, for the different types of “Recency” and “Segments”, we can see how the Users are distributed in the graphs below.

## DISTRIBUTION OF THE RECENCY ACCORDING TO USER SEGMENT
## VISUALISE THE CORRESPONDING DATA
ActivitySegment <- as.data.frame(data %>% 
                                   group_by(Recency, UserSegment) 
                                 %>% summarise(counts = n())
                                 %>% filter(Recency != "NULL")
                                 %>% arrange(desc(counts))
                                 %>% mutate(percent = round((counts/sum(counts))*100, digits = 2))
)
                    
p1.2 <- ggplot(data = ActivitySegment, aes(x = ActivitySegment[,2], y = ActivitySegment[,3], fill = ActivitySegment[,1])) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(size = 3, stat = "identity", position = position_dodge(.9), aes(y = ActivitySegment[,4]+ 800, label = paste0(ActivitySegment[,4], "%") )) +
  facet_grid(ActivitySegment[,1]) +
  labs(x = "Segment", y = "Frequency", fill = "Recency") +
  ggtitle("Segments by Recency")

## DISTRIBUTION OF THE USER SEGMENT ACCORDING TO RECENCY
ActivitySegment2 <- as.data.frame(data %>% 
                                   group_by(UserSegment, Recency) 
                                 %>% summarise(counts = n())
                                 %>% filter(Recency != "NULL")
                                 %>% arrange(desc(counts))
                                 %>% mutate(percent = round((counts/sum(counts))*100, digits = 2))
)

p2.2 <- ggplot(data = ActivitySegment2, aes(x = ActivitySegment2[,2], y = ActivitySegment2[,3], fill = ActivitySegment2[,1])) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(size = 3, stat = "identity", position = position_dodge(.9), aes(y = ActivitySegment2[,4] + 800, label = paste0(ActivitySegment2[,4], "%") )) +
  facet_grid(ActivitySegment2[,1]) +
  labs(x = "Segment", y = "Frequency", fill = "Segments") +
  ggtitle("Recency by Segments")

grid.arrange(p1.2, p2.2, nrow = 2)

MODEL CREATION AND CONCLUSIONS

Since there is a better understanding of the data, we can implement the Lasso Logistic Regression model to create our recommendation model. As it was mentioned, Lasso is an amazing technique as it prevents over-fitting by taking into account possible correlated features and eliminate variables that add noise to the model. Therefore, the final model is considered to provide with objective results and recommendations that can generalise well in unseen data.

In order for the model to work correctly, it is advised to be provided with balanced data; data that has the same number of Events = 1 and Events = 0. Since the proportion of our dataset is 70% in favor of products where the customer showed no interest, an artificially balanced dataset will be produced by undersampling the majority class. In other words, out of the 70,000 instances of Events = 0, a sample of 30,000 instances is going to be extracted to match the sample of 30,000 instances of Events = 1. Literature suggests undersampling in favor of techniques of oversampling. In addition, several iterations (e.g. 10 iterations) will be run in order to secure unbiased results due to sampling.

To validate the model, 10-Fold Cross Validation is performed and an additional test set is withheld from the training process with an 80%-20% split to test the model even further.

## SET NUMBER OF REPETITIONS
n <- 10
## EMPTY CONFUSION MATRIX TO BE STORED FROM VARIOUS REPETITIONS 
confusion <- data.frame(r1 = rep(0,n), r2 = rep(0,n), r3 = rep(0,n), r4 = rep(0,n))

## AUC VALUE EMPTY VECTOR TO STORE AUC VALUES FROM VARIOUS REPETITIONS
auc.total <- vector(mode = "numeric", length = n)

## RUN THE MODEL 
for(i in 1:n){
## CREATE BALANCED DATASET BY UNDERSAMPLING THE MAJORITY CLASS
event0 <- data[data$Event == 0,]     
event0 <- sample_n(event0, 30000, replace = FALSE)  ## FROM MAJORITY CLASS, TAKE A 30.000 SAMPLE 

event1 <- data[data$Event == 1,]    ## FROM MINORITY CLASS, TAKE ALL THE 30.000 INSTANCES
balancedData <- rbind(event0, event1)   ## CREATE THE BALANCED DATA

## TAKE THE DEISGN MATRIX
designMatrix <- sparse.model.matrix(Event ~ .  , data = balancedData[,-c(1)])[,-1]  ## REMOVE THE CUSTOMER ID AND THE INTERCEPT
response <- balancedData$Event  ## STORE THE RESPONSE VARIABLE

## SPLIT IN TRAIN AND TEST SET
ind <- sample(2, nrow(designMatrix), replace = TRUE, prob = c(0.8, 0.2))

trainData <- designMatrix[ind == 1,] 
responseTrain <- response[ind == 1]
dim(trainData) 

testData <- designMatrix[ind == 2,]
responseTest <- response[ind == 2]
dim(testData)

## CHECK FOR PROPORTION
# FOR DESIGN TRAIN
table(responseTrain)

# FOR DESIGN TEST
table(responseTest)

## LASSO CROSS VALIDATION TO FIND THE BEST LAMBDA
cv <- cv.glmnet(trainData, responseTrain, 
                nfolds = 10, 
                type.measure = "class",
                family = "binomial", 
                alpha = 1)

## PREDICT ON TEST.SET (UNSEEN DATA) USING THE BEST LAMBDA FROM CROSS VALIDATION
predictions <- predict(cv, newx = testData, type = "response", s = "lambda.min")
pred.roc <- predictions     ## STORE THE PROBABILTIES FOR THE ROC CURVE TO BE USED LATER
cut_off <- 0.47             ## SELECT THE CUT-OFF TO CLASSIFY THE RESULT 
predictions <- if_else(predictions >= cut_off, 1, 0)

## STORE AUC VALUE
predictions.bal <- prediction(pred.roc, responseTest)
auc <- performance(predictions.bal, "auc")
auc.total[i] <- unlist(slot(auc, "y.values"))

## STORE THE CONFUSION MATRIX FOR EVERY REPETITION
confusion[i,] <- as.vector(table(predictions, responseTest))
}

RESULTS

The most common and widest metric to rate the success of a machine learning model is the accuracy metric. As the name denotes, accuracy is the fraction of predictions the model got right in total. However, we are also interested in the proportion of the predictions that the user would interact with a product and the predictions that the user would not. Both of these proportions are measured with the Sensitivity and Specificity metrics respectively. Sometimes, we are more interested in obtaining a higher value in one of these metrics over the other as it is considered to be more important. In our case, more of a significant error it would be to not recommend a product that the customer would interact rather than suggesting a product that the customer would not. Therefore, a priority is to obtain a high value of Sensitivity as possible while keeping a reasonable balance between them. This can be achieved by being a little bit more willing to recommend a product to a customer. Recall that the model will have an output of a probability which means that an output of 0.5 would indicate half a chance of the customer re-showing interest for a particular product. Setting a cut-off slightly lower than that, we can be sure to not miss a high number of products or services that would be of an interest to a customer.

## CREATE A FUNCTION TO CALCULATE THE AVERAGE ACCURACY, SPECIFICITY, SENSITIVITY OF THE VARIOUS REPETITIONS
res <- function(x) {
  x <- matrix(as.numeric(x), ncol = 2)
  accur <- (x[1,1]+x[2,2])/sum(x)
  sens <- x[2,2]/sum(x[,2])
  spec <- x[1,1]/sum(x[,1])
  out <- list(Accuracy = accur, Specificity = spec, Sensitivity = sens)
  out
}
total <-  apply(confusion, 1, FUN = res)

## OBTAIN THE AVERAGE ACCURACY, SPECIFICITY, SENSITIVITY AND STANDARD DEVIATION OF THESE
avg.spec <-0
avg.acc <-0
avg.sens <- 0

spec.total <- vector(mode = "numeric", length = n)
sens.total <- vector(mode = "numeric", length = n)
acc.total <- vector(mode = "numeric", length = n)
for(i in 1:n){
  
  avg.acc = as.numeric(total[[i]][1]) + avg.acc
  acc.total[i] <- as.numeric(total[[i]][1])
  
  avg.sens = as.numeric(total[[i]][3]) + avg.sens
  sens.total[i] <- as.numeric(total[[i]][3])
  
  avg.spec = as.numeric(total[[i]][2]) + avg.spec  
  spec.total[i] <- as.numeric(total[[i]][2])
  
}

## MODEL VALIDATION RESULTS
## AVERAGE AVERAGE ACCURACY, SPECIFICITY, SENSITIVITY
cat("The accuracy of the model for", n, "repetitions is:", round(avg.acc/n, digits = 2))
## The accuracy of the model for 10 repetitions is: 0.65
cat("The sensitivity of the model for", n, "repetitions is:", round(avg.sens/n, digits = 2))
## The sensitivity of the model for 10 repetitions is: 0.82
cat("The specificity of the model for", n, "repetitions is:", round(avg.spec/n, digits = 2))
## The specificity of the model for 10 repetitions is: 0.48
## STANDARD DEVIATIONS OF ACCURACY, SPECIFICITY, SENSITIVITY
cat("The standard deviation of the accuracy for", n, "repetitions is:", round(sd(acc.total),3))
## The standard deviation of the accuracy for 10 repetitions is: 0.007
cat("The standard deviation of the sensitivity for", n, "repetitions is:", round(sd(sens.total),3))
## The standard deviation of the sensitivity for 10 repetitions is: 0.03
cat("The standard deviation of the specificity for", n, "repetitions is:", round(sd(spec.total),3))
## The standard deviation of the specificity for 10 repetitions is: 0.043

Last but not least, the following graphs create a bigger picture between the trade-off of Sensitivity and Specificity so as to have a wider sense of the performance of the machine learning model. The AUC value, which stands for Area Under the Curve, is another validation value that was put for cross-reference with the Accuracy, Sensitivity and Specificity.

## ROC CURVE
predictions.bal <- prediction(pred.roc, responseTest)
roc.lasso <- performance(predictions.bal, measure = "tpr", x.measure = "fpr")

## AUC
mean.auc <- round(mean(auc.total),3)   ## AVEGRAGE AUC
sd.auc <- round(sd(auc.total),3)       ## STANDARD DEVIATION AUC

## ACCURACY CUT-OFF
perf.bal.ac <- performance(predictions.bal, "acc")
plot(perf.bal.ac)

## PLOT ROC CURVE AND AUC VALUE TO FURTHER VALIDATE THE MODEL
plot(roc.lasso, colorize = TRUE, ylab = "Sensitivity", xlab = "1 - Specificity", main = "ROC Curve - Logistic Regression with L1")
lines(c(0,1), c(0,1), col = "black", lty = 2)
legend(.6, .35, mean.auc, title = "AUC", cex = .8)
legend(.8, .35, sd.auc, title = "+- SD", cex = .8)

Finally, it is of a great interest to see which variables show strong predictive power and towards what behaviour. For instance, knowing the segment of a specific customer, would it help in recommending a particular product or would that push the user to the opposite direction? The graph below shows the predictive power of each variable as it was captured by the model.

## STORE THE LASSO LOGISTIC REGRESSION COEFFICIENTS
tmp_coef <- coef(cv, s = "lambda.min")   # COEF OF LAMBDA MIN
coef <- data.frame(name = tmp_coef@Dimnames[[1]][tmp_coef@i + 1], coefficient = tmp_coef@x)
b_coef <- coef[order(coef[,2]),]

## VISUALISE THE COEFFICIENTS
ggplot(b_coef, aes(y = b_coef[,2], x = b_coef[,1], color = if_else(sign(b_coef[,2]) > 0, "Positive", "Negative"))) +
  geom_point(stat = "identity") +
  geom_segment(aes(y = 0, x = b_coef[,1], yend = b_coef[,2], xend = b_coef[,1])) +
  coord_flip() +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank()) +
  ggtitle("Coefficients Predictive Contribution")

FUTURE DEVELOPMENT AND ENHANCEMENT

The proposed model is merely one of the main approaches in trying to predict customers behaviours and interests. For instance, when the data becomes large enough in terms of available variables, the suggested model may have some limitations. Moreover, it was stated by BGL that the data provided relates to a single, specific event/product. That means that the Logistic regression model needs to be run on a single product each time to make a recommendation. It is not difficult to understand that this is not an efficient way to make a recommendation, especially when there is a wide range of products or services available. Thus, an enhancement of the already proposed model would a Bayesian Logistic Regression model since prior knowledge is available or a hybrid model using Data Mining techniques along with Machine Learning. More specifically, the idea would suggest performing association rules to identify items that are frequently bought together. This can create some natural clusters where certain products can fall into. Finally, lasso regression could be run individually on these clusters to improve recommendations.

A very promising approach would also be to implement User-Item Collaborative Filtering. This approach gives suggestions to statements like: “Users who are similar to you also liked …”. It is worth mentioning that Collaborative Filtering can also be used as a hybrid model with Lasso Regression as described previously. Major recommendations systems (e.g. Netflix, Spotify, Amazon, Facebook, etc) are collaborative filtering models which are based on assumption that people like things similar to other things they like, and things that are liked by other people with similar taste.

Finally, a more complex approach would be functional clustering. The idea of functional clustering is that the Events as described in our dataset are not independent but connected. Therefore, the available data for every user can be transformed into a function (e.g. a line) and examined as such. Functional clustering is a really advanced approach that allows to further investigate user’s behaviours thoroughly. For instance, keeping in mind that every customer is represented as a function, various information can be extracted through the derivative of that function.

BROADER APPLICATIONS WITHIN COMPARETHEMARKET.COM

The aforementioned approaches were created with the purpose of recommending particular products to customers. Yet, these models can be used to answer a variety of challenging questions that BGL and comparethemarket.com are facing every day. Although we are certain to bring a decent recommendation system, a broader application of these models could assist in a better customer targeting or discovery of new products of interest for the users.

EXTRAS

This section offers an attempt of different machine learning model for a product recommendation. The approach uses a quick implementation of the Random Forest algorithm. Note that Random Forest is a parametric model and thus, it is time-consuming to fully develop such a model.

## SET NUMBER OF REPETITIONS
n <- 10
## EMPTY CONFUSION MATRIX TO BE STORED FROM VARIOUS REPETITIONS 
confusionRF <- data.frame(r1 = rep(0,n), r2 = rep(0,n), r3 = rep(0,n), r4 = rep(0,n))

## AUC VALUE EMPTY VECTOR TO STORE AUC VALUES FROM VARIOUS REPETITIONS
auc.totalRF <- vector(mode = "numeric", length = n)
OOB <- vector(mode = "numeric", length = n)

## RUN THE MODEL 
for(i in 1:n){
  ## CREATE BALANCED DATASET BY UNDERSAMPLING THE MAJORITY CLASS
  event0 <- data[data$Event == 0,]     
  event0 <- sample_n(event0, 30000, replace = FALSE)  ## FROM MAJORITY CLASS, TAKE A 30.000 SAMPLE 
  
  event1 <- data[data$Event == 1,]    ## FROM MINORITY CLASS, TAKE ALL THE 30.000 INSTANCES
  balancedDataRF <- rbind(event0, event1)   ## CREATE THE BALANCED DATA
  
  ## TAKE THE DEISGN MATRIX
  designMatrixRF <- sparse.model.matrix(Event ~ .  , data = balancedDataRF[,-c(1)])[,-1]  ## REMOVE THE CUSTOMER ID AND THE INTERCEPT
  responseRF <- balancedDataRF$Event  ## STORE THE RESPONSE VARIABLE
  
  ## SPLIT IN TRAIN AND TEST SET
  ind <- sample(2, nrow(designMatrixRF), replace = TRUE, prob = c(0.8, 0.2))
  
  trainDataRF <- designMatrixRF[ind == 1,] 
  responseTrainRF <- responseRF[ind == 1]
  dim(trainDataRF) 
  
  testDataRF <- designMatrixRF[ind == 2,]
  responseTestRF <- responseRF[ind == 2]
  dim(testDataRF)
  
  ## CHECK FOR PROPORTION
  # FOR DESIGN TRAIN
  table(responseTrainRF)
  
  # FOR DESIGN TEST
  table(responseTestRF)
  
  ## RANDOM FORREST
  rf <- randomForest(x = as.matrix(trainDataRF), y = as.factor(responseTrainRF))
  rf
  
  OOB.error <- rf$err.rate[,1]
  OOB[i] <- OOB.error[length(OOB.error)]
  
  # tune(randomForest, train.x = as.matrix(train.data.rf), train.y = as.factor(response.x.rf))
  
  ## AUC VALUE
  prediction.rf <- as.vector(rf$votes[,2])
  pred.rf <- prediction(prediction.rf, ifelse(responseTrainRF == "1", 1, 0))
  auc.rf <- performance(pred.rf, "auc")
  auc.rf <- auc.rf@y.values[[1]]
  auc.totalRF[i] <- auc.rf
  
  # PREDICT ON UNSEEN DATA
  rf_pred <- predict(rf, testDataRF, cutoff = c(0.60, 0.40))
  
  # RESULTS
  confusionRF[i,] <- as.vector(table(as.factor(rf_pred), as.factor(responseTestRF)))
  
}

# VISUALIZATIONS
# plot(rforest)
varImpPlot(rf, sort = TRUE, main = "Variable Importance")

## CREATE A FUNCTION TO CALCULATE THE AVERAGE ACCURACY, SPECIFICITY, SENSITIVITY OF THE VARIOUS REPETITIONS
res <- function(x) {
  x <- matrix(as.numeric(x), ncol = 2)
  accur <- (x[1,1]+x[2,2])/sum(x)
  sens <- x[2,2]/sum(x[,2])
  spec <- x[1,1]/sum(x[,1])
  out <- list(Accuracy = accur, Specificity = spec, Sensitivity = sens)
  out
}

## OBTAIN RESULTS
total.bal <-  apply(confusionRF, 1, FUN = res)

avg.spec.bal <-0
avg.acc.bal <-0
avg.sens.bal <- 0

acc.rf.total <- vector(mode = "numeric", length = n)
sens.rf.total <- vector(mode = "numeric", length = n)
spec.rf.total <- vector(mode = "numeric", length = n)

for(i in 1:n){
  avg.spec.bal = as.numeric(total.bal[[i]][2]) + avg.spec.bal
  spec.rf.total[i] <- as.numeric(total.bal[[i]][2])
  
  avg.acc.bal = as.numeric(total.bal[[i]][1]) + avg.acc.bal
  acc.rf.total[i] <- as.numeric(total.bal[[i]][1])
  
  avg.sens.bal = as.numeric(total.bal[[i]][3]) + avg.sens.bal
  sens.rf.total[i] <- as.numeric(total.bal[[i]][3])
}
## AVERAGE OOB ERROR
cat("The average Out Of Bag error for random forest is:", mean(OOB))

## AVERAGES
cat("The average accuracy of", n, "repetitions is:", round(avg.acc.bal/n, digits = 3))
cat("The average sensitivity of", n, "repetitions is:", round(avg.sens.bal/n, digits = 3))
cat("The average specificity of", n, "repetitions is:", round(avg.spec.bal/n, digits = 3))

## STANDARD DEVIATIONS
round(sd(acc.rf.total),3)
round(sd(sens.rf.total),3)
round(sd(spec.rf.total),3)

## ROC CURVE
prediction.rf <- as.vector(rf$votes[,2])
pred.rf <- prediction(prediction.rf, ifelse(responseTrainRF == "1", 1, 0)) 
roc.rf <- performance(pred.rf, "tpr", "fpr")

## AUC
auc.rf.avg <- round(mean(auc.totalRF),3)
auc.rf.sd <- round(sd(auc.totalRF),3)

plot(roc.rf, colorize = TRUE, ylab = "Sensitivity", xlab = "1 - Specificity", main = "ROC Curve - Random Forest")
lines(c(0,1), c(0,1), col = "black", lty = 2)
legend(.6, .35, auc.rf.avg, title = "AUC", cex = .8)
legend(.8, .35, auc.rf.sd, title = "+- SD", cex = .8)