1 OBJECTIVES AND OVERVIEW

For this task, based on the data that the marketing department collected, we have to predict our customers brand preferences regarding the laptop category for the incomplete data that was lest without the brand preference stated.

We need to conclude this prediction so that our company knows with which of the brands (Acer or Sony) to folow a collaborative agreement.

In order to do this we will follow the data analysis and model building/ predicting flow by inspecting the data, preparing the data and perform any preprocessing necessary, do the feature selection, sample the data, build the models, perfect them and train them. Final step is applying the model on the incomplete data and predict the preference in laptop brand of our customers.

2 Data visualization

After bringing our data in the R environment we procced in looking over the data types, variables and the general characteristics of our data set.

summary(complete)

##      salary            age            elevel           car       
##  Min.   : 20000   Min.   :20.00   Min.   :0.000   Min.   : 1.00  
##  1st Qu.: 52082   1st Qu.:35.00   1st Qu.:1.000   1st Qu.: 6.00  
##  Median : 84950   Median :50.00   Median :2.000   Median :11.00  
##  Mean   : 84871   Mean   :49.78   Mean   :1.983   Mean   :10.52  
##  3rd Qu.:117162   3rd Qu.:65.00   3rd Qu.:3.000   3rd Qu.:15.75  
##  Max.   :150000   Max.   :80.00   Max.   :4.000   Max.   :20.00  
##     zipcode          credit           brand       
##  Min.   :0.000   Min.   :     0   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:120807   1st Qu.:0.0000  
##  Median :4.000   Median :250607   Median :1.0000  
##  Mean   :4.041   Mean   :249176   Mean   :0.6217  
##  3rd Qu.:6.000   3rd Qu.:374640   3rd Qu.:1.0000  
##  Max.   :8.000   Max.   :500000   Max.   :1.0000

str(complete)

## 'data.frame':    9898 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : int  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : int  0 1 0 3 3 3 4 3 4 1 ...
##  $ car    : int  14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: int  4 6 2 5 4 3 5 0 0 4 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : int  0 1 0 1 0 1 1 1 0 1 ...

nrow(complete)#number of rows

## [1] 9898

We can see that the variables that we have are all numeric numeric and taht we will have to change some of them to factors.We proceed to change the lable variable brand from numeric to factor and rename the levels from “1” and “0” to “Acer” for “0” and “Sony” for “1”.

Next step is visualizing the data through some plots in order to see the distribution, relationships and patterns that may appear. To achieve this we will be using the library ggplot.

As we can see from the histogram above the distribution of the salary values is not a Gaussian one (Nomrmal distribution). We might want to check if other features have a similar problem.

After running the histogram for the credit and all the other features(age, education levels, type of car) we can conclude that our dataset doesen’t have a normal distribution. We will proceed with our prediction process and talk about this at the end of this report. We proceed to check out the percentage of people that prefer each brand and how the brand related to other attributes.

From the histogram and piechart above we can conclude that out of the surveyed customers the majority prefered Sony. Let’s see how the brand related with the salary and age of the customers.

Based on the plot above we can conclude that the age and salary do have an effect on the brand that our customers chose. After further visualization we have seen that the car type, credit and education level have none to very low effect on the brand chosen. We will keep this in mind when doing our feature selection.

3 Data PreProcessing and Feature Selection

We will start of by running a correlation matrix on our data and exclude any highly correlated features. The car and education elev are highly correlated (0.9963) so we will discard them both because of the high correlation and low significance seen by plotting the data. Also in the correlation matrix the salary seem that isn’t correlated with the brand but we know it is based on our visualizations.

##         salary   age elevel   car zipcode credit brand
## salary    1.00  0.01  -0.01 -0.01   -0.01  -0.03  0.21
## age       0.01  1.00  -0.01  0.01    0.00   0.00  0.01
## elevel   -0.01 -0.01   1.00  0.00    0.02   0.00  0.00
## car      -0.01  0.01   0.00  1.00    0.00  -0.01  0.01
## zipcode  -0.01  0.00   0.02  0.00    1.00   0.00  0.00
## credit   -0.03  0.00   0.00 -0.01    0.00   1.00  0.01
## brand     0.21  0.01   0.00  0.01    0.00   0.01  1.00
## 
## n= 9898 
## 
## 
## P
##         salary age    elevel car    zipcode credit brand 
## salary         0.4274 0.5102 0.5446 0.5863  0.0124 0.0000
## age     0.4274        0.5619 0.3081 0.7142  0.6616 0.1725
## elevel  0.5102 0.5619        0.9963 0.0718  0.7867 0.6310
## car     0.5446 0.3081 0.9963        0.8793  0.3042 0.5557
## zipcode 0.5863 0.7142 0.0718 0.8793         0.6216 0.6426
## credit  0.0124 0.6616 0.7867 0.3042 0.6216         0.5715
## brand   0.0000 0.1725 0.6310 0.5557 0.6426  0.5715

#removing car,elevels,credit attributes
complete <- complete[c(1,2,7)]
str(complete)

## 'data.frame':    9898 obs. of  3 variables:
##  $ salary: num  119807 106880 78021 63690 50874 ...
##  $ age   : int  45 63 23 51 20 56 24 62 29 41 ...
##  $ brand : int  0 1 0 1 0 1 1 1 0 1 ...

Next we will transform the brand into a factor, and try to see if we have any missing values. Also we will proceed to normalize the salary and age, find and remove the outliers in the salary and bin it.

#transforming brand to a factor and setting the levels

complete$brand<-as.factor(complete$brand)
levels(complete$brand) <- c("Acer","Sony")

#normalize the salary and age

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
 
complete$salary<-normalize(complete$salary)
complete$age<-normalize(complete$age)

#finding missing values


sum(is.na(complete)) #no missing values


#finding outliers in salary

outlier_tfSalary = outlier(complete$salary,logical=TRUE)
sum(outlier_tfSalary)

#what were the ouliers in salary

find_outlierSalary = which(outlier_tfSalary==TRUE,arr.ind=TRUE)

#Removing the outliers in salary

complete= complete[-find_outlierSalary,]


#binning the salary

bin(
  complete$salary,
  nbins = 5, 
  labels = NULL, 
  method = c( "content")
)

We can see that after we are done with the prerpocessing we have 3 features (salary,age and brand) ,age and salary are normalized and brand is a factor with 2 levels.

## 'data.frame':    9783 obs. of  3 variables:
##  $ salary: num  0.768 0.668 0.446 0.336 0.237 ...
##  $ age   : num  0.417 0.717 0.05 0.517 0 ...
##  $ brand : Factor w/ 2 levels "Acer","Sony": 1 2 1 2 1 2 2 2 1 2 ...

4 Building the Models

For this task we will build 2 models: a C5.0 Decision Tree and a Random forest Models. To build them, partition the data and make the predictions we will use the CARET library.

4.1 Partitioning the Data

Before starting building the models we need to split our data in 2 sets: the subset we will use for training the models and the testing set the one that we will use for testing the performance of our models.

#creating the numbers for each set with a 75 % ratio

set.seed(123)

inTrain<- createDataPartition(y= complete$brand,p=.75, list = FALSE) 

#partitioning the data into training and testing set

training <- complete[ inTrain,]
testing <- complete[-inTrain,]

#number of rows and the table head in each set
nrow(training)

## [1] 7338

head(training)

##      salary        age brand
## 1 0.7677427 0.41666667  Acer
## 2 0.6683114 0.71666667  Sony
## 3 0.4463135 0.05000000  Acer
## 4 0.3360764 0.51666667  Sony
## 6 0.8524057 0.60000000  Sony
## 7 0.8958411 0.06666667  Sony

nrow(testing)

## [1] 2445

head(testing)

##       salary       age brand
## 5  0.2374894 0.0000000  Acer
## 10 0.1369487 0.3500000  Sony
## 14 0.4805737 0.2166667  Acer
## 30 0.6746948 0.9166667  Sony
## 31 0.9884265 0.7000000  Sony
## 38 0.2284007 0.3666667  Sony

4.2 Setting Cross-Validation as Training Method

We set theresampling procedure as a 10 fold cross validation that will repeat 3 times.

ctrl <- trainControl(
                     method = "repeatedcv",
                     number = 10,
                     repeats = 3,
                     summaryFunction = multiClassSummary,
                     classProbs = TRUE
                     )

4.3 C5.0 Decision tree

Using the train function that we hava available in CARET we now train our C5.0 model on the training data.

set.seed(123)
treeFit <- train(
                 brand~.,
                 data = training,
                 method = "C5.0", 
                 trControl=ctrl,
                 tuneLength = 3,
                )

4.3.1 Prediction and Performance C5.0

Next we used the model to predict the brand on the testing data and build the confusion matrix for this model. From the confusion matrix we know that the model has an accuracy of 0.9235 and a Kappa of 0.839 . That means that the model is pretty good at prediciting the brand on the testing data.

#apllying the model on test

treeBrand <- predict(treeFit, newdata = testing)

#confusion matrix and the statistics

confusionMatrix(data= treeBrand, testing$brand)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  855  106
##       Sony   81 1403
##                                           
##                Accuracy : 0.9235          
##                  95% CI : (0.9123, 0.9337)
##     No Information Rate : 0.6172          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.839           
##                                           
##  Mcnemar's Test P-Value : 0.07925         
##                                           
##             Sensitivity : 0.9135          
##             Specificity : 0.9298          
##          Pos Pred Value : 0.8897          
##          Neg Pred Value : 0.9454          
##              Prevalence : 0.3828          
##          Detection Rate : 0.3497          
##    Detection Prevalence : 0.3930          
##       Balanced Accuracy : 0.9216          
##                                           
##        'Positive' Class : Acer            
##

4.4 Random Forest Model

For the second model we have the Random Forest Algorithm using the Cross Validation procedure like for the Decision Tree. For this model we manually tuned the parameters of the model and gave to the mrty parameter different values and the algorithm decided that mrty = 1 was the best value.

#setting the cross valdidation

ctrl <- trainControl(
                     method = "repeatedcv",
                     number = 10,
                     repeats = 3,
                     summaryFunction = multiClassSummary
                    )


#setting the mtry

mtry <- c(2,1,3,5,12,32)
rfGrid <- expand.grid(.mtry=mtry)

set.seed(123)


#train the model

system.time(
          randomForestFit <- train(
                                   brand~.,
                                   data = training,
                                   preProcess = c("center"),
                                   method = "rf", 
                                   trControl=ctrl,
                                   tuneGrid=rfGrid,
                                   importance=T
                                  )
            )

##    user  system elapsed 
##  413.79   27.78  469.79

###Prediction and Performance Random Forest

The model was then applyed to the testing data and the confusion matrix was drawn up. The performance of this model was good with an accuracy of 0.9043 and kappa of 0.7978.

#applying the model on test

rfBrand <- predict(randomForestFit, newdata = testing)

#confusion matrix for the random forest model

confusionMatrix(data= rfBrand, testing$brand)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Acer Sony
##       Acer  848  120
##       Sony   88 1389
##                                           
##                Accuracy : 0.9149          
##                  95% CI : (0.9032, 0.9257)
##     No Information Rate : 0.6172          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8211          
##                                           
##  Mcnemar's Test P-Value : 0.0316          
##                                           
##             Sensitivity : 0.9060          
##             Specificity : 0.9205          
##          Pos Pred Value : 0.8760          
##          Neg Pred Value : 0.9404          
##              Prevalence : 0.3828          
##          Detection Rate : 0.3468          
##    Detection Prevalence : 0.3959          
##       Balanced Accuracy : 0.9132          
##                                           
##        'Positive' Class : Acer            
##

4.5 Comparing the Models

Using the resample function we saw the comparation between the two models and could apreciate that the C5.0 Decision Tree is slightly better than the Random Forest and not so computationally expensive. We will also see this in the next visualizations

5 Predicting the Prefered Brand

The last step in our Task was using the model that we trained and apply it to the incomplete survey data. After applying it we will explore the results we get and inferate which is the prefered brand of the customers and with which of the two, Acer or Sony, should Blackwell consider a business relationship in the future.

This process is made of several steps: bringing the incomplete data into the R enviroment, preprocessing it exactly how we did with the complete data (normalization, change of variable type, binning, feature selection). Then we will proceed to apply the Decision Tree model to the data and explore the results.

After applying the mdoel we can see that it predicted that 1913 of our customers prefer ACer and the rest of 3087 prefer Sony.

#Applying the mdoel to the incomplete data
finalPredictionBrand <- predict(treeFit, newdata = incomplete)

#summary of the results
summary(finalPredictionBrand)

## Acer Sony 
## 1913 3087

#adding the predicted values to the incomplete data and bringing back the other features

completeNewData <- incompletef[c(1,2,3,4,5,6)]
completeNewData <- cbind(completeNewData, predictedBrand = finalPredictionBrand)
head(completeNewData)

##      salary age elevel car zipcode    credit predictedBrand
## 1 110499.74  54      3  15       4 354724.18           Acer
## 2 140893.78  44      4  20       7 395015.34           Sony
## 3 119159.65  49      2   1       3 122025.09           Acer
## 4  20000.00  56      0   9       1  99629.62           Sony
## 5  93956.32  59      1  15       1 458679.83           Acer
## 6  41365.43  71      2   7       2 216839.72           Acer

5.1 Visualizing the Results of the Prediction

The proportions of the two brands in the predicted data confirm the above statement. It’s visible that there are more customers that prefer the Sony brand than Acer. It’s interesting that this proportion is similar to the one in the complete data. We expected this because the two sets of data have similar distributions.

6 Suggestions for the Future

After doing the analysis and all the predictions, even though we have a good accuracy and the model seems to work properly we cannot be content because we knew from the beggining that the data has an abnormal distribution and it looks like it came from a stratificated sampling process. Because of this the data that we it’s not representative for the population. Anyhow, the decision tree,beeing a non parametric model is not highly sensitive to abnormal data distribution so we continued with the process.

To fully correct this and to be able to use the data in the future with higher confidence we need to collect data in a way that it will be representative for the population.

Predict which Brand of Products Customers Prefer Report

Damsa Ioana

27 November 2019