For this task, based on the data that the marketing department collected, we have to predict our customers brand preferences regarding the laptop category for the incomplete data that was lest without the brand preference stated.
We need to conclude this prediction so that our company knows with which of the brands (Acer or Sony) to folow a collaborative agreement.
In order to do this we will follow the data analysis and model building/ predicting flow by inspecting the data, preparing the data and perform any preprocessing necessary, do the feature selection, sample the data, build the models, perfect them and train them. Final step is applying the model on the incomplete data and predict the preference in laptop brand of our customers.
After bringing our data in the R environment we procced in looking over the data types, variables and the general characteristics of our data set.
## salary age elevel car
## Min. : 20000 Min. :20.00 Min. :0.000 Min. : 1.00
## 1st Qu.: 52082 1st Qu.:35.00 1st Qu.:1.000 1st Qu.: 6.00
## Median : 84950 Median :50.00 Median :2.000 Median :11.00
## Mean : 84871 Mean :49.78 Mean :1.983 Mean :10.52
## 3rd Qu.:117162 3rd Qu.:65.00 3rd Qu.:3.000 3rd Qu.:15.75
## Max. :150000 Max. :80.00 Max. :4.000 Max. :20.00
## zipcode credit brand
## Min. :0.000 Min. : 0 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:120807 1st Qu.:0.0000
## Median :4.000 Median :250607 Median :1.0000
## Mean :4.041 Mean :249176 Mean :0.6217
## 3rd Qu.:6.000 3rd Qu.:374640 3rd Qu.:1.0000
## Max. :8.000 Max. :500000 Max. :1.0000
## 'data.frame': 9898 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : int 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : int 0 1 0 3 3 3 4 3 4 1 ...
## $ car : int 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: int 4 6 2 5 4 3 5 0 0 4 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : int 0 1 0 1 0 1 1 1 0 1 ...
## [1] 9898
We can see that the variables that we have are all numeric numeric and taht we will have to change some of them to factors.We proceed to change the lable variable brand from numeric to factor and rename the levels from “1” and “0” to “Acer” for “0” and “Sony” for “1”.
Next step is visualizing the data through some plots in order to see the distribution, relationships and patterns that may appear. To achieve this we will be using the library ggplot.
As we can see from the histogram above the distribution of the salary values is not a Gaussian one (Nomrmal distribution). We might want to check if other features have a similar problem.
After running the histogram for the credit and all the other features(age, education levels, type of car) we can conclude that our dataset doesen’t have a normal distribution. We will proceed with our prediction process and talk about this at the end of this report. We proceed to check out the percentage of people that prefer each brand and how the brand related to other attributes.
From the histogram and piechart above we can conclude that out of the surveyed customers the majority prefered Sony. Let’s see how the brand related with the salary and age of the customers.
Based on the plot above we can conclude that the age and salary do have an effect on the brand that our customers chose. After further visualization we have seen that the car type, credit and education level have none to very low effect on the brand chosen. We will keep this in mind when doing our feature selection.
We will start of by running a correlation matrix on our data and exclude any highly correlated features. The car and education elev are highly correlated (0.9963) so we will discard them both because of the high correlation and low significance seen by plotting the data. Also in the correlation matrix the salary seem that isn’t correlated with the brand but we know it is based on our visualizations.
## salary age elevel car zipcode credit brand
## salary 1.00 0.01 -0.01 -0.01 -0.01 -0.03 0.21
## age 0.01 1.00 -0.01 0.01 0.00 0.00 0.01
## elevel -0.01 -0.01 1.00 0.00 0.02 0.00 0.00
## car -0.01 0.01 0.00 1.00 0.00 -0.01 0.01
## zipcode -0.01 0.00 0.02 0.00 1.00 0.00 0.00
## credit -0.03 0.00 0.00 -0.01 0.00 1.00 0.01
## brand 0.21 0.01 0.00 0.01 0.00 0.01 1.00
##
## n= 9898
##
##
## P
## salary age elevel car zipcode credit brand
## salary 0.4274 0.5102 0.5446 0.5863 0.0124 0.0000
## age 0.4274 0.5619 0.3081 0.7142 0.6616 0.1725
## elevel 0.5102 0.5619 0.9963 0.0718 0.7867 0.6310
## car 0.5446 0.3081 0.9963 0.8793 0.3042 0.5557
## zipcode 0.5863 0.7142 0.0718 0.8793 0.6216 0.6426
## credit 0.0124 0.6616 0.7867 0.3042 0.6216 0.5715
## brand 0.0000 0.1725 0.6310 0.5557 0.6426 0.5715
## 'data.frame': 9898 obs. of 3 variables:
## $ salary: num 119807 106880 78021 63690 50874 ...
## $ age : int 45 63 23 51 20 56 24 62 29 41 ...
## $ brand : int 0 1 0 1 0 1 1 1 0 1 ...
Next we will transform the brand into a factor, and try to see if we have any missing values. Also we will proceed to normalize the salary and age, find and remove the outliers in the salary and bin it.
#transforming brand to a factor and setting the levels
complete$brand<-as.factor(complete$brand)
levels(complete$brand) <- c("Acer","Sony")
#normalize the salary and age
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
complete$salary<-normalize(complete$salary)
complete$age<-normalize(complete$age)
#finding missing values
sum(is.na(complete)) #no missing values
#finding outliers in salary
outlier_tfSalary = outlier(complete$salary,logical=TRUE)
sum(outlier_tfSalary)
#what were the ouliers in salary
find_outlierSalary = which(outlier_tfSalary==TRUE,arr.ind=TRUE)
#Removing the outliers in salary
complete= complete[-find_outlierSalary,]
#binning the salary
bin(
complete$salary,
nbins = 5,
labels = NULL,
method = c( "content")
)We can see that after we are done with the prerpocessing we have 3 features (salary,age and brand) ,age and salary are normalized and brand is a factor with 2 levels.
## 'data.frame': 9783 obs. of 3 variables:
## $ salary: num 0.768 0.668 0.446 0.336 0.237 ...
## $ age : num 0.417 0.717 0.05 0.517 0 ...
## $ brand : Factor w/ 2 levels "Acer","Sony": 1 2 1 2 1 2 2 2 1 2 ...
For this task we will build 2 models: a C5.0 Decision Tree and a Random forest Models. To build them, partition the data and make the predictions we will use the CARET library.
Before starting building the models we need to split our data in 2 sets: the subset we will use for training the models and the testing set the one that we will use for testing the performance of our models.
#creating the numbers for each set with a 75 % ratio
set.seed(123)
inTrain<- createDataPartition(y= complete$brand,p=.75, list = FALSE)
#partitioning the data into training and testing set
training <- complete[ inTrain,]
testing <- complete[-inTrain,]
#number of rows and the table head in each set
nrow(training)## [1] 7338
## salary age brand
## 1 0.7677427 0.41666667 Acer
## 2 0.6683114 0.71666667 Sony
## 3 0.4463135 0.05000000 Acer
## 4 0.3360764 0.51666667 Sony
## 6 0.8524057 0.60000000 Sony
## 7 0.8958411 0.06666667 Sony
## [1] 2445
## salary age brand
## 5 0.2374894 0.0000000 Acer
## 10 0.1369487 0.3500000 Sony
## 14 0.4805737 0.2166667 Acer
## 30 0.6746948 0.9166667 Sony
## 31 0.9884265 0.7000000 Sony
## 38 0.2284007 0.3666667 Sony
We set theresampling procedure as a 10 fold cross validation that will repeat 3 times.
Using the train function that we hava available in CARET we now train our C5.0 model on the training data.
set.seed(123)
treeFit <- train(
brand~.,
data = training,
method = "C5.0",
trControl=ctrl,
tuneLength = 3,
)Next we used the model to predict the brand on the testing data and build the confusion matrix for this model. From the confusion matrix we know that the model has an accuracy of 0.9235 and a Kappa of 0.839 . That means that the model is pretty good at prediciting the brand on the testing data.
#apllying the model on test
treeBrand <- predict(treeFit, newdata = testing)
#confusion matrix and the statistics
confusionMatrix(data= treeBrand, testing$brand)## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 855 106
## Sony 81 1403
##
## Accuracy : 0.9235
## 95% CI : (0.9123, 0.9337)
## No Information Rate : 0.6172
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.839
##
## Mcnemar's Test P-Value : 0.07925
##
## Sensitivity : 0.9135
## Specificity : 0.9298
## Pos Pred Value : 0.8897
## Neg Pred Value : 0.9454
## Prevalence : 0.3828
## Detection Rate : 0.3497
## Detection Prevalence : 0.3930
## Balanced Accuracy : 0.9216
##
## 'Positive' Class : Acer
##
For the second model we have the Random Forest Algorithm using the Cross Validation procedure like for the Decision Tree. For this model we manually tuned the parameters of the model and gave to the mrty parameter different values and the algorithm decided that mrty = 1 was the best value.
#setting the cross valdidation
ctrl <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 3,
summaryFunction = multiClassSummary
)
#setting the mtry
mtry <- c(2,1,3,5,12,32)
rfGrid <- expand.grid(.mtry=mtry)
set.seed(123)
#train the model
system.time(
randomForestFit <- train(
brand~.,
data = training,
preProcess = c("center"),
method = "rf",
trControl=ctrl,
tuneGrid=rfGrid,
importance=T
)
)## user system elapsed
## 413.79 27.78 469.79
###Prediction and Performance Random Forest
The model was then applyed to the testing data and the confusion matrix was drawn up. The performance of this model was good with an accuracy of 0.9043 and kappa of 0.7978.
#applying the model on test
rfBrand <- predict(randomForestFit, newdata = testing)
#confusion matrix for the random forest model
confusionMatrix(data= rfBrand, testing$brand)## Confusion Matrix and Statistics
##
## Reference
## Prediction Acer Sony
## Acer 848 120
## Sony 88 1389
##
## Accuracy : 0.9149
## 95% CI : (0.9032, 0.9257)
## No Information Rate : 0.6172
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8211
##
## Mcnemar's Test P-Value : 0.0316
##
## Sensitivity : 0.9060
## Specificity : 0.9205
## Pos Pred Value : 0.8760
## Neg Pred Value : 0.9404
## Prevalence : 0.3828
## Detection Rate : 0.3468
## Detection Prevalence : 0.3959
## Balanced Accuracy : 0.9132
##
## 'Positive' Class : Acer
##
Using the resample function we saw the comparation between the two models and could apreciate that the C5.0 Decision Tree is slightly better than the Random Forest and not so computationally expensive. We will also see this in the next visualizations
The last step in our Task was using the model that we trained and apply it to the incomplete survey data. After applying it we will explore the results we get and inferate which is the prefered brand of the customers and with which of the two, Acer or Sony, should Blackwell consider a business relationship in the future.
This process is made of several steps: bringing the incomplete data into the R enviroment, preprocessing it exactly how we did with the complete data (normalization, change of variable type, binning, feature selection). Then we will proceed to apply the Decision Tree model to the data and explore the results.
After applying the mdoel we can see that it predicted that 1913 of our customers prefer ACer and the rest of 3087 prefer Sony.
#Applying the mdoel to the incomplete data
finalPredictionBrand <- predict(treeFit, newdata = incomplete)
#summary of the results
summary(finalPredictionBrand)## Acer Sony
## 1913 3087
#adding the predicted values to the incomplete data and bringing back the other features
completeNewData <- incompletef[c(1,2,3,4,5,6)]
completeNewData <- cbind(completeNewData, predictedBrand = finalPredictionBrand)
head(completeNewData)## salary age elevel car zipcode credit predictedBrand
## 1 110499.74 54 3 15 4 354724.18 Acer
## 2 140893.78 44 4 20 7 395015.34 Sony
## 3 119159.65 49 2 1 3 122025.09 Acer
## 4 20000.00 56 0 9 1 99629.62 Sony
## 5 93956.32 59 1 15 1 458679.83 Acer
## 6 41365.43 71 2 7 2 216839.72 Acer
The proportions of the two brands in the predicted data confirm the above statement. It’s visible that there are more customers that prefer the Sony brand than Acer. It’s interesting that this proportion is similar to the one in the complete data. We expected this because the two sets of data have similar distributions.
After doing the analysis and all the predictions, even though we have a good accuracy and the model seems to work properly we cannot be content because we knew from the beggining that the data has an abnormal distribution and it looks like it came from a stratificated sampling process. Because of this the data that we it’s not representative for the population. Anyhow, the decision tree,beeing a non parametric model is not highly sensitive to abnormal data distribution so we continued with the process.
To fully correct this and to be able to use the data in the future with higher confidence we need to collect data in a way that it will be representative for the population.