The sales team at Blackwell electronics have conducted some surveys to research which brand do the customers prefer between Acer and Sony, but due to some missing data in the surveys, the CTO of Blackwell electronics has asked if is possible to predict which of the brands do the customers prefer from the missing data in the surveys between the brands of Acer and Sony. This task will be achieved in the R program, and the prediction will be based on the data already collected.
In this report, the aim was to create a model to be able to predic the preference of our customers between Acer and Sony in the 5000 incomplete survey results, this task was going to be achieved with the 10000 existing results.
To find the best posible model, first had to make sure all the data was useful for the model to be able to have a trustworthy result. The team had to make sure all the data used for the completed surveys had to be relevant and useful, this was achieved by looking correlations between the features and the brand they had chosen. In this case age and salary were the ones chosen as they had a correlation between each other, while all the other features don’t seem to have any type of correlation with brand or the other features and that’s why they were left out.
Comparing age and salary there was a clear correlation as there were 3 different chunks in the data were the customer chose Acer(It can be seen in results of complete and incomplete responses). Knowing this, to help the model appreciate those chunks, age and salary were divided into categories.
Now having the features selected, it was time to use the different metrics to see which results are better. C5 decision tree gave the better results with a total of 91% accuracy, which means this model its very probable and could be trusted.
The results for the survey incomplete were 1967(39.34%) picked Acer over 3033(60.64%) picked Sony for the incomplete results, and for the complete results 3744(37.44%) picked Acer and 6154(61.54%) picked Sony.
After comparing the data attained from the prediction using the model, and comparing it with the already existing data, the team of data analytics at Blackwell electronics came with the conclusion that we believe this model can be useful and the responses are very likely to be right.
After looking at the data, the team decided to use the function “ggplot” to try find some correlations between all the features that could help the model have a more probable prediction of the brand. The results were very important as they showed the team how it was only age and salary that had some correlation with brand chosen. Therefore, all the other features were removed. To help the model have a better performance, it was decided to categorise these two features. Here below we can see the table where the correlation between age, salary and brand was found.
0= Acer, 1= Sony
This is how the team managed to categorise age and salary, and then create a new dataset ready to be applied to the model, only including the features selected. After the feature selection we can see how the data was distributed for the model to be able to use it for the trainning and testing.
Age <- CompleteResponses$age
Salary <- CompleteResponses$salary
CompleteResponses$salary1 <- cut(Salary, 5)
CompleteResponses$age1 <- cut(Age , 3)
Features <- c("brand", "salary1", "age1")
CompleteResponsesC <- CompleteResponses[, Features]
str(CompleteResponsesC)
## 'data.frame': 9898 obs. of 3 variables:
## $ brand : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 2 2 1 2 ...
## $ salary1: Factor w/ 5 levels "(1.99e+04,4.6e+04]",..: 4 4 3 2 2 5 5 4 3 1 ...
## $ age1 : Factor w/ 3 levels "(19.9,40]","(40,60]",..: 2 3 1 2 1 2 1 3 1 2 ...
#Training----
Intraining <- createDataPartition(y = CompleteResponses$brand,
p = .75, list = FALSE)
TrainingData <- CompleteResponsesC[Intraining,]
TestingData <- CompleteResponsesC[- Intraining,]
This is how the random forest was done and the results:
rfGrid <- expand.grid(mtry=c(1, 2 ,3 , 4, 5))
fitControl <- trainControl(method = "repeatedcv",
number = 10, repeats = 1)
mod1 <- train(brand~., data = TrainingData, method = "rf",
trControl=fitControl, tuneGrid = rfGrid)
## Random Forest
##
## 7424 samples
## 2 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6682, 6682, 6681, 6681, 6682, 6682, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7098555 0.2745150
## 2 0.8591026 0.6871168
## 3 0.9096166 0.8084951
## 4 0.9128511 0.8162855
## 5 0.9128511 0.8162855
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
This is how c5 decision tree was done and the results:
mod2 <- train(brand~., data = TrainingData,
method = "C5.0", trControl=fitControl, tuneLength = 2)
## C5.0
##
## 7424 samples
## 2 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 6681, 6681, 6681, 6682, 6683, 6683, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.9128516 0.8162622
## rules FALSE 10 0.8945431 0.7722923
## rules TRUE 1 0.9128516 0.8162622
## rules TRUE 10 0.8945431 0.7722923
## tree FALSE 1 0.9128516 0.8162622
## tree FALSE 10 0.8986996 0.7828544
## tree TRUE 1 0.9128516 0.8162622
## tree TRUE 10 0.8986996 0.7828544
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 1, model = rules
## and winnow = TRUE.
Looking at the results, both methods seem to be performing very similar, as it can be shown in the outcome. While the C5 decision tree has very similar results all around, Random tree has different results in the metrics 1 and 2. Looking at all the results, the data analytics team from Blackwell electronics decided that C5 decision tree have the best results. This is the reason why it was decided C5 decision tree will do the predictions for this project.
This is how the prediction was achieved and the results attained:
#Loading the data
SurveyIncomplete <- read.csv("SurveyIncomplete.csv")
#Adjusting the data
SurveyIncomplete$brand = as.factor(SurveyIncomplete$brand)
AgeSI <- SurveyIncomplete$age
SalarySI <- SurveyIncomplete$salary
SurveyIncomplete$salary1 <- cut(SalarySI, 5)
SurveyIncomplete$age1 <- cut(AgeSI , 3)
#Feature selection
FeautresSi <- c("brand", "salary1", "age1")
SurveyincomeSI <- SurveyIncomplete[, FeautresSi]
#Prediction
PredictionBrands <- predict(mod2, SurveyincomeSI)
#Including our prediction in survey incomplete
SurveyIncomplete$brand <- PredictionBrands
## 0 1
## 1967 3033
0= Acer, 1= Sony
The results from the prediction show a clear preference on Sony over Acer, knowing we have a 91% percent of accuracy, the team believes this is probable possibility and therefore they can be trusted. But before jumping into any conclusions, lets add this results to the overall comparison.
This is how we add both tables together into just one:
AllData <- rbind(CompleteResponses[1:7], SurveyIncomplete[1:7])
## 0 1
## 5711 9187
0= Acer, 1= Sony
Looking at the overall results, it gives us a clear victory for Sony, with just under half of the votes, which its similar percentage of what we got from the model results, knowing that the model had a high accuracy, the team trusts this model to be doing a probable prediction and therefore recommend the sales team from Blackwell Electronics to pick Sony over Acer as it’s more liked by the people. Here below there is a graph that shows how age and salary have such a high correlation overall and you could fit one into the other one pretty much perfectly, thats why the model has 91% accuracy, it is very difficult to improve from there, as none of the other features have correlation between them and it’s hard to help pick up the ones that act a bit different in age and salary without loosing accuracy somewhere else. 0= Acer, 1= Sony