Brand Prediction
Example of solution
Background
We have been asked by Danielle Sherman, CTO of Blackwell Electronics, to predict the customers brand preferences that are missing from the incomplete surveys. In the following report, you will find a long analysis of both samples to see if they are useful for our purpose of predicting the customer brands preferences and find the differences between both of them, as well as a prediction for brand preferences in the incomplete survey.
Executive Summary
Highlights:
There are several indicators that strongly suggest the data was sampled by a clustered method, in each a stratification was created by education level, region and car. This type of sample is recommended for targeted marketing campaigns, but we cannot calculate total proportion of all Blackwell customers that prefer Sony/Acer.
Also, there is sign that data may have been fabricated, it has very questionable quality. For example, customers reported their salaries and credit with precision that doesnt make sense in the real world. For example a reported salary of $113,236.3836. No human would report their salary with 4 decimal places. An exchange rate conversion could add the decimal places, but even so, humans tend to report their salary or credit level in round whole numbers: $110,000, $75,000, perhaps $42,000, but not very often as precisely as $42,235.
Load packages and data
pacman:: p_load(dummies, caret, party, xlsx, corrplot, ggplot2, plotly, gplots, reshape2, dplyr)
complete <- read.csv("Complete_Responses_v1_inputR.csv", sep=";")
incomplete <- read.csv("SurveyIncomplete v1.csv",
sep=";")We see the same number of brand preference behaviour in all classes of education level, region and car
Initial exploration
# check sample method
plot(complete$elevel, complete$brand)plot(complete$region, complete$brand)plot(complete$car, complete$brand)# distribution of depedent variable
# change to pie chart
f <- ggplot(as.data.frame(table(complete$brand)), aes(x = "",y = Freq, fill = Var1)) +
geom_bar(width = 1, stat = "identity")
f + coord_polar("y", start=0)# check distribution complete versus incomplete
complete$survey <- "complete"
incomplete$survey <- "incomplete"
all <- rbind(complete, incomplete)
for(i in names(all[ ,!names(all) %in% c("survey","brand")])){
x <- ggplot(data = all,aes_string(i,color="survey")) + geom_density()
print(x)
}all lines are unique (semi join is a good technique to check that)
x <- semi_join(incomplete,complete, by = c("elevel", "car", "region", "age", "yearly.salary", "credit"))
nrow(x)## [1] 0
complete$survey <- NULL
incomplete$survey <- NULL
# Chi analysis
chi_eleval_brand <- chisq.test(complete$elevel,complete$brand)
chi_eleval_brand##
## Pearson's Chi-squared test
##
## data: complete$elevel and complete$brand
## X-squared = 0.65902, df = 4, p-value = 0.9563
chi_car_brand <- chisq.test(complete$car,complete$brand)
chi_car_brand##
## Pearson's Chi-squared test
##
## data: complete$car and complete$brand
## X-squared = 12.221, df = 19, p-value = 0.876
chi_region_brand <- chisq.test(complete$region,complete$brand)
chi_region_brand##
## Pearson's Chi-squared test
##
## data: complete$region and complete$brand
## X-squared = 7.3888, df = 8, p-value = 0.4953
# decision tree
decisiontree <- ctree(brand~.,data = complete, controls = ctree_control(maxdepth = 3))
plot(decisiontree)Modeling
set.seed(123)
inTraining <- createDataPartition(complete$brand,p=.75,list = FALSE)
training <- complete[inTraining,]
testing <- complete[-inTraining,]
fitControl <- trainControl(method = "cv", number=2, verboseIter = F)
# loop for models
combined <- c()
a <- c("knn", "rf", "svmRadial", "svmLinear", "gbm")
for(i in a) {
Fit <- train(brand~yearly.salary+age,data = training,method = i,
na.action = na.omit, trControl = fitControl, tuneLength = 5,
preProcess = c("center", "scale"))
pred <- predict(Fit,testing)
res <- postResample(pred,testing$brand)
combined <- cbind(combined, res)
}colnames(combined) <- a
combined## knn rf svmRadial svmLinear gbm
## Accuracy 0.9199680 0.9011605 0.9235694 0.6218487 0.9251701
## Kappa 0.8294054 0.7885685 0.8368447 0.0000000 0.8397947
## gbm
gbmFit <- train(brand~yearly.salary+age,data = training,method = "gbm",
na.action = na.omit, trControl = fitControl, tuneLength = 20,
preProcess = c("center", "scale"))
predictbrand_gbm <- predict(gbmFit,testing)
postResample(predictbrand_gbm,testing$brand)
confusionMatrix(testing$brand,predictbrand_gbm)
## Applying prediction
incomplete$brand <- predict(gbmFit,incomplete)
# Visual results
f <- ggplot(incomplete, aes(age, yearly.salary))
f + geom_point(aes(colour = factor(brand)))
all <- rbind(complete, incomplete)
## Export prediction
write.xlsx(incomplete, file = "incompleteprediction.xlsx")
write.csv(all, file = "all_regions.csv")