Brand Prediction

Example of solution

Background

We have been asked by Danielle Sherman, CTO of Blackwell Electronics, to predict the customers brand preferences that are missing from the incomplete surveys. In the following report, you will find a long analysis of both samples to see if they are useful for our purpose of predicting the customer brands preferences and find the differences between both of them, as well as a prediction for brand preferences in the incomplete survey.

Executive Summary

Highlights:

  • There are several indicators that strongly suggest the data was sampled by a clustered method, in each a stratification was created by education level, region and car. This type of sample is recommended for targeted marketing campaigns, but we cannot calculate total proportion of all Blackwell customers that prefer Sony/Acer.

  • Also, there is sign that data may have been fabricated, it has very questionable quality. For example, customers reported their salaries and credit with precision that doesnt make sense in the real world. For example a reported salary of $113,236.3836. No human would report their salary with 4 decimal places. An exchange rate conversion could add the decimal places, but even so, humans tend to report their salary or credit level in round whole numbers: $110,000, $75,000, perhaps $42,000, but not very often as precisely as $42,235.

Load packages and data

pacman:: p_load(dummies, caret, party, xlsx, corrplot, ggplot2, plotly, gplots, reshape2, dplyr)

complete <- read.csv("Complete_Responses_v1_inputR.csv", sep=";")

incomplete <- read.csv("SurveyIncomplete v1.csv", 
                                sep=";")

We see the same number of brand preference behaviour in all classes of education level, region and car

Initial exploration

# check sample method
plot(complete$elevel, complete$brand)

plot(complete$region, complete$brand)

plot(complete$car, complete$brand)

# distribution of depedent variable

# change to pie chart
f <- ggplot(as.data.frame(table(complete$brand)), aes(x = "",y = Freq, fill = Var1)) + 
  geom_bar(width = 1, stat = "identity")

f + coord_polar("y", start=0)

# check distribution complete versus incomplete 
complete$survey <- "complete"
incomplete$survey <- "incomplete"

all <- rbind(complete, incomplete)

for(i in names(all[ ,!names(all) %in% c("survey","brand")])){
x <- ggplot(data = all,aes_string(i,color="survey")) + geom_density()
print(x)
}

all lines are unique (semi join is a good technique to check that)

x <- semi_join(incomplete,complete, by = c("elevel", "car", "region", "age", "yearly.salary", "credit"))

nrow(x)
## [1] 0
complete$survey <- NULL
incomplete$survey <- NULL

# Chi analysis 

chi_eleval_brand <- chisq.test(complete$elevel,complete$brand)
chi_eleval_brand
## 
##  Pearson's Chi-squared test
## 
## data:  complete$elevel and complete$brand
## X-squared = 0.65902, df = 4, p-value = 0.9563
chi_car_brand <- chisq.test(complete$car,complete$brand)
chi_car_brand
## 
##  Pearson's Chi-squared test
## 
## data:  complete$car and complete$brand
## X-squared = 12.221, df = 19, p-value = 0.876
chi_region_brand <- chisq.test(complete$region,complete$brand)
chi_region_brand
## 
##  Pearson's Chi-squared test
## 
## data:  complete$region and complete$brand
## X-squared = 7.3888, df = 8, p-value = 0.4953
# decision tree

decisiontree <- ctree(brand~.,data = complete, controls = ctree_control(maxdepth = 3))
plot(decisiontree)

Modeling


set.seed(123)

inTraining <- createDataPartition(complete$brand,p=.75,list = FALSE)
training <- complete[inTraining,]
testing <- complete[-inTraining,]
fitControl <- trainControl(method = "cv", number=2, verboseIter = F)


# loop for models

combined <- c()

a <- c("knn", "rf", "svmRadial", "svmLinear", "gbm")
for(i in a) {

Fit <- train(brand~yearly.salary+age,data = training,method = i, 
                na.action = na.omit, trControl = fitControl, tuneLength = 5, 
                preProcess = c("center", "scale"))

pred <- predict(Fit,testing)

res <- postResample(pred,testing$brand)

combined <- cbind(combined, res) 

}
colnames(combined) <- a

combined
##                knn        rf svmRadial svmLinear       gbm
## Accuracy 0.9199680 0.9011605 0.9235694 0.6218487 0.9251701
## Kappa    0.8294054 0.7885685 0.8368447 0.0000000 0.8397947

## gbm

gbmFit <- train(brand~yearly.salary+age,data = training,method = "gbm", 
                na.action = na.omit, trControl = fitControl, tuneLength = 20, 
                preProcess = c("center", "scale"))

predictbrand_gbm <- predict(gbmFit,testing)
postResample(predictbrand_gbm,testing$brand)
confusionMatrix(testing$brand,predictbrand_gbm)

## Applying prediction

incomplete$brand <- predict(gbmFit,incomplete)

# Visual results

f <- ggplot(incomplete, aes(age, yearly.salary))

f + geom_point(aes(colour = factor(brand)))


all <- rbind(complete, incomplete)

## Export prediction

write.xlsx(incomplete, file = "incompleteprediction.xlsx")

write.csv(all, file = "all_regions.csv")

Analysing results

08/04/2019