This report main objective is to predict the customer’s brand preference on laptops (Sony or Acer) based on their characteristics such as Age and Salary.
| KNN k = 5 | Random Forest | |
|---|---|---|
| Accuracy | 0.9127651 | 0.8141291 |
| Kappa | 0.9171669 | 0.8239834 |
Comparing both algorithms metrics, we clearly identify that the best model to use for classification here is KNN with k = 5, which has higher accuracy and kappa than Random Forest.
Image showing the predicted amount of values for each brand on the incompete dataset
Image that shows the distribution of customers by Age.
It does not make much sense that it follows a uniform distribution. In theory, it should look like a normal distribution. This means that the sample has not been randomly selected, but a certain number of people has been selected for each Age. This is just an example, all the other distributions but the one on Salary have a uniform range.
Image showing distributions for Salary and Brand for both, Predicted and Original dataset. The distribution of data is quite similar, so we can use the model to predict on this data.
The distributions among data attributes on the Predicted values are quite similar compared to the distributions on the Original data set, which means we can use the model to classify customers that have this very similar characteristics to ones on our dataset. Due to to the issue that the sampling of customers has not been random, we cannot extrapolate the info for any new potential customer. Then again, is very important to make clear that the model for classification works, but only for potential customers that have similar attribute characteristics to the ones used for training our model, not for any kind of customer.
To improve this model and make it useful to classify any new potential customer, we would need access to a new sample, as big as possible, of randomly selected customers.
# df = read.xlsx(file = "C:/Users/David/Google Drive/Ubiqum/3_BrandPreference/survey_complete.xlsx",
# sheetIndex = 2)
# save(df, file = "survey_complete.Rdata")
load("survey_complete.Rdata")
head(df)
str(df)
## 'data.frame': 10000 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : num 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : num 0 1 0 3 3 3 4 3 4 1 ...
## $ car : num 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: num 4 6 2 5 4 3 5 0 0 4 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : num 0 1 0 1 0 1 1 1 0 1 ...
We need to format the label attributes properly, as all attributes appear like numeric and that is not the way all data is.
df$elevel = factor(x = df$elevel,
levels = 0:4,
labels = c("Less_Than_High_School", "High_School", "Some_College", "4_Year_Degree_College", "Master_PHD"),
ordered = TRUE) #Making 'elevel' an ordinal factor attribute
df$zipcode = factor(x = df$zipcode,
levels = 0:8,labels = c("New_England", "Mid_Atlantic", "ENC", "WNC", "South_Atlantic", "ESC", "WSC", "Mountain", "Pacific"),
ordered = FALSE) #Labeling regions
df$brand = factor(x = df$brand,
levels = 0:1,
labels = c("Acer", "Sony"),
ordered = FALSE) #Labeling factors
df$car = as.factor(df$car) #Convert variable to factor
str(df)
## 'data.frame': 10000 obs. of 7 variables:
## $ salary : num 119807 106880 78021 63690 50874 ...
## $ age : num 45 63 23 51 20 56 24 62 29 41 ...
## $ elevel : Ord.factor w/ 5 levels "Less_Than_High_School"<..: 1 2 1 4 4 4 5 4 5 2 ...
## $ car : Factor w/ 20 levels "1","2","3","4",..: 14 11 15 6 14 14 8 3 17 5 ...
## $ zipcode: Factor w/ 9 levels "New_England",..: 5 7 3 6 5 4 6 1 1 5 ...
## $ credit : num 442038 45007 48795 40889 352951 ...
## $ brand : Factor w/ 2 levels "Acer","Sony": 1 2 1 2 1 2 2 2 1 2 ...
# df = df[!df$credit == 0, ] #Deletig rows were the "credit" value is 0
# df["new"] <- df$salary/df$credit
# cor(df[,c("salary", "credit", "new")])
hist(df$salary, main = "Salary histogram")
hist(df$age, main = "Age histogram")
hist(df$credit, main = "Credit histogram")
barplot(table(df$elevel))
barplot(table(df$car), main = "Car brands category")
barplot(table(df$zipcode))
barplot(table(df$brand))
table(df$zipcode)
##
## New_England Mid_Atlantic ENC WNC South_Atlantic
## 1097 1066 1125 1094 1103
## ESC WSC Mountain Pacific
## 1113 1167 1091 1144
We can see that all the attributes follow some kind of uniform distribution, which is a little bit awkward. Almost the same amount of people has been selected in this study in each region. The sampling method does not look random at all.
df2 = df[c("salary", "age", "credit")]
plot(df2)
cor(df2)
## salary age credit
## salary 1.000000000 0.007025207 -0.024047739
## age 0.007025207 1.000000000 -0.004960033
## credit -0.024047739 -0.004960033 1.000000000
As we can see beforehand, there isn’t any kind of correlation between the numeric attributes.
par(mfrow = c(1,2))
plot(df$salary~df$brand, main = "Brand & Salary")
plot(df$brand~df$salary, main = "Brand & Salary")
par(mfrow = c(1,2))
plot(df$age~df$brand, main = "Brand & Age")
plot(df$brand~df$age, main = "Brand & Age")
par(mfrow = c(1,2))
plot(df$elevel~df$brand, main = "Brand & Elevel")
plot(df$brand~df$elevel, main = "Brand & Elevel")
par(mfrow = c(1,2))
plot(df$car~df$brand, main = "Brand & Car")
plot(df$brand~df$car, main = "Brand & Car")
par(mfrow = c(1,2))
plot(df$zipcode~df$brand, main = "Brand & Zipcode")
plot(df$brand~df$zipcode, main = "Brand & Zipcode")
par(mfrow = c(1,2))
plot(df$credit~df$brand, main = "Brand & Credit")
plot(df$brand~df$credit, main = "Brand & Credit")
Looks like the only attribute that explains if a brand is prefered is the “salary”. We can see how people that earn between 20k and 60k aprox. prefer Sony over Acer, while people who earn more prefer Acer. That is not true for people earning more than 140k though, they all prefer Sony.
tree = rpart(formula = brand~., data = df)
sumtree = summary(tree, file = "summTree")
sumtree$variable.importance
## age salary car zipcode credit elevel
## 2036.06398 1421.30552 78.59978 42.70644 21.20404 16.06770
The tree shows that the most important variables to segment customers are Age and Salary.
treeAge = rpart(formula = brand~age, data = df)
rpart.plot(x = treeAge, box.palette = "RdBu", nn = TRUE)
treeSalary = rpart(formula = brand~salary, data = df)
rpart.plot(x = treeSalary, box.palette = "RdBu", nn = TRUE)
tree = rpart(formula = brand~age+salary, data = df)
rpart.plot(x = tree, box.palette = "RdBu", nn = TRUE)
1. We can see how ‘Age’ alone does not explain anything about the chosen brand.
2. Salary itself segments customers pretty fine.
3. We must gather ‘Salary’ and ‘Age’ because they both explain better the segmentation of customers.
df$salary <- (df$salary - mean(df$salary)) / sd(df$salary)
df$age <- (df$age - mean(df$age)) / sd(df$age)
df$credit <- (df$credit - mean(df$credit)) / sd(df$credit)
head(df)
Normalizing transforms all attributes values so they have the same range on a Z distribution.
idxs = createDataPartition(y = df$brand, list = FALSE, p = 0.75)
train = df[idxs, ]
test = df[-idxs, ]
prop.table(table(train$brand)) * 100 #Training set
##
## Acer Sony
## 37.83496 62.16504
prop.table(table(test$brand)) * 100 #Test set
##
## Acer Sony
## 37.81513 62.18487
prop.table(table(df$brand)) * 100 #Original set
##
## Acer Sony
## 37.83 62.17
We can see how the function “createDataPartition” distributes data proportionally equal between partitions (not just randomly).
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
knnFit <- train(brand ~ age+salary,
data = train,
method = "knn",
trControl = ctrl,
tuneLength = 10)
#Output of kNN fit
knnFit
## k-Nearest Neighbors
##
## 7501 samples
## 2 predictor
## 2 classes: 'Acer', 'Sony'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6751, 6750, 6750, 6751, 6751, 6751, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9148567 0.8192847
## 7 0.9167659 0.8232451
## 9 0.9171661 0.8242526
## 11 0.9197873 0.8298231
## 13 0.9203207 0.8309649
## 15 0.9214763 0.8333910
## 17 0.9209878 0.8322868
## 19 0.9205433 0.8313862
## 21 0.9203651 0.8309584
## 23 0.9208989 0.8320532
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 15.
plot(knnFit)
We can see that the best K is 23 but the difference for accuracy between k = 5 and 23 is really small, so we will take the most simple model, which is k = 5
knnFit <- train(brand ~ age+salary,
data = train,
method = "knn",
trControl = ctrl,
tuneGrid = expand.grid(k = 5))
knnFit
## k-Nearest Neighbors
##
## 7501 samples
## 2 predictor
## 2 classes: 'Acer', 'Sony'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6751, 6751, 6751, 6751, 6750, 6751, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9164978 0.8226725
##
## Tuning parameter 'k' was held constant at a value of 5
knnPredict <- predict(knnFit, newdata = test)
#Get the confusion matrix to see accuracy value and other parameter values
confmat = confusionMatrix(knnPredict, test$brand)
knitr::kable(confmat$table)
| Acer | Sony | |
|---|---|---|
| Acer | 846 | 115 |
| Sony | 99 | 1439 |
knnMet = postResample(pred = knnPredict, test$brand)
knnMet
## Accuracy Kappa
## 0.9143657 0.8185200
Looks like our model works pretty well on the test-split data, but we need to take a look at how it performs on our validation data. Before that, let’s define a Random Forest model to see if it improves the error rates of KNN on the test data.
rfFit = randomForest(brand ~ age+salary, data = train, ntree = 60)
#Output of random forest fit
rfFit
##
## Call:
## randomForest(formula = brand ~ age + salary, data = train, ntree = 60)
## Type of random forest: classification
## Number of trees: 60
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 8.55%
## Confusion matrix:
## Acer Sony class.error
## Acer 2549 289 0.10183228
## Sony 352 4311 0.07548788
plot(rfFit)
The optimal number of trees is between 20 and 30.
rfPredict <- predict(rfFit, newdata = test)
confmat = confusionMatrix(rfPredict, test$brand)
knitr::kable(confmat$table)
| Acer | Sony | |
|---|---|---|
| Acer | 846 | 120 |
| Sony | 99 | 1434 |
rfMet = postResample(pred = rfPredict, test$brand)
rfMet
## Accuracy Kappa
## 0.9123649 0.8144715
Looks like Random Forest performs great, just as KNN.
metrics = matrix(c(knnMet[1], knnMet[2], rfMet[1], rfMet[2]),
ncol = 2,
nrow = 2,
byrow = T)
colnames(metrics) <- c("KNN k = 5", "Random Forest")
rownames(metrics) <- c("Accuracy", "Kappa")
metrics = knitr::kable(metrics)
# save(... = metrics, file = "metrics.Rdata")
metrics
| KNN k = 5 | Random Forest | |
|---|---|---|
| Accuracy | 0.9143657 | 0.8185200 |
| Kappa | 0.9123649 | 0.8144715 |
Comparing both algorithms metrics, we clearly identify that the best model to use for classification here is KNN, which has higher accuracy and kappa than Random Forest.
# dfIncomp = read.csv(file = "C:/Users/David/Google Drive/Ubiqum/3_BrandPreference/survey_incomplete.csv")
# save(dfIncomp, file = "survey_incomplete.Rdata")
load("survey_incomplete.Rdata")
head(dfIncomp)
str(dfIncomp)
## 'data.frame': 5000 obs. of 7 variables:
## $ salary : num 110500 140894 119160 20000 93956 ...
## $ age : int 54 44 49 56 59 71 32 33 32 58 ...
## $ elevel : int 3 4 2 0 1 2 1 4 1 2 ...
## $ car : int 15 20 1 9 15 7 17 17 19 8 ...
## $ zipcode: int 4 7 3 1 1 2 1 0 2 4 ...
## $ credit : num 354724 395015 122025 99630 458680 ...
## $ brand : int 0 0 0 0 0 0 0 0 0 0 ...
dfIncomp$elevel = factor(x = dfIncomp$elevel,
levels = 0:4,
labels = c("Less_Than_High_School", "High_School", "Some_College", "4_Year_Degree_College", "Master_PHD"),
ordered = TRUE) #Making 'elevel' an ordinal factor attribute
dfIncomp$zipcode = factor(x = dfIncomp$zipcode,
levels = 0:8,labels = c("New_England", "Mid_Atlantic", "ENC", "WNC", "South_Atlantic", "ESC", "WSC", "Mountain", "Pacific"),
ordered = FALSE) #Labeling regions
dfIncomp$brand = factor(x = dfIncomp$brand,
levels = 0:1,
labels = c("Acer", "Sony"),
ordered = FALSE) #Labeling factors
dfIncomp$car = as.factor(dfIncomp$car) #Convert variable to factor
str(dfIncomp)
## 'data.frame': 5000 obs. of 7 variables:
## $ salary : num 110500 140894 119160 20000 93956 ...
## $ age : int 54 44 49 56 59 71 32 33 32 58 ...
## $ elevel : Ord.factor w/ 5 levels "Less_Than_High_School"<..: 4 5 3 1 2 3 2 5 2 3 ...
## $ car : Factor w/ 20 levels "1","2","3","4",..: 15 20 1 9 15 7 17 17 19 8 ...
## $ zipcode: Factor w/ 9 levels "New_England",..: 5 8 4 2 2 3 2 1 3 5 ...
## $ credit : num 354724 395015 122025 99630 458680 ...
## $ brand : Factor w/ 2 levels "Acer","Sony": 1 1 1 1 1 1 1 1 1 1 ...
hist(dfIncomp$salary, main = "Salary histogram")
hist(dfIncomp$age, main = "Age histogram")
hist(dfIncomp$credit, main = "Credit histogram")
barplot(table(dfIncomp$elevel))
barplot(table(dfIncomp$car), main = "Car brands category")
barplot(table(dfIncomp$zipcode))
We can see that distributions here are unsual, just as on the Complete dataset, but they are similar to each other. Due to this reason though, we can predict (classify) values on the new dataset.
dfIncomp$salary <- (dfIncomp$salary - mean(dfIncomp$salary)) / sd(dfIncomp$salary)
dfIncomp$age <- (dfIncomp$age - mean(dfIncomp$age)) / sd(dfIncomp$age)
dfIncomp$credit <- (dfIncomp$credit - mean(dfIncomp$credit)) / sd(dfIncomp$credit)
head(dfIncomp)
knnPredict <- predict(knnFit, newdata = dfIncomp)
summary(knnPredict)
## Acer Sony
## 1897 3103
plot(knnPredict)
We can see the total amount of predictions made for each brand.
dfPredicted = as.data.frame(dfIncomp)
dfPredicted["predicted"] <- knnPredict
dfPredicted = dfPredicted[,-7]
par(mfrow = c(1,2))
plot(dfPredicted$salary~dfPredicted$predicted, main = "Brand & Salary")
plot(dfPredicted$predicted~dfPredicted$salary, main = "Brand & Salary")
par(mfrow = c(1,2))
plot(dfPredicted$age~dfPredicted$predicted, main = "Brand & Age")
plot(dfPredicted$predicted~dfPredicted$age, main = "Brand & Age")
par(mfrow = c(1,2))
plot(dfPredicted$elevel~dfPredicted$predicted, main = "Brand & Elevel")
plot(dfPredicted$predicted~dfPredicted$elevel, main = "Brand & Elevel")
par(mfrow = c(1,2))
plot(dfPredicted$car~dfPredicted$predicted, main = "Brand & Car")
plot(dfPredicted$predicted~dfPredicted$car, main = "Brand & Car")
par(mfrow = c(1,2))
plot(dfPredicted$zipcode~dfPredicted$predicted, main = "Brand & Zipcode")
plot(dfPredicted$predicted~dfPredicted$zipcode, main = "Brand & Zipcode")
par(mfrow = c(1,2))
plot(dfPredicted$credit~dfPredicted$predicted, main = "Brand & Credit")
plot(dfPredicted$predicted~dfPredicted$credit, main = "Brand & Credit")
The distributions between attributes and the label data are quite similar compared to the distributions on the “Complete” dataset, which means we can use the model to predict this specific observations, but we cannot extrapolate the info por the population at USS, due to sample and population issues.
par(mfrow = c(1,2))
plot(dfPredicted$predicted~dfPredicted$salary, main = "Brand & Salary on PREDICTED")
plot(df$brand~df$salary, main = "Brand & Salary on ORIGINAL")
As we have just seen with all other distribution plots, we can confirm that the data at the Incomplete dataset follows the Original data distribution.