REPORT DESCRIPTION

This report main objective is to predict the customer’s brand preference on laptops (Sony or Acer) based on their characteristics such as Age and Salary.


SUMMARISED REPORT

1. Metrics about the models used on classification

KNN k = 5 Random Forest
Accuracy 0.9127651 0.8141291
Kappa 0.9171669 0.8239834

Comparing both algorithms metrics, we clearly identify that the best model to use for classification here is KNN with k = 5, which has higher accuracy and kappa than Random Forest.

2. Predicting brand

Image showing the predicted amount of values for each brand on the incompete dataset

3. Data distribution

Image that shows the distribution of customers by Age.

It does not make much sense that it follows a uniform distribution. In theory, it should look like a normal distribution. This means that the sample has not been randomly selected, but a certain number of people has been selected for each Age. This is just an example, all the other distributions but the one on Salary have a uniform range.

4. Salary distribution comparison

Image showing distributions for Salary and Brand for both, Predicted and Original dataset. The distribution of data is quite similar, so we can use the model to predict on this data.

The distributions among data attributes on the Predicted values are quite similar compared to the distributions on the Original data set, which means we can use the model to classify customers that have this very similar characteristics to ones on our dataset. Due to to the issue that the sampling of customers has not been random, we cannot extrapolate the info for any new potential customer. Then again, is very important to make clear that the model for classification works, but only for potential customers that have similar attribute characteristics to the ones used for training our model, not for any kind of customer.

To improve this model and make it useful to classify any new potential customer, we would need access to a new sample, as big as possible, of randomly selected customers.


FULL REPORT

Loading data set

# df = read.xlsx(file = "C:/Users/David/Google Drive/Ubiqum/3_BrandPreference/survey_complete.xlsx",
#                sheetIndex = 2)
# save(df, file = "survey_complete.Rdata")
load("survey_complete.Rdata")
head(df)

Checking structure

str(df)
## 'data.frame':    10000 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : num  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : num  0 1 0 3 3 3 4 3 4 1 ...
##  $ car    : num  14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: num  4 6 2 5 4 3 5 0 0 4 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : num  0 1 0 1 0 1 1 1 0 1 ...

We need to format the label attributes properly, as all attributes appear like numeric and that is not the way all data is.

Formatting data properly

df$elevel = factor(x = df$elevel,
                   levels = 0:4,
                   labels = c("Less_Than_High_School", "High_School", "Some_College", "4_Year_Degree_College", "Master_PHD"),
                   ordered = TRUE) #Making 'elevel' an ordinal factor attribute

df$zipcode = factor(x = df$zipcode,
                    levels = 0:8,labels = c("New_England", "Mid_Atlantic", "ENC", "WNC", "South_Atlantic", "ESC", "WSC", "Mountain", "Pacific"),
                    ordered = FALSE) #Labeling regions

df$brand = factor(x = df$brand,
                  levels = 0:1,
                  labels = c("Acer", "Sony"),
                  ordered = FALSE) #Labeling factors

df$car = as.factor(df$car) #Convert variable to factor

str(df)
## 'data.frame':    10000 obs. of  7 variables:
##  $ salary : num  119807 106880 78021 63690 50874 ...
##  $ age    : num  45 63 23 51 20 56 24 62 29 41 ...
##  $ elevel : Ord.factor w/ 5 levels "Less_Than_High_School"<..: 1 2 1 4 4 4 5 4 5 2 ...
##  $ car    : Factor w/ 20 levels "1","2","3","4",..: 14 11 15 6 14 14 8 3 17 5 ...
##  $ zipcode: Factor w/ 9 levels "New_England",..: 5 7 3 6 5 4 6 1 1 5 ...
##  $ credit : num  442038 45007 48795 40889 352951 ...
##  $ brand  : Factor w/ 2 levels "Acer","Sony": 1 2 1 2 1 2 2 2 1 2 ...

Feature engineering - Rows with value 0 at “Credit”

# df = df[!df$credit == 0, ] #Deletig rows were the "credit" value is 0
# df["new"] <- df$salary/df$credit
# cor(df[,c("salary", "credit", "new")])

Exploring data distribution

hist(df$salary, main = "Salary histogram")

hist(df$age, main = "Age histogram")

hist(df$credit, main = "Credit histogram")

barplot(table(df$elevel))

barplot(table(df$car), main = "Car brands category")

barplot(table(df$zipcode))

barplot(table(df$brand))

table(df$zipcode)
## 
##    New_England   Mid_Atlantic            ENC            WNC South_Atlantic 
##           1097           1066           1125           1094           1103 
##            ESC            WSC       Mountain        Pacific 
##           1113           1167           1091           1144

We can see that all the attributes follow some kind of uniform distribution, which is a little bit awkward. Almost the same amount of people has been selected in this study in each region. The sampling method does not look random at all.

Correlation plot

df2 = df[c("salary", "age", "credit")]
plot(df2)

cor(df2)
##              salary          age       credit
## salary  1.000000000  0.007025207 -0.024047739
## age     0.007025207  1.000000000 -0.004960033
## credit -0.024047739 -0.004960033  1.000000000

As we can see beforehand, there isn’t any kind of correlation between the numeric attributes.

Relation between attributes and Dependent variable “Brand”

par(mfrow = c(1,2))
plot(df$salary~df$brand, main = "Brand & Salary")
plot(df$brand~df$salary, main = "Brand & Salary")

par(mfrow = c(1,2))
plot(df$age~df$brand, main = "Brand & Age")
plot(df$brand~df$age, main = "Brand & Age")

par(mfrow = c(1,2))
plot(df$elevel~df$brand, main = "Brand & Elevel")
plot(df$brand~df$elevel, main = "Brand & Elevel")

par(mfrow = c(1,2))
plot(df$car~df$brand, main = "Brand & Car")
plot(df$brand~df$car, main = "Brand & Car")

par(mfrow = c(1,2))
plot(df$zipcode~df$brand, main = "Brand & Zipcode")
plot(df$brand~df$zipcode, main = "Brand & Zipcode")

par(mfrow = c(1,2))
plot(df$credit~df$brand, main = "Brand & Credit")
plot(df$brand~df$credit, main = "Brand & Credit")

Looks like the only attribute that explains if a brand is prefered is the “salary”. We can see how people that earn between 20k and 60k aprox. prefer Sony over Acer, while people who earn more prefer Acer. That is not true for people earning more than 140k though, they all prefer Sony.

Looking at a Decision Tree variables importance

tree = rpart(formula = brand~., data = df)
sumtree = summary(tree, file = "summTree")
sumtree$variable.importance
##        age     salary        car    zipcode     credit     elevel 
## 2036.06398 1421.30552   78.59978   42.70644   21.20404   16.06770

The tree shows that the most important variables to segment customers are Age and Salary.

Age & Salary vs Brand individually and gathered

treeAge = rpart(formula = brand~age, data = df)
rpart.plot(x = treeAge, box.palette = "RdBu", nn = TRUE)

treeSalary = rpart(formula = brand~salary, data = df)
rpart.plot(x = treeSalary, box.palette = "RdBu", nn = TRUE)

tree = rpart(formula = brand~age+salary, data = df)
rpart.plot(x = tree, box.palette = "RdBu", nn = TRUE)

1. We can see how ‘Age’ alone does not explain anything about the chosen brand.

2. Salary itself segments customers pretty fine.

3. We must gather ‘Salary’ and ‘Age’ because they both explain better the segmentation of customers.

Normalizing data

df$salary <- (df$salary - mean(df$salary)) / sd(df$salary)
df$age <- (df$age - mean(df$age)) / sd(df$age)
df$credit <- (df$credit - mean(df$credit)) / sd(df$credit)
head(df)

Normalizing transforms all attributes values so they have the same range on a Z distribution.

Train & Test data.frames

idxs = createDataPartition(y = df$brand, list = FALSE, p = 0.75)
train = df[idxs, ]
test = df[-idxs, ]

Checking distribution in original and partitioned data

prop.table(table(train$brand)) * 100 #Training set
## 
##     Acer     Sony 
## 37.83496 62.16504
prop.table(table(test$brand)) * 100 #Test set
## 
##     Acer     Sony 
## 37.81513 62.18487
prop.table(table(df$brand)) * 100 #Original set
## 
##  Acer  Sony 
## 37.83 62.17

We can see how the function “createDataPartition” distributes data proportionally equal between partitions (not just randomly).

KNN model

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

knnFit <- train(brand ~ age+salary,
                data = train,
                method = "knn",
                trControl = ctrl,
                tuneLength = 10)

#Output of kNN fit
knnFit
## k-Nearest Neighbors 
## 
## 7501 samples
##    2 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 6751, 6750, 6750, 6751, 6751, 6751, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9148567  0.8192847
##    7  0.9167659  0.8232451
##    9  0.9171661  0.8242526
##   11  0.9197873  0.8298231
##   13  0.9203207  0.8309649
##   15  0.9214763  0.8333910
##   17  0.9209878  0.8322868
##   19  0.9205433  0.8313862
##   21  0.9203651  0.8309584
##   23  0.9208989  0.8320532
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 15.
plot(knnFit)

We can see that the best K is 23 but the difference for accuracy between k = 5 and 23 is really small, so we will take the most simple model, which is k = 5

KNN with k = 5

knnFit <- train(brand ~ age+salary,
                data = train,
                method = "knn",
                trControl = ctrl,
                tuneGrid = expand.grid(k = 5))

knnFit
## k-Nearest Neighbors 
## 
## 7501 samples
##    2 predictor
##    2 classes: 'Acer', 'Sony' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 6751, 6751, 6751, 6751, 6750, 6751, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9164978  0.8226725
## 
## Tuning parameter 'k' was held constant at a value of 5

KNN performance

knnPredict <- predict(knnFit, newdata = test)
#Get the confusion matrix to see accuracy value and other parameter values

confmat = confusionMatrix(knnPredict, test$brand)
knitr::kable(confmat$table)
Acer Sony
Acer 846 115
Sony 99 1439
knnMet = postResample(pred = knnPredict, test$brand)
knnMet
##  Accuracy     Kappa 
## 0.9143657 0.8185200

Looks like our model works pretty well on the test-split data, but we need to take a look at how it performs on our validation data. Before that, let’s define a Random Forest model to see if it improves the error rates of KNN on the test data.

Random Forest model

rfFit = randomForest(brand ~ age+salary, data = train, ntree = 60)

#Output of random forest fit
rfFit
## 
## Call:
##  randomForest(formula = brand ~ age + salary, data = train, ntree = 60) 
##                Type of random forest: classification
##                      Number of trees: 60
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 8.55%
## Confusion matrix:
##      Acer Sony class.error
## Acer 2549  289  0.10183228
## Sony  352 4311  0.07548788
plot(rfFit)

The optimal number of trees is between 20 and 30.

RF performance

rfPredict <- predict(rfFit, newdata = test)
confmat = confusionMatrix(rfPredict, test$brand)
knitr::kable(confmat$table)
Acer Sony
Acer 846 120
Sony 99 1434
rfMet = postResample(pred = rfPredict, test$brand)
rfMet
##  Accuracy     Kappa 
## 0.9123649 0.8144715

Looks like Random Forest performs great, just as KNN.

Metrics for predicted data on the test set

metrics = matrix(c(knnMet[1], knnMet[2], rfMet[1], rfMet[2]),
                 ncol = 2,
                 nrow = 2,
                 byrow = T)
colnames(metrics) <- c("KNN k = 5", "Random Forest")
rownames(metrics) <- c("Accuracy", "Kappa")
metrics = knitr::kable(metrics)
# save(... = metrics, file = "metrics.Rdata")
metrics
KNN k = 5 Random Forest
Accuracy 0.9143657 0.8185200
Kappa 0.9123649 0.8144715

Comparing both algorithms metrics, we clearly identify that the best model to use for classification here is KNN, which has higher accuracy and kappa than Random Forest.

Loading incomplete data

# dfIncomp = read.csv(file = "C:/Users/David/Google Drive/Ubiqum/3_BrandPreference/survey_incomplete.csv")
# save(dfIncomp, file = "survey_incomplete.Rdata")
load("survey_incomplete.Rdata")
head(dfIncomp)
str(dfIncomp)
## 'data.frame':    5000 obs. of  7 variables:
##  $ salary : num  110500 140894 119160 20000 93956 ...
##  $ age    : int  54 44 49 56 59 71 32 33 32 58 ...
##  $ elevel : int  3 4 2 0 1 2 1 4 1 2 ...
##  $ car    : int  15 20 1 9 15 7 17 17 19 8 ...
##  $ zipcode: int  4 7 3 1 1 2 1 0 2 4 ...
##  $ credit : num  354724 395015 122025 99630 458680 ...
##  $ brand  : int  0 0 0 0 0 0 0 0 0 0 ...

Format new data frame

dfIncomp$elevel = factor(x = dfIncomp$elevel,
                   levels = 0:4,
                   labels = c("Less_Than_High_School", "High_School", "Some_College", "4_Year_Degree_College", "Master_PHD"),
                   ordered = TRUE) #Making 'elevel' an ordinal factor attribute

dfIncomp$zipcode = factor(x = dfIncomp$zipcode,
                    levels = 0:8,labels = c("New_England", "Mid_Atlantic", "ENC", "WNC", "South_Atlantic", "ESC", "WSC", "Mountain", "Pacific"),
                    ordered = FALSE) #Labeling regions

dfIncomp$brand = factor(x = dfIncomp$brand,
                  levels = 0:1,
                  labels = c("Acer", "Sony"),
                  ordered = FALSE) #Labeling factors

dfIncomp$car = as.factor(dfIncomp$car) #Convert variable to factor

str(dfIncomp)
## 'data.frame':    5000 obs. of  7 variables:
##  $ salary : num  110500 140894 119160 20000 93956 ...
##  $ age    : int  54 44 49 56 59 71 32 33 32 58 ...
##  $ elevel : Ord.factor w/ 5 levels "Less_Than_High_School"<..: 4 5 3 1 2 3 2 5 2 3 ...
##  $ car    : Factor w/ 20 levels "1","2","3","4",..: 15 20 1 9 15 7 17 17 19 8 ...
##  $ zipcode: Factor w/ 9 levels "New_England",..: 5 8 4 2 2 3 2 1 3 5 ...
##  $ credit : num  354724 395015 122025 99630 458680 ...
##  $ brand  : Factor w/ 2 levels "Acer","Sony": 1 1 1 1 1 1 1 1 1 1 ...

Exploring data distribution for the Incomplete Dataset

hist(dfIncomp$salary, main = "Salary histogram")

hist(dfIncomp$age, main = "Age histogram")

hist(dfIncomp$credit, main = "Credit histogram")

barplot(table(dfIncomp$elevel))

barplot(table(dfIncomp$car), main = "Car brands category")

barplot(table(dfIncomp$zipcode))

We can see that distributions here are unsual, just as on the Complete dataset, but they are similar to each other. Due to this reason though, we can predict (classify) values on the new dataset.

Normalizing Incomplete dataset

dfIncomp$salary <- (dfIncomp$salary - mean(dfIncomp$salary)) / sd(dfIncomp$salary)
dfIncomp$age <- (dfIncomp$age - mean(dfIncomp$age)) / sd(dfIncomp$age)
dfIncomp$credit <- (dfIncomp$credit - mean(dfIncomp$credit)) / sd(dfIncomp$credit)
head(dfIncomp)

Classifying Incomplete dataset

knnPredict <- predict(knnFit, newdata = dfIncomp)
summary(knnPredict)
## Acer Sony 
## 1897 3103
plot(knnPredict)

We can see the total amount of predictions made for each brand.

Taking a look at Incomplete data with predicted values

dfPredicted = as.data.frame(dfIncomp)
dfPredicted["predicted"] <- knnPredict
dfPredicted = dfPredicted[,-7]

par(mfrow = c(1,2))
plot(dfPredicted$salary~dfPredicted$predicted, main = "Brand & Salary")
plot(dfPredicted$predicted~dfPredicted$salary, main = "Brand & Salary")

par(mfrow = c(1,2))
plot(dfPredicted$age~dfPredicted$predicted, main = "Brand & Age")
plot(dfPredicted$predicted~dfPredicted$age, main = "Brand & Age")

par(mfrow = c(1,2))
plot(dfPredicted$elevel~dfPredicted$predicted, main = "Brand & Elevel")
plot(dfPredicted$predicted~dfPredicted$elevel, main = "Brand & Elevel")

par(mfrow = c(1,2))
plot(dfPredicted$car~dfPredicted$predicted, main = "Brand & Car")
plot(dfPredicted$predicted~dfPredicted$car, main = "Brand & Car")

par(mfrow = c(1,2))
plot(dfPredicted$zipcode~dfPredicted$predicted, main = "Brand & Zipcode")
plot(dfPredicted$predicted~dfPredicted$zipcode, main = "Brand & Zipcode")

par(mfrow = c(1,2))
plot(dfPredicted$credit~dfPredicted$predicted, main = "Brand & Credit")
plot(dfPredicted$predicted~dfPredicted$credit, main = "Brand & Credit")

The distributions between attributes and the label data are quite similar compared to the distributions on the “Complete” dataset, which means we can use the model to predict this specific observations, but we cannot extrapolate the info por the population at USS, due to sample and population issues.

Comparing Predicted and Original datasets Brand vs Salary distribution

par(mfrow = c(1,2))
plot(dfPredicted$predicted~dfPredicted$salary, main = "Brand & Salary on PREDICTED")
plot(df$brand~df$salary, main = "Brand & Salary on ORIGINAL")

As we have just seen with all other distribution plots, we can confirm that the data at the Incomplete dataset follows the Original data distribution.