Executive summary

The conclusions from this report are as follow, with respect to the data that was given to me, even though the predictions are very accurate, this is because the complete survey was very similar to the incomplete one and they were distributed too perfectly. Our sampling methods were stratified. To explain it easier: our model is good for predicting this incomplete survey because they are distributed very alike, if we input more data the predictions will probably be wrong.

Training using complete data

Data summary

Complete Data Histograms

Incomplete Data Histograms

As we can see in this graphs, the complete dataset and the incomplete ones are very similarto each other sampling-wise.

## [1] "Complete summary"
##        BMW      Buick   Cadillac  Chevrolet   Chrysler      Dodge 
##        492        509        488        479        505        477 
##       Ford      Honda    Hyundai       Jeep        Kia    Lincoln 
##        495        511        487        500        473        498 
##      Mazda   Mercedes Mitsubishi     Nissan      Other        Ram 
##        473        494        542        470        484        508 
##     Subaru     Toyota 
##        524        489
##    0    1    2    3    4    5    6    7    8 
## 1085 1053 1112 1080 1087 1108 1155 1083 1135
##  4year College    High School No High School            PhD   Some College 
##           1947           1948           2052           1968           1983
## [1] "Incomplete summary"
##        BMW      Buick   Cadillac  Chevrolet   Chrysler      Dodge 
##        247        242        250        245        255        236 
##       Ford      Honda    Hyundai       Jeep        Kia    Lincoln 
##        249        263        237        251        259        243 
##      Mazda   Mercedes Mitsubishi     Nissan      Other        Ram 
##        249        238        239        259        250        250 
##     Subaru     Toyota 
##        262        276
##   0   1   2   3   4   5   6   7   8 
## 529 549 545 548 585 568 542 571 563
##  4year College    High School No High School            PhD   Some College 
##           1001            993            981           1005           1020

This summaries show how every level of the factors have the same amount of observations which means that the way of taking our data was done incorrectly.

Correlations

This scatter plot shows there is a difference for the Brand preference when the Salary/Age change. The other attributes for the predictions are not useful in this model.

Model

Model using Knn, predicting Brand preference based on Salary and Age, I tryied using different models such as PLSDA, C5.0 and decision trees but the one with the best results was k-NN.

Model performance

This is the accuracy of the model we used on our complete dataset. The kappa is 0.83 and the accuracy of 92% which means that the confidence level is very high and the model well trained (when doing predictions on similar data)

k Accuracy Kappa AccuracySD KappaSD
5 13 0.9215034 0.8320079 0.0096458 0.0211252

Errors

## [1] "Number of errors = 210"
## [1] "Number of correct = 2265"

Accuracy

Actual accuracy after the results.

## [1] "91.52%"

Predicting on Incomplete Survey

Number of customers who prefer each brand

## [1] "Customers predicted Acer = 1910"
## [1] "Customers predicted Sony = 3090"