The conclusions from this report are as follow, with respect to the data that was given to me, even though the predictions are very accurate, this is because the complete survey was very similar to the incomplete one and they were distributed too perfectly. Our sampling methods were stratified. To explain it easier: our model is good for predicting this incomplete survey because they are distributed very alike, if we input more data the predictions will probably be wrong.
As we can see in this graphs, the complete dataset and the incomplete ones are very similarto each other sampling-wise.
## [1] "Complete summary"
## BMW Buick Cadillac Chevrolet Chrysler Dodge
## 492 509 488 479 505 477
## Ford Honda Hyundai Jeep Kia Lincoln
## 495 511 487 500 473 498
## Mazda Mercedes Mitsubishi Nissan Other Ram
## 473 494 542 470 484 508
## Subaru Toyota
## 524 489
## 0 1 2 3 4 5 6 7 8
## 1085 1053 1112 1080 1087 1108 1155 1083 1135
## 4year College High School No High School PhD Some College
## 1947 1948 2052 1968 1983
## [1] "Incomplete summary"
## BMW Buick Cadillac Chevrolet Chrysler Dodge
## 247 242 250 245 255 236
## Ford Honda Hyundai Jeep Kia Lincoln
## 249 263 237 251 259 243
## Mazda Mercedes Mitsubishi Nissan Other Ram
## 249 238 239 259 250 250
## Subaru Toyota
## 262 276
## 0 1 2 3 4 5 6 7 8
## 529 549 545 548 585 568 542 571 563
## 4year College High School No High School PhD Some College
## 1001 993 981 1005 1020
This summaries show how every level of the factors have the same amount of observations which means that the way of taking our data was done incorrectly.
This scatter plot shows there is a difference for the Brand preference when the Salary/Age change. The other attributes for the predictions are not useful in this model.
Model using Knn, predicting Brand preference based on Salary and Age, I tryied using different models such as PLSDA, C5.0 and decision trees but the one with the best results was k-NN.
This is the accuracy of the model we used on our complete dataset. The kappa is 0.83 and the accuracy of 92% which means that the confidence level is very high and the model well trained (when doing predictions on similar data)
| k | Accuracy | Kappa | AccuracySD | KappaSD | |
|---|---|---|---|---|---|
| 5 | 13 | 0.9215034 | 0.8320079 | 0.0096458 | 0.0211252 |
## [1] "Number of errors = 210"
## [1] "Number of correct = 2265"
Actual accuracy after the results.
## [1] "91.52%"
Number of customers who prefer each brand
## [1] "Customers predicted Acer = 1910"
## [1] "Customers predicted Sony = 3090"