I decided to use markdown so I could better comment on the different code chunks and have the results be visible underneath. I did notice to errors in the code chuncks below which is probably why I am assuming you wanted it in a pdf form.
# read the data into r and convert the brand to a factor
cars <- read.csv("~/RstudioProjects/BAN340/Data sets/cars.txt") %>%
mutate(brand = factor(brand))
# look at data
summary(cars)
## mpg cylinders cubicinches hp weightlbs
## Min. :10.00 Min. :3.00 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:16.90 1st Qu.:4.00 1st Qu.:101.0 1st Qu.: 75.0 1st Qu.:2246
## Median :22.00 Median :6.00 Median :156.0 Median : 95.0 Median :2835
## Mean :23.14 Mean :5.59 Mean :201.1 Mean :106.4 Mean :3005
## 3rd Qu.:28.80 3rd Qu.:8.00 3rd Qu.:302.0 3rd Qu.:138.0 3rd Qu.:3664
## Max. :46.60 Max. :8.00 Max. :455.0 Max. :230.0 Max. :4997
## time.to.60 year brand
## Min. : 8.00 Min. :1971 Europe.: 48
## 1st Qu.:14.00 1st Qu.:1974 Japan. : 51
## Median :16.00 Median :1977 US. :162
## Mean :15.55 Mean :1977
## 3rd Qu.:17.00 3rd Qu.:1980
## Max. :25.00 Max. :1983
# verify data types
str(cars)
## 'data.frame': 261 obs. of 8 variables:
## $ mpg : num 14 31.9 17 15 30.5 23 13 14 25.4 37.7 ...
## $ cylinders : int 8 4 8 8 4 8 8 8 5 4 ...
## $ cubicinches: int 350 89 302 400 98 350 351 440 183 89 ...
## $ hp : int 165 71 140 150 63 125 158 215 77 62 ...
## $ weightlbs : int 4209 1925 3449 3761 2051 3900 4363 4312 3530 2050 ...
## $ time.to.60 : int 12 14 11 10 17 17 13 9 20 17 ...
## $ year : int 1972 1980 1971 1971 1978 1980 1974 1971 1980 1982 ...
## $ brand : Factor w/ 3 levels " Europe."," Japan.",..: 3 1 3 3 3 3 3 3 1 2 ...
Here I reate a normalize function to use with the data
normalize <- function(x) {
return ((x-min(x)) / (max(x) - min(x)))
}
I take a look at the data prior to normilizing it so I can see if I see variables that I want to compare against each other.
## List of 1
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : num 90
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
# I create a variable to hold all the columns except the target
features <- select(cars, -brand)
I normalize the data (minus the target) -
However, it is a little concerning how uneven the target data is distributed between the three types
features <- as.data.frame(lapply(cars[,c(1:7)], normalize))
str(features)
## 'data.frame': 261 obs. of 7 variables:
## $ mpg : num 0.109 0.598 0.191 0.137 0.56 ...
## $ cylinders : num 1 0.2 1 1 0.2 1 1 1 0.4 0.2 ...
## $ cubicinches: num 0.7287 0.0543 0.6047 0.8579 0.0775 ...
## $ hp : num 0.6467 0.1359 0.5109 0.5652 0.0924 ...
## $ weightlbs : num 0.7671 0.0922 0.5426 0.6348 0.1294 ...
## $ time.to.60 : num 0.235 0.353 0.176 0.118 0.529 ...
## $ year : num 0.0833 0.75 0 0 0.5833 ...
summary(cars$brand)
## Europe. Japan. US.
## 48 51 162
Create a predictor set w/ various k= using a normalized feature set. I am running more than necessary but I want to see the differences between the number of neighbors
cars_with_pred <- cars %>%
mutate(
knn1 = knn.cv(features, cars$brand, k=1),
knn2 = knn.cv(features, cars$brand, k=2),
knn3 = knn.cv(features, cars$brand, k=3),
knn5 = knn.cv(features, cars$brand, k=5),
knn7 = knn.cv(features, cars$brand, k=7)
)
Before I start running the tables I decide for kicks I wanted to run the same tables before they were normalized as well, so I could see the difference, especially as there is such a disparity in the data set with regards to the size of US variaable. If I understand correctly you normally wouldn’t want to have such a disparity in the target variable (europe= 48, japan= 51, US= 162). Perhaps it doesn’t matter because they are normalized to 1 and 0, but if I understand the concept right it would be better if the target variable factor levels were closer.
# variable that is not normalized to see the difference in tables
features_x <- as.data.frame(cars[,c(1:7)])
str(features)
## 'data.frame': 261 obs. of 7 variables:
## $ mpg : num 0.109 0.598 0.191 0.137 0.56 ...
## $ cylinders : num 1 0.2 1 1 0.2 1 1 1 0.4 0.2 ...
## $ cubicinches: num 0.7287 0.0543 0.6047 0.8579 0.0775 ...
## $ hp : num 0.6467 0.1359 0.5109 0.5652 0.0924 ...
## $ weightlbs : num 0.7671 0.0922 0.5426 0.6348 0.1294 ...
## $ time.to.60 : num 0.235 0.353 0.176 0.118 0.529 ...
## $ year : num 0.0833 0.75 0 0 0.5833 ...
summary(cars$brand)
## Europe. Japan. US.
## 48 51 162
As you will see below I wanted to create a predictor set with the NON-normalized data set with the various k= so I could see if there where any distinct differences running the data not normalized as compared to normalized
# create predictor set that is NOT normalized w/ various k=
cars_with_pred_x <- cars %>%
mutate(
knn1 = knn.cv(features_x, cars$brand, k=1),
knn2 = knn.cv(features_x, cars$brand, k=2),
knn3 = knn.cv(features_x, cars$brand, k=3),
knn5 = knn.cv(features_x, cars$brand, k=5),
knn7 = knn.cv(features_x, cars$brand, k=7)
)
Tables run with all variables(normalized and not normalized for comparison) on top of each other to more easily see the differences
The first run using knn1 had a 72% capture of the Europe/Japan/US brands and using the non-normailzed set we had a 78% capture of the Europe/Japan/US brands. If you look at the counts for the non-normalized sets, they had different results for the counts that matched by brand but differed more so on the classification of the misses. The non-normalized set actually matched a little better in those first three runs of knn by brand but differed in how they classified the misses. I believe this is because of the impact that the amount of range(min-max) in the non-normalized values has? In the first three knn (knn1,2,3) where I compared the knn using both the normalized and the non-normalized values, the non-normalized set was a able to better pick up the higher amount of correct brand to brand match, by both percent and by count. However as more neighbors were introduced that gap shrunk and then turned in the favor of the normalized data.
# tables run with all variables(normalized)
tab1 <- table(cars_with_pred$brand, cars_with_pred$knn1)
sum(diag(tab1)) / sum(tab1)
## [1] 0.7203065
tab1
##
## Europe. Japan. US.
## Europe. 23 14 11
## Japan. 7 24 20
## US. 10 11 141
# tables run with all variables(NOT normalized)
tab1x <- table(cars_with_pred_x$brand, cars_with_pred_x$knn1)
sum(diag(tab1x)) / sum(tab1x)
## [1] 0.7816092
tab1x
##
## Europe. Japan. US.
## Europe. 24 11 13
## Japan. 6 37 8
## US. 7 12 143
# Normalized
tab2 <- table(cars_with_pred$brand, cars_with_pred$knn2)
sum(diag(tab2)) / sum(tab2)
## [1] 0.7049808
tab2
##
## Europe. Japan. US.
## Europe. 20 16 12
## Japan. 9 28 14
## US. 10 16 136
# NOT normalized
tab2x <- table(cars_with_pred_x$brand, cars_with_pred_x$knn2)
sum(diag(tab2x)) / sum(tab2x)
## [1] 0.7203065
tab2x
##
## Europe. Japan. US.
## Europe. 22 9 17
## Japan. 9 26 16
## US. 8 14 140
# Normalized
tab3 <- table(cars_with_pred$brand, cars_with_pred$knn3)
sum(diag(tab3)) / sum(tab3)
## [1] 0.7241379
tab3
##
## Europe. Japan. US.
## Europe. 28 12 8
## Japan. 9 22 20
## US. 8 15 139
# NOT Normalized
tab3x <- table(cars_with_pred_x$brand, cars_with_pred_x$knn3)
sum(diag(tab3x)) / sum(tab3x)
## [1] 0.7509579
tab3x
##
## Europe. Japan. US.
## Europe. 22 13 13
## Japan. 5 31 15
## US. 5 14 143
This is where the brands start to be more accurate with the normalized set. It did a much better job matching the european cars and stayed almost the same for the US and japan cars
# Normalized
tab5 <- table(cars_with_pred$brand, cars_with_pred$knn5)
sum(diag(tab5)) / sum(tab5)
## [1] 0.7509579
tab5
##
## Europe. Japan. US.
## Europe. 29 10 9
## Japan. 11 26 14
## US. 8 13 141
#NOT Normalized
tab5x <- table(cars_with_pred_x$brand, cars_with_pred_x$knn5)
sum(diag(tab5x)) / sum(tab5x)
## [1] 0.7164751
tab5x
##
## Europe. Japan. US.
## Europe. 17 16 15
## Japan. 8 30 13
## US. 9 13 140
Comparing the knn normalized with the knn non-normalized was merely to see for myself how not normalizing can distort the results. Knn 5 and knn 7 had better matches which makes sense as they encompass more points, and it is obvious why the US cars match better since there are so many more to feed into the model.
# Normalized
tab7 <- table(cars_with_pred$brand, cars_with_pred$knn7)
sum(diag(tab7)) / sum(tab7)
## [1] 0.7547893
tab7
##
## Europe. Japan. US.
## Europe. 24 16 8
## Japan. 11 29 11
## US. 7 11 144
# NOT Normalized
tab7x <- table(cars_with_pred_x$brand, cars_with_pred_x$knn7)
sum(diag(tab7x)) / sum(tab7x)
## [1] 0.7126437
tab7x
##
## Europe. Japan. US.
## Europe. 19 11 18
## Japan. 10 26 15
## US. 9 12 141
Looking at the top 3 (optimal % wise) variables against the target variable with k=3. I really like ggpairs as it gives you a much better visual of the data as a whole. Originally, I piped this same graph through plotly to make it interactive, but decided to use the standard version as the plotly version wouldn’t scale to the larger size figure to better see the results for some reason.
You can see in all the graphs below how the Europe and Japan brands overlap each other a lot. This is because the style of cars that the US uses is much different from Europe and Japan who prefer smaller cars,with higher mpg because of the high fuel prices. The US, as is the case with everything, believe bigger is better and make gas guzzling monsters that are larger and rugged.
looking at the top 3 (optimal % wise) variables against the target variable with k=5
looking at the top 3 (optimal % wise) variables against the target variable with k=7
Re-run graphs minus year variable -
I don’t believe the “year” variable adds anything to the data. US cars are clearly eastablished outliers in comparison to the Europe and Japan cars. The year variable is so overlapped with the different car brands it almost makes no difference. Admittedly, if it is the same across the board it probably does not matter if it stays or goes, but I figired I would give it try so I could see what it looked like without the “year” variable. From what I could see by removing the year variable there was a change in the distribution of mpg, which went from very peaked (Europe/Japan) to a more rounded peak. However, the US went from a low lying rounded peak to closer in line with the other two brands becoming a rounded peak. Possibly, because by removing the year variable the distribution was evened out of the span of time?
A run looking at knn5 without year
A run looking at knn7 without year. Removing year and doing a second set of graphs was merely to satisfy my own curiosity as to its impact on the result. When running knn7 there was virtually no change to the result when removing year. I assume because it was bolstered by more neighbors in the model.
I decided to make the next graphs long for easier visualization as comparison against all the different knn
cars_with_pred_long <- cars_with_pred %>%
tidyr::pivot_longer(starts_with("knn"), names_to = "knn", values_to = "pred_brand")
mpg vs time -
here I start to graph the variables against each other. Based on the plots above and the clustering I thought that these three would yield the best results. I do not completely trust the data because of the size of the US data in cars with comparison to the Japan and Europe cars. It to heavily favors the US. There is also the fact that Japan cars and Europe cars are much more likely to match because the style of cars we use in the US is so different. That is why in the above graphs the Japan and Europe plots overlap each other so much.
mpg_vs_time - You can see below that the data does not differ very much at all. Just minor changes between the matching for brands when you start to add more neighbors to the model.
mpg_vs_time <- ggplot(cars_with_pred_long, aes(mpg, `time.to.60`, color=pred_brand, shape=brand)) +
geom_point(size=2) + facet_grid(. ~ knn) +
theme_bw()
mpg_vs_time
hp vs cubic -
Below has the same issue except when you get to the larger k= 5 and k=7 the cluster gets tighter and the brands appear less spread out between Europe and Japan.
hp_vs_cubicinches <- ggplot(cars_with_pred_long, aes(mpg, `cubicinches`, color=pred_brand, shape=brand)) +
geom_point(size=2) + facet_grid(. ~ knn) +
theme_bw()
hp_vs_cubicinches
mpg vs weight -
In these comparison you can see that the data virtually is the same no matter the k= that is used.
mpg_vs_weight <- ggplot(cars_with_pred_long, aes(mpg, `weightlbs`, color=pred_brand, shape=brand)) +
geom_point(size=2) + facet_grid(. ~ knn) +
theme_bw()
mpg_vs_weight
See all three as a comparison to by knn. It is a little harder to see how they brands fan out this way but I think seeing all the data this way is very useful. Overall I think the knn would have better on a larger more evenly distributed data set, but you can see enough to know that they higher k= achieved better results.
mpg_vs_time / hp_vs_cubicinches / mpg_vs_weight
References:
Larose, Daniel T., and Chantal D. Larose. Discovering Knowledge in Data: an Introduction to Data Mining. 2nd ed., Wiley, 2014. Chapter 5.6
Harrison, Onel. “Machine Learning Basics with the K-Nearest Neighbors Algorithm” Towards Data Science, 10 Sep. 2018, https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761