This assignment is designed as a follow-up to the logistic regression model developed in A5. We wiil use two other supervised classification models and one unsupervised model. All three models were explained in the Datacamp course “Introduction to Machine Learning.” You should refer to that course as a guide to the problems in this assignment.
Load all of the packages you need here.
Load or replicate the train and test dataframes from A5. Run summaries on both of these to verify success.
# Place your code here.
load("~/Dropbox/RProjects/Math 146 Notes/cdc.Rdata")
cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI
Rejects = cdc$BMIDes < 10 |
(cdc$BMIDes > 50 & cdc$DesActRatio > 1.0)
cdc <- cdc[!Rejects,]
# Splitting starts here.
# Place your code here.
nrows = nrow(cdc)
set.seed(123)
scramble = sample(1:nrows,size=nrows)
split = round(nrows*.7)
traini = scramble[1:split]
testi = scramble[(split+1):nrows]
length(traini)
## [1] 13998
length(testi)
## [1] 5999
train = cdc[traini,]
test = cdc[testi,]
str(train)
## 'data.frame': 13998 obs. of 12 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 1 1 3 1 3 1 3 3 2 2 ...
## $ exerany : num 1 1 1 1 1 1 1 0 1 1 ...
## $ hlthplan : num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100 : num 1 0 0 1 0 0 0 1 0 0 ...
## $ height : num 71 71 69 65 67 60 63 65 62 61 ...
## $ weight : int 155 225 145 168 165 103 156 130 120 134 ...
## $ wtdesire : int 160 200 145 125 135 103 140 120 120 125 ...
## $ age : int 32 39 18 29 35 33 80 39 60 37 ...
## $ gender : Factor w/ 2 levels "m","f": 1 1 1 2 2 2 2 2 2 2 ...
## $ BMI : num 21.6 31.4 21.4 28 25.8 ...
## $ BMIDes : num 22.3 27.9 21.4 20.8 21.1 ...
## $ DesActRatio: num 1.032 0.889 1 0.744 0.818 ...
str(test)
## 'data.frame': 5999 obs. of 12 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 5 2 2 2 3 2 1 3 3 3 ...
## $ exerany : num 1 1 0 1 1 1 1 1 1 1 ...
## $ hlthplan : num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100 : num 1 0 1 1 0 0 0 1 0 0 ...
## $ height : num 72 67 72 74 62 68 71 63 76 68 ...
## $ weight : int 400 120 215 250 115 175 180 125 160 159 ...
## $ wtdesire : int 190 120 180 230 115 160 170 125 160 150 ...
## $ age : int 50 25 45 53 18 70 61 72 18 29 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 1 1 2 2 1 2 1 2 ...
## $ BMI : num 54.2 18.8 29.2 32.1 21 ...
## $ BMIDes : num 25.8 18.8 24.4 29.5 21 ...
## $ DesActRatio: num 0.475 1 0.837 0.92 1 ...
Make a copy of train, train_km without the gender variable. Use kmeans() to create 2 subsets of train_clean.
Create a table to compare the class variable in the kmeans() dataframe with the gender variable in train. Does it appear that gender was very important in the definitions of the subsets created by kmeans()? Note that this is very similar to the interpretation of a confusion matrix from a supervised model.
# First we need to deal with scaling
train_km = select(train,-gender)
scaler = function(x){
maxv = max(x)
minv = min(x)
scaled = (x - min(x))/(maxv - minv)
}
train_km %>%
mutate(height = scaler(height),
weight = scaler(weight),
age = scaler(age),
DesActRatio = scaler(DesActRatio),
BMI = scaler(BMI),
BMIDes = scaler(BMIDes),
wtdesire = scaler(wtdesire)) -> train_km
# Convert genhlth to a collection of dummies
train_km =
dummy_cols(train_km,
select_columns = "genhlth",
remove_first_dummy = TRUE)
train_km = select(train_km,-genhlth)
glimpse(train_km)
## Observations: 13,998
## Variables: 14
## $ exerany <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,…
## $ hlthplan <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,…
## $ smoke100 <dbl> 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1,…
## $ height <dbl> 0.6388889, 0.6388889, 0.5833333, 0.4722222, …
## $ weight <dbl> 0.20138889, 0.36342593, 0.17824074, 0.231481…
## $ wtdesire <dbl> 0.32624113, 0.46808511, 0.27304965, 0.202127…
## $ age <dbl> 0.17283951, 0.25925926, 0.00000000, 0.135802…
## $ BMI <dbl> 0.1518414, 0.3126901, 0.1484581, 0.2562710, …
## $ BMIDes <dbl> 0.2672299, 0.3928019, 0.2469111, 0.2331434, …
## $ DesActRatio <dbl> 0.4499724, 0.3657079, 0.4310129, 0.2805782, …
## $ genhlth_good <int> 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ `genhlth_very good` <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,…
## $ genhlth_poor <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ genhlth_fair <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
k2 = kmeans(train_km,centers=2,nstart=25)
str(k2)
## List of 9
## $ cluster : int [1:13998] 1 1 1 1 1 1 1 1 2 2 ...
## $ centers : num [1:2, 1:14] 0.711 0.804 0.867 0.897 0.48 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "1" "2"
## .. ..$ : chr [1:14] "exerany" "hlthplan" "smoke100" "height" ...
## $ totss : num 16765
## $ withinss : num [1:2] 9991 2866
## $ tot.withinss: num 12857
## $ betweenss : num 3908
## $ size : int [1:2] 9113 4885
## $ iter : int 1
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
table(k2$cluster,train$gender)
##
## m f
## 1 4321 4792
## 2 2375 2510