KMeans Example

This assignment is designed as a follow-up to the logistic regression model developed in A5. We wiil use two other supervised classification models and one unsupervised model. All three models were explained in the Datacamp course “Introduction to Machine Learning.” You should refer to that course as a guide to the problems in this assignment.

Setup

Load all of the packages you need here.

Problem 1

Load or replicate the train and test dataframes from A5. Run summaries on both of these to verify success.

# Place your code here. 
load("~/Dropbox/RProjects/Math 146 Notes/cdc.Rdata")

cdc$BMI = (cdc$weight*703)/(cdc$height)^2
cdc$BMIDes = (cdc$wtdesire*703)/(cdc$height)^2
cdc$DesActRatio = cdc$BMIDes/cdc$BMI

Rejects = cdc$BMIDes < 10 | 
   (cdc$BMIDes > 50 & cdc$DesActRatio > 1.0) 
cdc <- cdc[!Rejects,] 

# Splitting starts here.

# Place your code here.
nrows = nrow(cdc)
set.seed(123)
scramble = sample(1:nrows,size=nrows)
split = round(nrows*.7)
traini = scramble[1:split]
testi = scramble[(split+1):nrows]
length(traini)

## [1] 13998

length(testi)

## [1] 5999

train = cdc[traini,]
test = cdc[testi,]

str(train)

## 'data.frame':    13998 obs. of  12 variables:
##  $ genhlth    : Factor w/ 5 levels "excellent","very good",..: 1 1 3 1 3 1 3 3 2 2 ...
##  $ exerany    : num  1 1 1 1 1 1 1 0 1 1 ...
##  $ hlthplan   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100   : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ height     : num  71 71 69 65 67 60 63 65 62 61 ...
##  $ weight     : int  155 225 145 168 165 103 156 130 120 134 ...
##  $ wtdesire   : int  160 200 145 125 135 103 140 120 120 125 ...
##  $ age        : int  32 39 18 29 35 33 80 39 60 37 ...
##  $ gender     : Factor w/ 2 levels "m","f": 1 1 1 2 2 2 2 2 2 2 ...
##  $ BMI        : num  21.6 31.4 21.4 28 25.8 ...
##  $ BMIDes     : num  22.3 27.9 21.4 20.8 21.1 ...
##  $ DesActRatio: num  1.032 0.889 1 0.744 0.818 ...

str(test)

## 'data.frame':    5999 obs. of  12 variables:
##  $ genhlth    : Factor w/ 5 levels "excellent","very good",..: 5 2 2 2 3 2 1 3 3 3 ...
##  $ exerany    : num  1 1 0 1 1 1 1 1 1 1 ...
##  $ hlthplan   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100   : num  1 0 1 1 0 0 0 1 0 0 ...
##  $ height     : num  72 67 72 74 62 68 71 63 76 68 ...
##  $ weight     : int  400 120 215 250 115 175 180 125 160 159 ...
##  $ wtdesire   : int  190 120 180 230 115 160 170 125 160 150 ...
##  $ age        : int  50 25 45 53 18 70 61 72 18 29 ...
##  $ gender     : Factor w/ 2 levels "m","f": 1 2 1 1 2 2 1 2 1 2 ...
##  $ BMI        : num  54.2 18.8 29.2 32.1 21 ...
##  $ BMIDes     : num  25.8 18.8 24.4 29.5 21 ...
##  $ DesActRatio: num  0.475 1 0.837 0.92 1 ...

Problem 2

Make a copy of train, train_km without the gender variable. Use kmeans() to create 2 subsets of train_clean.

Create a table to compare the class variable in the kmeans() dataframe with the gender variable in train. Does it appear that gender was very important in the definitions of the subsets created by kmeans()? Note that this is very similar to the interpretation of a confusion matrix from a supervised model.

# First we need to deal with scaling
train_km = select(train,-gender)
scaler = function(x){
  maxv = max(x)
  minv = min(x)
  scaled = (x - min(x))/(maxv - minv)
}

train_km %>% 
  mutate(height = scaler(height),
         weight = scaler(weight),
         age = scaler(age),
         DesActRatio = scaler(DesActRatio),
         BMI = scaler(BMI),
         BMIDes = scaler(BMIDes),
         wtdesire = scaler(wtdesire)) -> train_km

# Convert genhlth to a collection of dummies
train_km = 
  dummy_cols(train_km, 
             select_columns = "genhlth",
             remove_first_dummy = TRUE)
train_km = select(train_km,-genhlth)
glimpse(train_km)

## Observations: 13,998
## Variables: 14
## $ exerany             <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,…
## $ hlthplan            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,…
## $ smoke100            <dbl> 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1,…
## $ height              <dbl> 0.6388889, 0.6388889, 0.5833333, 0.4722222, …
## $ weight              <dbl> 0.20138889, 0.36342593, 0.17824074, 0.231481…
## $ wtdesire            <dbl> 0.32624113, 0.46808511, 0.27304965, 0.202127…
## $ age                 <dbl> 0.17283951, 0.25925926, 0.00000000, 0.135802…
## $ BMI                 <dbl> 0.1518414, 0.3126901, 0.1484581, 0.2562710, …
## $ BMIDes              <dbl> 0.2672299, 0.3928019, 0.2469111, 0.2331434, …
## $ DesActRatio         <dbl> 0.4499724, 0.3657079, 0.4310129, 0.2805782, …
## $ genhlth_good        <int> 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ `genhlth_very good` <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,…
## $ genhlth_poor        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ genhlth_fair        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

k2 = kmeans(train_km,centers=2,nstart=25)
str(k2)

## List of 9
##  $ cluster     : int [1:13998] 1 1 1 1 1 1 1 1 2 2 ...
##  $ centers     : num [1:2, 1:14] 0.711 0.804 0.867 0.897 0.48 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:14] "exerany" "hlthplan" "smoke100" "height" ...
##  $ totss       : num 16765
##  $ withinss    : num [1:2] 9991 2866
##  $ tot.withinss: num 12857
##  $ betweenss   : num 3908
##  $ size        : int [1:2] 9113 4885
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

table(k2$cluster,train$gender)

##    
##        m    f
##   1 4321 4792
##   2 2375 2510

KMeans Example

Nelson

April 20, 2019

Setup

Problem 1

Problem 2