Answer the questions below and submit your completed work as a knitted docx file.
library(plyr) # Important that this one come first.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## arrange(): dplyr, plyr
## compact(): purrr, plyr
## count(): dplyr, plyr
## failwith(): dplyr, plyr
## filter(): dplyr, stats
## id(): dplyr, plyr
## lag(): dplyr, stats
## mutate(): dplyr, plyr
## rename(): dplyr, plyr
## summarise(): dplyr, plyr
## summarize(): dplyr, plyr
library(vcd)
## Loading required package: grid
library(gmodels)
# Replace this command with something that works on your computer.
load("E:/Download/cdc.rdata")
glimpse(cdc)
## Observations: 20,000
## Variables: 9
## $ genhlth <fctr> good, good, good, good, very good, very good, very g...
## $ exerany <dbl> 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,...
## $ hlthplan <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,...
## $ smoke100 <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,...
## $ height <dbl> 70, 64, 60, 66, 61, 64, 71, 67, 65, 70, 69, 69, 66, 7...
## $ weight <int> 175, 125, 105, 132, 150, 114, 194, 170, 150, 180, 186...
## $ wtdesire <int> 175, 115, 105, 124, 130, 114, 185, 160, 130, 170, 175...
## $ age <int> 77, 33, 49, 42, 55, 55, 31, 45, 27, 44, 46, 62, 21, 6...
## $ gender <fctr> m, f, f, f, f, f, m, m, f, m, m, m, m, m, m, m, m, m...
We’ll use the cdc dataset for this exercise. There are several categorical variables in this dataset. We’ll start off with exerany (Do you exercise?) and smoke100 (Have you smoked 100 cigarettes). Both of these variables use 0/1 encoding, which makes tables and graphs hard to interpret. Recode the variables and save the reults in new variables smoker and exerciser. Make the values “Smoker”, “Non-smoker”, “Exerciser”, and “Couch Potato”. Produce tables comparing the original variables with the new ones to make sure you have everything right.
hlthCat = as.character(cdc$exerany)
hlthCat[cdc$exerany == 0] = "Couch Potato"
hlthCat[cdc$exerany == 1] = "Exerciser"
hlthCat[is.na(cdc$exerany)] = "Unknown"
table(hlthCat)
## hlthCat
## Couch Potato Exerciser
## 5086 14914
table(cdc$exerany)
##
## 0 1
## 5086 14914
smkCat = as.character(cdc$smoke100)
smkCat[cdc$smoke100 == 0] = "Non-smoker"
smkCat[cdc$smoke100 == 1] = "Smoker"
smkCat[is.na(cdc$smoke100)] = "Unknown"
table(smkCat)
## smkCat
## Non-smoker Smoker
## 10559 9441
table(cdc$smoke100)
##
## 0 1
## 10559 9441
Use CrossTable to examine the relationship between exercise and smoking in the cdc dataset. What can you say about the relationship between these two variables?
CrossTable(hlthCat,smkCat)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 20000
##
##
## | smkCat
## hlthCat | Non-smoker | Smoker | Row Total |
## -------------|------------|------------|------------|
## Couch Potato | 2543 | 2543 | 5086 |
## | 7.526 | 8.417 | |
## | 0.500 | 0.500 | 0.254 |
## | 0.241 | 0.269 | |
## | 0.127 | 0.127 | |
## -------------|------------|------------|------------|
## Exerciser | 8016 | 6898 | 14914 |
## | 2.566 | 2.870 | |
## | 0.537 | 0.463 | 0.746 |
## | 0.759 | 0.731 | |
## | 0.401 | 0.345 | |
## -------------|------------|------------|------------|
## Column Total | 10559 | 9441 | 20000 |
## | 0.528 | 0.472 | |
## -------------|------------|------------|------------|
##
##
The largest amount of people are non-smokers that exercise while on the other side of the spectrum, couch potatos have equal amount of non smokers and smokers. ## Problem 3 Produce a mosaic plot to show the relationship between exercise and smoking in the cdc dataset. What can you say about the relationship between these two variables?
mosaicplot(table(hlthCat,smkCat))
There is a significant larger amount of people who exercise and a slight amount of non-smokers who do exercise compared to smokers. couch potatos are equal amount of smokers and non-smokers ## Problem 4 Use CrossTable and mosaic to examine the relatiosnhip between gender and smoking in the cdc dataset. What can you say?
CrossTable(cdc$gender, smkCat)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 20000
##
##
## | smkCat
## cdc$gender | Non-smoker | Smoker | Row Total |
## -------------|------------|------------|------------|
## m | 4547 | 5022 | 9569 |
## | 50.471 | 56.448 | |
## | 0.475 | 0.525 | 0.478 |
## | 0.431 | 0.532 | |
## | 0.227 | 0.251 | |
## -------------|------------|------------|------------|
## f | 6012 | 4419 | 10431 |
## | 46.300 | 51.783 | |
## | 0.576 | 0.424 | 0.522 |
## | 0.569 | 0.468 | |
## | 0.301 | 0.221 | |
## -------------|------------|------------|------------|
## Column Total | 10559 | 9441 | 20000 |
## | 0.528 | 0.472 | |
## -------------|------------|------------|------------|
##
##
There are more female non-smokers than female smokers but for male there are more smokers than non-smokers. ## Problem 5 Use CrossTable and mosaic to examine the relatiosnhip between gender and genhlth in the cdc dataset. What can you say?
CrossTable(cdc$gender, cdc$genhlth)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 20000
##
##
## | cdc$genhlth
## cdc$gender | excellent | very good | good | fair | poor | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## m | 2298 | 3382 | 2722 | 884 | 283 | 9569 |
## | 2.190 | 0.641 | 0.017 | 6.959 | 5.167 | |
## | 0.240 | 0.353 | 0.284 | 0.092 | 0.030 | 0.478 |
## | 0.493 | 0.485 | 0.480 | 0.438 | 0.418 | |
## | 0.115 | 0.169 | 0.136 | 0.044 | 0.014 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## f | 2359 | 3590 | 2953 | 1135 | 394 | 10431 |
## | 2.009 | 0.588 | 0.016 | 6.384 | 4.740 | |
## | 0.226 | 0.344 | 0.283 | 0.109 | 0.038 | 0.522 |
## | 0.507 | 0.515 | 0.520 | 0.562 | 0.582 | |
## | 0.118 | 0.179 | 0.148 | 0.057 | 0.020 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 4657 | 6972 | 5675 | 2019 | 677 | 20000 |
## | 0.233 | 0.349 | 0.284 | 0.101 | 0.034 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
mosaicplot(table(cdc$gender, cdc$genhlth))
there are more ecellent,very good, good health for both female and male across the board. females have a slightly higher amount of fair and poor compared to men. the biggest genhlth category is very good for both male and female. ## Problem 6 Crosstable can only handle two variables, but mosaic can handle three easily. Use mosaic to examin the relatioships among smoking, gender and general health. Try different orderings of the variables to get a readable plot. What can you say?
mosaicplot(table(smkCat, cdc$gender, cdc$genhlth))
mosaicplot(table(cdc$gender, cdc$genhlth, smkCat)) #Winner winner chicken dinner
mosaicplot(table(cdc$genhlth, cdc$gender, smkCat))
comparing male to female, non-smoker and smoker are generally the same. female has a slight increase of non-smokers compared to smoker.