Exercise1:How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
str(cdc)
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
summary(cdc)
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5675 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
table(cdc$smoke100)
##
## 0 1
## 10559 9441
table(cdc$smoke100)/20000
##
## 0 1
## 0.52795 0.47205
barplot(table(cdc$smoke100))
Exercise2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
#IQR
quantile(cdc$height, .5)
## 50%
## 67
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
#IQR
quantile(cdc$age , .5)
## 50%
## 43
# frequency distribution
table(cdc$gender)/20000
##
## m f
## 0.47845 0.52155
table(cdc$exerany)/20000
##
## 0 1
## 0.2543 0.7457
table(cdc$gender)
##
## m f
## 9569 10431
table(cdc$genhlth)/20000
##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
#Proportion of sample in excellent health
mosaicplot(table(cdc$gender,cdc$genhlth))
Exercise 3. What does the mosaic plot reveal about smoking habits and gender?
No. of male somkers are more.
Exercise 4: Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- subset(cdc, cdc$age < 23 & cdc$smoke100 == 1)
#View(under23_and_smoke)
summary(under23_and_smoke)
## genhlth exerany hlthplan smoke100
## excellent:110 Min. :0.0000 Min. :0.0000 Min. :1
## very good:244 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1
## good :204 Median :1.0000 Median :1.0000 Median :1
## fair : 53 Mean :0.8145 Mean :0.6952 Mean :1
## poor : 9 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1
## Max. :1.0000 Max. :1.0000 Max. :1
## height weight wtdesire age gender
## Min. :59.00 Min. : 85.0 Min. : 80.0 Min. :18.00 m:305
## 1st Qu.:65.00 1st Qu.:130.0 1st Qu.:125.0 1st Qu.:19.00 f:315
## Median :68.00 Median :155.0 Median :150.0 Median :20.00
## Mean :67.92 Mean :158.9 Mean :152.2 Mean :20.22
## 3rd Qu.:71.00 3rd Qu.:180.0 3rd Qu.:175.0 3rd Qu.:21.00
## Max. :79.00 Max. :350.0 Max. :315.0 Max. :22.00
Exercise 5: What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
The box plot shows that, the lower the BMI excellent the health is.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$exerany)
boxplot(bmi ~ cdc$hlthplan)
The box plot shows people who exercised have better health and they still have health plan. there may be other outliers.