606 Lab1: Introduction to Data

Centers for Disease Control and Prevention (CDC)

Exercise1:How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"
str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00
table(cdc$smoke100)
## 
##     0     1 
## 10559  9441
table(cdc$smoke100)/20000
## 
##       0       1 
## 0.52795 0.47205
barplot(table(cdc$smoke100))

Exercise2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
#IQR
quantile(cdc$height, .5)
## 50% 
##  67
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
#IQR
quantile(cdc$age , .5)
## 50% 
##  43
# frequency distribution
table(cdc$gender)/20000
## 
##       m       f 
## 0.47845 0.52155
table(cdc$exerany)/20000
## 
##      0      1 
## 0.2543 0.7457
table(cdc$gender)
## 
##     m     f 
##  9569 10431
table(cdc$genhlth)/20000
## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385
#Proportion of sample in excellent health
mosaicplot(table(cdc$gender,cdc$genhlth))

Exercise 3. What does the mosaic plot reveal about smoking habits and gender?

No. of male somkers are more.

Exercise 4: Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under23_and_smoke <- subset(cdc, cdc$age < 23 & cdc$smoke100 == 1)
#View(under23_and_smoke)
summary(under23_and_smoke)
##       genhlth       exerany          hlthplan         smoke100
##  excellent:110   Min.   :0.0000   Min.   :0.0000   Min.   :1  
##  very good:244   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1  
##  good     :204   Median :1.0000   Median :1.0000   Median :1  
##  fair     : 53   Mean   :0.8145   Mean   :0.6952   Mean   :1  
##  poor     :  9   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1  
##                  Max.   :1.0000   Max.   :1.0000   Max.   :1  
##      height          weight         wtdesire          age        gender 
##  Min.   :59.00   Min.   : 85.0   Min.   : 80.0   Min.   :18.00   m:305  
##  1st Qu.:65.00   1st Qu.:130.0   1st Qu.:125.0   1st Qu.:19.00   f:315  
##  Median :68.00   Median :155.0   Median :150.0   Median :20.00          
##  Mean   :67.92   Mean   :158.9   Mean   :152.2   Mean   :20.22          
##  3rd Qu.:71.00   3rd Qu.:180.0   3rd Qu.:175.0   3rd Qu.:21.00          
##  Max.   :79.00   Max.   :350.0   Max.   :315.0   Max.   :22.00

Exercise 5: What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

The box plot shows that, the lower the BMI excellent the health is.

bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$exerany)

boxplot(bmi ~ cdc$hlthplan)

The box plot shows people who exercised have better health and they still have health plan. there may be other outliers.