Initialization

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.
## 
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
## 
##     demo

Excercise 1

dim(cdc)
## [1] 20000     9
summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

genhealth: ordinal exerany: categorical hlthplan: categorical smoke100: categorical height: discrete weight: discrete wtdesire: discrete age: discrete gender: categorical

Excercise 2

summary <- summary(cdc$age)
summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
IQR <- summary[5] - summary[2]
sprintf("Age IQR: %s", IQR)
## [1] "Age IQR: 26"
summary <- summary(cdc$height)
summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
IQR <- summary[5] - summary[2]
sprintf("Height IQR: %s", IQR)
## [1] "Height IQR: 6"
table(cdc$gender)
## 
##     m     f 
##  9569 10431
table(cdc$exerany)
## 
##     0     1 
##  5086 14914
table(cdc$genhlth)/20000
## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

Excercise 3

Males are more likely to have smoked at least 100 cigarettes in their lifetime

Excercise 4

under23_and_smoke <- subset(cdc, age < 23 & smoke100 == 1)
head(under23_and_smoke, 10)
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f
## 262      fair       0        1        1     71    185      185  20      m
## 296      fair       1        1        1     72    185      170  19      m
## 297 excellent       1        0        1     63    105      100  19      m
## 300      fair       1        1        1     71    185      150  18      m

Excerxise 5

The boxplot of bmi vs health shows that there is a correlation between bmi and health. It seems to be lower as health improves.

cdc$bmi <- (cdc$weight / cdc$height^2) * 703
ggplot(cdc) + geom_boxplot(aes(x = 1, y = bmi, group = gender, color = gender))

I chose gender because there are differences in body compostion between men and women. The plot suggests men have a higher median bmi than women, as well as higher values for the 25th and 75th percentiles. There appears to be more varinace in BMI for women, given the higher IQR and more outliers.

On your own

Question 1

ggplot(cdc) + geom_point(aes(x= wtdesire, y = weight))

There appears to be a positive correlation between weight and desired weight, but a lot of variability. Some people have desired weights well below their actual weight.

Question 2

cdc$wdiff <- cdc$wtdesire - cdc$weight

Question 3

Wdiff is a discrete variable. Values of 0 mean the person is at their desired weight. Negative values indicate their weight is above desired, and positives indicate their weight is below it.

Question 4

ggplot(cdc) + geom_histogram(aes(x = wdiff), binwidth = 10) 

ggplot(cdc) + geom_density(aes(x = wdiff)) 

The distribution is unimodal and left skewed. The shape indicates many people are unhappy with their current weight, some people very much so. There are many people who would like to lose more than 100 pounds, and none who would like to gain more than 100 pounds.

Question 5

men <- subset(cdc, gender == "m")
women <- subset(cdc, gender == "f")
summary(men$wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00
summary(women$wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00
ggplot(cdc) + geom_boxplot(aes(x = 1, y = wdiff, group = gender, color = gender)) + ylim(-100,100)
## Warning: Removed 184 rows containing non-finite values (stat_boxplot).

It is difficult to discern a large difference from the plots. From the summary, we can see the median for women is higher than for men. This could mean difference is not significant.

Question 6

stdev <- sd(cdc$weight)
avg <- mean(cdc$weight)
answers <- ifelse(cdc$weight - avg < stdev,1,0)
proportion <-mean(answers)
sprintf("Propotion within 1 standard deviation: %s", proportion)
## [1] "Propotion within 1 standard deviation: 0.84675"