##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
dim(cdc)
## [1] 20000 9
summary(cdc)
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5675 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
genhealth: ordinal exerany: categorical hlthplan: categorical smoke100: categorical height: discrete weight: discrete wtdesire: discrete age: discrete gender: categorical
summary <- summary(cdc$age)
summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
IQR <- summary[5] - summary[2]
sprintf("Age IQR: %s", IQR)
## [1] "Age IQR: 26"
summary <- summary(cdc$height)
summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
IQR <- summary[5] - summary[2]
sprintf("Height IQR: %s", IQR)
## [1] "Height IQR: 6"
table(cdc$gender)
##
## m f
## 9569 10431
table(cdc$exerany)
##
## 0 1
## 5086 14914
table(cdc$genhlth)/20000
##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
Males are more likely to have smoked at least 100 cigarettes in their lifetime
under23_and_smoke <- subset(cdc, age < 23 & smoke100 == 1)
head(under23_and_smoke, 10)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
## 262 fair 0 1 1 71 185 185 20 m
## 296 fair 1 1 1 72 185 170 19 m
## 297 excellent 1 0 1 63 105 100 19 m
## 300 fair 1 1 1 71 185 150 18 m
The boxplot of bmi vs health shows that there is a correlation between bmi and health. It seems to be lower as health improves.
cdc$bmi <- (cdc$weight / cdc$height^2) * 703
ggplot(cdc) + geom_boxplot(aes(x = 1, y = bmi, group = gender, color = gender))
I chose gender because there are differences in body compostion between men and women. The plot suggests men have a higher median bmi than women, as well as higher values for the 25th and 75th percentiles. There appears to be more varinace in BMI for women, given the higher IQR and more outliers.
ggplot(cdc) + geom_point(aes(x= wtdesire, y = weight))
There appears to be a positive correlation between weight and desired weight, but a lot of variability. Some people have desired weights well below their actual weight.
cdc$wdiff <- cdc$wtdesire - cdc$weight
Wdiff is a discrete variable. Values of 0 mean the person is at their desired weight. Negative values indicate their weight is above desired, and positives indicate their weight is below it.
ggplot(cdc) + geom_histogram(aes(x = wdiff), binwidth = 10)
ggplot(cdc) + geom_density(aes(x = wdiff))
The distribution is unimodal and left skewed. The shape indicates many people are unhappy with their current weight, some people very much so. There are many people who would like to lose more than 100 pounds, and none who would like to gain more than 100 pounds.
men <- subset(cdc, gender == "m")
women <- subset(cdc, gender == "f")
summary(men$wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
summary(women$wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
ggplot(cdc) + geom_boxplot(aes(x = 1, y = wdiff, group = gender, color = gender)) + ylim(-100,100)
## Warning: Removed 184 rows containing non-finite values (stat_boxplot).
It is difficult to discern a large difference from the plots. From the summary, we can see the median for women is higher than for men. This could mean difference is not significant.
stdev <- sd(cdc$weight)
avg <- mean(cdc$weight)
answers <- ifelse(cdc$weight - avg < stdev,1,0)
proportion <-mean(answers)
sprintf("Propotion within 1 standard deviation: %s", proportion)
## [1] "Propotion within 1 standard deviation: 0.84675"