source("http://www.openintro.org/stat/data/cdc.R")
There are 20,000 cases each with 9 variables. The following variables are categorical: genhlth, exerany, hlthplan, smoke100, and gender. The following variables are continuous quantitative: height, weight, wtdesire, and age.
# enter code for Ex2 below
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.0 64.0 67.0 67.2 70.0 93.0
summary(cdc$height)[5] - summary(cdc$height)[2]
## 3rd Qu.
## 6
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 31.0 43.0 45.1 57.0 99.0
summary(cdc$age)[5] - summary(cdc$age)[2]
## 3rd Qu.
## 26
There are 9,569 males in the sample. Of the sample, 23.285% are in excellent health.
# enter code for Ex3 below
table(cdc$gender)/20000
##
## m f
## 0.4784 0.5215
table(cdc$genhlth)/20000
##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
# code for Ex4 already given in lab
mosaicplot(table(cdc$gender, cdc$smoke100))
The plot shows that gender and smoking 100 cigarettes are not independent variables. The probability of having smoked 100 cigarettes is different based on whether the subject is male or female.
# enter code for Ex5 below
under23_and_smoke <- subset(cdc, cdc$age < 23 & cdc$smoke == "1")
nrow(under23_and_smoke)
## [1] 620
There are 620 respondents who are under 23 and smoke.
# code for bmi vs. genhlth already given in lab
bmi = (cdc$weight/cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth, main = "BMI vs. general health")
This shows a trend that as people are in poorer health, their BMI tends to increase.
# enter code for Ex6 below (boxplot for bmi vs. your chosen variable)
boxplot(bmi ~ cdc$gender, main = "BMI vs. gender")
tapply(bmi, cdc$gender, summary)
## $m
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.4 23.7 26.3 26.9 29.2 64.2
##
## $f
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.7 21.8 24.6 25.7 28.3 73.1
Men and women are physiologically built differently so they may have different BMI's. The median and quartiles for BMI are all lower for women showing a difference between the two genders. However, the difference is small and not clearly shown by the boxplot.
# enter code for Ex7 below
cdcnew <- subset(cdc, cdc$wtdesire < 500)
plot(cdcnew$weight, cdcnew$wtdesire)
These variables are closely correlated in the sense that people's desired weight do not differ greatly from their current weight. In general people's desired weight are less than their current weight.
No text needed for this question, just code.
# enter code for Ex8 below
wdiff = cdcnew$weight - cdcnew$wtdesire
wdiff is a list of integers. If an observation in wdiff is 0, that means the person does not want to change his or her weight. If the observation is positive, the person wishes to lose weight. Likewise, if the observation is negative, the person wishes to gain weight.
# enter code for Ex10 below - numerical summary
summary(wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -110.0 0.0 10.0 14.6 21.0 300.0
# enter code for Q4 below - plot(s)
hist(wdiff)
The distribution of weight differences is right skewed. There are relatively few extreme cases where people want to lose large amounts of weight. The majority of the data is positive, which means people generally want to lose weight. The data is centered, which means people are generally comfortable with their weight.
# enter code for Ex11 below - numerical summary
tapply(wdiff, cdcnew$gender, summary)
## $m
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -110.0 0.0 5.0 10.8 20.0 300.0
##
## $f
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -83.0 0.0 10.0 18.2 27.0 300.0
# enter code for Ex11 below - side-by-side box plot
boxplot(wdiff ~ cdcnew$gender)
The median and mean for wdiff is larger for females than for males, indicating that females in general want to lose more weight.
Note: these statistics are done after adjusting for outliers
# enter code for Ex12 below
summary(cdcnew$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68 140 165 170 190 500
sd(cdcnew$weight)
## [1] 40.07
onesd <- subset(cdcnew$weight, cdcnew$weight < 209.77 & cdcnew$weight > 129.67)
dim(onesd)
## NULL
14151/20000
## [1] 0.7076
The mean is 169.7 and the standard deviation is 40.07. There are 14151 people within one standard deviation of the mean, or about 70.755%.