source("more/cdc.R")#Number of cases is 20,000
#Number of variables is 9
dim(cdc)## [1] 20000 9
#check classes for each variable
sapply(cdc, class)## genhlth exerany hlthplan smoke100 height weight wtdesire
## "factor" "numeric" "numeric" "numeric" "numeric" "integer" "integer"
## age gender
## "integer" "factor"
Genhlth - Ordinal
Exerany - Binary
Hlthplan - Binary
Smoke100 - Binary
Height - Continuous numerical (recorded as continuous discrete)
Weight - Continuous numerical (recorded as continuous discrete)
Wtdesire - Continuous numerical (recorded as continuous discrete)
Age - Continuous numerical (recorded as continuous discrete)
Gender - Binary (nominal)
height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?#height summary
summary(cdc$height)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
#age summary
summary(cdc$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
#Height IQR
unname(summary(cdc$height)[5] - summary(cdc$height)[2])## [1] 6
#Age IQR
unname(summary(cdc$age)[5] - summary(cdc$age)[2])## [1] 26
#gender relative frequency distribution
table(cdc$gender)/length(cdc$gender)##
## m f
## 0.47845 0.52155
#exerany relative frequency distribution
table(cdc$exerany)/length(cdc$exerany)##
## 0 1
## 0.2543 0.7457
#9569 males are in the sample
table(cdc$gender)##
## m f
## 9569 10431
#Approximately 23% of people in the sample report being in excellent health
table(cdc$genhlth)/length(cdc$genhlth)##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
mosaicplot(table(cdc$gender,cdc$smoke100))According to the mosaic plot, there appears to be a higher proportion of men who have smoked 100 cigarettes in their lifetime than the proportion of women who have smoked 100 cigarettes in their lifetime.
under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.under23_and_smoke = subset(cdc, age < 23 & smoke100 == 1)
head(under23_and_smoke)## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)The boxplots above show the relationship between general health and bmi. It compares the BMI distributions between different general health values. It appears that people with poor general health have a larger IQR of BMI scores compared to people with excellent general health.
boxplot(bmi ~ cdc$exerany, horizontal = TRUE, main = "BMI vs. Any Exercise", xlab = "BMI", ylab = "Any Exercise")The variable I chose is exerany because whether or not someone exercises probably influences how much they weigh and therefore affects their BMI. According to the boxplot, people who exercise have a smaller IQR and smaller median than people who do not exercise. However, the spread of BMI values indicates that BMI can vary significantly regardless of whether people exercise or not.
plot(cdc$weight,cdc$wtdesire,main = "Weight vs. Desired Weight",xlab = "Weight",ylab = "Desired Weight")According to the scatterplot, there appears to be a positive, somewhat linear relationship between weight and desired weight. In other words, as weight increases desired weight also increases somewhat.
wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.#assign desired weight - weight to wtdiff column in cdc dataframe
cdc$wdiff = cdc$wtdesire - cdc$weight
#first 5 values of wdiff
cdc$wdiff[1:5]## [1] 0 -10 0 -8 -20
wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?Wdiff is a calculated variable or a derived metric; its data type is numeric. If an observation is zero, that means a person is satisfied with their weight, or that their current weight is the same as their desired weight. If wdiff is positive, that means that the person desires to gain additional weight. If wdiff is negative, that means that the person desires to lose weight.
wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?hist(cdc$wdiff, main = "Histogram of Weight Difference")summary(cdc$wdiff)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
According to the histogram and the five number summary, wdiff appears to be centered around -10. The distribution is fairly symmetric and has little spread. It appears that most people are slightly dissatisfied with their weight and would prefer to lose a few pounds.
summary(cdc$wdiff[cdc$gender == "m"])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
summary(cdc$wdiff[cdc$gender == "f"])## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
boxplot(cdc$wdiff~cdc$gender,horizontal = TRUE, xlab = "Weight Difference",ylab = "Gender",main = "Boxplot of Weight Difference by Gender")According to the summary, both men and women typically want to lose weight, however, women want to lose slightly more weight than men. This can be seen by both the mean and median weight differences for women being of higher magnitude than the weight differences for men.
weight and determine what proportion of the weights are within one standard deviation of the mean.#mean
mean(cdc$weight)## [1] 169.683
#standard deviation
sd(cdc$weight)## [1] 40.08097
#About 70% of observations are within one standard deviation of the mean
sum(cdc$weight > (mean(cdc$weight) - sd(cdc$weight)) & cdc$weight < (mean(cdc$weight) + sd(cdc$weight)))/length(cdc$weight)## [1] 0.7076