source("more/cdc.R")dim(cdc)## [1] 20000 9
There are 20,000 cases and 9 variables in this data set. The variables are:
names(cdc)## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
The data types of each variable are:
head(cdc)## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
genhlth - Categorical - Ordinal exerany - Numerical - Discrete hlthplan - Numerical - Discrete somke100 - Numerical - Discrete height - Numerical - Continuous weight - Numerical - Continuous wtdesire - Numerical - Continuous age - Numerical - Continuous gender - Categorical
summary(cdc$height)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
70.00-64.00## [1] 6
summary(cdc$age)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
57.00-31.00## [1] 26
table(cdc$gender)/20000##
## m f
## 0.47845 0.52155
table(cdc$exerany)/20000##
## 0 1
## 0.2543 0.7457
table(cdc$gender)##
## m f
## 9569 10431
There are 9569 males in the sample.
library(plyr)
count(cdc, "genhlth")## genhlth freq
## 1 excellent 4657
## 2 very good 6972
## 3 good 5675
## 4 fair 2019
## 5 poor 677
We can see 4657 people reports being in excellent health.
4657/20000 *100## [1] 23.285
So, the proportion is 23.285%
mosaicplot(table(cdc$gender, cdc$smoke100)) From the mosaic plot we can see that more males than females reported having smoked 100 cigarettes in their lifetime. We also can see more females reported not having smoked 100 cigarettes in their lifetime.
under23_and_smoke <- subset(cdc, age < 23 & smoke100 == "1")
nrow(under23_and_smoke)## [1] 620
There are 620 respondents who are under 23 and smoked 100 cigarettes in their lifetime.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$age) Using the age variable shows that the younger people have lower BMI and the BMI increases through the mid age. As the people get older, they gain more weight comparing to the height, so the BMI falls down respectively.
plot(cdc$weight, cdc$wtdesire) From this scatterplot, we can say that the desire for lower weight increases as the respondants acheive higher weight.
cdc_temp <-cdc
cdc_temp$wdiff <- (cdc$wtdesire - cdc$weight)
head(cdc_temp, 05)## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## wdiff
## 1 0
## 2 -10
## 3 0
## 4 -8
## 5 -20
wdiff is a numerical continuous variable. If, wdiff == 0, weight and desired weight are the same (happy with the current weight) wdiff > 0, desired weight is higher than the actual weight (wants to gain some weight) wdiff < 0, desired weight is lower than the actual weight (wants to lose some weight)
summary(cdc_temp$wdiff)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
boxplot(cdc_temp$wdiff)hist(cdc_temp$wdiff, xlim = c(-100, 200))hist(cdc_temp$wdiff, breaks = 100, xlim = c(-100, 200)) The median is -10 with a mean of -14.59. On average, people want to lose weight more than they want to gain. From the generated plots, there is a left skew due to the greater number of respondants who want to lose some weight.
From the first histogram we can see, approximately 16000-17000 respondants want to lose some weight ranging from 0-50 lb. On the other hand, approximately 1000-2000 respondants want to gain some weight ranging from 0-50 lb.
From the second histogram we can see, approximately 8000-9000 respondants are very close to have their ideal weights in range of -10~10 lbs weight difference.
Summary for male:
summary(subset(cdc_temp$wdiff, cdc_temp$gender == "m"))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
Summary for female:
summary(subset(cdc_temp$wdiff, cdc_temp$gender == "f"))## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
side-by-side box plot:
boxplot(cdc_temp$wdiff ~ cdc_temp$gender) From the summary, the median for female is -10 while for male it is -5.00. That indicates that, females are more inclined to want to lose weight then men.
The side-by-side box plot indicates that male might be more likely to gain some weight than female.
Overall summary:
mean <- mean(cdc$weight)Standard deviation:
sd <- sd(cdc$weight)Determining proportion of weights:
oneSdOfMean = subset(cdc, (weight < (mean + sd)) & (weight > (mean - sd)))
proportion = dim(oneSdOfMean)[1]/dim(cdc)[1]
print(mean)## [1] 169.683
print(sd)## [1] 40.08097
print(proportion)## [1] 0.7076
That means, 70.70% of the weights are within one standard deviation of the mean 169.683.