library(knitr)
source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
head(cdc)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
cases_count | variables_count | genhlth | exerany | hlthplan | smoke100 | height | weight | wtdesire | age | gender |
---|---|---|---|---|---|---|---|---|---|---|
20000 | 9 | categorical ordinal | discrete | discrete | discrete | continous | continous | discrete | discrete | categorical |
height
and age
, and compute the interquartile range for each. Compute the relative frequency distribution for gender
and exerany
. How many males are in the sample? What proportion of the sample reports being in excellent health?summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
iqr_height<-summary(cdc$height)[5]-summary(cdc$height)[2]
iqr_age<-summary(cdc$age)[5]-summary(cdc$age)[2]
relative_frequency_gender<-table(cdc$gender)/cases_count
relative_frequency_exerany<-table(cdc$exerany)/cases_count
male_count<-table(cdc$gender)[1]
excellent_health_reported<-table(cdc$genhlth)[1]/cases_count
iqr_height | iqr_age | relative_frequency_gender | relative_frequency_exerany | male_count | excellent_health_reported |
---|---|---|---|---|---|
6 | 26 | 0.47845 male/ 0.52155 female | 0.2543 male/ 0.7457 female | 9569 | 23.285 % |
table(cdc$gender,cdc$smoke100)
##
## 0 1
## m 4547 5022
## f 6012 4419
mosaicplot(table(cdc$gender,cdc$smoke100))
The mosiac show that the square box (m,1) is bigger than the square box (f,1), men are more smokers than women of 100 cigarettes or more. To prove this, the following table displays:
m_smoking_less_or_100 | f_smoking_less_or_100 | m_smoking_above_or_100 | f_smoking_above_or_100 |
---|---|---|---|
48 % | 58 % | 52 % | 42 % |
The results show that men tend to smoke 100 cigarettes or more, more than women.
under23_and_smoke
that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.under23_and_smoke<-subset(cdc, cdc$smoke100==1 & cdc$age<23)
head(under23_and_smoke)
## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)
This boxplot show the relationship between the BMI value holder and the holder’s general health. To check whether participants exercised or not over the past month, let’s box plot using cdc$exerany
:
boxplot(bmi ~ cdc$exerany)
The box 0 on the left side clearly shows that in the past month participants had higher BMIs than the ones who were exercising.
There’s a correlation between the desired weight and the weight, when one of them increases the other does also. It is skewed to the right.
wtdesire
) and current weight (weight
). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff
.wdiff<-cdc$wtdesire-cdc$weight
wdiff
? If an observation wdiff
is 0, what does this mean about the person’s weight and desired weight. What if wdiff
is positive or negative?1- wdiff is discrete integer 2- wdiff = 0, desired weight achieved 3- wdiff > 0, needs to lose weight 4- wdiff < 0, needs to gain weight
wdiff
in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?summary(wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
hist(wdiff, breaks=70)
Checking the above 2 outputs, we can conclude that the histogram shows high peaks around and below 0, mean=-10.00 and median=-14.59 are negatives which means that most of the subjects are above their desired weight looking to lose weight.
men_weight | m_desired_weight | women_weight | f_desired_weight | m_diff | w_diff |
---|---|---|---|---|---|
1811629 | 1709182 | 1582030 | 1392695 | 102447 | 189335 |
m_diff = men_weight - m_desired_weight = 102447
, f_diff = women_weight - f_desired_weight = 189335
As seen on the table, m_diff < f_diff
. Men has less desire than women in changing their weight.
Now it’s time to get creative. Find the mean and standard deviation of weight
and determine what proportion of the weights are within one standard deviation of the mean.
mea | n_weight wei | ght_sd sd_ | proportions_one |
---|---|---|---|
Mean | 169.683 | 40.08097 | 0.7076 |
70.76% of the weight proportions are within 1 standard deviation of the mean.