library(knitr)
source("http://www.openintro.org/stat/data/cdc.R")
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"
head(cdc)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f
Question 1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
cases_count variables_count genhlth exerany hlthplan smoke100 height weight wtdesire age gender
20000 9 categorical ordinal discrete discrete discrete continous continous discrete discrete categorical
Question 2. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
iqr_height<-summary(cdc$height)[5]-summary(cdc$height)[2]
iqr_age<-summary(cdc$age)[5]-summary(cdc$age)[2]
relative_frequency_gender<-table(cdc$gender)/cases_count 
relative_frequency_exerany<-table(cdc$exerany)/cases_count 
male_count<-table(cdc$gender)[1]
excellent_health_reported<-table(cdc$genhlth)[1]/cases_count
iqr_height iqr_age relative_frequency_gender relative_frequency_exerany male_count excellent_health_reported
6 26 0.47845 male/ 0.52155 female 0.2543 male/ 0.7457 female 9569 23.285 %
Question 3. What does the mosaic plot reveal about smoking habits and gender?
table(cdc$gender,cdc$smoke100)
##    
##        0    1
##   m 4547 5022
##   f 6012 4419
mosaicplot(table(cdc$gender,cdc$smoke100))

The mosiac show that the square box (m,1) is bigger than the square box (f,1), men are more smokers than women of 100 cigarettes or more. To prove this, the following table displays:

m_smoking_less_or_100 f_smoking_less_or_100 m_smoking_above_or_100 f_smoking_above_or_100
48 % 58 % 52 % 42 %

The results show that men tend to smoke 100 cigarettes or more, more than women.

Question 4. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke<-subset(cdc, cdc$smoke100==1 & cdc$age<23)
head(under23_and_smoke)
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f
Question 5. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)

This boxplot show the relationship between the BMI value holder and the holder’s general health. To check whether participants exercised or not over the past month, let’s box plot using cdc$exerany:

boxplot(bmi ~ cdc$exerany)

The box 0 on the left side clearly shows that in the past month participants had higher BMIs than the ones who were exercising.


On Your Own

There’s a correlation between the desired weight and the weight, when one of them increases the other does also. It is skewed to the right.

wdiff<-cdc$wtdesire-cdc$weight

1- wdiff is discrete integer 2- wdiff = 0, desired weight achieved 3- wdiff > 0, needs to lose weight 4- wdiff < 0, needs to gain weight

summary(wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00
hist(wdiff, breaks=70)

Checking the above 2 outputs, we can conclude that the histogram shows high peaks around and below 0, mean=-10.00 and median=-14.59 are negatives which means that most of the subjects are above their desired weight looking to lose weight.

men_weight m_desired_weight women_weight f_desired_weight m_diff w_diff
1811629 1709182 1582030 1392695 102447 189335

m_diff = men_weight - m_desired_weight = 102447, f_diff = women_weight - f_desired_weight = 189335

As seen on the table, m_diff < f_diff. Men has less desire than women in changing their weight.

70.76% of the weight proportions are within 1 standard deviation of the mean.