Loading the CDC dataset:

source("more/cdc.R")
  1. How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
dim(cdc)
## [1] 20000     9

There are 20,000 cases and 9 variables in this data set. The variables are:

names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

The data types of each variable are:

head(cdc)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

genhlth - Categorical - Ordinal
exerany - Numerical - Discrete
hlthplan - Numerical - Discrete
somke100 - Numerical - Discrete
height - Numerical - Continuous
weight - Numerical - Continuous
wtdesire - Numerical - Continuous
age - Numerical - Continuous
gender - Categorical

  1. Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

Height summary and interquartile range:

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
70.00-64.00
## [1] 6

Age summary and interquartile range:

summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
57.00-31.00
## [1] 26

Relative frequency distribution for gender:

table(cdc$gender)/20000
## 
##       m       f 
## 0.47845 0.52155

Relative frequency distribution for exerany:

table(cdc$exerany)/20000
## 
##      0      1 
## 0.2543 0.7457

Number of males in sample:

table(cdc$gender)
## 
##     m     f 
##  9569 10431

There are 9569 males in the sample.

Portions of being in excellent health:

library(plyr)
count(cdc, "genhlth")
##     genhlth freq
## 1 excellent 4657
## 2 very good 6972
## 3      good 5675
## 4      fair 2019
## 5      poor  677

We can see 4657 people reports being in excellent health.

4657/20000 *100
## [1] 23.285

So, the proportion is 23.285%

  1. What does the mosaic plot reveal about smoking habits and gender?
mosaicplot(table(cdc$gender, cdc$smoke100))

From the mosaic plot we can see that more males than females reported having smoked 100 cigarettes in their lifetime. We also can see more females reported not having smoked 100 cigarettes in their lifetime.

  1. Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- subset(cdc, age < 23 & smoke100 == "1")
nrow(under23_and_smoke)
## [1] 620

There are 620 respondents who are under 23 and smoked 100 cigarettes in their lifetime.

  1. What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$age)

Using the age variable shows that the younger people have lower BMI and the BMI increases through the mid age. As the people get older, they gain more weight comparing to the height, so the BMI falls down respectively.

On my Own:

  1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
plot(cdc$weight, cdc$wtdesire)

From this scatterplot, we can say that the desire for lower weight increases as the respondants acheive higher weight.

  1. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
cdc_temp <-cdc
cdc_temp$wdiff <- (cdc$wtdesire - cdc$weight)
head(cdc_temp, 05)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
##   wdiff
## 1     0
## 2   -10
## 3     0
## 4    -8
## 5   -20
  1. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

wdiff is a numerical continuous variable.
If,
wdiff == 0, weight and desired weight are the same (happy with the current weight)
wdiff > 0, desired weight is higher than the actual weight (wants to gain some weight)
wdiff < 0, desired weight is lower than the actual weight (wants to lose some weight)

  1. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
summary(cdc_temp$wdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00
boxplot(cdc_temp$wdiff)

hist(cdc_temp$wdiff, xlim = c(-100, 200))

hist(cdc_temp$wdiff, breaks = 100, xlim = c(-100, 200))

The median is -10 with a mean of -14.59. On average, people want to lose weight more than they want to gain. From the generated plots, there is a left skew due to the greater number of respondants who want to lose some weight.

From the first histogram we can see, approximately 16000-17000 respondants want to lose some weight ranging from 0-50 lb. On the other hand, approximately 1000-2000 respondants want to gain some weight ranging from 0-50 lb.

From the second histogram we can see, approximately 8000-9000 respondants are very close to have their ideal weights in range of -10~10 lbs weight difference.

  1. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

Summary for male:

summary(subset(cdc_temp$wdiff, cdc_temp$gender == "m"))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -20.00   -5.00  -10.71    0.00  500.00

Summary for female:

summary(subset(cdc_temp$wdiff, cdc_temp$gender == "f"))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -27.00  -10.00  -18.15    0.00   83.00

side-by-side box plot:

boxplot(cdc_temp$wdiff ~ cdc_temp$gender)

From the summary, the median for female is -10 while for male it is -5.00. That indicates that, females are more inclined to want to lose weight then men.

The side-by-side box plot indicates that male might be more likely to gain some weight than female.

  1. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

Overall summary:

mean <- mean(cdc$weight)

Standard deviation:

sd <- sd(cdc$weight)

Determining proportion of weights:

oneSdOfMean = subset(cdc, (weight < (mean + sd)) & (weight > (mean - sd)))
proportion = dim(oneSdOfMean)[1]/dim(cdc)[1]
print(mean)
## [1] 169.683
print(sd)
## [1] 40.08097
print(proportion)
## [1] 0.7076

That means, 70.70% of the weights are within one standard deviation of the mean 169.683.