Lab report

Load data:

source("http://www.openintro.org/stat/data/cdc.R")

Exercise 1: How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, or numeric).

nrow(cdc)
## [1] 20000
ncol(cdc)
## [1] 9
# There are 20,000 people in the data set and there are 9 variables in it.
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"   "wtdesire"
## [8] "age"      "gender"
head(cdc)
##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f
class(cdc$genhlth)
## [1] "factor"

There are 20,000 people in the data set and there are 9 variables in it.

genhlth: categorical (multi-category)

exerany: categorical (binary)

hlthplan: categorical (binary)

smoke100: categorical (binary)

height: numeric

weight: numeric

wtdesire: numeric

age: numeric

gender: categorical (binary)

Exercise 2: Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

#create a numerical summary for height and age
summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
#compute the interquartile range for each
IQR(cdc$height)
## [1] 6
IQR(cdc$age)
## [1] 26
#How many males are in the sample?
table(cdc$gender)
## 
##     m     f 
##  9569 10431
#what proportion of the sample reports being n excellent health?
table(cdc$genhlth)/20000
## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385
#It is about 23% being in excellent health

Exercise 3: What does the mosaic plot reveal about smoking habits and gender?

mosaicplot(table(cdc$gender,cdc$smoke100))

Exercise 4: Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

under12_and_smoke <- subset(cdc, smoke100 == 1 & age <23)
head(under12_and_smoke)
##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

Exercise 5:

bmi <- (cdc$weight/cdc$height)*703
boxplot(bmi ~ cdc$genhlth)

boxplot(bmi ~ cdc$exerany)

This shows that those with more favorable health ratings tend to have lower BMI’s as a whole but there is still a significant amount of overlap in BMI among the groups

I chose “exerany” thinking that whether or not a person exercised might be related to their BMI. The boxplot doesn’t tend to show dramatic differences.


On your own:

1: Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

plot(cdc$weight, cdc$wtdesire)

There seems to be a positive mederatly strong linear relationship between weight and desired weight with a few outliers. A couple of people have very large desired weights.

2: Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdiff = cdc$wtdesire - cdc$weight

3: What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

# Wdiff is a numeric variable. If wdiff=0, this means that the person's desired weight is the same as their current weight. If wdiff is positive this means that the person desires to gain weight. If wdiff is negative this means that the person desires to lose weight.

4: Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

hist(wdiff, breaks = 100, col = "blue")

More people feel that they are overweight than feel that they are underweight. That is, more people desire to lose weight than to gain weight.

5: Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

boxplot (wdiff ~ cdc$gender)

fdata = subset(cdc, cdc$gender == "f")
mdata = subset (cdc, cdc$gender == "m")
fwdiff = fdata$wtdesire=fdata$weight
mwdiff = mdata$wtdesire=mdata$weight
summary(fwdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   128.0   145.0   151.7   170.0   495.0
summary(mwdiff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    78.0   165.0   185.0   189.3   210.0   500.0
mean(fwdiff<0)
## [1] 0
mean(mwdiff<0)
## [1] 0

About 72% of women want to lose weight while 55% of men want to lose weight. The women who want to lose weight typically want to lose more weight than men do.

6: Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

sd.1.range = mean(cdc$weight) + c(-1,1)*sd(cdc$weight)
middle.1.sd = subset(cdc,weight >= sd.1.range[1] & weight <= sd.1.range[2])
nrow(middle.1.sd)/20000
## [1] 0.7076

Teamwork report

Team member Attendance Author Contribution %
Diego Regules Yes Yes 25%
Gabriella Cardenas Yes No 25%
Cheyenne Korf Yes No 25%
Name of member 4 Yes / No Yes / No 25%
Total 100%