source("more/cdc.R")
names(cdc)
## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
Number of cases
nrow(cdc)
## [1] 20000
Number of Variables
ncol(cdc)
## [1] 9
Variable | Data Type |
---|---|
genhlth | categorical ordinal |
exerany | categorical |
hlthplan | categorical |
smoke100 | categorical |
height | numerical continuous |
weight | mumerical continuous |
wtdesire | mumerical discrete. One could argue it is continuous but i dont think any one is going to say i want be 180.25 lbs. |
age | uumeric discrete or continous if you take month in to account |
gender | categorical |
height
and age
, and compute the interquartile range for each. Compute the relative frequency distribution for gender
and exerany
. How many males are in the sample? What proportion of the sample reports being in excellent health?Summary and IQR : Height
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
IQR(cdc$height)
## [1] 6
Summary and IQR : Age
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
IQR(cdc$age)
## [1] 26
Relative Frequencey Distribution for Gender
table(cdc$gender)/nrow(cdc)
##
## m f
## 0.47845 0.52155
Relative Frequencey Distribution for exerany
table(cdc$exerany)/nrow(cdc)
##
## 0 1
## 0.2543 0.7457
Number of Males
nrow(subset(cdc, gender =="m", select =c('gender')))
## [1] 9569
Or
summary(cdc$gender)
## m f
## 9569 10431
Good Health
nrow(subset(cdc, genhlth =="good", select =c('genhlth')))/(nrow(cdc))
## [1] 0.28375
Or
table(cdc$genhlth)/nrow(cdc)
##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
Males smoke slightly more than females.
under23_and_smoke
that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.under23_and_smoke <- subset(cdc, smoke100 == 1 & age < 23)
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)
Box plot shows BMI increases as genhealth gets worse. It also shows the middle 50% range gets bigger as genhealth gets worse.
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$exerany)
Box plot shows that people who excercise have a lower BMI and the BMI range is narrower. Outliers for excericed group accounts for people like Arnold Schwarzenegger since BMI does not take muscle mass into account.
Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
plot(cdc$wtdesire~ cdc$weight)
wdiff <- cdc$wtdesire - cdc$weight
plot(wdiff~cdc$weight)
Weight and desired weight are negatively correlated, that is the more a person weighs the lower the desired weight. Plot of weight difference vs actual weight clearly shows this.
Let’s consider a new variable: the difference between desired weight (wtdesire
) and current weight (weight
). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff
.
wdiff <- cdc$wtdesire - cdc$weight
head(wdiff)
## [1] 0 -10 0 -8 -20 0
length(wdiff)
## [1] 20000
str(wdiff)
## int [1:20000] 0 -10 0 -8 -20 0 -9 -10 -20 -10 ...
typeof(wdiff)
## [1] "integer"
What type of data is wdiff
? If an observation wdiff
is 0, what does this mean about the person’s weight and desired weight. What if wdiff
is positive or negative?
Describe the distribution of wdiff
in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
wdiff <- cdc$wtdesire - cdc$weight
min(wdiff)
## [1] -300
max(wdiff)
## [1] 500
hist(wdiff, breaks = 100 ,freq = FALSE)
summary(wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
The histogram is skewed to the left and shows that most of the people would like to loose weight. Median is -10 and mean is -14.59 and mean is probably skewed by the few outliers. This tells us that most of the people would like to loose weight and a small percentage of them would like to gain weight.
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
men <-subset(cdc, gender == 'm')
women <-subset(cdc, gender == 'f')
men_wdiff <- men$wtdesire - men$weight
women_wdiff <- women$wtdesire - women$weight
summary(men_wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
summary(women_wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
wdiff <- cdc$wtdesire -cdc$weight
boxplot(wdiff~ cdc$gender)
boxplot(cdc$weight~ cdc$gender)
Ladies want to loose more weight than men. Median for men is -5.00 and women is -10.00
Now it’s time to get creative. Find the mean and standard deviation of weight
and determine what proportion of the weights are within one standard deviation of the mean.
sd(cdc$weight)
## [1] 40.08097
mean(cdc$weight)
## [1] 169.683
one_std_of_mean_after <- mean(cdc$weight) + sd(cdc$weight)
one_std_of_mean_before <- mean(cdc$weight) - sd(cdc$weight)
one_std_of_mean_after
## [1] 209.7639
one_std_of_mean_before
## [1] 129.602
with_in_one_sd <- subset(cdc, weight >= one_std_of_mean_before & weight <= one_std_of_mean_after)
dim(with_in_one_sd)
## [1] 14152 9
#Porpotions of weights with in one standard deviation of the mean
nrow(with_in_one_sd)/nrow(cdc)
## [1] 0.7076
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.