To initialize this lab, we first set the working directory, and source the cdc data set.
setwd("C:/Users/Robert/Documents/R/win-library/3.2/IS606/labs/Lab1")
source("more/cdc.R")
Exercise 1 - How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).
nrow(cdc)
## [1] 20000
dim(cdc)
## [1] 20000 9
Exercise 2 - Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?
summary(cdc$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
summary(cdc$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 31.00 43.00 45.07 57.00 99.00
summary(cdc$height)[5]-summary(cdc$height)[2]
## 3rd Qu.
## 6
summary(cdc$age)[5]-summary(cdc$age)[2]
## 3rd Qu.
## 26
table(cdc$gender)/nrow(cdc)
##
## m f
## 0.47845 0.52155
table(cdc$exerany)/nrow(cdc)
##
## 0 1
## 0.2543 0.7457
table(cdc$gender)['m']
## m
## 9569
table(cdc$genhlth)['excellent']/nrow(cdc)
## excellent
## 0.23285
We now load the second set of data of modern American birthrate data.
mosaicplot(table(cdc$gender,cdc$smoke100))
Exercise 3 - What does the mosaic plot reveal about smoking habits and gender?
This plot helps us visualize the proportions between these variables. There are more female respondants but males are more likely to have smoked over 100 cigarettes. Males are more (over 50%) likely to have smoked over 100 cigarettes, while females are less (under 50%) likely.
Exercise 4 - Create a new object called “under23_and_smoke” that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.
under23_and_smoke <- subset(cdc, smoke100 == 1 & age < 23)
bmi <- (cdc$weight / cdc$height^2) * 703
boxplot(bmi ~ cdc$genhlth)
Exercise 5 - What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.
The box plot shoes an increasing average bmi and general health margin as health degenerates.
boxplot(bmi ~ cdc$smoke100)
This box plot shows relatively similar data between not smoking and smoking, but does show a preponderance for extreme BMI. This may be due to cancer, disease, or a cross-addiction to food.
Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
smoothScatter(cdc$wtdesire~cdc$weight)
abline(lm(cdc$wtdesire~cdc$weight), col="red")
There is a positive relationship between the two variables.
Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
wdiff <- cdc$wtdesire-cdc$weight
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?
typeof(wdiff)
## [1] "integer"
wdiff is an integer array. a value of 0 means the respondant believes they are at optimal weight. Positive values reflect desired weight gain and negative values weight loss.
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?
d <- density(wdiff)
plot(d)
mean(wdiff)
## [1] -14.5891
median(wdiff)
## [1] -10
quantile(wdiff)
## 0% 25% 50% 75% 100%
## -300 -21 -10 0 500
boxplot(wdiff)
If we take the mean of desired weight minus weight, we see an average weight loss goal of -14 pounds with a median of -10. The density plot reflects a mode of 0, which tells us that most people claim to be comfortable with their current weight and that few believe they are underweight. There is a definite belief among the respondents that they are overweight. A boxplot, while not pretty, helps highlight the outliers, including someone believing they should be 500 pounds heavier.
Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
malesDesire <-subset(cdc,cdc$gender=='m')$wtdesire
femalesDesire <-subset(cdc,cdc$gender=='f')$wtdesire
malesWeight <-subset(cdc,cdc$gender=='m')$weight
femalesWeight <-subset(cdc,cdc$gender=='f')$weight
boxplot(malesDesire - malesWeight,femalesDesire - femalesWeight)
summary(malesDesire - malesWeight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
summary(femalesDesire - femalesWeight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
The box plot and summary statistics show a greater inner-quartile range. They are typically more likely to prefer losing weight, but there is a greater rate of desire to gain weight as opposed to lose or stay the same.
Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
mean(cdc$weight)
## [1] 169.683
sd(cdc$weight)
## [1] 40.08097
belowMean <-subset(cdc, cdc$weight > mean(cdc$weight)-sd(cdc$weight))
aboveMean <-subset(cdc, cdc$weight < mean(cdc$weight)+sd(cdc$weight))
withinSD <-subset(belowMean, belowMean$weight < max(aboveMean$weight))
nrow(withinSD)/nrow(cdc)
## [1] 0.7071