Notes

Using the same cdc data set as in Lab03, answer the following questions. Before we can answer any questions, we load the mosaic package and the BRFSS data set which is in the oilabs package.

library(mosaic)
library(oilabs)
data(cdc)

Total out of 10 pts.

Question 1

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

(2 pts) Let’s plot both all the data and a zoomed in plot:

xyplot(wtdesire ~ weight, data=cdc)

xyplot(wtdesire ~ weight, data=cdc, ylim=c(0,300))

Peoples’ desired weight and weight are positively associated/correlated. Some people have desired weights of about 600 and 700. Sumo wrestlers perhaps?

Question 2

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

(1 pt) This is similar to the mutate example from Lab03 where we created a new variable bmi. The following code does:

cdc <-
  cdc %>% 
  mutate(wdiff = wtdesire - weight)

Question 3

What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

(3 pts)

Question 4

Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

histogram(~wdiff, data=cdc)

histogram(~wdiff, data=cdc, xlim=c(-100, 100), nint=50)

favstats(~wdiff, data=cdc)
##   min  Q1 median Q3 max     mean       sd     n missing
##  -300 -21    -10  0 500 -14.5891 24.04586 20000       0

(2 pts) For completeness’ sake, we also show a zoomed in histogram with a more refined binning system.

Both the mean and the median are negative, again since we defined wdiff = wtdesire - weight a negative value means they want to lose weight. There are extreme outliers on either end, indicating some people want to either lose a lot of weight, or gain a lot. The distribution is not quite symmetric: there appear to be slightly more values on the negative side. As for the spread, in order to discount the effect of these outliers, the IQR gives a good sense of the spread of the middle 50% of the data: \(Q_3 - Q_1 = 0 - (-21) = 21\) i.e. about 21 pounds. In other words, most people want to lose between 0 and 21 pounds. Note that the 3rd quartile is 0, meaning at least 75% of people either want to lose weight or are happy with their current weight.

The conclusion is, as would be expected, that people would like to lose at least some weight; only a smaller number what to gain weight.

histogram(~wdiff, data=cdc, xlim=c(-100, 100), nint=50)

Question 5

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

(2 pts) In the lab, we saw how to split off only the males in the cdc data set by using the filter() command. We do this again for males and separately for females.

We can now do a comparison using favstats()

favstats(wdiff~gender, data=cdc)
##   gender  min  Q1 median Q3 max      mean       sd     n missing
## 1      m -300 -20     -5  0 500 -10.70613 23.49262  9569       0
## 2      f -300 -27    -10  0  83 -18.15118 23.99713 10431       0

We see on average men wanted to lose less weight. We plot both the side-by-side boxplot and zoomed-in versions.

bwplot(wdiff~gender, data=cdc)

bwplot(wdiff~gender, data=cdc, ylim=c(-75, 75))

Because the median wdiff is lower for females, and a negative value for wdiff was indicative of a desire to lose weight, one can argue women want to lose more weight. Furthermore there is greater variability in this difference for the middle 50% of women than then middle 50% of men.

Question 6: BONUS

Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean. Hint: & is the “and” operator. So if statement A is true and statement B is true, then A & B is true as well. But if B is false, A & B is false.

(+2 pts)

xbar <- mean(cdc$weight)
s <- sd(cdc$weight)
lower.bound <- xbar - s
upper.bound <- xbar + s
cdc <-
  cdc %>% 
  mutate(within_one_sd = lower.bound <= weight & weight <= upper.bound)
tally(~within_one_sd, data=cdc)
## 
##  TRUE FALSE 
## 14152  5848
mean(~within_one_sd, data=cdc)
## [1] 0.7076

We find 70.76% of weights are within one standard deviation of the mean.