NOW ON YOUR OWN.... Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.
plot( cdc$weight , cdc$wtdesire)
There is a positive relationship between desired weight and weight, with a few potential outliers on top and far right.
Let's consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.
wdiff = cdc$wtdesire - cdc$weight
str(wdiff)
## int [1:20000] 0 -10 0 -8 -20 0 -9 -10 -20 -10 ...
What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person's weight and desired weight.
wdiff’s data type is discrete as we can see using str() function (integers).
If 0, that means this person’s weight and desired weight are the same.
What if wdiff is positive or negative?
If wdiff is positive, that means this person’s weight is smaller than desired weight.
If negative, then weight larger than desired weight.
Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use.
boxplot(wdiff)
summary(wdiff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -21.00 -10.00 -14.59 0.00 500.00
hist(wdiff, breaks= 100)
What does this tell us about how people feel about their current weight?
From the summary(wdiff) function and histogram we can see that the center is below 0, skewed to the left, and spread from -300 to 500. From the boxplots we can see that there are a few potential outliers. This indicates that most people feel that their weights are larger than their desired weights.
Using mumerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.
menweight = subset(cdc, cdc$gender == "m")
menweight = subset(cdc$wtdesire - cdc$weight, cdc$gender == "m")
womenweight = subset(cdc$wtdesire - cdc$weight, cdc$gender == "f")
summary(menweight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -20.00 -5.00 -10.71 0.00 500.00
summary(womenweight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -300.00 -27.00 -10.00 -18.15 0.00 83.00
boxplot(menweight, womenweight)
It seems that females have slightly bigger gaps (differences between their weights and desired weights) than males by looking at the summary (median and mean).
Now it's time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
mycount = subset(cdc$weight, cdc$weight > (mean(cdc$weight)-sd(cdc$weight)) &
cdc$weight < (mean(cdc$weight)+sd(cdc$weight)))
length(mycount) / nrow(cdc)
## [1] 0.7076
70.76% of the weights are within one standard deviation of the mean.