Import dataset from web and specify folder
setwd("C:\\Users\\26291\\Documents\\Data606_lab1")
source("http://www.openintro.org/stat/data/cdc.R")
Attributes Names
names(cdc)
## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

(Answer) The general slope of the plot is positive which means as a person’s weight increases so does their desired weight.
plot(cdc$weight ~ cdc$wtdesire)

2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called wdiff.

wdif <- (cdc$wtdesire - cdc$weight)

3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person’s weight and desired weight. What if wdiff is positive or negative?

(Answer) wdiff is numerical and discrete. If an observation is 0, then the respondent is satisfied with their current weight

If wdiff is negative, than they want to lose weight, if it is positive, they want to gain weight.

4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

(Answer)
boxplot(wdif)

hist(wdif, breaks = 40)

plot(wdif)

summary(wdif)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -300.00  -21.00  -10.00  -14.59    0.00  500.00
Wdiff median = -10, mean = -14.59, so it’s typical for people to want to lose around 10 to 15 pounds

The Wdiff histogram is unimodal with a slight left skew, so there are some people who want to lose a lot of weight, and few people who want to gain weight

The iqr spread is between 0 and -21 pounds, although there are many outliers, mostly of people who want to lose weight

5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

difwdif <- data.frame(wdif, cdc$gender)
summary(subset(difwdif, cdc.gender == "m"))
##       wdif         cdc.gender
##  Min.   :-300.00   m:9569    
##  1st Qu.: -20.00   f:   0    
##  Median :  -5.00             
##  Mean   : -10.71             
##  3rd Qu.:   0.00             
##  Max.   : 500.00
summary(subset(difwdif, cdc.gender == "f"))
##       wdif         cdc.gender
##  Min.   :-300.00   m:    0   
##  1st Qu.: -27.00   f:10431   
##  Median : -10.00             
##  Mean   : -18.15             
##  3rd Qu.:   0.00             
##  Max.   :  83.00
boxplot(difwdif$wdif ~ difwdif$cdc.gender)

As per the boxplot output, Women (median = -10) generally appear to want to lose a few more pounds than men (median = -5), and women have a slightly larger range of how much they want to lose/gain (iqr = 27) than men (iqr = 20). Interestingly, more men than women appear to want to gain weight.

6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

avg_wt <- mean(cdc$weight)
avg_wt 
## [1] 169.683
sd_wt <- sd(cdc$weight)
sd_wt
## [1] 40.08097
weight_within_one_sd <- subset(cdc, weight < (avg_wt + sd_wt) & weight > (avg_wt - sd_wt))
dim(weight_within_one_sd)
## [1] 14152     9
dim(weight_within_one_sd)[1]/dim(cdc)[1]
## [1] 0.7076
Since there are 14152 observations in the required subset, out of a total of 20000 observations, this represents a

proportion of 14152/20000 = 0.7076, or about 71%. Thus 71% of weights are within 1 standard deviation of the mean.