1. The scatterplot given below shows the relationship between the percentage of residents of each state that eats at least 5 servings of fruits and vegetables each day and the percentage that is obese.
  1. Does the scatterplot show a positive or negatve association? Explain why your answer makes sense for these two variables.
    This scatterplot shows a negative association. This makes sense because it is reasonable that as the percentage of residents that eat at least 5 servings of fruits and vegetables increases, the percentage of residents that is obese decreases, since healthy eating habits can prevent obesity.

  2. Which of the following is most likely to be the correlation between these two variables: \(-1, -0.941, -0.605, -0.083, 0.172, 0.445, 0.955, or 1\)? Explain your reasoning.
    The mostly likely correlation between these two variables is -.605, because it is a moderately strong, negative relationiship.

  3. Would a negative correlation imply that eating more vegetables will cause you to lose weight? Explain.
    No; correlation does not imply causation, so we only know that there is some relationship between the two not necessarily that it eating more vegetables will cause you to lose weight.

x_values <- c(2002,2003,2004,2005,2006,2007,2008,2009,2010,2011)
y_values <- c(50,45,54,49,54,66,59,68,54,62)

xyplot(y_values~x_values,type = c("p", "r"))

model<-lm(y_values~x_values)
model
## 
## Call:
## lm(formula = y_values ~ x_values)
## 
## Coefficients:
## (Intercept)     x_values  
##   -3385.352        1.715
origCorrelation <- cor(y_values~x_values)

listOfCorrelations <- c()

for (i in 1:10000){
  reordered_y_values <- sample(y_values,size=10,replace=F)
  newCorrelation <- cor(reordered_y_values~x_values)
  listOfCorrelations <- c(listOfCorrelations,newCorrelation)
}

histogram(~listOfCorrelations)

#2-sided p-value
pvalue <- (sum(listOfCorrelations >= origCorrelation)+sum(listOfCorrelations <= -1*origCorrelation))/10000
pvalue
## [1] 0.0327
  1. We make a null hypothesis that the correlation coefficient is equal to zero and a alternative hypothesis that it is not equal to zero. With a p-value of ‘r pvalue’, we can reject the null hypothesis and conclude that there is a significant linear relationship bewteen the year and the number of hot dogs and buns eaten.

  2. \(HotDogs = 1.715*Year - 3385.352\) The slope tells us that for every additional year, the number of hot dogs and buns eaten increases by 1.715.

yavg = mean(y_values)
predicted = (1.715 * x_values) - 3385.352


SST = sum((y_values - yavg)^2)
SST
## [1] 506.9
SSE = sum((y_values - predicted)^2)
SSE
## [1] 265.1333
SSR = sum((predicted - yavg)^2)
SSR
## [1] 243.5783
anova(model)
## Analysis of Variance Table
## 
## Response: y_values
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## x_values   1 242.69 242.694  7.3486 0.02662 *
## Residuals  8 264.21  33.026                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
r2 = (SSR/SST)^2
r2
## [1] 0.2309045

The R^2 value tells us that 23.09% of the variability of the number of hot dogs and buns eaten is explained by the model.

  1. The p-value of ‘r pvalue’ lets us conclude that there is a significant linear relationship bewteen the year and the number of hot dogs and buns eaten.