Homework #6

1.) The scatterplot given shows the relationship between the percentage of residents of each state that eats at least 5 servings of fruits and vegetables each day and the percentage that is obese.

1.a) Does the scatterplot show a positive or negatve association? Explain why your answer makes sense for these two variables.

The scatterplot shows a negative association between the percentage of residents who eat at least 5 servings of fruits and vegetables each day for each state and the percentage of residents that is obese for each state.

1.b) Which of the following is most likely to be the correlation between these two variables: -1, -0.941, -0.605, -0.083, 0.172, 0.445, 0.955, or 1? Explain your reasoning.

-0.605 is most likely to be the correlation value because the scatterplot suggests a negative, moderately linear relationship.

1.c) Would a negative correlation imply that eating more vegetables will cause you to lose weight? Explain.

No, the negative correlation suggests that states with a higher percentage of residents who eat at least 5 servings of fruits and vegetables each day will have a lower percentage of obesity.

2.) Every Fourth of July, Nathan’s Famous in New York City holds a hot dog eating contest. The table shown below gives the winning number of hot dogs and buns eaten every year from 2002 to 2011.

2.a) Use a randomization test for correlation to determine if there is a signficant linear relationship between the year and the number of hot dogs and buns eaten.

First set the value for alpha to be 0.05 to test our p-value against. The null and alternative hypotheses are stated, where rho represents the correlation cofficient.

\(H_o\): \(\rho = 0\)

\(H_a\): \(\rho \neq 0\)

library(fastR2)
x_values <- c(2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011)
#f <- c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
#Used f as years since the 2002 competition to see if there is a difference; the intercept was 48.382 rather than -3385.352 
y_values <- c(50,45,54,49,54,66,59,68,54,62)

xyplot(y_values~x_values,type = c("p", "r"), xlab = "Year", ylab = "Hot Dogs")

#computes regression line components
model <- lm(y_values~x_values)
model

## 
## Call:
## lm(formula = y_values ~ x_values)
## 
## Coefficients:
## (Intercept)     x_values  
##   -3385.352        1.715

origCorrelation <- cor(y_values~x_values)
origCorrelation

## [1] 0.6919398

listOfCorrelations <- c()

for (i in 1:10000){
  reordered_y_values <- sample(y_values,size=10,replace=F)
  newCorrelation <- cor(reordered_y_values~x_values)
  listOfCorrelations <- c(listOfCorrelations,newCorrelation)
}

histogram(~listOfCorrelations)

#2-sided p-value
pvalue <- (sum(listOfCorrelations >= origCorrelation) + sum(listOfCorrelations <= -origCorrelation))/10000
pvalue

## [1] 0.029

Compute test statistic R. The correlation coefficient was computed 0.6919398.

Randomization simulation will be done to compute our p-value. The test is two-sided, thus both sides of the distribution will be accounted for the p-value.

The calculated p-value was calculated to be 0.029, which is less than alpha. Thus we reject the null hypothesis, and there is statistically significant evidence to conclude that there is a positive, linear correlation between the year and the amount of hotdogs eaten at the contest.

2.b) Find the equation of the regression line. Interpret the slope of this line in the context of the problem.

Regression line: \(\hat{y} = -3385.352 + 1.715x\)

The slope of the regression line tells us that the for every year that passes the predicted amount of hot dogs that will be eaten at the contest will increase by 1.715.

2.c) Find the total sum of squares (SST), error sum of squares (SSE), and regression sum of squares(SSR) directly, without using the anova command. Then verify your results using the anova command.

\[SST = SSR + SSE \\ SSR = \sum^{n}_{i=1} (\hat{y}_{i} - \bar{y})^2 = 241\\ SSE = \sum^{n}_{i=1} (y_{i} - \hat{y}_{i})^2 = 265\\ SST = \sum^{n}_{i=1} (y_{i} - \bar{y})^2 = 506\]

The anova command verifies our result to two more decimal places.

anova(model)

## Analysis of Variance Table
## 
## Response: y_values
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## x_values   1 242.69 242.694  7.3486 0.02662 *
## Residuals  8 264.21  33.026                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.d) Find and interpret the coefficient of determination R squared using your answers to part c.

\[R^2 = 1 - (\frac{SSE}{SST})^2 = 0.4787729\]

The calculated R-squared value indicates that 47.88% variability of the response data can be explaine3d by the model.

2.e) Complete the F Test Procedure for determining whether or not there is a significant linear relationship between the year and the number of hot dogs and buns eaten.

The Anova test computes an F-value, as well as the resulting p-value obtained from that F-statistic which was found to be 0.02662 which is less than alpha, thus we reject the null hypothesis from part a. And we conclude that there is a significant linear relationship between the year and the number of hot dogs eaten at the contest.

Homework #6

Jacob Wolfla

December 2, 2017