Instruction

I strongly encourage that you read the questions as soon as you get the assignment. This will not only help you start thinking how to solve them, but also allow you to have sufficient to get help if you ever need it. In case of questions, or if you get stuck, please don’t hesitate to email me (though I would appreciate if you could allow me to have at least 24 hours before the deadline to get back to you).

My regular office hours (please feel free to let me know if those don’t work for your schedule):

Also, remember that the Statistics & Data Science (SDS) Fellows have Drop-in hours: Sunday-Thursday evenings from 7:00-9:00PM, in E208. I encourage you to use this resource; the fellows are able to help with questions regarding conceptual understanding of the course material, as well as R and RMarkdown questions.

Steps to proceed:

  1. Download the file PS5.Rmd from Moodle
  2. Upload the file to the RStudio server (r.amherst.edu)
  3. Replace “YOUR NAME HERE” with your name at the top of this document, as well as the date part
  4. Add in your responses below where it is marked SOLUTION:
  5. Run Knit to PDF under “Knit”
  6. Once you are done, upload the knited pdf to Gradescope

Note that datasets for most of the tables, examples, and exercises are available via the Stat2Data package.

Problems to turn in:

Three Problems are included below, but only Problem 1 and Problem 2 are required. Problem 3 is for your practice only. Again, you may want to take a look at Problem 2 first, as it requires team work!

Problem 1 [25 points - 5 points for completion, 20 points for accuracy]: Adirondacks High Peaks (modified from 4.2 and 4.12 in STAT2)

Forty six mountains in the Adirondacks of upstate New York are know as the High Peaks, with elevations near or above 400 feet (although modern measurements show a couple of the peaks are actually under 4000 feet). A goal for hikers in the region is to become a “46er”, scaling each of the peaks. The file HighPeaks contains information on the Elevation (in feet) of each peak along with data on typical hikes, including the Ascent (in feet), round trip distance (Length, in miles), Difficulty rating (1-7), and expected trip Time (in hours).

  1. Look at a scatterplot of Time vs. Elevation of the mountain and the correlation between the two variables. Does it look like Elevation should be very helpful in predicting Time?

## [1] -0.0163

SOLUTION:
It looks like elevation is not very helpful in predicting time, as the points appear to be randomly scattered and there is no obvious relationship between these two variables. Also, the correlation coefficient is very low, implying that elevation is a poor predictor of time.

  1. Suppose we’ve fit a simple linear regression (SLR) model using Length to predict Time, and it’s significant. Now, we would like to consider adding Elevation as the second predictor for Time and the following added variable plot for Elevation can help us see the effect of adding Elevation to the model :

Does the plot show that there is additional information in Elevation that is useful for predicting Time after accounting for Length? Explain.

SOLUTION:
Yes, since the plot shows that the residuals have a negative linear relationship, this plot definitely does show us additional information in Elevation that is useful for predicting for Time after accounting for Length. This negative linear relationship shows that after accounting for Length there is a decent amount of variability in Elevation that can be explained by Time.

  1. Suppose we run a “best subsets” procedure to determine which set of predictors (not including Peak) would have the lowest \(C_p\) value when predicting Time, and got the following output from R:
adjr2 cp bic rss Length Elevation Difficulty Ascent
1 ( 1 ) 0.731 25.4 -53.8 92.4
2 ( 1 ) 0.787 12.2 -61.7 71.6
3 ( 1 ) 0.815 6.3 -65.4 60.7
4 ( 1 ) 0.824 5.0 -65.2 56.2

What model (which predictors) did the “best subsets” procedure suggest to use for predicting Time?

SOLUTION:
Since the last model has the smallest CP (also it manages to almost completely satisfy the CP rule of thumb of m+1), we can say that the last model did the “best subsets” procedure in its prediction of time. This model included Elevation, Difficulty, and Ascent in its prediction of Time.

  1. Suppose we fit the model suggested by the “best subsets” regression above, and obtained the following plots:

Use the above plots to assess and comment on the regression conditions of linearity, homoscedasticity (equal variance), and normality in this situation.

SOLUTION:
Linearity: The residual plot has a slight curve to it, but since this pattern is not very strong, we can say that it does satisfy this condition of linearity but you must be cautious in analyzing this model. Homoscedasticity (equal variance): The residual plot has a slight curve to it, but since this pattern is not very strong, we can say that it does satisfy this condition of equal variance but you must be cautious in analyzing this model. Normality: From the QQ plot, the points are scattered around the fitted lined (with only a couple of outliers), so we can say that these plots satisfy this condition. Based on these conditions, we can proceed with analyzing this model but definitely with caution.

  1. Are there any mountains that stand out as being outliers? Are there any mountains that have high leverage or may be influential on the fit? Use the above plots to help you answer this question.

SOLUTION: From the above plots, the three mountains that stand out as potential outliers are 24, 33, 40. From the Cook’s plot, these three mountains have the highest values on this plot and are the furthest away from the fitted line, meaning they exert the more leverage on the model and can be viewed as outliers. From the leverage point, these three mountains almost exceed the 1.73 limit (or the 3 radial), meaning that they can definitely be viewed as potential outliers.

  1. If your answer is yes to either question in e), identify the mountain(s) and report their values for the standardized or studentized residual, or the leverage, or the Cook’s Distance.

SOLUTION: The three mountains are Seward Mountain (which is 24), Mount Donaldson (which is 33), and Mount Emmons (which is 40). Their studentized residual values are 2.96, 2.1, and 2.56 respectively, and their Cook’s distance/leverage are 0.112, 0.082, and 0.0156 respectively.

## 24 33 40 
## 24 33 40
## [1] Seward Mtn.    Mt. Donaldson  Mt. Emmons    
## 46 Levels: Algonquin Peak  Allen Mtn.  Armstrong Mtn.  ... Wright Peak
##   24   33   40 
## 2.96 2.10 2.56

Problem 2 [25 points - 5 points for completion, 10 points for proposed questions, 10 points for proposed solutions]: Brainstorm with your Work Team & Create the third exam problem (Q3) for our Exam Bank!!!

Now, find an appropriate dataset which has at least 4 variables - this could be one of the examples or applications from your posts on the Learning Forum this week, or from your Work Team Project dataset, or feel free to find another interesting dataset/study your team found from other resources [As always, be sure to include the citation!].

citation: https://www.kaggle.com/datasets/levyedgar44/income-and-happiness-correction

Here is how to create your own problem by adapting Problem 1, part a) to e) only:

IMPORTANT NOTE - As said before, please make sure to have FUN creating exam problems with your work team! You are encouraged to split/rotate the jobs, like finding the dataset, writing the questions, making the solution, etc. Be CREATIVE!

As before, each team member needsa to submit the SAME answers for Problem 2, but please write your own answers for Problem 1.

Problem for STAT230 Exam 1 Bank (Q3)

  1. Look at a scatterplot of Adjusted Satisfaction vs. Income Inequality of the countries and the correlation between the two variables. Does it look like Income Inequality should be very helpful in predicting Adjusted Satisfaction?

  • Correlation:
## [1] -0.124
  1. Suppose we’ve fit a simple linear regression (SLR) model using the log of avg_income to predict adjusted_satisfaction, and it’s significant. Now, we would like to consider adding income_inequality as the second predictor for adjusted_satisfaction and the following added variable plot for income_inequality can help us see the effect of adding income_inequality to the model :

Does the plot show that there is additional information in income_inequality that is useful for predicting adjusted_satisfaction after accounting for avg_income? Explain.

  1. Suppose we run a “best subsets” procedure to determine which set of predictors would have the lowest \(C_p\) value when predicting adjustedd_satisfaction, and got the following output from R:
##        avg_income income_inequality               GDP        happyScore 
##              3.86              1.22              3.63              3.28
adjr2 cp bic rss avg_income income_inequality GDP happyScore
1 ( 1 ) 0.81 4.92 -176 3158
2 ( 1 ) 0.81 3.91 -175 3073
3 ( 1 ) 0.82 3.29 -173 3000
4 ( 1 ) 0.82 5.00 -168 2991

What model (which predictors) did the “best subsets” procedure suggest to use for predicting Time?

  1. Suppose we fit the model suggested by the “best subsets” regression above, and obtained the following plots:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -4.5309     3.7019   -1.22    0.224    
## income_inequality   0.1031     0.0635    1.62    0.108    
## GDP                 4.6184     2.1990    2.10    0.038 *  
## happyScore          8.3758     0.7002   11.96   <2e-16 ***
## 
## Residual standard error: 5.29 on 107 degrees of freedom
## Multiple R-squared:  0.822,  Adjusted R-squared:  0.817 
## F-statistic:  164 on 3 and 107 DF,  p-value: <2e-16

Use the above plots to assess and comment on the regression conditions of linearity, homoscedasticity (equal variance), and normality in this situation.

  1. Are there any countries that stand out as being outliers? Are there any countries that have high leverage or may be influential on the fit? Identify them and use the above plots to help you answer this question.

Your solution:

  1. It doesn’t look like income inequality can be used to predict adjusted satisfaction score since the scatter plot shows little linear pattern and the correlation coefficient is very small, which is about -0.124, which means that they have a weak relationship.

  2. Yes, as we can see from the AV-plot, the residuals display a positive linear relationship. It is not extremely strong since there are some outliers here and there, but overall the data display a linear pattern with a pretty negative slope (i.e. not horizontal), which means that there is a decent amount of unexplained variation in adjusted satisfaction that can be uniquely explained by income inequality.

  3. The best model here seems like it would be model 3, which includes income inequality, GDP and happyscore to predict adjusted satisfaction. Since this model is the only one that satisfies the Cp rule of thumb of m+1, since here 3+1 = 4 > 3.29 = Cp of the model. Since the third model has the smallest Cp and kind of satisfies the Cp rule of thumb, we can say that the last model would be the best procedure. It also has the second smallest BIC of -173.

  4. The residual plot has a slight pattern to it and there are a few outliers (29,57,103) on the sides. So we can’t say it completely satisfies linearity and equal variance condition, but since the pattern is not very strong and curvy, we can say that it satisfies the linearity and equal variance condition, but we should proceed analyzing our model with caution. From the QQ plot it seems like the points are closely scattered around the reference line, with a few outliers (the same ones) on the tails, but we can say it satisfies this condition and we can proceed with caution.

  5. From the above plots, points 29, 57 and 103 stood out as outliers. They are more distant from the other points on the residual plot and QQ plot, and they are a bit influential on the residual plot since their existence made the plot fitted line curve up a bit in the beginning. If we look at our leverage plot, it seems like all three points are exceeding the radical 2 or 1.41 limit, which by the rule of thumb we can categorize them as potential outliers. If we look at their cook’s distance, it also seems like 29 and 103 are the two points points with the highest cook’s value, which makes sense since cook’s distace also takes into account of leverage. But more importantly, when we look at the cooks distance vs. leverage graph, countries 29 and 103 are further away from the fitted line than all other points, and country 57 is a bit closer, so these suggest that these three countries can be potential outliers, especially country 29 and 103. These three countries are Dominican Republic(29), Liberia (57), and Tanzania (103).

Problem 3 [For Practice only]: Fertility measurements (modified from 4.4 and 4.6 in STAT2)

A medical doctor and her team of researchers collected a variety of data on women who were having trouble getting pregnant. Data for randomly selected patients where complete information was available are provided in the file Fertility, including:

##  [1] "Age"        "LowAFC"     "MeanAFC"    "FSH"        "E2"        
##  [6] "MaxE2"      "MaxDailyGn" "TotalGn"    "Oocytes"    "Embryos"

A key method for assessing fertility is the count of antral follicles that can be performed with noninvasive ultrasound. Researchers are interested in how the other variables are related to these counts (either lowAFC - smallest antral follicle count, or MeanAFC - average antral follicle count).

  1. Use a “best subsets” procedure to choose a model (excluding lowAFC) with the smallest \(C_p\) value to predict MeanAFC. Write down the predictors in this model and the corresponding Mallow’s \(C_p\).

SOLUTION:

## R example code for the Best Subsets Regression using Cp values                 ##
## only need to replace DATASET, X1, X2, X3,..., Xk, Y with correct names!        ##
## DELECT the option `eval=FALSE` above before knitting the file                  ##  
best.sub <- regsubsets(Y ~ X1 + X2 + ...+Xk, data = DATASET)
with(msummary(best.sub), data.frame(adjr2, cp, bic, rss, outmat)) %>%   
  kable(digits = 3, booktabs = TRUE) %>%
  row_spec(row = 0, angle = 90)
  1. Use a stepwise regression procedure to choose a model (excluding lowAFC) to predict MeanAFC. Write down the predictors in this model and briefly describe each step of this model building process.

SOLUTION:

## R example code for Backward elimination/ Forward selection/ Stepwise Regression  ##
## only need to replace DATASET, X1, X2, X3,..., Xk, Y with correct names!          ##
## DELECT the option `eval=FALSE` above before knitting the file                    ##  
mod.intercept <- lm(Y ~ 1, data = DATASET)  
mod.all <- lm(Y ~ X1 + X2 + ...+Xk, data = DATASET)  
stepAIC(mod.intercept, scope = list(lower = mod.intercept, upper = mod.all),  
        direction = "both", trace = FALSE)$anova                  #Stepwise regression   
  1. Did two automated procedures suggest the same final model? If YES, fit the suggested final model and comment on the model’s appropriateness (checking the conditions), significance (checking the overall F-test and the individual t-tests), and goodness-of-fit (\(R^2\), \(\hat{\sigma_\epsilon}\)), If NO, fit both models and compare them via their appropriateness, significance, and goodness-of-fit (\(R^2_{adj}\), \(\hat{\sigma_\epsilon}\), AIC, BIC, Nested F-test - used only when two models are nested).

SOLUTION: