I strongly encourage that you read the questions as soon as you get the assignment. This will not only help you start thinking how to solve them, but also allow you to have sufficient to get help if you ever need it. In case of questions, or if you get stuck, please don’t hesitate to email me (though I would appreciate if you could allow me to have at least 24 hours before the deadline to get back to you).
My regular office hours (please feel free to let me know if those don’t work for your schedule):
Also, remember that the Statistics & Data Science (SDS) Fellows have Drop-in hours: Sunday-Thursday evenings from 7:00-9:00PM, in E208. I encourage you to use this resource; the fellows are able to help with questions regarding conceptual understanding of the course material, as well as R and RMarkdown questions.
Note that datasets for most of the tables, examples, and exercises are available via the Stat2Data package.
Three Problems are included below, but only Problem 1 and Problem 2 are required. Problem 3 is for your practice only. Again, you may want to take a look at Problem 2 first, as it requires team work!
Forty six mountains in the Adirondacks of upstate New York are know
as the High Peaks, with elevations near or above 400 feet
(although modern measurements show a couple of the peaks are actually
under 4000 feet). A goal for hikers in the region is to become a “46er”,
scaling each of the peaks. The file HighPeaks contains
information on the Elevation (in feet) of each peak along
with data on typical hikes, including the Ascent (in feet),
round trip distance (Length, in miles),
Difficulty rating (1-7), and expected trip
Time (in hours).
Time
vs. Elevation of the mountain and the correlation between
the two variables. Does it look like Elevation should be
very helpful in predicting Time?## [1] -0.0163
SOLUTION:
It looks like elevation is not very helpful in predicting time, as the points appear to be randomly scattered and there is no obvious relationship between these two variables. Also, the correlation coefficient is very low, implying that elevation is a poor predictor of time.
Length to predict Time, and it’s significant.
Now, we would like to consider adding Elevation as the
second predictor for Time and the following added
variable plot for Elevation can help us see the effect
of adding Elevation to the model :Does the plot show that there is additional information in
Elevation that is useful for predicting Time
after accounting for Length? Explain.
SOLUTION:
Yes, since the plot shows that the residuals have a negative linear relationship, this plot definitely does show us additional information in Elevation that is useful for predicting for Time after accounting for Length. This negative linear relationship shows that after accounting for Length there is a decent amount of variability in Elevation that can be explained by Time.
Peak) would have the lowest \(C_p\) value when predicting
Time, and got the following output from
R:| adjr2 | cp | bic | rss | Length | Elevation | Difficulty | Ascent | |
|---|---|---|---|---|---|---|---|---|
| 1 ( 1 ) | 0.731 | 25.4 | -53.8 | 92.4 |
|
|||
| 2 ( 1 ) | 0.787 | 12.2 | -61.7 | 71.6 |
|
|
||
| 3 ( 1 ) | 0.815 | 6.3 | -65.4 | 60.7 |
|
|
|
|
| 4 ( 1 ) | 0.824 | 5.0 | -65.2 | 56.2 |
|
|
|
|
What model (which predictors) did the “best subsets” procedure
suggest to use for predicting Time?
SOLUTION:
Since the last model has the smallest CP (also it manages to almost completely satisfy the CP rule of thumb of m+1), we can say that the last model did the “best subsets” procedure in its prediction of time. This model included Elevation, Difficulty, and Ascent in its prediction of Time.
Use the above plots to assess and comment on the regression conditions of linearity, homoscedasticity (equal variance), and normality in this situation.
SOLUTION:
Linearity: The residual plot has a slight curve to it, but since this pattern is not very strong, we can say that it does satisfy this condition of linearity but you must be cautious in analyzing this model. Homoscedasticity (equal variance): The residual plot has a slight curve to it, but since this pattern is not very strong, we can say that it does satisfy this condition of equal variance but you must be cautious in analyzing this model. Normality: From the QQ plot, the points are scattered around the fitted lined (with only a couple of outliers), so we can say that these plots satisfy this condition. Based on these conditions, we can proceed with analyzing this model but definitely with caution.
SOLUTION: From the above plots, the three mountains that stand out as potential outliers are 24, 33, 40. From the Cook’s plot, these three mountains have the highest values on this plot and are the furthest away from the fitted line, meaning they exert the more leverage on the model and can be viewed as outliers. From the leverage point, these three mountains almost exceed the 1.73 limit (or the 3 radial), meaning that they can definitely be viewed as potential outliers.
SOLUTION: The three mountains are Seward Mountain (which is 24), Mount Donaldson (which is 33), and Mount Emmons (which is 40). Their studentized residual values are 2.96, 2.1, and 2.56 respectively, and their Cook’s distance/leverage are 0.112, 0.082, and 0.0156 respectively.
## 24 33 40
## 24 33 40
## [1] Seward Mtn. Mt. Donaldson Mt. Emmons
## 46 Levels: Algonquin Peak Allen Mtn. Armstrong Mtn. ... Wright Peak
## 24 33 40
## 2.96 2.10 2.56
Now, find an appropriate dataset which has at least 4 variables - this could be one of the examples or applications from your posts on the Learning Forum this week, or from your Work Team Project dataset, or feel free to find another interesting dataset/study your team found from other resources [As always, be sure to include the citation!].
citation: https://www.kaggle.com/datasets/levyedgar44/income-and-happiness-correction
Here is how to create your own problem by adapting Problem 1, part a) to e) only:
IMPORTANT NOTE - As said before, please make sure to have FUN creating exam problems with your work team! You are encouraged to split/rotate the jobs, like finding the dataset, writing the questions, making the solution, etc. Be CREATIVE!
As before, each team member needsa to submit the SAME answers for Problem 2, but please write your own answers for Problem 1.
Exams on
Moodle by 9:00PM ET on Thursday, March 9th.Adjusted Satisfaction
vs. Income Inequality of the countries and the correlation
between the two variables. Does it look like
Income Inequality should be very helpful in predicting
Adjusted Satisfaction?## [1] -0.124
avg_income to predict
adjusted_satisfaction, and it’s significant. Now, we would
like to consider adding income_inequality as the second
predictor for adjusted_satisfaction and the following
added variable plot for income_inequality can help
us see the effect of adding income_inequality to the model
:Does the plot show that there is additional information in
income_inequality that is useful for predicting
adjusted_satisfaction after accounting for
avg_income? Explain.
adjustedd_satisfaction, and got the following output from
R:## avg_income income_inequality GDP happyScore
## 3.86 1.22 3.63 3.28
| adjr2 | cp | bic | rss | avg_income | income_inequality | GDP | happyScore | |
|---|---|---|---|---|---|---|---|---|
| 1 ( 1 ) | 0.81 | 4.92 | -176 | 3158 |
|
|||
| 2 ( 1 ) | 0.81 | 3.91 | -175 | 3073 |
|
|
||
| 3 ( 1 ) | 0.82 | 3.29 | -173 | 3000 |
|
|
|
|
| 4 ( 1 ) | 0.82 | 5.00 | -168 | 2991 |
|
|
|
|
What model (which predictors) did the “best subsets” procedure
suggest to use for predicting Time?
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.5309 3.7019 -1.22 0.224
## income_inequality 0.1031 0.0635 1.62 0.108
## GDP 4.6184 2.1990 2.10 0.038 *
## happyScore 8.3758 0.7002 11.96 <2e-16 ***
##
## Residual standard error: 5.29 on 107 degrees of freedom
## Multiple R-squared: 0.822, Adjusted R-squared: 0.817
## F-statistic: 164 on 3 and 107 DF, p-value: <2e-16
Use the above plots to assess and comment on the regression
conditions of linearity, homoscedasticity (equal variance), and
normality in this situation.
It doesn’t look like income inequality can be used to predict adjusted satisfaction score since the scatter plot shows little linear pattern and the correlation coefficient is very small, which is about -0.124, which means that they have a weak relationship.
Yes, as we can see from the AV-plot, the residuals display a positive linear relationship. It is not extremely strong since there are some outliers here and there, but overall the data display a linear pattern with a pretty negative slope (i.e. not horizontal), which means that there is a decent amount of unexplained variation in adjusted satisfaction that can be uniquely explained by income inequality.
The best model here seems like it would be model 3, which includes income inequality, GDP and happyscore to predict adjusted satisfaction. Since this model is the only one that satisfies the Cp rule of thumb of m+1, since here 3+1 = 4 > 3.29 = Cp of the model. Since the third model has the smallest Cp and kind of satisfies the Cp rule of thumb, we can say that the last model would be the best procedure. It also has the second smallest BIC of -173.
The residual plot has a slight pattern to it and there are a few outliers (29,57,103) on the sides. So we can’t say it completely satisfies linearity and equal variance condition, but since the pattern is not very strong and curvy, we can say that it satisfies the linearity and equal variance condition, but we should proceed analyzing our model with caution. From the QQ plot it seems like the points are closely scattered around the reference line, with a few outliers (the same ones) on the tails, but we can say it satisfies this condition and we can proceed with caution.
From the above plots, points 29, 57 and 103 stood out as outliers. They are more distant from the other points on the residual plot and QQ plot, and they are a bit influential on the residual plot since their existence made the plot fitted line curve up a bit in the beginning. If we look at our leverage plot, it seems like all three points are exceeding the radical 2 or 1.41 limit, which by the rule of thumb we can categorize them as potential outliers. If we look at their cook’s distance, it also seems like 29 and 103 are the two points points with the highest cook’s value, which makes sense since cook’s distace also takes into account of leverage. But more importantly, when we look at the cooks distance vs. leverage graph, countries 29 and 103 are further away from the fitted line than all other points, and country 57 is a bit closer, so these suggest that these three countries can be potential outliers, especially country 29 and 103. These three countries are Dominican Republic(29), Liberia (57), and Tanzania (103).
A medical doctor and her team of researchers collected a variety of data on women who were having trouble getting pregnant. Data for randomly selected patients where complete information was available are provided in the file Fertility, including:
## [1] "Age" "LowAFC" "MeanAFC" "FSH" "E2"
## [6] "MaxE2" "MaxDailyGn" "TotalGn" "Oocytes" "Embryos"
A key method for assessing fertility is the count of antral follicles
that can be performed with noninvasive ultrasound. Researchers are
interested in how the other variables are related to these counts
(either lowAFC - smallest antral follicle count, or
MeanAFC - average antral follicle count).
lowAFC) with the smallest \(C_p\) value to predict
MeanAFC. Write down the predictors in this model and the
corresponding Mallow’s \(C_p\).SOLUTION:
## R example code for the Best Subsets Regression using Cp values ##
## only need to replace DATASET, X1, X2, X3,..., Xk, Y with correct names! ##
## DELECT the option `eval=FALSE` above before knitting the file ##
best.sub <- regsubsets(Y ~ X1 + X2 + ...+Xk, data = DATASET)
with(msummary(best.sub), data.frame(adjr2, cp, bic, rss, outmat)) %>%
kable(digits = 3, booktabs = TRUE) %>%
row_spec(row = 0, angle = 90)
lowAFC) to predict MeanAFC. Write
down the predictors in this model and briefly describe each
step of this model building process.SOLUTION:
## R example code for Backward elimination/ Forward selection/ Stepwise Regression ##
## only need to replace DATASET, X1, X2, X3,..., Xk, Y with correct names! ##
## DELECT the option `eval=FALSE` above before knitting the file ##
mod.intercept <- lm(Y ~ 1, data = DATASET)
mod.all <- lm(Y ~ X1 + X2 + ...+Xk, data = DATASET)
stepAIC(mod.intercept, scope = list(lower = mod.intercept, upper = mod.all),
direction = "both", trace = FALSE)$anova #Stepwise regression
SOLUTION: