Write your answers below
SOLUTION: a. The response variable is WineQuality. It is quantitative varaible.
The explanatory variables are WinterRain, AverageTemp and Harvest Rain, and they are all quantitative variables.
Higher wine quality is associated with more winter rainfall because the coeffient of winter rainfall is positive.
Higher wine quality is associated with less harvest rainfall because the coefficient is negative.
Higher wine quality is associated with more average growing season temperature because the coefficient is positive.
f.Data that Ashenfelter analyzed are observational because these data are determined by nature, not by an experimenter.
Write your answers below
SOLUTION: a. The coefficient for age would be negative because the longer the roller coasters being used, the slower it is. All other predictors, maximum height, total length and maximum vertical drop will have positive coefficient becasue they all have a positive relationship with the top speed of the coaster.
I think the maximum vertical drop could be the best of these variables predicting for top speed since a roller coaster accelerates by dropping vertically.
The coefficient for Age agrees with our previous expectation that older coasters go slower. The coefficients of Height, Maximum Vertical Drop and Length are all postive as expected. However, TypeCode have a negative coefficient which was postiive in the original model.
Speed= 59.97 mph
Here is the initial R code to get you started
data(Day1Survey)
summary(Day1Survey)
## Section Class Sex Distance Height
## Min. :1.00 * : 1 F:17 Min. : 50 Min. :60.0
## 1st Qu.:1.00 Freshman :12 M:26 1st Qu.: 300 1st Qu.:66.0
## Median :1.00 Junior :11 Median : 600 Median :70.0
## Mean :1.44 N/A : 1 Mean : 1579 Mean :68.8
## 3rd Qu.:2.00 Senior :10 3rd Qu.: 1000 3rd Qu.:72.5
## Max. :2.00 Sophomore: 8 Max. :20000 Max. :77.0
## NA's :1
## Handedness Coins WhiteString BlackString
## Ambidextrous: 1 Min. : 0 Min. :22.0 Min. : 0.0
## Left : 4 1st Qu.: 0 1st Qu.:32.0 1st Qu.: 2.0
## Right :38 Median : 1 Median :38.0 Median : 4.0
## Mean : 19 Mean :38.3 Mean : 5.4
## 3rd Qu.: 4 3rd Qu.:43.0 3rd Qu.: 6.0
## Max. :500 Max. :70.0 Max. :42.0
##
## Reading TV Pulse Texting
## Min. : 2 Min. : 0.00 Min. :48.0 Min. : 0.0
## 1st Qu.: 50 1st Qu.: 2.00 1st Qu.:57.0 1st Qu.: 3.5
## Median : 100 Median : 3.00 Median :66.0 Median : 10.0
## Mean : 172 Mean : 4.81 Mean :67.1 Mean : 30.7
## 3rd Qu.: 200 3rd Qu.: 5.50 3rd Qu.:75.0 3rd Qu.: 35.0
## Max. :2000 Max. :25.00 Max. :96.0 Max. :250.0
##
part a: Does resting pulse differ by sex? #You may edit these to be more precise if you like
SOLUTION: CHOOSE: This study has a binary and quantative explanatory factor (binary gender). We choose the model \({Y}=μ_i+ ε\). When i=1, it represents the mean pulse rates for women. When i=2, it represents the mean pulse rates for men. ε represents errors.
FIT: From the data generated by the code below, we know that mean pulse rates for female students is 67.8 while for male students is 66.7. Similarly, the standard deviation for female students’ mean pulse rate is \(s_1\)=11.4 and \(s_2\)=11.3 estimate the standard deviations for male students’ mean pulse rate.
ASSESS:to assess the model, we see from the graph that the graph is largely normally distributed and there is no big concerns with normality. As a second component of assessment, I use the Welch Two-sample T-test. We test null hypothesis H0: \(u_1\) = \(u_2\) with the alternative hypotesis Ha: u1 not equal to u2. T-value=0.3 P-value=0.7. The p-value is large and we accpet the null hypothesis. Thus, there is no significant difference in meal pulse rates by binary gender.
USE: There is no evidence of a significant differnece in mean pulse rates between female and male.
favstats(~Pulse,data=Day1Survey)
## min Q1 median Q3 max mean sd n missing
## 48 57 66 75 96 67.1 11.2 43 0
histogram(~Pulse,data=Day1Survey)
densityplot(~ Pulse, data=Day1Survey, groups=Sex, auto.key=TRUE)
bwplot(Pulse ~ Sex, data=Day1Survey)
favstats(~ Pulse | Sex, data=Day1Survey)
## Sex min Q1 median Q3 max mean sd n missing
## 1 F 51 60 72 75 90 67.8 11.4 17 0
## 2 M 48 57 66 72 96 66.7 11.3 26 0
Males <- subset(Day1Survey,Sex=="M")
Females <- subset(Day1Survey,Sex=="F")
t.test(Females$Pulse, Males$Pulse) #other ways to do this test exist
##
## Welch Two Sample t-test
##
## data: Females$Pulse and Males$Pulse
## t = 0.3, df = 30, p-value = 0.7
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.02 8.36
## sample estimates:
## mean of x mean of y
## 67.8 66.7
part b: Write your question of interest here
Is there evidence that there is a significant height difference between male students and female students?
SOLUTION:
favstats(~Height, data=Day1Survey)
## min Q1 median Q3 max mean sd n missing
## 60 66 70 72.5 77 68.8 4.41 43 0
histogram(~Height, data=Day1Survey)
densityplot(~Height, data=Day1Survey, group=Sex, auto.key=TRUE)
bwplot(Height~ Sex, data=Day1Survey)
favstats(~Height| Sex, data=Day1Survey )
## Sex min Q1 median Q3 max mean sd n missing
## 1 F 60 62 65 67 70 64.6 3.37 17 0
## 2 M 66 70 71 73 77 71.6 2.39 26 0
Males <- subset(Day1Survey,Sex=="M")
Females <- subset(Day1Survey,Sex=="F")
t.test(Females$Height, Males$Height) #other ways to do this test exist
##
## Welch Two Sample t-test
##
## data: Females$Height and Males$Height
## t = -7, df = 30, p-value = 8e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.87 -4.99
## sample estimates:
## mean of x mean of y
## 64.6 71.6
CHOOSE: The height in this study is quantatitive and the gender is catergorical and binary. We choose the model \({Y}=μ_i+ ε\). When i=1, it represents the mean heights for women. When i=2, it represents the mean heights for men. ε represents errors.
FIT:From the data generated by the code above, we know that mean pulse rates for female students is 64.6 while for male students is 71.6. Similarly, the standard deviation for female students’ mean height is \(s_1\)=3.37 and \(s_2\)=2.39 estimate the standard deviations for male students’ mean heights.
ASSESS:to assess the model, firstly we need to check the normality. We see from the histogram that though there are more than expected values clustered around 60, overall, the graph is largely normally distributed. As a second component of assessment, I use the Welch Two-sample T-test. We test null hypothesis H0: \(u_1\) = \(u_2\) with the alternative hypotesis Ha: u1 not equal to u2. T-value=-7 P-value=8e-08.
In this case, the p-value is very small and we need to reject the null hypothesis.
USE: there is a significant height difference between male students and female students.
Remember your code goes first, followed by comments and your interpretations.
data(Pines)
summary(Pines)
## Row Col Hgt90 Hgt96 Diam96
## Min. : 1.0 Min. : 1.0 Min. : 6.0 Min. : 11 Min. :0.5
## 1st Qu.:14.0 1st Qu.: 6.0 1st Qu.:16.0 1st Qu.:235 1st Qu.:3.0
## Median :26.0 Median :12.0 Median :19.0 Median :289 Median :4.3
## Mean :24.8 Mean :12.4 Mean :19.2 Mean :279 Mean :4.1
## 3rd Qu.:35.0 3rd Qu.:18.0 3rd Qu.:22.4 3rd Qu.:329 3rd Qu.:5.3
## Max. :44.0 Max. :30.0 Max. :39.0 Max. :491 Max. :8.1
## NA's :182 NA's :139 NA's :149
## Grow96 Hgt97 Diam97 Spread.97
## Min. : 0.0 Min. : 87 Min. : 0.4 Min. : 20
## 1st Qu.: 64.0 1st Qu.:301 1st Qu.: 4.7 1st Qu.:152
## Median : 78.0 Median :357 Median : 6.2 Median :185
## Mean : 74.2 Mean :347 Mean : 5.9 Mean :183
## 3rd Qu.: 90.0 3rd Qu.:406 3rd Qu.: 7.4 3rd Qu.:214
## Max. :126.0 Max. :558 Max. :10.7 Max. :339
## NA's :136 NA's :135 NA's :139 NA's :136
## Needles97 Deer95 Deer97 Cover95
## Min. : 25.0 Min. :0.0 Min. :0.0 Min. :0.00
## 1st Qu.: 65.0 1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0.00
## Median : 74.0 Median :0.0 Median :0.0 Median :1.00
## Mean : 73.8 Mean :0.2 Mean :0.1 Mean :1.33
## 3rd Qu.: 82.0 3rd Qu.:0.0 3rd Qu.:0.0 3rd Qu.:2.00
## Max. :172.0 Max. :1.0 Max. :1.0 Max. :3.00
## NA's :135 NA's :129 NA's :135
## Fert Spacing
## Min. :0.000 Min. :10.0
## 1st Qu.:0.000 1st Qu.:10.0
## Median :1.000 Median :10.0
## Mean :0.507 Mean :12.1
## 3rd Qu.:1.000 3rd Qu.:15.0
## Max. :1.000 Max. :15.0
##
require(ggplot2)
ggplot(data=Pines, aes(y=Hgt96, x=Hgt90))+
geom_jitter()+
scale_x_continuous("Tree Height at Time of Planting (cm)")+
scale_y_continuous("Tree Height in 1996(cm)")+
geom_smooth(method="lm")
## Warning: Removed 193 rows containing non-finite values (stat_smooth).
## Warning: Removed 193 rows containing missing values (geom_point).
Treeheight.lm<-lm(Hgt96 ~ Hgt90, data=Pines)
plot(Treeheight.lm, which=2)
plot(Treeheight.lm, which=1)
part a: Scatterplot and comment on relationship
SOLUTION: In my opinion, the scatterplot demonstrates a weak and positive relationship between tree height at time of planting and their heights in 1996. From the graph, as x increases, there is a somewhat apparent effect on y.
part b: Fit a least squares line and report it
SOLUTION: After we fit a least squares line on the scatterplot, it proves my comment in the previous question. The least squares line demonstrate a postive relationship between the the tree height at time of planning and tree height in 1996.
part c: Assess and discuss the fit of the model
SOLUTION: Firstly, the QQ plot shows a consistent linear treand that supports the normality condition. Also, from residual vs. fitted plots, we can observe a constant variance of residuals around the fitted values. Therefore, this is a reasonable model for analysis.
data(Goldenrod)
require(ggplot2)
ggplot(data=Goldenrod, aes(y=Gdiam03, x=Stdiam03))+
geom_jitter()+
scale_x_continuous("Stem Diameter in 2003(mm)")+
scale_y_continuous("Gall Diameter in 2003(mm)")+
geom_smooth(method="lm")
## Warning: Removed 293 rows containing non-finite values (stat_smooth).
## Warning: Removed 293 rows containing missing values (geom_point).
ggplot(data=Goldenrod, aes(y=Wall03, x=Stdiam03))+
geom_jitter()+
scale_x_continuous("Stem Diameter in 2003(mm)")+
scale_y_continuous("Wall Thickness in 2003(mm)")+
geom_smooth(method="lm")
## Warning: Removed 460 rows containing non-finite values (stat_smooth).
## Warning: Removed 460 rows containing missing values (geom_point).
ggplot(data=Goldenrod, aes(y=Wall03, x=Gdiam03))+
geom_jitter()+
scale_x_continuous("Gall Diameter in 2003(mm)")+
scale_y_continuous("Wall Thickness in 2003(mm)")+
geom_smooth(method="lm")
## Warning: Removed 460 rows containing non-finite values (stat_smooth).
## Warning: Removed 460 rows containing missing values (geom_point).
Goldenrod.lm<-lm(Wall03 ~ Gdiam03, data=Goldenrod)
plot(Goldenrod.lm, which=1)
part a: Check for a positive correlation in 2003
SOLUTION: From the scatterplot and least squares regression line we created, we can observe a positive relationship between stem diameter and gall diameter in 2003.
part b: Compare relationships with wall thickness
SOLUTION: From the scatterplots we created, in my opinion, gall diameter has a stronger linear association with wall thickness than stem diameter.Comparing two scatterplots, it is obvious that data in the wall thickness vs. gall diameter scatterplot forms a linear correlation than those in wall thickness vs. stem diameter.
part c: Fit a least squares line and report it
SOLUTION: The least squares line has demonstrated a postive relationship between both wall thickness and gall diameter and wall thickness and stem diameter.
part d: Find fitted value and residual for first observation
SOLUTION: Fitted model: y=-1.052+0.368x For the first observation, the observed value is (9.10, 3.30). Fitted value: y=-1.052+0.368*9.10=2.2968 Residual=observed-fitted= 3.30-2.2968= 1.00
part e: Report value of a typical residual (Hint: this is a value in the output that helps to assess model fit.)
SOLUTION: I am not entirely sure about this question. In my opinion, there exists numerous typical residuals. In the Fitted vs. Residual graph I created, the red line contains all typical residuals.