Stat 230 - Homework #1 - Due Thursday, Feb. 1

YOUR NAME GOES HERE Yichi Zhang (Niko)

Help From:NamesofAnyoneYouGotHelpFromGoHere

PROBLEMS TO TURN IN: #0.6, #0.13, #0.15, #1.18, #1.24

Exercise 0.6

Write your answers below

SOLUTION: a. The response variable is WineQuality. It is quantitative varaible.

The explanatory variables are WinterRain, AverageTemp and Harvest Rain, and they are all quantitative variables.
Higher wine quality is associated with more winter rainfall because the coeffient of winter rainfall is positive.
Higher wine quality is associated with less harvest rainfall because the coefficient is negative.
Higher wine quality is associated with more average growing season temperature because the coefficient is positive.

f.Data that Ashenfelter analyzed are observational because these data are determined by nature, not by an experimenter.

Exercise 0.13

Write your answers below

SOLUTION: a. The coefficient for age would be negative because the longer the roller coasters being used, the slower it is. All other predictors, maximum height, total length and maximum vertical drop will have positive coefficient becasue they all have a positive relationship with the top speed of the coaster.

I think the maximum vertical drop could be the best of these variables predicting for top speed since a roller coaster accelerates by dropping vertically.
The coefficient for Age agrees with our previous expectation that older coasters go slower. The coefficients of Height, Maximum Vertical Drop and Length are all postive as expected. However, TypeCode have a negative coefficient which was postiive in the original model.
Speed= 59.97 mph

Exercise 0.15

Here is the initial R code to get you started

data(Day1Survey) 
summary(Day1Survey)

##     Section           Class    Sex       Distance         Height    
##  Min.   :1.00   *        : 1   F:17   Min.   :   50   Min.   :60.0  
##  1st Qu.:1.00   Freshman :12   M:26   1st Qu.:  300   1st Qu.:66.0  
##  Median :1.00   Junior   :11          Median :  600   Median :70.0  
##  Mean   :1.44   N/A      : 1          Mean   : 1579   Mean   :68.8  
##  3rd Qu.:2.00   Senior   :10          3rd Qu.: 1000   3rd Qu.:72.5  
##  Max.   :2.00   Sophomore: 8          Max.   :20000   Max.   :77.0  
##                                       NA's   :1                     
##         Handedness     Coins      WhiteString    BlackString  
##  Ambidextrous: 1   Min.   :  0   Min.   :22.0   Min.   : 0.0  
##  Left        : 4   1st Qu.:  0   1st Qu.:32.0   1st Qu.: 2.0  
##  Right       :38   Median :  1   Median :38.0   Median : 4.0  
##                    Mean   : 19   Mean   :38.3   Mean   : 5.4  
##                    3rd Qu.:  4   3rd Qu.:43.0   3rd Qu.: 6.0  
##                    Max.   :500   Max.   :70.0   Max.   :42.0  
##                                                               
##     Reading           TV            Pulse         Texting     
##  Min.   :   2   Min.   : 0.00   Min.   :48.0   Min.   :  0.0  
##  1st Qu.:  50   1st Qu.: 2.00   1st Qu.:57.0   1st Qu.:  3.5  
##  Median : 100   Median : 3.00   Median :66.0   Median : 10.0  
##  Mean   : 172   Mean   : 4.81   Mean   :67.1   Mean   : 30.7  
##  3rd Qu.: 200   3rd Qu.: 5.50   3rd Qu.:75.0   3rd Qu.: 35.0  
##  Max.   :2000   Max.   :25.00   Max.   :96.0   Max.   :250.0  
##

part a: Does resting pulse differ by sex? #You may edit these to be more precise if you like

SOLUTION: CHOOSE: This study has a binary and quantative explanatory factor (binary gender). We choose the model \({Y}=μ_i+ ε\). When i=1, it represents the mean pulse rates for women. When i=2, it represents the mean pulse rates for men. ε represents errors.

FIT: From the data generated by the code below, we know that mean pulse rates for female students is 67.8 while for male students is 66.7. Similarly, the standard deviation for female students’ mean pulse rate is \(s_1\)=11.4 and \(s_2\)=11.3 estimate the standard deviations for male students’ mean pulse rate.

ASSESS:to assess the model, we see from the graph that the graph is largely normally distributed and there is no big concerns with normality. As a second component of assessment, I use the Welch Two-sample T-test. We test null hypothesis H0: \(u_1\) = \(u_2\) with the alternative hypotesis Ha: u1 not equal to u2. T-value=0.3 P-value=0.7. The p-value is large and we accpet the null hypothesis. Thus, there is no significant difference in meal pulse rates by binary gender.

USE: There is no evidence of a significant differnece in mean pulse rates between female and male.

favstats(~Pulse,data=Day1Survey)

##  min Q1 median Q3 max mean   sd  n missing
##   48 57     66 75  96 67.1 11.2 43       0

histogram(~Pulse,data=Day1Survey)

densityplot(~ Pulse, data=Day1Survey, groups=Sex, auto.key=TRUE)

bwplot(Pulse ~ Sex, data=Day1Survey)

favstats(~ Pulse | Sex, data=Day1Survey)

##   Sex min Q1 median Q3 max mean   sd  n missing
## 1   F  51 60     72 75  90 67.8 11.4 17       0
## 2   M  48 57     66 72  96 66.7 11.3 26       0

Males <- subset(Day1Survey,Sex=="M")
Females <- subset(Day1Survey,Sex=="F")
t.test(Females$Pulse, Males$Pulse) #other ways to do this test exist

## 
##  Welch Two Sample t-test
## 
## data:  Females$Pulse and Males$Pulse
## t = 0.3, df = 30, p-value = 0.7
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.02  8.36
## sample estimates:
## mean of x mean of y 
##      67.8      66.7

part b: Write your question of interest here

Is there evidence that there is a significant height difference between male students and female students?

SOLUTION:

favstats(~Height, data=Day1Survey)

##  min Q1 median   Q3 max mean   sd  n missing
##   60 66     70 72.5  77 68.8 4.41 43       0

histogram(~Height, data=Day1Survey)

densityplot(~Height, data=Day1Survey, group=Sex, auto.key=TRUE)

bwplot(Height~ Sex, data=Day1Survey)

favstats(~Height| Sex, data=Day1Survey )

##   Sex min Q1 median Q3 max mean   sd  n missing
## 1   F  60 62     65 67  70 64.6 3.37 17       0
## 2   M  66 70     71 73  77 71.6 2.39 26       0

Males <- subset(Day1Survey,Sex=="M")
Females <- subset(Day1Survey,Sex=="F")
t.test(Females$Height, Males$Height) #other ways to do this test exist

## 
##  Welch Two Sample t-test
## 
## data:  Females$Height and Males$Height
## t = -7, df = 30, p-value = 8e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.87 -4.99
## sample estimates:
## mean of x mean of y 
##      64.6      71.6

CHOOSE: The height in this study is quantatitive and the gender is catergorical and binary. We choose the model \({Y}=μ_i+ ε\). When i=1, it represents the mean heights for women. When i=2, it represents the mean heights for men. ε represents errors.

FIT:From the data generated by the code above, we know that mean pulse rates for female students is 64.6 while for male students is 71.6. Similarly, the standard deviation for female students’ mean height is \(s_1\)=3.37 and \(s_2\)=2.39 estimate the standard deviations for male students’ mean heights.

ASSESS:to assess the model, firstly we need to check the normality. We see from the histogram that though there are more than expected values clustered around 60, overall, the graph is largely normally distributed. As a second component of assessment, I use the Welch Two-sample T-test. We test null hypothesis H0: \(u_1\) = \(u_2\) with the alternative hypotesis Ha: u1 not equal to u2. T-value=-7 P-value=8e-08.

In this case, the p-value is very small and we need to reject the null hypothesis.

USE: there is a significant height difference between male students and female students.

Remember your code goes first, followed by comments and your interpretations.

Exercise 1.18

data(Pines)
summary(Pines)

##       Row            Col           Hgt90          Hgt96         Diam96   
##  Min.   : 1.0   Min.   : 1.0   Min.   : 6.0   Min.   : 11   Min.   :0.5  
##  1st Qu.:14.0   1st Qu.: 6.0   1st Qu.:16.0   1st Qu.:235   1st Qu.:3.0  
##  Median :26.0   Median :12.0   Median :19.0   Median :289   Median :4.3  
##  Mean   :24.8   Mean   :12.4   Mean   :19.2   Mean   :279   Mean   :4.1  
##  3rd Qu.:35.0   3rd Qu.:18.0   3rd Qu.:22.4   3rd Qu.:329   3rd Qu.:5.3  
##  Max.   :44.0   Max.   :30.0   Max.   :39.0   Max.   :491   Max.   :8.1  
##                                NA's   :182    NA's   :139   NA's   :149  
##      Grow96          Hgt97         Diam97       Spread.97  
##  Min.   :  0.0   Min.   : 87   Min.   : 0.4   Min.   : 20  
##  1st Qu.: 64.0   1st Qu.:301   1st Qu.: 4.7   1st Qu.:152  
##  Median : 78.0   Median :357   Median : 6.2   Median :185  
##  Mean   : 74.2   Mean   :347   Mean   : 5.9   Mean   :183  
##  3rd Qu.: 90.0   3rd Qu.:406   3rd Qu.: 7.4   3rd Qu.:214  
##  Max.   :126.0   Max.   :558   Max.   :10.7   Max.   :339  
##  NA's   :136     NA's   :135   NA's   :139    NA's   :136  
##    Needles97         Deer95        Deer97       Cover95    
##  Min.   : 25.0   Min.   :0.0   Min.   :0.0   Min.   :0.00  
##  1st Qu.: 65.0   1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.00  
##  Median : 74.0   Median :0.0   Median :0.0   Median :1.00  
##  Mean   : 73.8   Mean   :0.2   Mean   :0.1   Mean   :1.33  
##  3rd Qu.: 82.0   3rd Qu.:0.0   3rd Qu.:0.0   3rd Qu.:2.00  
##  Max.   :172.0   Max.   :1.0   Max.   :1.0   Max.   :3.00  
##  NA's   :135     NA's   :129   NA's   :135                 
##       Fert          Spacing    
##  Min.   :0.000   Min.   :10.0  
##  1st Qu.:0.000   1st Qu.:10.0  
##  Median :1.000   Median :10.0  
##  Mean   :0.507   Mean   :12.1  
##  3rd Qu.:1.000   3rd Qu.:15.0  
##  Max.   :1.000   Max.   :15.0  
##

require(ggplot2)
ggplot(data=Pines, aes(y=Hgt96, x=Hgt90))+
  geom_jitter()+
  scale_x_continuous("Tree Height at Time of Planting (cm)")+
  scale_y_continuous("Tree Height in 1996(cm)")+
  geom_smooth(method="lm")

## Warning: Removed 193 rows containing non-finite values (stat_smooth).

## Warning: Removed 193 rows containing missing values (geom_point).

Treeheight.lm<-lm(Hgt96 ~ Hgt90, data=Pines)
plot(Treeheight.lm, which=2)

plot(Treeheight.lm, which=1)

part a: Scatterplot and comment on relationship

SOLUTION: In my opinion, the scatterplot demonstrates a weak and positive relationship between tree height at time of planting and their heights in 1996. From the graph, as x increases, there is a somewhat apparent effect on y.

part b: Fit a least squares line and report it

SOLUTION: After we fit a least squares line on the scatterplot, it proves my comment in the previous question. The least squares line demonstrate a postive relationship between the the tree height at time of planning and tree height in 1996.

part c: Assess and discuss the fit of the model

SOLUTION: Firstly, the QQ plot shows a consistent linear treand that supports the normality condition. Also, from residual vs. fitted plots, we can observe a constant variance of residuals around the fitted values. Therefore, this is a reasonable model for analysis.

Exercise 1.24

data(Goldenrod)
require(ggplot2)
ggplot(data=Goldenrod, aes(y=Gdiam03, x=Stdiam03))+
  geom_jitter()+
  scale_x_continuous("Stem Diameter in 2003(mm)")+
  scale_y_continuous("Gall Diameter in 2003(mm)")+
  geom_smooth(method="lm")

## Warning: Removed 293 rows containing non-finite values (stat_smooth).

## Warning: Removed 293 rows containing missing values (geom_point).

ggplot(data=Goldenrod, aes(y=Wall03, x=Stdiam03))+
  geom_jitter()+
  scale_x_continuous("Stem Diameter in 2003(mm)")+
  scale_y_continuous("Wall Thickness in 2003(mm)")+
  geom_smooth(method="lm")

## Warning: Removed 460 rows containing non-finite values (stat_smooth).

## Warning: Removed 460 rows containing missing values (geom_point).

ggplot(data=Goldenrod, aes(y=Wall03, x=Gdiam03))+
  geom_jitter()+
  scale_x_continuous("Gall Diameter in 2003(mm)")+
  scale_y_continuous("Wall Thickness in 2003(mm)")+
  geom_smooth(method="lm")

## Warning: Removed 460 rows containing non-finite values (stat_smooth).

## Warning: Removed 460 rows containing missing values (geom_point).

Goldenrod.lm<-lm(Wall03 ~ Gdiam03, data=Goldenrod)
plot(Goldenrod.lm, which=1)

part a: Check for a positive correlation in 2003

SOLUTION: From the scatterplot and least squares regression line we created, we can observe a positive relationship between stem diameter and gall diameter in 2003.

part b: Compare relationships with wall thickness

SOLUTION: From the scatterplots we created, in my opinion, gall diameter has a stronger linear association with wall thickness than stem diameter.Comparing two scatterplots, it is obvious that data in the wall thickness vs. gall diameter scatterplot forms a linear correlation than those in wall thickness vs. stem diameter.

part c: Fit a least squares line and report it

SOLUTION: The least squares line has demonstrated a postive relationship between both wall thickness and gall diameter and wall thickness and stem diameter.

part d: Find fitted value and residual for first observation

SOLUTION: Fitted model: y=-1.052+0.368x For the first observation, the observed value is (9.10, 3.30). Fitted value: y=-1.052+0.368*9.10=2.2968 Residual=observed-fitted= 3.30-2.2968= 1.00

part e: Report value of a typical residual (Hint: this is a value in the output that helps to assess model fit.)

SOLUTION: I am not entirely sure about this question. In my opinion, there exists numerous typical residuals. In the Fitted vs. Residual graph I created, the red line contains all typical residuals.