Regression Homework Helper

I am again going to try to keep it simple by using base R but doing some of the homework questions.

  1. Use a scatterplot and the linear correlation coefficient r to determine whether there is a correlation between the two variables. (Note: Use software, and don’t forget to look at the scatterplot!)

First I’ll load the data. I could load the data via a csv but to keep the reprodicibility high, I am going to manually enter the data.

x <-c(0.8,1.4 ,2,3.5,4.5,5.4,6.1,7.8,8.4,9.7,10.3,11.7,12.8,13.2,14.8)
y <- c(14.2,13.6,13,11.5,10.5,9.6,8.9,7.2,6.6,5.3,4.7,3.3,2.2,1.8,0.2)

plot(x,y)

That looks rather nicely linear! Of course it is negatively correlated (the slope is negative!).

To get the correlation \(r\)

cor(x,y,method = "pearson")
## [1] -1

We are using the ‘Pearson’ correlation method here. If I call the cor fucntion without that it will give me several answers.

Let’s repeat this using the definition of correlation! \[ r = \frac{\sum_{i=1}^n(x_i-\overline x)(y_i-\overline y )}{(n-1)s_xs_y} \]

sx = sd(x)
sy = sd(y) 
n = length(x)
xbar = mean(x)
ybar = mean(y)


numerator = mean((x-xbar)*(y-ybar))*n

r = numerator/((n-1)*sx*sy)
r
## [1] -1

Sweet! There are a couple of interesting things I did above! I took a sequence \(x\) and subtracted a single value \(\overline x\) from each value. I also multiplied sequences.

Let’s also get the equation of the linear relationship.

lm(y~x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##          15           -1

This brings in a couple of interesting R calls. I needed to do a function so I did y~x you see that I am making a function! Let’s define a function too so I can make some predictions.

fit = lm(y~x)

slope = coef(fit)[2]
intercept = coef(fit)[1]

yhat <- function(xpredict){
  return( slope*xpredict+intercept)
}

yhat(10)
## x 
## 5

So here I did it with a user defined function. While I was doing this I noticed the following command I could have used!

xToPredict <-data.frame(x = c(10))
predict(fit, newdata = xToPredict)
## 1 
## 5

We may want commands like this later! This command was a little bit of a pain becuase you can only pass a dataframe to the newdata option. One really nice feature though is that I can do confidence intervals!

predict(fit, newdata = xToPredict, interval = 'confidence')
##   fit lwr upr
## 1   5   5   5

Well crumb! This one is not interesting because it is a perfect fit, \(r=1\). In any case it is giving us the 95% CI.

Lastly we shoud perfom a hypothesis test. Here the null hypothesis is that there is no correlation. \[ H_0: \rho =0\\ H_A: \rho\neq 0 \] We have already computed our estimate of the test statistic \(r\). Then \(t\) is \[ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \] Here the degrees of freedom will be \(df=n-2\)

t = r*sqrt(n-2)/sqrt(1-r^2)
t
## [1] -Inf

There is no chance this happens! Really I am dividing by 0…

pt(t,n-2)
## [1] 0

So yes, we can reject the null hypothesis, the \(p\) is low.

If instead we want to test the slope of the line. We consider the following as the theorectical fit \[ \hat y = \beta_0+\beta_1x \]

The standard error for the slope is \[ SE_{\beta_1} = \sqrt{\frac{MSE}{\sum\left(x_i-\overline x\right)^2}} \]

Where \[ MSE = \frac{\sum_{i=1}^n\left(y_i-\hat y_i\right)^2}{n-2} \]

\(MSE\) is referred to as the mean square error. It adds all the square errors and divides by the adjusted total \((n-2)\) because of the degrees of freedom!

yframe = data.frame('Values' = y)
yhat <- predict(fit,yframe)
MSE = sum((y-yhat)^2)/(n-2)
MSE
## [1] 6.627806e-30

The denominator in the \(SE\) computation above is sometimes called the \(S_{xx}\) The sum of the squares of \(x\).

sxx = sum((x-xbar)^2)

Then we can now compute the standard error for the slope as

SEslope = sqrt(MSE/sxx)
SEslope
## [1] 1.514884e-16

If instead I wanted to look at the intercept, \(\beta_0\), it’s standard error is \[ SE_{\beta_0} = \sqrt{MSE\left(\frac1n+\frac{\overline x}{S_{xx}}\right)} \]

This will be straight forward as I have already computed all these values.

SEintercept = sqrt(MSE*(1/n+xbar^2/sxx))
SEintercept
## [1] 1.315456e-15

Both \(\beta_0\) and \(\beta_1\) are \(t\)-student distributed with degrees of freedom \(df = n-2\).

If I want to do the confidence interval for a prediction by hand, it is very similar! For an \(x_h\) that you want to predict, \[ SE_{x_h}=\sqrt{MSE\left(\frac1n+\frac{\left(x_h-\overline x\right)^2}{S_{xx}}\right)} \] Again this is \(t\)-studnet distributed. I won’t do this computation since I had R do it, but I could!

My example is clunky here because I had a perfect fit to begin with but you can see that these computations are not terrible with R! Also we are going to see shortly that R will return all of these values if I ask it to…

summary(fit)
## Warning in summary.lm(fit): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -8.777e-15 -5.387e-16  6.692e-16  1.111e-15  3.821e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  1.500e+01  1.469e-15  1.021e+16   <2e-16 ***
## x           -1.000e+00  1.692e-16 -5.910e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.876e-15 on 13 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.493e+31 on 1 and 13 DF,  p-value: < 2.2e-16

I was thinking about this problem a few hours later and low and behold there is a way to get R to tell you all of these values… Passing the fit to the summary() function, we see all these outputs computed for you. Standard Errors and even the MSE are computed right here!

Okay I cannot resist graphing…

plot(x,y)
abline(fit)

plot(fit)

I don’t really know what many of these plots are. Normall Q-Q is looking at the normality of the points in the input. The rest I have no idea about. Maybe a project if someone is interested…

  1. In some cases, the best-fitting multiple regression equation is of the form \(\hat y=b_0+b_1x+b_2x^2+b_3x^3\). The graph of such an equation is called a cubic. Using the data set given below, and letting \(x_1=x\), \(x_2=x^2\), and \(x_3=x^3\), find the multiple regression equation for the cubic that best fits the given data.
xnew <- c(-8,-7,-2,0,3,6,10)
ynew <- c(3.9,0.9,1.6,4.4,4.4,-5.2,-45.3)
xsquare = xnew^2
xcube = xnew^3
quadraticfit = lm(ynew~xnew+xsquare+xcube)
summary(quadraticfit)
## 
## Call:
## lm(formula = ynew ~ xnew + xsquare + xcube)
## 
## Residuals:
##          1          2          3          4          5          6          7 
##  8.218e-02 -1.074e-01 -4.438e-02  1.710e-01 -1.418e-01  4.047e-02 -1.704e-05 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.2290494  0.0936073   45.18 2.39e-05 ***
## xnew         1.0595259  0.0296256   35.76 4.81e-05 ***
## xsquare     -0.1972106  0.0021918  -89.98 3.03e-06 ***
## xcube       -0.0404032  0.0004339  -93.12 2.73e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1541 on 3 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:  0.9999 
## F-statistic: 2.749e+04 on 3 and 3 DF,  p-value: 3.724e-07

I’ll conclude with a graphic!

eqn = function(x){
  coef(quadraticfit)[1]
  +
    coef(quadraticfit)[2]*x
  +
    coef(quadraticfit)[3]*x^2
  +
    coef(quadraticfit)[4]*x^3}


plot(x=1, xlim = c(-9,11),ylim = c(-46,15), xlab = "Great",ylab = "Scott")
points(xnew,ynew)
curve(eqn, add = TRUE)

This was harder than I thought it would be, I do prefer the ggplot2 package!