Chapter 3

For this guide I will be using a set of data about the girth, height, and volume of trees. For the most part I will be looking at the relationship between a trees girth as a predictor of its height.

trees <-read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/trees.csv"))
attach(trees)
trees
##     X Girth Height Volume
## 1   1   8.3     70   10.3
## 2   2   8.6     65   10.3
## 3   3   8.8     63   10.2
## 4   4  10.5     72   16.4
## 5   5  10.7     81   18.8
## 6   6  10.8     83   19.7
## 7   7  11.0     66   15.6
## 8   8  11.0     75   18.2
## 9   9  11.1     80   22.6
## 10 10  11.2     75   19.9
## 11 11  11.3     79   24.2
## 12 12  11.4     76   21.0
## 13 13  11.4     76   21.4
## 14 14  11.7     69   21.3
## 15 15  12.0     75   19.1
## 16 16  12.9     74   22.2
## 17 17  12.9     85   33.8
## 18 18  13.3     86   27.4
## 19 19  13.7     71   25.7
## 20 20  13.8     64   24.9
## 21 21  14.0     78   34.5
## 22 22  14.2     80   31.7
## 23 23  14.5     74   36.3
## 24 24  16.0     72   38.3
## 25 25  16.3     77   42.6
## 26 26  17.3     81   55.4
## 27 27  17.5     82   55.7
## 28 28  17.9     80   58.3
## 29 29  18.0     80   51.5
## 30 30  18.0     80   51.0
## 31 31  20.6     87   77.0

3.1 The Simple Linear Regression Model

Simple linear regression assumptions: -x and y have a relationship that can be approximated by a straight line.

To check that this assumption is rational, we make a scatterplot of the data.

plot(Girth, Height)

On seeing that the assumption is reasonable, we can plot a best fit line to it.

mytree <- lm(Height ~ Girth)  #One thing to note on for the lm function, is that the response variable goes first,, rather than second
plot(Girth, Height)
abline(mytree)

3.2 The Lease Squares Point Estimates

The next step is looking into how efficient our best fit line is. We do this by calculating the least squares point estimates.

The following few lines are to calculate the residuals of our line.

mytreeResids <- mytree$residuals
plot(Girth, mytreeResids)
abline(0,0)

3.3 Point Estimates and Point Predictions

To perform point estimates, we first need our slope and intercept.

mytree
## 
## Call:
## lm(formula = Height ~ Girth)
## 
## Coefficients:
## (Intercept)        Girth  
##      62.031        1.054

Once we have our slope and our intercept, then performing a point estimation is just a matter of plugging in a value for X. In this case our intercept is 62, and our slope is 1.054. The equation for this instance is : y=62.031 + 1.054X If we wanted to predict the height of a tree using its girth as a predictor (9 in this instance), we would have a line of code like this:

62.031+1.054*9
## [1] 71.517

One thing to note is when doing point predictions, you keep your predictions within the experimental region.

3.4 Model Assumptions

Regression Assumptions ~At any value X, the potential error term has a mean equal to 0 ~Constant variance assumption ~Normality assumption ~Independence assumption

3.5 Testing the Significance of the Slope and y-Intercept

slope test H0: B1=0 Ha: B1!=0 or B1>0 or B1<0

This is the code to get the confidence interval. One odd note when doing it by hand, the T stat will have n-2 DF.

confint(mytree, level=.90)
##                    5 %      95 %
## (Intercept) 54.5835265 69.479101
## Girth        0.5068706  1.601867

Considering that 0 isn’t in the interval given, we can reject Ho.

For an intercept test everything is exactly the same, therefor we get H0: B0=0 Ha: B0!=0 or B1>0 or B1<0

This is the code to get the confidence interval.

confint(mytree, level=.90)
##                    5 %      95 %
## (Intercept) 54.5835265 69.479101
## Girth        0.5068706  1.601867

Only this time we look at the (intercept) row. And again, the interval doesn’t contain 0, so we can reject Ho.

3.6 Confidence and Prediction Intervals

For confidence intervals, we are looking to estimate the mean of a set of values we have yet to collect. The code to get a confidence interval we use the code below. Note, that this code takes in the assumption that we know the girth of the trees we are going to measure is 9.6.

treeFrame <- data.frame( Girth=9.6 )
tc <- predict (mytree, treeFrame, interval="confidence")
tc
##        fit      lwr      upr
## 1 72.15325 69.00365 75.30286

Our output here tells us that the expected mean height of trees with a girth of 9.6 is between 69.00365 and 75.30286 with 95% confidence.

For our prediction interval, we use the code below. The same measurement of 9.6 of the girth is used for consistency

treeFrame <- data.frame( Girth=9.6 )
tp <- predict (mytree, treeFrame, interval="predict")
tp
##        fit      lwr      upr
## 1 72.15325 60.39609 83.91042

And again, our output tells us that the expected height of the next tree with a girth of 9.6 is expected to be between 60.39609 and 83.91042 tall, with 95% confidence.

Lastly, when predicting the next trees height, and the average of the next few trees height, the variation is going to be smaller when taking an average. This is reflected in how our prediction has a wider range than our confidence interval.

3.7 Simple Coefficients of Determination and Correlation

When looking at a set of data, an important piece of information is the correlation coefficient. The correlation coefficient is a measure of how well the model fits the data, in addition to whether or not the model has a negative slope, or a positive slope.

The code below gives us the correlation of our data.

cor(Girth,Height)
## [1] 0.5192801

This shows us that our data has a correlation coefficient of .5192801 and with a positive corrilation.

3.8 F-test for the model

For the F-test, we look into the significance of the regression relationship between x and y. The following code includes our F-test

summary(mytree)
## 
## Call:
## lm(formula = Height ~ Girth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5816  -2.7686   0.3163   2.4728   9.9456 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  62.0313     4.3833  14.152 1.49e-14 ***
## Girth         1.0544     0.3222   3.272  0.00276 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.538 on 29 degrees of freedom
## Multiple R-squared:  0.2697, Adjusted R-squared:  0.2445 
## F-statistic: 10.71 on 1 and 29 DF,  p-value: 0.002758

The last line here has “F-Statistic: 10.71 on 1 and 29 DF, p-value: .002758” This tells us that our linear model resulted in an F-statistic of 10.71, with 1 and 29 degrees of freedom. The corresponding p-value was .002758, and this low probability shows us that the relationship between x and y is significant.

Definitions

time series

~data observed over a time sequence

cross-sectional

~data observed at a single time

experimental region

~region contained within the range of data points

simple coefficient of determination

~a measure of the usefulness of a simple linear regression model

Code lookup

read.csv(url (x))

~x is a url

~url is a command that extracts data from a url

~read.csv is a command that reads text, and separates the data based on newlines and commas

data(x)

~data(x) finds and extracts the data points from x

attach(x)

~when referencing sections of the data, you don’t need to specify x, beforehand every time

x

~just entering the title of the data spits out the dataset

plot(y, x)

~Plots a graph, with y, on the y axis, and x on the x axis (and all data points of the data as well)

z <- lm(x ~ y)

~creates a linear model, using X’s data for the x axis, and Y’s data for the y axis and stores the info in z

z

~stating the name of the linear model will output the y intercept and slope

abline(z)

~plots the linear model stored in z on the current graph

resid <- z$residuals

~records all the residuals of z

confint(z)

~gives you a confidence interval for your intercept, and slope. (note to only use one of these at a time, while the other remains constant)

cor(x,y))

~gives you the correlation between x and y.

a <- data.frame( x = 3.15)

~creates a frame of data of a, with the base info that x= 3.15

(p1 <- predict(line, frame, interval=“predict”))

~predict creates a prediction interval, using the prediction line “line” predicting what the y variable will be, with the x variables listed from “frame”. “interval=”predict“” states that it is performing a prediction interval.

~the extra parenthesis cause R to output the result of the code within them.

(c1 <- predict(Length.mod, Sepal.Frame, interval=“confidence”))

~predict creates a confidence interval, using the prediction line “line” predicting what the y variable will be, with the x variables listed from “frame”. “interval=”confidence“” states that it is performing a confidence interval.