For this guide I will be using a set of data about the girth, height, and volume of trees. For the most part I will be looking at the relationship between a trees girth as a predictor of its height.
trees <-read.csv(url("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/trees.csv"))
attach(trees)
trees
## X Girth Height Volume
## 1 1 8.3 70 10.3
## 2 2 8.6 65 10.3
## 3 3 8.8 63 10.2
## 4 4 10.5 72 16.4
## 5 5 10.7 81 18.8
## 6 6 10.8 83 19.7
## 7 7 11.0 66 15.6
## 8 8 11.0 75 18.2
## 9 9 11.1 80 22.6
## 10 10 11.2 75 19.9
## 11 11 11.3 79 24.2
## 12 12 11.4 76 21.0
## 13 13 11.4 76 21.4
## 14 14 11.7 69 21.3
## 15 15 12.0 75 19.1
## 16 16 12.9 74 22.2
## 17 17 12.9 85 33.8
## 18 18 13.3 86 27.4
## 19 19 13.7 71 25.7
## 20 20 13.8 64 24.9
## 21 21 14.0 78 34.5
## 22 22 14.2 80 31.7
## 23 23 14.5 74 36.3
## 24 24 16.0 72 38.3
## 25 25 16.3 77 42.6
## 26 26 17.3 81 55.4
## 27 27 17.5 82 55.7
## 28 28 17.9 80 58.3
## 29 29 18.0 80 51.5
## 30 30 18.0 80 51.0
## 31 31 20.6 87 77.0
Simple linear regression assumptions: -x and y have a relationship that can be approximated by a straight line.
To check that this assumption is rational, we make a scatterplot of the data.
plot(Girth, Height)
On seeing that the assumption is reasonable, we can plot a best fit line to it.
mytree <- lm(Height ~ Girth) #One thing to note on for the lm function, is that the response variable goes first,, rather than second
plot(Girth, Height)
abline(mytree)
The next step is looking into how efficient our best fit line is. We do this by calculating the least squares point estimates.
The following few lines are to calculate the residuals of our line.
mytreeResids <- mytree$residuals
plot(Girth, mytreeResids)
abline(0,0)
To perform point estimates, we first need our slope and intercept.
mytree
##
## Call:
## lm(formula = Height ~ Girth)
##
## Coefficients:
## (Intercept) Girth
## 62.031 1.054
Once we have our slope and our intercept, then performing a point estimation is just a matter of plugging in a value for X. In this case our intercept is 62, and our slope is 1.054. The equation for this instance is : y=62.031 + 1.054X If we wanted to predict the height of a tree using its girth as a predictor (9 in this instance), we would have a line of code like this:
62.031+1.054*9
## [1] 71.517
One thing to note is when doing point predictions, you keep your predictions within the experimental region.
Regression Assumptions ~At any value X, the potential error term has a mean equal to 0 ~Constant variance assumption ~Normality assumption ~Independence assumption
slope test H0: B1=0 Ha: B1!=0 or B1>0 or B1<0
This is the code to get the confidence interval. One odd note when doing it by hand, the T stat will have n-2 DF.
confint(mytree, level=.90)
## 5 % 95 %
## (Intercept) 54.5835265 69.479101
## Girth 0.5068706 1.601867
Considering that 0 isn’t in the interval given, we can reject Ho.
For an intercept test everything is exactly the same, therefor we get H0: B0=0 Ha: B0!=0 or B1>0 or B1<0
This is the code to get the confidence interval.
confint(mytree, level=.90)
## 5 % 95 %
## (Intercept) 54.5835265 69.479101
## Girth 0.5068706 1.601867
Only this time we look at the (intercept) row. And again, the interval doesn’t contain 0, so we can reject Ho.
For confidence intervals, we are looking to estimate the mean of a set of values we have yet to collect. The code to get a confidence interval we use the code below. Note, that this code takes in the assumption that we know the girth of the trees we are going to measure is 9.6.
treeFrame <- data.frame( Girth=9.6 )
tc <- predict (mytree, treeFrame, interval="confidence")
tc
## fit lwr upr
## 1 72.15325 69.00365 75.30286
Our output here tells us that the expected mean height of trees with a girth of 9.6 is between 69.00365 and 75.30286 with 95% confidence.
For our prediction interval, we use the code below. The same measurement of 9.6 of the girth is used for consistency
treeFrame <- data.frame( Girth=9.6 )
tp <- predict (mytree, treeFrame, interval="predict")
tp
## fit lwr upr
## 1 72.15325 60.39609 83.91042
And again, our output tells us that the expected height of the next tree with a girth of 9.6 is expected to be between 60.39609 and 83.91042 tall, with 95% confidence.
Lastly, when predicting the next trees height, and the average of the next few trees height, the variation is going to be smaller when taking an average. This is reflected in how our prediction has a wider range than our confidence interval.
When looking at a set of data, an important piece of information is the correlation coefficient. The correlation coefficient is a measure of how well the model fits the data, in addition to whether or not the model has a negative slope, or a positive slope.
The code below gives us the correlation of our data.
cor(Girth,Height)
## [1] 0.5192801
This shows us that our data has a correlation coefficient of .5192801 and with a positive corrilation.
For the F-test, we look into the significance of the regression relationship between x and y. The following code includes our F-test
summary(mytree)
##
## Call:
## lm(formula = Height ~ Girth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5816 -2.7686 0.3163 2.4728 9.9456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.0313 4.3833 14.152 1.49e-14 ***
## Girth 1.0544 0.3222 3.272 0.00276 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.538 on 29 degrees of freedom
## Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445
## F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758
The last line here has “F-Statistic: 10.71 on 1 and 29 DF, p-value: .002758” This tells us that our linear model resulted in an F-statistic of 10.71, with 1 and 29 degrees of freedom. The corresponding p-value was .002758, and this low probability shows us that the relationship between x and y is significant.
time series
~data observed over a time sequence
cross-sectional
~data observed at a single time
experimental region
~region contained within the range of data points
simple coefficient of determination
~a measure of the usefulness of a simple linear regression model
read.csv(url (x))
~x is a url
~url is a command that extracts data from a url
~read.csv is a command that reads text, and separates the data based on newlines and commas
data(x)
~data(x) finds and extracts the data points from x
attach(x)
~when referencing sections of the data, you don’t need to specify x, beforehand every time
x
~just entering the title of the data spits out the dataset
plot(y, x)
~Plots a graph, with y, on the y axis, and x on the x axis (and all data points of the data as well)
z <- lm(x ~ y)
~creates a linear model, using X’s data for the x axis, and Y’s data for the y axis and stores the info in z
z
~stating the name of the linear model will output the y intercept and slope
abline(z)
~plots the linear model stored in z on the current graph
resid <- z$residuals
~records all the residuals of z
confint(z)
~gives you a confidence interval for your intercept, and slope. (note to only use one of these at a time, while the other remains constant)
cor(x,y))
~gives you the correlation between x and y.
a <- data.frame( x = 3.15)
~creates a frame of data of a, with the base info that x= 3.15
(p1 <- predict(line, frame, interval=“predict”))
~predict creates a prediction interval, using the prediction line “line” predicting what the y variable will be, with the x variables listed from “frame”. “interval=”predict“” states that it is performing a prediction interval.
~the extra parenthesis cause R to output the result of the code within them.
(c1 <- predict(Length.mod, Sepal.Frame, interval=“confidence”))
~predict creates a confidence interval, using the prediction line “line” predicting what the y variable will be, with the x variables listed from “frame”. “interval=”confidence“” states that it is performing a confidence interval.