Today we are going to introduce a corner stone analytical technique while expanding some of our knowlegde of the capabilities included with r. For reference, we will be following the tutorial here: http://sciencefair.math.iit.edu/analysis/linereg/hand/
R Comes pre loaded with a ton of data sets collected from various sources. One of the more well known data sets is the cars data, It is a nice clean data set to practice regression and visualization. We attach the data as follows:
require(datasets)
attach(cars)
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
We have two variables in this dataset. The speed and dist variable. The summary function provides us some top level statistics. The stats can be quite useful when determining the distribution of data. The best way to check the relationship between two numeric variables is to plot them. R provides us with some base function to make this visualization.
The plot function is of the form plot(data for x axis, data for y axis, xlab=‘x axis name’, ylab=‘y axis name’, main=‘graph name’)
plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)',
main='Stopping Distance vs. Speed')
The problem at hand is to see if we can predict stopping distance as a function of speed. We will need to compute the various metrics by hand or in our case, by using software and not fuctions.
Lets compute all of these metrics one by one and save them to variables. Lets call stopping distance y and speed x.
http://faculty.cas.usf.edu/mbrannick/regression/Reg2IV.html
The simple regression line takes the following form:
\[ y=a+b_1x_1 \]
We wantto compute the slope and the intercept using our data to best predict y but we first need to find the metrics stated earlier in this text.
mean_x<-mean(cars$speed)
mean_y<-mean(cars$dist)
Since we need to iterate value by value, we will compute the difference between each data point and the mean and put them into new columns.
n=nrow(cars)
for (x in n)
{
cars$x_to_mean_diff=(cars$speed-mean_x)
}
for (y in n)
{
cars$y_to_mean_diff=(cars$dist-mean_y)
}
head(cars)
## speed dist x_to_mean_diff y_to_mean_diff
## 1 4 2 -11.4 -40.98
## 2 4 10 -11.4 -32.98
## 3 7 4 -8.4 -38.98
## 4 7 22 -8.4 -20.98
## 5 8 16 -7.4 -26.98
## 6 9 10 -6.4 -32.98
We need to make a new column that is the product of x_to_mean_diff and y_to_mean_diff and call it xy
for (y in n)
{
cars$xy=(cars$x_to_mean_diff*cars$y_to_mean_diff)
}
head(cars)
## speed dist x_to_mean_diff y_to_mean_diff xy
## 1 4 2 -11.4 -40.98 467.172
## 2 4 10 -11.4 -32.98 375.972
## 3 7 4 -8.4 -38.98 327.432
## 4 7 22 -8.4 -20.98 176.232
## 5 8 16 -7.4 -26.98 199.652
## 6 9 10 -6.4 -32.98 211.072
We should have enought to compute the slope of our regression line. The formula is as follows:
\[ b=\frac{\sum(x-mean(x))(y-mean(y))}{\sum(x-mean(x))^{2}} \]
Lets deconstruct this formula part by part. The numerator says to take the sum of all the xy values in our xy column. The denominator says to the sum of squares of the x data point minus the mean.
b_num=sum(cars$xy)
b_den=sum((cars$x_to_mean_diff)**2)
b=b_num/b_den
b
## [1] 3.932409
The final step is to calculate the intercept, which we can do using the initial regression equation with the values of distance and speed set as their respective means, along with our newly calculated b
print(paste0("mean of distance is ", mean_y))
## [1] "mean of distance is 42.98"
print(paste0("mean of speed is ", mean_x))
## [1] "mean of speed is 15.4"
Now it is a question of doing simple algebra.
\[ y=bx+a\\ 42.98=(3.932409)(15.4)+a\\ 42.98=60.5591+a\\ -17.5791=a \]
\[ y=3.932409(speed)-17.5791 \]
Lets store the regression line in a formula and use it to make a prediciton. Lets predict the stopping distance when the speed is 64.
model <-function(s)
{
3.932409*s-17.5791
}
model(64)
## [1] 234.0951
Expert domain knowledge will tell us if this prediction makes sense or not. Lets check if we have the right regression line using r’s built in function.
mod<-lm(dist~speed, cars)
summary(mod)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
As you can see, our intrcept and speed is given to us however, we are presented with more metrics than we found. All of these are useful. For the next session we will go over how to check if the model is a good fit and some assumptions about linear regression.
Predict Ozone using Solar.R as a predictor variable. Compute the regression line by scratch as we did with the car example, then verify that the regression line is correct using r’s lm function.
attach(airquality)
air<-subset(airquality, select=c("Ozone", "Solar.R"))
summary(air)
## Ozone Solar.R
## Min. : 1.00 Min. : 7.0
## 1st Qu.: 18.00 1st Qu.:115.8
## Median : 31.50 Median :205.0
## Mean : 42.13 Mean :185.9
## 3rd Qu.: 63.25 3rd Qu.:258.8
## Max. :168.00 Max. :334.0
## NA's :37 NA's :7