Introduction to Simple Regression

Today we are going to introduce a corner stone analytical technique while expanding some of our knowlegde of the capabilities included with r. For reference, we will be following the tutorial here: http://sciencefair.math.iit.edu/analysis/linereg/hand/

R Comes pre loaded with a ton of data sets collected from various sources. One of the more well known data sets is the cars data, It is a nice clean data set to practice regression and visualization. We attach the data as follows:

require(datasets)

attach(cars)

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

We have two variables in this dataset. The speed and dist variable. The summary function provides us some top level statistics. The stats can be quite useful when determining the distribution of data. The best way to check the relationship between two numeric variables is to plot them. R provides us with some base function to make this visualization.

The plot function is of the form plot(data for x axis, data for y axis, xlab=‘x axis name’, ylab=‘y axis name’, main=‘graph name’)

plot(cars$speed, cars$dist, xlab='Speed (mph)', ylab='Stopping Distance (ft)', 
     main='Stopping Distance vs. Speed')

The problem at hand is to see if we can predict stopping distance as a function of speed. We will need to compute the various metrics by hand or in our case, by using software and not fuctions.

Lets compute all of these metrics one by one and save them to variables. Lets call stopping distance y and speed x.

http://faculty.cas.usf.edu/mbrannick/regression/Reg2IV.html

The simple regression line takes the following form:

\[ y=a+b_1x_1 \]

We wantto compute the slope and the intercept using our data to best predict y but we first need to find the metrics stated earlier in this text.

Compute the means of x and y

mean_x<-mean(cars$speed)

mean_y<-mean(cars$dist)

differences between the data point and the mean for each column

Since we need to iterate value by value, we will compute the difference between each data point and the mean and put them into new columns.

n=nrow(cars)

for (x in n) 
 {
  cars$x_to_mean_diff=(cars$speed-mean_x)
 }

for (y in n) 
 {
  cars$y_to_mean_diff=(cars$dist-mean_y)
}

head(cars)

##   speed dist x_to_mean_diff y_to_mean_diff
## 1     4    2          -11.4         -40.98
## 2     4   10          -11.4         -32.98
## 3     7    4           -8.4         -38.98
## 4     7   22           -8.4         -20.98
## 5     8   16           -7.4         -26.98
## 6     9   10           -6.4         -32.98

Product of mean to data point differences for x and y

We need to make a new column that is the product of x_to_mean_diff and y_to_mean_diff and call it xy

for (y in n) 
 {
  cars$xy=(cars$x_to_mean_diff*cars$y_to_mean_diff)
}

head(cars)

##   speed dist x_to_mean_diff y_to_mean_diff      xy
## 1     4    2          -11.4         -40.98 467.172
## 2     4   10          -11.4         -32.98 375.972
## 3     7    4           -8.4         -38.98 327.432
## 4     7   22           -8.4         -20.98 176.232
## 5     8   16           -7.4         -26.98 199.652
## 6     9   10           -6.4         -32.98 211.072

Calculate the slope b

We should have enought to compute the slope of our regression line. The formula is as follows:

\[ b=\frac{\sum(x-mean(x))(y-mean(y))}{\sum(x-mean(x))^{2}} \]

Lets deconstruct this formula part by part. The numerator says to take the sum of all the xy values in our xy column. The denominator says to the sum of squares of the x data point minus the mean.

b_num=sum(cars$xy)

b_den=sum((cars$x_to_mean_diff)**2)

b=b_num/b_den

b

## [1] 3.932409

calculate intercept

The final step is to calculate the intercept, which we can do using the initial regression equation with the values of distance and speed set as their respective means, along with our newly calculated b

print(paste0("mean of distance is ", mean_y))

## [1] "mean of distance is 42.98"

print(paste0("mean of speed is ", mean_x))

## [1] "mean of speed is 15.4"

Now it is a question of doing simple algebra.

\[ y=bx+a\\ 42.98=(3.932409)(15.4)+a\\ 42.98=60.5591+a\\ -17.5791=a \]

State the regression line

\[ y=3.932409(speed)-17.5791 \]

Lets store the regression line in a formula and use it to make a prediciton. Lets predict the stopping distance when the speed is 64.

model <-function(s)
  {
  3.932409*s-17.5791
  }

model(64)

## [1] 234.0951

Expert domain knowledge will tell us if this prediction makes sense or not. Lets check if we have the right regression line using r’s built in function.

mod<-lm(dist~speed, cars)

summary(mod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

As you can see, our intrcept and speed is given to us however, we are presented with more metrics than we found. All of these are useful. For the next session we will go over how to check if the model is a good fit and some assumptions about linear regression.

Data Science Session 8b

Media Data Science

July 29, 2019