Regression Analysis

Linear regression analysis is a statistical model for testing whether there is a linear relationship between two quantitative variables. In other words, is there a correlation between two variables that is statistically distinguishable from a correlation of 0? Or, is there a good chance we would find some correlation in our data even if in the real world there is a correlation of 0? Simple regression basically just makes a scatterplot of two variables, draws a line that cuts as close as it can to all the data points, and gives us calculations of how the line fits. One important calculation is the \( B_1 \) coefficient–the slope of the line–which R calls the “estimate.” In other words, it “estimates” how much Y increases when X increases by one, based on what the sample of data shows. A second important calculation is the p-value–R labels it PR–which tells us the probability that we'd find B by chance if in the real world B is actually zero. A third important calculation is the intercept, denoted \( B_0 \), where the regression line meets the Y-axis. In other words, the expected value of Y when X is equal to 0. Finally, we calculate an error term, \( e \), which just reflects the average gap or difference between the data points and the line we fit through them. When we fit the line that cuts as closely as possible to all of the data points, the line can be summarized mathametically as

\[ Y = B_0 + B_{1}X + e \]

To run a regression model in R, we use the function lm(), where the first term we put in the parentheses is the dependent variable (the thing we want to explain) and the second term is the independent variable (what we think explains variation in the dependent variable), separating the two variables by a tilde.

setwd("~/Dropbox/Data General/GSS")
options(scipen = 999)
x <- read.csv("GSS.csv")

In it's simplest form, that's:

lm(dependentvariable ~ independentvariable)

Simple OLS Regression with two quantitative variables

The variable tvhours in dataframe x records how many hours per day each respondent watches television, and the variable age in the same dataframe records the age of each respondent. Each variable is quantitative so regression is a natural model for testing whether there is linear relationship between them.

model <- lm(x$tvhours ~ x$age)
summary(model)
## 
## Call:
## lm(formula = x$tvhours ~ x$age)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.492 -1.450 -0.694  0.919 21.330 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2.416013   0.035946    67.2 <0.0000000000000002 ***
## x$age       0.012092   0.000736    16.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 2.32 on 32437 degrees of freedom
##   (22648 observations deleted due to missingness)
## Multiple R-squared: 0.00825, Adjusted R-squared: 0.00822 
## F-statistic:  270 on 1 and 32437 DF,  p-value: <0.0000000000000002

So here we see that there is a positive relationship between age and hours per day spent watching television. The key output to note are:

\( B_0 \) = 42. This means that looking at the relationship between age and hours per day spent watching tv, the regression line predicts that at age zero we enter the world already watching 2.41 hours of tv!

\( B_1 \) = .012. This means that a one-unit increase in X, our independent variable and in this case age, is associated with a corresponding .012 increase in the number of tv hours per day.

\( X \) = The independent variable, in this case, age. There is no “result” for X because it's just a list of values we already know! For every value of this variable we have in our dataset, we can multiply it by \( B_1 \) to put us somewhere on the regression line and therefore predict value of Y given that value of X.

\( e \) = The error term is captured by the standard error of the estimate, .0007, which is just the average error around the regression line. If it is really big compared to the size of \( B_1 \), we can't say that \( B_1 \) is statistically different than 0. If it's tiny compared to \( B_1 \), then we can say it's really unlikely we'd find our regression line if, in the real world of the population at large, there is no relationship (a slope of 0).

\( p \) or \( PR \) = The p-value for our estimate rounds to 0, suggesting that it's very unlikely there's no relationship in the real world. In other words, because it's lower than .05, we reject the null hypothesis that there is no relationship in the population (that we'd find a slope of 0 if we had data for the whole, real world) with 95% confidence. In fact, here we are more than 99.9% confident that, because we found this relationship in this sample, the null hypothesis is wrong. The p-value is just a probability generated by looking at the standard error and the t-value.

\( R^2 \) = The R-squared (adjusted) is a measurement of the overall fit of the model. Here we see it equals .008. This means that \( age \) explains only a very small amount of the total variation in \( tvhours \).

We can get a graphical representation of the relationship by using the scatterplot function in the car package. It's probably going to be pretty messy because this is a huge survey, but the straight green line plotted automatically IS the regression line we just calculated.

library(car)
## Warning: package 'car' was built under R version 2.15.1
## Loading required package: MASS
## Loading required package: nnet
scatterplot(x$age, x$tvhours)

plot of chunk unnamed-chunk-3

Not very pretty, right? And the line is not very compelling, either, given all the noise in the data. But that's OK, it just means there is a lot of variation in tv watching that age fails to explain. Age maybe picks up some of that variation, but obviously not too much.

Regression with a dummy variable

Now, let's take a look at the race variable.

class(x$race)
## [1] "factor"
summary(x$race)
## black other white 
##  7625  2589 44873

The class of the variable is “factor”, which means it's categorical rather than quantitative. In other words, it puts every respondent into some group or category which we call “levels”. The summary function shows that 7,625 people are in the level labelled “black”, 2,589 in “other”, and 44,873 in “white”.

If our hypotheses only have to do with the difference between white and black people, it would be more useful to have a variable that only looks at white and black people. There are lots of ways to do this in R, but one way is to make a new variable that just excludes respondents who identified as “other”. We can do this by duplicating the variable, and then just give the new variable missing values (NA) for respondents who identified as “other”. After that apply the factor() function to remove levels with no observations. When done, summarize the variable to see what we've done.

Make a dummy variable based on a factor with multiple levels

x$black <- x$race  #make a new variable x$black, a replica of x$race
x$black[x$black == "other"] <- NA
x$black <- factor(x$black)  #use factor() function to automatically remove unused levels
summary(x$black)  #check results
## black white  NA's 
##  7625 44873  2589

To run a regression model we use the lm() function, where the first argument is the dependent variable and the second argument is the independent variable, separated by a tilde. If the independent variable is a factor with two levels (a dummy variable), then

reg <- lm(x$sei ~ x$black)
summary(reg)
## 
## Call:
## lm(formula = x$sei ~ x$black)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -32.5  -15.7   -6.0   13.9   55.0 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    42.200      0.295   143.0 <0.0000000000000002 ***
## x$blackwhite    7.424      0.318    23.3 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 19 on 29200 degrees of freedom
##   (25885 observations deleted due to missingness)
## Multiple R-squared: 0.0183,  Adjusted R-squared: 0.0182 
## F-statistic:  543 on 1 and 29200 DF,  p-value: <0.0000000000000002

The “estimate” or slope of the line is 7.42, suggesting that in our data, a black respondent's socioeconomic index would increase by an average of 7.42 points if he or she were white. The p-value is effectively zero, which means we reject the null hypothesis of no relationship. If we look at the scatterplot, we see that regression using a dummy for an independent variable basically IS a t-test.

library(car)
scatterplot(x$sei ~ x$black)
## Error: invalid type (NULL) for variable 'x$sei'