Introduction to Simple Linear Regression in R

Linear regressions are used to model the linear relationship between one dependent variable and one or more independent variables. A regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable. In this recitation, we will: (1) load a dataset from the car package, (2) graph the relationship between two variables, (3) perform a simple linear regression analysis between two variables using lm() and (4) extract the intercept and the regression coefficient.

1. Loading the dataset

In this recitation, we will use the Prestige dataset from the car package. First, install the car package using install.packages(). Then, load it into your library using library().

Variable name	Description
education	Average education of occupational incumbents in 1971 (in years)
income	Average income of incumbents in 1971 (in dollars)
women	Percentage of incumbents who are women
prestige	Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s
census	Canadian Census occupational code
type	Type of occupation: bc (Blue Collar), prof (Professional, Managerial, and Technical), wc (White Collar)

library(car) # Loading the car package using library()

## Warning: package 'car' was built under R version 3.4.4

## Loading required package: carData

## Warning: package 'carData' was built under R version 3.4.4

dim(Prestige) # Checking the dimensions of the Prestige dataset using dim()

## [1] 102   6

sapply(Prestige, FUN=class) # Viewing the class attribute of each variable using class()

## education    income     women  prestige    census      type 
## "numeric" "integer" "numeric" "numeric" "integer"  "factor"

head(Prestige,10) # Viewing the first 10 observations of the Prestige dataset using head()

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof
## biologists              15.09   8258 25.65     72.6   2133 prof
## architects              15.44  14163  2.69     78.1   2141 prof
## civil.engineers         14.52  11377  1.03     73.1   2143 prof
## mining.engineers        14.64  11023  0.94     68.8   2153 prof

2. Measuring the relationship between education and income

We could intuitively expect to observe a positive relationship between education and income. But is it actually the case? And if so, how strong is this relationship? Using a simple linear regression model, we will predict the income of a given occupation given the average number of years of education of the respondents who have this occupation. We call this a simple linear regression model because we are only using one independent variable (i.e. education), to predict one dependent variable (i.e. income).

2.1 Graphing the relationship between education and income using `plot()`

Our first step is to visually assess whether there seems to be a linear relationship between education and income by looking at a scatterplot.

plot(Prestige$education, Prestige$income,
     las=1,
     col="blue",
     pch=16,
     xlim=c(0,17),
     xlab="Education (years)",
     ylab=NA,
     main="Relationship between Education and Income (Canada, 1971)")
title(ylab="Income (dollars)", line=3.3)

At first glance, does there seem to be a correlation between both variables? If so, does the relationship seem positive or negative?

2.2 Performing a linear regression analysis using `lm()`

Using lm(), we specify a simple linear regression formula: dependent variable ~ independent variable. Here, we are regressing one dependent variable (income, in dollars) on one independent variable (education, in years). The lm() function will generate an intercept and a regression coefficient. The intercept represents the value of the dependent variable when the independent variable is equal to zero. The regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable (in other words, the slope).

Note that lm() automatically performs listwise deletion of rows with a missing value within the vectors you are using. In other words, it only uses pairwise complete observations.

model <- lm(income ~ education, data = Prestige) # Creating our model using lm()

summary(model) # Viewing our regression results using summary()

## 
## Call:
## lm(formula = income ~ education, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5493.2 -2433.8   -41.9  1491.5 17713.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2853.6     1407.0  -2.028   0.0452 *  
## education      898.8      127.0   7.075 2.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared:  0.3336, Adjusted R-squared:  0.3269 
## F-statistic: 50.06 on 1 and 100 DF,  p-value: 2.079e-10

nobs(model) # Sanity check: viewing how many observations were used in the model using nobs()

## [1] 102

nobs(model) == dim(Prestige)[1] # All rows were used: there were no missing observations

## [1] TRUE

Lexicon for the `lm()` output

Information	Description
Formula	Our linear model (dependent variable ~ independent variable)
Residuals	Difference between the observed and the estimated value of the dependent variable
Intercept	Value of the dependent variable when the independent variable is equal to zero
Coefficient	Mean change in the dependent variable for one unit of change in the independent variable
p-value	Probability of getting results similar to yours (or more extreme) given that the null hypothesis (= no relationship between the DV and the IV) is true
Degrees of freedom	Number of observations - number of parameters - 1
R-squared	Proportion of explained variance in the dependent variable

Note that you can graph this linear relationship using abline(). Visually, the line represents the predicted value of the income variable given the value of the education variable. In other words, it displays the best guess you could make regarding an occupation’s income if you were only given the average number of years of education of the respondents who have this occupation.

model <- lm(income ~ education, data = Prestige) # Creating our model using lm()

plot(Prestige$education, Prestige$income,
     las=1,
     col="blue",
     pch=16,
     xlim=c(0,17),
     xlab="Education (years)",
     ylab=NA,
     main="Relationship between Education and Income (Canada, 1971)")
title(ylab="Income (dollars)", line=3.3)

abline(model)

2.3 Extracting the coefficients

Almost done! We have successfully conducted our regression analysis. Here is how to store the intercept and the regression coefficient into a myCoeff vector using summary().

summary(model)$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -2853.5856  1407.0392 -2.028078 4.521163e-02
## education     898.8128   127.0354  7.075294 2.079192e-10

myCoeff <- summary(model)$coefficients[1:2,1]

myCoeff

## (Intercept)   education 
##  -2853.5856    898.8128

Bonus: Multiple Linear Model Regression

As you move forward with data analysis, you might start using multiple linear regression models: dependent variable ~ independent variable #1 + independent variable #2. The main difference from simple linear models is that you will have more than one predictors (or independent variables). Obviously, you are not requested to make 3D graphs at this stage—I included this for your own interest! Visually, your regression analysis would look like a plane rather than a line.

Exercises

Exercise 1

Without conducting a regression analysis, graph the relationship between income (IV) and prestige (DV) within the Prestige dataset. Does there seem to be a linear relationship between these two variables? If so, does this relationship seem positive or negative?

Relevant function: plot().

Exercise 2

Perform a linear regression analysis of the relationship between income (IV) and prestige (DV). How do you interpret the intercept and the regression coefficient? Also, thinking back to what we saw last week on p-values, how convincing are these results?

Relevant functions: lm(), summary().

Exercise 3

Display this linear relationship on the scatterplot you created in Exercice 1.

Relevant functions: plot(), abline().

Exercise 4

Extract the intercept and the coefficient and store these values into a myCoef vector.

Exercise 5

Is there a causal relationship between income and prestige? What alternative explanations could explain the results of your linear regression model? (in words)