Linear Models I
Linear regressions are used to model the linear relationship between
one dependent variable and one or more independent variables. A
regression coefficient represents the mean change in
the dependent variable for one unit of change in the independent
variable. In this laboratory, we will: (1) load a dataset from the
car package, (2) graph the relationship between two
variables, (3) perform a simple linear regression analysis between two
variables using lm() and (4) run a test on the residuals of
our model.
Relevant functions: lm(),
summary(), plot(), density(),
abline().
1. Loading Data
We will use the Prestige dataset from the
car package. First, install the car package
using install.packages(). Then, load it into your library
using library().
| Variable name | Description |
|---|---|
| education | Average education of occupational incumbents in 1971 (in years) |
| income | Average income of incumbents in 1971 (in dollars) |
| women | Percentage of incumbents who are women |
| prestige | Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s |
| census | Canadian Census occupational code |
| type | Type of occupation: bc (Blue Collar), prof (Professional, Managerial, and Technical), wc (White Collar) |
library(car) # Loading the car package using library()
## Loading required package: carData
dim(Prestige) # Checking the dimensions of the Prestige dataset using dim()
## [1] 102 6
sapply(Prestige, FUN=class) # Viewing the class attribute of each variable using class()
## education income women prestige census type
## "numeric" "integer" "numeric" "numeric" "integer" "factor"
head(Prestige,10) # Viewing the first 10 observations of the Prestige dataset using head()
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
## biologists 15.09 8258 25.65 72.6 2133 prof
## architects 15.44 14163 2.69 78.1 2141 prof
## civil.engineers 14.52 11377 1.03 73.1 2143 prof
## mining.engineers 14.64 11023 0.94 68.8 2153 prof
2. Bivariate Relationship
We could intuitively expect to observe a positive relationship between education and income. But is it actually the case? And if so, how strong is this relationship? Using a simple linear regression model, we will predict the income of a given occupation given the average number of years of education of the respondents who have this occupation. We call this a simple linear regression model because we are only using one independent variable (i.e. education), to predict one dependent variable (i.e. income).
2.1 Graphing our Data
Our first step is to visually assess whether there seems to be a linear relationship between education and income by looking at a scatterplot.
plot(Prestige$education, Prestige$income,
las=1,
col="blue",
pch=16,
xlim=c(0,17),
xlab="Education (years)",
ylab=NA,
main="Relationship between Education and Income (Canada, 1971)")
title(ylab="Income (dollars)", line=3.3)
At first glance, does there seem to be a correlation between both variables? If so, does the relationship seem positive or negative?
2.2 Running an OLS model
2.1.1 Using the lm() function
Using lm(), we specify a simple linear regression
formula: dependent variable ~ independent variable.
Here, we are regressing one dependent variable (income, in dollars) on
one independent variable (education, in years). The lm()
function will generate an intercept and a
regression coefficient. The intercept represents the value
of the dependent variable when the independent variable is equal to
zero. The regression
coefficient represents the mean change in the dependent
variable for one unit of change in the independent variable (in other
words, the slope).
Note that lm() automatically performs listwise
deletion of rows with a missing value within the vectors you are using.
In other words, it only uses pairwise complete observations.
model <- lm(income ~ education, data = Prestige) # Creating our model using lm()
summary(model) # Viewing our regression results using summary()
##
## Call:
## lm(formula = income ~ education, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5493.2 -2433.8 -41.9 1491.5 17713.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2853.6 1407.0 -2.028 0.0452 *
## education 898.8 127.0 7.075 2.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared: 0.3336, Adjusted R-squared: 0.3269
## F-statistic: 50.06 on 1 and 100 DF, p-value: 2.079e-10
nobs(model) # Sanity check: viewing how many observations were used in the model using nobs()
## [1] 102
nobs(model) == dim(Prestige)[1] # All rows were used: there were no missing observations
## [1] TRUE
Â
2.1.2 Lexicon for the lm() output
| Information | Description |
|---|---|
| Formula | Our linear model (dependent variable ~ independent variable) |
| Residuals | Difference between the observed and the estimated value of the dependent variable |
| Intercept | Value of the dependent variable when the independent variable is equal to zero |
| Coefficient | Mean change in the dependent variable for one unit of change in the independent variable |
| p-value | Probability of getting results similar to yours (or more extreme) given that the null hypothesis (= no relationship between the DV and the IV) is true |
| Degrees of freedom | Number of observations - number of parameters - 1 |
| R-squared | Proportion of explained variance in the dependent variable |
Note that you can graph this linear relationship using
abline(). Visually, the line represents the predicted value
of the income variable given the value of the education variable. In
other words, it displays the best guess you could make regarding an
occupation’s income if you were only given the average number of years
of education of the respondents who have this occupation.
model <- lm(income ~ education, data = Prestige) # Creating our model using lm()
plot(Prestige$education, Prestige$income,
las=1,
col="blue",
pch=16,
xlim=c(0,17),
xlab="Education (years)",
ylab=NA,
main="Relationship between Education and Income (Canada, 1971)")
title(ylab="Income (dollars)", line=3.3)
abline(model)
2.1.3 Extracting the coefficients
Almost done! We have successfully conducted our regression analysis.
Here is how to store the intercept and the regression coefficient into a
myCoeff vector using summary().
summary(model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2853.5856 1407.0392 -2.028078 4.521163e-02
## education 898.8128 127.0354 7.075294 2.079192e-10
myCoeff <- summary(model)$coefficients[1:2,1]
myCoeff
## (Intercept) education
## -2853.5856 898.8128
3. Residuals
One of the assumptions of linear models is that residuals should be normally distributed. In other words, there should not be a systematic bias in the residuals, which would imply that our model is better at predicting certain types of observations than others. The residuals represent the observed value minus the predicted value.
summary(residuals(model)) # Summarizing the residuals
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5493.20 -2433.80 -41.92 0.00 1491.50 17713.14
plot(density(residuals(model)), col="blue",
main="Density Plot of the Residuals from our OLS Model") # Plotting the density of the residuals
Does this look normally distributed to you? Why or why not? Please note that we’ll do more advanced tests on residuals next week.
Bonus: Multiple Linear Model Regression
As you move forward with data analysis, you might start using multiple linear regression models: dependent variable ~ independent variable #1 + independent variable #2. The main difference from simple linear models is that you will have more than one predictors (or independent variables). Obviously, you are not requested to make 3D graphs at this stage—I included this for your own interest! Visually, your regression analysis would look like a plane rather than a line.
4. Exercises
Exercise
| Null Hypothesis | Alternative Hypothesis |
|---|---|
| No effect of prestige on income | Effect of prestige on income |
| Similar income means for all prestige scores | Different income means depending on prestige scores |
- Perform a linear regression analysis of the relationship between prestige score (IV) and income (DV). In your own words, how do you interpret the intercept and the regression coefficient? Can you reject the null hypothesis?
model <- lm(income ~ prestige, data=Prestige)
summary(model)
##
## Call:
## lm(formula = income ~ prestige, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6693.3 -1434.9 136.5 1251.8 15152.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1465.03 860.47 -1.703 0.0918 .
## prestige 176.43 17.26 10.224 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2984 on 100 degrees of freedom
## Multiple R-squared: 0.5111, Adjusted R-squared: 0.5062
## F-statistic: 104.5 on 1 and 100 DF, p-value: < 2.2e-16
# For each additional score of prestige, there is an
# additional 176.43 income for the job on average
# We can reject the null since our p-value is < 0.05
- Generate a density plot of the residuals of your OLS model. Do the residuals look normally distributed?
plot(density(model$residuals))
# They are not randomly distributed
# Here, the model has a tendency to under-estimate the DV
- Extract the intercept and the coefficient of your OLS model and store these values into a myCoef vector.
myCoeff <- summary(model)$coefficients[1:2,1]
myCoeff
## (Intercept) income
## 27.141176368 0.002896799