PSCI 9590A - Introduction to Quantitative Methods

Evelyne Brie

Fall 2022

Linear Models I

Linear regressions are used to model the linear relationship between one dependent variable and one or more independent variables. A regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable. In this laboratory, we will: (1) load a dataset from the car package, (2) graph the relationship between two variables, (3) perform a simple linear regression analysis between two variables using lm() and (4) run a test on the residuals of our model.

Relevant functions: lm(), summary(), plot(), density(), abline().

1. Loading Data

We will use the Prestige dataset from the car package. First, install the car package using install.packages(). Then, load it into your library using library().

Variable name	Description
education	Average education of occupational incumbents in 1971 (in years)
income	Average income of incumbents in 1971 (in dollars)
women	Percentage of incumbents who are women
prestige	Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s
census	Canadian Census occupational code
type	Type of occupation: bc (Blue Collar), prof (Professional, Managerial, and Technical), wc (White Collar)

library(car) # Loading the car package using library()

## Loading required package: carData

dim(Prestige) # Checking the dimensions of the Prestige dataset using dim()

## [1] 102   6

sapply(Prestige, FUN=class) # Viewing the class attribute of each variable using class()

## education    income     women  prestige    census      type 
## "numeric" "integer" "numeric" "numeric" "integer"  "factor"

head(Prestige,10) # Viewing the first 10 observations of the Prestige dataset using head()

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof
## biologists              15.09   8258 25.65     72.6   2133 prof
## architects              15.44  14163  2.69     78.1   2141 prof
## civil.engineers         14.52  11377  1.03     73.1   2143 prof
## mining.engineers        14.64  11023  0.94     68.8   2153 prof

2. Bivariate Relationship

We could intuitively expect to observe a positive relationship between education and income. But is it actually the case? And if so, how strong is this relationship? Using a simple linear regression model, we will predict the income of a given occupation given the average number of years of education of the respondents who have this occupation. We call this a simple linear regression model because we are only using one independent variable (i.e. education), to predict one dependent variable (i.e. income).

2.1 Graphing our Data

Our first step is to visually assess whether there seems to be a linear relationship between education and income by looking at a scatterplot.

plot(Prestige$education, Prestige$income,
     las=1,
     col="blue",
     pch=16,
     xlim=c(0,17),
     xlab="Education (years)",
     ylab=NA,
     main="Relationship between Education and Income (Canada, 1971)")
title(ylab="Income (dollars)", line=3.3)

At first glance, does there seem to be a correlation between both variables? If so, does the relationship seem positive or negative?

2.2 Running an OLS model

2.1.1 Using the `lm()` function

Using lm(), we specify a simple linear regression formula: dependent variable ~ independent variable. Here, we are regressing one dependent variable (income, in dollars) on one independent variable (education, in years). The lm() function will generate an intercept and a regression coefficient. The intercept represents the value of the dependent variable when the independent variable is equal to zero. The regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable (in other words, the slope).

Note that lm() automatically performs listwise deletion of rows with a missing value within the vectors you are using. In other words, it only uses pairwise complete observations.

model <- lm(income ~ education, data = Prestige) # Creating our model using lm()

summary(model) # Viewing our regression results using summary()

## 
## Call:
## lm(formula = income ~ education, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5493.2 -2433.8   -41.9  1491.5 17713.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2853.6     1407.0  -2.028   0.0452 *  
## education      898.8      127.0   7.075 2.08e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared:  0.3336, Adjusted R-squared:  0.3269 
## F-statistic: 50.06 on 1 and 100 DF,  p-value: 2.079e-10

nobs(model) # Sanity check: viewing how many observations were used in the model using nobs()

## [1] 102

nobs(model) == dim(Prestige)[1] # All rows were used: there were no missing observations

## [1] TRUE

2.1.2 Lexicon for the `lm()` output

Information	Description
Formula	Our linear model (dependent variable ~ independent variable)
Residuals	Difference between the observed and the estimated value of the dependent variable
Intercept	Value of the dependent variable when the independent variable is equal to zero
Coefficient	Mean change in the dependent variable for one unit of change in the independent variable
p-value	Probability of getting results similar to yours (or more extreme) given that the null hypothesis (= no relationship between the DV and the IV) is true
Degrees of freedom	Number of observations - number of parameters - 1
R-squared	Proportion of explained variance in the dependent variable

Note that you can graph this linear relationship using abline(). Visually, the line represents the predicted value of the income variable given the value of the education variable. In other words, it displays the best guess you could make regarding an occupation’s income if you were only given the average number of years of education of the respondents who have this occupation.

model <- lm(income ~ education, data = Prestige) # Creating our model using lm()

plot(Prestige$education, Prestige$income,
     las=1,
     col="blue",
     pch=16,
     xlim=c(0,17),
     xlab="Education (years)",
     ylab=NA,
     main="Relationship between Education and Income (Canada, 1971)")
title(ylab="Income (dollars)", line=3.3)

abline(model)

2.1.3 Extracting the coefficients

Almost done! We have successfully conducted our regression analysis. Here is how to store the intercept and the regression coefficient into a myCoeff vector using summary().

summary(model)$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -2853.5856  1407.0392 -2.028078 4.521163e-02
## education     898.8128   127.0354  7.075294 2.079192e-10

myCoeff <- summary(model)$coefficients[1:2,1]

myCoeff

## (Intercept)   education 
##  -2853.5856    898.8128

3. Residuals

One of the assumptions of linear models is that residuals should be normally distributed. In other words, there should not be a systematic bias in the residuals, which would imply that our model is better at predicting certain types of observations than others. The residuals represent the observed value minus the predicted value.

summary(residuals(model)) # Summarizing the residuals

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -5493.20 -2433.80   -41.92     0.00  1491.50 17713.14

plot(density(residuals(model)), col="blue",
     main="Density Plot of the Residuals from our OLS Model") # Plotting the density of the residuals

Does this look normally distributed to you? Why or why not? Please note that we’ll do more advanced tests on residuals next week.

Bonus: Multiple Linear Model Regression

As you move forward with data analysis, you might start using multiple linear regression models: dependent variable ~ independent variable #1 + independent variable #2. The main difference from simple linear models is that you will have more than one predictors (or independent variables). Obviously, you are not requested to make 3D graphs at this stage—I included this for your own interest! Visually, your regression analysis would look like a plane rather than a line.

4. Exercises

Exercise

Null Hypothesis	Alternative Hypothesis
No effect of prestige on income	Effect of prestige on income
Similar income means for all prestige scores	Different income means depending on prestige scores

Perform a linear regression analysis of the relationship between prestige score (IV) and income (DV). In your own words, how do you interpret the intercept and the regression coefficient? Can you reject the null hypothesis?

model <- lm(income ~ prestige, data=Prestige)

summary(model)

## 
## Call:
## lm(formula = income ~ prestige, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6693.3 -1434.9   136.5  1251.8 15152.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1465.03     860.47  -1.703   0.0918 .  
## prestige      176.43      17.26  10.224   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2984 on 100 degrees of freedom
## Multiple R-squared:  0.5111, Adjusted R-squared:  0.5062 
## F-statistic: 104.5 on 1 and 100 DF,  p-value: < 2.2e-16

# For each additional score of prestige, there is an
# additional 176.43 income for the job on average
# We can reject the null since our p-value is < 0.05

Generate a density plot of the residuals of your OLS model. Do the residuals look normally distributed?

plot(density(model$residuals))

# They are not randomly distributed
# Here, the model has a tendency to under-estimate the DV

Extract the intercept and the coefficient of your OLS model and store these values into a myCoef vector.

myCoeff <- summary(model)$coefficients[1:2,1]

myCoeff

##  (Intercept)       income 
## 27.141176368  0.002896799