Linear regressions are used to model the linear relationship between one dependent variable and one or more independent variables. A regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable. In this recitation, we will: (1) load a dataset from the car package, (2) graph the relationship between two variables, (3) perform a simple linear regression analysis between two variables using lm() and (4) extract the intercept and the regression coefficient.
In this recitation, we will use the Prestige dataset from the car package. First, install the car package using install.packages(). Then, load it into your library using library().
| Variable name | Description |
|---|---|
| education | Average education of occupational incumbents in 1971 (in years) |
| income | Average income of incumbents in 1971 (in dollars) |
| women | Percentage of incumbents who are women |
| prestige | Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s |
| census | Canadian Census occupational code |
| type | Type of occupation: bc (Blue Collar), prof (Professional, Managerial, and Technical), wc (White Collar) |
library(car) # Loading the car package using library()
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
dim(Prestige) # Checking the dimensions of the Prestige dataset using dim()
## [1] 102 6
sapply(Prestige, FUN=class) # Viewing the class attribute of each variable using class()
## education income women prestige census type
## "numeric" "integer" "numeric" "numeric" "integer" "factor"
head(Prestige,10) # Viewing the first 10 observations of the Prestige dataset using head()
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
## biologists 15.09 8258 25.65 72.6 2133 prof
## architects 15.44 14163 2.69 78.1 2141 prof
## civil.engineers 14.52 11377 1.03 73.1 2143 prof
## mining.engineers 14.64 11023 0.94 68.8 2153 prof
We could intuitively expect to observe a positive relationship between education and income. But is it actually the case? And if so, how strong is this relationship? Using a simple linear regression model, we will predict the income of a given occupation given the average number of years of education of the respondents who have this occupation. We call this a simple linear regression model because we are only using one independent variable (i.e. education), to predict one dependent variable (i.e. income).
plot()
Our first step is to visually assess whether there seems to be a linear relationship between education and income by looking at a scatterplot.
plot(Prestige$education, Prestige$income,
las=1,
col="blue",
pch=16,
xlim=c(0,17),
xlab="Education (years)",
ylab=NA,
main="Relationship between Education and Income (Canada, 1971)")
title(ylab="Income (dollars)", line=3.3)
At first glance, does there seem to be a correlation between both variables? If so, does the relationship seem positive or negative?
lm()
Using lm(), we specify a simple linear regression formula: dependent variable ~ independent variable. Here, we are regressing one dependent variable (income, in dollars) on one independent variable (education, in years). The lm() function will generate an intercept and a regression coefficient. The intercept represents the value of the dependent variable when the independent variable is equal to zero. The regression coefficient represents the mean change in the dependent variable for one unit of change in the independent variable (in other words, the slope).
Note that lm() automatically performs listwise deletion of rows with a missing value within the vectors you are using. In other words, it only uses pairwise complete observations.
model <- lm(income ~ education, data = Prestige) # Creating our model using lm()
summary(model) # Viewing our regression results using summary()
##
## Call:
## lm(formula = income ~ education, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5493.2 -2433.8 -41.9 1491.5 17713.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2853.6 1407.0 -2.028 0.0452 *
## education 898.8 127.0 7.075 2.08e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3483 on 100 degrees of freedom
## Multiple R-squared: 0.3336, Adjusted R-squared: 0.3269
## F-statistic: 50.06 on 1 and 100 DF, p-value: 2.079e-10
nobs(model) # Sanity check: viewing how many observations were used in the model using nobs()
## [1] 102
nobs(model) == dim(Prestige)[1] # All rows were used: there were no missing observations
## [1] TRUE
lm() output
| Information | Description |
|---|---|
| Formula | Our linear model (dependent variable ~ independent variable) |
| Residuals | Difference between the observed and the estimated value of the dependent variable |
| Intercept | Value of the dependent variable when the independent variable is equal to zero |
| Coefficient | Mean change in the dependent variable for one unit of change in the independent variable |
| p-value | Probability of getting results similar to yours (or more extreme) given that the null hypothesis (= no relationship between the DV and the IV) is true |
| Degrees of freedom | Number of observations - number of parameters - 1 |
| R-squared | Proportion of explained variance in the dependent variable |
Note that you can graph this linear relationship using abline(). Visually, the line represents the predicted value of the income variable given the value of the education variable. In other words, it displays the best guess you could make regarding an occupation’s income if you were only given the average number of years of education of the respondents who have this occupation.
model <- lm(income ~ education, data = Prestige) # Creating our model using lm()
plot(Prestige$education, Prestige$income,
las=1,
col="blue",
pch=16,
xlim=c(0,17),
xlab="Education (years)",
ylab=NA,
main="Relationship between Education and Income (Canada, 1971)")
title(ylab="Income (dollars)", line=3.3)
abline(model)
Almost done! We have successfully conducted our regression analysis. Here is how to store the intercept and the regression coefficient into a myCoeff vector using summary().
summary(model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2853.5856 1407.0392 -2.028078 4.521163e-02
## education 898.8128 127.0354 7.075294 2.079192e-10
myCoeff <- summary(model)$coefficients[1:2,1]
myCoeff
## (Intercept) education
## -2853.5856 898.8128
As you move forward with data analysis, you might start using multiple linear regression models: dependent variable ~ independent variable #1 + independent variable #2. The main difference from simple linear models is that you will have more than one predictors (or independent variables). Obviously, you are not requested to make 3D graphs at this stage—I included this for your own interest! Visually, your regression analysis would look like a plane rather than a line.
Without conducting a regression analysis, graph the relationship between income (IV) and prestige (DV) within the Prestige dataset. Does there seem to be a linear relationship between these two variables? If so, does this relationship seem positive or negative?
Relevant function: plot().
Perform a linear regression analysis of the relationship between income (IV) and prestige (DV). How do you interpret the intercept and the regression coefficient? Also, thinking back to what we saw last week on p-values, how convincing are these results?
Relevant functions: lm(), summary().
Display this linear relationship on the scatterplot you created in Exercice 1.
Relevant functions: plot(), abline().
Extract the intercept and the coefficient and store these values into a myCoef vector.
Is there a causal relationship between income and prestige? What alternative explanations could explain the results of your linear regression model? (in words)