Multiple regression is a powerful statistical technique, and here you will discover why and how to use it. Part of the course will focus on matrix algebra since it is essential if you want to start estimating regression coefficients in the regression equation. The final chapter will introduce dummy coding as a technique to handle categorical variables.
setwd("d:/Cursos/Datacamp/RMultipleRegression")
fs <- read.csv('fs.csv')
fs
This chapter will help to develop your understanding of the basic framework of multiple regressions. To this purpose, you will use a pre-loaded dataset fs that contains the yearly wages of professors at a certain university. By going through the exercises, you will familiarize yourself with the implementation of, and the intuition behind, the most important aspects of multiple regression in R.
As always, start by exploring the data in order to make sure that you get a global view of the data that you are working with. This first step is vital as it helps you to truly understand things. As such, you will be able to move through the exercises more easily. Use the R console to perform these actions.
Can you tell which description below most accurately matches the characteristics of a typical professor (the “average person”) in the data set?
summary(fs)
salary age years pubs dept
Min. : 60072 Min. :31.00 Min. : 5.00 Min. : 14.00 H:28
1st Qu.:101818 1st Qu.:44.00 1st Qu.:17.75 1st Qu.: 44.00 P:35
Median :133049 Median :49.00 Median :23.50 Median : 66.00 S:37
Mean :133607 Mean :50.33 Mean :24.14 Mean : 66.93
3rd Qu.:170374 3rd Qu.:59.00 3rd Qu.:31.25 3rd Qu.: 90.00
Max. :199606 Max. :67.00 Max. :41.00 Max. :125.00
A professor with a yearly salary around $ 133,000 a year, approximately 24 years of working experience and a publication record of around 67 scientific articles.
Now that you have briefly looked at the data in the previous exercise, a good next step would be to get a view on the nature of the empirical relationships between the outcome variable (the annual salary of the professors) and two specific predictor variables (number of years as a faculty member and number of publications).
Intuitively, one would expect that a professor with more working experience would earn more. In addition, it would also make sense that a faculty member with a larger expertise in terms of numbers of publications would have a higher salary as well. Let us see whether this holds in practice by getting visual confirmation by means of some scatter plots!
# fs is available in your working environment
str(fs)
'data.frame': 100 obs. of 5 variables:
$ salary: int 60072 61017 61618 61976 66398 67083 69314 71653 72519 74821 ...
$ age : int 38 39 38 31 32 39 32 31 66 66 ...
$ years : int 16 14 18 15 30 25 24 29 32 26 ...
$ pubs : int 22 23 23 14 32 27 29 37 61 69 ...
$ dept : Factor w/ 3 levels "H","P","S": 2 2 2 2 2 2 2 3 3 3 ...
# Perform the two single regressions and save them in a variable
model_years <- lm(salary~years,data=fs)
model_pubs <- lm(salary~pubs,data=fs)
# Plot both enhanced scatter plots in one plot matrix of 1 by 2
# fs is available in your working environment
str(fs)
'data.frame': 100 obs. of 5 variables:
$ salary: int 60072 61017 61618 61976 66398 67083 69314 71653 72519 74821 ...
$ age : int 38 39 38 31 32 39 32 31 66 66 ...
$ years : int 16 14 18 15 30 25 24 29 32 26 ...
$ pubs : int 22 23 23 14 32 27 29 37 61 69 ...
$ dept : Factor w/ 3 levels "H","P","S": 2 2 2 2 2 2 2 3 3 3 ...
# Perform the two single regressions and save them in a variable
model_years <- lm(salary~years,data=fs)
model_pubs <- lm(salary~pubs,data=fs)
# Plot both enhanced scatter plots in one plot matrix of 1 by 2
par(mfrow = c(1, 2))
plot(fs$salary ~ fs$years, main = "plot_years", xlab = 'years', ylab = 'salary')
abline(model_years)
plot(fs$salary ~ fs$pubs, main = 'plot_pubs', xlab = 'pubs', ylab = 'salary')
abline(model_pubs)
In the video, Professor Conway talked about the R2 coefficients of regression models. These coefficients are often used in practice to select the best regression model in case of competing models. The R2 coefficient of a regression model is defined as the percentage of the variation in the outcome variable that can be explained by the predictor variables of the model. In general, the R2
coefficient of a model increases when more predictor variables are added to the model. After all, adding more predictor variables to the model tends to increase the odds of explaining more variation in the outcome variable.
Check this generality by comparing the R2 coefficient of a single regression model with that of a multiple regression model.
# Do a single regression of salary onto years of experience and check the output
model_1 <- lm(fs$salary ~ fs$years)
summary(model_1)
Call:
lm(formula = fs$salary ~ fs$years)
Residuals:
Min 1Q Median 3Q Max
-82972 -9537 4305 17703 57949
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68672.5 8259.0 8.315 5.38e-13 ***
fs$years 2689.9 318.4 8.448 2.78e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30220 on 98 degrees of freedom
Multiple R-squared: 0.4214, Adjusted R-squared: 0.4155
F-statistic: 71.37 on 1 and 98 DF, p-value: 2.779e-13
# Do a multiple regression of salary onto years of experience and numbers of publications and check the output
model_2 <- lm(fs$salary ~ fs$years + fs$pubs)
summary(model_2)
Call:
lm(formula = fs$salary ~ fs$years + fs$pubs)
Residuals:
Min 1Q Median 3Q Max
-67835 -14589 2362 13358 69613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58828.7 7605.9 7.735 9.79e-12 ***
fs$years 1337.4 387.1 3.455 0.000819 ***
fs$pubs 634.9 123.6 5.137 1.44e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 26930 on 97 degrees of freedom
Multiple R-squared: 0.5451, Adjusted R-squared: 0.5357
F-statistic: 58.12 on 2 and 97 DF, p-value: < 2.2e-16
# Save the R squared of both models in preliminary variables
preliminary_model_1 <- summary(model_1)$r.squared
preliminary_model_2 <-summary(model_2)$r.squared
# Round them off while you save them in new variables
r_squared <- c()
r_squared[1] <- round(preliminary_model_1,3)
r_squared[2] <- round(preliminary_model_2,3)
# Print out the vector to see both R squared coefficients
r_squared
[1] 0.421 0.545
Although the R2
coefficient of a regression model tends to increase in general when adding more predictor variables to the mix, more is not always better. Throwing in every variable at your disposal into a multiple regression is usually not a good idea. If the extra variables do not add extra value to the model -when the corresponding regression coefficients are not significantly different from zero- it tends to be a good practice to just leave them out.
In the next example you will see that adding the age of the professors as an extra predictor value does not necessarily lead to a better econometric model, even though the R2 coefficient increases in the process. Hence, a good statistician should not blindly depend on the R2 coefficients while selecting relevant regression models.
# Do multiple regression and check the regression output
model_3 <- lm(fs$salary ~ fs$years + fs$pubs + fs$age)
summary(model_3)
Call:
lm(formula = fs$salary ~ fs$years + fs$pubs + fs$age)
Residuals:
Min 1Q Median 3Q Max
-70488 -14268 2502 13233 70413
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 51538.4 13726.9 3.755 0.000298 ***
fs$years 1210.0 436.5 2.772 0.006691 **
fs$pubs 618.9 126.5 4.894 3.98e-06 ***
fs$age 227.2 355.6 0.639 0.524444
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27010 on 96 degrees of freedom
Multiple R-squared: 0.5471, Adjusted R-squared: 0.5329
F-statistic: 38.65 on 3 and 96 DF, p-value: < 2.2e-16
# Round off the R squared coefficients and save the result in the vector (in one step!)
r_squared[3] <- round(summary(model_3)$r.squared,3)
# Print out the vector in order to display all R squared coefficients simultaneously
r_squared
[1] 0.421 0.545 0.547
Can you tell that the R2 coefficient indeed increases when adding age as an extra predictor variable to the regression model? This would imply that this extended model is preferred according to the R2 coefficient criterium. However, this is not the case as this model is redundant: the regression coefficient for age tends to be insignificant. Remember that a good statistician not only looks at the R2 coefficient when selecting a relevant regression model!
Can you give the correct interpretation of the regression constant in a multiple regression model? > It is the predicted value for the outcome variable when all predictor variables are set equal to zero.
Now that you are an expert in the domain of multiple regressions, retake the multiple regression model of the outcome variable salary on predictor variables years and pubs once more. The required dataset fs is still pre-loaded into R. Print out the output to help you to answer the following question, that essentially consists of two parts. If you feel unsure about doing this, return to the earlier questions to revise the basics of multiple regression before moving on.
Which of the following is the correct value of the regression coefficient for publications? And how would you interpret this number? > The regression coefficient for publications is equal to 634.9. This means that the predicted increase in salary for a unit increase in the number of publications is equal to $634.90 for those professors who have already completed their PhD an average amount of years ago.
library('QuantPsyc')
lm.beta(model_2)
fs$years fs$pubs
0.3227385 0.4798728
This chapter will help to understand how the correlation matrix and regression coefficients are constructed. To this purpose, you will first refresh and practice some basic matrix concepts. By going through the exercises, you will acquaint yourself with the implementation of the theoretical concepts in R and become familiar with the step-by-step construction of the correlation matrix and regression coefficients.
A matrix is a rectangular table that contains known or unknown numbers. The size, or order, of the table is given by the number of rows and columns it contains. For example, matrix m consists of 4 rows and 2 columns, which means that the order of matrix m is 4x2. 72468⎤⎦⎥⎥⎥
m <- matrix(1:8,ncol=2,byrow = T)
m
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[4,] 7 8
A trick that is often used in matrix algebra is taking the transpose of a matrix. The transpose of a matrix is constructed by rewriting its rows as columns. As a result, the order will change as the matrix will be reversed. For example, taking the transpose of matrix m results in a 2x4 matrix
t(m)
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
Start by creating and defining matrices r and s to be used in the next exercises. Use the R terminal to perform these actions.
# Construction of 3 by 8 matrix r that contains the numbers 1 up to 24
r <- matrix(1:24, nrow=3, ncol=8)
r
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 4 7 10 13 16 19 22
[2,] 2 5 8 11 14 17 20 23
[3,] 3 6 9 12 15 18 21 24
# Construction of 3 by 8 matrix s that contains the numbers 21 up to 44
s <- matrix(21:44, nrow=3, ncol=8)
s
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 21 24 27 30 33 36 39 42
[2,] 22 25 28 31 34 37 40 43
[3,] 23 26 29 32 35 38 41 44
# Take the transpose t of matrix r
t <- t(r)
t
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
[5,] 13 14 15
[6,] 16 17 18
[7,] 19 20 21
[8,] 22 23 24
You can perform operations on matrices. Some operations put forward some requirements that need to be respected. For example, the addition or subtraction of two matrices is only possible if the matrices are of the same size or order. Elements with the same row and column number in the two matrices are added or subtracted and lead to a new matrix.
The multiplication of two matrices is only possible when they are conformable: the number of columns of the first matrix must equal the number of rows of the second matrix: > R=MT⋅N
with M a k-by-i matrix and N a k-by-i matrix. Matrix M and N can not be multiplied since both matrices have an identical order. Taking the transpose of matrix M leads to a solution: the two matrices are conformable.
# Compute the sum of matrices r and s
operation_1 <- r + s
# Compute the difference between matrices r and s
operation_2 <- r - s
# Multiply matrices t and s
operation_3 <- t %*% s
In the previous video, Professor Conway explained how you can go from a raw dataframe to a correlation matrix. Here you will construct such a correlation matrix yourself from scratch.
In your R workspace, a raw dataframe X (a 10-by-3 matrix) is already loaded in. The first step is to sum up its 3 columns. In other words, we need to construct the row vector of sums:
T1p=11nXnp
where T1p the 1-by-p row vector of sums is, Xnp an n-by-p raw dataframe, and 11n a 1-by-n matrix of ones.
irislen <- nrow(iris)
X <- as.matrix(iris[,1:3])
head(X)
Sepal.Length Sepal.Width Petal.Length
[1,] 5.1 3.5 1.4
[2,] 4.9 3.0 1.4
[3,] 4.7 3.2 1.3
[4,] 4.6 3.1 1.5
[5,] 5.0 3.6 1.4
[6,] 5.4 3.9 1.7
I = matrix(rep(1,irislen),ncol = irislen, nrow = 1)
t_mat <- I %*% X
print("Sum of columns")
[1] "Sum of columns"
t_mat
Sepal.Length Sepal.Width Petal.Length
[1,] 876.5 458.6 563.7
Now that you have the row vector of sums t_mat, a 1-by-3 matrix, you are ready to construct the row vector of means M via:
M1p=T1pN−1
with M1p the 1-by-p row vector of means, T1p the 1-by-p row vector of sums, and N−1
the inverse number of observations for each variable.
Given the row vector of means M, you can also construct the matrix of means MM by multiplying the row vector of means with a column vector:
MMnp=1n1M1p
with MMnp the n-by-p matrix of means, M1p the 1-by-p row vector of means, and 1n1 a n-by-1 column vector.
# Number of observations
n = irislen
# Compute the row vector of means
M <- t_mat * n^-1
# Construction of 10 by 1 matrix J of which the elements are all 1
J <- matrix(rep(1,n), nrow=n, ncol=1)
# Compute the matrix of means
MM <- J %*% M
Why do you need a matrix of means? Because you want a matrix of deviation scores! You are interested in the deviation of an observation for that variable to its mean.
As Prof. Conway showed in the video, the formula to calculate a matrix of deviation scores D is:
Dnp=Xnp−MMnp
where Dnp is an n-by-p deviation matrix, Xnp the n-by-p raw dataframe, and MMnp the n-by-p matrix of means (the one you calculated in the previous exercise).
# Matrix of deviation scores D
D <- X - MM
tail(D)
Sepal.Length Sepal.Width Petal.Length
[145,] 0.85666667 0.24266667 1.942
[146,] 0.85666667 -0.05733333 1.442
[147,] 0.45666667 -0.55733333 1.242
[148,] 0.65666667 -0.05733333 1.442
[149,] 0.35666667 0.34266667 1.642
[150,] 0.05666667 -0.05733333 1.342
Ok, so you have reached the really cool part now ;-)
If you now take your matrix of deviation scores D and multiply it with its transpose, just like prof. Conway did in the video, you get the matrix of sum of squares and sum of cross products S. The formula is:
Sxx=DTpn.Dnp
with SXX the matrix of sum of squares and sum of cross products, DTpn the transpose of the deviation matrix, and Dnp the n-by-p deviation matrix.
# Sum of squares and sum of cross products matrix
S <- t(D) %*% D
S
Sepal.Length Sepal.Width Petal.Length
Sepal.Length 102.168333 -6.322667 189.8730
Sepal.Width -6.322667 28.306933 -49.1188
Petal.Length 189.873000 -49.118800 464.3254
diag(S)
Sepal.Length Sepal.Width Petal.Length
102.16833 28.30693 464.32540
Note that the sum of squares are in the diagonal, and the sum of cross products across the diagonal.
Almost at the finish line! Using the sums of squares and the sums of cross products matrix S you can calculate the variance-covariance matrix C:
CXX=SXXN−1
with CXX the variance-covariance matrix, SXX the matrix of sum of squares and sum of cross products, and N−1
the inverse number of observations for each variable.
Next, you can standardize this variance-covariance matrix by multiplying it with the standard deviation matrix SD. This gives us the correlation matrix R:
RXX=(SDXX)−1CXX(SDXX)−1
with RXX the correlation matrix, CXX the variance-covariance matrix, and SDXX the standard deviation matrix.
# Construct the variance-covariance matrix
C <- S * n^-1
# Generate the standard deviations matrix
SD <- diag(x = diag(C)^(1/2), nrow = 3, ncol = 3)
# Compute the correlation matrix
R <- solve(SD) %*% C %*% solve(SD)
R
[,1] [,2] [,3]
[1,] 1.0000000 -0.1175698 0.8717538
[2,] -0.1175698 1.0000000 -0.4284401
[3,] 0.8717538 -0.4284401 1.0000000
head(X)
Sepal.Length Sepal.Width Petal.Length
[1,] 5.1 3.5 1.4
[2,] 4.9 3.0 1.4
[3,] 4.7 3.2 1.3
[4,] 4.6 3.1 1.5
[5,] 5.0 3.6 1.4
[6,] 5.4 3.9 1.7
This chapter will teach you multiple methods to transform categorical variables into numerical variables by means of dummy coding. As such, you will be able to use these “dummies” alongside other numerical variables in multiple regressions. Additionally, a special version of dummy coding called “effects coding” will also be introduced at the end of the chapter.
Again, you will use the dataset fs. Besides the independent variable, yearly wages (salary), and other characteristics of professors at a certain university, this dataset also contains a categorical variable (dept), that holds the information on the department that each professor belongs to. There are three departments: history (h), psychology (p) and sociology (s).
As always, start off by exploring the data of each department in order to make sure that you get a good view of the data that you are working with.
library(psych)
describeBy(fs, fs$dept)
Descriptive statistics by group
group: H
vars n mean sd median trimmed mad min max range skew kurtosis se
salary 1 28 137421.29 33736.24 138411 137679.38 45224.49 89387 181427 92040 -0.04 -1.61 6375.55
age 2 28 49.04 11.00 48 49.04 14.08 32 66 34 -0.01 -1.37 2.08
years 3 28 22.25 10.95 22 22.21 14.08 5 40 35 -0.03 -1.34 2.07
pubs 4 28 63.32 29.13 57 62.62 28.17 17 122 105 0.29 -1.14 5.50
dept* 5 28 1.00 0.00 1 1.00 0.00 1 1 0 NaN NaN 0.00
------------------------------------------------------------------------------------------------------------------------------------------------
group: P
vars n mean sd median trimmed mad min max range skew kurtosis se
salary 1 35 129067.37 43738.84 131222 130061.03 48036.24 60072 189447 129375 -0.18 -1.33 7393.21
age 2 35 48.23 9.63 45 48.24 10.38 31 67 36 0.15 -0.97 1.63
years 3 35 22.09 8.35 22 21.93 8.90 5 40 35 0.20 -0.53 1.41
pubs 4 35 59.91 30.52 57 57.86 34.10 14 122 108 0.48 -0.65 5.16
dept* 5 35 2.00 0.00 2 2.00 0.00 2 2 0 NaN NaN 0.00
------------------------------------------------------------------------------------------------------------------------------------------------
group: S
vars n mean sd median trimmed mad min max range skew kurtosis se
salary 1 37 135015.59 40034.84 135837 135091.03 52777.59 71653 199606 127953 -0.02 -1.32 6581.69
age 2 37 53.30 10.01 54 54.10 11.86 31 66 35 -0.47 -0.72 1.64
years 3 37 27.51 8.72 29 28.00 10.38 5 41 36 -0.45 -0.16 1.43
pubs 4 37 76.30 28.11 67 76.19 31.13 25 125 100 0.21 -0.90 4.62
dept* 5 37 3.00 0.00 3 3.00 0.00 3 3 0 NaN NaN 0.00
Can you also tell which department has the largest number of professors?
In order to automatically create dummy variables, the dummy.code() function of the psych package is easy to use.
The function takes a categorical variable as argument and automatically creates the required dummy variables: all levels are ranked alphabetically and the first one is taken as the reference group. Remember that only (N-1) dummies are created for a categorical variable with N levels. Consequently, the category which is not directly linked with a dummy variable is defined as the reference category.
head(dept_code)
[1] P P P P P P
attr(,"contrasts")
.L .Q
H -7.071068e-01 0.4082483
P -7.850462e-17 -0.8164966
S 7.071068e-01 0.4082483
Levels: H P S
Before switching to some more advanced exercises, think about the summary results for the extended_fs dataframe. Is there a way to see which department has the largest faculty staff using the dummy variables? Another way to look up which department tends to have the highest number of professors is to look at the mean of the dummy variables and to multiply it by 100. After all, the mean of each dummy variable is equal to the sum of one times the number of professors in a certain department, divided by the total number of professors (=100).
In order to include a categorical variable in a regression, the variable needs to be converted into a numeric variable by the means of a dummy variable. Previously, dummy variables have been generated using the intuitive, but less general dummy.code() function from the psych library.
From this point onwards the contrast C() function is used to create dummy variables. Do not confuse this function with the c() function that is used to combine values in a vector or list. The contrast C() takes a categorical variable as a first argument and the treatment as a second argument. The latter tells R to rank all levels alphabetically and to take the first category as the reference group.
This exercise will illustrate the inclusion of the categorical variable dept in a multiple regression. The code on the right estimates the regression without categorical variable. The summary() function is used to get the summary of the regression results of model and the confint() function is used to create the confidence intervals.
# Regress salary against years and publications
model <- lm(fs$salary ~ fs$years + fs$pubs)
# Apply the summary function to get summarized results for model
summary(model)
Call:
lm(formula = fs$salary ~ fs$years + fs$pubs)
Residuals:
Min 1Q Median 3Q Max
-67835 -14589 2362 13358 69613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 58828.7 7605.9 7.735 9.79e-12 ***
fs$years 1337.4 387.1 3.455 0.000819 ***
fs$pubs 634.9 123.6 5.137 1.44e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 26930 on 97 degrees of freedom
Multiple R-squared: 0.5451, Adjusted R-squared: 0.5357
F-statistic: 58.12 on 2 and 97 DF, p-value: < 2.2e-16
# Compute the confidence intervals for model
confint(model)
2.5 % 97.5 %
(Intercept) 43733.1268 73924.1805
fs$years 569.0514 2105.6692
fs$pubs 389.5972 880.2303
# Create dummies for the categorical variable fs$dept by using the C() function
dept_code <- C(fs$dept, contr.poly)
dept_code
[1] P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P P P
[95] P S S S S S
attr(,"contrasts")
.L .Q
H -7.071068e-01 0.4082483
P -7.850462e-17 -0.8164966
S 7.071068e-01 0.4082483
Levels: H P S
# Regress salary against years, publications and department
model_dummy <- lm(fs$salary ~ fs$years + fs$pubs + dept_code)
# Apply the summary function to get summarized results for model_dummy
summary(model_dummy)
Call:
lm(formula = fs$salary ~ fs$years + fs$pubs + dept_code)
Residuals:
Min 1Q Median 3Q Max
-59233 -12490 -1488 13227 67294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54125.3 7597.6 7.124 2.01e-10 ***
fs$years 1507.8 379.4 3.974 0.000138 ***
fs$pubs 655.5 120.2 5.452 3.93e-07 ***
dept_code.L -13327.3 4742.3 -2.810 0.006011 **
dept_code.Q -2899.4 4543.9 -0.638 0.524947
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 26080 on 95 degrees of freedom
Multiple R-squared: 0.5824, Adjusted R-squared: 0.5648
F-statistic: 33.12 on 4 and 95 DF, p-value: < 2.2e-16
# Compute the confidence intervals for model_dummy
confint(model_dummy)
2.5 % 97.5 %
(Intercept) 39042.0765 69208.4611
fs$years 754.5720 2260.9439
fs$pubs 416.8113 894.2219
dept_code.L -22742.0151 -3912.6476
dept_code.Q -11920.2193 6121.3400
summary(variance2)
Df Sum Sq Mean Sq F value Pr(>F)
years 1 6.518e+10 6.518e+10 99.853 3.72e-16 ***
pubs 1 1.914e+10 1.914e+10 29.320 5.26e-07 ***
dept 2 5.765e+09 2.882e+09 4.416 0.0149 *
years:pubs 1 1.074e+05 1.074e+05 0.000 0.9898
years:dept 2 1.995e+08 9.976e+07 0.153 0.8585
pubs:dept 2 4.523e+09 2.262e+09 3.465 0.0356 *
years:pubs:dept 2 2.429e+09 1.214e+09 1.860 0.1617
Residuals 88 5.744e+10 6.528e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In a next step, we would like to test if the inclusion of the categorical variable in the model improves the fit. The dataset fs and regressions model and model_dummy are available in your workspace. The anova() function compares both models and reports whether or not the models differ in a significant way.
If the inclusion of the categorical variable effectively improves the fit, one might wonder to which extent the specific departments differ from each other.
summary(aov(fs$salary ~ dept_code * fs$years * fs$pubs))
Df Sum Sq Mean Sq F value Pr(>F)
dept_code 2 1.202e+09 6.010e+08 0.921 0.4020
fs$years 1 6.867e+10 6.867e+10 105.204 < 2e-16 ***
fs$pubs 1 2.021e+10 2.021e+10 30.959 2.8e-07 ***
dept_code:fs$years 2 1.621e+08 8.104e+07 0.124 0.8834
dept_code:fs$pubs 2 4.528e+09 2.264e+09 3.469 0.0355 *
fs$years:fs$pubs 1 3.236e+07 3.236e+07 0.050 0.8243
dept_code:fs$years:fs$pubs 2 2.429e+09 1.214e+09 1.860 0.1617
Residuals 88 5.744e+10 6.528e+08
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Retrieve the output of the previous exercise by means of the anova() function. Based on this output, can you tell whether the inclusion of the department dummy variable dept_code significantly improves the fit of the model? And why? > The inclusion of department significantly improves the fit of the model as it results in a significant p-value at a 5 % significance level.
To see what role that the department plays in explaining the professors’ salary, you can take a look at the actual differences in mean salary among departments.
In order to compute the actual means of the salaries for each department easily, use the tapply() function, in which you enter the variable, the categorical variable and the requested summary statistic (that is, the mean).
# Actual means of fs$salary
tapply(fs$salary, fs$dept, mean)
As seen in the previous exercise, the actual means do not largely differ among the different departments. There has to be an explanation why the predicted salaries of the professors of the sociology department tend to be lower on average than those of professors of the history department. Remember that the regression coefficient for the sociology department was significantly different from zero in a multiple regression in which a dummy variable for department was included.
How would you explain the apparent discrepancy between the actual and the predicted mean salary of professors across departments? Take a look at the actual means of the other predictor variables years and publications by means of the tapply() function to help you to come up with a well-considered answer.
The variables defined in the previous exercises remain available. Check via ls()
tapply(fs$salary,fs$dept,mean)
tapply(fs$pubs,fs$dept,mean)
tapply(fs$years,fs$dept,mean)
Because sociology professors have more years on the job and more publications on average compared to other departments, their actual mean salary is comparable to that of other departments. The discrepancy between actual and predicted salary is solely the consequence of the fact that predicted mean salaries only hold for professors who have an average amount of publications and an average numbers of years of experience across all professors of the university.
Professor Conway mentions effects coding as an interesting dummy coding scheme. In the effects coding scheme, a reference category still needs to be appointed but this time the reference category gets a weight - 1. The number of dummies that must be generated equals “the number of levels - 1”. The effects coding scheme for a categorical variable with 4 levels of which the fourth category is picked out as the reference category, is shown below:
dummy = [1 0 0;0 1 0;0 0 0;-1-1 -1]
This type of effects coding is called unweighted effects coding since the number of observations within each category of the variable is not taken into account.
In the following steps, you will learn to set up a regression by using the unweighted effects coding scheme for a categorical variable. In the example, the dependent variable salary is regressed against the independent variable dept. The data is already loaded in the workspace under the name fs.
# Number of levels
fs$dept
[1] P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P P P
[95] P S S S S S
Levels: H P S
# Factorize the categorical variable fs$dept and name the factorized variable dept.f
dept.f <- factor(fs$dept)
dept.f
[1] P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P P P P S S S S S S S S H H H H H H H P P P P P P
[95] P S S S S S
Levels: H P S
# Assign the 3 levels generated in step 2 to dept.f
contrasts(dept.f) <-contr.sum(3)
# Regress salary against dept.f
model_unweighted <- lm(fs$salary ~ dept.f)
# Apply the summary() function
summary(model_unweighted)
Call:
lm(formula = fs$salary ~ dept.f)
Residuals:
Min 1Q Median 3Q Max
-68995 -27249 1488 36471 64590
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 133835 4007 33.403 <2e-16 ***
dept.f1 3586 5907 0.607 0.545
dept.f2 -4767 5579 -0.855 0.395
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39780 on 97 degrees of freedom
Multiple R-squared: 0.007771, Adjusted R-squared: -0.01269
F-statistic: 0.3799 on 2 and 97 DF, p-value: 0.685
Weighted effects coding differs from unweighted effects coding with respect to the weights, fractions. A reference category is chosen and the weights form the following dummy coding scheme:
with n = the number of observations of each group and index N = the number of levels. The weights represent the number of observation of a non-reference category relative to those of the reference category.
If the weights are computed, the regression of the dependent variable against the non-categorical and categorical variables using the weighted effects coding scheme can start.
# Factorize the categorical variable fs$dept and name the factorized variable dept.g
dept.g <- factor(fs$dept)
# Assign the weights matrix to dept.g
contrasts(dept.g) <- contr.sum(3)
# Regress salary against dept.f and apply the summary() function
model_weighted <- lm(fs$salary ~ dept.g)
# Apply the summary() function
summary(model_weighted)