#### Sociology 333: Introduction to Quantitative Analysis #### Duke University, Summer 2014, Instructor: David Eagle, PhD (Cand.)
You will need to load the GSS 2012 data using these commands:
load(url("http://www.soc.duke.edu/~dee4/soc333data/hw10.data"))
In a regression analysis, the dependent variable is one we want to predict or explain; the independent variables are the variables that we think predict or explain the dependent variable.
We'd like to ask, how much extra $ does another year of education get you? We'll restrict this to people who report working full time. In R use the following commands to run a regression:
gss.new = subset(gss, gss$wrkstat == "working fulltime")
# Notice that we specify 'data=gss.new'. This way we don't have to use
# attach or the gss.new$ notation:
lm(coninc ~ educ, data = gss.new)
##
## Call:
## lm(formula = coninc ~ educ, data = gss.new)
##
## Coefficients:
## (Intercept) educ
## -34912 6702
# We usually store the results from our regression in a variable. We'll name
# that variable fit:
fit = lm(coninc ~ educ, data = gss.new)
# We use the summary command to get the output we need
summary(fit)
##
## Call:
## lm(formula = coninc ~ educ, data = gss.new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88595 -30190 -10086 17684 133201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34912 7319 -4.77 2.2e-06 ***
## educ 6702 506 13.25 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44400 on 845 degrees of freedom
## (65 observations deleted due to missingness)
## Multiple R-squared: 0.172, Adjusted R-squared: 0.171
## F-statistic: 176 on 1 and 845 DF, p-value: <2e-16
The output needs some explanation. Under coefficients, the intercept tells us that, when x=0, average income is $35,734. It also says that every year of education adds $6,760 to the average predicted income.
The Std. Error column gives you the standard error on the coefficient.
The p values tell you the probability of observing this coefficient and wrongly rejecting the null hypothesis that the coefficient = 0. If you want the confidence intervals for the coefficients, type:
confint(fit)
## 2.5 % 97.5 %
## (Intercept) -49277 -20547
## educ 5709 7695
95% of the time, we expect one year of education to predict between $5754 and $7766 of additional income.
Below the coefficients is R-squared and Adjusted R-squared. DON'T USE R-squared. Use Adjusted R-squared. This tells you how much of the variance in the dependent variable is explained by the independent variables.
We can plot a regression line on a graph by:
plot(x = gss.new$educ, y = gss.new$coninc, ylim = c(-40000, 2e+05))
title("Education Versus Income")
abline(fit, col = "red")
Say we suspected that for every year older we get, we also expect higher earnings. We can add more dependent variables, in this case, age, by simply adding it to the model: (we are going to make a new age variable that is (age - 18), so it is years from your eighteenth birthday.)
gss.new$age2 = gss.new$age - 18
fit2 = lm(coninc ~ educ + age, data = gss.new)
# We use the summary command to get the output we need
summary(fit2)
##
## Call:
## lm(formula = coninc ~ educ + age, data = gss.new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99069 -29598 -10061 15056 142943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57198 8951 -6.39 2.7e-10 ***
## educ 6716 501 13.41 < 2e-16 ***
## age 515 122 4.24 2.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43900 on 844 degrees of freedom
## (65 observations deleted due to missingness)
## Multiple R-squared: 0.189, Adjusted R-squared: 0.187
## F-statistic: 98.5 on 2 and 844 DF, p-value: <2e-16
# It might be better to change educ to (educ-12), so that the intercept
# tells us the average income with 12 years of education
gss.new$educ2 = gss.new$educ - 12
fit3 = lm(coninc ~ educ2 + age2, data = gss.new)
# We use the summary command to get the output we need
summary(fit3)
##
## Call:
## lm(formula = coninc ~ educ2 + age2, data = gss.new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -99069 -29598 -10061 15056 142943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32677 3549 9.21 < 2e-16 ***
## educ2 6716 501 13.41 < 2e-16 ***
## age2 515 122 4.24 2.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43900 on 844 degrees of freedom
## (65 observations deleted due to missingness)
## Multiple R-squared: 0.189, Adjusted R-squared: 0.187
## F-statistic: 98.5 on 2 and 844 DF, p-value: <2e-16
Now, if you have 12 years education and are 18, both independent variables are equal to zero and your predicted income is given by the following formula:
y=32641+6781*0+511.4*0 = $32,641.
** Exercise 1: **
1.If you are 19 years old with 12 years education, what is the predicted amount of income you will earn?
2.If you are 18 years old with 16 years education, what is the predicted amount of income you will earn?
3.If you are 24 years old with 16 years education, what is the predicted amount of income you will earn?
4.If you are 20 years old with 10 years education, what is the predicted amount of income you will earn?
** Exercise 2: **
1.Use the dataframe gss and create a variable called tvreg to predict tvhours with age.
2.How many hours of television is 21 year old predicted to watch per day?
3.What is the adjusted R-squared for this model?
4.Add the number of hours worked per week (in variable hrs1) to the model. What does the coefficient on hrs2 tell us? Is it statistically significant (i.e. can we conclude that it is not zero)?
5.If I work 40 hours per week and I am 40 years old, what is the predicted number of hours I watch TV per day?