For this learning log I am using information about the relationship between education (years) and percentage of women in a workplace on income (dollars).
library(car)
## Warning: package 'car' was built under R version 3.4.3
data(Prestige)
summary(Prestige)
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
head(Prestige)
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
mymod<- lm(income ~ education * women, data = Prestige)
summary(mymod)
##
## Call:
## lm(formula = income ~ education * women, data = Prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7997.2 -968.6 -54.8 672.3 15716.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3843.435 1488.382 -2.582 0.0113 *
## education 1168.069 136.434 8.561 1.59e-13 ***
## women 36.650 42.181 0.869 0.3870
## education:women -9.364 3.838 -2.440 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2770 on 98 degrees of freedom
## Multiple R-squared: 0.5869, Adjusted R-squared: 0.5742
## F-statistic: 46.41 on 3 and 98 DF, p-value: < 2.2e-16
From this summary, we have a lot of information about our data. We have now found that our beta(zero) is -3843.435, our beta(one) is 1168.069, and our beta(two) is 36.65. Our beta(one) corresponds to the education variable and tells us that for every one-unit increase in education, the income level increases by 1168.069, all else held constant. Our beta(two) corresponds to the women variable and tells us that for every additional percentage of women in a profession, the income level increases by 36.65 dollars.
We also get information about our hypothesis test from this summary. Our null hypothesis in this test is that income levels do not have a linear relationship with any of the predictors, meaning all betas would equal 0. In the chart, you can find the test statistic for each predictor under t value. For example, the test stat for education is 8.561. In our example, the only p-value small enough to reject the null hypothesis is the one corresponding to the education predictor at 1.59e^-13. In words, this means that there is a relationship between education levels and income for certain professions. We do not have enough evidence with a p-value of .387 to conclude that there is a relationship between being a woman and income levels.
We can create a confidence interval for income levels by doing the following:
## fit lwr upr
## 1 7901.849 7301.251 8502.447
In this example, we created a confidence interval for income with an education level of 12 years and 30 percent women in the profession. The estimate for income at this level of education and women is 9416.215 dollars, while the confidence interval is (8703.73, 10128.7), meaning that we are 95% confident that average income levels for jobs with an education level of 12 years and 30 percent women is between 8703.73 and 10128.7.
We can also create a prediction interval for our data by doing the following:
pred <- predict(mymod, mydata, interval = "predict")
pred
## fit lwr upr
## 1 7901.849 2371.259 13432.44
Our prediction interval is much larger than our confidence interval because it is an interval for a point estimate rather than a mean of all the data. The interval is wider because there is more variability when predicting just one point rather than a mean.