R Markdown

For this learning log I am using information about the relationship between education (years) and percentage of women in a workplace on income (dollars).

library(car)
## Warning: package 'car' was built under R version 3.4.3
data(Prestige)
summary(Prestige)
##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517
head(Prestige)
##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof
mymod<- lm(income ~ education * women, data = Prestige)
summary(mymod)
## 
## Call:
## lm(formula = income ~ education * women, data = Prestige)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7997.2  -968.6   -54.8   672.3 15716.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -3843.435   1488.382  -2.582   0.0113 *  
## education        1168.069    136.434   8.561 1.59e-13 ***
## women              36.650     42.181   0.869   0.3870    
## education:women    -9.364      3.838  -2.440   0.0165 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2770 on 98 degrees of freedom
## Multiple R-squared:  0.5869, Adjusted R-squared:  0.5742 
## F-statistic: 46.41 on 3 and 98 DF,  p-value: < 2.2e-16

From this summary, we have a lot of information about our data. We have now found that our beta(zero) is -3843.435, our beta(one) is 1168.069, and our beta(two) is 36.65. Our beta(one) corresponds to the education variable and tells us that for every one-unit increase in education, the income level increases by 1168.069, all else held constant. Our beta(two) corresponds to the women variable and tells us that for every additional percentage of women in a profession, the income level increases by 36.65 dollars.

We also get information about our hypothesis test from this summary. Our null hypothesis in this test is that income levels do not have a linear relationship with any of the predictors, meaning all betas would equal 0. In the chart, you can find the test statistic for each predictor under t value. For example, the test stat for education is 8.561. In our example, the only p-value small enough to reject the null hypothesis is the one corresponding to the education predictor at 1.59e^-13. In words, this means that there is a relationship between education levels and income for certain professions. We do not have enough evidence with a p-value of .387 to conclude that there is a relationship between being a woman and income levels.

Confidence and Prediction Intervals

We can create a confidence interval for income levels by doing the following:

##        fit      lwr      upr
## 1 7901.849 7301.251 8502.447

In this example, we created a confidence interval for income with an education level of 12 years and 30 percent women in the profession. The estimate for income at this level of education and women is 9416.215 dollars, while the confidence interval is (8703.73, 10128.7), meaning that we are 95% confident that average income levels for jobs with an education level of 12 years and 30 percent women is between 8703.73 and 10128.7.

We can also create a prediction interval for our data by doing the following:

pred <- predict(mymod, mydata, interval = "predict")
pred
##        fit      lwr      upr
## 1 7901.849 2371.259 13432.44

Our prediction interval is much larger than our confidence interval because it is an interval for a point estimate rather than a mean of all the data. The interval is wider because there is more variability when predicting just one point rather than a mean.