Load library
library(knitr)
library(dplyr)
library(GGally)
Set WD
setwd("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Computer Assignments/CA05 - Bivariate Regression")
Load Data
load("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Datasets/Class4set/opm94.RData")
Check for structure
names(opm94)
## [1] "x" "sal" "grade" "patco" "major" "age"
## [7] "male" "vet" "handvet" "hand" "yos" "edyrs"
## [13] "promo" "exit" "supmgr" "race" "minority" "grade4"
## [19] "promo01" "supmgr01" "male01" "exit01" "vet01"
Check for format/values
str(opm94)
## 'data.frame': 1000 obs. of 23 variables:
## $ x : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sal : int 26045 37651 64926 18588 19573 28648 27805 16560 40440 24285 ...
## $ grade : int 7 9 14 4 3 9 7 3 11 6 ...
## $ patco : Factor w/ 5 levels "Administrative",..: 1 4 4 2 2 4 5 2 1 2 ...
## $ major : Factor w/ 23 levels " ","AGRIC",..: 16 11 10 1 1 11 1 1 1 6 ...
## $ age : int 52 34 37 26 51 44 50 37 59 57 ...
## $ male : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
## $ vet : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ handvet : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ hand : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ yos : int 6 4 3 6 14 1 7 5 13 6 ...
## $ edyrs : int 16 16 16 12 12 16 14 12 12 14 ...
## $ promo : Factor w/ 2 levels "no","yes": 2 1 1 1 NA 1 1 1 1 1 ...
## $ exit : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ supmgr : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ race : Factor w/ 5 levels "American Indian",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ minority: int 1 1 1 1 1 1 1 1 1 1 ...
## $ grade4 : Factor w/ 4 levels "grades 1 to 4",..: 3 4 2 1 1 4 3 1 4 3 ...
## $ promo01 : num 1 0 0 0 NA 0 0 0 0 0 ...
## $ supmgr01: num 0 0 0 0 0 0 0 0 0 0 ...
## $ male01 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exit01 : num 0 0 0 0 1 0 0 0 0 0 ...
## $ vet01 : num 0 0 0 0 0 0 0 0 1 0 ...
To see how changing the units of measurement affects the regression coefficient and the correlation coefficient, create a new variable (edyrs_months) that measures edyrs in months instead of years.
opm94 <- opm94 %>% mutate(edyrs_months = edyrs*12)
Correlation Table with (sal, grade, edyrs, edyrs_months, yos, age, male01, minority.)
opm94 %>% select(sal,grade,edyrs,edyrs_months,yos,age,male01,minority) %>% cor(use= "pairwise.complete.obs") %>% round(2)
## sal grade edyrs edyrs_months yos age male01 minority
## sal 1.00 0.91 0.59 0.59 0.40 0.29 0.36 -0.23
## grade 0.91 1.00 0.61 0.61 0.31 0.19 0.35 -0.23
## edyrs 0.59 0.61 1.00 1.00 0.01 0.08 0.31 -0.15
## edyrs_months 0.59 0.61 1.00 1.00 0.01 0.08 0.31 -0.15
## yos 0.40 0.31 0.01 0.01 1.00 0.62 0.08 -0.13
## age 0.29 0.19 0.08 0.08 0.62 1.00 0.09 -0.15
## male01 0.36 0.35 0.31 0.31 0.08 0.09 1.00 -0.12
## minority -0.23 -0.23 -0.15 -0.15 -0.13 -0.15 -0.12 1.00
3a. Which variable is grade most strongly related to? Rank order the variables in terms of the strength of their relationship with grade
| Rank | Variable | Correlation with grade |
|---|---|---|
| 1. | Salary | .91 |
| 2. | edyrs | .61 |
| 2. | edyrs_months | .61 |
| 4. | male01 | .35 |
| 5. | yos | .31 |
| 6. | age | .19 |
| 7. | minority | -.23 |
3b. Which variable is years of federal service most strongly related to? most weakly related to?
yos or years of service has the strongest correlation with age which has a .62 correlation coefficient which is a positively moderately strong relationship.
yos has the weakest correlation with minority which has a -.23 correlation coefficient which is a weak in terms of strength and negative relationship.
3c. Look at the correlations between edyrs and edyrs_month` and between these two variables and all other variables. Whatโs going on?
The correlation between edyrs and edyrs_months and between the other all variables are the same. When we created the variable edyrs_months, all that was done to create this was to multiply the value of edyrs by 12 which is how many months are in a given year. By changing the unit amount of a given variable by any amount will still show the same correlation between all the other variables as the doing will not change the underlying relationship of the of the variables.
Run four regressions.
sal on gradegrade on yosgrade on edyrsyos on agelm(sal ~ grade, data = opm94) %>% summary()
##
## Call:
## lm(formula = sal ~ grade, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12775 -4778 -505 3413 45197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5132.8 698.5 -7.348 4.19e-13 ***
## grade 4779.0 68.6 69.662 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7292 on 993 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8301, Adjusted R-squared: 0.83
## F-statistic: 4853 on 1 and 993 DF, p-value: < 2.2e-16
lm(grade ~ yos, data = opm94) %>% summary()
##
## Call:
## lm(formula = grade ~ yos, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.252 -2.833 0.527 2.684 6.539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.87967 0.19747 39.90 <2e-16 ***
## yos 0.11629 0.01144 10.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.21 on 998 degrees of freedom
## Multiple R-squared: 0.09387, Adjusted R-squared: 0.09296
## F-statistic: 103.4 on 1 and 998 DF, p-value: < 2.2e-16
lm(grade ~ edyrs, data = opm94) %>% summary()
##
## Call:
## lm(formula = grade ~ edyrs, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0775 -2.0775 -0.0775 1.9225 7.5345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.37071 0.54503 -6.184 9.08e-10 ***
## edyrs 0.90301 0.03748 24.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.681 on 998 degrees of freedom
## Multiple R-squared: 0.3678, Adjusted R-squared: 0.3671
## F-statistic: 580.6 on 1 and 998 DF, p-value: < 2.2e-16
lm(yos ~ age, data = opm94) %>% summary()
##
## Call:
## lm(formula = yos ~ age, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2467 -4.3889 0.2288 4.9875 16.6804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.85485 0.96979 -9.131 <2e-16 ***
## age 0.53883 0.02151 25.056 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.96 on 998 degrees of freedom
## Multiple R-squared: 0.3861, Adjusted R-squared: 0.3855
## F-statistic: 627.8 on 1 and 998 DF, p-value: < 2.2e-16
Formula : \(\hat{y}= a + bX\)
4a. For each regression, briefly explain the meaning of the y-intercept and the regression coefficient.
For sal on grade, the y-intercept is -5132.8 and the regression coefficient is 4779.
What this means: Y(hat) is the expected value of Y, which in this set is sal or the response variable, refers to the y-intercept (-5132.8), b (4779) which is the regression coefficient that explains the strength (0 - 1) and direction (+/-) of the relationship between the two variables and X is grade or the explanatory variable.
This means that when grade is at 0, y value (sal) is at -5132.8 which is the point where the line crosses the y-axis, and as grade increases, it increases at a rate of 4779 per grade.
For grade on yos, the y-intercept is 7.8767 and the regression coefficient is .1163.
What this means: Y(hat) is the expected value of Y, which in this set is grade, a refers to the y-intercept (-7.8767), b (.1163) which is the regression coefficient that explains the strength and direction of the relationship between the two variables and X is yos or the explanatory variable.
This means that when yos is at 0, y value (grade)is at -7.8767 which is the point where the line crosses the y-axis, and as yos increases, it increases at a rate of .1163 per yos.
For grade on edyrs, the y-intercept is -3.37071 and the regression coefficient is .0930.
What this means: Y(hat) is the expected value of Y, which in this set is edyrs, a refers to the y-intercept (-3.37071), b (.0930) which is the regression coefficient that explains the strength and direction of the relationship between the two variables and X is edyrs or the explanatory variable.
This means that when edyrs is at 0, y value (grade)is at -3.37071 which is the point where the line crosses the y-axis, and as edyrs increases, it increases at a rate of .0930 per edyrs.
For yos on age, the y-intercept is -8.5485 and the regression coefficient is .53883.
What this means: Y(hat) is the expected value of Y, which in this set is age, a refers to the y-intercept (-8.5485), b (.53884) which is the regression coefficient that explains the strength and direction of the relationship between the two variables and X is age or the explanatory variable.
This means that when age is at 0, y value (yos)is at -8.5485 which is the point where the line crosses the y-axis, and as age increases, it increases at a rate of .53883 per age.
4b. Find the expected salary for someone in 16th grade
# using the formula above : -5132.8 + 4779 * grade16
grade16 <- 16
sal16 = -5132.8 + 4779 * grade16
print(sal16)
## [1] 71331.2
4c. Find the expected grade for someone with 5 years of service
# using the formula above : 7.8767 + .11629 * yos5
yos5 <- 5
grade5 = 7.8767 + .11629 * yos5
print(grade5)
## [1] 8.45815
4d. Find the expected grade for someone with 12 years of education
# using the formula above : -3.37071 + .909031 * edyrs12
edyrs12 <- 12
grade12 = 3.37071 + .909031 * edyrs12
print(grade12)
## [1] 14.27908
grade on edyrs_monthslm(grade ~ edyrs_months, data = opm94) %>% summary()
##
## Call:
## lm(formula = grade ~ edyrs_months, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0775 -2.0775 -0.0775 1.9225 7.5345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.370706 0.545034 -6.184 9.08e-10 ***
## edyrs_months 0.075251 0.003123 24.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.681 on 998 degrees of freedom
## Multiple R-squared: 0.3678, Adjusted R-squared: 0.3671
## F-statistic: 580.6 on 1 and 998 DF, p-value: < 2.2e-16
lm(grade ~ edyrs, data = opm94) %>% summary()
##
## Call:
## lm(formula = grade ~ edyrs, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0775 -2.0775 -0.0775 1.9225 7.5345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.37071 0.54503 -6.184 9.08e-10 ***
## edyrs 0.90301 0.03748 24.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.681 on 998 degrees of freedom
## Multiple R-squared: 0.3678, Adjusted R-squared: 0.3671
## F-statistic: 580.6 on 1 and 998 DF, p-value: < 2.2e-16
4e. Why is the regressing coefficient different from the coefficient on edyrs? How are they the same?
#The coefficicent in `edyrs` is different from `edyrs_months` because of the difference in the timing of the variable, in which year is just the value of the year, while `edyrs_months` is divided by 12 to achieve the monthly value.
COE12 <- 0.075251 * 12 # Coefficient of edyrs_months and grade
print(COE12)
## [1] 0.903012
COE13 <- 0.90301 # Coefficient of edyrs and grade
print(COE13)
## [1] 0.90301
Everything in both regression tables are either the same or closely similar, expect for the changed regression coefficient, it should be noted that the y-intercept, residuals, r, r^2, and additional components have not altered due to the division for the months in edyrs_month when compared to edyrs.
opm94 <- opm94 %>% mutate(nonvet01 = if_else(vet01 == 0, 1, 0 )) #creates a new column for nonvet
sal on vet01:lm(sal ~ vet01, data = opm94) %>% summary()
##
## Call:
## lm(formula = sal ~ vet01, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30056 -14230 -3155 10464 72731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39439.7 636.2 61.995 < 2e-16 ***
## vet01 5669.8 1306.3 4.341 1.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17530 on 993 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.01862, Adjusted R-squared: 0.01763
## F-statistic: 18.84 on 1 and 993 DF, p-value: 1.567e-05
sal on nonvet01:lm(sal ~ nonvet01, data = opm94) %>% summary()
##
## Call:
## lm(formula = sal ~ nonvet01, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30056 -14230 -3155 10464 72731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45110 1141 39.539 < 2e-16 ***
## nonvet01 -5670 1306 -4.341 1.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17530 on 993 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.01862, Adjusted R-squared: 0.01763
## F-statistic: 18.84 on 1 and 993 DF, p-value: 1.567e-05
vets and nonvets:opm94 %>% group_by(vet) %>% summarise(Mean_Salary = mean(sal, na.rm = TRUE))
## # A tibble: 2 x 2
## vet Mean_Salary
## <fct> <dbl>
## 1 no 39440.
## 2 yes 45110.
5a. Find the mean grades of veterans and nonveterans from the two regression outputs
lm(grade ~ nonvet01, data = opm94) %>% summary()
##
## Call:
## lm(formula = grade ~ nonvet01, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2331 -3.4071 0.7669 2.5929 6.5929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.2331 0.2183 46.875 < 2e-16 ***
## nonvet01 -0.8260 0.2498 -3.307 0.000976 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.354 on 998 degrees of freedom
## Multiple R-squared: 0.01084, Adjusted R-squared: 0.009849
## F-statistic: 10.94 on 1 and 998 DF, p-value: 0.0009761
lm(grade ~ vet01, data = opm94) %>% summary()
##
## Call:
## lm(formula = grade ~ vet01, data = opm94)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2331 -3.4071 0.7669 2.5929 6.5929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4071 0.1213 77.532 < 2e-16 ***
## vet01 0.8260 0.2498 3.307 0.000976 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.354 on 998 degrees of freedom
## Multiple R-squared: 0.01084, Adjusted R-squared: 0.009849
## F-statistic: 10.94 on 1 and 998 DF, p-value: 0.0009761
5b. Interpret the Y-intercepts. Why do they differ?
The y-intercept for a vet is at 9.4071, which means that when the regression line (the relationship between the two variables: grade and vet) passes the y-axis the first point it touches is that value. The y-intercept for a non-vet is at 10.2331.
One possible reasoning for the difference in values is because vets on average have higher grades comparative to those that are non-vets, which is also prevalent in the findings of salary, when the average starting salary of a vet is at 39439 but increases by 5669 for being a vet, or 45110. When compared to being a non-vet, the average starting salary for a non-vet is 45110 but decreases by 5670 for basically not being a vet to 39440 which explains the mean salaries of being a vet vs non-vet.
5c. Interpret the regression coefficients. Why do they differ?
The regression coefficient for being a vet is 0.8260, which means that the correlation between grade and vet is a strong positive relationship. The regression coefficient for being a non-vet is -0.8260 which means that the correlation between grade and non-vet is a strong negative relationship.
The findings here suggests that vets in the government tend to have higher grades comparative to their peers that are non-vets. It does seem that non-vets in the government are less than likely to receive higher grades comparative to their vet peers.