CA05 - Bi-variate Regression

1. Load Libraries, Set WD, and Load Data

Load library

library(knitr)
library(dplyr)
library(GGally)

Set WD

setwd("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Computer Assignments/CA05 - Bivariate Regression")

Load Data

load("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Datasets/Class4set/opm94.RData")

Check for structure

names(opm94)

##  [1] "x"        "sal"      "grade"    "patco"    "major"    "age"     
##  [7] "male"     "vet"      "handvet"  "hand"     "yos"      "edyrs"   
## [13] "promo"    "exit"     "supmgr"   "race"     "minority" "grade4"  
## [19] "promo01"  "supmgr01" "male01"   "exit01"   "vet01"

Check for format/values

str(opm94)

## 'data.frame':    1000 obs. of  23 variables:
##  $ x       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sal     : int  26045 37651 64926 18588 19573 28648 27805 16560 40440 24285 ...
##  $ grade   : int  7 9 14 4 3 9 7 3 11 6 ...
##  $ patco   : Factor w/ 5 levels "Administrative",..: 1 4 4 2 2 4 5 2 1 2 ...
##  $ major   : Factor w/ 23 levels "     ","AGRIC",..: 16 11 10 1 1 11 1 1 1 6 ...
##  $ age     : int  52 34 37 26 51 44 50 37 59 57 ...
##  $ male    : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ vet     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
##  $ handvet : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hand    : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ yos     : int  6 4 3 6 14 1 7 5 13 6 ...
##  $ edyrs   : int  16 16 16 12 12 16 14 12 12 14 ...
##  $ promo   : Factor w/ 2 levels "no","yes": 2 1 1 1 NA 1 1 1 1 1 ...
##  $ exit    : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ supmgr  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ race    : Factor w/ 5 levels "American Indian",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ minority: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ grade4  : Factor w/ 4 levels "grades 1 to 4",..: 3 4 2 1 1 4 3 1 4 3 ...
##  $ promo01 : num  1 0 0 0 NA 0 0 0 0 0 ...
##  $ supmgr01: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ male01  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exit01  : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ vet01   : num  0 0 0 0 0 0 0 0 1 0 ...

2. Creating New Variable

To see how changing the units of measurement affects the regression coefficient and the correlation coefficient, create a new variable (edyrs_months) that measures edyrs in months instead of years.

opm94 <- opm94 %>% mutate(edyrs_months = edyrs*12)

3. Correlation Matrix

Correlation Table with (sal, grade, edyrs, edyrs_months, yos, age, male01, minority.)

opm94 %>% select(sal,grade,edyrs,edyrs_months,yos,age,male01,minority) %>% cor(use= "pairwise.complete.obs") %>% round(2)

##                sal grade edyrs edyrs_months   yos   age male01 minority
## sal           1.00  0.91  0.59         0.59  0.40  0.29   0.36    -0.23
## grade         0.91  1.00  0.61         0.61  0.31  0.19   0.35    -0.23
## edyrs         0.59  0.61  1.00         1.00  0.01  0.08   0.31    -0.15
## edyrs_months  0.59  0.61  1.00         1.00  0.01  0.08   0.31    -0.15
## yos           0.40  0.31  0.01         0.01  1.00  0.62   0.08    -0.13
## age           0.29  0.19  0.08         0.08  0.62  1.00   0.09    -0.15
## male01        0.36  0.35  0.31         0.31  0.08  0.09   1.00    -0.12
## minority     -0.23 -0.23 -0.15        -0.15 -0.13 -0.15  -0.12     1.00

Questions

3a. Which variable is grade most strongly related to? Rank order the variables in terms of the strength of their relationship with grade

Rank	Variable	Correlation with `grade`
1.	Salary	.91
2.	edyrs	.61
2.	edyrs_months	.61
4.	male01	.35
5.	yos	.31
6.	age	.19
7.	minority	-.23

3b. Which variable is years of federal service most strongly related to? most weakly related to?

yos or years of service has the strongest correlation with age which has a .62 correlation coefficient which is a positively moderately strong relationship.
yos has the weakest correlation with minority which has a -.23 correlation coefficient which is a weak in terms of strength and negative relationship.

3c. Look at the correlations between edyrs and edyrs_month` and between these two variables and all other variables. What’s going on?

The correlation between edyrs and edyrs_months and between the other all variables are the same. When we created the variable edyrs_months, all that was done to create this was to multiply the value of edyrs by 12 which is how many months are in a given year. By changing the unit amount of a given variable by any amount will still show the same correlation between all the other variables as the doing will not change the underlying relationship of the of the variables.

4. Regression with Numeric Explanatory Variables

Run four regressions.

Regress

sal on grade
grade on yos
grade on edyrs
yos on age

lm(sal ~ grade, data = opm94) %>% summary()

## 
## Call:
## lm(formula = sal ~ grade, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12775  -4778   -505   3413  45197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5132.8      698.5  -7.348 4.19e-13 ***
## grade         4779.0       68.6  69.662  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7292 on 993 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8301, Adjusted R-squared:   0.83 
## F-statistic:  4853 on 1 and 993 DF,  p-value: < 2.2e-16

lm(grade ~ yos, data = opm94) %>% summary()

## 
## Call:
## lm(formula = grade ~ yos, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.252 -2.833  0.527  2.684  6.539 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.87967    0.19747   39.90   <2e-16 ***
## yos          0.11629    0.01144   10.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.21 on 998 degrees of freedom
## Multiple R-squared:  0.09387,    Adjusted R-squared:  0.09296 
## F-statistic: 103.4 on 1 and 998 DF,  p-value: < 2.2e-16

lm(grade ~ edyrs, data = opm94) %>% summary()

## 
## Call:
## lm(formula = grade ~ edyrs, data = opm94)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0775 -2.0775 -0.0775  1.9225  7.5345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.37071    0.54503  -6.184 9.08e-10 ***
## edyrs        0.90301    0.03748  24.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.681 on 998 degrees of freedom
## Multiple R-squared:  0.3678, Adjusted R-squared:  0.3671 
## F-statistic: 580.6 on 1 and 998 DF,  p-value: < 2.2e-16

lm(yos ~ age, data = opm94) %>% summary()

## 
## Call:
## lm(formula = yos ~ age, data = opm94)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.2467  -4.3889   0.2288   4.9875  16.6804 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.85485    0.96979  -9.131   <2e-16 ***
## age          0.53883    0.02151  25.056   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.96 on 998 degrees of freedom
## Multiple R-squared:  0.3861, Adjusted R-squared:  0.3855 
## F-statistic: 627.8 on 1 and 998 DF,  p-value: < 2.2e-16

Questions

Formula : \(\hat{y}= a + bX\)

This formula represents the the least squares regression line which helps at predicting y (the response variable) based on x (the explanatory variable).

4a. For each regression, briefly explain the meaning of the y-intercept and the regression coefficient.

For sal on grade, the y-intercept is -5132.8 and the regression coefficient is 4779.

What this means: Y(hat) is the expected value of Y, which in this set is sal or the response variable, refers to the y-intercept (-5132.8), b (4779) which is the regression coefficient that explains the strength (0 - 1) and direction (+/-) of the relationship between the two variables and X is grade or the explanatory variable.
This means that when grade is at 0, y value (sal) is at -5132.8 which is the point where the line crosses the y-axis, and as grade increases, it increases at a rate of 4779 per grade.

For grade on yos, the y-intercept is 7.8767 and the regression coefficient is .1163.

What this means: Y(hat) is the expected value of Y, which in this set is grade, a refers to the y-intercept (-7.8767), b (.1163) which is the regression coefficient that explains the strength and direction of the relationship between the two variables and X is yos or the explanatory variable.
This means that when yos is at 0, y value (grade)is at -7.8767 which is the point where the line crosses the y-axis, and as yos increases, it increases at a rate of .1163 per yos.

For grade on edyrs, the y-intercept is -3.37071 and the regression coefficient is .0930.

What this means: Y(hat) is the expected value of Y, which in this set is edyrs, a refers to the y-intercept (-3.37071), b (.0930) which is the regression coefficient that explains the strength and direction of the relationship between the two variables and X is edyrs or the explanatory variable.
This means that when edyrs is at 0, y value (grade)is at -3.37071 which is the point where the line crosses the y-axis, and as edyrs increases, it increases at a rate of .0930 per edyrs.

For yos on age, the y-intercept is -8.5485 and the regression coefficient is .53883.

What this means: Y(hat) is the expected value of Y, which in this set is age, a refers to the y-intercept (-8.5485), b (.53884) which is the regression coefficient that explains the strength and direction of the relationship between the two variables and X is age or the explanatory variable.
This means that when age is at 0, y value (yos)is at -8.5485 which is the point where the line crosses the y-axis, and as age increases, it increases at a rate of .53883 per age.

4b. Find the expected salary for someone in 16th grade

# using the formula above : -5132.8 + 4779 * grade16
grade16 <- 16
sal16 = -5132.8 + 4779 * grade16
print(sal16)

## [1] 71331.2

4c. Find the expected grade for someone with 5 years of service

# using the formula above : 7.8767 + .11629 * yos5
yos5 <- 5
grade5 = 7.8767 + .11629 * yos5
print(grade5)

## [1] 8.45815

4d. Find the expected grade for someone with 12 years of education

# using the formula above : -3.37071 + .909031 * edyrs12
edyrs12 <- 12
grade12 = 3.37071 + .909031 * edyrs12
print(grade12)

## [1] 14.27908

Additional Regression

grade on edyrs_months

lm(grade ~ edyrs_months, data = opm94) %>% summary()

## 
## Call:
## lm(formula = grade ~ edyrs_months, data = opm94)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0775 -2.0775 -0.0775  1.9225  7.5345 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.370706   0.545034  -6.184 9.08e-10 ***
## edyrs_months  0.075251   0.003123  24.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.681 on 998 degrees of freedom
## Multiple R-squared:  0.3678, Adjusted R-squared:  0.3671 
## F-statistic: 580.6 on 1 and 998 DF,  p-value: < 2.2e-16

lm(grade ~ edyrs, data = opm94) %>% summary()

## 
## Call:
## lm(formula = grade ~ edyrs, data = opm94)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0775 -2.0775 -0.0775  1.9225  7.5345 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.37071    0.54503  -6.184 9.08e-10 ***
## edyrs        0.90301    0.03748  24.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.681 on 998 degrees of freedom
## Multiple R-squared:  0.3678, Adjusted R-squared:  0.3671 
## F-statistic: 580.6 on 1 and 998 DF,  p-value: < 2.2e-16

Additional Question

4e. Why is the regressing coefficient different from the coefficient on edyrs? How are they the same?

#The coefficicent in `edyrs` is different from `edyrs_months` because of the difference in the timing of the variable, in which year is just the value of the year, while `edyrs_months` is divided by 12 to achieve the monthly value. 

COE12 <- 0.075251 * 12 # Coefficient of edyrs_months and grade
print(COE12)

## [1] 0.903012

COE13 <- 0.90301 # Coefficient of edyrs and grade
print(COE13)

## [1] 0.90301

Everything in both regression tables are either the same or closely similar, expect for the changed regression coefficient, it should be noted that the y-intercept, residuals, r, r^2, and additional components have not altered due to the division for the months in edyrs_month when compared to edyrs.

5. Regression With Dummy Explanatory Variables

Create a dummy variable nonvet, which should be the mirror image of variable vet (vet = 0, nonvet = 1)

opm94 <- opm94 %>% mutate(nonvet01 = if_else(vet01 == 0, 1, 0 )) #creates a new column for nonvet

Regress sal on vet01:

lm(sal ~ vet01, data = opm94) %>% summary()

## 
## Call:
## lm(formula = sal ~ vet01, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30056 -14230  -3155  10464  72731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39439.7      636.2  61.995  < 2e-16 ***
## vet01         5669.8     1306.3   4.341 1.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17530 on 993 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.01862,    Adjusted R-squared:  0.01763 
## F-statistic: 18.84 on 1 and 993 DF,  p-value: 1.567e-05

Regress sal on nonvet01:

lm(sal ~ nonvet01, data = opm94) %>% summary()

## 
## Call:
## lm(formula = sal ~ nonvet01, data = opm94)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30056 -14230  -3155  10464  72731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    45110       1141  39.539  < 2e-16 ***
## nonvet01       -5670       1306  -4.341 1.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17530 on 993 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.01862,    Adjusted R-squared:  0.01763 
## F-statistic: 18.84 on 1 and 993 DF,  p-value: 1.567e-05

Compute mean salaries for vets and nonvets:

opm94 %>% group_by(vet) %>% summarise(Mean_Salary = mean(sal, na.rm = TRUE))

## # A tibble: 2 x 2
##   vet   Mean_Salary
##   <fct>       <dbl>
## 1 no         39440.
## 2 yes        45110.

QUESTIONS

5a. Find the mean grades of veterans and nonveterans from the two regression outputs

lm(grade ~ nonvet01, data = opm94) %>% summary()

## 
## Call:
## lm(formula = grade ~ nonvet01, data = opm94)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2331 -3.4071  0.7669  2.5929  6.5929 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.2331     0.2183  46.875  < 2e-16 ***
## nonvet01     -0.8260     0.2498  -3.307 0.000976 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.354 on 998 degrees of freedom
## Multiple R-squared:  0.01084,    Adjusted R-squared:  0.009849 
## F-statistic: 10.94 on 1 and 998 DF,  p-value: 0.0009761

lm(grade ~ vet01, data = opm94) %>% summary()

## 
## Call:
## lm(formula = grade ~ vet01, data = opm94)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2331 -3.4071  0.7669  2.5929  6.5929 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4071     0.1213  77.532  < 2e-16 ***
## vet01         0.8260     0.2498   3.307 0.000976 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.354 on 998 degrees of freedom
## Multiple R-squared:  0.01084,    Adjusted R-squared:  0.009849 
## F-statistic: 10.94 on 1 and 998 DF,  p-value: 0.0009761

5b. Interpret the Y-intercepts. Why do they differ?

The y-intercept for a vet is at 9.4071, which means that when the regression line (the relationship between the two variables: grade and vet) passes the y-axis the first point it touches is that value. The y-intercept for a non-vet is at 10.2331.
One possible reasoning for the difference in values is because vets on average have higher grades comparative to those that are non-vets, which is also prevalent in the findings of salary, when the average starting salary of a vet is at 39439 but increases by 5669 for being a vet, or 45110. When compared to being a non-vet, the average starting salary for a non-vet is 45110 but decreases by 5670 for basically not being a vet to 39440 which explains the mean salaries of being a vet vs non-vet.

5c. Interpret the regression coefficients. Why do they differ?

The regression coefficient for being a vet is 0.8260, which means that the correlation between grade and vet is a strong positive relationship. The regression coefficient for being a non-vet is -0.8260 which means that the correlation between grade and non-vet is a strong negative relationship.
The findings here suggests that vets in the government tend to have higher grades comparative to their peers that are non-vets. It does seem that non-vets in the government are less than likely to receive higher grades comparative to their vet peers.

CA05 - Bi-variate Regression

Kevin Chen

2/26/2020

1. Load Libraries, Set WD, and Load Data

2. Creating New Variable

3. Correlation Matrix

Questions

4. Regression with Numeric Explanatory Variables

Regress

Questions

Additional Regression

Additional Question

5. Regression With Dummy Explanatory Variables

QUESTIONS