1. Load Libraries, Set WD, and Load Data

Load library

library(dplyr)
library(ggplot2)
library(knitr)
library(pander)
library(GGally)
library(corrplot)

Set WD

setwd("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Computer Assignments/CA04-Scatterplots and Correlations")

Load Data

We are going to work with a new data set - a random sample of 1,000 federal personnel records for March 1994. This set includes the sort of information the government keeps in its personnel files: grade, salary, occupation, supervisory status, education, age, years of federal experience, sex, race, etc.

load("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Datasets/Class4set/OPM94.RData")

Variable Names:

names(opm94)
##  [1] "x"        "sal"      "grade"    "patco"    "major"    "age"     
##  [7] "male"     "vet"      "handvet"  "hand"     "yos"      "edyrs"   
## [13] "promo"    "exit"     "supmgr"   "race"     "minority" "grade4"  
## [19] "promo01"  "supmgr01" "male01"   "exit01"   "vet01"

Displays variable type / values

str(opm94)
## 'data.frame':    1000 obs. of  23 variables:
##  $ x       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sal     : int  26045 37651 64926 18588 19573 28648 27805 16560 40440 24285 ...
##  $ grade   : int  7 9 14 4 3 9 7 3 11 6 ...
##  $ patco   : Factor w/ 5 levels "Administrative",..: 1 4 4 2 2 4 5 2 1 2 ...
##  $ major   : Factor w/ 23 levels "     ","AGRIC",..: 16 11 10 1 1 11 1 1 1 6 ...
##  $ age     : int  52 34 37 26 51 44 50 37 59 57 ...
##  $ male    : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ vet     : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
##  $ handvet : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ hand    : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ yos     : int  6 4 3 6 14 1 7 5 13 6 ...
##  $ edyrs   : int  16 16 16 12 12 16 14 12 12 14 ...
##  $ promo   : Factor w/ 2 levels "no","yes": 2 1 1 1 NA 1 1 1 1 1 ...
##  $ exit    : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ supmgr  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ race    : Factor w/ 5 levels "American Indian",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ minority: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ grade4  : Factor w/ 4 levels "grades 1 to 4",..: 3 4 2 1 1 4 3 1 4 3 ...
##  $ promo01 : num  1 0 0 0 NA 0 0 0 0 0 ...
##  $ supmgr01: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ male01  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ exit01  : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ vet01   : num  0 0 0 0 0 0 0 0 1 0 ...


2. African American Males (AAM) Graphs of Certain Variables

summary(opm94$race) #Displays the levels of the variable race
## American Indian           Asian           Black        Hispanic           White 
##              17              31             175              49             728
summary(opm94$male) #Displays the levels of the varible male
## female   male 
##    488    512
opm94AAM <- opm94 %>% dplyr::filter(race == "Black", male == "male") # creates a subset of data where race is equal to black and male is equal to male


3. Scatterplots

Create scatterplots of salary (sal) as the dependent variable and four independent variables: grades, yos, edyrs, and age.

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=grade, y = sal)) + ggtitle("Scatterplot of AAM on correlation between grade and salary") + theme(plot.title = element_text(hjust = 0.5)) +  xlab("grade") + ylab("salary")

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=yos, y = sal)) + ggtitle("Scatterplot of AAM on correlation between yos and salary") + theme(plot.title = element_text(hjust = 0.5)) +  xlab("yos") + ylab("salary")

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=edyrs, y = sal)) + ggtitle("Scatterplot of AAM on correlation between edyrs and salary") + theme(plot.title = element_text(hjust = 0.5)) +  xlab("edyrs") + ylab("salary")

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=age, y = sal)) + ggtitle("Scatterplot of AAM on correlation between age and salary") + theme(plot.title = element_text(hjust = 0.5)) +  xlab("age") + ylab("salary")


4. Correlation Matrix

This table represents the correlation between two variables and displays the correlation coefficient which explains how closely related two variables are to each other, which is represented by a “1” denoting that the variables are a perfect match and “0” being not a good match.

opm94AAM %>% select(sal, grade, yos, edyrs, age, supmgr01, promo01) %>% cor(use = "pairwise.complete.obs") %>% round(digits = 2) %>% pandoc.table(style = "grid")
## 
## 
## +--------------+------+-------+-------+-------+-------+----------+---------+
## |    &nbsp;    | sal  | grade |  yos  | edyrs |  age  | supmgr01 | promo01 |
## +==============+======+=======+=======+=======+=======+==========+=========+
## |   **sal**    |  1   | 0.92  | 0.31  | 0.52  | 0.31  |   0.43   |   0.1   |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## |  **grade**   | 0.92 |   1   | 0.22  | 0.47  | 0.19  |   0.39   |  0.07   |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## |   **yos**    | 0.31 | 0.22  |   1   |  0.1  | 0.61  |   0.3    |  -0.06  |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## |  **edyrs**   | 0.52 | 0.47  |  0.1  |   1   | 0.22  |   0.14   |  0.07   |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## |   **age**    | 0.31 | 0.19  | 0.61  | 0.22  |   1   |   0.27   |  -0.18  |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## | **supmgr01** | 0.43 | 0.39  |  0.3  | 0.14  | 0.27  |    1     |  -0.02  |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## | **promo01**  | 0.1  | 0.07  | -0.06 | 0.07  | -0.18 |  -0.02   |    1    |
## +--------------+------+-------+-------+-------+-------+----------+---------+


5. Questions

Question 1: Write a couple of sentences about each graph. Talk about the strength and direction of each relationship. Does there seem to be any evidence of curvilinearity?


Question 2: Rank-order the strength of the correlations between sal and each of the other variables. Do these seem in line with what you would have guessed based on the scatterplots? Explain briefly.

Rank Variable Correlation with Salary
1 Grade .92
2 Edyrs .52
3 Supmgr01 .43
4 Yos .31
4 Age .31
6 Promo01 .10

It is not suprising that individuals with higher grade levels generally receive higher salaries, which can be explained by higher individuals whom receive the high grade levels are jobs that require skills that are more inimitable and have more responsibilities to the role which would result in higher pay.

Question 3: Talk about the direction of each of the correlations between sal and the other variables. Who tends to earn higher salaries, those with more or less education, higher or lower grades, etc.?

  1. There is a strong positive correlation between grade and salary, as those with higher grade levels tend to earn higher salaries.
  2. There is a moderately positive correlation between edyrs and salary, as those with longer years in education tend to earn higher salaries.
  3. There is a weak positive correlation between yos and salary, as those with higher years of service tend to have higher salaries.


Question 4: How do supervisors (supmgr=1) differ from other people (supmgr=0), based on the correlation coefficients? Do supervisors tend to have higher or lower salaries, higher or lower grades, etc.? Which variables is supervisory status most strongly related to?

opm94AAM %>% select(sal, grade, yos, edyrs, age, promo01,supmgr01) %>% cor(use = "pairwise.complete.obs") %>% round(digits = 2) %>% pandoc.table(style = "grid")
## 
## 
## +--------------+------+-------+-------+-------+-------+---------+----------+
## |    &nbsp;    | sal  | grade |  yos  | edyrs |  age  | promo01 | supmgr01 |
## +==============+======+=======+=======+=======+=======+=========+==========+
## |   **sal**    |  1   | 0.92  | 0.31  | 0.52  | 0.31  |   0.1   |   0.43   |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## |  **grade**   | 0.92 |   1   | 0.22  | 0.47  | 0.19  |  0.07   |   0.39   |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## |   **yos**    | 0.31 | 0.22  |   1   |  0.1  | 0.61  |  -0.06  |   0.3    |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## |  **edyrs**   | 0.52 | 0.47  |  0.1  |   1   | 0.22  |  0.07   |   0.14   |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## |   **age**    | 0.31 | 0.19  | 0.61  | 0.22  |   1   |  -0.18  |   0.27   |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## | **promo01**  | 0.1  | 0.07  | -0.06 | 0.07  | -0.18 |    1    |  -0.02   |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## | **supmgr01** | 0.43 | 0.39  |  0.3  | 0.14  | 0.27  |  -0.02  |    1     |
## +--------------+------+-------+-------+-------+-------+---------+----------+

When it comes to the variable that is strongly correlated with supmgr01, it is salary, in which the correlation coefficient is .43, which means those that are supervisors tend to have a moderate relationship with salary.


Question 5: How do people who were promoted between 1994 and 1995 (promo=1) differ from those who were not?

M <- opm94AAM %>% select(sal, grade, yos, edyrs, age, supmgr01, promo01) %>% cor(use = "pairwise.complete.obs") # store correlation table in letter M
corrplot(M, method = "number")

When it comes to people that were promoted (promo01), it can be said that there is no distinct relationship between the variables listed above as the correlation coefficients for each variable is a weak relationship strength with each variable.