Load library
library(dplyr)
library(ggplot2)
library(knitr)
library(pander)
library(GGally)
library(corrplot)
Set WD
setwd("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Computer Assignments/CA04-Scatterplots and Correlations")
Load Data
We are going to work with a new data set - a random sample of 1,000 federal personnel records for March 1994. This set includes the sort of information the government keeps in its personnel files: grade, salary, occupation, supervisory status, education, age, years of federal experience, sex, race, etc.
load("C:/Users/chenk/OneDrive/Documents/Spring 2020/PMAP 4041/Datasets/Class4set/OPM94.RData")
Variable Names:
names(opm94)
## [1] "x" "sal" "grade" "patco" "major" "age"
## [7] "male" "vet" "handvet" "hand" "yos" "edyrs"
## [13] "promo" "exit" "supmgr" "race" "minority" "grade4"
## [19] "promo01" "supmgr01" "male01" "exit01" "vet01"
Displays variable type / values
str(opm94)
## 'data.frame': 1000 obs. of 23 variables:
## $ x : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sal : int 26045 37651 64926 18588 19573 28648 27805 16560 40440 24285 ...
## $ grade : int 7 9 14 4 3 9 7 3 11 6 ...
## $ patco : Factor w/ 5 levels "Administrative",..: 1 4 4 2 2 4 5 2 1 2 ...
## $ major : Factor w/ 23 levels " ","AGRIC",..: 16 11 10 1 1 11 1 1 1 6 ...
## $ age : int 52 34 37 26 51 44 50 37 59 57 ...
## $ male : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
## $ vet : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ handvet : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ hand : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ yos : int 6 4 3 6 14 1 7 5 13 6 ...
## $ edyrs : int 16 16 16 12 12 16 14 12 12 14 ...
## $ promo : Factor w/ 2 levels "no","yes": 2 1 1 1 NA 1 1 1 1 1 ...
## $ exit : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
## $ supmgr : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ race : Factor w/ 5 levels "American Indian",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ minority: int 1 1 1 1 1 1 1 1 1 1 ...
## $ grade4 : Factor w/ 4 levels "grades 1 to 4",..: 3 4 2 1 1 4 3 1 4 3 ...
## $ promo01 : num 1 0 0 0 NA 0 0 0 0 0 ...
## $ supmgr01: num 0 0 0 0 0 0 0 0 0 0 ...
## $ male01 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exit01 : num 0 0 0 0 1 0 0 0 0 0 ...
## $ vet01 : num 0 0 0 0 0 0 0 0 1 0 ...
summary(opm94$race) #Displays the levels of the variable race
## American Indian Asian Black Hispanic White
## 17 31 175 49 728
summary(opm94$male) #Displays the levels of the varible male
## female male
## 488 512
opm94AAM <- opm94 %>% dplyr::filter(race == "Black", male == "male") # creates a subset of data where race is equal to black and male is equal to male
Create scatterplots of salary (sal) as the dependent variable and four independent variables: grades, yos, edyrs, and age.
ggplot(data=opm94AAM) + geom_point(mapping = aes(x=grade, y = sal)) + ggtitle("Scatterplot of AAM on correlation between grade and salary") + theme(plot.title = element_text(hjust = 0.5)) + xlab("grade") + ylab("salary")
ggplot(data=opm94AAM) + geom_point(mapping = aes(x=yos, y = sal)) + ggtitle("Scatterplot of AAM on correlation between yos and salary") + theme(plot.title = element_text(hjust = 0.5)) + xlab("yos") + ylab("salary")
ggplot(data=opm94AAM) + geom_point(mapping = aes(x=edyrs, y = sal)) + ggtitle("Scatterplot of AAM on correlation between edyrs and salary") + theme(plot.title = element_text(hjust = 0.5)) + xlab("edyrs") + ylab("salary")
ggplot(data=opm94AAM) + geom_point(mapping = aes(x=age, y = sal)) + ggtitle("Scatterplot of AAM on correlation between age and salary") + theme(plot.title = element_text(hjust = 0.5)) + xlab("age") + ylab("salary")
This table represents the correlation between two variables and displays the correlation coefficient which explains how closely related two variables are to each other, which is represented by a “1” denoting that the variables are a perfect match and “0” being not a good match.
opm94AAM %>% select(sal, grade, yos, edyrs, age, supmgr01, promo01) %>% cor(use = "pairwise.complete.obs") %>% round(digits = 2) %>% pandoc.table(style = "grid")
##
##
## +--------------+------+-------+-------+-------+-------+----------+---------+
## | | sal | grade | yos | edyrs | age | supmgr01 | promo01 |
## +==============+======+=======+=======+=======+=======+==========+=========+
## | **sal** | 1 | 0.92 | 0.31 | 0.52 | 0.31 | 0.43 | 0.1 |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## | **grade** | 0.92 | 1 | 0.22 | 0.47 | 0.19 | 0.39 | 0.07 |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## | **yos** | 0.31 | 0.22 | 1 | 0.1 | 0.61 | 0.3 | -0.06 |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## | **edyrs** | 0.52 | 0.47 | 0.1 | 1 | 0.22 | 0.14 | 0.07 |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## | **age** | 0.31 | 0.19 | 0.61 | 0.22 | 1 | 0.27 | -0.18 |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## | **supmgr01** | 0.43 | 0.39 | 0.3 | 0.14 | 0.27 | 1 | -0.02 |
## +--------------+------+-------+-------+-------+-------+----------+---------+
## | **promo01** | 0.1 | 0.07 | -0.06 | 0.07 | -0.18 | -0.02 | 1 |
## +--------------+------+-------+-------+-------+-------+----------+---------+
Question 1: Write a couple of sentences about each graph. Talk about the strength and direction of each relationship. Does there seem to be any evidence of curvilinearity?
In the first graph with grade and salary, this graph has a strong positive linear graph that shows that as the x (grade) increases y (salary) increases as well in almost perfect fashion. Which could be said that the further one advances with their grade level, the higher the salary one receives. In this dataset, grade represents, in similar fashion, to the clearance level in which an employee is to at.
In the second graph with yos (years of service) and salary, this graph has a weak positive relationship that can be argued that is linear with some indication that as x (yos) increases y (salary) increases in some cases, however it does not resemble a curvilinear graph.
In the third graph with edyrs (education years) and salary, this graph has a moderate positive relationship that somewhat resembles a curvilinear form in which y (salary) until a certain point (12 edyrs) in x (edyrs) then curves back up towards another point (15.75 edyrs) in x (edyrs). This is probably due to that fact that once the individuals in this file, finish high school which ends in grade 12 and do not pursue higher education in which results in a wide varied distribution of salaries at that point. This is similar to the later point in which is almost four years after the end of high school which marks the end of a bachelors degree which is generally around 4 years.
In the fourth graph with age and salary, this graph has a moderate positive relationship that somewhat resembles a linear form, which in this case, as x (age) increases y (salary) increases in some instances.
Question 2: Rank-order the strength of the correlations between sal and each of the other variables. Do these seem in line with what you would have guessed based on the scatterplots? Explain briefly.
| Rank | Variable | Correlation with Salary |
|---|---|---|
| 1 | Grade | .92 |
| 2 | Edyrs | .52 |
| 3 | Supmgr01 | .43 |
| 4 | Yos | .31 |
| 4 | Age | .31 |
| 6 | Promo01 | .10 |
It is not suprising that individuals with higher grade levels generally receive higher salaries, which can be explained by higher individuals whom receive the high grade levels are jobs that require skills that are more inimitable and have more responsibilities to the role which would result in higher pay.
Question 3: Talk about the direction of each of the correlations between sal and the other variables. Who tends to earn higher salaries, those with more or less education, higher or lower grades, etc.?
grade and salary, as those with higher grade levels tend to earn higher salaries.edyrs and salary, as those with longer years in education tend to earn higher salaries.yos and salary, as those with higher years of service tend to have higher salaries.Question 4: How do supervisors (supmgr=1) differ from other people (supmgr=0), based on the correlation coefficients? Do supervisors tend to have higher or lower salaries, higher or lower grades, etc.? Which variables is supervisory status most strongly related to?
opm94AAM %>% select(sal, grade, yos, edyrs, age, promo01,supmgr01) %>% cor(use = "pairwise.complete.obs") %>% round(digits = 2) %>% pandoc.table(style = "grid")
##
##
## +--------------+------+-------+-------+-------+-------+---------+----------+
## | | sal | grade | yos | edyrs | age | promo01 | supmgr01 |
## +==============+======+=======+=======+=======+=======+=========+==========+
## | **sal** | 1 | 0.92 | 0.31 | 0.52 | 0.31 | 0.1 | 0.43 |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## | **grade** | 0.92 | 1 | 0.22 | 0.47 | 0.19 | 0.07 | 0.39 |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## | **yos** | 0.31 | 0.22 | 1 | 0.1 | 0.61 | -0.06 | 0.3 |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## | **edyrs** | 0.52 | 0.47 | 0.1 | 1 | 0.22 | 0.07 | 0.14 |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## | **age** | 0.31 | 0.19 | 0.61 | 0.22 | 1 | -0.18 | 0.27 |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## | **promo01** | 0.1 | 0.07 | -0.06 | 0.07 | -0.18 | 1 | -0.02 |
## +--------------+------+-------+-------+-------+-------+---------+----------+
## | **supmgr01** | 0.43 | 0.39 | 0.3 | 0.14 | 0.27 | -0.02 | 1 |
## +--------------+------+-------+-------+-------+-------+---------+----------+
When it comes to the variable that is strongly correlated with supmgr01, it is salary, in which the correlation coefficient is .43, which means those that are supervisors tend to have a moderate relationship with salary.
Question 5: How do people who were promoted between 1994 and 1995 (promo=1) differ from those who were not?
M <- opm94AAM %>% select(sal, grade, yos, edyrs, age, supmgr01, promo01) %>% cor(use = "pairwise.complete.obs") # store correlation table in letter M
corrplot(M, method = "number")
When it comes to people that were promoted (promo01), it can be said that there is no distinct relationship between the variables listed above as the correlation coefficients for each variable is a weak relationship strength with each variable.