library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(knitr)
library(pander)
##
## Attaching package: 'pander'
## The following object is masked from 'package:GGally':
##
## wrap
load(file = "Datasets/OPM94.RData")
names(opm94)
## [1] "x" "sal" "grade" "patco" "major" "age"
## [7] "male" "vet" "handvet" "hand" "yos" "edyrs"
## [13] "promo" "exit" "supmgr" "race" "minority" "grade4"
## [19] "promo01" "supmgr01" "male01" "exit01" "vet01"
summary(opm94$race) # see the levels of variable race
## American Indian Asian Black Hispanic White
## 17 31 175 49 728
summary(opm94$male) # see the levels of variable male
## female male
## 488 512
opm94AAM <- opm94 %>% dplyr::filter(race == "Black", male == "male") # create a subset of data wher race == "Black", male == "male"
ggplot(data=opm94AAM) + geom_point(mapping = aes(x=grade, y = sal))

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=yos, y = sal))

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=edyrs, y = sal))

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=age, y = sal))

# cor(opm94AAM[, c("sal", "grade", "yos", "edyrs", "age", "supmgr01", "promo01")], use = "pairwise.complete.obs" ) %>% round(digits = 2)
opm94AAM %>% select(sal, grade, yos, edyrs, age, supmgr01, promo01) %>% cor(use = "pairwise.complete.obs") %>% round(digits = 2)
## sal grade yos edyrs age supmgr01 promo01
## sal 1.00 0.92 0.31 0.52 0.31 0.43 0.10
## grade 0.92 1.00 0.22 0.47 0.19 0.39 0.07
## yos 0.31 0.22 1.00 0.10 0.61 0.30 -0.06
## edyrs 0.52 0.47 0.10 1.00 0.22 0.14 0.07
## age 0.31 0.19 0.61 0.22 1.00 0.27 -0.18
## supmgr01 0.43 0.39 0.30 0.14 0.27 1.00 -0.02
## promo01 0.10 0.07 -0.06 0.07 -0.18 -0.02 1.00
#1. Write a couple of sentences about each graph. Talk about the strength and direction of each relationship. Does there seem to be any evidence of curvilinearity?
#In graph 1, there is a positive relationship between that grades people earn and the salary they receive. This grah is curvilinear
#In graph 2, there is very weak relationship between the years in school and salary. The has no trend, so it is not curvilinear.
#In graph 3, there is a weak positive relationship between edyrs and salary. As one increases the other increases as well.
#In graph 4, there is a a weak positive relationship as well. Most of the points are spread out and the graph is curvilinear.
#2. Rank-order the strength of the correlations between sal and each of the other variables. Do these seem in line with what you would have guessed based on the scatterplots? Explain briefly.
#1st grade
#2nd edyrs
#3rd age
#4th yos
#3. Talk about the direction of each of the correlations between sal and the other variables. Who tends to earn higher salaries, those with more or less education, higher or lower grades, etc.?
#There is a positive in most of the variables. The grades has the strongest relationship to salary, so, those people who have more education and better grades typically make a higher salary.
#4. How do supervisors (supmgr=1) differ from other people (supmgr=0), based on the correlation coefficients? Do supervisors tend to have higher or lower salaries, higher or lower grades, etc.? Which variables is supervisory status most strongly related to?
#Supervisors tend to have lower incomes and also grades and education. Its most closely related to age and yos.
#5. How do people who were promoted between 1994 and 1995 (promo=1) differ from those who were not?
#Because the correlation coefficient was low, they do not differ that much.