library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
library(GGally)
## Warning: package 'GGally' was built under R version 3.6.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.2
library(pander)
## Warning: package 'pander' was built under R version 3.6.2
##
## Attaching package: 'pander'
## The following object is masked from 'package:GGally':
##
## wrap
setwd("C:/Users/ramin/Desktop/2020 winter/Data Analysis/Computer Assignment 4/Dataset")
load(file = "OPM94.RData")
names(opm94)
## [1] "x" "sal" "grade" "patco" "major" "age"
## [7] "male" "vet" "handvet" "hand" "yos" "edyrs"
## [13] "promo" "exit" "supmgr" "race" "minority" "grade4"
## [19] "promo01" "supmgr01" "male01" "exit01"
summary(opm94$race) # see the levels of variable race
## American Indian Asian Black Hispanic White
## 17 31 175 49 728
summary(opm94$male) # see the levels of variable male
## female male
## 488 512
opm94AAM <- opm94 %>% dplyr::filter(race == "Black", male == "male")
Scatterplots with salary (sal) as the dependent variable and four independent variables: grade, yos, edyrs and age
ggplot(data=opm94AAM) + geom_point(mapping = aes(x=grade, y = sal))
ggplot(data=opm94AAM) + geom_point(mapping = aes(x=yos, y = sal))
ggplot(data=opm94AAM) + geom_point(mapping = aes(x=edyrs, y = sal))
ggplot(data=opm94AAM) + geom_point(mapping = aes(x=age, y = sal))
cor(opm94AAM[, c("sal", "grade", "yos", "edyrs", "age", "supmgr01", "promo01")], use = "pairwise.complete.obs" ) %>% round(digits = 2)
## sal grade yos edyrs age supmgr01 promo01
## sal 1.00 0.92 0.31 0.52 0.31 0.43 0.10
## grade 0.92 1.00 0.22 0.47 0.19 0.39 0.07
## yos 0.31 0.22 1.00 0.10 0.61 0.30 -0.06
## edyrs 0.52 0.47 0.10 1.00 0.22 0.14 0.07
## age 0.31 0.19 0.61 0.22 1.00 0.27 -0.18
## supmgr01 0.43 0.39 0.30 0.14 0.27 1.00 -0.02
## promo01 0.10 0.07 -0.06 0.07 -0.18 -0.02 1.00
opm94AAM %>% select(sal, grade, yos, edyrs, age, supmgr01, promo01) %>% cor(use = "pairwise.complete.obs") %>% round(digits = 2)
## sal grade yos edyrs age supmgr01 promo01
## sal 1.00 0.92 0.31 0.52 0.31 0.43 0.10
## grade 0.92 1.00 0.22 0.47 0.19 0.39 0.07
## yos 0.31 0.22 1.00 0.10 0.61 0.30 -0.06
## edyrs 0.52 0.47 0.10 1.00 0.22 0.14 0.07
## age 0.31 0.19 0.61 0.22 1.00 0.27 -0.18
## supmgr01 0.43 0.39 0.30 0.14 0.27 1.00 -0.02
## promo01 0.10 0.07 -0.06 0.07 -0.18 -0.02 1.00
QUESTIONS::
Write a couple of sentences about each graph. Talk about the strength and direction of each relationship. Does there seem to be any evidence of curvilinearity?
For the grades graph, there is a strong relationship between salary and grade. As grades are higher, the salary tends to increase, and there is a positoive relationship. There seems to be very low evidence of cruvlinearity if any
For the years of study graph, there is a weak positive relationship between salary and years of study.As the years of study increases, the salary tends to increase. Since the data shows salary increasing tremendously at 20 years of study, there is eveidence that there is curvlinearity.
For the edyrs graph, there is a weak positive relationship between salary and edyrs. As the edyrs increase, salary tends to increase. At about 11.8 edyrs, the salary is high then drops after and starts increasing, therefore there is evidence of curvlinearity.
For the age graph, there is a weak positive relationship between age and salary. As the age increases salary tends to increase. There is no evidence of curvlinearity in this case.
Rank-order the strength of the correlations between sal and each of the other variables. Do these seem in line with what you would have guessed based on the scatterplots? Explain briefly.
The strength of the correlations between sal aand the other variables from greatest to least are: grade, edyrs, age, yos. They do seem in line with what I would have guessed based on the scatterplots, but the age and yos are very close and their correlation coefficients are the same.
Talk about the direction of each of the correlations between sal and the other variables. Who tends to earn higher salaries, those with more or less education, higher or lower grades, etc.?
There is a positive direction between all the variables, however the higher grades are more strongly correlated with a higher salary. Thos who habe more education and higher grades tend to have the highest salaries.
How do supervisors (supmgr=1) differ from other people (supmgr=0), based on the correlation coefficients? Do supervisors tend to have higher or lower salaries, higher or lower grades, etc.? Which variables is supervisory status most strongly related to?
Compared to most people based on the correlation coeffiecients, supervisors tend to have low incomes, they have low education and grades. Supervisory is mostly related to age and yos.
How do people who were promoted between 1994 and 1995 (promo=1) differ from those who were not?
Since the correlation coefficient is low, they did not differ much.