1. Load Libraries, Set Your Working Directory, & Load Data

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
library(GGally)
## Warning: package 'GGally' was built under R version 3.6.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.2
library(pander)
## Warning: package 'pander' was built under R version 3.6.2
## 
## Attaching package: 'pander'
## The following object is masked from 'package:GGally':
## 
##     wrap
setwd("C:/Users/ramin/Desktop/2020 winter/Data Analysis/Computer Assignment 4/Dataset")

load(file = "OPM94.RData")

names(opm94)
##  [1] "x"        "sal"      "grade"    "patco"    "major"    "age"     
##  [7] "male"     "vet"      "handvet"  "hand"     "yos"      "edyrs"   
## [13] "promo"    "exit"     "supmgr"   "race"     "minority" "grade4"  
## [19] "promo01"  "supmgr01" "male01"   "exit01"

2. Let’s work with African-American males only to keep the graphs easy to work with:

summary(opm94$race)  # see the levels of variable race
## American Indian           Asian           Black        Hispanic           White 
##              17              31             175              49             728
summary(opm94$male)  # see the levels of variable male
## female   male 
##    488    512
opm94AAM <- opm94 %>% dplyr::filter(race == "Black", male == "male")

3. Scatterplots

Scatterplots with salary (sal) as the dependent variable and four independent variables: grade, yos, edyrs and age

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=grade, y = sal))

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=yos, y = sal))

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=edyrs, y = sal))

ggplot(data=opm94AAM) + geom_point(mapping = aes(x=age, y = sal))

4. Correlation Matrix

cor(opm94AAM[, c("sal", "grade", "yos", "edyrs", "age", "supmgr01", "promo01")], use = "pairwise.complete.obs" ) %>% round(digits = 2)
##           sal grade   yos edyrs   age supmgr01 promo01
## sal      1.00  0.92  0.31  0.52  0.31     0.43    0.10
## grade    0.92  1.00  0.22  0.47  0.19     0.39    0.07
## yos      0.31  0.22  1.00  0.10  0.61     0.30   -0.06
## edyrs    0.52  0.47  0.10  1.00  0.22     0.14    0.07
## age      0.31  0.19  0.61  0.22  1.00     0.27   -0.18
## supmgr01 0.43  0.39  0.30  0.14  0.27     1.00   -0.02
## promo01  0.10  0.07 -0.06  0.07 -0.18    -0.02    1.00
opm94AAM %>% select(sal, grade, yos, edyrs, age, supmgr01, promo01) %>% cor(use = "pairwise.complete.obs") %>% round(digits = 2)
##           sal grade   yos edyrs   age supmgr01 promo01
## sal      1.00  0.92  0.31  0.52  0.31     0.43    0.10
## grade    0.92  1.00  0.22  0.47  0.19     0.39    0.07
## yos      0.31  0.22  1.00  0.10  0.61     0.30   -0.06
## edyrs    0.52  0.47  0.10  1.00  0.22     0.14    0.07
## age      0.31  0.19  0.61  0.22  1.00     0.27   -0.18
## supmgr01 0.43  0.39  0.30  0.14  0.27     1.00   -0.02
## promo01  0.10  0.07 -0.06  0.07 -0.18    -0.02    1.00

QUESTIONS::

  1. Write a couple of sentences about each graph. Talk about the strength and direction of each relationship. Does there seem to be any evidence of curvilinearity?

    For the grades graph, there is a strong relationship between salary and grade. As grades are higher, the salary tends to increase, and there is a positoive relationship. There seems to be very low evidence of cruvlinearity if any

    For the years of study graph, there is a weak positive relationship between salary and years of study.As the years of study increases, the salary tends to increase. Since the data shows salary increasing tremendously at 20 years of study, there is eveidence that there is curvlinearity.

    For the edyrs graph, there is a weak positive relationship between salary and edyrs. As the edyrs increase, salary tends to increase. At about 11.8 edyrs, the salary is high then drops after and starts increasing, therefore there is evidence of curvlinearity.

    For the age graph, there is a weak positive relationship between age and salary. As the age increases salary tends to increase. There is no evidence of curvlinearity in this case.

  2. Rank-order the strength of the correlations between sal and each of the other variables. Do these seem in line with what you would have guessed based on the scatterplots? Explain briefly.

    The strength of the correlations between sal aand the other variables from greatest to least are: grade, edyrs, age, yos. They do seem in line with what I would have guessed based on the scatterplots, but the age and yos are very close and their correlation coefficients are the same.

  3. Talk about the direction of each of the correlations between sal and the other variables. Who tends to earn higher salaries, those with more or less education, higher or lower grades, etc.?

    There is a positive direction between all the variables, however the higher grades are more strongly correlated with a higher salary. Thos who habe more education and higher grades tend to have the highest salaries.

  4. How do supervisors (supmgr=1) differ from other people (supmgr=0), based on the correlation coefficients? Do supervisors tend to have higher or lower salaries, higher or lower grades, etc.? Which variables is supervisory status most strongly related to?

    Compared to most people based on the correlation coeffiecients, supervisors tend to have low incomes, they have low education and grades. Supervisory is mostly related to age and yos.

  5. How do people who were promoted between 1994 and 1995 (promo=1) differ from those who were not?

    Since the correlation coefficient is low, they did not differ much.