CUNY MSDS Bridge-Final Project

Gehad Gad

January 19, 2020

Winter Bridge

#Read data into R.

Data<- read.csv(file="https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/carData/Salaries.csv")

#Import libraries and/or Packages
if (!require(ggplot2)){
install.packages("ggplot2")
library(ggplot2)}

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.6.2

if (!require(dplyr)){
install.packages("dplyr")
library(dplyr)}

## Loading required package: dplyr

## Warning: package 'dplyr' was built under R version 3.6.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

if (!require(caTools)){
install.packages("caTools")
library(caTools)}

## Loading required package: caTools

if (!require(ggpubr)){
install.packages("ggpubr")
library(ggpubr)}

## Loading required package: ggpubr

## Warning: package 'ggpubr' was built under R version 3.6.2

## Loading required package: magrittr

Question 1: Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

#check for NA values in the data.

sum(is.na(Data))

## [1] 0

There are no NA values in the data.

#Display the head of the data.

head(Data)

##   X      rank discipline yrs.since.phd yrs.service  sex salary
## 1 1      Prof          B            19          18 Male 139750
## 2 2      Prof          B            20          16 Male 173200
## 3 3  AsstProf          B             4           3 Male  79750
## 4 4      Prof          B            45          39 Male 115000
## 5 5      Prof          B            40          41 Male 141500
## 6 6 AssocProf          B             6           6 Male  97000

The head shows the first 6 rows to give an idea about the data over all.

#Get the data summary.

summary(Data)

##        X              rank     discipline yrs.since.phd    yrs.service   
##  Min.   :  1   AssocProf: 64   A:181      Min.   : 1.00   Min.   : 0.00  
##  1st Qu.:100   AsstProf : 67   B:216      1st Qu.:12.00   1st Qu.: 7.00  
##  Median :199   Prof     :266              Median :21.00   Median :16.00  
##  Mean   :199                              Mean   :22.31   Mean   :17.61  
##  3rd Qu.:298                              3rd Qu.:32.00   3rd Qu.:27.00  
##  Max.   :397                              Max.   :56.00   Max.   :60.00  
##      sex          salary      
##  Female: 39   Min.   : 57800  
##  Male  :358   1st Qu.: 91000  
##               Median :107300  
##               Mean   :113706  
##               3rd Qu.:134185  
##               Max.   :231545

The summary displays some useful information about the data. There are 6 columns in the data (rank,discipline, yrs.since.phd, yrs.service, sex, and salary). The rank has three positions (AssocProf, AsstProf, and Prof). The discipline has two classes (A and B). The yrs.since.phd varies between 1 (min) to 56(max). The yrs.service starts from 0(min) to 60(max). The salary starts from 57800(min) to 231545(max).

mean(Data$salary)

## [1] 113706.5

median(Data$salary)

## [1] 107300

mean(Data$yrs.service)

## [1] 17.61461

#Find the correlation between salary and yrs.service

cor(Data$yrs.service, Data$salary, method= c("kendall"))

## [1] 0.3048343

Correlation shows whether how strongly the variables are related. The correlation ranges from -1.0 to +1.0. The closer the correlation (r) to +1 or -1, the more closely the two variables are related.

#Find the correlation between salary and yrs.since.phd

cor(Data$yrs.since.phd, Data$salary, method= c("kendall"))

## [1] 0.3410184

lm(Data$yrs.service ~ Data$salary)

## 
## Call:
## lm(formula = Data$yrs.service ~ Data$salary)
## 
## Coefficients:
## (Intercept)  Data$salary  
##   1.2706280    0.0001437

In linear model (lm), the coefficients are two constants that represent the intercept and slope.

#Get the Variance. The formula for Variance is: 1/(n-1)*sum((x-mean)**2)

var(Data$salary)

## [1] 917425865

Obtain the Vaiance in salary.

#Get the standard deviation
sd(Data$salary)

## [1] 30289.04

Obtain the standard deviation in salary.

#Get the Anova
aov(yrs.service ~ salary,data=Data)

## Call:
##    aov(formula = yrs.service ~ salary, data = Data)
## 
## Terms:
##                   salary Residuals
## Sum of Squares   7506.05  59479.98
## Deg. of Freedom        1       395
## 
## Residual standard error: 12.2712
## Estimated effects may be unbalanced

ANOVA test is used to studying differences between two or more group.

#Find any outliers

boxplot(Data$salary,plot=FALSE)$out

## [1] 231545 204000 205500

outliers= boxplot(Data$salary,plot=FALSE)$out

We have found three values as outliers.

Question 2: Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

#Create a new data frame with a subset of the columns and rows.  

Salaries<- data.frame (Data[c(1:100),c(2:7)])

#Renaming the column names for the new data frame.

names(Salaries) <- c("Position","Class","pHD Obtained","WorkExp","Gender", "Annualsalary")

#Use group by function to explore our data.

salary_group=Data %>% group_by (sex) %>% summarise(M=sum(salary))

Yrs.service_group=Data %>% group_by (yrs.service) %>% summarise(M=sum(salary))

#Delete all outliers.
Data1=Data[-which(Data$salary %in% outliers),]

Data1 in a new data frame without any outliers.

lm(Data1$yrs.service ~ Data1$salary)

## 
## Call:
## lm(formula = Data1$yrs.service ~ Data1$salary)
## 
## Coefficients:
##  (Intercept)  Data1$salary  
##    1.0833161     0.0001456

Run another linear model (lm) to see the difference in the slope and intercept.

#Scalling
Data$yrs.service=scale(Data$yrs.service,center=TRUE,scale=TRUE)

Data$yrs.since.phd=scale(Data$yrs.since.phd,center=TRUE,scale=TRUE)

Question 3: Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

#Set features as factors. 
Data$sex <- as.factor (Data$sex)
Data$rank <- as.factor (Data$rank)
Data$discipline <- as.factor (Data$discipline)

hist(Data$salary)

The hisogram above show the distribution of salary.

ggscatter(Data,x="yrs.service",y="salary")

The scatterplot shows the ralationship between salaries and years of service. As shown, there is a strong relationship between salary and years of service in the first 30 years.

ggscatter(Data,x="yrs.since.phd",y="salary")

The scatterplot shows the ralationship between salaries and years of service.

pie(salary_group$M,salary_group$sex)

Pie chart.

ggscatter(Data,x="yrs.service",y="salary", add = "reg.line", conf.int=TRUE, cor.coef = TRUE, cor.method= "kendall")

This is a scatter plot with a regression line and including the R and p values.

ggplot(Data,aes(x= sex, y=salary)) +
  theme_bw() +
  geom_boxplot()+
  labs(y ="salary",
      x= "sex",
       title = "The Salary Rates by Gender based on Gender")

Boxplot displays the distribution of salary based on gender. The two dots are considered as outliers!

ggplot(data=Data,aes(x=rank,fill=salary,))+geom_bar()

This graphs displays the salary based on Rank. We can see that Proffessor rank recieves the highes salary.

ggplot(Data,aes(x= sex,fill= salary)) +
  theme_bw() +
  facet_wrap(~discipline) +
  geom_bar()+
  labs(y ="Freq",
       title = "The Salary Rates by Gender based on Discipline")

This graph displays the distribution of salary based on gender and discipline. There are more females and males in discipline B.

ggplot(Data,aes(x= rank,fill= salary)) +
  theme_bw() +
  facet_wrap(~sex) +
  geom_bar()+
  labs(y ="Freq",
       title = "The Salary Rates by Gender based on rank")

This graph displays the distribution of salary based on gender and rank. Males have higher number of rank and salary.

Question 4: Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

#How is the salary being affected by other features in the data? Find the relationship between salary and other features.

The graphs above displays that males are recieving more ranks, discipline and salaries than females. In addition, professor possition is larger and recieving higher salaries than AsstProf and AssocProf. Males in discipline B have higher salary than Males in discipline A. Years of service and years since phd obtained also have impact on salary. More graphs below to explore the relationship between salary and other features in the data.

ggplot(data=Data,aes(x=yrs.since.phd,fill=salary,))+geom_bar()

ggscatter(Data,x="yrs.service",y="salary", add = "reg.line")

Question 5: BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Salaries_Github<- read.csv(file="https://github.com/GehadGad/MSDS-R-Bridge-Final-Project/raw/master/Salaries.csv")