Gehad Gad
January 19, 2020
Winter Bridge
#Read data into R.
Data<- read.csv(file="https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/carData/Salaries.csv")
#Import libraries and/or Packages
if (!require(ggplot2)){
install.packages("ggplot2")
library(ggplot2)}
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.6.2
if (!require(dplyr)){
install.packages("dplyr")
library(dplyr)}
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.6.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if (!require(caTools)){
install.packages("caTools")
library(caTools)}
## Loading required package: caTools
if (!require(ggpubr)){
install.packages("ggpubr")
library(ggpubr)}
## Loading required package: ggpubr
## Warning: package 'ggpubr' was built under R version 3.6.2
## Loading required package: magrittr
Question 1: Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
#check for NA values in the data.
sum(is.na(Data))
## [1] 0
There are no NA values in the data.
#Display the head of the data.
head(Data)
## X rank discipline yrs.since.phd yrs.service sex salary
## 1 1 Prof B 19 18 Male 139750
## 2 2 Prof B 20 16 Male 173200
## 3 3 AsstProf B 4 3 Male 79750
## 4 4 Prof B 45 39 Male 115000
## 5 5 Prof B 40 41 Male 141500
## 6 6 AssocProf B 6 6 Male 97000
The head shows the first 6 rows to give an idea about the data over all.
#Get the data summary.
summary(Data)
## X rank discipline yrs.since.phd yrs.service
## Min. : 1 AssocProf: 64 A:181 Min. : 1.00 Min. : 0.00
## 1st Qu.:100 AsstProf : 67 B:216 1st Qu.:12.00 1st Qu.: 7.00
## Median :199 Prof :266 Median :21.00 Median :16.00
## Mean :199 Mean :22.31 Mean :17.61
## 3rd Qu.:298 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :397 Max. :56.00 Max. :60.00
## sex salary
## Female: 39 Min. : 57800
## Male :358 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
The summary displays some useful information about the data. There are 6 columns in the data (rank,discipline, yrs.since.phd, yrs.service, sex, and salary). The rank has three positions (AssocProf, AsstProf, and Prof). The discipline has two classes (A and B). The yrs.since.phd varies between 1 (min) to 56(max). The yrs.service starts from 0(min) to 60(max). The salary starts from 57800(min) to 231545(max).
mean(Data$salary)
## [1] 113706.5
median(Data$salary)
## [1] 107300
mean(Data$yrs.service)
## [1] 17.61461
#Find the correlation between salary and yrs.service
cor(Data$yrs.service, Data$salary, method= c("kendall"))
## [1] 0.3048343
Correlation shows whether how strongly the variables are related. The correlation ranges from -1.0 to +1.0. The closer the correlation (r) to +1 or -1, the more closely the two variables are related.
#Find the correlation between salary and yrs.since.phd
cor(Data$yrs.since.phd, Data$salary, method= c("kendall"))
## [1] 0.3410184
lm(Data$yrs.service ~ Data$salary)
##
## Call:
## lm(formula = Data$yrs.service ~ Data$salary)
##
## Coefficients:
## (Intercept) Data$salary
## 1.2706280 0.0001437
In linear model (lm), the coefficients are two constants that represent the intercept and slope.
#Get the Variance. The formula for Variance is: 1/(n-1)*sum((x-mean)**2)
var(Data$salary)
## [1] 917425865
Obtain the Vaiance in salary.
#Get the standard deviation
sd(Data$salary)
## [1] 30289.04
Obtain the standard deviation in salary.
#Get the Anova
aov(yrs.service ~ salary,data=Data)
## Call:
## aov(formula = yrs.service ~ salary, data = Data)
##
## Terms:
## salary Residuals
## Sum of Squares 7506.05 59479.98
## Deg. of Freedom 1 395
##
## Residual standard error: 12.2712
## Estimated effects may be unbalanced
ANOVA test is used to studying differences between two or more group.
#Find any outliers
boxplot(Data$salary,plot=FALSE)$out
## [1] 231545 204000 205500
outliers= boxplot(Data$salary,plot=FALSE)$out
We have found three values as outliers.
Question 2: Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)
#Create a new data frame with a subset of the columns and rows.
Salaries<- data.frame (Data[c(1:100),c(2:7)])
#Renaming the column names for the new data frame.
names(Salaries) <- c("Position","Class","pHD Obtained","WorkExp","Gender", "Annualsalary")
#Use group by function to explore our data.
salary_group=Data %>% group_by (sex) %>% summarise(M=sum(salary))
Yrs.service_group=Data %>% group_by (yrs.service) %>% summarise(M=sum(salary))
#Delete all outliers.
Data1=Data[-which(Data$salary %in% outliers),]
Data1 in a new data frame without any outliers.
lm(Data1$yrs.service ~ Data1$salary)
##
## Call:
## lm(formula = Data1$yrs.service ~ Data1$salary)
##
## Coefficients:
## (Intercept) Data1$salary
## 1.0833161 0.0001456
Run another linear model (lm) to see the difference in the slope and intercept.
#Scalling
Data$yrs.service=scale(Data$yrs.service,center=TRUE,scale=TRUE)
Data$yrs.since.phd=scale(Data$yrs.since.phd,center=TRUE,scale=TRUE)
Question 3: Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
#Set features as factors.
Data$sex <- as.factor (Data$sex)
Data$rank <- as.factor (Data$rank)
Data$discipline <- as.factor (Data$discipline)
hist(Data$salary)
The hisogram above show the distribution of salary.
ggscatter(Data,x="yrs.service",y="salary")
The scatterplot shows the ralationship between salaries and years of service. As shown, there is a strong relationship between salary and years of service in the first 30 years.
ggscatter(Data,x="yrs.since.phd",y="salary")
The scatterplot shows the ralationship between salaries and years of service.
pie(salary_group$M,salary_group$sex)
Pie chart.
ggscatter(Data,x="yrs.service",y="salary", add = "reg.line", conf.int=TRUE, cor.coef = TRUE, cor.method= "kendall")
This is a scatter plot with a regression line and including the R and p values.
ggplot(Data,aes(x= sex, y=salary)) +
theme_bw() +
geom_boxplot()+
labs(y ="salary",
x= "sex",
title = "The Salary Rates by Gender based on Gender")
Boxplot displays the distribution of salary based on gender. The two dots are considered as outliers!
ggplot(data=Data,aes(x=rank,fill=salary,))+geom_bar()
This graphs displays the salary based on Rank. We can see that Proffessor rank recieves the highes salary.
ggplot(Data,aes(x= sex,fill= salary)) +
theme_bw() +
facet_wrap(~discipline) +
geom_bar()+
labs(y ="Freq",
title = "The Salary Rates by Gender based on Discipline")
This graph displays the distribution of salary based on gender and discipline. There are more females and males in discipline B.
ggplot(Data,aes(x= rank,fill= salary)) +
theme_bw() +
facet_wrap(~sex) +
geom_bar()+
labs(y ="Freq",
title = "The Salary Rates by Gender based on rank")
This graph displays the distribution of salary based on gender and rank. Males have higher number of rank and salary.
Question 4: Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.
#How is the salary being affected by other features in the data? Find the relationship between salary and other features.
The graphs above displays that males are recieving more ranks, discipline and salaries than females. In addition, professor possition is larger and recieving higher salaries than AsstProf and AssocProf. Males in discipline B have higher salary than Males in discipline A. Years of service and years since phd obtained also have impact on salary. More graphs below to explore the relationship between salary and other features in the data.
ggplot(data=Data,aes(x=yrs.since.phd,fill=salary,))+geom_bar()
ggscatter(Data,x="yrs.service",y="salary", add = "reg.line")
Question 5: BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
Salaries_Github<- read.csv(file="https://github.com/GehadGad/MSDS-R-Bridge-Final-Project/raw/master/Salaries.csv")