library(RCurl)
library(tidyverse)
url <- getURL("https://rawgit.com/nschettini/CUNY-MSDS-Bridge-R/master/Salaries.csv")
salaries <- read.csv(text = url)
head(salaries)
## X rank discipline yrs.since.phd yrs.service sex salary
## 1 1 Prof B 19 18 Male 139750
## 2 2 Prof B 20 16 Male 173200
## 3 3 AsstProf B 4 3 Male 79750
## 4 4 Prof B 45 39 Male 115000
## 5 5 Prof B 40 41 Male 141500
## 6 6 AssocProf B 6 6 Male 97000
summary(salaries)
## X rank discipline yrs.since.phd yrs.service
## Min. : 1 AssocProf: 64 A:181 Min. : 1.00 Min. : 0.00
## 1st Qu.:100 AsstProf : 67 B:216 1st Qu.:12.00 1st Qu.: 7.00
## Median :199 Prof :266 Median :21.00 Median :16.00
## Mean :199 Mean :22.31 Mean :17.61
## 3rd Qu.:298 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :397 Max. :56.00 Max. :60.00
## sex salary
## Female: 39 Min. : 57800
## Male :358 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
cat("The mean of all salaries is", mean(salaries$salary))
## The mean of all salaries is 113706.5
cat("The median of all salaries is", median(salaries$salary))
## The median of all salaries is 107300
Median Salary between gender and rank
aggregate(salary ~ (sex + rank), salaries, median)
## sex rank salary
## 1 Female AssocProf 90556.5
## 2 Male AssocProf 95626.5
## 3 Female AsstProf 77000.0
## 4 Male AsstProf 80182.0
## 5 Female Prof 120257.5
## 6 Male Prof 123996.0
Median Salary based on gender.
aggregate(salary ~ sex , salaries, median)
## sex salary
## 1 Female 103750
## 2 Male 108043
aggregate(salary ~ rank, salaries, median)
## rank salary
## 1 AssocProf 95626.5
## 2 AsstProf 79800.0
## 3 Prof 123321.5
According to the data above, Professors have the highest salary, followed by Associate Professors, then Assistance Professors. This data also shows that Males make more than Females.
rename1 <- rename(salaries, gender = sex, title = rank, total_salary = salary, years.since.phd = yrs.since.phd, years.serivce = yrs.service)
head(rename1)
## X title discipline years.since.phd years.serivce gender total_salary
## 1 1 Prof B 19 18 Male 139750
## 2 2 Prof B 20 16 Male 173200
## 3 3 AsstProf B 4 3 Male 79750
## 4 4 Prof B 45 39 Male 115000
## 5 5 Prof B 40 41 Male 141500
## 6 6 AssocProf B 6 6 Male 97000
total_males <- subset(salaries, salaries$sex == "Male")
count(total_males)
## # A tibble: 1 x 1
## n
## <int>
## 1 358
total_females <- subset(salaries, salaries$sex == "Female")
count(total_females)
## # A tibble: 1 x 1
## n
## <int>
## 1 39
It makes sense to see the difference in males vs. females in this dataset for this college:
ggplot(salaries, aes(x= sex)) + geom_bar(aes(fill=rank)) +
ggtitle("# of Female vs. Male Professors, by rank") +
xlab("Female vs. Male") +
ylab("Count")
Looking at the above graph, total # of males out number total # of females
ggplot(salaries, aes(x = salary, fill = rank)) + geom_histogram(bins = 10) +
theme_dark() +
xlab("Salaries") +
ylab("Freq. of Salaries") +
ggtitle("Histogram of Salaries for Professors") +
geom_vline(xintercept=median(salaries$salary), col='yellow')+
geom_vline(xintercept=mean(salaries$salary), col='orange')
This histogram shows the distribution of salaries over our dataset, along with the median highlighted in yellow, and mean highlight in orange.
ggplot(salaries, aes(y = salary, x = rank)) + geom_boxplot(aes(fill=factor(rank))) +
theme_dark() +
xlab("Rank") +
ylab("Salary") +
ggtitle("Median Salary for Different Ranks")
This boxplot shows the three different ranks of our data. This shows visually that Professors make the most, followed by Associate Professors, and lastly Assistant Professors.
ggplot(salaries, aes(x= yrs.service, y= salary)) +
geom_point(aes(color = rank, shape = sex), size = 4) +
theme_dark() +
xlab("Years of Service") +
ylab("Salary") +
ggtitle("Years of Service vs. Salary") +
geom_smooth(method=lm)
ggplot(salaries, aes(x= yrs.since.phd, y= salary)) +
geom_point(aes(color = rank), size = 4) +
theme_dark()+
xlab("Years since PHD was obtained") +
ylab("Salary") +
ggtitle("Years since PHD Obtained vs. Salary") +
geom_smooth(method=lm)