Final Data

Data used for answering the questions:

Salaries <- read.csv("~/Uni/SS 2015/R/Salaries.csv")

Section 1: Dataset Description

How did you obtain the dataset?

I searched for R datasets in Google and found the following website: https://vincentarelbundock.github.io/Rdatasets/datasets.html I downloaded the “Salaries” dataset as an Excel document and imported it into R.

How were the data originally collected?

The data was originally collected in an US college. It was collected over a period of nine months in the years 2008 and 2009. They asked for the academic salary of Assistant Professors, Associate Professors and Professors. The original goal of the data collection was to determine if there were any salary differences for female and male faculty members.

How many rows and columns are in the dataset?

The dataset consists of 397 rows and 7 columns.

What are the columns in the dataset? For each column, give the variable name and a brief description of what it represents.

The names of the columns are “X”, “rank”, “discipline”, “yrs.since.phd”, “yrs.service”, “sex”, “salary”. The variable “X” represents an individual faculty member. “rank” has 3 factors. It shows if the person is an Assistant Professors, an Associate Professors and or a Professors. The variable “discipline” has 2 factors (A and B). “A” means the person is working in a “theoretical” department. “B” means the person is working in an “applied” department. “yrs.since.phd” shows the number of years a person already has its phd. “yrs.service”shows the number of years a person is already working in his or her profession. “sex” has the factors “Male” and “Female”. The variable “salary” shows the nine-month salary in US-dollars.

Section 2: Questions

What is the mean, median and standard deviation of the total salaries at the college?
How are the salaries distributed?
Is there a difference in salary between men and women?
Is there a relationship between salary in euro and the years a person is already in service or the amount of years a person already has its phd?
What is the mean salary for Assistant Professors, Associate Professors and Professors respectively?

Section 3: Analysis

Recode the values of at least one column using indexing and reassignment.

Salaries$salary.euro <- Salaries$salary * 0.9

head(Salaries)

##   X      rank discipline yrs.since.phd yrs.service  sex salary salary.euro
## 1 1      Prof          B            19          18 Male 139750      125775
## 2 2      Prof          B            20          16 Male 173200      155880
## 3 3  AsstProf          B             4           3 Male  79750       71775
## 4 4      Prof          B            45          39 Male 115000      103500
## 5 5      Prof          B            40          41 Male 141500      127350
## 6 6 AssocProf          B             6           6 Male  97000       87300

Calculate at least one standard deviation using sd(), one mean using mean() and one median using median().

sd(Salaries$salary)

## [1] 30289.04

mean(Salaries$salary)

## [1] 113706.5

median(Salaries$salary)

## [1] 107300

Calculate at least one t-test using t.test().

salary.male <- subset(Salaries, subset = sex == "Male")$salary
salary.female <- subset(Salaries, subset = sex == "Female")$salary
test.result1 <- t.test(x = salary.male, 
y = salary.female, 
alternative = "two.sided" 
)
test.result1

## 
##  Welch Two Sample t-test
## 
## data:  salary.male and salary.female
## t = 3.1615, df = 50.122, p-value = 0.002664
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   5138.102 23037.916
## sample estimates:
## mean of x mean of y 
##  115090.4  101002.4

Calculate at least one correlation test using cor.test().

test.result2 <- cor.test(x = Salaries$yrs.service,
y = Salaries$salary.euro
)
test.result2

## 
##  Pearson's product-moment correlation
## 
## data:  Salaries$yrs.service and Salaries$salary.euro
## t = 7.0602, df = 395, p-value = 7.529e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2443740 0.4193506
## sample estimates:
##       cor 
## 0.3347447

Calculate at least one regression analysis using lm() or glm().

Salaries.lm <- lm(salary.euro ~ yrs.since.phd,
data = Salaries)
summary(Salaries.lm)

## 
## Call:
## lm(formula = salary.euro ~ yrs.since.phd, data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -75754 -17489  -2572  14478  92145 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   82546.82    2489.21  33.162   <2e-16 ***
## yrs.since.phd   886.81      96.63   9.177   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24780 on 395 degrees of freedom
## Multiple R-squared:  0.1758, Adjusted R-squared:  0.1737 
## F-statistic: 84.23 on 1 and 395 DF,  p-value: < 2.2e-16

Create at least one scatterplot containing data from two different groups (e.g; a set of green points and a set of red points representing different groups), with added regression lines using abline().

plot(x = 1, y = 1, xlab = "Time", ylab = "Salary in Euro",
type = "n", main = "The Effect of Time since phd and Years of Service on Salary in Euro",
xlim = c(0, 60), ylim = c(0, 210000))

points(x = Salaries$yrs.since.phd, y = Salaries$salary.euro, pch = 16, cex = 0.7, col = "red")

points(x = Salaries$yrs.service, y = Salaries$salary.euro, pch = 16, cex = 0.7, col = "blue")


summary(lm(formula = salary.euro ~ yrs.since.phd, data = Salaries))

## 
## Call:
## lm(formula = salary.euro ~ yrs.since.phd, data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -75754 -17489  -2572  14478  92145 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   82546.82    2489.21  33.162   <2e-16 ***
## yrs.since.phd   886.81      96.63   9.177   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24780 on 395 degrees of freedom
## Multiple R-squared:  0.1758, Adjusted R-squared:  0.1737 
## F-statistic: 84.23 on 1 and 395 DF,  p-value: < 2.2e-16

summary(lm(formula = salary.euro ~ yrs.service, data = Salaries))

## 
## Call:
## lm(formula = salary.euro ~ yrs.service, data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73739 -18460  -3399  14775  91752 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 89977.19    2174.94   41.37  < 2e-16 ***
## yrs.service   701.61      99.38    7.06 7.53e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25720 on 395 degrees of freedom
## Multiple R-squared:  0.1121, Adjusted R-squared:  0.1098 
## F-statistic: 49.85 on 1 and 395 DF,  p-value: 7.529e-12

abline(a = 82546.82, b = 886.81, lwd = 3, col = "red")

abline(a = 89977.19, b = 701.61, lwd = 3, col = "blue")

legend("bottomright",
legend = c("years since phd", "years service"),
col = c('red', 'blue'),
pch = c(16, 16),
bg = "white"
)

Create at least one histogram using hist() with additional reference lines showing the mean and median of the group.

hist(x = Salaries$salary,
main = "Salary Distribution",
xlab = "Salary"
)

abline(v = mean(Salaries$salary), lwd = 2, col = "blue")
abline(v = median(Salaries$salary), lwd = 2, col = "red")

legend("topright",
legend = c("Mean", "Median"),
col = c('blue', 'red'),
pch = c(16, 16),
bg = "white"
)

Use the aggregate function or dplyr to calculate descriptive statistics across groups of data.

aggregate(formula = salary ~ rank, 
FUN = mean,
na.rm = T, 
data = Salaries 
)

##        rank    salary
## 1 AssocProf  93876.44
## 2  AsstProf  80775.99
## 3      Prof 126772.11

Use par(mfrow = c(x, y)) to put two or more plots next to each other.

par(mfrow = c(1, 2))


plot(x = 1, y = 1, xlab = "Time", ylab = "Salary in Euro",
type = "n", main = "The Effect of Time since phd and Years of Service on Salary in Euro",
xlim = c(0, 60), ylim = c(0, 210000))

points(x = Salaries$yrs.since.phd, y = Salaries$salary.euro, pch = 16, cex = 0.7, col = "red")

points(x = Salaries$yrs.service, y = Salaries$salary.euro, pch = 16, cex = 0.7, col = "blue")

abline(a = 91718.7, b = 985.3, lwd = 3, col = "red")

abline(a = 99974.7, b = 779.6, lwd = 3, col = "blue")



hist(x = Salaries$salary,
main = "Salary Distribution",
xlab = "Salary"
)

abline(v = mean(Salaries$salary), lwd = 2, col = "blue")

abline(v = median(Salaries$salary), lwd = 2, col = "red")

legend("topright",
legend = c("Mean", "Median"),
col = c('blue', 'red'),
pch = c(16, 16),
bg = "white"
)

Create (and use!) at least one custom function.

Professor.Salary <- function(what) {
valid.input <- what %in% c("AsstProf", "AssocProf", "Prof")
if(valid.input == TRUE) { 
if (what == "AsstProf") {output <- mean(Salaries$salary[Salaries$rank == "AsstProf"])}
if (what == "AssocProf") {output <- mean(Salaries$salary[Salaries$rank == "AssocProf"])}
if (what == "Prof") {output <-  mean(Salaries$salary[Salaries$rank == "Prof"])}
} 
if(valid.input == FALSE) { 
output <- "Please enter AsstProf, AssocProf or Prof"
} 
return(output)
}

Professor.Salary("AsstProf")

## [1] 80775.99

Professor.Salary("AssocProf")

## [1] 93876.44

Professor.Salary("Prof")

## [1] 126772.1

Professor.Salary("LOL")

## [1] "Please enter AsstProf, AssocProf or Prof"

Section 4: Conclusion

The analysis of the dataset provided several interesting results. The results were both surprising and unsurprising giving insights for potential future political changes.

Descriptive statistics indicate that the salaries at the college are approximately normally distributed. Only few people are gaining less than 50 000 dollars or more than 200 000 dollars, while most people have a salary around 100 000 dollars. Descriptive statistics also indicate that there is a relationship between the salary and the rank of a person in the faculty. It seems to make a difference if a person is an Assistant Professor, an Associate Professor or a Professor with Professors having the highest and Assistant Professors having the lowest salary. Future analysis need to determine if this relationship is statistically significant.

Unfortunately, at this college, there is still a statistically sifnificant difference between the salary of men and women. In a period of nine months men are gaining on average approximately 15 000 dollars more than women.

test.result1

## 
##  Welch Two Sample t-test
## 
## data:  salary.male and salary.female
## t = 3.1615, df = 50.122, p-value = 0.002664
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   5138.102 23037.916
## sample estimates:
## mean of x mean of y 
##  115090.4  101002.4

However, the analyses did not control for rank, years since phd and years in service. Future analyses need to account for these factors to see whether political interventions might be neccessary to adjust salary distribution in the college.

Another interesting result was the effect of time on salary. Both time since the person already has its phd and the time of service had and effect on the size of the salary. People who have their phd for a longer period of time gain more money. Furthermore, people who are more years in service also have a higher salary. Regression analyses revealed that these relationships were statistically significant.

Final Data

Alexander Rein

August 2015