This research will utilized the salaries dataset (Salaries for Professors) from the “car” package. This dataset includes the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. The dataset contains 397 observations and 6 variables. This research will focus on the factors that influence the college professsors’ salaries. The response variable (Y) for this particular research is salary (nine-month salary, in dollars). The three explanatory variables are yrs.since.phd (years since PhD), yrs.service (years of service), and sex (a factor with levels Female Male). This research intends to explore how the sex, years of service and years since PhD influence the amount of salaries professors receive. We suspect that the response variable has relationship with the explanatory varibles and we will utilize statistics methods to analyze and explain the data.
Univariate Analysis 1. Salary
attach(Salaries)
summary(salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57800 91000 107300 113706 134185 231545
diff(range(salary))
## [1] 173745
IQR(salary)
## [1] 43185
hist(Salaries$salary, breaks = 15, col="blue", xlab="nine-month salary, in dollars",
main="Histogram", xlim = c(0,300000), ylim = c(0,100))
boxplot(salary)
The professors salaries in this sample range from 57800 to 231545 dollars. There is a 173745 dollars difference which means the amount of salaries is spread out. The amount of salaries each professor earn can be very different from each other. The mean of the salaries is 113706 dollars, which means the average amount a professor earn in nine months is 113706 dollars. Also, we can see from the histogram that most professors earn salaries around the average amount. The maximum amount of salary is 231545 dollars which is almost double the amount of the 3rd quartile salary, 134185 dollars. However, extremly high salary is not common. The box plot indicates there are a few outliers in the upper end of the salaries. All the data demonstrate that the professors’ salaries are spread out but most of the data lie around the mean and median.
attach(Salaries)
## The following objects are masked from Salaries (pos = 3):
##
## discipline, rank, salary, sex, yrs.service, yrs.since.phd
summary(yrs.since.phd)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 12.00 21.00 22.31 32.00 56.00
diff(range(yrs.since.phd))
## [1] 55
IQR(yrs.since.phd)
## [1] 20
hist(Salaries$yrs.since.phd, breaks = 15, col="blue", xlab="years since PhD",
main="Histogram", ylim = c(0,60), xlim = c(0,60))
boxplot(yrs.since.phd)
The range between the maximum years since PhD and the minimum is 55 years. The range indicates that there is great variety in years since PhD in professors. From the histogram we can see that most of professors service years after they have gotten their PhD degree is mostly below the average 22.31 years. There are only a couple professors who have served above 50 years since they have received PhD degree. Also, according to the boxplot, there are not outliers in this particular dataset. It suggest that the spread of this dataset is not as extreme as the previous one. The mean of this set of data is 22.31, which means on average, professors have been PhD for 22.31 years. The median is 21 meaning that the middle number out of all the years since PhD is 21 years.
summary(yrs.service)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.00 16.00 17.61 27.00 60.00
diff(range(yrs.service))
## [1] 60
IQR(yrs.service)
## [1] 20
hist(Salaries$yrs.service, breaks = 15, col="blue", xlab="years of service",
main="Histogram", ylim = c(0,100), xlim = c(0,70))
boxplot(yrs.service)
The year of service range from 0 years to 60 years, which again, indicates that the datas are widely spread out. The mean of this dataset is 17.6 years meaning that on averge, professors have been in service for 17.6 years. According to the histogram, we can observe that it is right-skewed, meaning as the years of service grow, the fewer the professors who qualifies that condition. The box plot indicates that there is a outlier in the upper end of the scale. This indicates that in this specific sample, most professors are rather new and have not been in service above 16 years.
counts <- table(Salaries$sex)
counts
##
## Female Male
## 39 358
barplot(counts, main= "barplot", ylim=c(0,400))
The table of professors’ gender show that there are only 39 female professors and 358 male professors out of 397 professors in total. It suggests in this sample, female professors is significantly fewer than male professors. ``` Bivariate Analysis 1. Salary & Years since PhD
plot(Salaries$salary ~ Salaries$yrs.since.phd)
Through observing the scatterplot, we can see a rather strong positive relationship betwwen the salary and years since PhD from the data. The scatterplot suggests that the longer the professors have been in service owning a PhD, the higher the salaries they earn.
plot(Salaries$yrs.service ~ Salaries$yrs.service)
## Warning in plot.formula(Salaries$yrs.service ~ Salaries$yrs.service):
## the formula 'Salaries$yrs.service ~ Salaries$yrs.service' is treated as
## 'Salaries$yrs.service ~ 1'
From observing the scatterplot, it appears that salary and years in service have no strong relationship. In another word, professors salaries are not influenced by their years in service.
boxplot(Salaries$salary ~ Salaries$sex, col="blue", main="Side-by-side boxplot", ylab="nine-month salary, in dollars", xlab="a factor with levels Female Male")
According to the side-by-side boxplot, it is easy to see that female professors earn an lower salary compares to male professors on average. Also, male professors have more variety of salaries meaning they have the chance to be paid higher. Even though female professors have fewer vareity but they have a higher lowest salaries compare to male professors.
Conclusion: After computing the data and creating graphs, they together indicates a couple things. First of all, this sample data has large variety. The range of individual category data are large and even contains outliers, which means professors years of service, years since PhD and salaries are very different from each other. Second, Through potting scatter plot and boxplot for reponse variable and explanatory variable one at a time, they suggest that the years since PhD and sex has greater influence on professors’ salary compare to the years in service. Therefore, the anlaysis seems to suggest that this sample cover a great range of professors and the years since PhD and sex seems to be factors that will influence salaries amount. ```
Intro:
Quantitative preditor: salary & years since PhD
plot(Salaries$salary ~ Salaries$yrs.since.phd)
lm(Salaries$salary ~ Salaries$yrs.since.phd)
##
## Call:
## lm(formula = Salaries$salary ~ Salaries$yrs.since.phd)
##
## Coefficients:
## (Intercept) Salaries$yrs.since.phd
## 91718.7 985.3
fit1 <- lm(Salaries$salary ~ Salaries$yrs.since.phd)
fit1
##
## Call:
## lm(formula = Salaries$salary ~ Salaries$yrs.since.phd)
##
## Coefficients:
## (Intercept) Salaries$yrs.since.phd
## 91718.7 985.3
abline(fit1)
summary(fit1)
##
## Call:
## lm(formula = Salaries$salary ~ Salaries$yrs.since.phd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84171 -19432 -2858 16086 102383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91718.7 2765.8 33.162 <2e-16 ***
## Salaries$yrs.since.phd 985.3 107.4 9.177 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27530 on 395 degrees of freedom
## Multiple R-squared: 0.1758, Adjusted R-squared: 0.1737
## F-statistic: 84.23 on 1 and 395 DF, p-value: < 2.2e-16
coef(fit1)
## (Intercept) Salaries$yrs.since.phd
## 91718.6854 985.3421
Explanation: The estimated regression line is y = 985.3x+91718.7 meaning every one more year professors work after obtaining a PhD degree, their salaries will goes up by 985.3 dollars. The y-intercept in this case is meaningless because it is assuming that professors who just obtain their PhD degree has a starting salary at 91718.7 dollars, which does not closely apply to real life.
Qualitative preditor: salary & sex
boxplot(salary ~ sex, data=Salaries)
is.factor(Salaries$sex)
## [1] TRUE
fit2 <- lm(salary ~ factor(sex), data=Salaries)
summary(fit2)
##
## Call:
## lm(formula = salary ~ factor(sex), data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57290 -23502 -6828 19710 116455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101002 4809 21.001 < 2e-16 ***
## factor(sex)Male 14088 5065 2.782 0.00567 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30030 on 395 degrees of freedom
## Multiple R-squared: 0.01921, Adjusted R-squared: 0.01673
## F-statistic: 7.738 on 1 and 395 DF, p-value: 0.005667
abline(h=100000)
tapply(Salaries$salary, Salaries$sex, mean)
## Female Male
## 101002.4 115090.4
Explanation: The y-intercept is 101002 and the coefficients is 14088. However, because it is a qualitative predition, the y-intercept actually means the mean salary for female professors, and the mean salary for male salary is the sum of the coefficient and the y-intercept, which is 11590 dollars. The horizontal line is y=100000 meaning that at 100000 dollars salaries, it is clear that $100000 is more close to the female professors’ salaries median than to the male professors’. In another word, through the use of this horizontal line, we can observe even clearer that female professors earn less than the male professors on average and as a whole.
summary(lm(Salaries$salary ~ Salaries$yrs.since.phd))$r.squared
## [1] 0.1757547
summary(lm(salary ~ factor(sex)))$r.squared
## [1] 0.01921278
The R-square for salary and years since PhD is around 17%. Even though it is not a strong positive relationship, it seems to suggest that there is positive relationship between the response and numerical explanatory variable. The R-square for salary and sex is only 2%, which suggest no association between salary and sex. However, I think the reason for this low r-square is becasue it is a qualitative prediction which cannot be represented by r-square. Also, the reason for the low r-square could be that there are so few female representatives to support the association. I think it is a good model becasue it contains categorical and continuous explanatory variables, which cover both kinds of data.
Conclusion: Through calculation and plotting different kinds of graphs, it turns out that there seems to be a positive relationship between salary and the years since PhD. The r-square for that particular quantitative prediction suggests such. Also, the boxplot by group and the horizontal line further confirm that, within this sample, female professors earn a lower average income than male professors. Even though the r-square for the qualitative prediction suggest no association betwwen salary and sex, I do think that there is an association but it cannot be simply represented by r-square.
This section is to be completed for Part 2.
This section is to be completed for Part 2.
There is only text in this section, no code. It is just your writing.
If you need help structuring your .Rmd file you can find some help below. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Pledged by Kai Yan.
When you are done click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
To see your finished product click on Knit and select Knit to HTML