Section I: Introduction

This research will utilized the salaries dataset (Salaries for Professors) from the “car” package. This dataset includes the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. The dataset contains 397 observations and 6 variables. This research will focus on the factors that influence the college professsors’ salaries. The response variable (Y) for this particular research is salary (nine-month salary, in dollars). The three explanatory variables are yrs.since.phd (years since PhD), yrs.service (years of service), and sex (a factor with levels Female Male). This research intends to explore how the sex, years of service and years since PhD influence the amount of salaries professors receive. We suspect that the response variable has relationship with the explanatory varibles and we will utilize statistics methods to analyze and explain the data.

Section II: Exploratory Data Analysis

Univariate Analysis 1. Salary

attach(Salaries)
summary(salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57800   91000  107300  113706  134185  231545
diff(range(salary))
## [1] 173745
IQR(salary)
## [1] 43185
hist(Salaries$salary, breaks = 15, col="blue", xlab="nine-month salary, in dollars",
     main="Histogram", xlim = c(0,300000), ylim = c(0,100))

boxplot(salary)

The professors salaries in this sample range from 57800 to 231545 dollars. There is a 173745 dollars difference which means the amount of salaries is spread out. The amount of salaries each professor earn can be very different from each other. The mean of the salaries is 113706 dollars, which means the average amount a professor earn in nine months is 113706 dollars. Also, we can see from the histogram that most professors earn salaries around the average amount. The maximum amount of salary is 231545 dollars which is almost double the amount of the 3rd quartile salary, 134185 dollars. However, extremly high salary is not common. The box plot indicates there are a few outliers in the upper end of the salaries. All the data demonstrate that the professors’ salaries are spread out but most of the data lie around the mean and median.

  1. Years since PhD
attach(Salaries)
## The following objects are masked from Salaries (pos = 3):
## 
##     discipline, rank, salary, sex, yrs.service, yrs.since.phd
summary(yrs.since.phd)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   12.00   21.00   22.31   32.00   56.00
diff(range(yrs.since.phd))
## [1] 55
IQR(yrs.since.phd)
## [1] 20
hist(Salaries$yrs.since.phd, breaks = 15, col="blue", xlab="years since PhD",
     main="Histogram", ylim = c(0,60), xlim = c(0,60))

boxplot(yrs.since.phd)

The range between the maximum years since PhD and the minimum is 55 years. The range indicates that there is great variety in years since PhD in professors. From the histogram we can see that most of professors service years after they have gotten their PhD degree is mostly below the average 22.31 years. There are only a couple professors who have served above 50 years since they have received PhD degree. Also, according to the boxplot, there are not outliers in this particular dataset. It suggest that the spread of this dataset is not as extreme as the previous one. The mean of this set of data is 22.31, which means on average, professors have been PhD for 22.31 years. The median is 21 meaning that the middle number out of all the years since PhD is 21 years.

  1. Years of service ```
summary(yrs.service)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.00   16.00   17.61   27.00   60.00
diff(range(yrs.service))
## [1] 60
IQR(yrs.service)
## [1] 20
hist(Salaries$yrs.service, breaks = 15, col="blue", xlab="years of service",
     main="Histogram", ylim = c(0,100), xlim = c(0,70))

boxplot(yrs.service)

The year of service range from 0 years to 60 years, which again, indicates that the datas are widely spread out. The mean of this dataset is 17.6 years meaning that on averge, professors have been in service for 17.6 years. According to the histogram, we can observe that it is right-skewed, meaning as the years of service grow, the fewer the professors who qualifies that condition. The box plot indicates that there is a outlier in the upper end of the scale. This indicates that in this specific sample, most professors are rather new and have not been in service above 16 years.

  1. Sex
counts <- table(Salaries$sex)
counts
## 
## Female   Male 
##     39    358
barplot(counts, main= "barplot", ylim=c(0,400))

The table of professors’ gender show that there are only 39 female professors and 358 male professors out of 397 professors in total. It suggests in this sample, female professors is significantly fewer than male professors. ``` Bivariate Analysis 1. Salary & Years since PhD

plot(Salaries$salary ~ Salaries$yrs.since.phd)

Through observing the scatterplot, we can see a rather strong positive relationship betwwen the salary and years since PhD from the data. The scatterplot suggests that the longer the professors have been in service owning a PhD, the higher the salaries they earn.

  1. Salary & Years of Service
plot(Salaries$yrs.service ~ Salaries$yrs.service)
## Warning in plot.formula(Salaries$yrs.service ~ Salaries$yrs.service):
## the formula 'Salaries$yrs.service ~ Salaries$yrs.service' is treated as
## 'Salaries$yrs.service ~ 1'

From observing the scatterplot, it appears that salary and years in service have no strong relationship. In another word, professors salaries are not influenced by their years in service.

  1. Salary & Sex
boxplot(Salaries$salary ~ Salaries$sex, col="blue", main="Side-by-side boxplot", ylab="nine-month salary, in dollars", xlab="a factor with levels Female Male")

According to the side-by-side boxplot, it is easy to see that female professors earn an lower salary compares to male professors on average. Also, male professors have more variety of salaries meaning they have the chance to be paid higher. Even though female professors have fewer vareity but they have a higher lowest salaries compare to male professors.

Conclusion: After computing the data and creating graphs, they together indicates a couple things. First of all, this sample data has large variety. The range of individual category data are large and even contains outliers, which means professors years of service, years since PhD and salaries are very different from each other. Second, Through potting scatter plot and boxplot for reponse variable and explanatory variable one at a time, they suggest that the years since PhD and sex has greater influence on professors’ salary compare to the years in service. Therefore, the anlaysis seems to suggest that this sample cover a great range of professors and the years since PhD and sex seems to be factors that will influence salaries amount. ```

Section III: Simple Linear Regression

Intro:

Quantitative preditor: salary & years since PhD

plot(Salaries$salary ~ Salaries$yrs.since.phd)
lm(Salaries$salary ~ Salaries$yrs.since.phd)
## 
## Call:
## lm(formula = Salaries$salary ~ Salaries$yrs.since.phd)
## 
## Coefficients:
##            (Intercept)  Salaries$yrs.since.phd  
##                91718.7                   985.3
fit1 <- lm(Salaries$salary ~ Salaries$yrs.since.phd)
fit1
## 
## Call:
## lm(formula = Salaries$salary ~ Salaries$yrs.since.phd)
## 
## Coefficients:
##            (Intercept)  Salaries$yrs.since.phd  
##                91718.7                   985.3
abline(fit1)

summary(fit1)
## 
## Call:
## lm(formula = Salaries$salary ~ Salaries$yrs.since.phd)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -84171 -19432  -2858  16086 102383 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             91718.7     2765.8  33.162   <2e-16 ***
## Salaries$yrs.since.phd    985.3      107.4   9.177   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27530 on 395 degrees of freedom
## Multiple R-squared:  0.1758, Adjusted R-squared:  0.1737 
## F-statistic: 84.23 on 1 and 395 DF,  p-value: < 2.2e-16
coef(fit1)
##            (Intercept) Salaries$yrs.since.phd 
##             91718.6854               985.3421

Explanation: The estimated regression line is y = 985.3x+91718.7 meaning every one more year professors work after obtaining a PhD degree, their salaries will goes up by 985.3 dollars. The y-intercept in this case is meaningless because it is assuming that professors who just obtain their PhD degree has a starting salary at 91718.7 dollars, which does not closely apply to real life.

Qualitative preditor: salary & sex

boxplot(salary ~ sex, data=Salaries)
is.factor(Salaries$sex)
## [1] TRUE
fit2 <- lm(salary ~ factor(sex), data=Salaries)
summary(fit2)
## 
## Call:
## lm(formula = salary ~ factor(sex), data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57290 -23502  -6828  19710 116455 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       101002       4809  21.001  < 2e-16 ***
## factor(sex)Male    14088       5065   2.782  0.00567 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30030 on 395 degrees of freedom
## Multiple R-squared:  0.01921,    Adjusted R-squared:  0.01673 
## F-statistic: 7.738 on 1 and 395 DF,  p-value: 0.005667
abline(h=100000)

tapply(Salaries$salary, Salaries$sex, mean)
##   Female     Male 
## 101002.4 115090.4

Explanation: The y-intercept is 101002 and the coefficients is 14088. However, because it is a qualitative predition, the y-intercept actually means the mean salary for female professors, and the mean salary for male salary is the sum of the coefficient and the y-intercept, which is 11590 dollars. The horizontal line is y=100000 meaning that at 100000 dollars salaries, it is clear that $100000 is more close to the female professors’ salaries median than to the male professors’. In another word, through the use of this horizontal line, we can observe even clearer that female professors earn less than the male professors on average and as a whole.

summary(lm(Salaries$salary ~ Salaries$yrs.since.phd))$r.squared
## [1] 0.1757547
summary(lm(salary ~ factor(sex)))$r.squared
## [1] 0.01921278

The R-square for salary and years since PhD is around 17%. Even though it is not a strong positive relationship, it seems to suggest that there is positive relationship between the response and numerical explanatory variable. The R-square for salary and sex is only 2%, which suggest no association between salary and sex. However, I think the reason for this low r-square is becasue it is a qualitative prediction which cannot be represented by r-square. Also, the reason for the low r-square could be that there are so few female representatives to support the association. I think it is a good model becasue it contains categorical and continuous explanatory variables, which cover both kinds of data.

Conclusion: Through calculation and plotting different kinds of graphs, it turns out that there seems to be a positive relationship between salary and the years since PhD. The r-square for that particular quantitative prediction suggests such. Also, the boxplot by group and the horizontal line further confirm that, within this sample, female professors earn a lower average income than male professors. Even though the r-square for the qualitative prediction suggest no association betwwen salary and sex, I do think that there is an association but it cannot be simply represented by r-square.

Section IV: Multiple Linear Regression

This section is to be completed for Part 2.

Section V: Hypothesis Testing

This section is to be completed for Part 2.

Section VI: Conclusions

There is only text in this section, no code. It is just your writing.

If you need help structuring your .Rmd file you can find some help below. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Honor Code

Pledged by Kai Yan.

When you are done click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

To see your finished product click on Knit and select Knit to HTML