Take Home Test Solutions

1. A 1998 article in The New York Times (31 May 1998, Section 3, p. 1) claimed that companies whose CEOs were better at golf had higher stock prices. This conclusion was based on data available at http://web02.gonzaga.edu/faculty/axon/422/golf.txt. The third column, Handicap, is a measure of golf prowess, with better golfers having lower handicaps. Perform your own analysis of the data, clearly stating null and alternative hypotheses and the results of a test of your hypotheses. Address any potential problems with your test (i.e. are the assumptions of your test reasonable?).

library(readxl)
## Warning: package 'readxl' was built under R version 3.4.2
Golf <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/TakeHomeTest/Golf.xlsx")

The null hyphothesis for this is that p=0, meaning there is no correlation between CEO golf scores and company stock prices and the two variables are independent. The alternate hypothesis that p=/=0, meaning there is a correlation betwen CEO golf scores and company stock prices.

I did most of the work for this problem by hand, but one aspect I found through R is the correlation coefficient through the following command.

cor(Golf$Handicap, Golf$StockRate)
## [1] -0.04172447

The rest of the test was conducted on my paper.

One of the potential problems with the test includes the reliability of the handicaps. All of the CEOs likely played on different courses, on different days, and who knows how truthful their responses were. I think the implied benefit of being good at golf for company stock prices is that the CEO conducts business while on the golf course. A CEO who is poor at golf could send a business associate on the golf trips.

Likely the biggest concern with the test I conducted was that I did not know if both the golf handicap and the company stock price variables were normally distributed. The sample size was only 33 so that was not an assumption that I should have made.

library(nortest)
qqnorm(Golf$Handicap)

shapiro.test(Golf$Handicap)
## 
##  Shapiro-Wilk normality test
## 
## data:  Golf$Handicap
## W = 0.97284, p-value = 0.2892
ad.test(Golf$Handicap)
## 
##  Anderson-Darling normality test
## 
## data:  Golf$Handicap
## A = 0.45805, p-value = 0.2537
qqnorm(Golf$StockRate)

shapiro.test(Golf$StockRate)
## 
##  Shapiro-Wilk normality test
## 
## data:  Golf$StockRate
## W = 0.9704, p-value = 0.2298
ad.test(Golf$StockRate)
## 
##  Anderson-Darling normality test
## 
## data:  Golf$StockRate
## A = 0.38399, p-value = 0.3826

After conducting both q-q plots and various normality tests on both of the variables, there was no significant evidence which suggested the variables were not normally distributed, so I think I was safe in my assumption.

5. Another sample of 33 bags of fun-sized M&Ms was analyzed and the numbers of green and yellow M&Ms in each bag was recorded. The results are available at http://web02.gonzaga.edu/faculty/axon/422/yellow-green.csv.

YellowGreen <- read.csv("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/TakeHomeTest/yellow-green.csv")

a) Fit a least squares line to the data.

plot(YellowGreen$Yellow, YellowGreen$X.Green, col="blue", main="# of Green and Yellow M&Ms", xlab="# of Yellow M&Ms", ylab="# of Green M&Ms")
m<-lm(YellowGreen$Yellow~YellowGreen$X.Green)
abline(m)

b) Calculate the sample correlation coefficient.

cor(YellowGreen$Yellow, YellowGreen$X.Green)
## [1] -0.1916997

c) Would it be appropriate to use normal correlation analysis to test for independence of our two variables? Why or why not?

In order to test whether it is appropriate to use normal correlation analysis to test for independence of our two variables, we must first check and see if each individual variables are normally distributed.

qqnorm(YellowGreen$Yellow)

This normal q-q plot provides us with a visual representation of the sample data of yellow MandMs in quantiles against quantiles from the standard normal distribution. If the data of yellow MandMs was normally distributed, the line created would be much more linear.

shapiro.test(YellowGreen$Yellow)
## 
##  Shapiro-Wilk normality test
## 
## data:  YellowGreen$Yellow
## W = 0.92281, p-value = 0.02193

To further explore the normality of the yellow MandM distribution, we can use a Shapiro test. A Shapiro test has the null hypothesis that the data is normally distributed with the alternative hypothesis that the data is not normally distributed. Since the p-value, 0.02193 is less than the significance level of 0.05, we must reject the null hypothesis that the data on yellow MandMs is normally distributed.

ad.test(YellowGreen$Yellow)
## 
##  Anderson-Darling normality test
## 
## data:  YellowGreen$Yellow
## A = 0.85393, p-value = 0.02502

Another normality test is the Anderson-Darling Test. This test is slightly different than the Shapiro test, but it shares the same null and alternate hypotheses. The p-value gathered from the yellow MandM data is 0.02502. Since this is again less than the significance level of 0.05, then we must reject the null hypothesis that the data on yellow MandMs is normally distributed.

We will next test the normality of the green MandM data.

qqnorm(YellowGreen$X.Green)

This normal q-q plot provides us with a visual representation of the sample data of green MandMs in quantiles against quantiles from the standard normal distribution. If the data of green MandMs was normally distributed, the line created would be much more linear.

shapiro.test(YellowGreen$X.Green)
## 
##  Shapiro-Wilk normality test
## 
## data:  YellowGreen$X.Green
## W = 0.94179, p-value = 0.07661

To further explore the normality of the green MandM distribution, we can use a Shapiro test. Since the p-value, 0.07661 is greater than the significance level of 0.05, we must fail to reject the null hypothesis that the data on green MandMs is normally distributed. Failing to reject is different than accepting the null hypothesis, so this is not evidence enough that the green MandM data is normally distributed.

ad.test(YellowGreen$X.Green)
## 
##  Anderson-Darling normality test
## 
## data:  YellowGreen$X.Green
## A = 0.89697, p-value = 0.01948

We will now use an Anderson-Darling normality test on the green MandM data. The p-value gathered from the green MandM data is 0.01948. Since this is less than the significance level of 0.05, then we must reject the null hypothesis that the data on green MandMs is normally distributed.

There is strong evidence that neither the yellow MandM data or the green MandM data is normally distributed. To run a normal correlation analysis, both variable must be normally distributed, so in this case, a normal correlation analysis is not appropriate.

6. The 1994 salaries of major league baseball players from 3 teams (New York Yankees, Boston Red Sox, and Seattle Mariners) are collected in a table available at http://web02.gonzaga.edu/faculty/axon/422/baseball.csv.

library(readxl)
Baseball <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/TakeHomeTest/baseball.xlsx")

a) Run a 2-way ANOVA to determine if mean salaries are different for different teams or different positions.

In order to do this, we must first set up a linear model. In our case, we are going to set up a model testing how much baseball player salaries are affected by both the position they play and what team they play for.

n <- lm(Baseball$Salary~Baseball$Position+Baseball$Team)
anova(n)
## Analysis of Variance Table
## 
## Response: Baseball$Salary
##                   Df     Sum Sq    Mean Sq F value Pr(>F)
## Baseball$Position 10 2.5707e+13 2.5707e+12  1.0820 0.3892
## Baseball$Team      2 5.0638e+12 2.5319e+12  1.0656 0.3505
## Residuals         64 1.5207e+14 2.3760e+12

Now we can run an anova test. The null hypothesis of an anova test is that the means of all the particular treatments are equal and the alternate hypothesis is that the means are not all equal. In our case, the first null hypothesis is that the mean salaries of the players at different positions are all equal while the alternate hypothesis is that not all of the mean salaries of players at different positions are equal. In the other case, the first null hypothesis is that the mean salaries of the players at different teams are all equal while the alternate hypothesis is that not all of the mean salaries of players at different teams are equal.

The p-value of the test of salaries against player position is 0.3892. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the mean salaries of players at different positions is not statistically significant.

The p-value of the test of salaries against team is 0.3505. This is greater than the significance level of 0.05, so we must fail to reject the null hypothesis. The difference in the mean salaries of players at different teams is not statistically significant.

b) Determine if the assumptions behind the ANOVA you just ran are reasonable.