Question 1: Calc Scores

Given the data on the scores from test 1 and 4 from Calc 157 across two sections, there are a number of questions I can answer. The first one I can analyze is whether scores from test 1 can be an indicator for performance on test 4 and vice versa. Basically, is there a correlation between test 1 scores and test 4 scores. Additionally, I can see if it is reasonable to assume that section A’s performance on the tests will be similar to section B’s performance. In order to test section A’s performance against section B’s performance, I will perform a hypothesis test using a t-test statistic. Then to see if there is a correlation between test 1 and test 4, I will use a regression test.

library(readxl)

## Warning: package 'readxl' was built under R version 3.4.2

scores <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/scores.xlsx")

calcscores <- scores
colnames(calcscores) <- c("A1", "A4", "B1", "B4")
summary(calcscores)

##        A1               A4               B1               B4       
##  Min.   : 38.00   Min.   : 35.00   Min.   : 52.00   Min.   :44.00  
##  1st Qu.: 69.25   1st Qu.: 70.75   1st Qu.: 70.00   1st Qu.:66.00  
##  Median : 84.00   Median : 84.50   Median : 81.00   Median :77.00  
##  Mean   : 78.29   Mean   : 78.67   Mean   : 79.31   Mean   :76.66  
##  3rd Qu.: 90.50   3rd Qu.: 91.25   3rd Qu.: 88.00   3rd Qu.:91.00  
##  Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :96.00  
##  NA's   :5        NA's   :5

plot(stack(calcscores)[, 2:1])

t.test(((calcscores$A1+calcscores$A4)/2),((calcscores$B1+calcscores$B4)/2))

## 
##  Welch Two Sample t-test
## 
## data:  ((calcscores$A1 + calcscores$A4)/2) and ((calcscores$B1 + calcscores$B4)/2)
## t = 0.12487, df = 42.776, p-value = 0.9012
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.521721  8.514537
## sample estimates:
## mean of x mean of y 
##  78.47917  77.98276

After performing this hypothesis test which compared the mean scores of section A and the mean scores of section B, I determined that I had to fail to reject the null hypothesis that the difference between the mean scores of the two sections was 0. The p-value that was extracted from this test equaled 0.9012 which is greater than the significance level of 0.05, so the null hypothesis cannot be rejected. The test scores from the two sections are two close to suggest that the null hypothesis is incorrect.

plot(scores$A1, scores$A4, col="blue", main="Scores on Exam 1 and Exam 4 in Math 157 A", xlab="Exam 1 Score", ylab="Exam 4 Score")
m<-lm(scores$A4~scores$A1)
abline(m)

summary(m)

## 
## Call:
## lm(formula = scores$A4 ~ scores$A1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.449  -5.738  -0.088   7.563  20.551 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21.2425    10.8035   1.966    0.062 .  
## scores$A1     0.7335     0.1349   5.438 1.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.17 on 22 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.5734, Adjusted R-squared:  0.554 
## F-statistic: 29.57 on 1 and 22 DF,  p-value: 1.838e-05

plot(m)

After performaing a regression test comparing section A’s scores on tests 1 and 4, I determined for a number of reasons that the correlation is not strong. First off, looking at the p-value extracted from the test 1 scores against test 4 scores, 1.84e^-05, I note that it is much smaller than the significance level of .05, so I must reject the hypothesis that the scores are correlated. Additionally, when looking at the residuals vs fitted chart, it does not present normal regression. The lower scores values are distributed on this chart way too different from the higher score values. Furthering this point is the Normal Q-Q chart whcih has most of the data points following on the line, but the points with theoretical quantities between -1 and -2 are concering for they are too far away from the line.

plot(scores$B1, scores$B4, col="blue", main="Scores on Exam 1 and Exam 4 in Math 157 B", xlab="Exam 1 Score", ylab="Exam 4 Score")
n<-lm(scores$B4~scores$B1)
abline(n)

summary(n)

## 
## Call:
## lm(formula = scores$B4 ~ scores$B1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.442  -6.869   3.051   8.624  23.508 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  22.0561    15.3364   1.438  0.16188   
## scores$B1     0.6884     0.1911   3.603  0.00125 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.71 on 27 degrees of freedom
## Multiple R-squared:  0.3247, Adjusted R-squared:  0.2997 
## F-statistic: 12.98 on 1 and 27 DF,  p-value: 0.001252

plot(n)

After performing a regression test comparing section B’s scores on tests 1 and 4, I determined for a number of reasons that the correlation is not strong. First off, looking at the p-value extracted from the test 1 scores against test 4 scores, 0.00125, I note that it is much smaller than the significance level of .05, so I must reject the hypothesis that the scores are correlated. Additionally, when looking at the residuals vs fitted chart, it does not present normal regression. There is no distinct pattern on this cart. Rather, it forms a sort of wave which does not provide evidence for correlation. Furthering this point is the Normal Q-Q chart whcih has most of the data points following on the line, but the points with theoretical quantities between -1 and -2 are concering for they are too far away from the line.

Some other interesting tests that could be done with the provided data, would be to gather data on how much studying each student did on each test. That data could determine which section studied more, which test was studied more for, and how much influence the study time had on the success on the exams. Another test would be to look at the variances of each section and determine which class is going to hold more volatile scores.

Question 2: Lion Hunting

Given the data on lion hunting success rates by group size and prey, the obvious question to answer is whether group size correlates with hunting success rate. In order to test this, I will do regression tests for group size against gazelle success, group size against Wildebeest and Zebra success, group size against other success, and group size against mean success.

library(readxl)
gazelle <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/gazelle.xlsx")

library(readxl)
WildebeestandZebra <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/WildebeestandZebra.xlsx")

library(readxl)
Other <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/Other.xlsx")

library(readxl)
LionHunting <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/Lion Hunting.xlsx")

plot(LionHunting$'Group Size', LionHunting$'G Successes', col="blue", main="Group Size Against Gazelle Success", xlab="Group Size", ylab="Gazelle Success")
q<-lm(LionHunting$'Group Size'~LionHunting$'G Successes')
abline(q)

summary(q)

## 
## Call:
## lm(formula = LionHunting$"Group Size" ~ LionHunting$"G Successes")
## 
## Residuals:
##       1       2       3       4       5 
##  0.1310 -1.6699 -1.1292  0.7973  1.8708 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                 -1.842      3.178  -0.580    0.603
## LionHunting$"G Successes"   17.915     10.766   1.664    0.195
## 
## Residual standard error: 1.655 on 3 degrees of freedom
## Multiple R-squared:   0.48,  Adjusted R-squared:  0.3066 
## F-statistic: 2.769 on 1 and 3 DF,  p-value: 0.1947

plot(q)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

After performing this regression test, I have to fail to reject the null hypothesis that there is a correlation between group size and gazelle success rate. The p-value extracted was 0.195 which is greater than the significance level of .05 so I must fail to reject. Additionally, the residuals vs fitted chart agrees with this analysis as it creates a linear correlation with the expection of one data point.

plot(LionHunting$'Group Size', LionHunting$'WZ Successes', col="blue", main="Group Size Against Wildebeest and Zebra Success", xlab="Group Size", ylab="Wildebeest and Zebra Success")
s<-lm(LionHunting$'Group Size'~LionHunting$'WZ Successes')
abline(s)

summary(s)

## 
## Call:
## lm(formula = LionHunting$"Group Size" ~ LionHunting$"WZ Successes")
## 
## Residuals:
##       1       2       3       4       5 
## -0.9579 -1.9590  1.3055  0.3218  1.2896 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                  0.4526     1.8665   0.243    0.824
## LionHunting$"WZ Successes"   9.9348     5.9779   1.662    0.195
## 
## Residual standard error: 1.656 on 3 degrees of freedom
## Multiple R-squared:  0.4793, Adjusted R-squared:  0.3058 
## F-statistic: 2.762 on 1 and 3 DF,  p-value: 0.1951

plot(s)

After performing this regression test, I have to fail to reject the null hypothesis that there is a correlation between group size and Wildebeest and Zebra success rate. The p-value extracted was 0.195 which is greater than the significance level of .05 so I must fail to reject. Additionally, the normal q-q chart agrees with this analysis as the data falls on or near the line with one exception.

plot(LionHunting$'Group Size', LionHunting$'O Successes', col="blue", main="Group Size Against Other Success", xlab="Group Size", ylab="Other Success")
t<-lm(LionHunting$'Group Size'~LionHunting$'O Successes')
abline(t)

summary(t)

## 
## Call:
## lm(formula = LionHunting$"Group Size" ~ LionHunting$"O Successes")
## 
## Residuals:
##       1       2       3       4       5 
## -1.9260 -1.7487  0.1257  2.0265  1.5226 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                  4.477      1.799   2.489   0.0885 .
## LionHunting$"O Successes"   -8.015     10.447  -0.767   0.4988  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.098 on 3 degrees of freedom
## Multiple R-squared:  0.164,  Adjusted R-squared:  -0.1146 
## F-statistic: 0.5887 on 1 and 3 DF,  p-value: 0.4988

plot(t)

After performing this regression test, I have to fail to reject the null hypothesis that there is a correlation between group size and Other success rate. The p-value extracted was 0.4988 which is greater than the significance level of .05 so I must fail to reject. Additionally, both the residual vs fitted chart and normal q-q chart support a failure to reject the null hypothesis.

plot(LionHunting$'Group Size', LionHunting$'Mean Success Rate', col="blue", main="Group Size Against Mean Success", xlab="Group Size", ylab="Mean Success")
c<-lm(LionHunting$'Group Size'~LionHunting$'Mean Success Rate')
abline(c)

summary(c)

## 
## Call:
## lm(formula = LionHunting$"Group Size" ~ LionHunting$"Mean Success Rate")
## 
## Residuals:
##       1       2       3       4       5 
##  0.4121 -1.7421 -0.2170  0.0583  1.4886 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                       -3.049      2.735  -1.115   0.3462  
## LionHunting$"Mean Success Rate"   23.222      9.758   2.380   0.0976 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.35 on 3 degrees of freedom
## Multiple R-squared:  0.6537, Adjusted R-squared:  0.5383 
## F-statistic: 5.664 on 1 and 3 DF,  p-value: 0.09763

plot(c)

This was likely the most relevant of the tests I have perfromed on question 2. After performing this regression test, I have to fail to reject the null hypothesis that there is a correlation between group size and Mean success rate. The p-value extracted was 0.0976 which is greater than the significance level of .05 so I must fail to reject. Again, with single outliers, the residual vs fitted charts and normal q-q charts show the trend the p-value presented.

Now that I have tested group size against success rate I would like to test from the other spectrum and try to determine which group size is most efficient for which prey. Additionally, for the most part, hunts were more successful with larger group size, but that does not necessarily mean the lions are going less hungry as the meal will be spread out across more mouths. I would like to see which group size results in the lions having the fullest stomachs.

Question 3:Life Expectancy with Contraceptive Use

Given the data on different countries life expectancies and the percentage of their population which utilizes contraceptives, the obvious question that needs to be addressed is whether their is a correlation between contraceptive use and life expectancy. In order to test this, I will do a regression test.

library(readxl)
LifeExpectancy <- read_excel("~/R/win-library/3.4/gudatavizfa17/data/MathStatsCalcScores/Life Expectancy.xlsx")

plot(LifeExpectancy$`Life Expectancy`, LifeExpectancy$`Contraceptive use(%)`, col="blue", main="Life Expectancy against Contraceptive Use", xlab="Life Expectancy", ylab="Contraceptive Use")
p<-lm(LifeExpectancy$`Life Expectancy`~LifeExpectancy$`Contraceptive use(%)`)
abline(p)

summary(p)

## 
## Call:
## lm(formula = LifeExpectancy$`Life Expectancy` ~ LifeExpectancy$`Contraceptive use(%)`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9879  -2.9394   0.3244   3.5552  17.6975 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)                           57.40771    1.37022  41.897   <2e-16
## LifeExpectancy$`Contraceptive use(%)`  0.25235    0.02545   9.917   <2e-16
##                                          
## (Intercept)                           ***
## LifeExpectancy$`Contraceptive use(%)` ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.231 on 131 degrees of freedom
## Multiple R-squared:  0.4288, Adjusted R-squared:  0.4244 
## F-statistic: 98.34 on 1 and 131 DF,  p-value: < 2.2e-16

plot(p)

After performing a regression test comparing life expectancies and contraceptive use of different countries, I determined for a number of reasons that the correlation is not strong. First off, looking at the p-value extracted from the regression test was <2e^-16. This value is much smaller than the significance level of .05, so I must reject the hypothesis that life expectancy and contraceptive use are correlated. When looking at the Normal Q-Q chart, more evidence supporting this rejection is presented. While a majority of the data points fall on the correlation line, the values with theoretical quantities less than -1 and greater than 2 drift away from the line which is concerning. That is enough to question the correlation.

Other factors that would be relevant to explore would be data on how much access each country has to contraceptives as well as how many children on average does each family have. I think a stronger correlation could be tied to life expectancy and contraceptive access as opposed to contraceptive use.

Math 422 R HW 1

Brian Grob

April 8, 2018

Question 1: Calc Scores

Question 2: Lion Hunting

Question 3:Life Expectancy with Contraceptive Use

Other factors that would be relevant to explore would be data on how much access each country has to contraceptives as well as how many children on average does each family have. I think a stronger correlation could be tied to life expectancy and contraceptive access as opposed to contraceptive use.