Look into the idea of blending indian and asian catagories like the census and simply including a statement that the data does not complety include the hispanic sample size
## Warning in cbind(white_qaf, asian_qaf, black_qaf, hispanic_qaf,
## indian_qaf, : number of rows of result is not a multiple of vector length
## (arg 2)
## Warning: package 'graphics' is not available (for R version 3.5.1)
## Warning: package 'graphics' is a base package, and should not be updated
## Warning in cbind(Blue_iris, Green_iris, Brown_iris, Hazel_iris, na.rm =
## TRUE): number of rows of result is not a multiple of vector length (arg 1)
## Df Sum Sq Mean Sq F value Pr(>F)
## ind 4 10490243 2622561 534 <2e-16 ***
## Residuals 790 3879492 4911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 590 observations deleted due to missingness
##
## Welch Two Sample t-test
##
## data: BrownEyes and NonBrownEyes
## t = -4.5517, df = 91.256, p-value = 1.637e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -82.23230 -32.26669
## sample estimates:
## mean of x mean of y
## 202.1866 259.4361
It seems that the brown iris has lower qAf value overall while retaining similar variance as the rest of the colors. And if the boxplot was not enough, the t-test very much proves that there is a significant differnece in means between the two populations.
For question #3, lets look at the R code to justify our answer at the end
#Question 3 - The over-riding effect that we found was a change in qAF with age. Plot age against qAF. What do you notice? What is the problem? What might you do to the data to allow you to run a regression (similar to correlation, and same issues are relevant)? Plot the data again once you have reduced the problem.
# Lets first simply plot Age vs qAF without any manipuation to the data lets also add a regression line to confirm whatever trend we see
mod1 = lm(post_class$qaf ~ post_class$age)
plot(post_class$age, post_class$qaf, main = "Age vs qAF with Regression Lines", type = 'p', xlab = 'Age (yrs)', ylab = "qAF")
abline(mod1, lwd = 2)
# calculate residuals and predicted values - uncomment the lower three lines if you want.
#res = signif(residuals(mod1), 5)
#pre = predict(mod1) # plot distances between points and the regression line
#segments(post_class$age, post_class$qaf, post_class$age, pre, col="red")
#mod2 = lm(log10(post_class$qaf) ~ post_class$age)
##plot(post_class$age, log10(post_class$qaf), main = "Age vs qAF with Regression Lines", type = 'p', xlab = 'Age (yrs)', ylab = "qAF")
#abline(mod2, lwd = 2)
After running a linear regression of the unmanipulated and logged data, we can see that there is a linear relationship, because of the regression line