This week, I used my time to work through and review the second chapter of Stat2 on the topic of Inference for Simple Linear Regression. After working through the reading, I used R to answer practice exercise questions in the back, using R for both graphing and finding statistical values.
Many concepts were introduced, with the importance of the t-test emphasized. The t-test for slope is a way to assess if the slope from a set of sample data differs from zero. To do this, the t-test for the slope of a linear model is used, where the test statistic is equal to the slope β1 divided by the standard error of the slope, SE(β1). The purpose of this test is to see if the slope of the regression line is significantly different from zero, and if it is, this helps to conclude that there is a relationship between the variables. Computing this value allows us to continue to another important value, the p-value.
The p-value is a measure of probability, a useful way to sum up the strength of evidence in the model. The chapter stressed the importance of using the p-value correctly, and to not abuse it. A smaller p-value that is below a significance level (usually of 0.05) signifies a strong association between variables, while a greater p-value signifies the opposite. A larger p-value also helps to prove the null hypothesis. In the section on p-values, the concept of the modernity of how easy it is to now compute a p-value with technology such as R was introduced. While doing regression in R, it is easy to simply plug a variable into a model and immediately obtain the p-value, something that took more time thirty years ago. I had not before truly understood the significance of the p-value, and this section taught me to take it with a grain of salt, use it as a guiding measure to find associations, not causations.
This chapter introduced ANOVA, or analysis of variance, for simple linear models. The purpose of ANOVA is to “measure how much of the variability in the response variable is explained by the predictions based on the fitted model.” When I first read this, I simply did not understand the jumble of words. Thankfully, visuals with equations and more simple words were shown to simplify it. My definition of ANOVA came from realizing that it is derived from two different pieces, the model and the residuals. The variance in the dependent variable comes from adding variability explained by the model to variability that was unexplained and due to error in the residuals.
Another important topic introduced was the coefficient of determination, R^2. R is the correlational coefficient, but R^2 is a measure that tells us how much variance in Y can be explained by the linear model.
I liked this chapter as it talked a lot about not using solely statistical factors to check for linear relationships. It stressed using common sense, something I had not seen before in a text. It talked about asking questions that did not have anything to do with strictly numbers, such as “Even if we suppose that X and Y are related, what difference does it make?” This made me think back to the p-values, and how important it is to not simply glance at a statistic such as a p-value as a way of evaluating an entire model and the strength of said model.
I chose to do the homework in RPubs, which allows me to have both my graphs, R code, and answers all in the same place, accessible from a simple link. Within the homework, I did problems that used simple linear regression and ANOVA, meaning I had to learn functions that allowed me to do ANOVA in R. After some internet searching, I found a function that has the same structure as the lm() function used to do ANOVA. I did problems that integrated both regression and ANOVA, allowing me to add another way to check if my model was statistically significant.
## import libraries
library(ggplot2)
library(Stat2Data)
library(broom)
library(AICcmodavg)
library(plotrix)
# Produce a scatterplot
brain = read.csv("~/downloads/BrainpH.csv")
ggplot(brain, aes(x = Age, y = pH)) +
geom_point() +
stat_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
brainLM = lm(pH ~ Age, data = brain)
summary(brainLM)
##
## Call:
## lm(formula = pH ~ Age, data = brain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56976 -0.21781 0.02032 0.16801 0.38649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8881113 0.1321194 52.13 <2e-16 ***
## Age -0.0003905 0.0022944 -0.17 0.866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.235 on 52 degrees of freedom
## Multiple R-squared: 0.0005566, Adjusted R-squared: -0.01866
## F-statistic: 0.02896 on 1 and 52 DF, p-value: 0.8655
There does not appear to be a linear relationship, as the predictor line is close to horizontal, meaning its slope is close to zero, and there is great scatter among the data
# anova test
brainAN = aov(pH ~ Age, data = brain)
summary(brainAN)
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 1 0.0016 0.00160 0.029 0.866
## Residuals 52 2.8717 0.05522
The null hypothesis is that age is not a predictor of brain tissue pH at death. The alternative hypothesis is that age is a predictor of brain tissue pH at death. To test these hypotheses, I perfomed an ANOVA (analysis of variance) test of the data in R. The p - value of the f-statistic is 0.866, which is much greater than the significance value of 0.05. Because the p-value is so much greater, we fail to reject the null hypothesis, meaning that age is not a predictor of brain tissue pH at death.
## 2.15
cereal = read.csv("~/desktop/Cereal.csv")
# produce a scatterplot
ggplot(cereal, aes(x = Sugar, y = Calories)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# anova test
cerealAN = aov(Calories ~ Sugar, data = cereal)
summary(cerealAN)
## Df Sum Sq Mean Sq F value Pr(>F)
## Sugar 1 4567 4567 12.3 0.0013 **
## Residuals 34 12626 371
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# linear model
cerealLM = lm(Calories ~ Sugar, data = cereal)
summary(cerealLM)
##
## Call:
## lm(formula = Calories ~ Sugar, data = cereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.428 -9.832 0.245 8.909 40.322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.4277 5.1627 16.935 <2e-16 ***
## Sugar 2.4808 0.7074 3.507 0.0013 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.27 on 34 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.244
## F-statistic: 12.3 on 1 and 34 DF, p-value: 0.001296
a. Null: Sugar content and calories have no linear relationship
Alternative: There is a linear relationship between sugar and calories in cereal.
The equation of the linear model is 87.428 + 2.4808(Sugar), the slope test statistic is 3.51, and the p-value is 0.001296. The p-value is below the significance level of 0.05, allowing us to reject the null hypothesis, concluding that there is a linear relationship betwen sugar and calories in cereal.
b. The 95% confidence interval is (1.04, 3.92). This means that we are 95% confident that as sugar increases by 1 gram, the calories increase by a value that is between 1.04 and 3.92.
Other work is in Google doc linked here https://docs.google.com/document/d/1trVckAvZGUJoQJKwn_eJeVRUx4Waq1L82QjGpeb8kUo/edit
lewy = read.csv("~/downloads/LewyDLBad.csv")
# create scatterplot
ggplot(lewy, aes(x = APC, y = MMSE)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# linear model
lewylm = lm(MMSE ~ APC, data = lewy)
summary(lewylm)
##
## Call:
## lm(formula = MMSE ~ APC, data = lewy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1791 -1.6991 -0.1081 1.9911 3.3963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4359 0.6858 -3.552 0.00228 **
## APC 1.3444 0.4225 3.182 0.00516 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.184 on 18 degrees of freedom
## Multiple R-squared: 0.36, Adjusted R-squared: 0.3245
## F-statistic: 10.13 on 1 and 18 DF, p-value: 0.005161
# residual model
lewyResid = resid(lewylm)
plot(fitted(lewylm), lewyResid, main = "Residuals without log transformation")
abline(0,0)
hist(lewyResid, main = "No log transformation")
# quantile-quantile plot
qqnorm(lewyResid, pch = 1, frame = FALSE)
qqline(lewyResid)
# add log transformation to residual histogram
hist(log(lewyResid), main = "Log Transformation", breaks = 10)
## Warning in log(lewyResid): NaNs produced
The histogram of residual models does not show a normal distribution, meaning this is not an appropriate linear model. Because of this, I decided to do a log transformation in hopes of a better residual model appearing. Even with the log transformation, the graph still does not show a normal distribution. The residuals are not centered at 0 either, meaning this does not follow the zero mean condition.
The slope of this graph is 1.3444 with a standard error of 0.4225.
spar = read.csv("~/downloads/Sparrows.csv")
ggplot(spar, aes(x = WingLength, y = Weight)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# linear summary
sparlm = lm(Weight ~ WingLength, data = spar)
summary(sparlm)
##
## Call:
## lm(formula = Weight ~ WingLength, data = spar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5440 -0.9935 0.0809 1.0559 3.4168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.36549 0.95731 1.426 0.156
## WingLength 0.46740 0.03472 13.463 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.4 on 114 degrees of freedom
## Multiple R-squared: 0.6139, Adjusted R-squared: 0.6105
## F-statistic: 181.3 on 1 and 114 DF, p-value: < 2.2e-16
The p-value is 2.2e-16, which is less than the alpha value of 0.05, meaning we reject the null, meaning that the slope of the least squares regression line is different than zero.
As WingLength increases by 1mm, we are 95% confident that the average weight of the sparrow will increase by between 0.4 and 0.5 grams in the population of all sparrows.
The confidence interval does not contain zero. This supports our previous finding in part a that the slope of the regression model is not zero, proving that WingLength is a good predictor of the sparrow weight.
rail = read.csv("~/desktop/RailsTrails.csv")
# linear regression model
ggplot(rail, aes(x = squarefeet, y = adj2007)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# summary
raillm = lm(formula = adj2007 ~ squarefeet, data = rail)
summary(raillm)
##
## Call:
## lm(formula = adj2007 ~ squarefeet, data = rail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -131.940 -36.635 -4.109 32.470 153.323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.973 15.541 4.695 8.32e-06 ***
## squarefeet 162.526 9.351 17.381 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53 on 102 degrees of freedom
## Multiple R-squared: 0.7476, Adjusted R-squared: 0.7451
## F-statistic: 302.1 on 1 and 102 DF, p-value: < 2.2e-16
# residual plot
railResid = resid(raillm)
plot(fitted(raillm), railResid)
abline(0,0)
# quantile-quantile plot
qqnorm(railResid, frame = FALSE)
qqline(railResid)
adj2007(hat) = 72.973 + 162.526(squarefeet) - As the square footage of a house increases by one square foot, the estimated price increase is $162.53.
(See below written work) This means that we are 90% confident that the adjusted 2007 price of a house will increase by between $41.39 and $283.66.
The first residual scatterplot shown shows a violation of the equal variance condition because there is less scatter as the predicted response gets larger. Despite this, the linear model is still valid because usually as the predicted response gets larger, variability increases. The quantile-quantile plot has small tails at the left and right ends, but mostly adheres to the model, showing that the data comes from a Normal distribution.
met = read.csv("~/desktop/MetRate.csv")
# least squares regression line
ggplot(data = met, aes(x = LogBodySize, y = LogMrate)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# linear summary
metlm = lm(LogMrate ~ LogBodySize, data = met)
summary(metlm)
##
## Call:
## lm(formula = LogMrate ~ LogBodySize, data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59203 -0.11194 0.00361 0.12118 0.47292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.30655 0.01356 96.33 <2e-16 ***
## LogBodySize 0.91641 0.01235 74.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1752 on 303 degrees of freedom
## Multiple R-squared: 0.9478, Adjusted R-squared: 0.9477
## F-statistic: 5505 on 1 and 303 DF, p-value: < 2.2e-16
# anova
metan = aov(LogMrate ~ LogBodySize, data = met)
summary(metan)
## Df Sum Sq Mean Sq F value Pr(>F)
## LogBodySize 1 169.0 169.02 5505 <2e-16 ***
## Residuals 303 9.3 0.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LogMrate(hat) = 1.30655 + 0.91641(LogBodySize)
The p-value is 2.2e-16, which is less than the alpha value of 0.05, which allows us to reject the null hypothesis, meaning that there is sufficient statistical evidence that the slope of the regression line is different from zero that predicts LogMrate from LogBodySize.
Comparing this to an F-distribution with 1 numerator and 303 denominator degrees of freedom, we find a p-value of 2e-16, which is very close to 0. Thus, we reject the null hypothesis and conclude that there is a relationship between LogMrate and LogBodySize, meaning that LogBodySize is effective for predicting LogMrate.
leaf = read.csv("~/downloads/LeafWidth.csv")
# find r value
cor(leaf$Width, leaf$Year)
## [1] -0.2469483
leaflm = lm(Year ~ Width, data = leaf)
summary(leaflm)
##
## Call:
## lm(formula = Year ~ Width, data = leaf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.823 -8.571 3.823 10.548 39.204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1978.5313 3.0116 656.974 < 2e-16 ***
## Width -3.4728 0.8619 -4.029 7.43e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.02 on 250 degrees of freedom
## Multiple R-squared: 0.06098, Adjusted R-squared: 0.05723
## F-statistic: 16.24 on 1 and 250 DF, p-value: 7.425e-05
# anova
leafan = aov(Year ~ Width, data = leaf)
summary(leafan)
## Df Sum Sq Mean Sq F value Pr(>F)
## Width 1 6509 6509 16.24 7.43e-05 ***
## Residuals 250 100220 401
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because the r value is -0.0246, this means there is a weak negative linear relationship between width and year. Despite this, the p-value is low, allowing us to reject the null hypothesis, meaning that this model is an OK predictor of width based on year (age) of a leaf.
The r-squared value of 0.06 means that 6% of variation within the model is explained by the linear model.
See summary(leafan)
The F-statistic is the T-statistic squared, meaning that the t-statistic here is equal to 4.02 because the F-value is 16.24
## 2.40
grin = read.csv("~/downloads/Grinnell.csv")
# linear summary
grinlm = lm(SalePrice ~ ListPrice, data = grin)
summary(grinlm)
##
## Call:
## lm(formula = SalePrice ~ ListPrice, data = grin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55942 -3275 846 4141 44168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.448e+02 5.236e+02 -0.277 0.782
## ListPrice 9.431e-01 3.201e-03 294.578 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8019 on 927 degrees of freedom
## Multiple R-squared: 0.9894, Adjusted R-squared: 0.9894
## F-statistic: 8.678e+04 on 1 and 927 DF, p-value: < 2.2e-16
# anova
grinan = aov(SalePrice ~ ListPrice, data = grin)
summary(grinan)
## Df Sum Sq Mean Sq F value Pr(>F)
## ListPrice 1 5.580e+12 5.580e+12 86776 <2e-16 ***
## Residuals 927 5.961e+10 6.431e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1