This week, I spent hours working through the first chapter of Stat2, Simple Linear Regression. Following the reading, I spent time working on the practice problems and creating the graphs in R and using R’s computational abilities to find equations for least square regression equations. While reading through the chapter, I was reminded of concepts I had learned previously in AP Stats, such as the formula for for a linear function model, residual plots, and checking conditions. While I had already known these, the new chapter delved much deeper into these concepts. In linear models, new variables were introduced to represent the intercept and slope, the Greek letter beta. On the topic of residual plots, much more was introduced, such as new types of graphs (see graph to the right, labeled “Normal Q-Q”). Normal quantile plots are essentially scatterplots of residuals that represent a scatterplot of the data given versus values that we would expect to see from a normal sample of the same size. These types of graphs are useful in detecting if the errors are normal, as models with nonnormal errors tend to show a lot of skew that may not follow the line given. Another graph I found interesting that I had not before seen was a histogram of residuals (see graph to right titled “Histogram of textResid”). At first, I didn’t understand how a histogram of residuals would be helpful in seeing any new aspects of the data, but soon realized how it was helpful to check the zero mean condition. When using least squares regression, the sample mean of residuals has to be zero, meaning the histogram of the residuals must be centered around zero. This graph is also another way to ensure the data follows a normal distribution with a bell shaped curve.
This chapter also introduced the idea of reexpression rather than transformation. In statistics, we are usually interested in finding differences, such as comparing one group with another. A common form of reexpression is with logs, and logs allow us to make almost any multiplication and division operation into one of subtraction and addition, allowing us to find the difference. While logs are most commonly used, raising a variable to a power is also common. Logs and other forms of reexpression can also help us to improve graphs, such as when a scatterplot shows curvature, or the distribution of a graph is skewed. Linearity is very important to accurately see a relationship between variables, and by applying log reexpression, we can more clearly see these relationships.
While working through the exercises, I was really excited to integrate R by using its graphing and summary statistic capabilities. I completed problems that found intercepts of a least square regression line to assessing residual plots of textbook prices and checking their conditions. When doing this, I sometimes struggled with finding which residual graph was the most appropriate to use. Then, I realized that running through the conditions was the solution to my problem. To prove that a linear model truly followed conditions, I would create graphs such as histograms of residuals to prove it met the zero mean condition, or normal residual scatterplots to show it met the zero-variance condition. Overall, I really enjoyed completing the practice problems because I was able to intertwine the theory I had learned within the chapter to graphing and using summary statistics.
This week, I have spent a few hours on trying to figure out how to use R Markdown to create a website. While I have made a website, I have struggled with how to format and integrate my code into it. Because of this, I have included a link to a GitHub ReadME.md file at the top of this doc to show my code. Most of the data are in the package “Stat2Data”, which can easily be imported onto the computer, which helps with saving storage space. Throughout this weekend, I look to continue to find how to integrate my code into a website, being able to show not only the code, but my graphs alongside them. I also aim to create a tab on the website where I can include these reflections, and a tab where I include links to my homework assignments on these docs. I could also possibly find a way to create a “homework” tab where I create files that format each problem alongside the code used to make it, making it easier to interpret as a whole. I also look to continue to read through chapters 2 and 3 this week and weekend.
# import libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(skimr)
library(gapminder)
library(Stat2Data)
## import data
leaf = read.csv("~/desktop/LeafWidth.csv")
leaf %>%
summarize(mean_Width_avg = mean(Width))
## mean_Width_avg
## 1 3.173037
plot(ggplot(leaf, aes(x = Year, y = Width)) +
geom_point() +
stat_smooth(method = lm, se = FALSE))
## `geom_smooth()` using formula 'y ~ x'
The fitted regression model is Width(hat) = 37.72 - 0.0176(Year)
fm = lm(Width ~ Year, data = leaf)
summary(fm)
##
## Call:
## lm(formula = Width ~ Year, data = leaf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1214 -1.1253 -0.3136 0.9320 5.4144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.723091 8.574977 4.399 1.61e-05 ***
## Year -0.017560 0.004358 -4.029 7.43e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.424 on 250 degrees of freedom
## Multiple R-squared: 0.06098, Adjusted R-squared: 0.05723
## F-statistic: 16.24 on 1 and 250 DF, p-value: 7.425e-05
In context of this setting, every time another year is added, 0.0176 of a mm is decreased from a leaf’s width.
The predicted width of the leaves in 1966 is 3.1184
GlowWorms = read.csv("~/desktop/GlowWorms.csv")
regressionWorm = lm(Eggs ~ Lantern, data = GlowWorms)
summary(regressionWorm)
##
## Call:
## lm(formula = Eggs ~ Lantern, data = GlowWorms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.50 -23.59 -3.20 22.95 63.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.977 21.869 -0.410 0.685087
## Lantern 7.325 1.757 4.169 0.000343 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.71 on 24 degrees of freedom
## Multiple R-squared: 0.4201, Adjusted R-squared: 0.3959
## F-statistic: 17.38 on 1 and 24 DF, p-value: 0.0003431
plot(ggplot(GlowWorms, aes(x = Lantern, y = Eggs)) +
geom_point() +
stat_smooth(method = lm, se = FALSE))
## `geom_smooth()` using formula 'y ~ x'
The fitted regression model is Eggs(hat) = -8.977 + 7.325(Lantern)
Every time the lantern size increases by 1 mm, another 7.325 eggs are hatched.
The predicted number of eggs she will lay is 93.573.
There is a strong linear relationship between the listing price and the sale price of a house in Grinnell
house = read.csv("~/desktop/GrinnellHouses.csv")
plot(ggplot(house, aes(x = ListPrice, y = SalePrice)) +
geom_point())
regressionHouse = lm(SalePrice ~ ListPrice, data = house)
summary(regressionHouse)
##
## Call:
## lm(formula = SalePrice ~ ListPrice, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55942 -3275 846 4141 44168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.448e+02 5.236e+02 -0.277 0.782
## ListPrice 9.431e-01 3.201e-03 294.578 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8019 on 927 degrees of freedom
## Multiple R-squared: 0.9894, Adjusted R-squared: 0.9894
## F-statistic: 8.678e+04 on 1 and 927 DF, p-value: < 2.2e-16
SalePrice(hat) = -144.8 + 0.943(ListPrice)
If the listing price of a house in Grinnell increases by a dollar, the sale price will increase by about $0.94.
Using the regression equation we created earlier, the sale price of a house listed at $95,000 is $89,440.
Listing price - Sale price = $99,500 - $95,000 = $4,500
Yes, this is a strong summary of the relationship between listing price and sale price because there is a strong linear relationship between the two factors.
bird = read.csv("~/desktop/Sparrows.csv")
plot(ggplot(bird, aes(x = Weight, y = WingLength)) +
geom_point() +
stat_smooth(method = lm, se = FALSE))
## `geom_smooth()` using formula 'y ~ x'
birdModel = lm(WingLength ~ Weight, data = bird)
summary(birdModel)
##
## Call:
## lm(formula = WingLength ~ Weight, data = bird)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1170 -1.4080 0.4279 1.3513 6.1457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.75465 1.39601 6.271 6.62e-09 ***
## Weight 1.31341 0.09756 13.463 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.346 on 114 degrees of freedom
## Multiple R-squared: 0.6139, Adjusted R-squared: 0.6105
## F-statistic: 181.3 on 1 and 114 DF, p-value: < 2.2e-16
# histogram of residuals
ggplot(bird, aes(x = birdModel$residuals)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# normal probability plot of residuals
bird.stdres = rstandard(birdModel)
qqnorm(bird.stdres)
###```{r} train = read.csv(“~/desktop/Railsrrails1.csv”) plot(ggplot(train, aes(x = SquareFeet, y = Adj2007)) + geom_point())
regressionTrain = lm(Adj2007 ~ SquareFeet, data = train) summary(regressionTrain) ggplot(train, aes(x = SquareFeet, y = Adj2007)) + geom_point() + stat_smooth(method = lm, se = FALSE) # residual plot resTrain = resid(regressionTrain) plot(fitted(regressionTrain), resTrain) abline(0,0) # normal probability plot of residuals train.stdres = rstandard(regressionTrain) qqnorm(train.stdres) qqline(train.stdres, col = “steelblue”, lwd = 2) # histogram of residuals ggplot(train, aes(x = regressionTrain$residuals)) + geom_histogram() ###```
bug = read.csv("~/desktop/Caterpillars.csv")
# scatterplot of WetFrass on Mass
bugModel = lm(WetFrass ~ Mass, data = bug)
summary(bugModel)
##
## Call:
## lm(formula = WetFrass ~ Mass, data = bug)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65454 -0.04796 -0.03336 -0.01014 1.50828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.033198 0.027436 1.21 0.227
## Mass 0.247696 0.007463 33.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3885 on 265 degrees of freedom
## Multiple R-squared: 0.8061, Adjusted R-squared: 0.8054
## F-statistic: 1102 on 1 and 265 DF, p-value: < 2.2e-16
plot(ggplot(data = bug, aes(x = Mass, y = WetFrass)) +
geom_point() +
stat_smooth(method = lm, se = FALSE))
## `geom_smooth()` using formula 'y ~ x'
As the mass increases past 10 grams, the amount of waste a caterpillar produces decreases below the LSRL line, but before the mass reaches 10 grams, the amount of waste a caterpillar produces trends above the LSRL line.
# Log graph
ggplot(data = bug) +
geom_point(aes(x = log(Mass), y = log(WetFrass)))
The plot of logs is more linear with less curvature than the original plot.
# predictor
bugModel = lm(LogWetFrass ~ LogMass, data = bug)
ggplot(bugModel, aes(x = bug$LogMass, y = bug$LogWetFrass)) +
geom_point() +
stat_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
I would prefer the graph in part b because there is more data, and the positive linear association makes it easier to predict.
Because of the linear trend in the graph, it seems to be consistent for all stages of the caterpillar’s life.
stamp = read.csv("~/downloads/USStamp.csv")
stampLM = lm(Price ~ Year, data = stamp)
ggplot(stampLM, aes(x = Year, y = Price)) +
geom_point()
# remove first four observations
stampNew = stamp %>%
slice(5:45)
stampLM1 = lm(Price ~ Year, data = stampNew)
ggplot(stampLM1, aes(x = Year, y = Price)) +
geom_point() + stat_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
summary(stampLM1)
##
## Call:
## lm(formula = Price ~ Year, data = stampNew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9232 -0.9478 0.1195 1.1899 4.5325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.647e+03 4.686e+01 -35.15 <2e-16 ***
## Year 8.410e-01 2.357e-02 35.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.737 on 19 degrees of freedom
## Multiple R-squared: 0.9853, Adjusted R-squared: 0.9845
## F-statistic: 1273 on 1 and 19 DF, p-value: < 2.2e-16
# plot residuals
stampRes = resid(stampLM1)
plot(fitted(stampLM1), stampRes)
abline(0,0)
After the year around 1952, the graph appears to curve more upwards
The equation for the new regression line is Price(hat) = -1647 + 0.84(Year)
I think the regression line appears to be a good fit, as the r-squared value is .98, which is very close to a perfect linear value of 1, meaning the model is a good fit for the data.
Yes, there is equal spread above and below the zero line of residuals, meaning this is a linear graph.
An unusual residual is the first datapoint, around (0,4)
text = read.csv("~/downloads/TextPrices.csv")
ggplot(text, aes(x = Pages, y = Price)) +
geom_point()
# regression line
textRes = lm(Price ~ Pages, text)
summary(textRes)
##
## Call:
## lm(formula = Price ~ Pages, data = text)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.475 -12.324 -0.584 15.304 72.991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.42231 10.46374 -0.327 0.746
## Pages 0.14733 0.01925 7.653 2.45e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared: 0.6766, Adjusted R-squared: 0.665
## F-statistic: 58.57 on 1 and 28 DF, p-value: 2.452e-08
ggplot(text, aes(x = Pages, y = Price)) +
geom_point() + stat_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
plot(textRes)
# Residual plot
textResid = resid(textRes)
plot(fitted(textRes), textResid)
abline(0,0)
hist(textResid)
I plotted the price of the textbooks against the amount of pages in the book, and found that as the number of pages in the book increases, as does the price
The equation for the regression is Price(hat) = -3.42 + 0.147(Pages)
See residual graphs
Zero Mean: Because the graph shows that the residuals are scattered on either sides of zero, we can tell that it is centered at zero and the sample mean of the residuals is in fact zero.
Constant Variance: Looking at the zero line on the scatterplot of residuals, we can see an even spread of points above and below the graph that continues with the same variance throughout the graph, supporting the constant variance condition.
Randomness and Independence: We cannot tell from a graph if these conditions are satisfied, but can tell from the context given. Given that it is a random sample of books from a college bookstore, we can assume that this sample is representative of most majors, which can allow us to extend this data to most majors and colleges.
Normality: Using the histogram of residuals, we can see a bell-shaped curve, showing a normal distribution. Using the quintile plot, we can see a linear trend that supports the normality condition, but we need to be wary of the points skewing the data on the left and right.