title: ‘Stat 272: Homework #2’ subtitle: ‘Eric Besonen’ output: pdf_document: fig_height: 3 fig_width: 4.5 html_document: default word_document: default geometry: “left=2.5cm,right=2.5cm,top=2.5cm,bottom=2.5cm” editor_options: chunk_output_type: console
Complete parts (a)-(c) in the book and (d)-(f) below:
library(MASS) # for stdres() and studres() - load BEFORE tidyverse
library(Stat2Data) # if planning to use textbook data
library(mosaic) # for favstats()
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggformula
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Loading required package: Matrix
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
##
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
##
## mean
## The following object is masked from 'package:ggplot2':
##
## stat
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median,
## prop.test, quantile, sd, t.test, var
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
library(tidyverse) # ALWAYS load
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.1 ✔ purrr 0.3.2
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ mosaic::count() masks dplyr::count()
## ✖ purrr::cross() masks mosaic::cross()
## ✖ mosaic::do() masks dplyr::do()
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ✖ mosaic::stat() masks ggplot2::stat()
## ✖ mosaic::tally() masks dplyr::tally()
data("TextPrices")
data("TextPrices")
TextPrices <- as_tibble(TextPrices)
ggplot(data = TextPrices, mapping = aes(x = Pages, y = Price)) +
geom_point() +
geom_smooth(method = "lm")
ggplot(data = TextPrices, mapping = aes(x = Pages, y = Price)) +
geom_point() +
geom_smooth(method = "lm")
model5 <- lm(Price ~ Pages, data = TextPrices)
summary(model5)
##
## Call:
## lm(formula = Price ~ Pages, data = TextPrices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.475 -12.324 -0.584 15.304 72.991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.42231 10.46374 -0.327 0.746
## Pages 0.14733 0.01925 7.653 2.45e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared: 0.6766, Adjusted R-squared: 0.665
## F-statistic: 58.57 on 1 and 28 DF, p-value: 2.452e-08
Price=.147pages - 3.42
Textbook 4: 400 pages priced at $128.5 because a book at 400 pages is expected to be priced at $56. This high difference in actual vs fitted explanatory variable is an indicator of a leverage point. Another point/textbook that would fall under this category would be 9, 600 pages priced at $19.5. Its fited value should mark the book price to $80 but the book goes for 19.5.There are some books under the 400 page range which may be a cause of concern for being too cheap relative to the line of best fit.
TextPrices <- TextPrices %>%
mutate(fitted = fitted(model5),
standardized = stdres(model5),
studentized = studres(model5))
ggplot(data = TextPrices, aes(x = Pages, y = studentized)) +
geom_point() +
geom_hline(yintercept = c(-2, 2), color = "red")
TextPrices %>%
filter(studentized > 2 | studentized < -2)
## # A tibble: 3 x 5
## Pages Price fitted standardized studentized
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 400 128. 55.5 2.50 2.78
## 2 600 19.5 85.0 -2.25 -2.44
## 3 930 70.8 134. -2.26 -2.45
Books 4, 9, and 16 are outliers based on the standardized and studentized residual plots derived above.
no24 <- TextPrices %>%
filter(Pages != "24")
model2_no24 <- lm(Price ~ Pages, data = TextPrices)
summary(model2_no24)
##
## Call:
## lm(formula = Price ~ Pages, data = TextPrices)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.475 -12.324 -0.584 15.304 72.991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.42231 10.46374 -0.327 0.746
## Pages 0.14733 0.01925 7.653 2.45e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared: 0.6766, Adjusted R-squared: 0.665
## F-statistic: 58.57 on 1 and 28 DF, p-value: 2.452e-08
ggplot(data = TextPrices, aes(x = Pages, y = Price)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(data = no24, method = "lm", color = "red") +
geom_label(data = subset(TextPrices, Pages == "24"),
aes(label = Pages), nudge_x = -350)
It seems to fall in the distribution when the point is excluded. We can see this as it falls within the grey area of the outline. The grey outline indicates the likely values that would fit in the distribution.
Based off the residual model there is no evidence that book 24 is an influential point.
Complete parts (a)-(b) in the book and (c)-(e) below:
Null hypothesis: The slope of the Least Squared Regression is equal to 0. This indicates that there is no difference in true mean between predicted and observed.
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
TextPrices <- TextPrices %>%
mutate(resid1 = model5$residuals,
fit1 = model5$fitted.values)
TextPrices
## # A tibble: 30 x 7
## Pages Price fitted standardized studentized resid1 fit1
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 600 95 85.0 0.344 0.339 10.0 85.0
## 2 91 20.0 9.98 0.351 0.346 9.97 9.98
## 3 200 51.5 26.0 0.883 0.880 25.5 26.0
## 4 400 128. 55.5 2.50 2.78 73.0 55.5
## 5 521 96 73.3 0.775 0.769 22.7 73.3
## 6 315 48.5 43.0 0.189 0.186 5.51 43.0
## 7 800 147. 114. 1.13 1.14 32.3 114.
## 8 800 92 114. -0.786 -0.781 -22.4 114.
## 9 600 19.5 85.0 -2.25 -2.44 -65.5 85.0
## 10 488 85.5 68.5 0.582 0.575 17.0 68.5
## # … with 20 more rows
It helps when analyzing group effects. We can see if our data when centered still has similar results indicating the validity of our null hypothesis that the difference in true means is zero. It is a test to see the validitiy in the null.
ggplot(data = TextPrices, aes(x = resid1)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = TextPrices, aes(x = fit1, y = resid1)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Y= 65.02 + .147x
65.02 signifies the price of a book that is at the mean page value, which is 465.
Both slopes are the same because it is calculated based off our conception of the null hypothesis.The null hypothesis being that the difference in means is zero. Adding the mean doesn’t change the slope as it has a negating effect on its calculation. It shifts the regression line to create a more useful intercept adjusting for extrapolation
Answer parts (a)-(h); you should include the code below in your setup chunk (after removing “eval = FALSE”):
The background for this problem is provided on page 46, and the data is stored as PalmBeach in the Stat2Data library. One theory espoused by some observers of the 2000 U.S. Presidential Election is that errors induced by the butterfly ballot in Palm Beach County, Florida, may have swung the entire 2000 Presidential election. We are going to specifically investigate the question of whether or not the butterfly ballot caused voters to accidentally vote for Pat Buchanan instead of Al Gore in Palm Beach County. The angle we’ll take is to use regression techniques to establish a relationship between the number of Buchanan votes and the number of Bush votes in Florida counties, and then show that the number of Buchanan votes in Palm Beach County does not follow that pattern, even after accounting for uncertainty in predictions.
r library(tidyverse) library(Stat2Data) data("PalmBeach") PalmBeach <- as_tibble(PalmBeach) model2 <- lm(Buchanan ~ Bush, data = PalmBeach) summary(model2)
## ## Call: ## lm(formula = Buchanan ~ Bush, data = PalmBeach) ## ## Residuals: ## Min 1Q Median 3Q Max ## -907.50 -46.10 -29.19 12.26 2610.19 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.529e+01 5.448e+01 0.831 0.409 ## Bush 4.917e-03 7.644e-04 6.432 1.73e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 353.9 on 65 degrees of freedom ## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795 ## F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
r noPalmBeach <- PalmBeach %>% filter(County != "PALM BEACH") model2_noPalmBeach <- lm(Buchanan ~ Bush, data = noPalmBeach) summary(model2_noPalmBeach)
## ## Call: ## lm(formula = Buchanan ~ Bush, data = noPalmBeach) ## ## Residuals: ## Min 1Q Median 3Q Max ## -512.43 -47.97 -17.09 41.78 305.45 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.557e+01 1.733e+01 3.784 0.000343 *** ## Bush 3.482e-03 2.501e-04 13.923 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 112.5 on 64 degrees of freedom ## Multiple R-squared: 0.7518, Adjusted R-squared: 0.7479 ## F-statistic: 193.8 on 1 and 64 DF, p-value: < 2.2e-16
r ggplot(data = model2_noPalmBeach, mapping = aes(x = Bush, y = Buchanan)) + geom_point() + geom_smooth(method = "lm")
r confint(model2_noPalmBeach)
## 2.5 % 97.5 % ## (Intercept) 30.951986107 1.001950e+02 ## Bush 0.002982285 3.981511e-03
r B1_hat <- summary(model2_noPalmBeach)$coefficients[2,1] SE_B1 <- summary(model2_noPalmBeach)$coefficients[2,2] tstar <- qt(.975, df = model2_noPalmBeach$df.residual) B1_hat - tstar * SE_B1
## [1] 0.002982285
r B1_hat + tstar * SE_B1
## [1] 0.003981511
model2 <- lm(Buchanan ~ Bush, data = PalmBeach)
summary(model2)
##
## Call:
## lm(formula = Buchanan ~ Bush, data = PalmBeach)
##
## Residuals:
## Min 1Q Median 3Q Max
## -907.50 -46.10 -29.19 12.26 2610.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.529e+01 5.448e+01 0.831 0.409
## Bush 4.917e-03 7.644e-04 6.432 1.73e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 353.9 on 65 degrees of freedom
## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795
## F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08
noPalmBeach <- PalmBeach %>%
filter(County != "PALM BEACH")
model2_noPalmBeach <- lm(Buchanan ~ Bush, data = noPalmBeach)
summary(model2_noPalmBeach)
##
## Call:
## lm(formula = Buchanan ~ Bush, data = noPalmBeach)
##
## Residuals:
## Min 1Q Median 3Q Max
## -512.43 -47.97 -17.09 41.78 305.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.557e+01 1.733e+01 3.784 0.000343 ***
## Bush 3.482e-03 2.501e-04 13.923 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 112.5 on 64 degrees of freedom
## Multiple R-squared: 0.7518, Adjusted R-squared: 0.7479
## F-statistic: 193.8 on 1 and 64 DF, p-value: < 2.2e-16
ggplot(data = PalmBeach, aes(x = Bush, y = Buchanan)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(data = noPalmBeach, method = "lm", color = "red") +
geom_label(data = subset(PalmBeach, County == "PALM BEACH"),
aes(label = County), nudge_x = -350)
par(mfrow = c(2, 2), mar = c(2, 2, 2, 2))
plot(model2_noPalmBeach, which = c(1, 2, 3, 5))
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
Linearity: seems varily patternless as seen through the residuals vs fitted graph. Independence: within the context of the data set it seems to fit. Normality: With the outlier excluded we see a linear line in regression. Equal Variance: scale location is flat. Sample: Based off the reading it seems legit in context.
noPalmBeach <- noPalmBeach %>% mutate(logBuchanan = log(Buchanan), logBush = log(Bush)) ggplot(noPalmBeach, aes(x = logBush, y = Buchanan)) + geom_point() + geom_smooth(method = “lm”, se = FALSE) + labs(y = “Buchanan”, x = “log Bush”)
model4 <- lm(Buchanan ~ logBush, data = noPalmBeach) summary(model4)
When one of the values of Bush votes is doubled it changes the mean number of Buchanan votes by a difference of 38.375
noPalmBeach <- noPalmBeach %>%
mutate(logBuchanan = log(Buchanan),
logBush = log(Bush))
ggplot(noPalmBeach, aes(x = logBush, y = logBuchanan)) +
geom_point() +
geom_smooth(method = "lm") +
labs(y = "log Buchanan", x = "log Bush")
model7 <- lm(logBuchanan ~ logBush, data = noPalmBeach)
summary(model7)
##
## Call:
## lm(formula = logBuchanan ~ logBush, data = noPalmBeach)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95631 -0.21236 0.02503 0.28102 1.02056
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.34149 0.35442 -6.607 9.07e-09 ***
## logBush 0.73096 0.03597 20.323 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4198 on 64 degrees of freedom
## Multiple R-squared: 0.8658, Adjusted R-squared: 0.8637
## F-statistic: 413 on 1 and 64 DF, p-value: < 2.2e-16
confint(model7)
## 2.5 % 97.5 %
## (Intercept) -3.049512 -1.633461
## logBush 0.659109 0.802815
par(mfrow = c(2, 2), mar = c(2, 2, 2, 2))
plot(model7, which = c(1, 2, 3, 5))
par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
Linearity: seems varily patternless as seen through the residuals vs fitted graph. Independence: within the context of the data set it seems to fit. Normality: With the outlier excluded we see a linear line in regression. Equal Variance: scale location is flat. Sample: Based off the reading it seems legit in context.
y=.759x-2.577 From this summary data we are 95% confident that when an amount of Bush votes are doubled, the median number of Buchanan votes changes by between 1.6 and 1.8 votes
116007.537
95%: 521-597 votes
pred_range <- tibble(logBush = seq(7, 12.7, length = 1000)) %>%
mutate(Bush = exp(logBush))
# note: model7 comes from (d)
PIs <- predict(model7, pred_range, interval = "prediction") %>%
as_tibble() %>%
mutate(fitted_Buchanan = exp(fit),
lowerPI_Buchanan = exp(lwr),
upperPI_Buchanan = exp(upr)) %>%
bind_cols(pred_range)
ggplot(data = PalmBeach, mapping = aes(x = Bush, y = Buchanan)) +
geom_point() +
geom_line(data = PIs, mapping = aes(x = Bush, y = fitted_Buchanan,
color = "fit")) +
geom_line(data = PIs, mapping = aes(x = Bush, y = lowerPI_Buchanan,
color = "PI")) +
geom_line(data = PIs, mapping = aes(x = Bush, y = upperPI_Buchanan,
color = "PI")) +
geom_label(data = subset(PalmBeach, County == "PALM BEACH"),
aes(label = County), nudge_x = 35000) +
scale_color_manual(values = c('PI' = 'darkblue', 'fit' = 'red')) +
labs(y = "Buchanan votes", x = "Bush votes", color = "Prediction Bounds")
The first part of the r code seems to take the log of all the bush data and mutate it by exponentially manipulating it. The mutation that this line of code goes under is meant to predict a range of values of which the log of the bush votes would fit. The next part then takes the logged data of both Bush and Buchanan to make a predicted interval while now exponentially mutating the Buchanan votes in addition. The final part creates the plot and sets the parameters for the axis and labels. This includes the original data which Palm Beach is included. It shows the expected fit of bush to buchanan votes.
Something unusual did occur in Palm Beach County. It is probably due to human error and how the ballot itself was formatted. Putting Buchanan as the second bubble made it seem confusing as the top two bubbles are typically reserved for the primary positions.The residual difference between PBC and other counties is large enough for a cause of concern. Since the ballot is specific to Palm Beach County the anomaly is harder to isolate as we lack data of other butterfly ballots. It does seem to be significant that Gore should have won Florida due to the fault of the butterfly ballot due to the history provided in the article. on page 798 it quotes that ‘In 1996 no county in Florida has a reisdual even remotely as large as the one for PBC in 2000’.