title: ‘Stat 272: Homework #2’ subtitle: ‘Eric Besonen’ output: pdf_document: fig_height: 3 fig_width: 4.5 html_document: default word_document: default geometry: “left=2.5cm,right=2.5cm,top=2.5cm,bottom=2.5cm” editor_options: chunk_output_type: console

1) Open-ended Exercise 1.44: Textbook prices (p.57)

Complete parts (a)-(c) in the book and (d)-(f) below:

library(MASS)        # for stdres() and studres() - load BEFORE tidyverse
library(Stat2Data)   # if planning to use textbook data
library(mosaic)      # for favstats()

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: lattice

## Loading required package: ggformula

## Loading required package: ggplot2

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## Loading required package: ggstance

## 
## Attaching package: 'ggstance'

## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh

## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")

## Loading required package: mosaicData

## Loading required package: Matrix

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2

## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.

## 
## Attaching package: 'mosaic'

## The following object is masked from 'package:Matrix':
## 
##     mean

## The following object is masked from 'package:ggplot2':
## 
##     stat

## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally

## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median,
##     prop.test, quantile, sd, t.test, var

## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum

library(tidyverse)   # ALWAYS load

## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2

## ── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ tibble  2.1.1     ✔ purrr   0.3.2
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ mosaic::count() masks dplyr::count()
## ✖ purrr::cross()  masks mosaic::cross()
## ✖ mosaic::do()    masks dplyr::do()
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ✖ mosaic::stat()  masks ggplot2::stat()
## ✖ mosaic::tally() masks dplyr::tally()

data("TextPrices")

data("TextPrices")
TextPrices <- as_tibble(TextPrices)
ggplot(data = TextPrices, mapping = aes(x = Pages, y = Price)) +
geom_point() +
geom_smooth(method = "lm")

ggplot(data = TextPrices, mapping = aes(x = Pages, y = Price)) +
geom_point() +
geom_smooth(method = "lm")

model5 <- lm(Price ~ Pages, data = TextPrices)
summary(model5)

## 
## Call:
## lm(formula = Price ~ Pages, data = TextPrices)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.475 -12.324  -0.584  15.304  72.991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.42231   10.46374  -0.327    0.746    
## Pages        0.14733    0.01925   7.653 2.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.665 
## F-statistic: 58.57 on 1 and 28 DF,  p-value: 2.452e-08

Price=.147pages - 3.42

Are there any potential leverage points? If so, which point(s)? If not, why not?

Textbook 4: 400 pages priced at $128.5 because a book at 400 pages is expected to be priced at $56. This high difference in actual vs fitted explanatory variable is an indicator of a leverage point. Another point/textbook that would fall under this category would be 9, 600 pages priced at $19.5. Its fited value should mark the book price to $80 but the book goes for 19.5.There are some books under the 400 page range which may be a cause of concern for being too cheap relative to the line of best fit.

Are there any outliers? If so, which point(s)? Produce plots of standardized and studentized residuals to investigate this question.

TextPrices <- TextPrices %>%
mutate(fitted = fitted(model5),
standardized = stdres(model5),
studentized = studres(model5))
ggplot(data = TextPrices, aes(x = Pages, y = studentized)) +
geom_point() +
geom_hline(yintercept = c(-2, 2), color = "red")

TextPrices %>%
filter(studentized > 2 | studentized < -2)

## # A tibble: 3 x 5
##   Pages Price fitted standardized studentized
##   <int> <dbl>  <dbl>        <dbl>       <dbl>
## 1   400 128.    55.5         2.50        2.78
## 2   600  19.5   85.0        -2.25       -2.44
## 3   930  70.8  134.         -2.26       -2.45

Books 4, 9, and 16 are outliers based on the standardized and studentized residual plots derived above.

Is there any evidence that Book #24 (with 1060 pages) is an influential point? Compare fitted models with and without this book to address this question.

no24 <- TextPrices %>%
filter(Pages != "24")

model2_no24 <- lm(Price ~ Pages, data = TextPrices)
summary(model2_no24)

## 
## Call:
## lm(formula = Price ~ Pages, data = TextPrices)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.475 -12.324  -0.584  15.304  72.991 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.42231   10.46374  -0.327    0.746    
## Pages        0.14733    0.01925   7.653 2.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.76 on 28 degrees of freedom
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.665 
## F-statistic: 58.57 on 1 and 28 DF,  p-value: 2.452e-08

ggplot(data = TextPrices, aes(x = Pages, y = Price)) +
geom_point() +
geom_smooth(method = "lm") +
geom_smooth(data = no24, method = "lm", color = "red") +
geom_label(data = subset(TextPrices, Pages == "24"),
aes(label = Pages), nudge_x = -350)

It seems to fall in the distribution when the point is excluded. We can see this as it falls within the grey area of the outline. The grey outline indicates the likely values that would fit in the distribution.

Based off the residual model there is no evidence that book 24 is an influential point.

2) Guided Exercise 2.16: Textbook prices (continued) (p.78)

Complete parts (a)-(b) in the book and (c)-(e) below:

Null hypothesis: The slope of the Least Squared Regression is equal to 0. This indicates that there is no difference in true mean between predicted and observed.

Create a new variable by centering the number of pages on the mean number of pages in this study (465). Why is centering a good idea in this case?

par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))
TextPrices <- TextPrices %>%
mutate(resid1 = model5$residuals,
fit1 = model5$fitted.values)
TextPrices

## # A tibble: 30 x 7
##    Pages Price fitted standardized studentized resid1   fit1
##    <int> <dbl>  <dbl>        <dbl>       <dbl>  <dbl>  <dbl>
##  1   600  95    85.0         0.344       0.339  10.0   85.0 
##  2    91  20.0   9.98        0.351       0.346   9.97   9.98
##  3   200  51.5  26.0         0.883       0.880  25.5   26.0 
##  4   400 128.   55.5         2.50        2.78   73.0   55.5 
##  5   521  96    73.3         0.775       0.769  22.7   73.3 
##  6   315  48.5  43.0         0.189       0.186   5.51  43.0 
##  7   800 147.  114.          1.13        1.14   32.3  114.  
##  8   800  92   114.         -0.786      -0.781 -22.4  114.  
##  9   600  19.5  85.0        -2.25       -2.44  -65.5   85.0 
## 10   488  85.5  68.5         0.582       0.575  17.0   68.5 
## # … with 20 more rows

It helps when analyzing group effects. We can see if our data when centered still has similar results indicating the validity of our null hypothesis that the difference in true means is zero. It is a test to see the validitiy in the null.

Fit a new model with Price as the response and your centered variable from (c) as the explanatory. Report the equation of your fitted line, and interpret the intercept in context.

ggplot(data = TextPrices, aes(x = resid1)) +
geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = TextPrices, aes(x = fit1, y = resid1)) +
geom_point() +
geom_smooth(se = FALSE) +
geom_hline(yintercept = 0)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Y= 65.02 + .147x

65.02 signifies the price of a book that is at the mean page value, which is 465.

Your slope in (d) should match your slope from (b). Explain why this is true.

Both slopes are the same because it is calculated based off our conception of the null hypothesis.The null hypothesis being that the difference in means is zero. Adding the mean doesn’t change the slope as it has a negating effect on its calculation. It shifts the regression line to create a more useful intercept adjusting for extrapolation

4) Example 1.11: Butterfly ballot (p.46)

Answer parts (a)-(h); you should include the code below in your setup chunk (after removing “eval = FALSE”):

The background for this problem is provided on page 46, and the data is stored as PalmBeach in the Stat2Data library. One theory espoused by some observers of the 2000 U.S. Presidential Election is that errors induced by the butterfly ballot in Palm Beach County, Florida, may have swung the entire 2000 Presidential election. We are going to specifically investigate the question of whether or not the butterfly ballot caused voters to accidentally vote for Pat Buchanan instead of Al Gore in Palm Beach County. The angle we’ll take is to use regression techniques to establish a relationship between the number of Buchanan votes and the number of Bush votes in Florida counties, and then show that the number of Buchanan votes in Palm Beach County does not follow that pattern, even after accounting for uncertainty in predictions.

Produce a scatterplot with Buchanan votes by county as the response and Bush votes as the explanatory variable. Note that Palm Beach County is an obvious outlier in this relationship; remove Palm Beach County so that we can build a model without it.

r library(tidyverse) library(Stat2Data) data("PalmBeach") PalmBeach <- as_tibble(PalmBeach) model2 <- lm(Buchanan ~ Bush, data = PalmBeach) summary(model2)

## ## Call: ## lm(formula = Buchanan ~ Bush, data = PalmBeach) ## ## Residuals: ## Min 1Q Median 3Q Max ## -907.50 -46.10 -29.19 12.26 2610.19 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.529e+01 5.448e+01 0.831 0.409 ## Bush 4.917e-03 7.644e-04 6.432 1.73e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 353.9 on 65 degrees of freedom ## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3795 ## F-statistic: 41.37 on 1 and 65 DF, p-value: 1.727e-08

r noPalmBeach <- PalmBeach %>% filter(County != "PALM BEACH") model2_noPalmBeach <- lm(Buchanan ~ Bush, data = noPalmBeach) summary(model2_noPalmBeach)

## ## Call: ## lm(formula = Buchanan ~ Bush, data = noPalmBeach) ## ## Residuals: ## Min 1Q Median 3Q Max ## -512.43 -47.97 -17.09 41.78 305.45 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.557e+01 1.733e+01 3.784 0.000343 *** ## Bush 3.482e-03 2.501e-04 13.923 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 112.5 on 64 degrees of freedom ## Multiple R-squared: 0.7518, Adjusted R-squared: 0.7479 ## F-statistic: 193.8 on 1 and 64 DF, p-value: < 2.2e-16

r ggplot(data = model2_noPalmBeach, mapping = aes(x = Bush, y = Buchanan)) + geom_point() + geom_smooth(method = "lm")

r confint(model2_noPalmBeach)

## 2.5 % 97.5 % ## (Intercept) 30.951986107 1.001950e+02 ## Bush 0.002982285 3.981511e-03

r B1_hat <- summary(model2_noPalmBeach)$coefficients[2,1] SE_B1 <- summary(model2_noPalmBeach)$coefficients[2,2] tstar <- qt(.975, df = model2_noPalmBeach$df.residual) B1_hat - tstar * SE_B1

## [1] 0.002982285

r B1_hat + tstar * SE_B1

## [1] 0.003981511

After removing Palm Beach County, fit a regression line predicting Buchanan votes as a function of Bush votes. Assess the assumptions for our simple linear regression model using residual plots.

model2 <- lm(Buchanan ~ Bush, data = PalmBeach)
summary(model2)

## 
## Call:
## lm(formula = Buchanan ~ Bush, data = PalmBeach)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -907.50  -46.10  -29.19   12.26 2610.19 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.529e+01  5.448e+01   0.831    0.409    
## Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 353.9 on 65 degrees of freedom
## Multiple R-squared:  0.3889, Adjusted R-squared:  0.3795 
## F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08

noPalmBeach <- PalmBeach %>%
  filter(County != "PALM BEACH")
model2_noPalmBeach <- lm(Buchanan ~ Bush, data = noPalmBeach)
summary(model2_noPalmBeach)

## 
## Call:
## lm(formula = Buchanan ~ Bush, data = noPalmBeach)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -512.43  -47.97  -17.09   41.78  305.45 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.557e+01  1.733e+01   3.784 0.000343 ***
## Bush        3.482e-03  2.501e-04  13.923  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 112.5 on 64 degrees of freedom
## Multiple R-squared:  0.7518, Adjusted R-squared:  0.7479 
## F-statistic: 193.8 on 1 and 64 DF,  p-value: < 2.2e-16

ggplot(data = PalmBeach, aes(x = Bush, y = Buchanan)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_smooth(data = noPalmBeach, method = "lm", color = "red") +
  geom_label(data = subset(PalmBeach, County == "PALM BEACH"),
             aes(label = County), nudge_x = -350)

par(mfrow = c(2, 2), mar = c(2, 2, 2, 2))
plot(model2_noPalmBeach, which = c(1, 2, 3, 5))

par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))

Linearity: seems varily patternless as seen through the residuals vs fitted graph. Independence: within the context of the data set it seems to fit. Normality: With the outlier excluded we see a linear line in regression. Equal Variance: scale location is flat. Sample: Based off the reading it seems legit in context.

Log transfrom Bush votes and refit the regression model, still using untransformed Buchanan votes as the response. Interpret the slope of this model in context.

noPalmBeach <- noPalmBeach %>% mutate(logBuchanan = log(Buchanan), logBush = log(Bush)) ggplot(noPalmBeach, aes(x = logBush, y = Buchanan)) + geom_point() + geom_smooth(method = “lm”, se = FALSE) + labs(y = “Buchanan”, x = “log Bush”)

model4 <- lm(Buchanan ~ logBush, data = noPalmBeach) summary(model4)

When one of the values of Bush votes is doubled it changes the mean number of Buchanan votes by a difference of 38.375

Log transform both variables (Buchanan votes and Bush votes) and refit the regression model. Again, assess the assumptions for the regression model, showing graphs or numerical summaries when appropriate. Use this model to answer the remaining questions.

noPalmBeach <- noPalmBeach %>%
mutate(logBuchanan = log(Buchanan),
logBush = log(Bush))

ggplot(noPalmBeach, aes(x = logBush, y = logBuchanan)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(y = "log Buchanan", x = "log Bush")

model7 <- lm(logBuchanan ~ logBush, data = noPalmBeach)
summary(model7)

## 
## Call:
## lm(formula = logBuchanan ~ logBush, data = noPalmBeach)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95631 -0.21236  0.02503  0.28102  1.02056 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.34149    0.35442  -6.607 9.07e-09 ***
## logBush      0.73096    0.03597  20.323  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4198 on 64 degrees of freedom
## Multiple R-squared:  0.8658, Adjusted R-squared:  0.8637 
## F-statistic:   413 on 1 and 64 DF,  p-value: < 2.2e-16

confint(model7)

##                 2.5 %    97.5 %
## (Intercept) -3.049512 -1.633461
## logBush      0.659109  0.802815

par(mfrow = c(2, 2), mar = c(2, 2, 2, 2))
plot(model7, which = c(1, 2, 3, 5))

par(mfrow = c(1, 1), mar = c(5.1, 4.1, 4.1, 2.1))

What is the equation of the least-squares regression line from (d)? Interpret the slope and its 95% confidence interval in the context of the problem.

y=.759x-2.577 From this summary data we are 95% confident that when an amount of Bush votes are doubled, the median number of Buchanan votes changes by between 1.6 and 1.8 votes

What is the predicted number of Buchanan votes in Palm Beach County based on your model?

116007.537

Expand on (f) by generating and interpreting a prediction interval for the number of Buchanan votes in Palm Beach County based on the number of Bush votes cast there (remembering any necessary back transformations).

95%: 521-597 votes

Explain what the following R code does by describing in detail what each of the 3 parts do in a few sentences each (be sure to remove “eval = FALSE” when you knit it).

pred_range <- tibble(logBush = seq(7, 12.7, length = 1000)) %>%
  mutate(Bush = exp(logBush))

# note: model7 comes from (d)
PIs <- predict(model7, pred_range, interval = "prediction") %>%
  as_tibble() %>%
  mutate(fitted_Buchanan = exp(fit),
         lowerPI_Buchanan = exp(lwr),
         upperPI_Buchanan = exp(upr)) %>%
  bind_cols(pred_range)

ggplot(data = PalmBeach, mapping = aes(x = Bush, y = Buchanan)) +
  geom_point() +
  geom_line(data = PIs, mapping = aes(x = Bush, y = fitted_Buchanan, 
                                      color = "fit")) +
  geom_line(data = PIs, mapping = aes(x = Bush, y = lowerPI_Buchanan, 
                                      color = "PI")) +
  geom_line(data = PIs, mapping = aes(x = Bush, y = upperPI_Buchanan, 
                                      color = "PI")) +
  geom_label(data = subset(PalmBeach, County == "PALM BEACH"),
             aes(label = County), nudge_x = 35000) +
  scale_color_manual(values = c('PI' = 'darkblue', 'fit' = 'red')) +
  labs(y = "Buchanan votes", x = "Bush votes", color = "Prediction Bounds")

The first part of the r code seems to take the log of all the bush data and mutate it by exponentially manipulating it. The mutation that this line of code goes under is meant to predict a range of values of which the log of the bush votes would fit. The next part then takes the logged data of both Bush and Buchanan to make a predicted interval while now exponentially mutating the Buchanan votes in addition. The final part creates the plot and sets the parameters for the axis and labels. This includes the original data which Palm Beach is included. It shows the expected fit of bush to buchanan votes.

Thoughtfully discuss what you can conclude from this analysis. For instance, did something unusual occur in Palm Beach County? Is there evidence that the butterfly ballot caused voters to accidentally vote for Pat Buchanan instead of Al Gore in Palm Beach County in 2000? Is it likely that Al Gore really should have won Florida (he lost by 537 votes) and thus the entire election?

Something unusual did occur in Palm Beach County. It is probably due to human error and how the ballot itself was formatted. Putting Buchanan as the second bubble made it seem confusing as the top two bubbles are typically reserved for the primary positions.The residual difference between PBC and other counties is large enough for a cause of concern. Since the ballot is specific to Palm Beach County the anomaly is harder to isolate as we lack data of other butterfly ballots. It does seem to be significant that Gore should have won Florida due to the fault of the butterfly ballot due to the history provided in the article. on page 798 it quotes that ‘In 1996 no county in Florida has a reisdual even remotely as large as the one for PBC in 2000’.