DATA 621 Week #2 Textbook Exercises

MARR 2.1
MARR 2.2
LMR 2.4
LMR 2.5

MARR 2.1

The web site www.playbill.com provides weekly reports on the box office ticket sales for plays on Broadway in New York. We shall consider the data for the week October 11-17, 2004 (reffered to below as the current week). The data are in the form of the gross box office results for the current week and the gross box office results for the previous wek (i.e., October 3-10, 2004).

Fit the following model to the data: $Y=\beta_0 + \beta_1 x + e$ where $Y$ is the gross box office results for the current week (in $) and $x$ is the gross box office results for the previous week (in $).

playbill <- read.csv("data/playbill.csv")
fit <- lm(CurrentWeek ~ LastWeek, data = playbill)

Complete the following tasks:

A. Find a 95% confidence interval for the slope of the regression model, $\beta_1$. Is 1 a plausible value for $\beta_1$? Give a reason to support your answer.

confint(fit, "LastWeek")

             2.5 %   97.5 %
LastWeek 0.9514971 1.012666

Yes. It is plausible because it falls within the range of the 95% confidence interval.

B. Test the null hypothesis $H_0 : \beta_0 = 10000$ against a two-sided alternative. Interpret your result.

confint(fit, "(Intercept)")

                2.5 %  97.5 %
(Intercept) -14244.33 27854.1

The 95% confidence interval ranges from -14,244 to 27,854. Since the null hypothesis falls outside of this range we would reject the null hypothesis.

C. Use the fitted regression model to estimate the gross box office results for the current week (in $) for a production with $400,000 in gross box office the previous week. Find a 95% prediction interval for the gross box office results for the current week (in $) for a production with $400,000 in gross box office the previous week.

predict(fit, data.frame("LastWeek" = 400000), interval="predict")

       fit      lwr      upr
1 399637.5 359832.8 439442.2

Is $450,000 a feasible value for the gross box office results in the current week, for a production with $400,000 in gros box office the previous week? Give a reason to support your answer.

No, because it falls outside of the upper bounds of the prediction interval.

D. Some promoters of Brodway plays use the rediction rult that next weeks gross box office results will be equal to this weeks’s gross box office results. Comment on the appropriateness of this rule.

Let’s examine this model graphically. We will plot the data and add a red line that represents the model:

ggplot(playbill, aes(x=LastWeek/1000, y=CurrentWeek/1000)) +
  geom_point() +
  labs(title = "Gross Box Office Results",
       subtitle = "(in thousands of dollars)",
       x = "Previous Week",
       y = "Current Week") +
  geom_abline(slope=1, intercept = 0, color = "red") + 
  theme(panel.background = element_blank(),
        panel.grid.major = element_line(color = "gray90"))

Most of the data lines up with the line. It is not a bad model. This also relates to question A. 1 falls in the 95% confidence interval.

MARR 2.2

A story by James R. Hagerty entitled With Buyers Sidelined, Home Prices Slide published in the Thursday October 25, 2007 edition of the Wall Street Journal contained data on so-called fundamental housing indicators in major real estate markets across the US. The author argues that…prices are generally falling and overdue loan payments are pilling up. Thus, we shall consider data presented in the article on

$Y$ = Percentage change in average price from July 2006 to July 2007 (based on the S&P/Case-Shiller national housing index); and

$x$ = Percentage of mortgage loans 30 days or more overdue in latest quarter (based on data from Equifax and Moody’s).

Fit the following model to the data: $Y = \beta_0 + \beta_1 x + e$.

indicators <- read.delim("data/indicators.txt")
fit <- lm(PriceChange ~ LoanPaymentsOverdue, data = indicators)

Complete the following tasks:

A. Find a 95% confidence interval for the slope of the regression model, $\beta_1$. On the basis of this confidence interval decide whether there is evidence of a significant negative linear association.

confint(fit, "LoanPaymentsOverdue")

                        2.5 %    97.5 %
LoanPaymentsOverdue -6.759119 0.3731191

The data suggest that there may be a negative linear association. However this relationship is not very strong.

B. Use the fitted regression model to estimate $E(Y|X=4)$. Find a 95% confidence interval for $E(Y|X=4)$. Is 0% a feasible value for $E(Y|X=4)$? Give a reason to support your answer.

predict(fit, data.frame("LoanPaymentsOverdue" = 4), interval="predict")

        fit       lwr      upr
1 -3.789726 -14.48141 6.901961

The expected value is -3.79 (roughly). The 95% CI is -14.48 to 6.9. 0 is a feasible value because it falls within the confidence interval.

LMR 2.4

The dataset prostate comes from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Fit a model with lpsa as the response and lcavol as the predictor. Record the residual standard error and the $R^2$. Now add lweight, svi, lbph, age, lcp, pgg45 and gleason to the model on one at a time. For each model record the residual standard error and the $R^2$. Plot the trends in these two statistics.

prostate <- read.delim("data/prostate.txt")

get_row <- function(fit, n_vars){
  r2 <- summary(fit)$r.squared
  rse <- sigma(fit)
  data.frame("Predictors" = n_vars, "RSE" = rse, "R2" = r2)
}

df <- get_row(lm(lpsa ~ lcavol, prostate), 1) %>% 
  bind_rows(get_row(lm(lpsa ~ lcavol + lweight, prostate), 2)) %>%
  bind_rows(get_row(lm(lpsa ~ lcavol + lweight + svi, prostate), 3)) %>%
  bind_rows(get_row(lm(lpsa ~ lcavol + lweight + svi + lbph, prostate), 4)) %>%
  bind_rows(get_row(lm(lpsa ~ lcavol + lweight + svi + lbph, prostate), 5)) %>%
  bind_rows(get_row(lm(lpsa ~ lcavol + lweight + svi + lbph + age, prostate), 6)) %>%
  bind_rows(get_row(lm(lpsa ~ lcavol + lweight + svi + lbph + age + lcp, prostate), 7)) %>%
  bind_rows(get_row(lm(lpsa ~ lcavol + lweight + svi + lbph + age + lcp + pgg45, prostate), 8)) %>%
  bind_rows(get_row(lm(lpsa ~ lcavol + lweight + svi + lbph + age + lcp + pgg45 + gleason, prostate), 9)) %>%
  gather(Measure, Value, -Predictors) %>%
  mutate(Measure = as.factor(Measure))

ggplot(df, aes(x=Predictors, y = Value, color = Measure)) +
  geom_line() +
  scale_x_continuous(breaks = round(seq(min(df$Predictors), max(df$Predictors), by = 1), 0)) +
  scale_y_continuous(breaks = round(seq(0, 1, by = 0.05), 2)) +
  annotate("text", x = 5, y = max(df[df$Measure == "R2",]$Value) + 0.01, label ="R2") +
  annotate("text", x = 5, y = min(df[df$Measure == "RSE",]$Value) + 0.02, label ="RSE") +
  labs(title="R2 and RSE by Number of Variables", 
       x = element_blank(), 
       y = element_blank()) +
  theme(legend.position = "None",
        panel.background = element_blank(),
        panel.grid.major = element_line(color = "gray90"))

LMR 2.5

Using the prostate data, plot lpsa against lcavol. Fit the regressions of lpsa on lcavol and lcavol on lpsa. Display both regression lines on the plot. At what point do the two lines intersect? They intersect where the cancer volume roughly 1.3 and the log of PSA is 2.5

fit <- lm(lpsa ~ lcavol, prostate)
df <- data.frame(lcavol = prostate$lcavol, 
                 lpsa = predict(fit, prostate), 
                 Model = "lpsa ~ lcavol")

fit <- lm(lcavol ~ lpsa, prostate)
df <- df %>% 
  bind_rows(data.frame(lcavol = predict(fit, prostate), 
                       lpsa = prostate$lpsa, 
                       Model = "lcavol ~ lpsa")) %>%
  mutate(Model = as.factor(Model))

ggplot(prostate, aes(lcavol, lpsa)) + 
  geom_point() + 
  geom_line(aes(lcavol, lpsa, color = Model), data = df) + 
  labs(x = "Cancer Volume", 
       y = "PSA (log)") +
  theme(legend.position = "bottom",
        panel.background = element_blank(),
        panel.grid.major = element_line(color = "gray90")) +
  scale_x_continuous(breaks = round(seq(min(prostate$lcavol), max(prostate$lcavol), by = 1), 0)) +
  scale_y_continuous(breaks = round(seq(min(prostate$lpsa), max(prostate$lpsa), by = 1), 0))