MARR 5.1

The question itself can be found on page 146 of the text.

The first thing we’ll do here is load the data and take a look at what we’re woking with.

d <- read.csv('http://gattonweb.uky.edu/sheather/book/docs/datasets/overdue.txt',sep='')

plot(d$LATE, d$BILL)

The data appears to be piece-wise linear, however it’ll be hard to fit this with a simple linear model. The question indicates that the first 48 observations are RESIDENTIAL, and the second 48 are COMMERCIAL. Those labels are not included in the set that I found, but will be easy enough to add. Importantly, it seems as tough this categorical data is extremely important based on this data.

Below we assign categories to the data and can see two distinct, clear relationships * Residentials tend to pay more slowly as the bills get larger * Commercials tend to pay more quickly if the bill is larger

d$TYPE[1:48] <- "RESIDENTIAL"
d$TYPE[48:96] <- "COMMERCIAL"

d$TYPE <- as.factor(d$TYPE)

library(ggplot2)

ggplot(d,aes(x=LATE,y=BILL,color=TYPE))+
  geom_point()

Next we’ll build a simple regression using BILL and TYPE to try to predict LATE

m <- lm(LATE ~ BILL + TYPE, data = d)

summary(m)
## 
## Call:
## lm(formula = LATE ~ BILL + TYPE, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.451 -12.331   0.615  12.782  29.572 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      70.28146    4.01526  17.504   <2e-16 ***
## BILL             -0.01422    0.01953  -0.728    0.468    
## TYPERESIDENTIAL -36.81058    3.02390 -12.173   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.81 on 93 degrees of freedom
## Multiple R-squared:  0.6151, Adjusted R-squared:  0.6068 
## F-statistic:  74.3 on 2 and 93 DF,  p-value: < 2.2e-16

I note that BILL isn’t significant here, so let’s try a reduced model:

m.reduced <- lm(LATE ~ TYPE, data = d)

summary(m.reduced)
## 
## Call:
## lm(formula = LATE ~ TYPE, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.796 -12.000   0.704  12.403  31.204 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       67.796      2.110   32.12   <2e-16 ***
## TYPERESIDENTIAL  -36.796      3.016  -12.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.77 on 94 degrees of freedom
## Multiple R-squared:  0.6129, Adjusted R-squared:  0.6088 
## F-statistic: 148.8 on 1 and 94 DF,  p-value: < 2.2e-16

Next we look at the F-test and we find evidence in favor of the reduced model. We can see that here, we would not reject the null. It appears okay to use the reduced model here.

anova(m.reduced,m)

I’m not 100% confident on this, but it appears as though in this case, we don’t need the “BILL” variable and can instead rely on the categorical “TYPE” variable instead.