Beyond MLR Workbook: CH. 1

The following book is being used to create this markdown.

1.2 Introduction

Modeling non-Gaussian data can be done using generalized linear models (GLMs) instead of what we have been doing.

Assumptions of GLM

Normality doesn’t matter
Observations must still be independent of one another. See note below
Variance in y at each level of x does not need to be equal
Linearity between x and y does not need to be met

NOTE: GLMs cannot be used for models in the following circumstances: medical researchers collect data on patients in clinical trials weekly for 6 months; rat dams are injected with teratogenic substances and their offspring are monitored for defects; and, musicians’ performance anxiety is recorded for several performances. Each of these examples involves correlated data: the same patient’s outcomes are more likely to be similar from week-to-week than outcomes from different patients; litter mates are more likely to suffer defects at similar rates in contrast to unrelated rat pups; and, a musician’s anxiety is more similar from performance to performance than it is with other musicians. Each of these examples violate the independence assumption of simpler linear models for LLSR or GLM inference.

1.3 Assumptions of LLSR

L: there is a linear relationship between mean response (y) and explanatory (x)
I: Errors are independent - there is no connection between how far any two points lie from the regression line
N: the responses are normally distributed at each level of x
E: The variance or SD of responses is equal for all levels of x.

These assumptions are shown in Fig. 1.1 below.

L: The mean value for Y at each level of X falls on the regression line.
I: We’ll need to check the design of the study to determine if the errors (vertical distances from the line) are independent of one another.
N: At each level of X, the values for Y are normally distributed.
E: The spread in the Y’s for each level of X is the same.

1.3 Violation Examples

Below are some examples of violations of normality and options for dealing with those data.

Models with binary responses

Response variable (y): Exam outcome (pass or fail)

Explanatory variable (x): Time spent studying in hours

Logistic Regression can be used here. Chapter 6 explains this.

Models with counts

Response variable (y): Family size, number of children

Explanatory variable (x): Annual income in dollars

Poisson models can be used here. Chapter 4 explains this.

Models with two predictors

Response variable (y): Weight

Explanatory variable (x): Sex and hours spent exercising per week

The linearity assumption implies that there is a linear relationship in mean weight and amount of exercise for males and, similarly, a linear relationship in mean weight and amount of exercise for females. This data may not be appropriate for LLSR modeling because the standard deviation in weight for students who do not exercise for each sex is likely to be considerably greater than the standard deviation in weight for students who follow an exercise regime.

We can determine this violation by plotting weight by amount of exercise for males and females separately.

This example might also violate independence because there is no indication that the subjects were randomly selected.

Multilevel models can be used here and can be found in Chapter 8.

Models with scales in response

Response variable (y): Surgery outcome, scale 1-10

Explanatory variable (x): Patient age, in years

Because outcomes for patients operated on by the same surgeon are more likely to be similar and have similar results there is a violation in independence.

Multilevel models can be used here and can be found in Chapter 8.

1.4 MLR Review

Case Study: Kentucky Derby

The Kentucky Derby is a 1.25-mile horse race held annually at the Churchill Downs race track in Louisville, Kentucky. Our data set derbyplus.csv contains the year of the race, the winning horse (winner), the condition of the track, the average speed (in feet per second) of the winner, and the number of starters (field size, or horses who raced) for the years 1896-2017 (Wikipedia contributors 2018). The track condition has been grouped into three categories: fast, good (which includes the official designations “good” and “dusty”), and slow (which includes the designations “slow”, “heavy”, “muddy”, and “sloppy”). We would like to use least squares linear regression techniques to model the speed of the winning horse as a function of track condition, field size, and trends over time.

derbyData <- read.csv("derbyplus.csv")

headData <- head(derbyData)

gt(headData)|>
  opt_table_font(google_font("Corbel"))%>%
  as_raw_html(inline_css = TRUE)

year	winner	condition	speed	starters
1896	Ben Brush	good	51.66	8
1897	Typhoon II	slow	49.81	6
1898	Plaudit	good	51.16	4
1899	Manuel	fast	50.00	5
1900	Lieut. Gibson	fast	52.28	7
1901	His Eminence	fast	51.66	5

Step 1

Look at the data

a <- ggplot(derbyData, aes(speed))+
  geom_histogram(bins=14, fill="white", color="black")+
  labs(x="Number of starters")

b <- ggplot(derbyData, aes(starters))+
  geom_histogram(bins=14, fill="white", color="black")+
  labs(x="Winning speed (ft/s)")

ggarrange(a, b, labels = c("A", "B"))

Figure 1.2: Histograms of key continuous variables. Plot (A) shows winning speeds, while plot (B) shows the number of starters.

In Figure 1.2(A), we see that the primary response, winning speed, follows a distribution with a slight left skew, with a large number of horses winning with speeds between 53-55 feet per second. Plot (B) shows that the number of starters is mainly distributed between 5 and 20, with the largest number of races having between 15 and 20 starters.

The primary categorical explanatory variable is track condition, where 88 (72%) of the 122 races were run under fast conditions, 10 (8%) under good conditions, and 24 (20%) under slow conditions.

Step 2

Bivariate Summaries

reduceData <- data.frame(cbind(derbyData[3], derbyData[1], derbyData[5], derbyData[4]))

ggpairs(reduceData)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 1.3: Relationships between pairs of variables in the Kentucky Derby data set.

Relationships between categorical variables like track condition and continuous variables can be illustrated with side-by-side boxplots as in the top row, or with stacked histograms as in the first column. As expected, we see evidence of higher speeds on fast tracks and also a tendency for recent years to have more fast conditions. These observed trends can be supported with summary statistics generated by subgroup. For instance, the mean speed under fast conditions is 53.6 feet per second, compared to 52.7 ft/s under good conditions and 51.7 ft/s under slow conditions. Variability in winning speeds, however, is greatest under slow conditions (SD = 1.36 ft/s) and least under fast conditions (0.94 ft/s).

Finally, notice that the diagonal illustrates the distribution of individual variables, using density curves for continuous variables and a bar chart for categorical variables. Trends observed in the last two diagonal entries match trends observed in Figure 1.2.

By using shape or color or other attributes, we can incorporate the effect of a third or even fourth variable into the scatterplots of Figure 1.3. For example, in the coded scatterplot of Figure 1.4 we see that speeds are generally faster under fast conditions, but the rate of increasing speed over time is greater under good or slow conditions.

reduceData$fastfactor = ifelse(reduceData$condition == "fast", "fast", "notfast")

ggplot(reduceData, aes(year, speed, color=fastfactor))+
  geom_point()+
  geom_smooth(method="lm", se=F, linetype="dashed")

## `geom_smooth()` using formula = 'y ~ x'

Figure 1.4: Linear trends in winning speeds over time, presented separately for fast conditions vs. good or slow conditions.

Of course, any graphical analysis is exploratory, and any notable trends at this stage should be checked through formal modeling. At this point, a statistician begins to ask familiar questions such as:

are winning speeds increasing in a linear fashion?
does the rate of increase in winning speed depend on track condition or number of starters?
after accounting for other explanatory variables, is greater field size (number of starters) associated with faster winning speeds (because more horses in the field means a greater chance one horse will run a very fast time) or slower winning speeds (because horses are more likely to bump into each other or crowd each others’ attempts to run at full gait)?
are any of these associations statistically significant?

*How well can we predict the winning speed in the Kentucky Derby?

As you might expect, answers to these questions will arise from proper consideration of variability and properly identified statistical models.

Step 3

Multiple Regression Models with a Continuous Predictor

We will begin by modeling the winning speed as a function of time; for example, have winning speeds increased at a constant rate since 1896? For this initial model, let
\(Y_i\) be the speed of the winning horse in year \(i\). Then, we might consider Model 1:

\(Y_i\) = \(\beta_0\) + \(\beta_1\)*\(Year_i\)+\(\epsilon_i\) where \(\epsilon_i\) ~ N(0, \(\sigma^2\))

In this case, \(\beta_0\) represents the true intercept—the expected winning speed during Year 0. \(\beta_1\) represents the true slope—the expected increase in winning speed from one year to the next, assuming the rate of increase is linear (i.e., constant with each successive year since 1896). Finally, the error (\(\epsilon_i\)) terms represent the deviations of the actual winning speed in Year
\(i\) (\(Y_i\)) from the expected speeds under this model (\(\beta_0\) + \(\beta_1\)\(Year_i\))—the part of a horse’s winning speed that is not explained by a linear trend over time. The variability in these deviations from the regression model is denoted by \(\sigma^2\).

The parameters in this model (\(\beta_0\), \(\beta_1\), and \(\sigma^2\)) can be estimated through ordinary least squares methods; we will use hats to denote estimates of population parameters based on empirical data. Values for \(\hat\beta_0\) and \(\hat\beta_1\) are selected to minimize the sum of squared residuals, where a residual is simply the observed prediction error – the actual winning speed for a given year minus the winning speed predicted by the model. In the notation of this section,

Predicted speed: \(\hat{Y}_i\) = \(\hat\beta_0\) + \(\hat\beta_1\)\(Year_i\)
Residual (estimated error): \(\epsilon_i\) = \(Y_i\) - \(\hat{Y}_i\)
Estimated variance of points around the line: \(\sigma^2\) = \(\sum\)\(\hat\epsilon\)\(^2_i\)\(/\)\((n-2)\)

THE MODEL

model1 <- lm(speed ~ year, data=derbyData)

model1

## 
## Call:
## lm(formula = speed ~ year, data = derbyData)
## 
## Coefficients:
## (Intercept)         year  
##     2.05347      0.02613

summary(model1)

## 
## Call:
## lm(formula = speed ~ year, data = derbyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.08190 -0.50026  0.07387  0.67367  1.68720 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.053473   4.543754   0.452    0.652    
## year        0.026126   0.002322  11.251   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9032 on 120 degrees of freedom
## Multiple R-squared:  0.5134, Adjusted R-squared:  0.5093 
## F-statistic: 126.6 on 1 and 120 DF,  p-value: < 2.2e-16

THE SUMMARY

Using Kentucky Derby data, we estimate \(\hat\beta_0\) = 2.05, \(\hat\beta_1\) = 0.026, and \(\sigma^2\) = 0.90. Thus, according to our simple linear regression model, winning horses of the Kentucky Derby have an estimated winning speed of 2.05 ft/s in Year 0 (more than 2000 years ago!), and the winning speed improves by an estimated 0.026 ft/s every year. With an \(R^2\) of 0.513, the regression model explains a moderate amount (51.3%) of the year-to-year variability in winning speeds, and the trend toward a linear rate of improvement each year is statistically significant at the 0.05 level (\(t_120\)) = 11.251, p < .001).

**\(Year_0\) is confusing, since it was 1986. To make your data more meaningful – you can center this data.

In this case, we could create a centered year variable by subtracting 1896 from each year for Model 2:

\(Y_i\) = \(\beta_0\) + \(\beta_1\)*\(Year_i\)+\(\epsilon_i\) where \(\epsilon_i\) ~ N(0, \(\sigma^2\)) and \(Yearnew = Year - 1986\)

reduceData$Yearnew = reduceData$year-1986

model2 <- lm(speed ~ Yearnew, data=reduceData)

model2

## 
## Call:
## lm(formula = speed ~ Yearnew, data = reduceData)
## 
## Coefficients:
## (Intercept)      Yearnew  
##    53.93973      0.02613

sum2 <- summary(model2)

## R squared = 0.5134
## Residual Error = 0.9032

Note that the only thing that changes from Model 1 to Model 2 is the estimated intercept; \(\hat\beta_1\), \(R^2\), and \(\hat\sigma\) all remain exactly the same. Now \(\hat\beta_0\) tells us that the estimated winning speed in 1896 is 51.59 ft/s, but estimates of the linear rate of improvement or the variability explained by the model remain the same. As Figure 1.5 shows, centering year has the effect of shifting the y-axis from year 0 to year 1896, but nothing else changes.

ggplot(reduceData, aes(Yearnew, speed))+
  geom_smooth(method="lm", se=F)

## `geom_smooth()` using formula = 'y ~ x'

Figure 1.5: Compare Model 1 (with intercept at 0) to Model 2 (with intercept at 1896).

Step 4

Verify Assumptions

op <- par(mfrow = c(2, 2))
plot(model2) #standard graphical output

Figure 1.6: Residual Plots for Model 2.

In this case, the Residuals vs. Fitted plot indicates that a quadratic fit might be better than the linear fit of Model 2; other assumptions look reasonable. Influential points would be denoted by high values of Cook’s Distance; they would fall outside of the cutoff lines in the northeast or southeast section of the Residuals vs. Leverage plot. Since no cutoff lines are even noticeable, there is no potential influential points of concern. GREAT!

We recommend relying on graphical evidence for identifying regression model assumption violations, looking for highly obvious violations of assumptions before trying corrective actions. While some numerical tests have been devised for issues such as normality and influence, most of these tests are not very reliable, highly influenced by sample size and other factors. There is typically no residual plot, however, to evaluate the independence assumption; evidence for lack of independence comes from knowing about the study design and methods of data collection. In this case, with a new field of horses each year, the assumption of independence is pretty reasonable.

Based on residual diagnostics, we should test Model 2Q, in which a quadratic term is added to the linear term in Model 2.

\(Y_i\) = \(\beta_0\) + \(\beta_1\) + \(Yearnew_i\)+\(\beta_2\)*\(Yearnew_i^2\)+\(\epsilon_i\) where \(\epsilon_i\) ~ N(0, \(\sigma^2\))

This model could suggest that the rate of increase in winning speed is slowing down over time. In fact, there is evidence that the quadratic model improves upon the linear model (see Fig. 1.7). \(R^2\), the proportion of year-to-year variability in winning speeds explained by the model, has increased from 51.3% to 64.1% and the pattern of the Residuals vs. Fitted plot of Fig. 1.6 has disappeared in Figure 1.8, although, normality is a little sketchier in the left tail, and teh larger mass of points with fitted values near 54 appears to have slightly lower variability. The significanly negative coefficients for \(\beta_2\) suggests that the rate of increase is indeed slowing in more recent years.

derby.df <- mutate(reduceData, yearnew2 = Yearnew^2)
model2q <- lm(speed ~ Yearnew + yearnew2, data = derby.df)

model2q

## 
## Call:
## lm(formula = speed ~ Yearnew + yearnew2, data = derby.df)
## 
## Coefficients:
## (Intercept)      Yearnew     yearnew2  
##  54.0927694    0.0017230   -0.0004136

sum2 <- summary(model2q)

Step 5

Recheck assumptions (not doing it here)

Step 6

More models

model3 <- lm(speed ~ fastfactor=="fast", data = derby.df)

model3

## 
## Call:
## lm(formula = speed ~ fastfactor == "fast", data = derby.df)
## 
## Coefficients:
##              (Intercept)  fastfactor == "fast"TRUE  
##                   51.994                     1.629

sum4 <- summary(model3)

##  R squared =  0.3236 
##  Residual standard error =  1.065

## Note that this is simply a t-test

#model3 <- lm(speed ~ fast, data = derby.df) #where are they getting fast from in this data? There is no fast column label. Confusing. 


model4 <- lm(speed ~ Yearnew + fastfactor, data = derby.df)

model4 #something is wrong here. :(

## 
## Call:
## lm(formula = speed ~ Yearnew + fastfactor, data = derby.df)
## 
## Coefficients:
##       (Intercept)            Yearnew  fastfactornotfast  
##          54.17712            0.02258           -1.22685

# the fastfactor should not be negative and teh intercept should be 50.91782

Our new model estimates that winning speeds are, on average, 1.23 ft/s faster under fast conditions after accounting for time trends, which is down from an estimated 1.63 ft/s without accounting for time. It appears our original model (Model 3) may have overestimated the effect of fast conditions by conflating it with improvements over time. Through our new model, we also estimate that winning speeds increase by 0.023 ft/s per year, after accounting for track condition. This yearly effect is also smaller than the 0.026 ft/s per year we estimated in Model 1, without adjusting for track condition. Based on the \(R^2\) value, Model 4 explains 68.7% of the year-to-year variability in winning speeds, a noticeable increase over using either explanatory variable alone.

Stopped at 1.6.4

to be continued…