Last lab, we focused on learning to fit a LSLR Line (i.e., find the values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)) using R. We looked specifically at coffee data, and went through the steps of building and interpreting an LSLR line.
It turns out that the choice to use a LSLR model involves some assumptions. We have already discussed the fact that we really shouldn't use a LSLR model if the relationship between Y and X isn't, well, linear. In other words, in a graph, the relationship between Y and X should look like a line. This is called the shape condition, and it is one of the first things we check when we choose a LSLR model.
In addition to the shape condition, there are 5 other conditions that have to be met in order for us to use a LSLR model. Today, we will focus on understanding these conditions and learning how to check them using R.
Remember that now is a good time to clear your Environment if you have not already! We created a lot of objects in the last lab, and today's work will be easier if your Environment is clean.
The data we are working with today is a subset of a larger data set concerning births in North Carolina. The sample we will work with is a random sample of 800 births. To load the data, put the code below in a chunk in your Markdown and press play.
download.file("http://www.openintro.org/stat/data/nc.RData",destfile = "nc.RData") load("nc.RData")
nc <- na.omit(nc)
Note that this is a good time to try the echo=FALSE
tip for saving space in your final document. At this point in the course, you all know how to load the data, and we don't need that to show up in your knitted document. At the top of the R chunk where you have loaded the data, change the {r} command to {r, echo = FALSE}. Knit and make sure your document compiles.
The nc
data set contains information on 12 variables. Each row in the data set represents a birth (meaning a baby born in North Carolina).
fage
: father’s age in years.mage
: mother’s age in years.mature
: maturity status of mother (over 35 or under 35)weeks
: length of pregnancy in weeks.premie
: whether the birth was classified as premature (premie) or full-term.visits
: number of hospital visits during pregnancy.marital
: whether mother is married
or not married
at the baby's birth.gained
: weight gained by mother during pregnancy in pounds.weight
: weight of the baby at birth in pounds.lowbirthweight
: whether baby was classified as low birthweight (low
) or not (not low
).gender
: biological sex of the baby, limited to female
or male
.habit
: status of the mother as a nonsmoker
or a smoker
.Today, we will be exploring 4 of these variables, two at at a time. In Part 1 will we will explore mother's age and father's age. In Part 2, we will explore weight at birth and weeks.
The first step in starting any analysis is exploratory data analysis (EDA). This generally involves graphs and a few numeric computations.
We are going to run two analyses today. For the first, we will use fage
, i.e., the age of the father, as our response variable.
Now that we have chosen and explored our response variable, let's choose an explanatory variable. For now, let's use mage
, the age of the mother, and explore the relationship between the two variables.
We have now gained some insight into whether or not an LSLR model might be a possibility for these data. Before we decide, there are a few other conditions we need to check to see if an LSLR model is an appropriate choice. Let's check these now!
One very powerful tool we use for checking conditions of an LSLR model is a residual plot. Residual plots are special scatter plot where the x-axis is the explanatory variable, and the y-axis is the residuals from the LSLR line.
To make a residual plot, we first have to fit our LSLR line. To do this, we use the code below:
m1 <- lm(fage ~ mage, data = nc)
You always have to fit the LSLR model before making the residual plot, as we need the residuals from the model fit on the y-axis of the plot. Once we have fit the model, we can obtain the residuals using the code m1$residuals
.
+geom_hline(yintercept = 0)
.
Hint: This is just a special scatter plot, so you use the same code you would for a scatter plot. You just need a very specific y-axis.Another condition for LSLR models is the normality condition. Not only do we assume that our errors \(\epsilon \) have a mean of 0 and some constant standard deviation \(\sigma\), we also assume that these errors follow a normal distribution. There are two different kinds of plots we use to check this condition.
The normality condition tells us that we assume that our errors follow a normal distribution. This assumption is going to be very important when we move into our discussion of inference for LSLR models. Hypothesis tests, confidence intervals...they will all rely on this. Our task now is to determine if the normality assumption is met for our LSLR line.
Since we do not actually have information on the errors, we use the residuals as a way to check the normality assumption. One way to assess whether or not the residuals seem normal is to look at a plot of their distribution. In other words, if we plot the residuals, do we get a shape that looks like a normal distribution? The first plot we will use to check is a histogram.
m1
look like what we would expect for a normal distribution? Explain.
With a histogram, we can check to see if the distribution has the modality and skew we expect from a normal distribution, but it does not allow us to check the spread. Normal distributions have a very specfic spread, where 95% of the data is within rough 1.96 standard deviations of the mean, and 68% is within rouhgl 1 standard deviation, etc. We cannot use histograms to check to see if the spread of the data matches up with what we would expect to see in a normal distribution. To check that, we need a second plot. This is called a quantile plot, or QQ plot.
To make a QQplot, run the following two lines of code:
qqnorm(m1$residuals)
qqline(m1$residuals)
The normal QQplot is a little less intuitive than the histogram - however, it provides useful information. The QQplot plots the quantiles of a normal distribution on the x-axis, and the quantiles of our residuals on the y-axis. If our residuals follow a normal distribution, then the quantiles should roughly line up. In other words, the plot should yield a fairly straight line.
Now that we have some experience checking conditions, let's try it with another pair of variables. We will now use weeks
, i.e., the number of weeks the mother carried the baby, as our response variable and weight
, the amount a baby weighed at birth (in pounds), as our explanatory variable. We are also going to change our data set a little. Now, we are only going to look at preemie babies.
preemie <- subset(nc, nc$premie == "premie" & nc$weeks > 25)
I'll give you a pretty strong hint here - at least one of these conditions does not look like it is reasonable to assume. So, what do we do???
One of the ways to address concerns with violating the constant variance assumption is to try a transformation of the response variable. To do this, we create a model called log_model
. This model uses exactly the same type of code as m1
, but with different predictors and log(weight)
in the place of the response variable.
Notice that this code is the same as our usual code, aside from the fact that we have taken the log (natural log, remember!) of our response variable. Applying log()
takes the natural log of every value in the parentheses. Similarly, using sqrt()
would take the square root, if we decided that transformation was more appropriate.
Let's work with the transformed model for now to get some practice.
log_model
and interpret the slope in terms of both the transformed AND original data scale Hint: Check the slides from last class if you aren't sure what this means. Remember that you have code in the last lab to help with writing math code in R.