Complete all Questions, and submit final documents in PDF or html form on Canvas.

The Goal

Last lab, we focused on learning to fit a LSLR Line (i.e., find the values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)) using R. We looked specifically at coffee data, and went through the steps of building and interpreting an LSLR line.

It turns out that the choice to use a LSLR model involves some assumptions. We have already discussed the fact that we really shouldn't use a LSLR model if the relationship between Y and X isn't, well, linear. In other words, in a graph, the relationship between Y and X should look like a line. This is called the shape condition, and it is one of the first things we check when we choose a LSLR model.

In addition to the shape condition, there are 5 other conditions that have to be met in order for us to use a LSLR model. Today, we will focus on understanding these conditions and learning how to check them using R.

Remember that now is a good time to clear your Environment if you have not already! We created a lot of objects in the last lab, and today's work will be easier if your Environment is clean.

The Data

The data we are working with today is a subset of a larger data set concerning births in North Carolina. The sample we will work with is a random sample of 800 births. To load the data, put the code below in a chunk in your Markdown and press play.

download.file("http://www.openintro.org/stat/data/nc.RData",destfile = "nc.RData")
load("nc.RData")
nc <- na.omit(nc)

Note that this is a good time to try the echo=FALSE tip for saving space in your final document. At this point in the course, you all know how to load the data, and we don't need that to show up in your knitted document. At the top of the R chunk where you have loaded the data, change the {r} command to {r, echo = FALSE}. Knit and make sure your document compiles.

The nc data set contains information on 12 variables. Each row in the data set represents a birth (meaning a baby born in North Carolina).

  • fage: father’s age in years.
  • mage: mother’s age in years.
  • mature: maturity status of mother (over 35 or under 35)
  • weeks: length of pregnancy in weeks.
  • premie: whether the birth was classified as premature (premie) or full-term.
  • visits: number of hospital visits during pregnancy.
  • marital: whether mother is married or not married at the baby's birth.
  • gained: weight gained by mother during pregnancy in pounds.
  • weight: weight of the baby at birth in pounds.
  • lowbirthweight: whether baby was classified as low birthweight (low) or not (not low).
  • gender: biological sex of the baby, limited to female or male.
  • habit: status of the mother as a nonsmoker or a smoker.

Today, we will be exploring 4 of these variables, two at at a time. In Part 1 will we will explore mother's age and father's age. In Part 2, we will explore weight at birth and weeks.

Part 1: Mother's Age and Father's Age

Step 1: Exploratory Data Analysis

The first step in starting any analysis is exploratory data analysis (EDA). This generally involves graphs and a few numeric computations.

We are going to run two analyses today. For the first, we will use fage, i.e., the age of the father, as our response variable.

  1. Use one visualization (i.e., a plot you think is appropriate) to explore the distribution of father's age. Justify why the type of plot you chose is appropriate, and explain what your plot illustrates.
  2. Now that we have chosen and explored our response variable, let's choose an explanatory variable. For now, let's use mage, the age of the mother, and explore the relationship between the two variables.

    1. Make a plot to explore the relationship between X = mother's age and Y = father's age. Make sure you have labeled your axes, and describe the relationship between the two variables.
    1. Based on what we have so far, does it seem like an LSLR model might be appropriate? Comment on why or why not.

    We have now gained some insight into whether or not an LSLR model might be a possibility for these data. Before we decide, there are a few other conditions we need to check to see if an LSLR model is an appropriate choice. Let's check these now!

    Residual Plots

    One very powerful tool we use for checking conditions of an LSLR model is a residual plot. Residual plots are special scatter plot where the x-axis is the explanatory variable, and the y-axis is the residuals from the LSLR line.

    1. Which of the 6 conditions can be checked using a residual plot? Write down all that apply.

    To make a residual plot, we first have to fit our LSLR line. To do this, we use the code below:

    m1 <- lm(fage ~ mage, data = nc)

    You always have to fit the LSLR model before making the residual plot, as we need the residuals from the model fit on the y-axis of the plot. Once we have fit the model, we can obtain the residuals using the code m1$residuals.

    1. Create a residual plot for our LSLR model (m1). Label the x-axis "Mother's Age" and the y-axis "Residuals". If you want to add on the horizontal line at 0, add on +geom_hline(yintercept = 0). Hint: This is just a special scatter plot, so you use the same code you would for a scatter plot. You just need a very specific y-axis.
    1. Based on the plot, so you think constant variance seems to be a reasonable assumption? Explain your answer.

    QQ Plots

    Another condition for LSLR models is the normality condition. Not only do we assume that our errors \(\epsilon \) have a mean of 0 and some constant standard deviation \(\sigma\), we also assume that these errors follow a normal distribution. There are two different kinds of plots we use to check this condition.

    The normality condition tells us that we assume that our errors follow a normal distribution. This assumption is going to be very important when we move into our discussion of inference for LSLR models. Hypothesis tests, confidence intervals...they will all rely on this. Our task now is to determine if the normality assumption is met for our LSLR line.

    Since we do not actually have information on the errors, we use the residuals as a way to check the normality assumption. One way to assess whether or not the residuals seem normal is to look at a plot of their distribution. In other words, if we plot the residuals, do we get a shape that looks like a normal distribution? The first plot we will use to check is a histogram.

    1. Create a histogram of the residuals, and label the x-axis Residuals. Does the skew and the modality of the histogram of the residuals of m1 look like what we would expect for a normal distribution? Explain.

    With a histogram, we can check to see if the distribution has the modality and skew we expect from a normal distribution, but it does not allow us to check the spread. Normal distributions have a very specfic spread, where 95% of the data is within rough 1.96 standard deviations of the mean, and 68% is within rouhgl 1 standard deviation, etc. We cannot use histograms to check to see if the spread of the data matches up with what we would expect to see in a normal distribution. To check that, we need a second plot. This is called a quantile plot, or QQ plot.

    To make a QQplot, run the following two lines of code:

    qqnorm(m1$residuals)
    qqline(m1$residuals)  

    The normal QQplot is a little less intuitive than the histogram - however, it provides useful information. The QQplot plots the quantiles of a normal distribution on the x-axis, and the quantiles of our residuals on the y-axis. If our residuals follow a normal distribution, then the quantiles should roughly line up. In other words, the plot should yield a fairly straight line.

    1. Create a QQplot of the residuals. Based on the plot, does it look like the spread of the data reasonably matches what we might see in a normal distribution? Explain.

    Part 2

    Now that we have some experience checking conditions, let's try it with another pair of variables. We will now use weeks, i.e., the number of weeks the mother carried the baby, as our response variable and weight, the amount a baby weighed at birth (in pounds), as our explanatory variable. We are also going to change our data set a little. Now, we are only going to look at preemie babies.

preemie <-  subset(nc, nc$premie == "premie" & nc$weeks > 25)
  1. Make a plot to explore the relationship between X = weeks and Y = weight for the preemie data set. Label your axes, and describe the relationship between the two variables.
  1. Make and show appropriate plots to check the zero mean and constant variance conditions. Explain whether or not the shape seems linear, and whether zero mean and constant variance seem to be reasonable.

I'll give you a pretty strong hint here - at least one of these conditions does not look like it is reasonable to assume. So, what do we do???

Trying to Improve

One of the ways to address concerns with violating the constant variance assumption is to try a transformation of the response variable. To do this, we create a model called log_model. This model uses exactly the same type of code as m1, but with different predictors and log(weight) in the place of the response variable.

Notice that this code is the same as our usual code, aside from the fact that we have taken the log (natural log, remember!) of our response variable. Applying log() takes the natural log of every value in the parentheses. Similarly, using sqrt() would take the square root, if we decided that transformation was more appropriate.

  1. Fit the model for log(weight). Check the constant variability condition for the transformed model. Does the log transformation result in any improvement? Justify your response.

Let's work with the transformed model for now to get some practice.

  1. Write down the LSLR line for log_model and interpret the slope in terms of both the transformed AND original data scale Hint: Check the slides from last class if you aren't sure what this means. Remember that you have code in the last lab to help with writing math code in R.
Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2020 Sept 21.
The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.
The data set used in this lab is the nc data set, provided by OpenIntro.