Today's lab serves as a review for your first exam. Work carefully, and ask questions if you have any.

There are two parts to this lab. The first part involves building a linear model using a specified response and explanatory variable. In the second part of the lab, you will be asked to choose a different variable that might provide a stronger linear model.

Remember that now is a good time to clear your workspace if you have not already! We created a lot of objects in the last lab, and today's work will be easier if your workspace is clean.

The Data

The data we are working with today is a subset of a larger data set concerning births in North Carolina. The sample we will work with is a random sample of 1000 births. To load the data, put the code below in a chunk in your Markdown to load the nc data set.

download.file("http://www.openintro.org/stat/data/nc.RData",destfile = "nc.RData")
load("nc.RData")

Note that this is a good time to try the echo=FALSE tip for saving space in your final document. At this point in the course, you all know how to load the data, and we don't need that to show up in your knitted document. At the top of the R chunk where you have loaded the data, change the {r} command to {r, echo = FALSE}. Knit and make sure your document compiles.

The nc data set contains information on 13 variables. Each row in the data set represents a birth (meaning a baby born in North Carolina).

fage: father’s age in years.
mage: mother’s age in years.
mature: maturity status of mother.
weeks: length of pregnancy in weeks.
premie: whether the birth was classified as premature (premie) or full-term.
visits: number of hospital visits during pregnancy.
marital: whether mother is married or not married at birth.
gained: weight gained by mother during pregnancy in pounds.
weight: weight of the baby at birth in pounds.
lowbirthweight: whether baby was classified as low birthweight (low) or not (not low).
gender: gender of the baby, female or male.
habit: status of the mother as a nonsmoker or a smoker.
whitemom: whether mom is white or not white.

Part 1

Step 1: Exploratory Data Analysis

The first step in starting any analysis is exploratory data analysis (EDA). This generally involves graphs and a few numeric computations.

For our work today, we will use weight, i.e., the birth weight of a baby, as our response variable. Before we choose an explanatory variable, let's explore this variable on its own.

Use one visualization (i.e., a plot you think is appropriate) to explore the distribution of baby weight. Justify why the type of plot you chose is appropriate, and explain what your plot illustrates.

Now that we have chosen and explored our response variable, let's choose an explanatory variable. For now, let's use gained, the amount of weight a mother gained during pregnancy.

Get a summary of the gained variable using the summary command. From how many mothers are we missing weight gain data? Why is this a problem if we want to use this variable as an explanatory variable?

1.1: Missing Data

Missing data is a common reality in statistics, and as part of EDA, it is important to consider which of the variables involved have missing information. Most data sets (including data sets you might choose for your projects) have missing data to some degree.

There are a lot of ways of handling missing data in the real world, but unfortunately a lot of these methods are beyond the scope of this class. If you are interested, ask me! For today, we will be handling the missing data by creating a "cleaned" version of the variable. We say a variable has been "cleaned" if we have checked for and handled all data quality issues (i.e., missing data, incorrect values, etc).

R has a variety of ways of dealing with missing data, but the most direct in this case is the na.omit function:

ncClean <- na.omit(nc)

This command creates a new data set, called ncClean. Basically, we have told R, "hey, R, take a look at the nc data set. Now, remove all of the rows of the data set that are NA, i.e., have missing data. Store the result in a new data set called ncClean."

All of this means that nc is a bigger data set then ncClean Take a look at your workspace, and find both data sets.

How many fewer rows does ncClean have than nc? Based on this, do you think gained was the only variable with missing data?

Now, an important note. For the rest of the analysis, you should use ncClean, not nc, as your data set.

1.2: Back to EDA

Now that we have determined how to deal with the missing data in our explanatory variable, let's actually explore the relationship between mother's weight gained and baby weight.

There are 4 things we comment on when we are considering the relationships between two numeric variables. What are they?

Make a plot to explore the relationship between mother's weight gain and baby weight. Comment on each of the 4 things you listed in the previous question.

Step 2: Fitting the model

Now we have completed our EDA, we have our response and predictor variable, and we have explored their relationship. Let's fit a model, i.e, find the regression line.

Write down the regression line and interpret both coefficients. We will call this line M1.

What percentage of the variability in weight is explained by its linear relationship with gained?

Step 4: Using the model

Do we have convincing evidence of a linear relationship between mother's weight gain and baby weight? Show all steps, interpret the p-value, and state your conclusion. To produce the necessary mathematical notation for the 6 steps, copy and paste the following into the white space in your Markdown.

Step 1: $H_0:$, $H_a:$

Step 2: $\hat{\beta_1} =$, $SE_{\hat{\beta_1}} =$

Step 3:

Step 4: The sampling distribution of the test statistic under the null is...

Step 5: The p-value is ?. This means that ...

Step 6:

Given your answer to the Question 7, are you surprised by the result of your hypothesis test? Why or why not?

Create the confidence interval using the confidence level you chose in the previous question. Write down and interpret the interval.

Is 0 in your confidence interval? Did you expect it to be? Explain.

Practical versus significant

There is a difference in statistics between statistical significance and practical significance. Statistical significance has to do with the hypothesis test framework. If we determine there is as convincing evidence of a linear relationship using a confidence interval or hypothesis test, we generally say the relationship is statistically significant.

Practical significance is a different beast. For instance, suppose we find out that there is evidence of a relationship, i.e., a statistically significant relationship, between the chance that it is going to rain and the temperature . Cool. Now say we go a step further and build a 95% confidence interval to describe this relationship. We find that for every 1% increase in the chance of rain, we are 95% confident that on average the temperature will decrease by between 0.0005 and 0.001 degrees Fahrenheit.

Statistically, this is significant. Practically, do we really care if the temperature changes by less than a tenth of a degree? Does this make any practical difference? If we're walking outside, probably not. If we are running some kind of physics experiment that is highly sensitive to temperature, maybe. This is an important distinction. Statistical significance depends on your p-value. Practical significance depends upon the situation. And something can be statistically significant without being practically significant.

So, how do we determine practical significance? In the case of the weather example, you can just use logic to determine if we have practical significance. If I'm deciding whether or not to wear a jacket, less than a tenth of a degree makes little difference. If the situation is more specific (the technical physics experiment, for instance), we use expert knowledge to determine practical significance. Ask someone who knows what might make a practical difference in the situation at hand.

This is why in statistics generally the confidence interval is more important, and more useful, than being able to claim something is statistically significant. If we have a range of plausible values, like a confidence interval, we provide concrete information about the relationship between two variables. This information can then be used for a variety of purposes. The bottom line: In general, claiming statistical significance is less important providing than a range of values.

Does the confidence interval you created in the previous question suggest a practical relationship between weight gained by the mother and baby weight? Why or why not? Based on your value of r-squared, is this surprising?

Part 2

Let's now consider only babies that were born prematurely. We will use this data subset for all remaining questions in this lab.

premie = subset(ncClean, ncClean$premie == "premie" & ncClean$weeks > 25)

What have we just done? We have created a subset of the original nc data set which restricts us to only looking at babies which (1) were born prematurely (i.e., nc$premie==1) and (2) specifically were born at more than 25 weeks.

Step 1: EDA

As we noted earlier, we are working with a rich data set, meaning there are a lot of variables to explore.

We have explored weight and gained. Write down the names of the other numeric variables in the data set. How many are there?

Our model for weight using gained was not particularly strong. Now, let's try choosing a predictor that will yield a stronger linear relationship. We will keep weight as our response variable, but we want a new explanatory variable.

What could we use to help decide among the models? Explain.

Based on your answer to the previous question, choose the numeric explanatory variable that you think is the best predictor of weight. Which variable have you chosen, and why?

Step 2: Fit the Model

Write out the regression line. Interpret the slope coefficient.

Step 3: Check Conditions

Is the constant variance assumption violated for this model? Justify your response. If the condition is violated, how might we try to fix that?

Implement the suggestion you made in the previous question, and check the constant variability condition for again. Does your change result in any improvement? Justify your response.

Are there any general outliers? Justify your answer. Hint: Suppose x is a vector containing 1,4,3,5,2. If I want to know which elements in x are greater than 3, I can use which( x > 3). The answer comes back 2 and 4, because the 2nd and 4th elements in the list are great than 3.

Write down the final chosen regression line and interpret the slope coefficient in terms of both the transformed and original data scale.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2020 February 6.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.

The data set used in this lab is the nc data set, provided by OpenIntro.

STA 212 Lab 4: Simple Linear Regression Synopsis