Today's lab serves as a review for your first exam. Work carefully, and ask questions if you have any.
There are two parts to this lab. The first part involves building a linear model using a specified response and explanatory variable. In the second part of the lab, you will be asked to choose a different variable that might provide a stronger linear model.
Remember that now is a good time to clear your workspace if you have not already! We created a lot of objects in the last lab, and today's work will be easier if your workspace is clean.
The data we are working with today is a subset of a larger data set concerning births in North Carolina. The sample we will work with is a random sample of 1000 births. To load the data, put the code below in a chunk in your Markdown to load the nc
data set.
download.file("http://www.openintro.org/stat/data/nc.RData",destfile = "nc.RData")
load("nc.RData")
Note that this is a good time to try the echo=FALSE
tip for saving space in your final document. At this point in the course, you all know how to load the data, and we don't need that to show up in your knitted document. At the top of the R chunk where you have loaded the data, change the {r} command to {r, echo = FALSE}. Knit and make sure your document compiles.
The nc
data set contains information on 13 variables. Each row in the data set represents a birth (meaning a baby born in North Carolina).
fage
: father’s age in years.mage
: mother’s age in years.mature
: maturity status of mother.weeks
: length of pregnancy in weeks.premie
: whether the birth was classified as premature (premie) or full-term.visits
: number of hospital visits during pregnancy.marital
: whether mother is married
or not married
at birth.gained
: weight gained by mother during pregnancy in pounds.weight
: weight of the baby at birth in pounds.lowbirthweight
: whether baby was classified as low birthweight (low
) or not (not low
).gender
: gender of the baby, female
or male
.habit
: status of the mother as a nonsmoker
or a smoker
.whitemom
: whether mom is white
or not white
.The first step in starting any analysis is exploratory data analysis (EDA). This generally involves graphs and a few numeric computations.
For our work today, we will use weight
, i.e., the birth weight of a baby, as our response variable. Before we choose an explanatory variable, let's explore this variable on its own.
Now that we have chosen and explored our response variable, let's choose an explanatory variable. For now, let's use gained
, the amount of weight a mother gained during pregnancy.
gained
variable using the summary
command. From how many mothers are we missing weight gain data? Why is this a problem if we want to use this variable as an explanatory variable?Missing data is a common reality in statistics, and as part of EDA, it is important to consider which of the variables involved have missing information. Most data sets (including data sets you might choose for your projects) have missing data to some degree.
There are a lot of ways of handling missing data in the real world, but unfortunately a lot of these methods are beyond the scope of this class. If you are interested, ask me! For today, we will be handling the missing data by creating a "cleaned" version of the variable. We say a variable has been "cleaned" if we have checked for and handled all data quality issues (i.e., missing data, incorrect values, etc).
R has a variety of ways of dealing with missing data, but the most direct in this case is the na.omit
function:
ncClean <- na.omit(nc)
This command creates a new data set, called ncClean
. Basically, we have told R, "hey, R, take a look at the nc
data set. Now, remove all of the rows of the data set that are NA, i.e., have missing data. Store the result in a new data set called ncClean
."
All of this means that nc
is a bigger data set then ncClean
Take a look at your workspace, and find both data sets.
gained
was the only variable with missing data? Now, an important note. For the rest of the analysis, you should use ncClean
, not nc
, as your data set.
Now that we have determined how to deal with the missing data in our explanatory variable, let's actually explore the relationship between mother's weight gained and baby weight.
Now we have completed our EDA, we have our response and predictor variable, and we have explored their relationship. Let's fit a model, i.e, find the regression line.
M1
.weight
is explained by its linear relationship with gained
?Step 1: $H_0:$, $H_a:$
Step 2: $\hat{\beta_1} =$, $SE_{\hat{\beta_1}} =$
Step 3:
Step 4: The sampling distribution of the test statistic under the null is...
Step 5: The p-value is ?. This means that ...
Step 6:
There is a difference in statistics between statistical significance and practical significance. Statistical significance has to do with the hypothesis test framework. If we determine there is as convincing evidence of a linear relationship using a confidence interval or hypothesis test, we generally say the relationship is statistically significant.
Practical significance is a different beast. For instance, suppose we find out that there is evidence of a relationship, i.e., a statistically significant relationship, between the chance that it is going to rain and the temperature . Cool. Now say we go a step further and build a 95% confidence interval to describe this relationship. We find that for every 1% increase in the chance of rain, we are 95% confident that on average the temperature will decrease by between 0.0005 and 0.001 degrees Fahrenheit.
Statistically, this is significant. Practically, do we really care if the temperature changes by less than a tenth of a degree? Does this make any practical difference? If we're walking outside, probably not. If we are running some kind of physics experiment that is highly sensitive to temperature, maybe. This is an important distinction. Statistical significance depends on your p-value. Practical significance depends upon the situation. And something can be statistically significant without being practically significant.
So, how do we determine practical significance? In the case of the weather example, you can just use logic to determine if we have practical significance. If I'm deciding whether or not to wear a jacket, less than a tenth of a degree makes little difference. If the situation is more specific (the technical physics experiment, for instance), we use expert knowledge to determine practical significance. Ask someone who knows what might make a practical difference in the situation at hand.
This is why in statistics generally the confidence interval is more important, and more useful, than being able to claim something is statistically significant. If we have a range of plausible values, like a confidence interval, we provide concrete information about the relationship between two variables. This information can then be used for a variety of purposes. The bottom line: In general, claiming statistical significance is less important providing than a range of values.
Let's now consider only babies that were born prematurely. We will use this data subset for all remaining questions in this lab.
premie = subset(ncClean, ncClean$premie == "premie" & ncClean$weeks > 25)
What have we just done? We have created a subset of the original nc
data set which restricts us to only looking at babies which (1) were born prematurely (i.e., nc$premie==1
) and (2) specifically were born at more than 25 weeks.
As we noted earlier, we are working with a rich data set, meaning there are a lot of variables to explore.
weight
and gained
. Write down the names of the other numeric variables in the data set. How many are there? Our model for weight
using gained
was not particularly strong. Now, let's try choosing a predictor that will yield a stronger linear relationship. We will keep weight
as our response variable, but we want a new explanatory variable.
weight
. Which variable have you chosen, and why? which( x > 3)
. The answer comes back 2 and 4, because the 2nd and 4th elements in the list are great than 3.