STA 112 Project Part 3
Getting Started
Step 1
- Open your Project Part 2 RMarkdown file.
Step 2
- Find “File” at the top of your screen and choose “Save As”.
- Save the file as ProjectPart3 (do not change the file extension at all, meaning do not change .Rmd or add .Rmd - just change the name)
- You have just created a copy of your Project Part 2 file.
Step 3
- Change the first chunk in your Markdown file to match what is below. This allows you to hide all your code so you have a professional looking file at the end!
Doing these steps will mean that your data, all data cleaning work you did, and your model will be loaded.
Note: This is a good time to have Dr. Dalzell come over during class and help you delete what you do not need to keep in the document!!
Question 1
In Project Part 1, you proposed several explanatory variables that you were going to use for your model. State all of those variables here, and identify each of them as categorical or numeric.
Section 1: Adding a Categorical Variable
In Project Part 2, you fit a model with one numeric \(X\). Now, we are going to add one additional \(X\) variable, and we are going to make it categorical.
Question 2
Choose one categorical variable that you are going to add into the model. Tell me what the variable is and state how many levels it has.
PRO TIP
If you do not have a categorical variable in your data set, you can
make one! Suppose I have a dataset called candy that has
information on how much sugar is in different types of candy. Grams of
sugar is a numeric variable, but I can turn it into a categorical
variable by deciding that more than 10 grams of sugar is high sugar, and
10 or less is not high sugar. This means that instead of knowing a candy
has 9 grams, the variable would say “not high”.
To do this in R is actually relatively easy. I tell R to create a new
variable (highsugar) that is high if the candy
has more than 10 grams of Sugar, and
not high otherwise.
Here
candy= the name of your data sethighsugar= the name of the new variable I am creating; you can make this up depending on your datacandy$Sugar= the name of the numeric column you want to turn into a categorical variable.10= the cutoff I used for high sugar; you can change it to match what makes sense for your data."high"and"not high"are the values I am assigning for my variable; change this to match your data!
If you get stuck or want more than 2 levels, ask Dr. Dalzell!
Question 3
Create a table of your categorical variable. We do this to make sure we have enough data at each level for it to make sense to use it as an \(X\). Ideally, you want to make sure each level has at least 10% of the data. Show the table and state whether your explanatory variable meets the 10% standard.
Code
where
data: replace with the name of your data set.categorical: replace with the name of the column with your categorical variable.
Question 4
Build an MR model using both your X from Project Part 2 and your
categorical X (so the model should have 2 X variables in it). Call the
model model_both.
As an answer to this question (1) use the code below to show the coefficient table and (2) write down the fitted model using appropriate notation.
Hint: To write down the model, put in the white space:
$$\hat{Y} = $$
Code:
Question 5
Choose one categorical coefficient from your fitted model and interpret it.
Example: If my variable is Wake Forest class, I could interpret the coefficient for sophomore.
Question 6
Use ANOVA to determine how much adding your categorical variable improved your model fit. As an answer to this question, (1) show the ANOVA table using the code below and (2) use appropriate numeric values from the table to describe how much the categorical variable is contributing to the model fit.
Code
Question 7
We have now built two models on the same data set, one with one X and the other with two Xs. To check to see if adding the new variable improved the fit of the model (meaning how well the model matches the data), we compute the adjusted \(R^2\).
- State the adjusted \(R^2\) for both models, (2) state which model is a better fit to the data, and (3) interpret the adjusted \(R^2\) (which you interpret the same way as \(R^2\)).
Code:
Section 2: Adding all the variables
We are now going to add in all the variables!
Question 8
We are about to build a model called model_all using all
the explanatory variables you proposed in Project Part 1 (and are listed
in Question 1). Check the shape assumption for model_all
(you do not need to build the model yet!!). Show all necessary plots and
clearly state if any transformations are needed or if assuming linearity
works for all your variables.
PRO TIP: This is about to make a lot of graphs. When we make a lot of graphs, it helps to stack them. Let’s say I create two graphs.
I can stack them using
You can stack more than 2 graphs! If you have any questions, let me know.
Question 9
Build an MR model that meets the shape assumption using all the
explanatory variables you proposed in Project Part 1 (and are listed in
Question 1). Call the model model_all.
As an answer to this question use the code below to create the coefficient table.
Code:
Question 10
You have three models, the model with one X, model_both,
and model_all. Which one is the best fit to the data?
Explain your choice.
For the rest of the project, you will use the model you identify as best in this question!!
Question 11
The whole point of having you all choose your own data was so that you could build a model about something that interested you. Now that we have done this, and chosen a model…what does it tell us?
In this section, write a paragraph explaining to a friend who is also interested in the topic what relationships are highlighted in the model. In other words, what seems to be related to \(Y\), and this is a strong relationship, a positive one, a negative one, does the significance seem practical, etc.?
Here, you do NOT need to give a formal interpretation of the slopes. You are talking to someone who does not know statistics, and the goal is story telling. What can you learn about the relationships from the model?
Section 3: Inference
Question 12
Check all assumptions for inference, using appropriate plots. Are all assumptions reasonable? Explain. Note: You do NOT need to check assumptions that you have already checked, and you may assume your data is a random sample for the purposes of this project.
Residual Plot
ggplot( data , aes(modelchosen$fitted.values, modelchosen$residuals)) +
geom_point() +
labs(x = "Predicted Values", y = "Residuals",
title = "Residual Plot") +
geom_hline(yintercept = 0, lty = 2,
color="red", lwd = 1.5)data= change to the name of your data setmodelchosen= the name where you stored your chosen model
QQPlot
ggplot(data, aes(sample = modelchosen$residuals)) +
geom_qq() +
geom_qq_line(color="blue") +
labs(x = "Theoretical normal quantiles",
y = "Observed residual quantiles") +
theme_bw()data= change to the name of your data setmodelchosen= the name where you stored your chosen model
PRO TIP 2 : What if the assumptions are not reasonable??
We run into situations all the time when one or more of our assumptions for inference are not met. In this case, we should not use the hypothesis tests and confidence intervals we learned in class that rely on t-distributions, as the theory that let us use those methods no longer applied.
Instead, we use randomization methods to perform hypothesis tests and confidence intervals when assumptions for inference are not met. We will explore how these are built in more depth later, but for now, here are the codes you need!
Hypothesis Test
Put the following in a chunk and press play. DO NOT CHANGE ANYTHING.
randomizationHT <- function(mydata, Y, X,whereX){
mydata <- data.frame(mydata)
model <- lm( mydata[,Y] ~ ., data = mydata[,-Y])
samplestat <- model$coefficients[whereX+1]
storage <- rep(NA,1000)
newdata <- mydata
for(i in 1:1000){
newdata[,X] <- mydata[sample(1:nrow(mydata),nrow(mydata)),X]
model <- lm( mydata[,Y] ~ ., data = newdata[,-Y])
storage[i] <- model$coefficients[whereX+1]
}
if(samplestat <0){
pvalue <- length(which(storage <= samplestat ))/1000
}
if(samplestat > 0){
pvalue <- length(which(storage >= samplestat ))/1000
}
pvalue
}To run the HT, use the following:
mydata= replace with your data setY= replace with the number of the column where \(Y\) is in your data set.X= replace with the number of the column where the \(X\) you are doing inference on is in your data set.whereX= the number of the \(\beta\) you are testing + 1. This means if you are testing \(\beta_2\), you usewhereX= 3.
The number you get is the p-value for your test.
Confidence Intervals
Put the following in a chunk and press play. DO NOT CHANGE ANYTHING.
randomizationCI <- function(mydata, Y, X,whereX){
mydata <- data.frame(mydata)
set.seed(112)
model <- lm( mydata[,Y] ~ ., data = mydata[,-Y])
samplestat <- model$coefficients[whereX]
storage <- rep(NA,1000)
newdata <-mydata
for(i in 1:1000){
newdata<- mydata[sample(1:nrow(mydata),nrow(mydata),replace = TRUE),]
model <- lm( newdata[,Y] ~ ., data = newdata[,-Y])
storage[i] <- model$coefficients[whereX+1]
}
upper <- quantile(storage, .975)
lower <- quantile(storage, .025)
print(lower)
print(upper)
}To build a 95% CI, use the following:
mydata= replace with your data setY= replace with the number of the column where \(Y\) is in your data set.X= replace with the number of the column where the \(X\) you are doing inference on is in your data set.whereX= the number of the \(\beta\) you are testing. This means if you are testing \(\beta_2\), you usewhereX= 2.
Question 13
Choose one coefficient (any one you like) and run a test to check to see if the data provide evidence that the population value of this coefficient would be statistically different from 0. You do not need to show all your steps, but you DO need to (1) state the hypotheses, (2) state the p-value, and (3) state the conclusion clearly in the context of the data (so useful for a person interested in the topic).
Question 14
Choose one coefficient (any one you like) and build and interpret a 95% confidence interval for that coefficient.
Section 4: Prediction
The final step is to assess our predictive ability now that we have chosen a final model!
Question 15
Suppose my model is called \(modelMine\) and my data is called \(dataset\), with variables \(x\) and \(y\). To create a plot of the predictions versus the true values of \(Y\), the code is:
ggplot( dataset, aes( x = y, y = predict(modelMine))) +
geom_point( ) +
labs( x = "True value of Y", y = "Predicted value of Y" ) +
stat_smooth(formula = y ~ x, method = "lm", se = FALSE)Create the plot using the code above, adapting the code and labels as needed to suit your data. Based on the plot, comment on the predictive abilities of your model. Again, comment so that a friend who is interested in the subject but does not know statistics can understand how well the model is doing at prediction.
Turning in your assignment
Go ahead and knit your assignment. Make sure:
- You have run spell check.
- There is NO R output of any kind that does not have words right near it to describe the output.
- All plots have labelled axes and titles or captions.
- You do NOT have any super long R output (like printing 50 numbers on the screen.)
When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.