If you’ve already had a go at this week’s homework practical, you’ll have come across today’s data already. Don’t worry if you haven’t had a chance to do this yet, as everything you need to know for the practical is explained below.
The file cereal.csv contains the data of interest. It
gives nutritional information about a variety of cereals, including fat
(in grams), protein (in grams), complex carbohydrates (in grams), sugars
(in grams), fibre (in grams) and the calories per 28g serving. The aim
is to try and determine how fat, protein, complex carbohydrates,
sugar and fibre content affects calories.
If you build a normal linear regression model, with all possible variables as predictors, you’ll get the following output:
==========================================================
calories
----------------------------------------------------------
protein -0.434
(1.286)
fat 8.186***
(0.942)
ccarb 1.429***
(0.290)
sugar 1.216***
(0.283)
fibre -2.229***
(0.487)
Constant 72.555***
(6.831)
Observations 77
R2 0.759
Adjusted R2 0.742
Residual Std. Error 7.041 (df = 71)
F Statistic 44.622*** (df = 5; 71)
----------------------------------------------------------
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.
We now want to start thinking about whether the assumptions of this model are satisfied.
One of the assumptions of a linear regression model is that of
linearity: that the expected response is a
linear function of the model parameters. Later in the course
we’ll consider this assumption in more detail, but here we’ll perform a
simple check for linearity by plotting the response,
calories, against each of the predictors in turn. An
example for one of the predictors is given below, but you need to do
this for all predictors in the above model.
You might notice that the scatterplot above looks as though it
contains fewer than 77 observations (the number of observations in the
dataset). Why do you think that’s the case?
Because, for example, for \(protein = 1g\), there are more then 1 item with \(calories = 110cal\); in other words, points are superposed together.
## There are 20 combination of Calories & Protein.
What does the function jitter do in R, for example
plot(jitter(cereal$protein, 30), cereal$calories)? How
might it help in this case? Why do you need to be a little careful in
using jitter in plots when using them to assess
assumptions?
jitter function adds a small amount of noise on the
data to prevent overlapping of points.
In this case, we can see how many points has \(protein = 1g\).
By increasing the amount in jitter(),
the points are more separated.
Using jitter with too much noise in
plots could affect the assumptions of:
linearity (randomness occur),
homoscedasticity of the error term (variances may be inflated due to large artificial noise),
normality of the error term (error terms may not follow a normal
distribution under jitter), and
independence of error term (harder to spot the patterns of residuals)
Note: jitter should only be used for
visualization, not to the regression
model.
Create a similar scatterplot for each of your predictors (with or
without jitter, but only use jitter if you really have
to).
For each plot you created, comment on whether you think linearity is reasonable. No need for long justifications!
PROTEIN: Not reasonable, as most of the points are concentrated at \(protein = 1g\).
FAT: Reasonable: as grams of fat increase, calorie level gets higher.
COMPLEX CARBS: Reasonable, most of the points plotted form a straight line.
SUGAR: Reasonable, most of the points plotted form a straight line.
FIBRE: Reasonable, there’s an obvious downward pattern of the points.
For any predictors where you think linearity is unreasonable in your opinion, choose ONE and try transforming the predictor (leave the outcome alone) as you see fit. Recall that you saw an example of this using the brain and body weight data in this week’s lecture, except there I also transformed the outcome. If none need transforming, you can skip this part.
cereal$logprotein <- log(cereal$protein)
cereal$sqrtprotein <- (cereal$protein)^1/2
For the predictor you transformed, show the ‘before and after plot’
and state which transformation you applied to the predictor. You can
display plots side-by-side as follows, which you should adapt for your
own example. If none need transforming, you can skip this part.
Now run the updated regression model with the transformed predictor. If you didn’t transform any predictor, just stick any transformation in for one of the predictors so that you complete the rest of the worksheet:
==========================================================
calories
----------------------------------------------------------
logprotein -1.094
(2.782)
fat 8.197***
(0.942)
ccarb 1.434***
(0.290)
sugar 1.212***
(0.283)
fibre -2.232***
(0.487)
Constant 72.070***
(6.668)
Observations 77
R2 0.759
Adjusted R2 0.742
Residual Std. Error 7.039 (df = 71)
F Statistic 44.656*** (df = 5; 71)
----------------------------------------------------------
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.
Why did I ask you not to transform the outcome in question 5, do you
think?
To manage to meet the linearity assumption.
Compare the outputs from the two models. Have they changed
dramatically?
Nothing but protein & log-protein covariates’ coefficients changed.
=====================================================================
Calories per 28g serving
(1) (2)
---------------------------------------------------------------------
protein -0.434
(1.286)
logprotein -1.094
(2.782)
fat 8.186*** 8.197***
(0.942) (0.942)
ccarb (Complex Carbonhydrate) 1.429*** 1.434***
(0.290) (0.290)
sugar 1.216*** 1.212***
(0.283) (0.283)
fibre -2.229*** -2.232***
(0.487) (0.487)
Constant 72.555*** 72.070***
(6.831) (6.668)
Observations 77 77
R2 0.759 0.759
Adjusted R2 0.742 0.742
Residual Std. Error (df = 71) 7.041 7.039
F Statistic (df = 5; 71) 44.622*** 44.656***
---------------------------------------------------------------------
Notes: ***Significant at the 1 percent level.
**Significant at the 5 percent level.
*Significant at the 10 percent level.
As was the case last week, you should get one person in your group to submit the work on Moodle before the end of the session. Emma will then provide general feedback and an example at the start of next week.