Week 2 - Computer practical

Today’s practical

If you’d like to, find a group of students to work with today. It’s also fine to work alone if you’d prefer.
Then move on to the task below.

The data

If you’ve already had a go at this week’s homework practical, you’ll have come across today’s data already. Don’t worry if you haven’t had a chance to do this yet, as everything you need to know for the practical is explained below.

The file cereal.csv contains the data of interest. It gives nutritional information about a variety of cereals, including fat (in grams), protein (in grams), complex carbohydrates (in grams), sugars (in grams), fibre (in grams) and the calories per 28g serving. The aim is to try and determine how fat, protein, complex carbohydrates, sugar and fibre content affects calories.

If you build a normal linear regression model, with all possible variables as predictors, you’ll get the following output:


==========================================================
                                   calories               
----------------------------------------------------------
protein                             -0.434                
                                   (1.286)                
                                                          
fat                                8.186***               
                                   (0.942)                
                                                          
ccarb                              1.429***               
                                   (0.290)                
                                                          
sugar                              1.216***               
                                   (0.283)                
                                                          
fibre                             -2.229***               
                                   (0.487)                
                                                          
Constant                          72.555***               
                                   (6.831)                
                                                          
Observations                          77                  
R2                                  0.759                 
Adjusted R2                         0.742                 
Residual Std. Error            7.041 (df = 71)            
F Statistic                 44.622*** (df = 5; 71)        
----------------------------------------------------------
Notes:              ***Significant at the 1 percent level.
                    **Significant at the 5 percent level. 
                    *Significant at the 10 percent level.

We now want to start thinking about whether the assumptions of this model are satisfied.

The assumption of linearity

One of the assumptions of a linear regression model is that of linearity: that the expected response is a linear function of the model parameters. Later in the course we’ll consider this assumption in more detail, but here we’ll perform a simple check for linearity by plotting the response, calories, against each of the predictors in turn. An example for one of the predictors is given below, but you need to do this for all predictors in the above model.

Question 1

You might notice that the scatterplot above looks as though it contains fewer than 77 observations (the number of observations in the dataset). Why do you think that’s the case?

Because, for example, for $protein = 1g$, there are more then 1 item with $calories = 110cal$; in other words, points are superposed together.

## There are 20 combination of Calories & Protein.

Question 2

What does the function jitter do in R, for example plot(jitter(cereal$protein, 30), cereal$calories)? How might it help in this case? Why do you need to be a little careful in using jitter in plots when using them to assess assumptions?

jitter function adds a small amount of noise on the data to prevent overlapping of points.
In this case, we can see how many points has $protein = 1g$.
By increasing the amount in jitter(), the points are more separated.

Using jitter with too much noise in plots could affect the assumptions of:

linearity (randomness occur),
homoscedasticity of the error term (variances may be inflated due to large artificial noise),
normality of the error term (error terms may not follow a normal distribution under jitter), and
independence of error term (harder to spot the patterns of residuals)

Note: jitter should only be used for visualization, not to the regression model.

Question 3

Create a similar scatterplot for each of your predictors (with or without jitter, but only use jitter if you really have to).

Question 4

For each plot you created, comment on whether you think linearity is reasonable. No need for long justifications!

PROTEIN: Not reasonable, as most of the points are concentrated at $protein = 1g$.
FAT: Reasonable: as grams of fat increase, calorie level gets higher.
COMPLEX CARBS: Reasonable, most of the points plotted form a straight line.
SUGAR: Reasonable, most of the points plotted form a straight line.
FIBRE: Reasonable, there’s an obvious downward pattern of the points.

Question 5

For any predictors where you think linearity is unreasonable in your opinion, choose ONE and try transforming the predictor (leave the outcome alone) as you see fit. Recall that you saw an example of this using the brain and body weight data in this week’s lecture, except there I also transformed the outcome. If none need transforming, you can skip this part.

cereal$logprotein <- log(cereal$protein)
cereal$sqrtprotein <- (cereal$protein)^1/2

Question 6

For the predictor you transformed, show the ‘before and after plot’ and state which transformation you applied to the predictor. You can display plots side-by-side as follows, which you should adapt for your own example. If none need transforming, you can skip this part.

Question 7

Now run the updated regression model with the transformed predictor. If you didn’t transform any predictor, just stick any transformation in for one of the predictors so that you complete the rest of the worksheet:


==========================================================
                                   calories               
----------------------------------------------------------
logprotein                          -1.094                
                                   (2.782)                
                                                          
fat                                8.197***               
                                   (0.942)                
                                                          
ccarb                              1.434***               
                                   (0.290)                
                                                          
sugar                              1.212***               
                                   (0.283)                
                                                          
fibre                             -2.232***               
                                   (0.487)                
                                                          
Constant                          72.070***               
                                   (6.668)                
                                                          
Observations                          77                  
R2                                  0.759                 
Adjusted R2                         0.742                 
Residual Std. Error            7.039 (df = 71)            
F Statistic                 44.656*** (df = 5; 71)        
----------------------------------------------------------
Notes:              ***Significant at the 1 percent level.
                    **Significant at the 5 percent level. 
                    *Significant at the 10 percent level.

Question 8

Why did I ask you not to transform the outcome in question 5, do you think?

To manage to meet the linearity assumption.

Question 9

Compare the outputs from the two models. Have they changed dramatically?

Nothing but protein & log-protein covariates’ coefficients changed.


=====================================================================
                                     Calories per 28g serving        
                                      (1)                 (2)        
---------------------------------------------------------------------
protein                             -0.434                           
                                    (1.286)                          
                                                                     
logprotein                                              -1.094       
                                                        (2.782)      
                                                                     
fat                                8.186***            8.197***      
                                    (0.942)             (0.942)      
                                                                     
ccarb (Complex Carbonhydrate)      1.429***            1.434***      
                                    (0.290)             (0.290)      
                                                                     
sugar                              1.216***            1.212***      
                                    (0.283)             (0.283)      
                                                                     
fibre                              -2.229***           -2.232***     
                                    (0.487)             (0.487)      
                                                                     
Constant                           72.555***           72.070***     
                                    (6.831)             (6.668)      
                                                                     
Observations                          77                  77         
R2                                   0.759               0.759       
Adjusted R2                          0.742               0.742       
Residual Std. Error (df = 71)        7.041               7.039       
F Statistic (df = 5; 71)           44.622***           44.656***     
---------------------------------------------------------------------
Notes:                        ***Significant at the 1 percent level. 
                              **Significant at the 5 percent level.  
                              *Significant at the 10 percent level.

As was the case last week, you should get one person in your group to submit the work on Moodle before the end of the session. Emma will then provide general feedback and an example at the start of next week.