Load Data

setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/1_STAT_Duq_2017/HOMEWORK/Duq_Biostats_Homework_4_Milk_regression/homework4_RMD_vs2")
dat2 <- read.csv("milk_subset.csv")

Regression of fat ~ body size

The author’s of the milk paper indicate that there are typically correlations between the size of mammal and fat content.

They note: “W/in particular phylogenetic groups, there are significant negative relationships between body mass & … fat … (Merchant et al. 1989; Derrickson et el 1996; Hinde & Milligan 2011). Spp w/ small body masses are expected to produce more concentrated & energetically dense milk b/c they have higher mass-specific metabolic demands & reduced digestive capacity due to smaller GI tracts (Blaxter 1961). Thus, smallbodied species should be incapable of ingesting greater quantities of milk to meet metabolic & nutritional demands & should instead require more concentrated & energy dense milk (Blaxter 1961).”

References

Derrickson 1992. Comparative reproductive strategies of altricial & precocial eutherian mammals. Functional Ecology, 6, 57-65.
Derrickson et al. 1996. Milk composition of two precocial, arid-dwelling rodents, Kerodon rupestris and Acomys cahirinus. Physiological Zoology, 69, 1402-1418.

Plot fat ~ body size

Looking at the data it doesn’t look like there is a consistent negative trend

library(ggplot2)
qplot(y=fat.log10,
      x=mass.log10, 
      data = dat2)

Indeed, across multiple taxonomic groups, The author’s of the milk paper do not find a negative relationship betwee fat % and body size.
They used a log transformation, which I think is less than ideal.
We’ll check if a logit transformation matters.

The author’s model: log(fat) vs log(mass)

Regression fat % against mass of animal

log.log.mass <- lm(fat.log10 ~ mass.log10, data = dat2)

summary(log.log.mass)

## 
## Call:
## lm(formula = fat.log10 ~ mass.log10, data = dat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68120 -0.27680 -0.01053  0.32121  0.81289 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.84813    0.11531   7.355 2.02e-11 ***
## mass.log10   0.02255    0.02734   0.825    0.411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4573 on 128 degrees of freedom
## Multiple R-squared:  0.005287,   Adjusted R-squared:  -0.002484 
## F-statistic: 0.6804 on 1 and 128 DF,  p-value: 0.411

The slope of the relationship between fat & mass is 0.022 on the log-log scale.
This is in the opposite direction hypothesized. * The p value is 0.411, so the relationship is not significant.

QUESTION:

Write the regression equation for this line

My model logit(fat) ~ log(mass)

Percentage data is best transformed with a logit transformation. See Warton & Hui “Why the arcsine is asinine”

Fit a model to logit transformed data

logit.log.mass <- lm(fat.logit ~ mass.log10, data = dat2)

summary(logit.log.mass)

## 
## Call:
## lm(formula = fat.logit ~ mass.log10, data = dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1076 -0.7710 -0.0754  0.7805  2.5916 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.62319    0.30681  -8.550  3.2e-14 ***
## mass.log10   0.08713    0.07275   1.198    0.233    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.217 on 128 degrees of freedom
## Multiple R-squared:  0.01108,    Adjusted R-squared:  0.003358 
## F-statistic: 1.435 on 1 and 128 DF,  p-value: 0.2332

The slope is even more positive, and the p-value smaller.

using coef()

Compare just at the coefficients using coef()

coef(log.log.mass)

## (Intercept)  mass.log10 
##  0.84812647  0.02255102

coef(logit.log.mass)

## (Intercept)  mass.log10 
## -2.62319113  0.08713388

The logit Transformation substanially changes the intercepts and slopes
This is expected because log data and logit transformation are on different scales.
Because of this, these regression parameters can’t be compared.

Look at the p values for the two slopes

summary(log.log.mass)$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 0.84812647 0.11530526 7.3554882 2.016361e-11
## mass.log10  0.02255102 0.02733924 0.8248592 4.109861e-01

summary(logit.log.mass)$coefficients

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -2.62319113 0.30681324 -8.549798 3.197174e-14
## mass.log10   0.08713388 0.07274638  1.197776 2.332174e-01

The p value is ~0.4 for the log-log and ~0.2 for logit-log.

Look at the residuals of the log-log model

par(mfrow = c(2,2)) sets up a 2 x 2 panel

par(mfrow = c(2,2), 
    mar = c(4,4,2,1))
plot(log.log.mass)

QUESTIONS

Why plot provides information about… * … homogenity of variance? * … normality of residuals? * … outliers/influential poitns?

ANSWERS

Residual vs. fitt plot: Homogenity of variance (homoscedasticity)
Normal Q-Q plot: Normality of residaul
Scale-Location plot: …
Residuals vs. Leverage: outliers/influencitla points

Some thoughts:

At 1st glance I thought that the Residual vs. fitted plot had problems;
- it looks like the residuals are skewed negative.
However, I think this is just due to point 86 being so low.
Ignoring 86, the points are basically a football shape centered on zero, which is typical.
Point 86 does stick the furthest below the QQplot lines

Look at the residuals of the logit-log model

Let’s see if my transformation makes any difference

par(mfrow = c(2,2), 
    mar = c(4,4,2,1))
plot(logit.log.mass)

Residuals look similar regardless of the transformation.
Note that point 86 sticks out on both of the top 2 plots.
In the Normal Q-Q plot is down near the x axis, away from the dashed diagonal line. This is probably not an issue but it would be good to double check this point to make sure it is correct (ie no typo).

On (not) Comparing models with different transformations of the response variable

Note that we can’t compare the fit of these 2 models because transforming the y variable changes the underlying structure of the data.
You can’t compare the “log-log”" and “logit-log”" model with either ANVOA (because the models are not nested) or AIC (because the underlying response variables differ).
Transformations of the y variable are assessed based on their residuals; you can’t formally test whether one transformatio is better than the other (doing so would probably lead to p-hacking too)

Plot model

The slope of this model is not significant, but let’s practice plotting it against the raw data * This is an essential but frequently skipped step in model checking.

Get predictions from model

Use the predict() function to get predictions from the fitted regression model.
When applied directly to a fitted regression model (ie, our model log.log.mass), the predict() function
- takes each observed row of data
- plugs the appropriate variables into the fitted regression model
- determines the predicted values of the response (y) variable.

Get predictions

Get the predictions with predict()

y.hat <- predict(log.log.mass)

Add prediction to the dataframe

dat2$y.hat <- y.hat

Plot predictions

Plot predictions with points()

par(mfrow = c(1,1))

#plot raw dat
plot(fat.log10 ~ mass.log10, data = dat2)

#plot predicitons as points
points(y.hat ~ mass.log10, data  = dat2, col = 2)

We can plot the predictions as a straight line like this by using the “type = ‘l’” arguement in points().

#plot raw dat
plot(fat.log10 ~ mass.log10, data = dat2)

#plot predicitons as points
points(y.hat ~ mass.log10, 
       data  = dat2, 
       col = 2, 
       type = "l")

Plot predictions from a defined dataset

A good way to plot the output of a fitted model is to make a new dataframe that contains data that spans the range of numbers you want to plug into the model.
This is especially important when you have more complex datasets
In this case, our regression is simple and just has a single predictor, mass.log10
We can make a small new dataframe with a column called “mass.log10” with just 2 numbers, the min and max values in the observed data
we can get these min and max using min() and max(), or the function range()

1st, get the range of the x variable

#get the range of x variable
observed.range <- range(dat2$mass.log10)

observed.range

## [1] 0.903090 8.230449

Make into a dataframee

#make new dataframe
newdat <- data.frame(mass.log10 = observed.range)

newdat

##   mass.log10
## 1   0.903090
## 2   8.230449

Using predict() with newdata

Use predict() w/ the arguement newdat =

y.hat2 <- predict(log.log.mass, 
                  newdat = newdat)

y.hat2

##         1         2 
## 0.8684921 1.0337315

Add the y.hat predictions to our ne wdataframe

newdat$y.hat2 <- y.hat2

newdat

##   mass.log10    y.hat2
## 1   0.903090 0.8684921
## 2   8.230449 1.0337315

Plot raw data w/ line defined by the predictions

#plot raw dat
plot(fat.log10 ~ mass.log10, data = dat2)

#plot predicitons as points
points(y.hat2 ~ mass.log10, 
       data  = newdat, 
       col = 2, 
       type = "l")

Plotting simple model output with abline()

When a model just has a single predictor we can just use the function abline to quikly plot it.
This only works for models with a single predictor.

#plot raw dat
plot(fat.log10 ~ mass.log10, 
     data = dat2)

#plot model w/abline
abline(log.log.mass, 
       col = 2)

HW 4: Regression Case Study Tutorial 4: Regression with one variable

brouwern@gmail.com

February 13, 2017