Load Data

setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/1_STAT_Duq_2017/HOMEWORK/Duq_Biostats_Homework_4_Milk_regression/homework4_RMD_vs2")
dat2 <- read.csv("milk_subset.csv")

Regression of fat ~ body size

They note: “W/in particular phylogenetic groups, there are significant negative relationships between body mass & … fat … (Merchant et al. 1989; Derrickson et el 1996; Hinde & Milligan 2011). Spp w/ small body masses are expected to produce more concentrated & energetically dense milk b/c they have higher mass-specific metabolic demands & reduced digestive capacity due to smaller GI tracts (Blaxter 1961). Thus, smallbodied species should be incapable of ingesting greater quantities of milk to meet metabolic & nutritional demands & should instead require more concentrated & energy dense milk (Blaxter 1961).”

References

Plot fat ~ body size

library(ggplot2)
qplot(y=fat.log10,
      x=mass.log10, 
      data = dat2)

The author’s model: log(fat) vs log(mass)

Regression fat % against mass of animal

log.log.mass <- lm(fat.log10 ~ mass.log10, data = dat2)
summary(log.log.mass)
## 
## Call:
## lm(formula = fat.log10 ~ mass.log10, data = dat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68120 -0.27680 -0.01053  0.32121  0.81289 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.84813    0.11531   7.355 2.02e-11 ***
## mass.log10   0.02255    0.02734   0.825    0.411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4573 on 128 degrees of freedom
## Multiple R-squared:  0.005287,   Adjusted R-squared:  -0.002484 
## F-statistic: 0.6804 on 1 and 128 DF,  p-value: 0.411
  • The slope of the relationship between fat & mass is 0.022 on the log-log scale.
  • This is in the opposite direction hypothesized. * The p value is 0.411, so the relationship is not significant.

QUESTION:

  • Write the regression equation for this line

My model logit(fat) ~ log(mass)

  • Percentage data is best transformed with a logit transformation. See Warton & Hui “Why the arcsine is asinine”

Fit a model to logit transformed data

logit.log.mass <- lm(fat.logit ~ mass.log10, data = dat2)

summary(logit.log.mass)
## 
## Call:
## lm(formula = fat.logit ~ mass.log10, data = dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1076 -0.7710 -0.0754  0.7805  2.5916 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.62319    0.30681  -8.550  3.2e-14 ***
## mass.log10   0.08713    0.07275   1.198    0.233    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.217 on 128 degrees of freedom
## Multiple R-squared:  0.01108,    Adjusted R-squared:  0.003358 
## F-statistic: 1.435 on 1 and 128 DF,  p-value: 0.2332
  • The slope is even more positive, and the p-value smaller.

using coef()

Compare just at the coefficients using coef()

coef(log.log.mass)
## (Intercept)  mass.log10 
##  0.84812647  0.02255102
coef(logit.log.mass)
## (Intercept)  mass.log10 
## -2.62319113  0.08713388
  • The logit Transformation substanially changes the intercepts and slopes
  • This is expected because log data and logit transformation are on different scales.
  • Because of this, these regression parameters can’t be compared.

Look at the p values for the two slopes

summary(log.log.mass)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 0.84812647 0.11530526 7.3554882 2.016361e-11
## mass.log10  0.02255102 0.02733924 0.8248592 4.109861e-01
summary(logit.log.mass)$coefficients
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -2.62319113 0.30681324 -8.549798 3.197174e-14
## mass.log10   0.08713388 0.07274638  1.197776 2.332174e-01

The p value is ~0.4 for the log-log and ~0.2 for logit-log.

Look at the residuals of the log-log model

  • par(mfrow = c(2,2)) sets up a 2 x 2 panel
par(mfrow = c(2,2), 
    mar = c(4,4,2,1))
plot(log.log.mass)

QUESTIONS

Why plot provides information about… * … homogenity of variance? * … normality of residuals? * … outliers/influential poitns?













ANSWERS

  • Residual vs. fitt plot: Homogenity of variance (homoscedasticity)
  • Normal Q-Q plot: Normality of residaul
  • Scale-Location plot: …
  • Residuals vs. Leverage: outliers/influencitla points

Some thoughts:

  • At 1st glance I thought that the Residual vs. fitted plot had problems;
    • it looks like the residuals are skewed negative.
  • However, I think this is just due to point 86 being so low.
  • Ignoring 86, the points are basically a football shape centered on zero, which is typical.
  • Point 86 does stick the furthest below the QQplot lines

Look at the residuals of the logit-log model

  • Let’s see if my transformation makes any difference
par(mfrow = c(2,2), 
    mar = c(4,4,2,1))
plot(logit.log.mass)

  • Residuals look similar regardless of the transformation.
  • Note that point 86 sticks out on both of the top 2 plots.
  • In the Normal Q-Q plot is down near the x axis, away from the dashed diagonal line. This is probably not an issue but it would be good to double check this point to make sure it is correct (ie no typo).

On (not) Comparing models with different transformations of the response variable

Plot model

Get predictions from model

  • Use the predict() function to get predictions from the fitted regression model.
  • When applied directly to a fitted regression model (ie, our model log.log.mass), the predict() function
    • takes each observed row of data
    • plugs the appropriate variables into the fitted regression model
    • determines the predicted values of the response (y) variable.

Get predictions

Get the predictions with predict()

y.hat <- predict(log.log.mass)

Add prediction to the dataframe

dat2$y.hat <- y.hat

Plot predictions

Plot predictions with points()

par(mfrow = c(1,1))

#plot raw dat
plot(fat.log10 ~ mass.log10, data = dat2)

#plot predicitons as points
points(y.hat ~ mass.log10, data  = dat2, col = 2)

We can plot the predictions as a straight line like this by using the “type = ‘l’” arguement in points().

#plot raw dat
plot(fat.log10 ~ mass.log10, data = dat2)

#plot predicitons as points
points(y.hat ~ mass.log10, 
       data  = dat2, 
       col = 2, 
       type = "l")

Plot predictions from a defined dataset

  • A good way to plot the output of a fitted model is to make a new dataframe that contains data that spans the range of numbers you want to plug into the model.
  • This is especially important when you have more complex datasets
  • In this case, our regression is simple and just has a single predictor, mass.log10
  • We can make a small new dataframe with a column called “mass.log10” with just 2 numbers, the min and max values in the observed data
  • we can get these min and max using min() and max(), or the function range()

1st, get the range of the x variable

#get the range of x variable
observed.range <- range(dat2$mass.log10)

observed.range
## [1] 0.903090 8.230449

Make into a dataframee

#make new dataframe
newdat <- data.frame(mass.log10 = observed.range)

newdat
##   mass.log10
## 1   0.903090
## 2   8.230449

Using predict() with newdata

  • Use predict() w/ the arguement newdat =
y.hat2 <- predict(log.log.mass, 
                  newdat = newdat)

y.hat2
##         1         2 
## 0.8684921 1.0337315

Add the y.hat predictions to our ne wdataframe

newdat$y.hat2 <- y.hat2

newdat
##   mass.log10    y.hat2
## 1   0.903090 0.8684921
## 2   8.230449 1.0337315

Plot raw data w/ line defined by the predictions

#plot raw dat
plot(fat.log10 ~ mass.log10, data = dat2)

#plot predicitons as points
points(y.hat2 ~ mass.log10, 
       data  = newdat, 
       col = 2, 
       type = "l")

Plotting simple model output with abline()

  • When a model just has a single predictor we can just use the function abline to quikly plot it.
  • This only works for models with a single predictor.
#plot raw dat
plot(fat.log10 ~ mass.log10, 
     data = dat2)

#plot model w/abline
abline(log.log.mass, 
       col = 2)