Functions in R

A simple function to calculate the standard SE

Standard error

Equation

\[SE = \frac{SD}{\sqrt{N}}\]

R-ish notation

\[SE = \frac{SD(x)}{sqrt(length(x))}\]

Calcualte SE by hand

Make some fake data

fake.data <- rnorm(n = 100,mean = 0,sd = 1)

Calc SE

sd(fake.data)/sqrt(length(fake.data))

## [1] 0.1037973

A function that does these calculation

First I’ll make a function that breaks the steps up

SE.function <- function(x){ 
      
      #Calcualte the SD
      sd.x <- sd(x)
      
      #sample size
      length.x <- length(x)
      
      #sqrt of sample size
      sqrt.length <- sqrt(length.x)
      
      #calculate the SE
      SE <- sd.x/sqrt.length
      
      #return teh S
      return(SE)
       }

This function does the same thing directly

SE.function <- function(x){ sd(x)/sqrt(length(x)) }

Test the function

SE.function(fake.data)

## [1] 0.1037973

Compare to a real function

library(plotrix)

Check

std.error(fake.data)

## [1] 0.1037973

How exact is my function?

std.error(fake.data) == SE.function(fake.data)

## [1] TRUE

Problem: My function will choke if there’s an NA.

Make the first entry in the vector NA

fake.data[1] <- NA

Check

std.error(fake.data)

## [1] 0.1040933

SE.function(fake.data)

## [1] NA

This is b/c the sd function doesnt like NAs

sd(fake.data)

## [1] NA

I’ll fix the function by add “na.rm = TRUE” to the function

SE.function <- function(x){sd(x, na.rm = T)/sqrt(length(x))}

Now it doesn’t choke, but it gives the wrong asnwer

#correct
std.error(fake.data)

## [1] 0.1040933

#mine
SE.function(fake.data)

## [1] 0.1035715

This is b/c the length() data gives the totally length of the vector, including any NAs

length(fake.data)

## [1] 100

length() doesn’t have an arugement related to NAs. So I need to remove them first before I do the calculations. I’ll use the na.omit() function to do this

SE.function2 <- function(x){
  # remove NAs
  x2 <- na.omit(x)
  
  # calcualte
  sd(x2)/sqrt(length(x2))}

Check the results

std.error(fake.data)

## [1] 0.1040933

SE.function2(fake.data)

## [1] 0.1040933

A more complex function

Writing lots of plotting code can be a drag in R. I’ll make a function that can speed this up.

Regression model

Normally to plot a regression line we’d do something like this

Load milk data

setwd("C:/Users/lisanjie2/Desktop/TEACHING/1_STATS_CalU/1_STAT_CalU_2016_by_NLB/1_STAT_Duq_2017/HOMEWORK/Duq_Biostats_Homework_4_Milk_regression/homework4_RMD_vs2")
dat2 <- read.csv("milk_subset.csv")

#the the model
model. <-lm(fat.log10 ~ mass.log10, data = dat2)

#plot the raw dat
plot(fat.log10 ~ mass.log10, data = dat2)

#plot regression line
abline(model., col = 2)

We can make a function the simplifies this a bit

plot.my.mod <- function(mod){
  
  #extract formula from model
  model.formula <- formula(mod)
  
  #plot the raw dat
  plot( model.formula , data = dat2)
  
  #plot regression line
  abline(mod, col = 2)

}

Test it

plot.my.mod(model.)

I’d like to be able to change the colors of the regression line on the fly

add “…” to the function() code

plot.my.mod <- function(mod, ...){
  
  #extract formula from model
  model.formula <- formula(mod)
  
  #plot the raw dat
  plot( model.formula , data = dat2)
  
  #plot regression line
  abline(mod, ...)

}

Test it

plot.my.mod(model., col = 3)

I can change the line type of the regression line now too

plot.my.mod(model., col = 3, lty = 2)

A more complex situation

Build a model with a continuous and categorical predictor.
This model defines two parralell lines, one for aquatic species, one for terrestrial species.

Build the model

model.2 <-lm(fat.log10 ~ mass.log10 + 
               biome, 
             data = dat2)


summary(model.2)

## 
## Call:
## lm(formula = fat.log10 ~ mass.log10 + biome, data = dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3528 -0.1585  0.0565  0.2075  0.6763 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.92354    0.14453  13.309  < 2e-16 ***
## mass.log10       -0.06955    0.02321  -2.997  0.00328 ** 
## biometerrestrial -0.85608    0.09079  -9.430 2.52e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3521 on 127 degrees of freedom
## Multiple R-squared:  0.4149, Adjusted R-squared:  0.4057 
## F-statistic: 45.03 on 2 and 127 DF,  p-value: 1.653e-15

Question What are the equations for these two lines?

My function won’t work properly on model output this b/c it uses the abline() function, which does’t work w/ complex regression models

plot.my.mod(model.2, col = 3, lty = 2)

## Warning in abline(mod, ...): only using the first two of 3 regression
## coefficients

A function to plot models with two predictors

In a files called “fnxn_plot_intxns.R” I have a function that can plot this model properly. I will load this function using source()

source("fnxn_plot_ANCOVA.R")

The function is called plot.ANCOVA()
As written this function will not work if you did the log transformations “on the fly!”
- MUSt have transformation “hard coded” as a seperate column

plot.ANCOVA(model.2)

Code for the plot.ANCOVA()

#makes interaction plot of simple regression
#models w/1 numeric and 1 categorical predictor



#makes interaction plot of simple regression
#models w/1 numeric and 1 categorical predictor


plot.ANCOVA <- function(model,
                        raw.data = NULL,
                        x.axis = NULL,
                        cols = c(1,2)){
  
  #Data
  ##extract data used in model
  dat <- model$model
  
  ##if x.axis arguement contains data, add to dataframe
  if(is.character(x.axis) == TRUE){
    dim.orig <- dim(dat)[2]
    dat <- cbind(dat,raw.data[,x.axis])
    names(dat)[-c(1:dim.orig)] <-  x.axis
  }
  
  
  
  #extract formula from model object
  ##get formula
  form <- as.character(formula(model))[-1]
  
  ##split into vector
  model.terms <- strsplit(form[2],"[\\+\\*]")
  
  ##name list returned by strsplit()
  names(model.terms) <- "x.names"
  
  ##get rid of spaces
  model.terms$x.names <- gsub(" ","" ,model.terms$x.names)
  
  ##count number of parameters
  i.terms <- length(model.terms$x.names)
  
  #create new entry in list (this will be overwritten)
  model.terms$x.types <- model.terms$x.names
  
  #Loop over each term in the model to determine
  ##whether it is categorical or continous
  for(i in 1:i.terms){
    term.i <- model.terms$x.names[i]
    
    if(any(strsplit(term.i,"")[[1]] == "*") == TRUE){
      model.terms$x.types[i] <- "interactions"
      next
    }
    
    
    model.terms$x.types[i] <- ifelse(is.numeric(dat[,term.i]) == TRUE, 
                                     "numeric",
                                     "categorical")
  }
  
  
  
  ## y axis
  model.terms$y <- form[1]
  
  
  
 
  
  #Define continous variable
  ## If the model doesn't contain a continuous, 
  ## get it from x.axis arguement
  if(is.null(x.axis)== FALSE){
    x.cont <- x.axis
  }
  
  ##If model contains a continous variable...
  if(is.null(x.axis)== TRUE){
    i.num <- which(model.terms$x.types == "numeric")
    x.cont <- model.terms$x.names[i.num]
    
  }

  
  #Define categorical variable
  i.cat <- which(model.terms$x.types == "categorical")
  x.cat <- model.terms$x.names[i.cat]
  
  if(length(x.cat) > 1){
    x.cat <- x.cat[1]
    
  }
  
  
  #Define y variable
  y <- model.terms$y
  
  

  
  # range of observed numeric data
  ## if calling from dataframe
  if(is.character(x.cont) == TRUE){
    rng <- range(dat[,x.cont])
    
  }
  
  ## if calling from x.var arguement
  if(is.numeric(x.cont) == TRUE){
    rng <- range(x.cont)
    
  }
  
  ## levels of the categorical variable
  levs <- levels(dat[,x.cat])
  
  
  ## generate new data with expand.grid() 
  newdat <- expand.grid(x.cont = rng,
                        x.cat = levs)
  

  ## label columns

  #if calling from x.axis
  if(is.numeric(x.cont) == TRUE){
    names(newdat) <- c("x.axis", x.cat)
    
  }
 
  #if calling from model datat
  if(is.character(x.cont) == TRUE){
    names(newdat) <- c(x.cont, x.cat)
    
  }
  
  # generate predictions from fitted model
  newdat$yhat <- predict(model,newdat)
  
  #determine baseline level of fitted model
  i.lev1 <- which(dat[,x.cat] == levs[1])
  
  #determine range of values
  ## range of observed x values
  #if calling from x.axis
  if(is.numeric(x.cont) == TRUE){
    xlims <- range(x.cont)
    
    
  }
  
  #if calling from model datat
  if(is.character(x.cont) == TRUE){
    xlims <- range(dat[,x.cont])
    
    
  }
  
 
  
  
  
  ## range of observed y values
  ylims <- range(dat[,y])
  
  
  #Plot raw data
  ## Plot level 1 of model (intercept)
  plot(dat[i.lev1,y] ~ dat[i.lev1,x.cont], 
       xlab = x.cont,
       ylab = y,
       xlim = xlims,
       ylim = ylims,col=cols[1])
  
  ## Plot level 2 of model (intercept+effect)
  points(dat[-i.lev1,y] ~ dat[-i.lev1,x.cont], 
         xlab = x.cont,
         ylab = y,
         col = cols[2])
  
  
  #Plot regession lines
  ##subset data for regression lines
  i.lev1 <- newdat[,x.cat] == levs[1]
  i.lev2 <- newdat[,x.cat] == levs[2]
  
  ##plot regression lines
  points(newdat$yhat ~ newdat[,x.cont], 
         subset = i.lev1,
         col = cols[1],
         type = "l")
  
  points(newdat$yhat ~ newdat[,x.cont], 
         subset = i.lev2,
         col = cols[2],
         type = "l")
  
  
}

Duq Biostats Week 6 | Regression Tutorial 5: Functions

brouwern@gmail.com

February 13, 2017