Growth Models

Title: Growth models

Synopsis: This document is aimed at helping you to undestand the main principles of growth models.

Linear Model

Details: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 Week 5: Linear Functions > Lecture Videos > Functions and Models

Characteristic: Contant Rate of change.

Using SDSFoundations Package

    library(SDSFoundations)
    
    MyOrange = Orange[Orange$Tree == 1,]
    linFit(MyOrange$age, MyOrange $circumference)

## Linear Fit 
##  Intercept =  24.43785 
##  Slope =  0.08148 
##  R-squared =  0.97115
    #expFit(MyOrange $age, MyOrange $circumference)
    #logisticFit(MyOrange $age, MyOrange $circumference)

Using R Package

    MyOrange = Orange[Orange$Tree == 1,]
    Reg = lm(MyOrange$circumference ~ MyOrange$age)
    plot(MyOrange$age , MyOrange$circumference)
    abline(Reg)

    SumReg = summary(Reg)
    # Linear fit
    Reg$coefficients[1] # Intercept
## (Intercept) 
##    24.43785
    Reg$coefficients[2] # Slope
## MyOrange$age 
##   0.08147716
    SumReg$r.squared # r.squared
## [1] 0.9711463

From the above results we gather together the function

circumference = 24.43785 + 0.08148 (age)

Interpretation:

To predict the outcome of CIRCUMFERENCE start with a line at 0 and 0.08148 and for every unit increase in x, our AGE value, we increate by 0.08148 units of our y, our CIRCUMFERENCE.

Using this function we can find a single output value for any value of our input.

Linear Function fatal flaw is that it assumes a contant rate of change.

You can predit a specific value

What will be the circunference of a tree with 1250 years?

      linFitPred(MyOrange$age, MyOrange $circumference, 1250)
    
      # you can store this value
    
      PredictedValue = linFitPred(MyOrange$age, MyOrange $circumference, 1250)

      PredictedValue
## [1] 126.28

Vertical Line Test

If we draw a vertical line our function should only produce one vertical value for any value of X. You can see from the graph above that this is not the case.

Our data does not pass the vertical line test, but remember our data is not the function. The function is a model of our data. It is represented by the straigth line and tries to represent the model as best as possible. Our model, the ideia of a model, is to find a way to succintly describe a relationship in data. It´s a sumarization.

Summarizing:

Correlation coeficiente (r): A summary of the relationship that describes the extent of the relationship.

Model: Is a function. It expands the idea of that relationship to describe and summarize what the relationship looks like in terms of the actual variables involved. The specific input and the specific output.

Line of best fit

Details: Week 5: Linear Functions > Lecture Videos > The Line of Best Fit

Any linear function line we draw throu correlated data, must pass throu the value of the mean of X and the mean o Y (same values used to calculate the Pearson Correlation Coefficient).

    meanX = mean(MyOrange$age)
    meanY = mean(MyOrange$circumference)
    
    plot(MyOrange$age, MyOrange$circumference)
    Reg = lm(MyOrange$circumference~MyOrange$age)
    abline(Reg)
    abline(v=meanX, col="red")
    abline(h=meanY, col="red")

However there is a infinite number of lines possible, how to solve this puzzle?

Residuals tells us which line is the best line.

Residuals

A residual can formally be defined as y - Y’. In other words, the actual value of y and the predicted value of y’ by the function

  • Positive residuals are actual points above the line.
  • Nevative residuals are points below the line.

So: Residuals = y - y’

And if we add all resilduals we have ZERO

sum(Residuals) = 0

    Reg = lm(MyOrange$circumference~MyOrange$age)    
    #summary(Reg)
    Reg$residuals
##         1         2         3         4         5         6         7 
## -4.052152 -5.872793  8.461319  8.759084 -4.736232  5.775489 -8.334715
    sum(-4.052152,-5.872793,8.461319,8.759084,-4.736232,5.775489,-8.334715)
## [1] 0
    #However sum(Reg$residuals) =   1.776357e-15 what makes no sense.

So instead we square the residuals and them sum then, obtaining SUM OF SQUARED RESIDUALS VALUE.

SRQ = sum(Residuals^2)

    Reg = lm(MyOrange$circumference~MyOrange$age)   
    #summary(Reg)
    sum(Reg$residuals^2)
## [1] 324.4807

The Sum of Squared Residuals gives us a measure of appropiatness of the line we just draw. How well it works.

The line of best fit for a linear function is the line that has the lowest value of sums of squares residuals of all the possibly line that we can ever draw.

If you plot the intercept value (x) versus the SSQ(y) the lowest point is the line that have the lowest residuals.

Coefficient of Determination

The coefficiente of determination is also called the PROPORTION OF VARIANCE ACCOUNTED FOR or r-squared (r^2)

It is simply the correlation coeficiente squared (r^2).

    Reg = lm(MyOrange$circumference~MyOrange$age)   
    # Correlation
    cor(MyOrange$circumference,MyOrange$age)
## [1] 0.9854675
    # Correlation Coeficient (r^2)
    cor(MyOrange$circumference,MyOrange$age)^2
## [1] 0.9711463
    # As you can see by
    MySum = summary(Reg)
    MySum$r.squared
## [1] 0.9711463

r^2

“It measures the amount of error that we can account for in the variable of interest (dependent variable) by knowing the model of the relationship with it and our independent variable.”

Where the Coefficiente of determination actually comes from.

    # Plotting the mean line
    meanY = mean(MyOrange$circumference)
    plot(MyOrange$age, MyOrange$circumference)
    Reg = lm(MyOrange$circumference~MyOrange$age)
    abline(Reg)
    abline(h=meanY, col="red")

    # Calculate the Sum of Squared Residuals for the meanline
    # Total Sums of Squares (TotalSS)
    
    MyOrange = Orange[Orange$Tree == 1,]
    MyOrange$MyMean = mean(MyOrange$circumference)
    MyOrange$Residual1 = MyOrange$circumference - MyOrange$MyMean
    ssr1 = sum(MyOrange$Residual1^2)
    ssr1
## [1] 11245.71
    # Calculate the Sum of Squared Residuals for the line of best fit 
    # Model Sums of Squares (ModelSS)    
    Reg = lm(MyOrange$circumference~MyOrange$age)   
    MyOrange$Predict = predict(Reg)
    MyOrange$Residual2 = MyOrange$circumference - MyOrange$Predict
    ssr2 = sum(MyOrange$Residual2^2)
    ssr2
## [1] 324.4807
    # Calculate the r.squared   
    diferenca = ssr1-ssr2
    r.square = diferenca/ssr1
    r.square
## [1] 0.9711463
    # Confirm the calculations
    SumReg = summary(Reg)
    SumReg$r.squared
## [1] 0.9711463

We will eventually use this value to help determine which model, linear, exponential or logistic works better with our specific data.

Interpreting the Linear Model

Details: Week 5: Linear Functions > Lecture Videos > Interpreting the Linear Model

Let´s see what each of the linear model parameter are telling us.

In other words:

What do the values of our linear model actually mean?

Suppose the folowwing model:

f(x) = a’ + b’ * x

b’ = slope = Rate of change

The linear model has a constant reate of change over equally spaced distances of the input variable.

slope = f(c2)-f(c1) / c2 - c1
Given: c2 - c1 = 1
As long as c2-c1 = 1, as long as we move one hole unit on our input variable the numerator will be our slope.

If x = 0 Then

a’ = Intercept value

Caution about extrapolation

Let´s see following example, useing the wolf dataset that records the number of wolfs in the yellowstone park in a series of years:

    library(SDSFoundations)
    str(wolf)
## 'data.frame':    13 obs. of  2 variables:
##  $ Year  : int  1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ...
##  $ Number: int  21 25 41 45 82 121 130 147 174 163 ...
    wolf
##    Year Number
## 1  1995     21
## 2  1996     25
## 3  1997     41
## 4  1998     45
## 5  1999     82
## 6  2000    121
## 7  2001    130
## 8  2002    147
## 9  2003    174
## 10 2004    163
## 11 2005    155
## 12 2006    141
## 13 2007    165
    linFit(wolf$Year,wolf$Number)

## Linear Fit 
##  Intercept =  -26982 
##  Slope =  13.53846 
##  R-squared =  0.84046

From the data above we get:

Wolves = -26982 + 13.53846 * Year

It means that in the year zero we will have -26982 which does not even makes sense.

It occurs because this value is way out of our range of interest and that must be avoided. It is called extrapolation.

Those results are not reliable and even not intepretable.

How do you fix extrapolation?

Instead of using the year we could use years since wolves were introduced in the yellowstone park.

    w = wolf
    w$Time = w$Year- min(w$Year)

Now we have a variable that will make 1995 our year zero.

    linFit(w$Time,w$Number, xlab="Years since 1995", ylab="Number of grey wolves")

## Linear Fit 
##  Intercept =  27.23077 
##  Slope =  13.53846 
##  R-squared =  0.84046

Which will result in a more meaningfull model.

Number = 27.23077 + 13.53846 * Time

Observe that we haven´t change the relationships between the variables. We´ve just made a zero value of our x axis variable in our intercept more meaningful.

This time in year zero we would have about 27 wolves.

SUMMING UP

Slope = Rate of change in our predicted outcome variable we expect to see one hole unit increase in our x axis variable.

Intercept = Predicted value of our outcome variable when our x axis variable is zero.

It is up to US to make sure that this value of zero makes sense.

Exponential Model

Details: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 Week 6: Exponential and Logistic Function Models > Lecture Videos > Exponential Growth and Decay

Characteristic: Contant Percentage Rate of change

How does the exponential model differ from the linear model?

Using the SDSFoundations Package

    library(SDSFoundations)
    
    MyOrange = Orange[Orange$Tree == 1,]
    expFit(MyOrange$age, MyOrange$circumference)

## Exponential Fit 
##  a =  33.92672 
##  b =  1.00104 
##  R-squared =  0.90234
    #logisticFit(MyOrange $age, MyOrange $circumference)

Using the R Package

    expFit = lm(log(MyOrange$circumference) ~ MyOrange$age)
    plot(MyOrange$age,MyOrange$circumference)
    lines(MyOrange$age, exp(fitted(expFit)))

Basically one is straight and the exponential is not, however that NOT STRAIGTH has a very specific property that makes it exponential.

In the linear model for every one unit change in x (independent) the change in Y (dependent) is constant.

In the Exponential model there is a constant percent rate of change.

The exponential funciton is:

f(x) = a * b ^ x

a = value when x = 0
b = Growth factor
r = Growth rate
Where:
b = 1 + r

So, for every point on the exponential courve there is a fixed exponential rate of change.

Therefore.

Given the following points in an exponential function;

x’, x’‘, x’‘’, x’‘’’

Their relation is as follows:
percent rate of change = prc

prc% * x’ + x’ = x’‘
prc% * x’‘+ x’‘= x’‘’
prc% * x’‘’ + x’‘’ = x’‘’’ ( and so on )

However to find the Constante rate of change you must return to the function.

The b parameter holds in it the percent rate of change as follows:

b = 1 + r

Where r = Percentage Growth Rate

Where r is the proportional change of the outcome (dependent variable) for every hole unit increase on the input variable.

Exponential Function fatal flaw is that it assumes a contant percent rate of change.

Logistic Model

Details: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 Week 6: Exponential and Logistic Function Models > Lecture Videos > The Logistic Growth Model

Characteristic: ability to model the natural upper limit to most systems.

Why the logistic model is good to know? Because unlike the linear and the exponential models it takes into account the natural upper limit we see in most instances of biological processes. The price however is that better fitting of more complex data means a more complex function with parameters that are less intuitive to inerpret.

Using the SDSFoundations Package

    library(SDSFoundations)
    
    MyOrange = Orange[Orange$Tree == 1,]
    logisticFit(MyOrange$age, MyOrange$circumference)

## Logistic Fit 
##  C =  154.1629 
##  a =  5.63979 
##  b =  1.00276 
##  R-squared =  0.98426

Using the R Package

    # I did not know how to do it with my example so I picked this one
    
    #install.packages("LOGIT")
    library(car)
    library(ggplot2)
    #Data
    mass<-c(6.25,10,20,23,26,27.6,29.8,31.6,37.2,41.2,48.7,54,54,63,66,72,72.2,
    76,75) #Wilson's mass in pounds
    
    days.since.birth<-c(31,62,93,99,107,113,121,127,148,161,180,214,221,307,
    452,482,923, 955,1308) #days since Wilson's birth
    #days.since.birth
    data<-data.frame(mass,days.since.birth) #create the data frame
    #str(data)
    #data
    plot(mass~days.since.birth, data=data) #always look at your data first!

    coef(lm(logit(mass/100)~days.since.birth,data=data))
##      (Intercept) days.since.birth 
##     -1.096091866      0.002362622
    wilson<-nls(mass~phi1/(1+exp(-(phi2+phi3*days.since.birth))),
    start=list(phi1=100,phi2=-1.096,phi3=.002),data=data,trace=TRUE)
## 3825.744 :  100.000  -1.096   0.002
## 3716.262 :  81.463877204 -0.886292540  0.002512256
## 3489.696 :  66.115027751 -0.731862991  0.003791928
## 1927.422 :  63.447368011 -1.036245947  0.007113719
## 204.0813 :  71.10755282 -2.06528377  0.01379199
## 123.3499 :  71.76966631 -2.39321544  0.01639836
## 121.3052 :  71.62701380 -2.45163194  0.01692932
## 121.2515 :  71.58084567 -2.46029145  0.01701556
## 121.2502 :  71.57256235 -2.46167633  0.01702943
## 121.2501 :  71.57121657 -2.46189524  0.01703163
## 121.2501 :  71.57100257 -2.46192995  0.01703198
## 121.2501 :  71.57096862 -2.46193546  0.01703204
    #build the model, start is the starting parameters, trace=TRUE will return the iterations
    #summary(wilson)
    
    
    #set parameters
    phi1<-coef(wilson)[1]
    phi2<-coef(wilson)[2]
    phi3<-coef(wilson)[3]
    x<-c(min(data$days.since.birth):max(data$days.since.birth)) #construct a range of x values bounded by the data
    y<-phi1/(1+exp(-(phi2+phi3*x))) #predicted mass
    predict<-data.frame(x,y) #create the prediction data frame#And add a nice plot (I cheated and added the awesome inset jpg in another program)
    ggplot(data=data,aes(x=days.since.birth,y=mass))+
    geom_point(color="blue",size=5)+theme_bw()+
    labs(x="Days Since Birth",y="Mass (lbs)")+
    scale_x_continuous(breaks=c(0,250,500,750, 1000,1250))+
    scale_y_continuous(breaks=c(0,10,20,30,40,50,60,70,80))+
    theme(axis.text=element_text(size=18),axis.title=element_text(size=24))+
    geom_line(data=predict,aes(x=x,y=y), size=1)

The logistc funciton is:

f(t) = C / (1 + a b^-t)

C = Carrying capacity

The A and B values have nothing to do with their linear and exponential counterparts.

a = It is a helper, it helps to find where of the f(t) is when our input value is zero. Because in this case f(t=0) = C / 1 + a. Diffente from the other models where a or intercept is the actual value. b = In the case of the logistic growth b must be greater than 1 (b>1)

Carrying capacity:

It is the upper limit of the function. It represents the upper Y (dependent) value.

The logistic function has a point where it turns over.

The point where the function turns over is the inflection point.

The inflection point occurs at C/2.

The inflection point can also be calculated by:

Inflection point = I.P = log(a)/log(b).

DOUBT???: IS BE A MEANSURE OF RATE OF CHANGE OR PERCENTAGE RATE OF CHANGE OR WHAT?

Using the models to our advantage

1 We use the model to predict amount of an outcome (dependnet) variable at an specific input (independent) value. 2 We can also use the functions to determine at what value of our input (indepenent) variable we have a certain level of outcome (dependent variable).

We can call that solving for t.

As in:

f(t) = a + b * t (Linear model)
f(t) = a * b ^ t (Exponential model)
f(t) = C / (1 + a b^-t) (Logistic model)

And solving for t means you have the value of f(t) and have to do the calculations to find the value o t. Solving for t therefore.

To do so you´ll need to know the basic log rules.

Log Rules

y = log(x) = 10^y = x

log(ab) = log(a) + log(b)
log(a/b) = log(a) - log(b) 7 log(b^t) = t * log(b)

How to find the best model to fit the data

Details: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 Week 6: Exponential and Logistic Function Models > Lecture Videos > Finding the “Best” Model

How we can tell which model fits our data best?

When we look at data the best model is something very subjective. Specially when we take into account context of the data. But when context is out of the table we want to simplify. And to do that we turn to residuals.

Useful Residuals:

Total Sums of Squares (TotalSS)

Uses the mean of the outcome variable.

Model Sums of Squares (ModelSS)

It´s how much of the error in the outcome variable that is explained by the model.

    linFit(MyOrange$age, MyOrange$circumference)

## Linear Fit 
##  Intercept =  24.43785 
##  Slope =  0.08148 
##  R-squared =  0.97115
    expFit(MyOrange$age, MyOrange$circumference)

## Exponential Fit 
##  a =  33.92672 
##  b =  1.00104 
##  R-squared =  0.90234
    logisticFit(MyOrange$age, MyOrange$circumference)

## Logistic Fit 
##  C =  154.1629 
##  a =  5.63979 
##  b =  1.00276 
##  R-squared =  0.98426

Observe the results of r.square values:

Linear = 0.97115 (97%)
Exponential = 0.90234 (90%)
Logistic = 0.98426 (98%)

So the Logistic model explains better our data.

    tripleFit(MyOrange$age, MyOrange$circumference)

A few explanations

When we take the log of our dependent variable we get closer to a linear line.

    plot(MyOrange$age,log(MyOrange$circumference)) 

    linFit(MyOrange$age,log(MyOrange$circumference))

## Linear Fit 
##  Intercept =  3.5242 
##  Slope =  0.00104 
##  R-squared =  0.90234
    Reg = lm(log(MyOrange$circumference)~ MyOrange$age)
    SumReg = summary(Reg)
    SumReg
## 
## Call:
## lm(formula = log(MyOrange$circumference) ~ MyOrange$age)
## 
## Residuals:
##         1         2         3         4         5         6         7 
## -0.245907  0.032137  0.250125  0.175027 -0.018842  0.002637 -0.195179 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.5242030  0.1596461  22.075 3.54e-06 ***
## MyOrange$age 0.0010415  0.0001532   6.797  0.00105 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1965 on 5 degrees of freedom
## Multiple R-squared:  0.9023, Adjusted R-squared:  0.8828 
## F-statistic:  46.2 on 1 and 5 DF,  p-value: 0.00105
    #Intercept
    Reg$coefficients[1] 
## (Intercept) 
##    3.524203
    #Slope
    Reg$coefficients[2]    
## MyOrange$age 
##  0.001041536
    # Reverting the log
    exp(3.524203) # intercept
## [1] 33.92672
    exp(0.00104153) # slope    
## [1] 1.001042
    expFit(MyOrange$age, MyOrange$circumference)

## Exponential Fit 
##  a =  33.92672 
##  b =  1.00104 
##  R-squared =  0.90234

In other words the r.squared of a exponential function is the same r.square of a function who has had it´s dependent value logged.

Predictions

Linear

One Specific VAlue

What will be the circunference of a tree with 1250 years?

      PredictedValue = linFitPred(MyOrange$age, MyOrange $circumference, 1250)

      PredictedValue
## [1] 126.28

Array of values

What will be the circunference of a tree with 1250, 1260, 1360 years?

      PredictedValue = linFitPred(MyOrange$age, MyOrange $circumference, c(1250, 1260, 1360))

      PredictedValue
## [1] 126.28 127.10 135.25

The values of the dataset predicted by the function

What will be the circunference of a tree with 1250, 1260, 1360 years?

      PredictedValue = linFitPred(MyOrange$age, MyOrange $circumference, MyOrange$age)

      PredictedValue
## [1]  34.05  63.87  78.54 106.24 124.74 136.22 153.33

Exponential

One Specific VAlue

What will be the circunference of a tree with 1250 years?

      expFitPred(MyOrange$age, MyOrange $circumference, 1250)
      # you can store this value
    
      PredictedValue = expFitPred(MyOrange$age, MyOrange $circumference, 1250)

      PredictedValue
## [1] 124.726

Array of values

What will be the circunference of a tree with 1250, 1260, 1360 years?

      PredictedValue = expFitPred(MyOrange$age, MyOrange $circumference, c(1250, 1260, 1360))

      PredictedValue
## [1] 124.726 126.032 139.867

The values of the dataset predicted by the function

What will be the circunference of a tree with 1250, 1260, 1360 years?

      PredictedValue = expFitPred(MyOrange$age, MyOrange $circumference, MyOrange$age)

      PredictedValue
## [1]  38.363  56.166  67.747  96.535 122.282 141.626 176.252

Logistic

One Specific VAlue

What will be the circunference of a tree with 1250 years?

      PredictedValue = logisticFitPred(MyOrange$age, MyOrange $circumference, 1250)

      PredictedValue
## [1] 130.705

Array of values

What will be the circunference of a tree with 1250, 1260, 1360 years?

      PredictedValue = logisticFitPred(MyOrange$age, MyOrange $circumference, c(1250, 1260, 1360))

      PredictedValue
## [1] 130.705 131.249 136.126

The values of the dataset predicted by the function

What will be the circunference of a tree with 1250, 1260, 1360 years?

      PredictedValue = logisticFitPred(MyOrange$age, MyOrange $circumference, MyOrange$age)

      PredictedValue
## [1]  30.389  62.055  80.991 113.881 129.644 136.646 143.831