Details: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 Week 5: Linear Functions > Lecture Videos > Functions and Models
Characteristic: Contant Rate of change.
Using SDSFoundations Package
library(SDSFoundations)
MyOrange = Orange[Orange$Tree == 1,]
linFit(MyOrange$age, MyOrange $circumference)
## Linear Fit
## Intercept = 24.43785
## Slope = 0.08148
## R-squared = 0.97115
#expFit(MyOrange $age, MyOrange $circumference)
#logisticFit(MyOrange $age, MyOrange $circumference)
Using R Package
MyOrange = Orange[Orange$Tree == 1,]
Reg = lm(MyOrange$circumference ~ MyOrange$age)
plot(MyOrange$age , MyOrange$circumference)
abline(Reg)
SumReg = summary(Reg)
# Linear fit
Reg$coefficients[1] # Intercept
## (Intercept)
## 24.43785
Reg$coefficients[2] # Slope
## MyOrange$age
## 0.08147716
SumReg$r.squared # r.squared
## [1] 0.9711463
From the above results we gather together the function
circumference = 24.43785 + 0.08148 (age)
Interpretation:
To predict the outcome of CIRCUMFERENCE start with a line at 0 and 0.08148 and for every unit increase in x, our AGE value, we increate by 0.08148 units of our y, our CIRCUMFERENCE.
Using this function we can find a single output value for any value of our input.
Linear Function fatal flaw is that it assumes a contant rate of change.
You can predit a specific value
What will be the circunference of a tree with 1250 years?
linFitPred(MyOrange$age, MyOrange $circumference, 1250)
# you can store this value
PredictedValue = linFitPred(MyOrange$age, MyOrange $circumference, 1250)
PredictedValue
## [1] 126.28
If we draw a vertical line our function should only produce one vertical value for any value of X. You can see from the graph above that this is not the case.
Our data does not pass the vertical line test, but remember our data is not the function. The function is a model of our data. It is represented by the straigth line and tries to represent the model as best as possible. Our model, the ideia of a model, is to find a way to succintly describe a relationship in data. It´s a sumarization.
Summarizing:
Correlation coeficiente (r): A summary of the relationship that describes the extent of the relationship.
Model: Is a function. It expands the idea of that relationship to describe and summarize what the relationship looks like in terms of the actual variables involved. The specific input and the specific output.
Details: Week 5: Linear Functions > Lecture Videos > The Line of Best Fit
Any linear function line we draw throu correlated data, must pass throu the value of the mean of X and the mean o Y (same values used to calculate the Pearson Correlation Coefficient).
meanX = mean(MyOrange$age)
meanY = mean(MyOrange$circumference)
plot(MyOrange$age, MyOrange$circumference)
Reg = lm(MyOrange$circumference~MyOrange$age)
abline(Reg)
abline(v=meanX, col="red")
abline(h=meanY, col="red")
However there is a infinite number of lines possible, how to solve this puzzle?
Residuals tells us which line is the best line.
A residual can formally be defined as y - Y’. In other words, the actual value of y and the predicted value of y’ by the function
So: Residuals = y - y’
And if we add all resilduals we have ZERO
sum(Residuals) = 0
Reg = lm(MyOrange$circumference~MyOrange$age)
#summary(Reg)
Reg$residuals
## 1 2 3 4 5 6 7
## -4.052152 -5.872793 8.461319 8.759084 -4.736232 5.775489 -8.334715
sum(-4.052152,-5.872793,8.461319,8.759084,-4.736232,5.775489,-8.334715)
## [1] 0
#However sum(Reg$residuals) = 1.776357e-15 what makes no sense.
So instead we square the residuals and them sum then, obtaining SUM OF SQUARED RESIDUALS VALUE.
SRQ = sum(Residuals^2)
Reg = lm(MyOrange$circumference~MyOrange$age)
#summary(Reg)
sum(Reg$residuals^2)
## [1] 324.4807
The Sum of Squared Residuals gives us a measure of appropiatness of the line we just draw. How well it works.
The line of best fit for a linear function is the line that has the lowest value of sums of squares residuals of all the possibly line that we can ever draw.
If you plot the intercept value (x) versus the SSQ(y) the lowest point is the line that have the lowest residuals.
The coefficiente of determination is also called the PROPORTION OF VARIANCE ACCOUNTED FOR or r-squared (r^2)
It is simply the correlation coeficiente squared (r^2).
Reg = lm(MyOrange$circumference~MyOrange$age)
# Correlation
cor(MyOrange$circumference,MyOrange$age)
## [1] 0.9854675
# Correlation Coeficient (r^2)
cor(MyOrange$circumference,MyOrange$age)^2
## [1] 0.9711463
# As you can see by
MySum = summary(Reg)
MySum$r.squared
## [1] 0.9711463
r^2
“It measures the amount of error that we can account for in the variable of interest (dependent variable) by knowing the model of the relationship with it and our independent variable.”
Where the Coefficiente of determination actually comes from.
# Plotting the mean line
meanY = mean(MyOrange$circumference)
plot(MyOrange$age, MyOrange$circumference)
Reg = lm(MyOrange$circumference~MyOrange$age)
abline(Reg)
abline(h=meanY, col="red")
# Calculate the Sum of Squared Residuals for the meanline
# Total Sums of Squares (TotalSS)
MyOrange = Orange[Orange$Tree == 1,]
MyOrange$MyMean = mean(MyOrange$circumference)
MyOrange$Residual1 = MyOrange$circumference - MyOrange$MyMean
ssr1 = sum(MyOrange$Residual1^2)
ssr1
## [1] 11245.71
# Calculate the Sum of Squared Residuals for the line of best fit
# Model Sums of Squares (ModelSS)
Reg = lm(MyOrange$circumference~MyOrange$age)
MyOrange$Predict = predict(Reg)
MyOrange$Residual2 = MyOrange$circumference - MyOrange$Predict
ssr2 = sum(MyOrange$Residual2^2)
ssr2
## [1] 324.4807
# Calculate the r.squared
diferenca = ssr1-ssr2
r.square = diferenca/ssr1
r.square
## [1] 0.9711463
# Confirm the calculations
SumReg = summary(Reg)
SumReg$r.squared
## [1] 0.9711463
We will eventually use this value to help determine which model, linear, exponential or logistic works better with our specific data.
Details: Week 5: Linear Functions > Lecture Videos > Interpreting the Linear Model
Let´s see what each of the linear model parameter are telling us.
In other words:
What do the values of our linear model actually mean?
Suppose the folowwing model:
f(x) = a’ + b’ * x
b’ = slope = Rate of change
The linear model has a constant reate of change over equally spaced distances of the input variable.
slope = f(c2)-f(c1) / c2 - c1
Given: c2 - c1 = 1
As long as c2-c1 = 1, as long as we move one hole unit on our input variable the numerator will be our slope.
If x = 0 Then
a’ = Intercept value
Caution about extrapolation
Let´s see following example, useing the wolf dataset that records the number of wolfs in the yellowstone park in a series of years:
library(SDSFoundations)
str(wolf)
## 'data.frame': 13 obs. of 2 variables:
## $ Year : int 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ...
## $ Number: int 21 25 41 45 82 121 130 147 174 163 ...
wolf
## Year Number
## 1 1995 21
## 2 1996 25
## 3 1997 41
## 4 1998 45
## 5 1999 82
## 6 2000 121
## 7 2001 130
## 8 2002 147
## 9 2003 174
## 10 2004 163
## 11 2005 155
## 12 2006 141
## 13 2007 165
linFit(wolf$Year,wolf$Number)
## Linear Fit
## Intercept = -26982
## Slope = 13.53846
## R-squared = 0.84046
From the data above we get:
Wolves = -26982 + 13.53846 * Year
It means that in the year zero we will have -26982 which does not even makes sense.
It occurs because this value is way out of our range of interest and that must be avoided. It is called extrapolation.
Those results are not reliable and even not intepretable.
How do you fix extrapolation?
Instead of using the year we could use years since wolves were introduced in the yellowstone park.
w = wolf
w$Time = w$Year- min(w$Year)
Now we have a variable that will make 1995 our year zero.
linFit(w$Time,w$Number, xlab="Years since 1995", ylab="Number of grey wolves")
## Linear Fit
## Intercept = 27.23077
## Slope = 13.53846
## R-squared = 0.84046
Which will result in a more meaningfull model.
Number = 27.23077 + 13.53846 * Time
Observe that we haven´t change the relationships between the variables. We´ve just made a zero value of our x axis variable in our intercept more meaningful.
This time in year zero we would have about 27 wolves.
SUMMING UP
Slope = Rate of change in our predicted outcome variable we expect to see one hole unit increase in our x axis variable.
Intercept = Predicted value of our outcome variable when our x axis variable is zero.
It is up to US to make sure that this value of zero makes sense.
Details: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 Week 6: Exponential and Logistic Function Models > Lecture Videos > Exponential Growth and Decay
Characteristic: Contant Percentage Rate of change
How does the exponential model differ from the linear model?
Using the SDSFoundations Package
library(SDSFoundations)
MyOrange = Orange[Orange$Tree == 1,]
expFit(MyOrange$age, MyOrange$circumference)
## Exponential Fit
## a = 33.92672
## b = 1.00104
## R-squared = 0.90234
#logisticFit(MyOrange $age, MyOrange $circumference)
Using the R Package
expFit = lm(log(MyOrange$circumference) ~ MyOrange$age)
plot(MyOrange$age,MyOrange$circumference)
lines(MyOrange$age, exp(fitted(expFit)))
Basically one is straight and the exponential is not, however that NOT STRAIGTH has a very specific property that makes it exponential.
In the linear model for every one unit change in x (independent) the change in Y (dependent) is constant.
In the Exponential model there is a constant percent rate of change.
The exponential funciton is:
f(x) = a * b ^ x
a = value when x = 0
b = Growth factor
r = Growth rate
Where:
b = 1 + r
So, for every point on the exponential courve there is a fixed exponential rate of change.
Therefore.
Given the following points in an exponential function;
x’, x’‘, x’‘’, x’‘’’
Their relation is as follows:
percent rate of change = prc
prc% * x’ + x’ = x’‘
prc% * x’‘+ x’‘= x’‘’
prc% * x’‘’ + x’‘’ = x’‘’’ ( and so on )
However to find the Constante rate of change you must return to the function.
The b parameter holds in it the percent rate of change as follows:
b = 1 + r
Where r = Percentage Growth Rate
Where r is the proportional change of the outcome (dependent variable) for every hole unit increase on the input variable.
Exponential Function fatal flaw is that it assumes a contant percent rate of change.
Details: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 Week 6: Exponential and Logistic Function Models > Lecture Videos > The Logistic Growth Model
Characteristic: ability to model the natural upper limit to most systems.
Why the logistic model is good to know? Because unlike the linear and the exponential models it takes into account the natural upper limit we see in most instances of biological processes. The price however is that better fitting of more complex data means a more complex function with parameters that are less intuitive to inerpret.
Using the SDSFoundations Package
library(SDSFoundations)
MyOrange = Orange[Orange$Tree == 1,]
logisticFit(MyOrange$age, MyOrange$circumference)
## Logistic Fit
## C = 154.1629
## a = 5.63979
## b = 1.00276
## R-squared = 0.98426
Using the R Package
# I did not know how to do it with my example so I picked this one
#install.packages("LOGIT")
library(car)
library(ggplot2)
#Data
mass<-c(6.25,10,20,23,26,27.6,29.8,31.6,37.2,41.2,48.7,54,54,63,66,72,72.2,
76,75) #Wilson's mass in pounds
days.since.birth<-c(31,62,93,99,107,113,121,127,148,161,180,214,221,307,
452,482,923, 955,1308) #days since Wilson's birth
#days.since.birth
data<-data.frame(mass,days.since.birth) #create the data frame
#str(data)
#data
plot(mass~days.since.birth, data=data) #always look at your data first!
coef(lm(logit(mass/100)~days.since.birth,data=data))
## (Intercept) days.since.birth
## -1.096091866 0.002362622
wilson<-nls(mass~phi1/(1+exp(-(phi2+phi3*days.since.birth))),
start=list(phi1=100,phi2=-1.096,phi3=.002),data=data,trace=TRUE)
## 3825.744 : 100.000 -1.096 0.002
## 3716.262 : 81.463877204 -0.886292540 0.002512256
## 3489.696 : 66.115027751 -0.731862991 0.003791928
## 1927.422 : 63.447368011 -1.036245947 0.007113719
## 204.0813 : 71.10755282 -2.06528377 0.01379199
## 123.3499 : 71.76966631 -2.39321544 0.01639836
## 121.3052 : 71.62701380 -2.45163194 0.01692932
## 121.2515 : 71.58084567 -2.46029145 0.01701556
## 121.2502 : 71.57256235 -2.46167633 0.01702943
## 121.2501 : 71.57121657 -2.46189524 0.01703163
## 121.2501 : 71.57100257 -2.46192995 0.01703198
## 121.2501 : 71.57096862 -2.46193546 0.01703204
#build the model, start is the starting parameters, trace=TRUE will return the iterations
#summary(wilson)
#set parameters
phi1<-coef(wilson)[1]
phi2<-coef(wilson)[2]
phi3<-coef(wilson)[3]
x<-c(min(data$days.since.birth):max(data$days.since.birth)) #construct a range of x values bounded by the data
y<-phi1/(1+exp(-(phi2+phi3*x))) #predicted mass
predict<-data.frame(x,y) #create the prediction data frame#And add a nice plot (I cheated and added the awesome inset jpg in another program)
ggplot(data=data,aes(x=days.since.birth,y=mass))+
geom_point(color="blue",size=5)+theme_bw()+
labs(x="Days Since Birth",y="Mass (lbs)")+
scale_x_continuous(breaks=c(0,250,500,750, 1000,1250))+
scale_y_continuous(breaks=c(0,10,20,30,40,50,60,70,80))+
theme(axis.text=element_text(size=18),axis.title=element_text(size=24))+
geom_line(data=predict,aes(x=x,y=y), size=1)
The logistc funciton is:
f(t) = C / (1 + a b^-t)
C = Carrying capacity
The A and B values have nothing to do with their linear and exponential counterparts.
a = It is a helper, it helps to find where of the f(t) is when our input value is zero. Because in this case f(t=0) = C / 1 + a. Diffente from the other models where a or intercept is the actual value. b = In the case of the logistic growth b must be greater than 1 (b>1)
Carrying capacity:
It is the upper limit of the function. It represents the upper Y (dependent) value.
The logistic function has a point where it turns over.
The point where the function turns over is the inflection point.
The inflection point occurs at C/2.
The inflection point can also be calculated by:
Inflection point = I.P = log(a)/log(b).
DOUBT???: IS BE A MEANSURE OF RATE OF CHANGE OR PERCENTAGE RATE OF CHANGE OR WHAT?
1 We use the model to predict amount of an outcome (dependnet) variable at an specific input (independent) value. 2 We can also use the functions to determine at what value of our input (indepenent) variable we have a certain level of outcome (dependent variable).
We can call that solving for t.
As in:
f(t) = a + b * t (Linear model)
f(t) = a * b ^ t (Exponential model)
f(t) = C / (1 + a b^-t) (Logistic model)
And solving for t means you have the value of f(t) and have to do the calculations to find the value o t. Solving for t therefore.
To do so you´ll need to know the basic log rules.
Log Rules
y = log(x) = 10^y = x
log(ab) = log(a) + log(b)
log(a/b) = log(a) - log(b) 7 log(b^t) = t * log(b)
Details: UTAustinX: UT.7.10x Foundations of Data Analysis - Part 1 Week 6: Exponential and Logistic Function Models > Lecture Videos > Finding the “Best” Model
How we can tell which model fits our data best?
When we look at data the best model is something very subjective. Specially when we take into account context of the data. But when context is out of the table we want to simplify. And to do that we turn to residuals.
Useful Residuals:
Total Sums of Squares (TotalSS)
Uses the mean of the outcome variable.
Model Sums of Squares (ModelSS)
It´s how much of the error in the outcome variable that is explained by the model.
linFit(MyOrange$age, MyOrange$circumference)
## Linear Fit
## Intercept = 24.43785
## Slope = 0.08148
## R-squared = 0.97115
expFit(MyOrange$age, MyOrange$circumference)
## Exponential Fit
## a = 33.92672
## b = 1.00104
## R-squared = 0.90234
logisticFit(MyOrange$age, MyOrange$circumference)
## Logistic Fit
## C = 154.1629
## a = 5.63979
## b = 1.00276
## R-squared = 0.98426
Observe the results of r.square values:
Linear = 0.97115 (97%)
Exponential = 0.90234 (90%)
Logistic = 0.98426 (98%)
So the Logistic model explains better our data.
tripleFit(MyOrange$age, MyOrange$circumference)
A few explanations
When we take the log of our dependent variable we get closer to a linear line.
plot(MyOrange$age,log(MyOrange$circumference))
linFit(MyOrange$age,log(MyOrange$circumference))
## Linear Fit
## Intercept = 3.5242
## Slope = 0.00104
## R-squared = 0.90234
Reg = lm(log(MyOrange$circumference)~ MyOrange$age)
SumReg = summary(Reg)
SumReg
##
## Call:
## lm(formula = log(MyOrange$circumference) ~ MyOrange$age)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.245907 0.032137 0.250125 0.175027 -0.018842 0.002637 -0.195179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5242030 0.1596461 22.075 3.54e-06 ***
## MyOrange$age 0.0010415 0.0001532 6.797 0.00105 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1965 on 5 degrees of freedom
## Multiple R-squared: 0.9023, Adjusted R-squared: 0.8828
## F-statistic: 46.2 on 1 and 5 DF, p-value: 0.00105
#Intercept
Reg$coefficients[1]
## (Intercept)
## 3.524203
#Slope
Reg$coefficients[2]
## MyOrange$age
## 0.001041536
# Reverting the log
exp(3.524203) # intercept
## [1] 33.92672
exp(0.00104153) # slope
## [1] 1.001042
expFit(MyOrange$age, MyOrange$circumference)
## Exponential Fit
## a = 33.92672
## b = 1.00104
## R-squared = 0.90234
In other words the r.squared of a exponential function is the same r.square of a function who has had it´s dependent value logged.
One Specific VAlue
What will be the circunference of a tree with 1250 years?
PredictedValue = linFitPred(MyOrange$age, MyOrange $circumference, 1250)
PredictedValue
## [1] 126.28
Array of values
What will be the circunference of a tree with 1250, 1260, 1360 years?
PredictedValue = linFitPred(MyOrange$age, MyOrange $circumference, c(1250, 1260, 1360))
PredictedValue
## [1] 126.28 127.10 135.25
The values of the dataset predicted by the function
What will be the circunference of a tree with 1250, 1260, 1360 years?
PredictedValue = linFitPred(MyOrange$age, MyOrange $circumference, MyOrange$age)
PredictedValue
## [1] 34.05 63.87 78.54 106.24 124.74 136.22 153.33
One Specific VAlue
What will be the circunference of a tree with 1250 years?
expFitPred(MyOrange$age, MyOrange $circumference, 1250)
# you can store this value
PredictedValue = expFitPred(MyOrange$age, MyOrange $circumference, 1250)
PredictedValue
## [1] 124.726
Array of values
What will be the circunference of a tree with 1250, 1260, 1360 years?
PredictedValue = expFitPred(MyOrange$age, MyOrange $circumference, c(1250, 1260, 1360))
PredictedValue
## [1] 124.726 126.032 139.867
The values of the dataset predicted by the function
What will be the circunference of a tree with 1250, 1260, 1360 years?
PredictedValue = expFitPred(MyOrange$age, MyOrange $circumference, MyOrange$age)
PredictedValue
## [1] 38.363 56.166 67.747 96.535 122.282 141.626 176.252
One Specific VAlue
What will be the circunference of a tree with 1250 years?
PredictedValue = logisticFitPred(MyOrange$age, MyOrange $circumference, 1250)
PredictedValue
## [1] 130.705
Array of values
What will be the circunference of a tree with 1250, 1260, 1360 years?
PredictedValue = logisticFitPred(MyOrange$age, MyOrange $circumference, c(1250, 1260, 1360))
PredictedValue
## [1] 130.705 131.249 136.126
The values of the dataset predicted by the function
What will be the circunference of a tree with 1250, 1260, 1360 years?
PredictedValue = logisticFitPred(MyOrange$age, MyOrange $circumference, MyOrange$age)
PredictedValue
## [1] 30.389 62.055 80.991 113.881 129.644 136.646 143.831