Maximum Likelihood Estimates

Our sample y

Suppose we have 5 values that I tell you are drawn from a member of the normal distribution family. Here they are. I’ll call them y. I’ll also print out the mean and standard deviation of the sample.

set.seed(9092014)
print(y<-rnorm(5,3,4))

## [1] 10.384  3.593  1.294  6.162  3.070

print(c(mean(y),sd(y)))

## [1] 4.901 3.526

From Likelihood to Log Likelihood

I now ask you what the probabilities are that these numbers would be drawn if the sample came from a member of the normal distribution family with mean m and standard deviation s. Since the draws are independent, the probability of the sample is the product of the probabilities of the members. We now need to realize two things.

If \(x^*\) maximizes some function \(f\) of x then \(x^*\) also maximizes any “monotonic transform” g of f. By monotonic transformation I mean that if \(f(x_2) > f(x_1)\) then it must always be true that \(g(f(x_2)) > g(f(x_1))\). By way of example, if we look at the plot of this expression \(\frac{e^{\frac{2-x}{3}}}{3 \left(e^{\frac{2-x}{3}}+1\right)^2}\) (the PDF of a logistic distribution) shown below we will see it is maximized at 2. The log function is monotonic: if \(y_2>y_1\) then \(log(y_2)>log(y_1)\). And so if we compute the log of this function (\(\log \left(\frac{e^{\frac{2-x}{3}}}{3 \left(e^{\frac{2-x}{3}}+1\right)^2}\right)\)), we get the second graph below, which also is maximized at 2. This rule is true not only of one variable functions but of multi-variable functions too.

## Loading required package: ggplot2

plot of chunk unnamed-chunk-2

\(\log (a b)=\log (a)+\log (b)\).

All of this means that if some parameters create a distribution that maximizes the product of the probabilities of the draws then it must also be the case that the same parameters maximize the sum of the log of the probabilities of the draws. Thus, instead of computing the product of the probabilities, we can equally well compute the “log-likelihoods”: the sum of the log of the probabilities of the draws. And since this computation ends up being considerably simpler, we are going to follow tradition here and do precisely that.

R Implementation

So we can write a function sampleprob as follows. It takes the density function of a normal distribution with mean m and standard deviation s and evaluates it for all values of the data. It then takes the log of the results. And it then sums up the logs to produce a single value.

sampleprob<- function(data,m,s) sum(log(dnorm(data,m,s)))

By way of example, suppose I guessed that the sample came from a normal distribution with mean of 2 and standard deviation of 5. Here’s what the sum of the logs of the probabilities of the sample would be. And I can do the same computation with a guessed mean of 4 and a standard deviation of 2. We can see that the first guess is a little better because the sum of the log of the probabilities is a little higher.

print(sampleprob(y,2,5))

## [1] -14.48

print(sampleprob(y,4,2))

## [1] -14.78

We could make a three-dimensional plot of the ways in which sampleprob varied as we varied m (the proposed mean) and s (the proposed standard deviation). I’ve done so in the graphic below. plot of chunk unnamed-chunk-5

We can see that there is some value (it looks like about mean=5, standard deviation=3) at which the sum of the log of the probabilities is maximized. This is the maximum likelihood estimate of the distribution from which the sample was derived. We can also use R’s optim¹ command to zero-in on this distribution. The concept of optim is that we make a guess for the parameters as the first argument (a vector) and then provide the function to be minimized as the second argument. If we actually want to maximize a function, as here, we just take the negative of the function. Here my initial guess is that the normal distribution from which the sample derives has the same mean as my mean and the same standard deviation as my sample.

optim(c(mean(y),(sd(y))),function (params) -sampleprob(y,params[1],params[2]))

## $par
## [1] 4.900 3.154
## 
## $value
## [1] 12.84
## 
## $counts
## function gradient 
##       47       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

The output tells us the optimal results are a mean of 4.9 and a standard deviation of 3.154. It tells us that at those optimal levels the function yields 12.84. You can ignore the remaining information.

Ratcheting up the difficulty

Let’s kick it up a notch now. Suppose we believe each item in our list comes from a different normal distribution but that all the distributions are related to each other. Suppose, in particular that for each number in our random sample we associate an “x-value.” I’ll do this now – don’t ask where the numbers came from. I’ll also plot the x-values and y-values together.

print(x<-c(1,0,7,6,-2))

## [1]  1  0  7  6 -2

require(ggplot2)
ggplot(data=data.frame(x=x,y=y),aes(x=x,y=y))+geom_point()

plot of chunk unnamed-chunk-7

Suppose further that we believe that the normal distribution with which each y-value derives is a function of the associated x-value. In particular, it is a normal distribution whose mean is a function \(a x + b\), and whose standard deviation is some constant value. I hope you recognize that \(a x + b\) creates a line with y-intercept b and slope a. And now the question is as follows: what are the values of a and b of that function and what is the standard deviation s for which the sum of the log of the probabilities is maximized. Read the preceding sentence over and over until you understand it well.

Let’s create a function sampleproblm that computes the sum of the log of the probabilities given our y data, our x data and arbitrary parameters a, b and s. And then let’s see how that function works given our values of y and x and (a) a guess that a is -1, b is 2 and the standard deviation of the distribution is a constant equal to the sample deviation of our sample. and (b) a guess that a is 3, b is 0.4 and the standard deviation of the distribution is a constant equal to the sample deviation of our sample

sampleproblm<-function(ydata,xdata,a,b,s) sum(log(dnorm(ydata,a*xdata+b,s)))
sampleproblm(y,x,-1,2,sd(y))

## [1] -20.32

sampleproblm(y,x,3,0.4,sd(y))

## [1] -38.57

We can see that our first guess is better. Again, we can plot the value of sampleproblm as we vary a and b. plot of chunk unnamed-chunk-9

And we can use optim to find the values of a and b that maximize the sum of the log of the probabilities. I’ll guess that the best values are 1 and 2.

optim(c(1,2),function (params) -sampleproblm(y,x,params[1],params[2],sd(y)))

## $par
## [1] -0.1395  5.2350
## 
## $value
## [1] 12.85
## 
## $counts
## function gradient 
##       65       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

optim(c(1,2,sd(y)),function (params) -sampleproblm(y,x,params[1],params[2],params[3])  )

## $par
## [1] -0.1393  5.2347  3.1155
## 
## $value
## [1] 12.78
## 
## $counts
## function gradient 
##      100       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Combining the data with our optimization

So, we have now found the “maximum likelihood estimate.” We can now plot the line associated with the parameters a and b that generate a maximum likelihood estimate. I’ll do it on the same graph as above.

ggplot(data=data.frame(x=x,y=y),aes(x=x,y=y))+geom_point()+stat_function(fun=function(x) -0.1395*x+5.2350)

plot of chunk unnamed-chunk-11 If we assume, as we have, (a) that the values of the draws are a function of the associated x value and (b) that this function must be linear, and (c) that the draws come from a normal distribution, and (d) that the standard deviation of that normal distribution is independent of the associated x value, then the line we have found is the line that would maximize the likelihood of the draws. It is the “best fit.”

The distribution of the parameters

Take a good look again at Figure x. If you think about it, for a while, you will recognize that this plot is just the log transform of a probability distribution. The parameters a and b must, after all, take on some values and our plot just shows the (log) of the probability that they take on each of the values.

Regression in R

We are now done, in some sense, with the hard part. The theory behind regression is challenging the first time it is presented. The good news is that actually doing regression in R is easy. You might ask, then, why we went through the preceding exercise. The answer is that R (and other computer languages) make it too easy. You can use the functions as black boxes without thinking through what they are calculating and what the output really means. You can fail to recognize the restricted set of assumptions under which the statistical technique being used is valid. My hope is that the preceding material makes you aware of what is going on behind the scenes.

With those caveats in mind, here is how we would regress the data in R.

summary(lm(y~x,data=data.frame(x=x,y=y)))

## 
## Call:
## lm(formula = y ~ x, data = data.frame(x = x, y = y))
## 
## Residuals:
##     1     2     3     4     5 
##  5.29 -1.64 -2.97  1.76 -2.44 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    5.235      2.181    2.40    0.096 .
## x             -0.139      0.514   -0.27    0.804  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.02 on 3 degrees of freedom
## Multiple R-squared:  0.0239, Adjusted R-squared:  -0.301 
## F-statistic: 0.0735 on 1 and 3 DF,  p-value: 0.804

Here’s how we read the report produced by R. The residuals are the differences between the values predicted by our model \(a x +b\) and our sample. The coefficients contain a table showing, for each parameter that we want to estimate (here the intercept and slope of a line) our best estimate, the “standard error” around that estimate, the t-statistic of the value we obtained assuming the true value of the parameter was zero, and the probability that the absolute value of the t-statistic we obtained would be as great as it is if the true value of the parameter were zero. In our case, there is only a 9.6% chance that the value of the intercept term and associated t-statistic would be as far from 0 as it is if the true intercept were 0. There is, however, an 80.4% chance that the value of the slope and associated t-statistic would be as far from 0 as it is if the true slope were 0.

The next part of the report I want you to focus on deals with the “Multiple R Squared”, the “Adjusted R Squared” and the F-statistic." This part of the report focuses not on the individual coefficients in the model (slope and intercept) but whether the model, taken as a whole is particularly predictive. Here, we have a miserable model. Either the model isn’t linear, there are a lot of other variables we aren’t aware of that do predict y, or it’s all just highly random. If we were to calculate the variance of our sample by just taking the mean of the squares of the residuals and then we were to calculate the variance of our sample by taking the mean of the squares of the residuals after taking the x-values into account, we would find that we have reduced the variance by only about 2.4%. And, it’s even worse than that. It turns out that you will almost always be able to reduce the variance by using a linear model even if there is basically no relationship between the x-value and the y=value. The regression procedure will tune itself to the noise in the data. The “Adjusted R Squared” is a (crude) attempt to take this phenomenon into account. And a negative Adjusted R Squared basically means our model is garbage.

The final part of the report deals with an “F Statistic.” We will not discuss the F Statistic here much more than to say that it is another measure of how well the model does, overall, in fitting the data.

A different model specification

Our purely linear model is lousy. But suppose we were to assume not that the mean of the distribution was a linear function of the x-value but rather that it was a quadratic. Here’s how we would write that in R. The only thing that has changed is that we have added an I(x^2) to the model. You might well wonder what the purpose of the I function is. The short answer is that if I just wrote x^2, I believe R would actually do the arithmetic operation and potentially add it to x, which is not at all what I want. I stands for “Inhibit,” and tells R to leave the expression as is.

summary(lm(y~x+I(x^2),data=data.frame(x=x,y=y)))

## 
## Call:
## lm(formula = y ~ x + I(x^2), data = data.frame(x = x, y = y))
## 
## Residuals:
##      1      2      3      4      5 
##  2.064 -3.312 -0.976  1.152  1.072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    6.905      1.901    3.63    0.068 .
## x              1.761      1.132    1.56    0.260  
## I(x^2)        -0.346      0.193   -1.79    0.215  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.05 on 2 degrees of freedom
## Multiple R-squared:  0.625,  Adjusted R-squared:  0.25 
## F-statistic: 1.67 on 2 and 2 DF,  p-value: 0.375

ggplot(data=data.frame(x=x,y=y),aes(x=x,y=y))+geom_point()+
  stat_function(fun=function(x) 6.905+1.761*x-0.346*x^2)

plot of chunk unnamed-chunk-13 This model works somewhat better. Our R Squared has jumped to 0.625, meaning that we are able to reduce the variance of the residuals by 62.5% with this model. Our Adjusted R Squared now stands at 0.25, which means that even taking into account the fact that we are training to noise, our model appears to be accomplishing something.

By the way, even though we have added a quadratic term to our model, this is still considered a form of “linear regression.” This is so because the prediction is ultimately just the sum of other terms. If our model were something like \(\log \left(a x+b x^2\right)\), we would not be able to use lm.

Multiple Regression

Thus far, we have looked at a data set that had only one “independent variable.” We could do various things with it, such as square it, but ultimately, there was only one value – x – on which the values in our sample depended. This need not be the case, however. There are often situations in which the values in the sample are associated with multiple independent variables.

Let’s load in a new database. The independent variables are the limit of the insurance policy involved and the “reserve,” the amount the insurer initially thought it might have to pay in the case. The dependent variable is the “result,” the size of the judgment or settlement that ultimately resulted in the case.

head(litigation<-read.csv("/Users/sethchandler/Dropbox/Courses/Analytic Methods/Course Materials/litigation.csv",stringsAsFactors=FALSE))

##      X policylimits reserve result
## 1 5047        25000    2500  12500
## 2 6313       100000    5000  12500
## 3 3802        20000    2500  18000
## 4 6151       500000  150000 115257
## 5 1140      1000000   25000 600000
## 6 3029        25000   12500  17000

Let’s do a linear regression, now, of the settlement amount on the policy limit and the initial amount the insurer thought it would have to pay, the “reserve.” Our syntax is very similar to what we used for the quadratic model above.

summary(lm(result~policylimits+reserve,data=litigation))

## 
## Call:
## lm(formula = result ~ policylimits + reserve, data = litigation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -177706  -27378    3884   15080  380501 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.25e+03   1.43e+04   -0.51    0.614    
## policylimits  2.08e-01   3.76e-02    5.53  1.4e-06 ***
## reserve       7.46e-01   3.83e-01    1.95    0.057 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70400 on 47 degrees of freedom
## Multiple R-squared:  0.443,  Adjusted R-squared:  0.419 
## F-statistic: 18.7 on 2 and 47 DF,  p-value: 1.06e-06

One reads the results of this regression precisely the same way one reads the results of the simple univariate regression above. Every dollar of policy limits appears to add 20.8 cents to the ultimate judgment and insurers are doing a tolerable job reserving claims: every dollar they reserve correlates with a 74.6 cents increase in the result. We now see three asterisks following the p-value of the policylimit variable. According to the key provided later in the output summary, and following a convention that one often sees it statistics, the three asterisks mean the result is highly significant from a statistical perspective. There is less than a one in a thousand chance that, if policylimit had no effect, we would see a coefficient as high as the one we see in front of the policylimit variable. The output also shows that our model is successful in accounting for 44.3% of the variance in the result data. Even with an adjustment, we are still accounting for over 41% of the variation in the result data.

Ordinary Least Squares (OLS)

Some of you may have heard about something called Ordinary Least Squares or OLS in connection with regression. And you may be wondering what the relationship is between OLS and the “maximum likelihood” method that I have indicated is behind regression. Let me try to explain it. Suppose I try to represent some data with a formula. I could then take the differences between the prediction made by the formula based on the independent variables and the dependent variables. I could take those differences, square them, and add them up. What I could then do is find the formula that minimized the sum of these squared “residuals.” Now, obviously, if I could pick any formula in the universe, I would just pick a formula that went exactly through each of the data points. (There always is one). But that is a form of “cheating.” Instead, I might constrain my formula. It must be of the form \(a x + b\). Or if must be of the form \(a x + b y + c\). I would then find the values of a and b (or the values of a, b and c in the second example) that minimized the squared residuals. Such a formula would represent the least squares estimate of the data.

Now, as it happens, under a very restricted set of circumstances, the result given by this OLS method is the same as the result given from maximum likelihood. And so, because OLS is a lot easier to calculate than maximum likelihood, for a long time, OLS was emphasized. But, in my humble opinion, this is a bad idea. There are lots of occasions under which OLS will not give the right answer. It has very little going for it on a principled basis. It is still sometimes useful as an approximation, but I would prefer that you think about the maximum likelihood idea in understanding what is going on behind the scenes in a regression.

squaredResiduals<- function(data,f) sum((data$y-f(data$x))^2)
squaredResiduals(data.frame(x=x,y=y),function(x) -0.13*x +5)

## [1] 48.77

linearmodel<-function (params) squaredResiduals(data.frame(x=x,y=y),function(x) params[1]*x+params[2])
linearmodel(c(0.4,1.7))

## [1] 91.44

optim(c(1,2),linearmodel)

## $par
## [1] -0.1395  5.2350
## 
## $value
## [1] 48.54
## 
## $counts
## function gradient 
##       67       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

I assume that optim is short for optimize.↩

Linear Regression in R

Introduction