I have chosen the rock data set to work with. It contains measurements on 48 rock samples from a petroleum reservoir. This data set has four variables:
My independent variable is area and my dependent variable is permeability. I want to see if the total area of pores in the rock sample will have a relationship with the sample’s permeability. I would have expected a strong, linear relationship between the two, but that does not appear to be the case.
plot(rock$area, rock$perm)
abline(lm(rock$perm ~ rock$area))
The generalized estimating equation I will use is: \[ \hat{y_i}=\beta_0+\beta_1x_i+\epsilon_i \] where \(\hat{y_i}\) is the predictor for my dependent variable rock.perm, \(x_i\) is my independent variable rock.area, \(\beta_0\) is my y-intercept, \(\beta_1\) is my slope, and \(\epsilon_i\) is my error for each term.
est <- lm(rock$perm ~ rock$area)
intercept <- est$coefficients[1]
slope <- est$coefficients[2]
est
##
## Call:
## lm(formula = rock$perm ~ rock$area)
##
## Coefficients:
## (Intercept) rock$area
## 880.5226 -0.0647
An intercept value of 880.5226 tells me that, going by this linear model, if a rock sample were to have a total pore area of 0 pixels squared, we would expect the permeability of the sample to be 880.5226 millidarcies. A rock in this case wouldn’t have any pores, so it is surprising that it would have such high permeability compared to other rocks in the sample. The slope is the more interesting parameter here. It being negative means that there is an inverse relationship between the two variables: as pore area goes up, permeability goes down instead of up. We would intuitively expect these two parameters to have a positive relationship, but, the data suggests the opposite.
\(\beta_1=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}\)
\(\beta_0=\bar{y}-\beta_1\bar{x}\)
x_bar <- mean(rock$area)
y_bar <- mean(rock$perm)
beta_1 <- sum((rock$area-x_bar)*(rock$perm-y_bar))/sum((rock$area-x_bar)^2)
beta_0 <- y_bar - beta_1 * x_bar
c(beta_0, beta_1)
## [1] 880.52257985 -0.06470369
est$coefficients
## (Intercept) rock$area
## 880.52257985 -0.06470369
One of the types of linear models that can be fit to a data set is the ordinary least squares (OLS) regression line. The conditions under which an OLS model is the best linear unbiased estimator (BLUE) are called the Gauss-Markov assumptions. If they are met, then an OLS model will have smaller residuals than any other type of linear model. It is unlikely to know for certain that all of the criteria will be met when working with real data, which is why these are assumptions that will be made about the data instead of rigid metrics that need to be proven.
Unsurprisingly, the first assumption is that the data should look like it follows a linear trend. If the data does not look linear (such as being curved or overly scattered), then a linear model will not be the best fit. The next assumption is that the residuals must be nearly normal in their distribution. While the regression will likely not perfectly predict all of the data points, the presence of outliers (points that are much further away from the line than any others) should be minimal (ideally nonexistent). The third assumption is that the variability of points around the line should be roughly constant. As in, the line should not clearly be a better or worse predictor as the values of the independent variable increase. Lastly, all of the observations in the data should be independent and randomly sampled. If there is an underlying structure to the data, such as being a time series or if the samples were chosen for specific reasons, then the observations in the sample may not be reflective of the population.
The data set I chose earlier in the exercise is a poor candidate for an OLS model. While I could have looked at the R-squared value to see this, I can instead compare it to the above assumptions. The data is not clearly linear, there do appear to be outliers in my data, and variability is clearly not constant. While OLS might still be a good fit (though not BLUE) if only one of the assumptions had failed, it is clear that OLS was not a good model for the rock data set.