In class we talked about diagnostic plots, transformations of variables, and outliers.
We covered the different diagnostic plots that come up when you plot your linear model. The first plot is the residuals vs fitted values. We have been making these plots by hand to test the assumptions of a linear model, but this is easier because it only takes one line of code. When looking at this plot we are looking for trend in residuals (a curvature indicates our mean function is missing something), heteroscedasisity, and outliers.
Our second graph is a Q-Q norm plot. It plots under the assumption that the error terms are normally distributed. This graph is used to see if our data and model are normally distributed; this plot gives less weight to extreme values.
The third graph is a scale-location plot. We are looking for trends in our residuals. This graph reduces the skewness of our residuals whereas the first graph does not do well with this.
The fourth graph is Cook’s distance. The goal is to get all of your values within the two cooks distance lines. Then we talked about transforms. There are a few ways to transform the variables. We can take the square root of y, 1/y, and the log of y (and x for all 3). The square root of y helps with heteroscedasticity and could help the mean function. In both log(y) and 1/y, we need y to be greater than 0.
When looking at your data, you can try a few transforms if you have reason to believe they might have that relationship. In class we went through trying various transforms to see if they helped our data have better diagnostic plots.
Finally, we talked about outliers. Basically, you cannot just get rid of outliers. If you know a point is a mistake, that is the only scenario in which you can just get rid of an outlier. You can temporarily remove an outlier and see if your data gets better, but you have to make sure to state that in your analysis.
A data point is considered an outlier if the absolute value of the studentized residual is greater than 2.