Previously we looked at the 4 pairs of x and y variables in the anscombe dataset. We saw that although the variables were very similar to each other in terms of some statistics, such as the means, and the regression results were all identical, the plotted data looked pretty different
Remember you can see the anscombe data by typing View(anscombe) in the console.
Run the code below to review the plots.
I believe graph 3 accurately describe the relationship between x and y because the plot points are the closes to the OLS regression line.
Just use words to say what you see. You can mention individual points if it makes sense to do so.
For the other tables I would say there is weak relationship. Table1:The points some are touching the mean line but some points are near. Table2: The points are not within the ribbon. Table4:The points are on top of each other but not on the mean line.
Ordinary least squares regression is designed to estimate a straight line that has the “best” fit to the data. But what does “best” fit mean? We have already seen that sometimes the regression line is not really the most accurate way to summarize data.
Remember that all of the regression results for the four pairs were the same. Let’s just look at the summary for the first result to remind us of this. We’ll also create the 3 other results objects.
results1<-lm(y1 ~ x1, data=anscombe)
summary(results1)
##
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
results2<-lm(y2 ~ x2, data=anscombe)
results3<-lm(y3 ~ x3 , data=anscombe)
results4<-lm(y4 ~ x4, data=anscombe)
The two things we want to look closely at here are the coefficient estimates and the Multiple R-Squared values.
0.6665
The multiple R squared represents how much of the variation in the dependent variable is “explained by” the independent variable. A 0 would mean none, a 1 would mean all. This is a proportion, so it has to be between 0 and 1.
The coefficients represent a regresssion equation that is
predicted(y) = 3.001 + .5001*x
Use R as a calculator below to calculate the predicted(y) for 0, 8, 9, 19 and a value of your choice.
3+.5*0
## [1] 3
3+.5*8
## [1] 7
3+.5*9
## [1] 7.5
3+.5*19
## [1] 12.5
The predicted value is a line and the x it could any random value i choose.
Fortunately R will calculated the predicted values of y for each observation. These are found in fitted.values(lm_results_object). Fitted values and predicted values mean the same thing.
It will also calculate the actual y value minus the predicted value. These are called either resdiduals or errors. These are found in resid(lm_results_object). Let’s get all 4 sets of actual x, actual y, predicted and residual.
# Results 1 (x1, y1)
fitdata1 <-data.frame(x=anscombe$x1, y=anscombe$y1, predicted = fitted.values(results1), error = resid(results1))
# To make the results a bit easier to read, arrange them by the size of x.
dplyr::arrange(fitdata1, x)
## x y predicted error
## 1 4 4.26 5.000455 -0.74045455
## 2 5 5.68 5.500545 0.17945455
## 3 6 7.24 6.000636 1.23936364
## 4 7 4.82 6.500727 -1.68072727
## 5 8 6.95 7.000818 -0.05081818
## 6 9 8.81 7.500909 1.30909091
## 7 10 8.04 8.001000 0.03900000
## 8 11 8.33 8.501091 -0.17109091
## 9 12 10.84 9.001182 1.83881818
## 10 13 7.58 9.501273 -1.92127273
## 11 14 9.96 10.001364 -0.04136364
# Results 2 (x2, y2)
fitdata1 <-data.frame(x=anscombe$x2, y=anscombe$y2, predicted = fitted.values(results2), error = resid(results2))
# To make the results a bit easier to read, arrange them by the size of x.
dplyr::arrange(fitdata1, x)
## x y predicted error
## 1 4 3.10 5.000909 -1.9009091
## 2 5 4.74 5.500909 -0.7609091
## 3 6 6.13 6.000909 0.1290909
## 4 7 7.26 6.500909 0.7590909
## 5 8 8.14 7.000909 1.1390909
## 6 9 8.77 7.500909 1.2690909
## 7 10 9.14 8.000909 1.1390909
## 8 11 9.26 8.500909 0.7590909
## 9 12 9.13 9.000909 0.1290909
## 10 13 8.74 9.500909 -0.7609091
## 11 14 8.10 10.000909 -1.9009091
# Results 3 (x3, y3)
fitdata1 <-data.frame(x=anscombe$x3, y=anscombe$y3, predicted = fitted.values(results3), error = resid(results3))
# To make the results a bit easier to read, arrange them by the size of x.
dplyr::arrange(fitdata1, x)
## x y predicted error
## 1 4 5.39 5.001364 0.38863636
## 2 5 5.73 5.501091 0.22890909
## 3 6 6.08 6.000818 0.07918182
## 4 7 6.42 6.500545 -0.08054545
## 5 8 6.77 7.000273 -0.23027273
## 6 9 7.11 7.500000 -0.39000000
## 7 10 7.46 7.999727 -0.53972727
## 8 11 7.81 8.499455 -0.68945455
## 9 12 8.15 8.999182 -0.84918182
## 10 13 12.74 9.498909 3.24109091
## 11 14 8.84 9.998636 -1.15863636
# Results 4 (x4, y4)
fitdata1 <-data.frame(x=anscombe$x4, y=anscombe$y4, predicted = fitted.values(results4), error = resid(results4))
# To make the results a bit easier to read, arrange them by the size of x.
dplyr::arrange(fitdata1, x)
## x y predicted error
## 1 8 6.58 7.001 -4.210000e-01
## 2 8 5.76 7.001 -1.241000e+00
## 3 8 7.71 7.001 7.090000e-01
## 4 8 8.84 7.001 1.839000e+00
## 5 8 8.47 7.001 1.469000e+00
## 6 8 7.04 7.001 3.900000e-02
## 7 8 5.25 7.001 -1.751000e+00
## 8 8 5.56 7.001 -1.441000e+00
## 9 8 7.91 7.001 9.090000e-01
## 10 8 6.89 7.001 -1.110000e-01
## 11 19 12.50 12.500 -1.526557e-16
Yes it would be the same.
No their actual y is not the same.
The y is the actual value, the predicted value, and error.
It turns out that one way to tell if a regression line is an appropriate approach for your data is to look for patterns or strange values, such as
extremely large values, in the errors. If there are patterns then the regression model you have used probably does not make sense. Errors should be evenly (and randomly) distributed around the regression line.
I see a large residual in data 3 the residual is 3.24109091, there are a large residuals in data 4 7.090000e and 9.090000e.
Now we want to look closely at the last 3 models, since the first one looks fine.
For the x4 and y4 what could cause an unsual value is a outlier this makes the predicted equal to the value. ### What is the row number of the unusual value?
there would be no outlier, all the points would be on top each other.
# Create a new data set without observation 8.
no_obs_8 <- anscombe[-8,]
results4<-lm(y4 ~ x4, data=no_obs_8)
# Get the summary
summary(results4)
##
## Call:
## lm(formula = y4 ~ x4, data = no_obs_8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -1.036 -0.036 0.859 1.839
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0010 0.3908 17.92 2.39e-08 ***
## x4 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
You might want to also try some of the techniques we used earlier, such as graphing or looking at the errors.
4x has the results it has because you just have all x’s and you can’t make a line.
There is not a right answer, only well thought out answers.
No because then you won’t make the linear line.
Now let’s look at the x3, y3 results.
The observation number is 13 and the value of the outlier is 12.74.
Write the code to see what happens if we drop the outlier.
# Create a new data set without observation 13.
no_obs_3 <- anscombe[-3,]
results3<-lm(y3 ~ x3, data=no_obs_3)
summary(results3)
##
## Call:
## lm(formula = y3 ~ x3, data = no_obs_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0041558 -0.0022240 0.0000649 0.0018182 0.0050649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0056494 0.0029242 1370 <2e-16 ***
## x3 0.3453896 0.0003206 1077 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003082 on 8 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.161e+06 on 1 and 8 DF, p-value: < 2.2e-16
The coefficients decreased and the Multiple R-squared is equal to 1.
You might want to also try some of the techniques we used earlier, such as graphing or looking at the errors.
I think it means 100.
Another way to appoach the x3, y3 data would be to make a dichotomous variable representing observation 3 and add that to the regression.
anscombe$obs3<-anscombe$y3 == 12.74
# Add it to the model using a +.
results3<-lm(y3 ~ x3 + obs3, data=anscombe)
There is not a right answer, only well thought out answers.
Finally, let’s look at the x2, y2 results.
Really try to remember!
One kind of curve line in algebra is a parabola, which is a function with a squared value of x. (Google parabola if you need to.)
Let’s try a model with a squared term. We add it using a + sign.
anscombe$x2squared<- anscombe$x2^2
results2<-lm(y2 ~ x2 + x2squared, data=anscombe)
You will definitely want to look at the predicted values and the errors. You may also want to plot x and predicted.
We should take two lessons from this. First, just because you can do something does not mean you should do it. A regression will run for all kinds of data but that does not mean it is right.
Second, always look at your data graphically in order to help decide whether a regression model makes sense and spot problems such as outliers. Looking at residuals can give you the same information especially as your models get more complex.
As a reader of regression results you should always ask whether the author has really investigated if there are any such issues in their data.