Previously we looked at the 4 pairs of x and y variables in the anscombe dataset. We saw that although the variables were very similar to each other in terms of some statistics, such as the means, and the regression results were all identical, the plotted data looked pretty different

Remember you can see the anscombe data by typing View(anscombe) in the console.

Run the code below to review the plots.

Looking at these graphs, in which graph does the straight line, known as

the OLS regression line, accurately describe the relationship between x and y?

I believe graph 3 accurately describe the relationship between x and y because the plot points are the closes to the OLS regression line.

For the other three graphs, how would you describe x, y and the relationship between x and y?

Just use words to say what you see. You can mention individual points if it makes sense to do so.

For the other tables I would say there is weak relationship. Table1:The points some are touching the mean line but some points are near. Table2: The points are not within the ribbon. Table4:The points are on top of each other but not on the mean line.

Ordinary least squares regression is designed to estimate a straight line that has the “best” fit to the data. But what does “best” fit mean? We have already seen that sometimes the regression line is not really the most accurate way to summarize data.

Remember that all of the regression results for the four pairs were the same. Let’s just look at the summary for the first result to remind us of this. We’ll also create the 3 other results objects.

results1<-lm(y1 ~ x1, data=anscombe)
summary(results1)
## 
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217
results2<-lm(y2 ~ x2, data=anscombe)
results3<-lm(y3 ~ x3 , data=anscombe)
results4<-lm(y4 ~ x4, data=anscombe)

The two things we want to look closely at here are the coefficient estimates and the Multiple R-Squared values.

What value is the multiple R squared?

0.6665

The multiple R squared represents how much of the variation in the dependent variable is “explained by” the independent variable. A 0 would mean none, a 1 would mean all. This is a proportion, so it has to be between 0 and 1.

The coefficients represent a regresssion equation that is

predicted(y) = 3.001 + .5001*x

Find where the coefficiencts are in the summary. Where are they?

Use R as a calculator below to calculate the predicted(y) for 0, 8, 9, 19 and a value of your choice.

3+.5*0
## [1] 3
3+.5*8
## [1] 7
3+.5*9
## [1] 7.5
3+.5*19
## [1] 12.5

What does the .5001 say about the relationship of x and predicted(y)?

The predicted value is a line and the x it could any random value i choose.

Fortunately R will calculated the predicted values of y for each observation. These are found in fitted.values(lm_results_object). Fitted values and predicted values mean the same thing.

It will also calculate the actual y value minus the predicted value. These are called either resdiduals or errors. These are found in resid(lm_results_object). Let’s get all 4 sets of actual x, actual y, predicted and residual.

# Results 1 (x1, y1)
fitdata1 <-data.frame(x=anscombe$x1, y=anscombe$y1,  predicted = fitted.values(results1), error = resid(results1))
# To make the results a bit easier to read, arrange them by the size of x.
dplyr::arrange(fitdata1, x)
##     x     y predicted       error
## 1   4  4.26  5.000455 -0.74045455
## 2   5  5.68  5.500545  0.17945455
## 3   6  7.24  6.000636  1.23936364
## 4   7  4.82  6.500727 -1.68072727
## 5   8  6.95  7.000818 -0.05081818
## 6   9  8.81  7.500909  1.30909091
## 7  10  8.04  8.001000  0.03900000
## 8  11  8.33  8.501091 -0.17109091
## 9  12 10.84  9.001182  1.83881818
## 10 13  7.58  9.501273 -1.92127273
## 11 14  9.96 10.001364 -0.04136364
# Results 2 (x2, y2)
fitdata1 <-data.frame(x=anscombe$x2, y=anscombe$y2,  predicted = fitted.values(results2), error = resid(results2))
# To make the results a bit easier to read, arrange them by the size of x.
dplyr::arrange(fitdata1, x)
##     x    y predicted      error
## 1   4 3.10  5.000909 -1.9009091
## 2   5 4.74  5.500909 -0.7609091
## 3   6 6.13  6.000909  0.1290909
## 4   7 7.26  6.500909  0.7590909
## 5   8 8.14  7.000909  1.1390909
## 6   9 8.77  7.500909  1.2690909
## 7  10 9.14  8.000909  1.1390909
## 8  11 9.26  8.500909  0.7590909
## 9  12 9.13  9.000909  0.1290909
## 10 13 8.74  9.500909 -0.7609091
## 11 14 8.10 10.000909 -1.9009091
# Results 3 (x3, y3)
fitdata1 <-data.frame(x=anscombe$x3, y=anscombe$y3,  predicted = fitted.values(results3), error = resid(results3))
# To make the results a bit easier to read, arrange them by the size of x.
dplyr::arrange(fitdata1, x)
##     x     y predicted       error
## 1   4  5.39  5.001364  0.38863636
## 2   5  5.73  5.501091  0.22890909
## 3   6  6.08  6.000818  0.07918182
## 4   7  6.42  6.500545 -0.08054545
## 5   8  6.77  7.000273 -0.23027273
## 6   9  7.11  7.500000 -0.39000000
## 7  10  7.46  7.999727 -0.53972727
## 8  11  7.81  8.499455 -0.68945455
## 9  12  8.15  8.999182 -0.84918182
## 10 13 12.74  9.498909  3.24109091
## 11 14  8.84  9.998636 -1.15863636
# Results 4 (x4, y4)
fitdata1 <-data.frame(x=anscombe$x4, y=anscombe$y4,  predicted = fitted.values(results4), error = resid(results4))
# To make the results a bit easier to read, arrange them by the size of x.
dplyr::arrange(fitdata1, x)
##     x     y predicted         error
## 1   8  6.58     7.001 -4.210000e-01
## 2   8  5.76     7.001 -1.241000e+00
## 3   8  7.71     7.001  7.090000e-01
## 4   8  8.84     7.001  1.839000e+00
## 5   8  8.47     7.001  1.469000e+00
## 6   8  7.04     7.001  3.900000e-02
## 7   8  5.25     7.001 -1.751000e+00
## 8   8  5.56     7.001 -1.441000e+00
## 9   8  7.91     7.001  9.090000e-01
## 10  8  6.89     7.001 -1.110000e-01
## 11 19 12.50    12.500 -1.526557e-16

If two observations have the same value of x, does that mean their

predicted(y) values will be the same?

Yes it would be the same.

If two observations have the same value of x does that mean that their

actual y values will be the same? You may want to look at results4 and

the 4th graph to help answer this.

No their actual y is not the same.

How do the values of y, the predicted and error relate to each other?

Answer in both words and with an equation.

The y is the actual value, the predicted value, and error.

It turns out that one way to tell if a regression line is an appropriate approach for your data is to look for patterns or strange values, such as
extremely large values, in the errors. If there are patterns then the regression model you have used probably does not make sense. Errors should be evenly (and randomly) distributed around the regression line.

Just based on looking at the values of the errors, for which pairs

do you see patterns in the errors? What are the patterns?

A pattern is anything that makes it easier for you to guess the

size or sign of the error based on either the x or y values.

Just based on looking at the residuals, so you see any particularly

large residuals? In which data?

I see a large residual in data 3 the residual is 3.24109091, there are a large residuals in data 4 7.090000e and 9.090000e.

Deciding what to do

Now we want to look closely at the last 3 models, since the first one looks fine.

Looking at the last one (x4, y4) what could cause such an unusual value?

For the x4 and y4 what could cause an unsual value is a outlier this makes the predicted equal to the value. ### What is the row number of the unusual value?

What would happen to your regression if you just left that value out of the analysis?

there would be no outlier, all the points would be on top each other.

# Create a new data set without observation 8.
no_obs_8 <- anscombe[-8,]

results4<-lm(y4 ~ x4, data=no_obs_8)
# Get the summary 

summary(results4)
## 
## Call:
## lm(formula = y4 ~ x4, data = no_obs_8)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -1.036 -0.036  0.859  1.839 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.0010     0.3908   17.92 2.39e-08 ***
## x4                NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.236 on 9 degrees of freedom

What is the equation for your results? Why does x4 have the coefficient

that it does?

You might want to also try some of the techniques we used earlier, such as graphing or looking at the errors.

4x has the results it has because you just have all x’s and you can’t make a line.

Do you think it is valid to drop an outlier from the analysis in this case?

There is not a right answer, only well thought out answers.

No because then you won’t make the linear line.

Now let’s look at the x3, y3 results.

What is the observation number and value of the outlier?

The observation number is 13 and the value of the outlier is 12.74.

Write the code to see what happens if we drop the outlier.

# Create a new data set without observation 13.
no_obs_3 <- anscombe[-3,]

results3<-lm(y3 ~ x3, data=no_obs_3)
summary(results3)
## 
## Call:
## lm(formula = y3 ~ x3, data = no_obs_3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0041558 -0.0022240  0.0000649  0.0018182  0.0050649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.0056494  0.0029242    1370   <2e-16 ***
## x3          0.3453896  0.0003206    1077   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003082 on 8 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.161e+06 on 1 and 8 DF,  p-value: < 2.2e-16

How did dropping the outlier change your results? Look at both the

coefficients and the Multiple R-squared.

The coefficients decreased and the Multiple R-squared is equal to 1.

You might want to also try some of the techniques we used earlier, such as graphing or looking at the errors.

What do you think it means when the Multiple R-Squared is 1?

I think it means 100.

Another way to appoach the x3, y3 data would be to make a dichotomous variable representing observation 3 and add that to the regression.

anscombe$obs3<-anscombe$y3 == 12.74
# Add it to the model using a +.
results3<-lm(y3 ~ x3 + obs3, data=anscombe)

How does this analysis change your results? It gives me whole numbers and also negative numbers.

What is the equation for the line?

What is the predicted value for observation 3?

Do you think it is valid to drop an outlier from the analysis in this case or is it better to add a dicohtomous variable for the observation? Why?

There is not a right answer, only well thought out answers.

Finally, let’s look at the x2, y2 results.

What is the basic problem with using a straight line model for this pair?

Can you remember any kinds of lines or functions from algebra that

were not straight lines? If so, what are they?

Really try to remember!

One kind of curve line in algebra is a parabola, which is a function with a squared value of x. (Google parabola if you need to.)

Let’s try a model with a squared term. We add it using a + sign.

anscombe$x2squared<- anscombe$x2^2
results2<-lm(y2 ~ x2 + x2squared, data=anscombe)

You will definitely want to look at the predicted values and the errors. You may also want to plot x and predicted.

How did adding the squared term change your results?

Do you think it is legitimate to add a squared term?

Can you think of any set of variables that might have a relationship

that is curved like this?

Conclusion

We should take two lessons from this. First, just because you can do something does not mean you should do it. A regression will run for all kinds of data but that does not mean it is right.

Second, always look at your data graphically in order to help decide whether a regression model makes sense and spot problems such as outliers. Looking at residuals can give you the same information especially as your models get more complex.

As a reader of regression results you should always ask whether the author has really investigated if there are any such issues in their data.