Section 2.3 (continued) and Section 2.4

Example from last class: Fat Gain vs. Non-Exercise Activity (NEA)

Let’s load the data and look at the regression line again for the example from last class:

fidget<-read.file("/home/emesekennedy/Data/Ch2/fidget.txt")
## Reading data with read.table()
reg<-lm(Fat~NEA, data=fidget)
f<-makeFun(reg)
xyplot(Fat~NEA, data=fidget)

plotFun(f(NEA)~NEA, data=fidget, add=T)

As the plot indicates, the relationship between Fat and NEA seems fairly linear with a negative association, so we would expect the correlation r to be negative and closer to -1 than to 0.

cor(Fat~NEA, data=fidget)
## [1] -0.7785558

The correlation is as we would expect. Now, let’s find the value of the correlation squared.

cor(Fat~NEA, data=fidget)^2
## [1] 0.6061492

This value indicates that approximately 61% of the variation in Fat is explained by the least-squares regression line. Now, let’s look at the residual plot.

mplot(reg, which=1)
## [[1]]

As we would expect, the residuals look fairly scattered around the zero line with no specific pattern, and the residual values are fairly small. This means that the least-squares regression line captures the overall pattern of the data well.

Example: Data Created by a Statistican Named Ascombe

Let’s load the data, and before we look at a plot of the data, let’s find the correlation and the value of the correlation squared between the variables xB and yB.

anscombe<-read.file("/home/emesekennedy/Data/Ch2/anscombe.txt")
## Reading data with read.table()
cor(yB~xB, data=anscombe)
## [1] 0.8162365
cor(yB~xB, data=anscombe)^2
## [1] 0.666242

The correlotion and its square are both fairly close to 1, so we would expect to two variables to have a fairly strong linear relationship. Let’s plot the data, find the regression line, and add the regression line to the scatterplot.

xyplot(yB~xB, data=anscombe)

regB<-lm(yB~xB, data=anscombe)
fB<-makeFun(regB)
plotFun(fB(xB)~xB, data=anscombe, add=T)

Looking at the plot, it seems clear that the data is not linear and that the least-squares regression line does not capture the overall pattern of the data. Let’s confirm this with the residual plot.

mplot(regB, which=1)
## [[1]]

As expected, the residuals have a clear pattern. Moral of this Example: Always look at the graph of your data and not just the numerical measures to get a full picture.

Example: Diabetes

Refer to the handout from class for a step-by-step guide on how to analyze the least-squares regression and correlation for this example. Below are the commands that correspond to the problems on the handout.

Step 1

diabetes<-read.file("/home/emesekennedy/Data/Ch2/diabetes.txt")
## Reading data with read.table()

Step 2

xyplot(FPG~HbA1c, data=diabetes)

Step 3(a)

reg1<-lm(FPG~HbA1c, data=diabetes)

Step 3(b)

f1<-makeFun(reg1)

Step 3(c)

plotFun(f1(HbA1c)~HbA1c, data=diabetes, add=T)

Step 3(d)

mplot(reg1, which=1)
## [[1]]

Output the slope and intercept for the regression line (needed to fill in the table in Step 6)

reg1
## 
## Call:
## lm(formula = FPG ~ HbA1c, data = diabetes)
## 
## Coefficients:
## (Intercept)        HbA1c  
##       66.43        10.41

Step 3(e)

cor(FPG~HbA1c, data=diabetes)
## [1] 0.4819017

Step 3(f)

cor(FPG~HbA1c, data=diabetes)^2
## [1] 0.2322292

Step 4

Remove the observation corresponding to Subject 15 by creating a new data set without that observation.

diabetes2<-subset(diabetes, FPG<300)

Repeat Step 3 with the new data set.

reg2<-lm(FPG~HbA1c, data=diabetes2)
f2<-makeFun(reg2)
xyplot(FPG~HbA1c, data=diabetes2)

plotFun(f2(HbA1c)~HbA1c, data=diabetes2, add=T)

mplot(reg2, which=1)
## [[1]]

cor(FPG~HbA1c, data=diabetes2)
## [1] 0.5683967
cor(FPG~HbA1c, data=diabetes2)^2
## [1] 0.3230749
reg2
## 
## Call:
## lm(formula = FPG ~ HbA1c, data = diabetes2)
## 
## Coefficients:
## (Intercept)        HbA1c  
##       69.49         8.92

Step 5

Remove the observation corresponding to Subject 18 by creating a new data set without that observation.

diabetes3<-subset(diabetes, HbA1c<15)

Repeat Step 3 with the new data set.

reg3<-lm(FPG~HbA1c, data=diabetes3)
f3<-makeFun(reg3)
xyplot(FPG~HbA1c, data=diabetes3)

plotFun(f3(HbA1c)~HbA1c, data=diabetes3, add=T)

mplot(reg3, which=1)
## [[1]]

cor(FPG~HbA1c, data=diabetes3)
## [1] 0.3837007
cor(FPG~HbA1c, data=diabetes3)^2
## [1] 0.1472262
reg3
## 
## Call:
## lm(formula = FPG ~ HbA1c, data = diabetes3)
## 
## Coefficients:
## (Intercept)        HbA1c  
##       52.26        12.12

Step 6 (b)

Create a scatter plot and plot all three regression lines on it.

xyplot(FPG~HbA1c, data=diabetes)
plotFun(f1(HbA1c)~HbA1c, data=diabetes, add=T)
plotFun(f2(HbA1c)~HbA1c, data=diabetes2, add=T, col="red")
plotFun(f3(HbA1c)~HbA1c, data=diabetes3, add=T, col="magenta")

Step 6 (c)

The outlier corresponding to Subject 15 pulls the regression line up and seems to have a large effect on the correlation than the other outlier. The outlier corresponding to Subject 18 pulls the regression line down toward itself. The influence of this outlier not as large as the influence of the other outlier. The blue line (the regression line corresponding to the entire data) and the magenta line (the one corresponding to the data without Subject 18) are close together for HbA1c values between 6 and 14.