Section 2.3 (continued) and Section 2.4

Example from last class: Fat Gain vs. Non-Exercise Activity (NEA)

Let’s load the data and look at the regression line again for the example from last class:

fidget<-read.file("/home/emesekennedy/Data/Ch2/fidget.txt")
## Reading data with read.table()
reg<-lm(Fat~NEA, data=fidget)
f<-makeFun(reg)
xyplot(Fat~NEA, data=fidget)

plotFun(f(NEA)~NEA, data=fidget, add=T)

As the plot indicates, the relationship between Fat and NEA seems fairly linear with a negative association, so we would expect the correlation r to be negative and closer to -1 than to 0.

cor(Fat~NEA, data=fidget)
## [1] -0.7785558

The correlation is as we would expect. Now, let’s find the value of the correlation squared.

cor(Fat~NEA, data=fidget)^2
## [1] 0.6061492

This value indicates that approximately 61% of the variation in Fat is explained by the least-squares regression line. Now, let’s look at the residual plot.

mplot(reg, which=1)
## [[1]]

As we would expect, the residuals look fairly scattered around the zero line with no specific pattern, and the residual values are fairly small. This means that the least-squares regression line captures the overall pattern of the data well.

Example: Data Created by a Statistican Named Ascombe

Let’s load the data, find a least-squares regression line, and create a scatterplot with the regression lie added to it.

anscombe<-read.file("/home/emesekennedy/Data/Ch2/anscombe.txt")
## Reading data with read.table()
xyplot(yB~xB, data=anscombe)

regB<-lm(yB~xB, data=anscombe)
fB<-makeFun(regB)
plotFun(fB(xB)~xB, data=anscombe, add=T)

Looking at the plot, it seems clear that the data is not linear and that the least-squares regression line does not capture the overall pattern of the data. Let’s confirm this with the residual plot.

mplot(regB, which=1)
## [[1]]

As expected, the residuals have a clear pattern. Now, let’s find the correlation and the value of the correlation squared.

cor(yB~xB, data=anscombe)
## [1] 0.8162365
cor(yB~xB, data=anscombe)^2
## [1] 0.666242

Just by looking at the correlation and the correlation squared, we would expect the data to be more linear. Moral: Always look at the graph of your data and not just the numerical measures to get a full picture.

Example: Diabetes

Refer to the handout from class for a step-by-step guide on how to analyze the least-squares regression and correlation for this example. Below are the commands that correspond to the problems on the handout.

Step 1

diabetes<-read.file("/home/emesekennedy/Data/Ch2/diabetes.txt")
## Reading data with read.table()

Step 2

xyplot(FPG~HbA1c, data=diabetes)

Step 3(a)

reg1<-lm(FPG~HbA1c, data=diabetes)

Step 3(b)

f1<-makeFun(reg1)

Step 3(c)

plotFun(f1(HbA1c)~HbA1c, data=diabetes, add=T)

Step 3(d)

mplot(reg1, which=1)
## [[1]]

Output the slope and intercept for the regression line (needed to fill in the table in Step 6)

reg1
## 
## Call:
## lm(formula = FPG ~ HbA1c, data = diabetes)
## 
## Coefficients:
## (Intercept)        HbA1c  
##       66.43        10.41

Step 3(e)

cor(FPG~HbA1c, data=diabetes)
## [1] 0.4819017

Step 3(f)

cor(FPG~HbA1c, data=diabetes)^2
## [1] 0.2322292

Step 4

Remove the observation corresponding to Subject 15 by creating a new data set without that observation.

diabetes2<-subset(diabetes, FPG<300)

Repeat Step 3 with the new data set.

reg2<-lm(FPG~HbA1c, data=diabetes2)
f2<-makeFun(reg2)
xyplot(FPG~HbA1c, data=diabetes2)

plotFun(f2(HbA1c)~HbA1c, data=diabetes2, add=T)

mplot(reg2, which=1)
## [[1]]

cor(FPG~HbA1c, data=diabetes2)
## [1] 0.5683967
cor(FPG~HbA1c, data=diabetes2)^2
## [1] 0.3230749
reg2
## 
## Call:
## lm(formula = FPG ~ HbA1c, data = diabetes2)
## 
## Coefficients:
## (Intercept)        HbA1c  
##       69.49         8.92

Step 5

Remove the observation corresponding to Subject 18 by creating a new data set without that observation.

diabetes3<-subset(diabetes, HbA1c<15)

Repeat Step 3 with the new data set.

reg3<-lm(FPG~HbA1c, data=diabetes3)
f3<-makeFun(reg3)
xyplot(FPG~HbA1c, data=diabetes3)

plotFun(f3(HbA1c)~HbA1c, data=diabetes3, add=T)

mplot(reg3, which=1)
## [[1]]

cor(FPG~HbA1c, data=diabetes3)
## [1] 0.3837007
cor(FPG~HbA1c, data=diabetes3)^2
## [1] 0.1472262
reg3
## 
## Call:
## lm(formula = FPG ~ HbA1c, data = diabetes3)
## 
## Coefficients:
## (Intercept)        HbA1c  
##       52.26        12.12

Step 6 (b)

Create a scatter plot and plot all three regression lines on it.

xyplot(FPG~HbA1c, data=diabetes)
plotFun(f1(HbA1c)~HbA1c, data=diabetes, add=T)
plotFun(f2(HbA1c)~HbA1c, data=diabetes2, add=T, col="red")
plotFun(f3(HbA1c)~HbA1c, data=diabetes3, add=T, col="magenta")

Step 6 (c)

The outlier corresponding to Subject 15 pulls the regression line up and seems to have a large effect on the correlation than the other outlier. The outlier corresponding to Subject 18 pulls the regression line down toward itself. The influence of this outlier not as large as the influence of the other outlier. The blue line (the regression line corresponding to the entire data) and the magenta line (the one corresponding to the data without Subject 18) are close together for HbA1c values between 6 and 14.