Section 2.2: Correlation

Let’s find the correlation r for three data sets that we looked at last class.

Forbes Best Countries for Business

best<-read.file("/home/emesekennedy/Data/Ch2/bestcountries.txt", sep="\t", header=T)
## Reading data with read.table()
xyplot(GDPPerCap~Unemployment, data=best)

cor(GDPPerCap~Unemployment, data=best)
## [1] -0.3655093

By looking at the scatterplot, the relationship between the two variables does not seem to be linear. This is confirmed by the r value of -0.366 which is close to zero. The r value indicates that there is a very weak negative linear relationship between the two variables.

Botnets

spam<-read.file("/home/emesekennedy/Data/Ch2/spam.txt", sep="\t", header=T)
## Reading data with read.table()
xyplot(SpamsPerDay~Bots, data=spam)

cor(SpamsPerDay~Bots, data=spam)
## [1] 0.8839139

By looking at the scatterplot, the relationship between the two variables seems to be fairly linear. This is confirmed by the r value of 0.88 which indicates that there is a moderate positive linear relationship between the two variables.

Debt of Countries

debt<-read.file("/home/emesekennedy/Data/Ch2/debt.txt", sep="\t", header=T)
## Reading data with read.table()
xyplot(Debt2007~Debt2006, data=debt)

cor(Debt2007~Debt2006, data=debt)
## [1] 0.9971255

By looking at the scatterplot, the relationship between the two variables seems to be strongly linear. This is confirmed by the r value of 0.997 which indicates that there is a strong positive linear relationship between the two variables.

Section 2.3: Least-Squares Regression

Example: Fat vs Non-Excercis Activity (NEA) When Overeating

Refer to the handout from class for a step-by-step guide on how to do least-squares regression on this example. Below are the commands that correspond to the problems on the handout.

Step 1

fidget<-read.file("/home/emesekennedy/Data/Ch2/fidget.txt")
## Reading data with read.table()

Step 2 (a)

xyplot(Fat~NEA, data=fidget)

Step 2 (b)

We can use the following command to get a quick visual representation of are data with a regression line. Note; this command does not give the equation of our regression line, so we cannot use it for analysis.

xyplot(Fat~NEA, data=fidget, panel=panel.lmbands)

Step 3 (a)

Use a built-in R command to fit our data with a regression line.

reg<-lm(Fat~NEA, data=fidget)

Step 3 (b)

reg
## 
## Call:
## lm(formula = Fat ~ NEA, data = fidget)
## 
## Coefficients:
## (Intercept)          NEA  
##    3.505123    -0.003441

The slope of the regression line is -0.0034 and the intercept is 3.5.

Step 3 (c)

The below command gives a lot more information about the regression line. we will talk about what some of these things mean later.

summary(reg)
## 
## Call:
## lm(formula = Fat ~ NEA, data = fidget)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1091 -0.3904 -0.1039  0.4125  1.6439 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.5051229  0.3036164  11.545 1.53e-08 ***
## NEA         -0.0034415  0.0007414  -4.642 0.000381 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7399 on 14 degrees of freedom
## Multiple R-squared:  0.6061, Adjusted R-squared:  0.578 
## F-statistic: 21.55 on 1 and 14 DF,  p-value: 0.000381

Step 4 (a)

Create a function using the regression line we found.

f<-makeFun(reg)

Step 4 (b)

Plot the data again.

xyplot(Fat~NEA, data=fidget)

Step 4 (c)

Add the regression line we found to the scatterplot.

plotFun(f(NEA)~NEA,data=fidget,add=T)

Step 5 (a)

Use the regression line to predict fat gain for a person whose NEA increases by 400 calories by overeating.

f(NEA=400)
##        1 
## 2.128528

The fat gain of the person is approximately 2.13 kg.

Step 5 (b)

Use the regression line to predict fat gain for a person whose NEA increases by 1500 calories by overeating.

f(NEA=1500)
##         1 
## -1.657108

The fat gain of the person is approximately -1.66 kg, but this prediction is not reasonable because NEA=1500 is far outside of the range of values for NEA in our data.

Step 6 (a)

Now let’s find the slope b1 of a least-squares regression line using the formula we learned.

b1<-cor(Fat~NEA, data=fidget)*(sd(~Fat, data=fidget)/sd(~NEA, data=fidget))
b1
## [1] -0.003441487

Step 6 (b)

Now let’s find the intercept b0 of a least-squares regression line using the formula we learned.

b0<-mean(~Fat, data=fidget)-b1*mean(~NEA, data=fidget)
b0
## [1] 3.505123

Step 6 (c)

The slope and intercept of the least-squares regression line are the same as the slope and intercept of the regression line we found using the built-in R command. This confirms that the built-in R command uses the method of least-squares to find a regression line.