Let’s find the correlation r for three data sets that we looked at last class.
best<-read.file("/home/emesekennedy/Data/Ch2/bestcountries.txt", sep="\t", header=T)
## Reading data with read.table()
xyplot(GDPPerCap~Unemployment, data=best)
cor(GDPPerCap~Unemployment, data=best)
## [1] -0.3655093
By looking at the scatterplot, the relationship between the two variables does not seem to be linear. This is confirmed by the r value of -0.366 which is close to zero. The r value indicates that there is a very weak negative linear relationship between the two variables.
spam<-read.file("/home/emesekennedy/Data/Ch2/spam.txt", sep="\t", header=T)
## Reading data with read.table()
xyplot(SpamsPerDay~Bots, data=spam)
cor(SpamsPerDay~Bots, data=spam)
## [1] 0.8839139
By looking at the scatterplot, the relationship between the two variables seems to be fairly linear. This is confirmed by the r value of 0.88 which indicates that there is a moderate positive linear relationship between the two variables.
debt<-read.file("/home/emesekennedy/Data/Ch2/debt.txt", sep="\t", header=T)
## Reading data with read.table()
xyplot(Debt2007~Debt2006, data=debt)
cor(Debt2007~Debt2006, data=debt)
## [1] 0.9971255
By looking at the scatterplot, the relationship between the two variables seems to be strongly linear. This is confirmed by the r value of 0.997 which indicates that there is a strong positive linear relationship between the two variables.
Refer to the handout from class for a step-by-step guide on how to do least-squares regression on this example. Below are the commands that correspond to the problems on the handout.
fidget<-read.file("/home/emesekennedy/Data/Ch2/fidget.txt")
## Reading data with read.table()
xyplot(Fat~NEA, data=fidget)
We can use the following command to get a quick visual representation of are data with a regression line. Note; this command does not give the equation of our regression line, so we cannot use it for analysis.
xyplot(Fat~NEA, data=fidget, panel=panel.lmbands)
Use a built-in R command to fit our data with a regression line.
reg<-lm(Fat~NEA, data=fidget)
reg
##
## Call:
## lm(formula = Fat ~ NEA, data = fidget)
##
## Coefficients:
## (Intercept) NEA
## 3.505123 -0.003441
The slope of the regression line is -0.0034 and the intercept is 3.5.
The below command gives a lot more information about the regression line. we will talk about what some of these things mean later.
summary(reg)
##
## Call:
## lm(formula = Fat ~ NEA, data = fidget)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1091 -0.3904 -0.1039 0.4125 1.6439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5051229 0.3036164 11.545 1.53e-08 ***
## NEA -0.0034415 0.0007414 -4.642 0.000381 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7399 on 14 degrees of freedom
## Multiple R-squared: 0.6061, Adjusted R-squared: 0.578
## F-statistic: 21.55 on 1 and 14 DF, p-value: 0.000381
Create a function using the regression line we found.
f<-makeFun(reg)
Plot the data again.
xyplot(Fat~NEA, data=fidget)
Add the regression line we found to the scatterplot.
plotFun(f(NEA)~NEA,data=fidget,add=T)
Use the regression line to predict fat gain for a person whose NEA increases by 400 calories by overeating.
f(NEA=400)
## 1
## 2.128528
The fat gain of the person is approximately 2.13 kg.
Use the regression line to predict fat gain for a person whose NEA increases by 1500 calories by overeating.
f(NEA=1500)
## 1
## -1.657108
The fat gain of the person is approximately -1.66 kg, but this prediction is not reasonable because NEA=1500 is far outside of the range of values for NEA in our data.
Now let’s find the slope b1 of a least-squares regression line using the formula we learned.
b1<-cor(Fat~NEA, data=fidget)*(sd(~Fat, data=fidget)/sd(~NEA, data=fidget))
b1
## [1] -0.003441487
Now let’s find the intercept b0 of a least-squares regression line using the formula we learned.
b0<-mean(~Fat, data=fidget)-b1*mean(~NEA, data=fidget)
b0
## [1] 3.505123
The slope and intercept of the least-squares regression line are the same as the slope and intercept of the regression line we found using the built-in R command. This confirms that the built-in R command uses the method of least-squares to find a regression line.