Regression and Correlation with R

Scatter plot

To create a scatter plot: See Graphing with R file.

Regression equation

To calculate regression equation: lm(dependent variable ~ independent variable)

You can also do: lm.out = lm(dependent variable ~ independent variable) –calculates the linear model (you can call it anything you want. It doesn’t have to be lm.out) lm.out –prints out the linear model

Example: Find the linear model for the amount of gas used based on temperature

Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm(gas_consumed~temperature, data=Gas)

## 
## Call:
## lm(formula = gas_consumed ~ temperature, data = Gas)
## 
## Coefficients:
## (Intercept)  temperature  
##       4.571       -0.223

Plot linear model on the scatter plot:

To plot the linear model on the scatter plot gf_point(dependent variable~independent variable, data=Dataset, title=”type a title for the graph”)%>% gf_lm(dependent variable~ independent variable, data=Dataset)–plots the linear model on the scatter plot

Example, draw the scatter plot and linear model on the scatter plot for gas consumed versus temperature.

Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out<-lm(gas_consumed~temperature, data=Gas)
gf_point(gas_consumed~temperature, data=Gas, title="Gas Consumed vs Temperature")%>% 
gf_lm(gas_consumed~temperature, data=Gas)

Residuals

To find and plot residuals: residuals(lm.out) –calculates the residuals gf_point(residuals(lm.out)%>% ~independent variable, data=Dataset) –plots the residuals against the independent variable gf_hline(yintercept = 0) - plots a horizontal line through (0,0)

Example: Find and plot the residuals for gas consumed vs temperature.

Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out<-lm(gas_consumed~temperature, data=Gas)
residuals(lm.out)

##           1           2           3           4           5           6 
##  0.07256170  0.20706857  0.35166949 -0.25912868 -0.03682822 -0.01452777 
##           7           8           9          10          11          12 
##  0.04157544 -0.01382365 -0.51382365 -0.68002090  0.19838276  0.82068322 
##          13          14          15          16          17          18 
##  0.02068322 -0.13471586 -0.11241541  0.15448597 -0.02321357 -0.07861266

gf_point(residuals(lm.out)~temperature, data=Gas)%>%
gf_hline(yintercept = 0)

To calculate correlation coefficient:

cor(dependent variable~independent variable, data=Dataset) Example: Find the correlation coefficient for the amount of gas used based on temperature

Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
cor(gas_consumed~temperature, data= Gas)

## [1] -0.7484644

Coefficient of Determination

The coefficient of determination is found by doing lm.out<-lm(dependent variable ~ independent variable, data=Dataset) rsquared(lm.out)

Example: Find the coefficient of determination for the amount of gas used based on temperature

Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out<-lm(gas_consumed~temperature, data=Gas)
rsquared(lm.out)

## [1] 0.560199

Hypothesis test for the Correlation Coeffiecient

If you are testing for a correlation, you can use the command cor.test(dependent variable~independent variable, data=Dataset)

Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
cor.test(gas_consumed~temperature, data=Gas)

## 
##  Pearson's product-moment correlation
## 
## data:  gas_consumed and temperature
## t = -4.5144, df = 16, p-value = 0.0003529
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9006243 -0.4328463
## sample estimates:
##        cor 
## -0.7484644

Standard Error of the Estimate

You can also find many of the calculations above plus the standard error of the estimate by using the command summary(lm.out). You will see the coefficients of the linear model, the t and p-value of the hypothesis test, the coefficient of determination, and the standard error of the estimate. In the row of the output that says your independent variables name are the t value and p-value. The standard error of the estimate is Residual standard error, and the coefficient of determination is Multiple R-squared.

Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out<-lm(gas_consumed~temperature, data=Gas)
summary(lm.out)

## 
## Call:
## lm(formula = gas_consumed ~ temperature, data = Gas)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68002 -0.10396 -0.01418  0.13400  0.82068 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.5713     0.1592  28.719 3.41e-15 ***
## temperature  -0.2230     0.0494  -4.514 0.000353 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3301 on 16 degrees of freedom
## Multiple R-squared:  0.5602, Adjusted R-squared:  0.5327 
## F-statistic: 20.38 on 1 and 16 DF,  p-value: 0.0003529

This output also gives you the regression line, hypothesis test p-value, the coefficient of determination, and the standard error of the estimate. The regression line is formed using the numbers next to the (Intercept), which is the y-intercept, and next to temperature, which is the slope. This gives y-hat=4.5713+(-0.2230)x. The t value 28.719. The p-value is 3.41e-15. The coefficient of determination is 0.5602. The standard error of the estimate is 0.3301.

Prediction Interval

To calculate a C% prediction interval perform the commands lm.out = lm(dependent variable ~ independent variable, data= Dataset) predict(lm.out, newdata=list(independent variable =value), interval=“prediction”, level=C) –will compute a prediction interval for the independent variable set to a particular value (put that value in place of the word value), at a particular C level (given as a decimal)

Example, find the 95% prediction interval for the amount of gas consumed when the temperature is 3.5 degrees

Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out <-lm(gas_consumed ~ temperature, data=Gas)
predict(lm.out, newdata=list(temperature=3.5), interval="prediction", level=0.95)

##        fit     lwr      upr
## 1 3.790819 3.06823 4.513408