title: “Regression_Correlation” output: html_document —
To create a scatter plot: See Graphing with R file.
To calculate regression equation: lm(dependent variable ~ independent variable)
You can also do: lm.out = lm(dependent variable ~ independent variable) –calculates the linear model (you can call it anything you want. It doesn’t have to be lm.out) lm.out –prints out the linear model
Example: Find the linear model for the amount of gas used based on temperature
Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm(gas_consumed~temperature, data=Gas)
##
## Call:
## lm(formula = gas_consumed ~ temperature, data = Gas)
##
## Coefficients:
## (Intercept) temperature
## 4.571 -0.223
To plot the linear model on the scatter plot gf_point(dependent variable~independent variable, data=Dataset, title=”type a title for the graph”)%>% gf_lm(dependent variable~ independent variable, data=Dataset)–plots the linear model on the scatter plot
Example, draw the scatter plot and linear model on the scatter plot for gas consumed versus temperature.
Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out<-lm(gas_consumed~temperature, data=Gas)
gf_point(gas_consumed~temperature, data=Gas, title="Gas Consumed vs Temperature")%>%
gf_lm(gas_consumed~temperature, data=Gas)
To find and plot residuals: residuals(lm.out) –calculates the residuals gf_point(residuals(lm.out)%>% ~independent variable, data=Dataset) –plots the residuals against the independent variable gf_hline(yintercept = 0) - plots a horizontal line through (0,0)
Example: Find and plot the residuals for gas consumed vs temperature.
Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out<-lm(gas_consumed~temperature, data=Gas)
residuals(lm.out)
## 1 2 3 4 5 6
## 0.07256170 0.20706857 0.35166949 -0.25912868 -0.03682822 -0.01452777
## 7 8 9 10 11 12
## 0.04157544 -0.01382365 -0.51382365 -0.68002090 0.19838276 0.82068322
## 13 14 15 16 17 18
## 0.02068322 -0.13471586 -0.11241541 0.15448597 -0.02321357 -0.07861266
gf_point(residuals(lm.out)~temperature, data=Gas)%>%
gf_hline(yintercept = 0)
cor(dependent variable~independent variable, data=Dataset) Example: Find the correlation coefficient for the amount of gas used based on temperature
Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
cor(gas_consumed~temperature, data= Gas)
## [1] -0.7484644
The coefficient of determination is found by doing lm.out<-lm(dependent variable ~ independent variable, data=Dataset) rsquared(lm.out)
Example: Find the coefficient of determination for the amount of gas used based on temperature
Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out<-lm(gas_consumed~temperature, data=Gas)
rsquared(lm.out)
## [1] 0.560199
If you are testing for a correlation, you can use the command cor.test(dependent variable~independent variable, data=Dataset)
Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
cor.test(gas_consumed~temperature, data=Gas)
##
## Pearson's product-moment correlation
##
## data: gas_consumed and temperature
## t = -4.5144, df = 16, p-value = 0.0003529
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9006243 -0.4328463
## sample estimates:
## cor
## -0.7484644
You can also find many of the calculations above plus the standard error of the estimate by using the command summary(lm.out). You will see the coefficients of the linear model, the t and p-value of the hypothesis test, the coefficient of determination, and the standard error of the estimate. In the row of the output that says your independent variables name are the t value and p-value. The standard error of the estimate is Residual standard error, and the coefficient of determination is Multiple R-squared.
Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out<-lm(gas_consumed~temperature, data=Gas)
summary(lm.out)
##
## Call:
## lm(formula = gas_consumed ~ temperature, data = Gas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68002 -0.10396 -0.01418 0.13400 0.82068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.5713 0.1592 28.719 3.41e-15 ***
## temperature -0.2230 0.0494 -4.514 0.000353 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3301 on 16 degrees of freedom
## Multiple R-squared: 0.5602, Adjusted R-squared: 0.5327
## F-statistic: 20.38 on 1 and 16 DF, p-value: 0.0003529
This output also gives you the regression line, hypothesis test p-value, the coefficient of determination, and the standard error of the estimate. The regression line is formed using the numbers next to the (Intercept), which is the y-intercept, and next to temperature, which is the slope. This gives y-hat=4.5713+(-0.2230)x. The t value 28.719. The p-value is 3.41e-15. The coefficient of determination is 0.5602. The standard error of the estimate is 0.3301.
To calculate a C% prediction interval perform the commands lm.out = lm(dependent variable ~ independent variable, data= Dataset) predict(lm.out, newdata=list(independent variable =value), interval=“prediction”, level=C) –will compute a prediction interval for the independent variable set to a particular value (put that value in place of the word value), at a particular C level (given as a decimal)
Example, find the 95% prediction interval for the amount of gas consumed when the temperature is 3.5 degrees
Gas <-read.csv("https://krkozak.github.io/MAT160/consumption.csv")
lm.out <-lm(gas_consumed ~ temperature, data=Gas)
predict(lm.out, newdata=list(temperature=3.5), interval="prediction", level=0.95)
## fit lwr upr
## 1 3.790819 3.06823 4.513408