This R-Codline has been published to RPubs link
This document covers the R-Codeline as referenced in the Power Point presentation for this CAS study. Please read/consult the presention regards the intense discussion and furhter input to this topic.
Linear Regression
Do a linear regression analysis for numeric predictor and response
From a machine learning perspective, regression is the task of predicting numerical outcomes from various inputs. Linear regression is the fundamental method in this field. y is ”linearly” related to xi, Each xi contributes additively to y.
** y = a + b*x **
New York Air Quality Measurements
Daily air quality measurements in New York, May to September 1973.
A data frame with 154 observations on 6 variables.
[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1–12)
[,6] Day numeric Day of month (1–31)
Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
In R, the linear regression model can be calculated and a response value can be predicted.
Example: New York Air Quality Measurements
Evaluate and predict Ozone (y) based on Temp (x)
library(psych)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
Remove all object currently in the R-Studio session.
rm(list=ls())
str(airquality)
## 'data.frame': 153 obs. of 6 variables:
## $ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
## $ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
## $ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
## $ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
## $ Month : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
summary(airquality)
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
head(airquality,10)
airquality_wa_NA <- airquality[complete.cases(airquality),]
pairs.panels shows a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
There seems to be a valid correlation between “Ozone” and “Temp”.
pairs.panels(airquality_wa_NA, main="New York Air Quality Measurements")
cor.test(airquality_wa_NA$Ozone, airquality_wa_NA$Temp)
##
## Pearson's product-moment correlation
##
## data: airquality_wa_NA$Ozone and airquality_wa_NA$Temp
## t = 10.192, df = 109, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5888139 0.7829869
## sample estimates:
## cor
## 0.6985414
cor = 0.699 > positiv correlation
airq_model <- lm(Ozone ~ Temp,airquality_wa_NA)
airq_model
##
## Call:
## lm(formula = Ozone ~ Temp, data = airquality_wa_NA)
##
## Coefficients:
## (Intercept) Temp
## -147.646 2.439
R-Model (formula): Ozone = -147.646 * 2.439 Temp
summary(airq_model)
##
## Call:
## lm(formula = Ozone ~ Temp, data = airquality_wa_NA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.922 -17.459 -0.874 10.444 118.078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -147.6461 18.7553 -7.872 2.76e-12 ***
## Temp 2.4391 0.2393 10.192 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.92 on 109 degrees of freedom
## Multiple R-squared: 0.488, Adjusted R-squared: 0.4833
## F-statistic: 103.9 on 1 and 109 DF, p-value: < 2.2e-16
R-Squared = 0.483
p-value = < 2.2e-16
With P-Value less than 5% we can assume for a statistical significant correlation
Once the linear model has been evaluated, the model can be checked against real data. In the current example, the predicted values will be applied to the origin data frame and displayed in a graphical representation.
Ggplot graphic depicts the predicted values (x) vs. the Ozone values (y)
airquality_wa_NA$prediction <- predict(airq_model)
head(airquality_wa_NA,10)
Seen in DataCamp - Supervised Learning in R: Regression
ggplot(airquality_wa_NA, aes(x=prediction, y=Ozone)) +
geom_point() +
geom_abline(color = "blue")
*predict is a generic function for predictions from the results of various model fitting functions.
ozone_predict.data <- data.frame(Temp=c(65,70,75))
ozone_predict.prediction <- predict(airq_model,ozone_predict.data)
ozone_predict.prediction
## 1 2 3
## 10.89607 23.09162 35.28717
Results: