Preface / Reference

This R-Codline has been published to RPubs link

This document covers the R-Codeline as referenced in the Power Point presentation for this CAS study. Please read/consult the presention regards the intense discussion and furhter input to this topic.

Introduction

Linear Regression

Do a linear regression analysis for numeric predictor and response

From a machine learning perspective, regression is the task of predicting numerical outcomes from various inputs. Linear regression is the fundamental method in this field. y is ”linearly” related to xi, Each xi contributes additively to y.

** y = a + b*x **

Data Frame used

New York Air Quality Measurements
Daily air quality measurements in New York, May to September 1973.

A data frame with 154 observations on 6 variables.

[,1] Ozone numeric Ozone (ppb)
[,2] Solar.R numeric Solar R (lang)
[,3] Wind numeric Wind (mph)
[,4] Temp numeric Temperature (degrees F)
[,5] Month numeric Month (1–12)
[,6] Day numeric Day of month (1–31)

Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

In R, the linear regression model can be calculated and a response value can be predicted.

Example: New York Air Quality Measurements
Evaluate and predict Ozone (y) based on Temp (x)

Prepartion

Load Libraries

library(psych)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

Cleanup

Remove all object currently in the R-Studio session.

rm(list=ls())

Dataset “airquality”

str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
summary(airquality)
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
head(airquality,10)

Remove NAs

airquality_wa_NA <- airquality[complete.cases(airquality),]

Graphical representation of data frame (Histograms and Correlations)

pairs.panels shows a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.

There seems to be a valid correlation between “Ozone” and “Temp”.

pairs.panels(airquality_wa_NA, main="New York Air Quality Measurements")

Check for linear regression Ozone vs. Temp

Do a correlation test

cor.test(airquality_wa_NA$Ozone, airquality_wa_NA$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  airquality_wa_NA$Ozone and airquality_wa_NA$Temp
## t = 10.192, df = 109, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5888139 0.7829869
## sample estimates:
##       cor 
## 0.6985414

cor = 0.699 > positiv correlation

Establish regression model

airq_model <- lm(Ozone ~ Temp,airquality_wa_NA)
airq_model
## 
## Call:
## lm(formula = Ozone ~ Temp, data = airquality_wa_NA)
## 
## Coefficients:
## (Intercept)         Temp  
##    -147.646        2.439

R-Model (formula): Ozone = -147.646 * 2.439 Temp

summary(airq_model)
## 
## Call:
## lm(formula = Ozone ~ Temp, data = airquality_wa_NA)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.922 -17.459  -0.874  10.444 118.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -147.6461    18.7553  -7.872 2.76e-12 ***
## Temp           2.4391     0.2393  10.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.92 on 109 degrees of freedom
## Multiple R-squared:  0.488,  Adjusted R-squared:  0.4833 
## F-statistic: 103.9 on 1 and 109 DF,  p-value: < 2.2e-16

R-Squared = 0.483
p-value = < 2.2e-16
With P-Value less than 5% we can assume for a statistical significant correlation

Predict data

Once the linear model has been evaluated, the model can be checked against real data. In the current example, the predicted values will be applied to the origin data frame and displayed in a graphical representation.

Ggplot graphic depicts the predicted values (x) vs. the Ozone values (y)

Add prediction values to DF in new column

airquality_wa_NA$prediction <- predict(airq_model)
head(airquality_wa_NA,10)

Check prediction in data plot

Seen in DataCamp - Supervised Learning in R: Regression

ggplot(airquality_wa_NA, aes(x=prediction, y=Ozone)) +
  geom_point() +
  geom_abline(color = "blue")

Do prectiion for new values

*predict is a generic function for predictions from the results of various model fitting functions.

ozone_predict.data <- data.frame(Temp=c(65,70,75))
ozone_predict.prediction <- predict(airq_model,ozone_predict.data)
ozone_predict.prediction
##        1        2        3 
## 10.89607 23.09162 35.28717

Results:

  • Temp = 65 > Ozone = 10.9
  • Temp = 70 > Ozone = 23.1
  • Temp = 75 > Ozone = 35.3