2025-10-25

Introduction and Importing

  • For this homework, I used a dataset on Air Quality from the UCI Archives.
  • Specifically, the dataset contains the levels of different contaminants in the air in an Italian city such as:
    • Carbon Monoxide, Non-metallic Hydrocarbons, Benzene, etc.
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
airq <- read.csv("AirQualityUCI.csv", sep=";", dec=",")
airq$Date_Objects <- as.Date(airq$Date, format="%m/%d/%Y")
head(airq)
##         Date     Time CO.GT. PT08.S1.CO. NMHC.GT. C6H6.GT. PT08.S2.NMHC.
## 1 10/03/2004 18.00.00    2.6        1360      150     11.9          1046
## 2 10/03/2004 19.00.00    2.0        1292      112      9.4           955
## 3 10/03/2004 20.00.00    2.2        1402       88      9.0           939
## 4 10/03/2004 21.00.00    2.2        1376       80      9.2           948
## 5 10/03/2004 22.00.00    1.6        1272       51      6.5           836
## 6 10/03/2004 23.00.00    1.2        1197       38      4.7           750
##   NOx.GT. PT08.S3.NOx. NO2.GT. PT08.S4.NO2. PT08.S5.O3.    T   RH     AH  X X.1
## 1     166         1056     113         1692        1268 13.6 48.9 0.7578 NA  NA
## 2     103         1174      92         1559         972 13.3 47.7 0.7255 NA  NA
## 3     131         1140     114         1555        1074 11.9 54.0 0.7502 NA  NA
## 4     172         1092     122         1584        1203 11.0 60.0 0.7867 NA  NA
## 5     131         1205     116         1490        1110 11.2 59.6 0.7888 NA  NA
## 6      89         1337      96         1393         949 11.2 59.2 0.7848 NA  NA
##   Date_Objects
## 1   2004-10-03
## 2   2004-10-03
## 3   2004-10-03
## 4   2004-10-03
## 5   2004-10-03
## 6   2004-10-03

Nitrogenous Oxides

figure <- plot_ly(x=airq$Date_Objects, y=airq$NOx.GT., type="scatter", mode="markers", name="data",
        width=600, height=350) %>%
        layout(title="Plot_ly plot of Nitrogenous Oxides for Each Date",
               xaxis=list(title="Date"), yaxis=list(title="Nitrogenous Oxides Emissions (in mg/m^3)"))
figure

ggplot Number 1

figure2 <- ggplot(aes(x=NOx.GT., y=PT08.S1.CO.), data=airq) +
  geom_point() +
  geom_smooth(method="lm", se=TRUE) +
  labs(title="Total Hourly Averaged Sensor Response over Averaged NOx Response")
figure2
## `geom_smooth()` using formula = 'y ~ x'

ggplot Number 2

figure3 <- ggplot(aes(x=CO.GT., y=PT08.S1.CO.), data=airq) +
  geom_point() +
  geom_smooth(method="lm", se=TRUE) +
  labs(title="Average Benzene Concentration vs Average Carbon Monoxide Concentration") +
  coord_cartesian(xlim=c(0, 15))
figure3
## `geom_smooth()` using formula = 'y ~ x'

Linear Regression Model For GGplot 1

model <- lm(PT08.S1.CO. ~ NOx.GT., data=airq)
summary(model)
## 
## Call:
## lm(formula = PT08.S1.CO. ~ NOx.GT., data = airq)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1655.52   -92.70    18.74   146.33   922.30 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 988.93322    3.91569  252.56   <2e-16 ***
## NOx.GT.       0.35617    0.01272   27.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 316.8 on 9355 degrees of freedom
##   (114 observations deleted due to missingness)
## Multiple R-squared:  0.07728,    Adjusted R-squared:  0.07718 
## F-statistic: 783.5 on 1 and 9355 DF,  p-value: < 2.2e-16

Explanation

Model: \(\text{PT08.S1.CO.} = \beta_0 + \beta_1 \cdot \text{NOx.GT.} + \varepsilon; \hspace{1 cm} \varepsilon \sim N(0;\sigma^2)\)

  • PT08.S1.CO. is the “Hourly Averaged Sensor Response (when Nominally targetting CO)”
  • NOx.GT. is the “True hourly averaged NOx (Nitrogenous Oxides) concentration”
  • The equation above is the linear regression relationship demonstrated on the previous slide
  • As seen in the previous slide, The R squared value is 0.07728, so the linear relationship is weak

Carbon Monoxide Plot_ly

figure4 <- plot_ly(x=airq$Date_Objects, y=airq$CO.GT., type="scatter", mode="markers", name="data",
        width=600, height=200) %>%
        layout(title="Plot_ly plot of Carbon Monoxide for Each Date",
               xaxis=list(title="Date"), yaxis=list(title="Carbon Monoxide Emissions (in mg/m^3)"))
figure4

Second Example of Linear Regression

model_2 <- lm(NOx.GT. ~ CO.GT.,data=airq)
summary(model_2)
## 
## Call:
## lm(formula = NOx.GT. ~ CO.GT., data = airq)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -439.83 -123.28  -73.58   97.54 1233.41 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 228.31550    2.47262   92.34   <2e-16 ***
## CO.GT.        1.74519    0.02914   59.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218.9 on 9355 degrees of freedom
##   (114 observations deleted due to missingness)
## Multiple R-squared:  0.2772, Adjusted R-squared:  0.2771 
## F-statistic:  3587 on 1 and 9355 DF,  p-value: < 2.2e-16

Explanation

  • NOx.GT. is the True hourly averaged NOx (Nitrogenous Oxides) concentration
  • CO.GT. is the true hourly averaged Carbon Monoxide concentration
  • The R squared value for this regression model is 0.2771 – This is a lot higher than the value for the previous model, which indicates a stronger relationship

    Model: \(\text{NOx.GT.} = \beta_0 + \beta_1 \cdot \text{CO.GT.} + \varepsilon; \hspace{1 cm} \varepsilon \sim N(0;\sigma^2)\)