This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017. The station of interest and hence the region of interest is Aotizhongxin.
The air pollutants tracked are PM2.5, PM10, SO2, NO2, CO, O3. Along with that other air important indicators such as TEMP, PRES, DEWP, RAIN, wd (winddirection), WSPM (windspeed in m/s) are also recorded.
# Load libraries
library(dplyr)
library(ggplot2)
library(janitor)
library(lubridate)
library(readr)
library(zoo)
df <- read.csv("D:/Downloads/PRSA2017_Data_20130301-20170228/PRSA_Data_20130301-20170228/PRSA_Data_Aotizhongxin_20130301-20170228.csv")
head(df)
## No year month day hour PM2.5 PM10 SO2 NO2 CO O3 TEMP PRES DEWP RAIN wd
## 1 1 2013 3 1 0 4 4 4 7 300 77 -0.7 1023.0 -18.8 0 NNW
## 2 2 2013 3 1 1 8 8 4 7 300 77 -1.1 1023.2 -18.2 0 N
## 3 3 2013 3 1 2 7 7 5 10 300 73 -1.1 1023.5 -18.2 0 NNW
## 4 4 2013 3 1 3 6 6 11 11 300 72 -1.4 1024.5 -19.4 0 NW
## 5 5 2013 3 1 4 3 3 12 12 300 72 -2.0 1025.2 -19.5 0 N
## 6 6 2013 3 1 5 5 5 18 18 400 66 -2.2 1025.6 -19.6 0 N
## WSPM station
## 1 4.4 Aotizhongxin
## 2 4.7 Aotizhongxin
## 3 5.6 Aotizhongxin
## 4 3.1 Aotizhongxin
## 5 2.0 Aotizhongxin
## 6 3.7 Aotizhongxin
Temperature is the variable of interest. Logically speaking, the temperature is lower at night and goes higher around the noon and then drops as evening approaches. Two most plausible factors that could cause this variation are presence of Sun and vehicular activity due to humans.
Particulates like SO2, NO2 etc. are a function of human activities like vehicular emmissions, factory emmissions etc. Since, humans usually are involved in such activities during the normal working hours, there is a usual rise in the concentrations of those particulates in such hours.
For this project, Temperature is the variable of interest. However, there are many significant indicators of air quality. This dataset has many null values which makes it difficult to forecast as important information is lost at many time instances. Hence, only for this assignment, Temperature, which has only 20 null values, is chosen for forecasting.
attach(df)
boxplot(TEMP)
#looks like a bimodal distribution
hist(TEMP)
#summarising by hour
#avg. temp increases as day approaches and decreases from evening to night till early morning
by_hour <- group_by(df, hour)
df1 <- summarise(by_hour, T1 = mean(TEMP, na.rm = TRUE))
ggplot(df1, aes(x=hour, y=T1)) +
geom_line()
#summarising by day
by_day <- group_by(df, day)
df2 <- summarise(by_day, T1 = mean(TEMP, na.rm = TRUE))
ggplot(df2, aes(x=day, y=T1)) +
geom_line()
#summarising by month
by_month <- group_by(df, month)
df3 <- summarise(by_month, T1 = mean(TEMP, na.rm = TRUE))
ggplot(df3, aes(x=month, y=T1)) +
geom_line()
#range, mean, median
summary(df$TEMP)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -16.80 3.10 14.50 13.58 23.30 40.50 20
#standard deviation
sd(df$TEMP, na.rm = TRUE)
## [1] 11.3991
The intended variable of interest TEMP has no outliers. This can be confirmed from the previous boxplot visualization and summaries.
# fitting linear regression model as per avg temp vs hour
by_hour <- group_by(df, hour)
df1 <- summarise(by_hour, T1 = mean(TEMP, na.rm = TRUE))
fit_h <- lm(T1 ~ hour, data = df1)
summary(fit_h)
##
## Call:
## lm(formula = T1 ~ hour, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3577 -2.0117 -0.4459 2.0965 3.9947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.91132 1.02638 10.63 3.92e-10 ***
## hour 0.23244 0.07647 3.04 0.00601 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 22 degrees of freedom
## Multiple R-squared: 0.2958, Adjusted R-squared: 0.2638
## F-statistic: 9.24 on 1 and 22 DF, p-value: 0.006013
Here we can see that this model(TEMP vs hour) has a p- value less than 0.05. However,the model is insignificant as adj-Rsq is really low.
# fitting linear regression model as per avg temp vs day
by_day <- group_by(df, day)
df2 <- summarise(by_day, T1 = mean(TEMP, na.rm = TRUE))
fit_d <- lm(T1 ~ day, data = df2)
summary(fit_d)
##
## Call:
## lm(formula = T1 ~ day, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.78194 -0.22470 0.00436 0.22682 1.08759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.297306 0.157928 84.199 <2e-16 ***
## day 0.018161 0.008616 2.108 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4291 on 29 degrees of freedom
## Multiple R-squared: 0.1329, Adjusted R-squared: 0.103
## F-statistic: 4.443 on 1 and 29 DF, p-value: 0.0438
Here we can see that this model(TEMP vs hour) has a p-value less than 0.05. However,the model is insignificant as adj-Rsq is really low.
One conclusion that can be drawn here is that fitting a linear regression model is not adequately explaining the movement of variable TEMP i.e. the temperature.