Section 1

This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017. The station of interest and hence the region of interest is Aotizhongxin.

The air pollutants tracked are PM2.5, PM10, SO2, NO2, CO, O3. Along with that other air important indicators such as TEMP, PRES, DEWP, RAIN, wd (winddirection), WSPM (windspeed in m/s) are also recorded.

# Load libraries
library(dplyr)
library(ggplot2)
library(janitor)
library(lubridate)
library(readr)
library(zoo)

df <- read.csv("D:/Downloads/PRSA2017_Data_20130301-20170228/PRSA_Data_20130301-20170228/PRSA_Data_Aotizhongxin_20130301-20170228.csv")

head(df)
##   No year month day hour PM2.5 PM10 SO2 NO2  CO O3 TEMP   PRES  DEWP RAIN  wd
## 1  1 2013     3   1    0     4    4   4   7 300 77 -0.7 1023.0 -18.8    0 NNW
## 2  2 2013     3   1    1     8    8   4   7 300 77 -1.1 1023.2 -18.2    0   N
## 3  3 2013     3   1    2     7    7   5  10 300 73 -1.1 1023.5 -18.2    0 NNW
## 4  4 2013     3   1    3     6    6  11  11 300 72 -1.4 1024.5 -19.4    0  NW
## 5  5 2013     3   1    4     3    3  12  12 300 72 -2.0 1025.2 -19.5    0   N
## 6  6 2013     3   1    5     5    5  18  18 400 66 -2.2 1025.6 -19.6    0   N
##   WSPM      station
## 1  4.4 Aotizhongxin
## 2  4.7 Aotizhongxin
## 3  5.6 Aotizhongxin
## 4  3.1 Aotizhongxin
## 5  2.0 Aotizhongxin
## 6  3.7 Aotizhongxin

Temperature is the variable of interest. Logically speaking, the temperature is lower at night and goes higher around the noon and then drops as evening approaches. Two most plausible factors that could cause this variation are presence of Sun and vehicular activity due to humans.

Particulates like SO2, NO2 etc. are a function of human activities like vehicular emmissions, factory emmissions etc. Since, humans usually are involved in such activities during the normal working hours, there is a usual rise in the concentrations of those particulates in such hours.

For this project, Temperature is the variable of interest. However, there are many significant indicators of air quality. This dataset has many null values which makes it difficult to forecast as important information is lost at many time instances. Hence, only for this assignment, Temperature, which has only 20 null values, is chosen for forecasting.

Section 2

attach(df)
boxplot(TEMP)

#looks like a bimodal distribution
hist(TEMP)

#summarising by hour
#avg. temp increases as day approaches and decreases from evening to night till early morning
by_hour <- group_by(df, hour)
df1 <- summarise(by_hour, T1 = mean(TEMP, na.rm = TRUE)) 
ggplot(df1, aes(x=hour, y=T1)) +
  geom_line()

#summarising by day
by_day <- group_by(df, day)
df2 <- summarise(by_day, T1 = mean(TEMP, na.rm = TRUE)) 
ggplot(df2, aes(x=day, y=T1)) +
  geom_line()

#summarising by month
by_month <- group_by(df, month)
df3 <- summarise(by_month, T1 = mean(TEMP, na.rm = TRUE)) 
ggplot(df3, aes(x=month, y=T1)) +
  geom_line()

#range, mean, median
summary(df$TEMP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -16.80    3.10   14.50   13.58   23.30   40.50      20
#standard deviation
sd(df$TEMP, na.rm = TRUE)
## [1] 11.3991

Section 3

The intended variable of interest TEMP has no outliers. This can be confirmed from the previous boxplot visualization and summaries.

# fitting linear regression model as per avg temp vs hour
by_hour <- group_by(df, hour)
df1 <- summarise(by_hour, T1 = mean(TEMP, na.rm = TRUE))

fit_h <- lm(T1 ~ hour, data = df1)
summary(fit_h)
## 
## Call:
## lm(formula = T1 ~ hour, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3577 -2.0117 -0.4459  2.0965  3.9947 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.91132    1.02638   10.63 3.92e-10 ***
## hour         0.23244    0.07647    3.04  0.00601 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 22 degrees of freedom
## Multiple R-squared:  0.2958, Adjusted R-squared:  0.2638 
## F-statistic:  9.24 on 1 and 22 DF,  p-value: 0.006013

Here we can see that this model(TEMP vs hour) has a p- value less than 0.05. However,the model is insignificant as adj-Rsq is really low.

# fitting linear regression model as per avg temp vs day
by_day <- group_by(df, day)
df2 <- summarise(by_day, T1 = mean(TEMP, na.rm = TRUE))

fit_d <- lm(T1 ~ day, data = df2)
summary(fit_d)
## 
## Call:
## lm(formula = T1 ~ day, data = df2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.78194 -0.22470  0.00436  0.22682  1.08759 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.297306   0.157928  84.199   <2e-16 ***
## day          0.018161   0.008616   2.108   0.0438 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4291 on 29 degrees of freedom
## Multiple R-squared:  0.1329, Adjusted R-squared:  0.103 
## F-statistic: 4.443 on 1 and 29 DF,  p-value: 0.0438

Here we can see that this model(TEMP vs hour) has a p-value less than 0.05. However,the model is insignificant as adj-Rsq is really low.

One conclusion that can be drawn here is that fitting a linear regression model is not adequately explaining the movement of variable TEMP i.e. the temperature.