About dataset

Content

The Dataset is fully dedicated for the developers who want to train the model on Weather Forecasting for Indian climate. This dataset provides data from 1st January 2013 to 24th April 2017 in the city of Delhi, India. The 4 parameters here are meantemp, humidity, wind_speed, meanpressure.

Acknowledgements

This dataset has been collected from Weather Undergroud API. Dataset ownership and credit goes to them. ### Split Train and Test This dataset is splited into Train and Test dataset. Train dataset includes records from 01/01/2013 - 01/01/2017. Test dataset includes records from 01/01/2017 - 24/04/2017. We use Train dataset to training model and use Test dataset to check the accuracy of model.

Now our group will import Train and Test dataset by using libraby(readr).

library(readr)
Train <- read_csv("C:/Users/LAPTOP-LC/Downloads/DailyDelhiClimateTrain.csv")
## Rows: 1462 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): date
## dbl (4): meantemp, humidity, wind_speed, meanpressure
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Test <- read_csv("DailyDelhiClimateTest.csv")
## Rows: 114 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): date
## dbl (4): meantemp, humidity, wind_speed, meanpressure
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach(Train)
names(Train)
## [1] "date"         "meantemp"     "humidity"     "wind_speed"   "meanpressure"
dim(Train)
## [1] 1462    5
head(Train)
## # A tibble: 6 × 5
##   date     meantemp humidity wind_speed meanpressure
##   <chr>       <dbl>    <dbl>      <dbl>        <dbl>
## 1 1/1/2013    10        84.5       0           1016.
## 2 1/2/2013     7.4      92         2.98        1018.
## 3 1/3/2013     7.17     87         4.63        1019.
## 4 1/4/2013     8.67     71.3       1.23        1017.
## 5 1/5/2013     6        86.8       3.7         1016.
## 6 1/6/2013     7        82.8       1.48        1018
dim(Test)
## [1] 114   5
summary(Train)
##      date              meantemp        humidity        wind_speed    
##  Length:1462        Min.   : 6.00   Min.   : 13.43   Min.   : 0.000  
##  Class :character   1st Qu.:18.86   1st Qu.: 50.38   1st Qu.: 3.475  
##  Mode  :character   Median :27.71   Median : 62.62   Median : 6.222  
##                     Mean   :25.50   Mean   : 60.77   Mean   : 6.802  
##                     3rd Qu.:31.31   3rd Qu.: 72.22   3rd Qu.: 9.238  
##                     Max.   :38.71   Max.   :100.00   Max.   :42.220  
##   meanpressure     
##  Min.   :  -3.042  
##  1st Qu.:1001.580  
##  Median :1008.563  
##  Mean   :1011.105  
##  3rd Qu.:1014.945  
##  Max.   :7679.333

Train dataset has 5 column: “date”, “meantemp”, “humidity”, “wind_speed”, “meanpressure” and 1462 records from 01/01/2013 to 01/01/2017. Test dataset has 114 records with 5 columns.

We denote that meantemp is response variable and “humidity”, “wind_speed”, “meanpressure” are explained variables.

Plot mean temperature

temp_ts <- ts(Train$meantemp, frequency=365, start=c(2013,1))
plot(temp_ts)

Looking into this graph, mean temperature of Train dataset has seasonal variation. Our group are going to find regression model to find the best model explain mean temperature. In addition, we also find time series forecasting model to predict dataset in Test dataset and calculate the accuracy.

1. Regression model

Cleaning data

First, we check the NA values in dataframe.

sum(is.na(Train$meantemp))
## [1] 0
sum(is.na(Train$humidity))
## [1] 0
sum(is.na(Train$wind_speed))
## [1] 0
sum(is.na(Train$meanpressure))
## [1] 0

=> The dataset doesn’t contain NA values.

Identify if exist outlier by box plot

boxplot(meantemp, horizontal=FALSE, main="Box plot of meantemp", ylab="Times", col = "pink") 

boxplot(humidity, horizontal=FALSE, main="Box plot of humidity", ylab="Times", col = "pink") 

boxplot(wind_speed, horizontal=FALSE, main="Box plot of wind speed", ylab="Times", col = "pink") 

boxplot(meanpressure, horizontal=FALSE, main="Box plot of mean pressure", ylab="Times", col = "pink") 

Method to remove outliers

By ploting boxplot, we found that existing outliers in wind_speed. Therefore, we are going to remove outliers in wind_speed before put this explain variable in linear regression model. We remove points that greater than Q3+ 1.5 * IQR or lower than Q1- 1.5 * IQR. =>We see that dataset has outliers but we don’t remove it because after removed outliers, the model to predict is not better than not removed outliers.

We plot pairplot to see the variable explained mean temperature best.

pairs(data.frame(meantemp,humidity,wind_speed,meanpressure))

Correlogram

library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
corrplot(cor(Train[2:5]), method="color")

By looking in pairplot and correlogram, we found that there seems to be a relationship between mean temperature and wind speed. Therefore, we trying to build 4 regression model to verify it.

1, Simple linear regression: Use wind_speed to explain mean temperature.
2, Exponential regression: Use wind_speed to explain mean temperature with exponential regression.
3, Multi-regression with 2 variables: We adding humidity variable to model 1. Next, we finding multi-regression with 2 variables : wind_speed and humidity.
4, Multi-regression with 3 variables: We adding meanpressure variable to model 3. Next, we finding multi-regression with 3 variables : wind_speed, humidity and meanpressure.

We use MSE and R-squared for estimating accuracy of models. Prioritize choosing MSE.

1.1 Simple linear regression

We are going to fit linear model to explain the mean temperature with respect to wind speed follow the formula: \[meantemp=\beta_0+\beta_1 windspeed.\]

Applied linear regression model

reg <- lm(Train$meantemp~Train$wind_speed)
beta0 <- reg$coef[1];beta0
## (Intercept) 
##    22.13743
beta1 <- reg$coef[2];beta1
## Train$wind_speed 
##        0.4936766

With the this result of model, mean temperature can be determined by this formula: \[meantemp=22.14+0.49\times windspeed.\] We plot mean temperature and wind speed with adding trend line that found by simple linear regression model.

plot(Train$wind_speed,Train$meantemp,col="pink",pch=19,xlab="Wind speed",ylab="Mean temperature")
abline(beta0,beta1)

Predict the mean temperature of Test dataset

pre_meantemp=beta0+beta1*Test$wind_speed
head(data.frame(Test$meantemp,pre_meantemp))
##   Test.meantemp pre_meantemp
## 1      15.91304     23.49182
## 2      18.50000     23.56635
## 3      17.11111     24.12036
## 4      18.70000     24.38119
## 5      18.38889     23.76656
## 6      19.31818     26.42344
plot(seq(114),Test$meantemp,col="pink",type="l")
lines(seq(114),pre_meantemp)

Evaluate model

We are going to evaluate model by calculating MSE and R-squared.

MSE_linear = sum((Test$meantemp-pre_meantemp)^2)/length(Test$meantemp);MSE_linear
## [1] 58.09911
R2_linear=sum((pre_meantemp-mean(Test$meantemp))^2)/sum((Test$meantemp-mean(Test$meantemp))^2);R2_linear
## [1] 0.570298

Mean Squared Error of this model is 58.1 With 57.03% of variation in mean temperature can be explained by variation in wind speed.

summary(reg)
## 
## Call:
## lm(formula = Train$meantemp ~ Train$wind_speed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.496  -5.590   1.823   5.613  14.479 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.13743    0.32863   67.36   <2e-16 ***
## Train$wind_speed  0.49368    0.04013   12.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.997 on 1460 degrees of freedom
## Multiple R-squared:  0.09392,    Adjusted R-squared:  0.0933 
## F-statistic: 151.3 on 1 and 1460 DF,  p-value: < 2.2e-16

1.2 Exponential model

Applied exponential model

reg <- lm(log(Train$meantemp)~Train$wind_speed)
loga <- reg$coef[1]
b <- reg$coef[2];b
## Train$wind_speed 
##       0.02121529
a <- exp(loga);a
## (Intercept) 
##    20.98371

With the this result of model, mean temperature can be determined by this formula: \[meantemp=20.98\times e^{0.021\times windspeed}.\]

Predict the mean temperature of Test dataset

pred_exp_temp=a*exp(b*Test$wind_speed)
head(data.frame(Test$meantemp,pred_exp_temp))
##   Test.meantemp pred_exp_temp
## 1      15.91304      22.24128
## 2      18.50000      22.31263
## 3      17.11111      22.85023
## 4      18.70000      23.10779
## 5      18.38889      22.50544
## 6      19.31818      25.22748
plot(seq(114),Test$meantemp,col="pink",pch=20,type="l")
lines(seq(114),pred_exp_temp)

Evaluate model

We are going to evaluate model by calculating MSE and R-squared.

MSE_exp = sum((Test$meantemp-pred_exp_temp)^2)/length(Test$meantemp);MSE_exp
## [1] 49.31626
R2_exp=sum((pred_exp_temp-mean(Test$meantemp))^2)/sum((Test$meantemp-mean(Test$meantemp))^2);R2_exp
## [1] 0.3650258

Mean Squared Error of this model is 49.32 With 36.5% of variation in mean temperature can be explained by variation in wind speed.

1.3 Multiple regression

1.3.1 With 2 variables

Our group add more one variable (humidity) into model to predict mean temperature.

We denote that X is matrix of 2 variables (humidity and wind_speed) and beta is matrix of coefficient of variables. We have this formula: \[meantemp = X \times beta\] or \[meantemp = beta_0 + beta_1\times humidity + beta_2\times windspeed.\] #### Applied muti-regression model

mlreg <- lm(Train$meantemp ~ Train$humidity + Train$wind_speed)
beta_0 = mlreg$coef[1];beta_0
## (Intercept) 
##    38.47482
beta_1 = mlreg$coef[2];beta_1
## Train$humidity 
##     -0.2329802
beta_2 = mlreg$coef[3];beta_2
## Train$wind_speed 
##        0.1733711

We have the possibility the consider the following model to predict the mean temperature: \[meantemp=38.47-0.23\times humidity+0.17\times windspeed\]

Predict mean_temp

pred_mlreg=beta_0+beta_1*Test$humidity+beta_2*Test$wind_speed
Compare = data.frame(pred_mlreg, Test$meantemp)
head(Compare)
##   pred_mlreg Test.meantemp
## 1   18.94455      15.91304
## 2   20.98538      18.50000
## 3   20.09270      17.11111
## 4   22.94253      18.70000
## 5   21.58637      18.38889
## 6   21.50043      19.31818
plot(seq(114),Test$meantemp,col="pink",pch=20,type="l")
lines(seq(114),pred_mlreg)

Evaluate model

MSE_mlreg= sum((Test$meantemp-pred_mlreg)^2)/length(Test$meantemp);MSE_mlreg
## [1] 37.82552
R2_mlreg= sum((pred_mlreg-mean(Test$meantemp))^2)/sum((Test$meantemp-mean(Test$meantemp))^2);R2_mlreg
## [1] 1.184212

Mean Squared Error of this model is 37.83 With 118.42% of variation in mean temperature can be explained by variation in wind speed and variation in humidity.

1.3.2 With 3 variables

Our group add more one variable (meanpressure) into model to predict mean temperature.

We denote that X is matrix of 3 variables (humidity, wind_speed and meanpressure) and beta is matrix of coefficient of variables. We have this formula: \[meantemp = X \times beta\] or \[meantemp = beta_0 + beta_1\times humidity + beta_2\times windspeed + beta_3\times meanpressure.\]

Applied muti-regression model

mlreg3 <- lm(Train$meantemp ~ Train$humidity + Train$wind_speed + Train$meanpressure)
b_0 = mlreg3$coef[1];b_0
## (Intercept) 
##    39.96173
b_1 = mlreg3$coef[2];b_1
## Train$humidity 
##     -0.2330892
b_2 = mlreg3$coef[3];b_2
## Train$wind_speed 
##        0.1720329
b_3 = mlreg3$coef[4];b_3
## Train$meanpressure 
##       -0.001455031

We have the possibility the consider the following model to predict the mean temperature: \[meantemp=816.78-0.15\times humidity-0.1\times windspeed-0.78\times meanpressure\]

Predict mean_temp

pred_mlreg3=b_0+b_1*Test$humidity+b_2*Test$wind_speed+b_3*Test$meanpressure
Compare3 = data.frame(pred_mlreg3, Test$meantemp)
head(Compare3)
##   pred_mlreg3 Test.meantemp
## 1    20.33259      15.91304
## 2    20.97838      18.50000
## 3    20.08361      17.11111
## 4    22.93785      18.70000
## 5    21.58481      18.38889
## 6    21.49492      19.31818
plot(seq(114),Test$meantemp,col="pink",type="l")
lines(seq(114),pred_mlreg3)

Evaluate model

MSE_mlreg3 = sum((Test$meantemp-pred_mlreg3)^2)/length(Test$meantemp);MSE_mlreg3
## [1] 37.84033
R2_mlreg3= sum((pred_mlreg3-mean(Test$meantemp))^2)/sum((Test$meantemp-mean(Test$meantemp))^2);R2_mlreg3
## [1] 1.183557

Mean Squared Error of this model is 37.84 With 118.36% of variation in mean temperature can be explained by variation in wind speed, humidity and meanpressure.

1.4 Compare model

MSE = cbind(MSE_linear,MSE_exp,MSE_mlreg,MSE_mlreg3)
R2 = cbind(R2_linear,R2_exp,R2_mlreg,R2_mlreg3)
accuracy = matrix(c(MSE,R2),nrow=2,byrow=TRUE)
colnames(accuracy) <- c("Linear","Exponential","Multi 2 var", "Multi 3 var")
rownames(accuracy) <- c("MSE","R-Squared")
accuracy
##              Linear Exponential Multi 2 var Multi 3 var
## MSE       58.099105  49.3162562   37.825515   37.840334
## R-Squared  0.570298   0.3650258    1.184212    1.183557

We see that, multi-regression with 2 variables shows the lowest MSE. Therefore, our group choose multi-regression with 2 variables to predict mean temperature.

2. Time series forecasting

Plot mean temperature

temp_ts <- ts(Train$meantemp, frequency=365, start=c(2013,1))
plot(temp_ts)

class(temp_ts)
## [1] "ts"

Time series has seasonal variation.

Check stationary of time series with ADF test

library(tseries)
## Warning: package 'tseries' was built under R version 4.2.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
adf.test(temp_ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  temp_ts
## Dickey-Fuller = -1.8526, Lag order = 11, p-value = 0.6407
## alternative hypothesis: stationary

With p-value = 0.6407, temp_ts is non-stationary.

Seasonal differencing time series

Our group difference time series at lag = 365.

diff365=diff(temp_ts,lag=365)
plot(seq(1097),diff365,type="l")

We check the stationary of diff365.

library(forecast)
adf.test(diff365)
## Warning in adf.test(diff365): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  diff365
## Dickey-Fuller = -7.6136, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary

With p-value = 0.01, diff365 is stationary time series. Therefore, we are going to apply ARIMA model with seasonal parameter.

SARIMA

We use function auto.arima to predict order = c(p,d,q) to apply ARIMA model.

library(forecast)
auto.arima(diff365)
## Series: diff365 
## ARIMA(3,1,1) 
## 
## Coefficients:
##          ar1     ar2      ar3      ma1
##       0.7188  0.0102  -0.0375  -0.9883
## s.e.  0.0311  0.0373   0.0311   0.0081
## 
## sigma^2 = 4.591:  log likelihood = -2389.45
## AIC=4788.9   AICc=4788.95   BIC=4813.9
A = auto.arima(diff365)
A= as.data.frame(A[7])
p= A[1,1]
q= A[2,1]
P= A[3,1]
Q= A[4,1]
d= A[6,1]
D= A[7,1]

Applied SARIMA

Predict <- arima(temp_ts,order=c(3,1,1),seasonal=list(order=c(0,1,0),frequency=365))
Predict
## 
## Call:
## arima(x = temp_ts, order = c(3, 1, 1), seasonal = list(order = c(0, 1, 0), frequency = 365))
## 
## Coefficients:
##          ar1     ar2      ar3      ma1
##       0.7188  0.0102  -0.0375  -0.9883
## s.e.  0.0311  0.0373   0.0311   0.0081
## 
## sigma^2 estimated as 4.575:  log likelihood = -2392.92,  aic = 4795.84

We check the residual of Predict model.

library(forecast)
checkresiduals(Predict)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,1)(0,1,0)[365]
## Q* = 367.14, df = 288, p-value = 0.001095
## 
## Model df: 4.   Total lags used: 292

This model seems good enough to predict the mean temperature. We check the accuracy between Train and Test dataset.

accuracy(forecast(Predict),Test$meantemp)
##                       ME     RMSE      MAE         MPE      MAPE      MASE
## Training set  0.01624947 1.851854 1.212622  -0.2772842  5.181115 0.9785001
## Test set     -2.86057334 4.179003 3.495903 -14.5198544 17.948692 2.8209458
##                     ACF1
## Training set 0.002660007
## Test set              NA

The difference in results is not much. So we decide to use this model to predict mean temperature.

plot(forecast(Predict))

Comparing Regression and Time series forecasting

With multi-regression 2 variables, MSE is 37.83 With Time series forecasting (SARIMA), MSE is RMSE^2 (4.179003^2) = 17.46 Thus, our group see that Time series forecasting with SARIMA is more accurate with lower MSE.