About dataset
Content
The Dataset is fully dedicated for the developers who want to train
the model on Weather Forecasting for Indian climate. This dataset
provides data from 1st January 2013 to 24th April 2017 in the city of
Delhi, India. The 4 parameters here are meantemp, humidity, wind_speed,
meanpressure.
Acknowledgements
This dataset has been collected from Weather Undergroud API. Dataset
ownership and credit goes to them. ### Split Train and Test This dataset
is splited into Train and Test dataset. Train dataset includes records
from 01/01/2013 - 01/01/2017. Test dataset includes records from
01/01/2017 - 24/04/2017. We use Train dataset to training model and use
Test dataset to check the accuracy of model.
Now our group will import Train and Test dataset by using
libraby(readr).
library(readr)
Train <- read_csv("C:/Users/LAPTOP-LC/Downloads/DailyDelhiClimateTrain.csv")
## Rows: 1462 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): date
## dbl (4): meantemp, humidity, wind_speed, meanpressure
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Test <- read_csv("DailyDelhiClimateTest.csv")
## Rows: 114 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): date
## dbl (4): meantemp, humidity, wind_speed, meanpressure
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach(Train)
names(Train)
## [1] "date" "meantemp" "humidity" "wind_speed" "meanpressure"
dim(Train)
## [1] 1462 5
head(Train)
## # A tibble: 6 × 5
## date meantemp humidity wind_speed meanpressure
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/2013 10 84.5 0 1016.
## 2 1/2/2013 7.4 92 2.98 1018.
## 3 1/3/2013 7.17 87 4.63 1019.
## 4 1/4/2013 8.67 71.3 1.23 1017.
## 5 1/5/2013 6 86.8 3.7 1016.
## 6 1/6/2013 7 82.8 1.48 1018
dim(Test)
## [1] 114 5
summary(Train)
## date meantemp humidity wind_speed
## Length:1462 Min. : 6.00 Min. : 13.43 Min. : 0.000
## Class :character 1st Qu.:18.86 1st Qu.: 50.38 1st Qu.: 3.475
## Mode :character Median :27.71 Median : 62.62 Median : 6.222
## Mean :25.50 Mean : 60.77 Mean : 6.802
## 3rd Qu.:31.31 3rd Qu.: 72.22 3rd Qu.: 9.238
## Max. :38.71 Max. :100.00 Max. :42.220
## meanpressure
## Min. : -3.042
## 1st Qu.:1001.580
## Median :1008.563
## Mean :1011.105
## 3rd Qu.:1014.945
## Max. :7679.333
Train dataset has 5 column: “date”, “meantemp”, “humidity”,
“wind_speed”, “meanpressure” and 1462 records from 01/01/2013 to
01/01/2017. Test dataset has 114 records with 5 columns.
We denote that meantemp is response variable and “humidity”,
“wind_speed”, “meanpressure” are explained variables.
Plot mean temperature
temp_ts <- ts(Train$meantemp, frequency=365, start=c(2013,1))
plot(temp_ts)

Looking into this graph, mean temperature of Train dataset has
seasonal variation. Our group are going to find regression model to find
the best model explain mean temperature. In addition, we also find time
series forecasting model to predict dataset in Test dataset and
calculate the accuracy.
1. Regression model
Cleaning data
First, we check the NA values in dataframe.
sum(is.na(Train$meantemp))
## [1] 0
sum(is.na(Train$humidity))
## [1] 0
sum(is.na(Train$wind_speed))
## [1] 0
sum(is.na(Train$meanpressure))
## [1] 0
=> The dataset doesn’t contain NA values.
Identify if exist outlier by box plot
boxplot(meantemp, horizontal=FALSE, main="Box plot of meantemp", ylab="Times", col = "pink")

boxplot(humidity, horizontal=FALSE, main="Box plot of humidity", ylab="Times", col = "pink")

boxplot(wind_speed, horizontal=FALSE, main="Box plot of wind speed", ylab="Times", col = "pink")

boxplot(meanpressure, horizontal=FALSE, main="Box plot of mean pressure", ylab="Times", col = "pink")

Method to remove outliers
By ploting boxplot, we found that existing outliers in wind_speed.
Therefore, we are going to remove outliers in wind_speed before put this
explain variable in linear regression model. We remove points that
greater than Q3+ 1.5 * IQR or lower than Q1- 1.5 * IQR. =>We
see that dataset has outliers but we don’t remove it because after
removed outliers, the model to predict is not better than not removed
outliers.
We plot pairplot to see the variable explained mean
temperature best.
pairs(data.frame(meantemp,humidity,wind_speed,meanpressure))

Correlogram
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
corrplot(cor(Train[2:5]), method="color")

By looking in pairplot and correlogram, we found that there seems to
be a relationship between mean temperature and wind speed. Therefore, we
trying to build 4 regression model to verify it.
1, Simple linear regression: Use wind_speed to explain mean
temperature.
2, Exponential regression: Use wind_speed to explain mean
temperature with exponential regression.
3, Multi-regression with 2 variables: We adding humidity variable to
model 1. Next, we finding multi-regression with 2 variables : wind_speed
and humidity.
4, Multi-regression with 3 variables: We adding meanpressure
variable to model 3. Next, we finding multi-regression with 3 variables
: wind_speed, humidity and meanpressure.
We use MSE and R-squared for estimating accuracy of models.
Prioritize choosing MSE.
1.1 Simple linear regression
We are going to fit linear model to explain the mean temperature with
respect to wind speed follow the formula: \[meantemp=\beta_0+\beta_1 windspeed.\]
Applied linear regression model
reg <- lm(Train$meantemp~Train$wind_speed)
beta0 <- reg$coef[1];beta0
## (Intercept)
## 22.13743
beta1 <- reg$coef[2];beta1
## Train$wind_speed
## 0.4936766
With the this result of model, mean temperature can be determined by
this formula: \[meantemp=22.14+0.49\times
windspeed.\] We plot mean temperature and wind speed with adding
trend line that found by simple linear regression model.
plot(Train$wind_speed,Train$meantemp,col="pink",pch=19,xlab="Wind speed",ylab="Mean temperature")
abline(beta0,beta1)

Predict the mean temperature of Test dataset
pre_meantemp=beta0+beta1*Test$wind_speed
head(data.frame(Test$meantemp,pre_meantemp))
## Test.meantemp pre_meantemp
## 1 15.91304 23.49182
## 2 18.50000 23.56635
## 3 17.11111 24.12036
## 4 18.70000 24.38119
## 5 18.38889 23.76656
## 6 19.31818 26.42344
plot(seq(114),Test$meantemp,col="pink",type="l")
lines(seq(114),pre_meantemp)

Evaluate model
We are going to evaluate model by calculating MSE and R-squared.
MSE_linear = sum((Test$meantemp-pre_meantemp)^2)/length(Test$meantemp);MSE_linear
## [1] 58.09911
R2_linear=sum((pre_meantemp-mean(Test$meantemp))^2)/sum((Test$meantemp-mean(Test$meantemp))^2);R2_linear
## [1] 0.570298
Mean Squared Error of this model is 58.1 With 57.03% of variation in
mean temperature can be explained by variation in wind speed.
summary(reg)
##
## Call:
## lm(formula = Train$meantemp ~ Train$wind_speed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.496 -5.590 1.823 5.613 14.479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.13743 0.32863 67.36 <2e-16 ***
## Train$wind_speed 0.49368 0.04013 12.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.997 on 1460 degrees of freedom
## Multiple R-squared: 0.09392, Adjusted R-squared: 0.0933
## F-statistic: 151.3 on 1 and 1460 DF, p-value: < 2.2e-16
1.2 Exponential model
Applied exponential model
reg <- lm(log(Train$meantemp)~Train$wind_speed)
loga <- reg$coef[1]
b <- reg$coef[2];b
## Train$wind_speed
## 0.02121529
a <- exp(loga);a
## (Intercept)
## 20.98371
With the this result of model, mean temperature can be determined by
this formula: \[meantemp=20.98\times
e^{0.021\times windspeed}.\]
Predict the mean temperature of Test dataset
pred_exp_temp=a*exp(b*Test$wind_speed)
head(data.frame(Test$meantemp,pred_exp_temp))
## Test.meantemp pred_exp_temp
## 1 15.91304 22.24128
## 2 18.50000 22.31263
## 3 17.11111 22.85023
## 4 18.70000 23.10779
## 5 18.38889 22.50544
## 6 19.31818 25.22748
plot(seq(114),Test$meantemp,col="pink",pch=20,type="l")
lines(seq(114),pred_exp_temp)

Evaluate model
We are going to evaluate model by calculating MSE and R-squared.
MSE_exp = sum((Test$meantemp-pred_exp_temp)^2)/length(Test$meantemp);MSE_exp
## [1] 49.31626
R2_exp=sum((pred_exp_temp-mean(Test$meantemp))^2)/sum((Test$meantemp-mean(Test$meantemp))^2);R2_exp
## [1] 0.3650258
Mean Squared Error of this model is 49.32 With 36.5% of variation in
mean temperature can be explained by variation in wind speed.
1.3 Multiple regression
1.3.1 With 2 variables
Our group add more one variable (humidity) into model to predict mean
temperature.
We denote that X is matrix of 2 variables (humidity
and wind_speed) and beta is matrix of coefficient of
variables. We have this formula: \[meantemp =
X \times beta\] or \[meantemp = beta_0
+ beta_1\times humidity + beta_2\times windspeed.\] #### Applied
muti-regression model
mlreg <- lm(Train$meantemp ~ Train$humidity + Train$wind_speed)
beta_0 = mlreg$coef[1];beta_0
## (Intercept)
## 38.47482
beta_1 = mlreg$coef[2];beta_1
## Train$humidity
## -0.2329802
beta_2 = mlreg$coef[3];beta_2
## Train$wind_speed
## 0.1733711
We have the possibility the consider the following model to predict
the mean temperature: \[meantemp=38.47-0.23\times humidity+0.17\times
windspeed\]
Predict mean_temp
pred_mlreg=beta_0+beta_1*Test$humidity+beta_2*Test$wind_speed
Compare = data.frame(pred_mlreg, Test$meantemp)
head(Compare)
## pred_mlreg Test.meantemp
## 1 18.94455 15.91304
## 2 20.98538 18.50000
## 3 20.09270 17.11111
## 4 22.94253 18.70000
## 5 21.58637 18.38889
## 6 21.50043 19.31818
plot(seq(114),Test$meantemp,col="pink",pch=20,type="l")
lines(seq(114),pred_mlreg)

Evaluate model
MSE_mlreg= sum((Test$meantemp-pred_mlreg)^2)/length(Test$meantemp);MSE_mlreg
## [1] 37.82552
R2_mlreg= sum((pred_mlreg-mean(Test$meantemp))^2)/sum((Test$meantemp-mean(Test$meantemp))^2);R2_mlreg
## [1] 1.184212
Mean Squared Error of this model is 37.83 With 118.42% of variation
in mean temperature can be explained by variation in wind speed and
variation in humidity.
1.3.2 With 3 variables
Our group add more one variable (meanpressure) into model to predict
mean temperature.
We denote that X is matrix of 3 variables (humidity,
wind_speed and meanpressure) and beta is matrix of
coefficient of variables. We have this formula: \[meantemp = X \times beta\] or \[meantemp = beta_0 + beta_1\times humidity +
beta_2\times windspeed + beta_3\times meanpressure.\]
Applied muti-regression model
mlreg3 <- lm(Train$meantemp ~ Train$humidity + Train$wind_speed + Train$meanpressure)
b_0 = mlreg3$coef[1];b_0
## (Intercept)
## 39.96173
b_1 = mlreg3$coef[2];b_1
## Train$humidity
## -0.2330892
b_2 = mlreg3$coef[3];b_2
## Train$wind_speed
## 0.1720329
b_3 = mlreg3$coef[4];b_3
## Train$meanpressure
## -0.001455031
We have the possibility the consider the following model to predict
the mean temperature: \[meantemp=816.78-0.15\times humidity-0.1\times
windspeed-0.78\times meanpressure\]
Predict mean_temp
pred_mlreg3=b_0+b_1*Test$humidity+b_2*Test$wind_speed+b_3*Test$meanpressure
Compare3 = data.frame(pred_mlreg3, Test$meantemp)
head(Compare3)
## pred_mlreg3 Test.meantemp
## 1 20.33259 15.91304
## 2 20.97838 18.50000
## 3 20.08361 17.11111
## 4 22.93785 18.70000
## 5 21.58481 18.38889
## 6 21.49492 19.31818
plot(seq(114),Test$meantemp,col="pink",type="l")
lines(seq(114),pred_mlreg3)

Evaluate model
MSE_mlreg3 = sum((Test$meantemp-pred_mlreg3)^2)/length(Test$meantemp);MSE_mlreg3
## [1] 37.84033
R2_mlreg3= sum((pred_mlreg3-mean(Test$meantemp))^2)/sum((Test$meantemp-mean(Test$meantemp))^2);R2_mlreg3
## [1] 1.183557
Mean Squared Error of this model is 37.84 With 118.36% of variation
in mean temperature can be explained by variation in wind speed,
humidity and meanpressure.
1.4 Compare model
MSE = cbind(MSE_linear,MSE_exp,MSE_mlreg,MSE_mlreg3)
R2 = cbind(R2_linear,R2_exp,R2_mlreg,R2_mlreg3)
accuracy = matrix(c(MSE,R2),nrow=2,byrow=TRUE)
colnames(accuracy) <- c("Linear","Exponential","Multi 2 var", "Multi 3 var")
rownames(accuracy) <- c("MSE","R-Squared")
accuracy
## Linear Exponential Multi 2 var Multi 3 var
## MSE 58.099105 49.3162562 37.825515 37.840334
## R-Squared 0.570298 0.3650258 1.184212 1.183557
We see that, multi-regression with 2 variables shows the lowest MSE.
Therefore, our group choose multi-regression with 2 variables to predict
mean temperature.
2. Time series forecasting
Plot mean temperature
temp_ts <- ts(Train$meantemp, frequency=365, start=c(2013,1))
plot(temp_ts)

class(temp_ts)
## [1] "ts"
Time series has seasonal variation.
Check stationary of time series with ADF test
library(tseries)
## Warning: package 'tseries' was built under R version 4.2.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
adf.test(temp_ts)
##
## Augmented Dickey-Fuller Test
##
## data: temp_ts
## Dickey-Fuller = -1.8526, Lag order = 11, p-value = 0.6407
## alternative hypothesis: stationary
With p-value = 0.6407, temp_ts is non-stationary.
Seasonal differencing time series
Our group difference time series at lag = 365.
diff365=diff(temp_ts,lag=365)
plot(seq(1097),diff365,type="l")

We check the stationary of diff365.
library(forecast)
adf.test(diff365)
## Warning in adf.test(diff365): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff365
## Dickey-Fuller = -7.6136, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
With p-value = 0.01, diff365 is stationary time series. Therefore, we
are going to apply ARIMA model with seasonal parameter.
SARIMA
We use function auto.arima to predict order = c(p,d,q) to apply ARIMA
model.
library(forecast)
auto.arima(diff365)
## Series: diff365
## ARIMA(3,1,1)
##
## Coefficients:
## ar1 ar2 ar3 ma1
## 0.7188 0.0102 -0.0375 -0.9883
## s.e. 0.0311 0.0373 0.0311 0.0081
##
## sigma^2 = 4.591: log likelihood = -2389.45
## AIC=4788.9 AICc=4788.95 BIC=4813.9
A = auto.arima(diff365)
A= as.data.frame(A[7])
p= A[1,1]
q= A[2,1]
P= A[3,1]
Q= A[4,1]
d= A[6,1]
D= A[7,1]
Applied SARIMA
Predict <- arima(temp_ts,order=c(3,1,1),seasonal=list(order=c(0,1,0),frequency=365))
Predict
##
## Call:
## arima(x = temp_ts, order = c(3, 1, 1), seasonal = list(order = c(0, 1, 0), frequency = 365))
##
## Coefficients:
## ar1 ar2 ar3 ma1
## 0.7188 0.0102 -0.0375 -0.9883
## s.e. 0.0311 0.0373 0.0311 0.0081
##
## sigma^2 estimated as 4.575: log likelihood = -2392.92, aic = 4795.84
We check the residual of Predict model.
library(forecast)
checkresiduals(Predict)

##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,1,1)(0,1,0)[365]
## Q* = 367.14, df = 288, p-value = 0.001095
##
## Model df: 4. Total lags used: 292
This model seems good enough to predict the mean temperature. We
check the accuracy between Train and Test dataset.
accuracy(forecast(Predict),Test$meantemp)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01624947 1.851854 1.212622 -0.2772842 5.181115 0.9785001
## Test set -2.86057334 4.179003 3.495903 -14.5198544 17.948692 2.8209458
## ACF1
## Training set 0.002660007
## Test set NA
The difference in results is not much. So we decide to use this model
to predict mean temperature.
plot(forecast(Predict))

Comparing Regression and Time series forecasting
With multi-regression 2 variables, MSE is 37.83 With Time series
forecasting (SARIMA), MSE is RMSE^2 (4.179003^2) = 17.46 Thus, our group
see that Time series forecasting with SARIMA is more accurate with lower
MSE.