Weather forecasting refers to the process in which science and technology are applied to predict the conditions of the atmosphere for a given location and time. Weather prediction is done by collecting data on the current state of the atmosphere and applying a scientific understanding of atmospheric processes to predict atmospheric progression. The Weather has great impact on various aspects of human life. Knowing the future of the weather can be important for individuals and organizations. Accurate weather forecasts can tell a farmer the best time to plant; an airport control tower what information to send to aircraft that are landing and taking off; and residents of a coastal region when a hurricane might strike.The goal of weather prediction is to provide information to people in advance so that they can make better decisions. The weather dataset which we are using for our analysis is taken from kaggle and can be found here.The dataset has 21 variables and 366 observations. The variables include details like the minimum and maximum temperature, humidity, pressure, wind direction, wind speed and other weather related factors.There is so much to explore in this dataset, but we will be mainly focusing on prediction of rainfall based on other environmental factor. We will be using supervised learning methods such as regression and decision trees to predict our model.
Packages Required
library(rpart)
library(rpart.plot)
library(ggplot2)
library(lattice)
library(caret)
library(rattle)
library(gplots)
library(ROCR)
library(randomForest)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(reshape2)
Importing the Dataset
weather <- read.csv("/Users/sindhuherle/Documents/Data mining/weather.csv")
Before analyzing, let us first view the dataset by using head() and str() functions to understand all the variables.
head(weather)
## MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
## 1 8.0 24.3 0.0 3.4 6.3 NW 30
## 2 14.0 26.9 3.6 4.4 9.7 ENE 39
## 3 13.7 23.4 3.6 5.8 3.3 NW 85
## 4 13.3 15.5 39.8 7.2 9.1 NW 54
## 5 7.6 16.1 2.8 5.6 10.6 SSE 50
## 6 6.2 16.9 0.0 5.8 8.2 SE 44
## WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm
## 1 SW NW 6 20 68 29
## 2 E W 4 17 80 36
## 3 N NNE 6 6 82 69
## 4 WNW W 30 24 62 56
## 5 SSE ESE 20 28 68 49
## 6 SE E 20 24 70 57
## Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
## 1 1019.7 1015.0 7 7 14.4 23.6 No
## 2 1012.4 1008.4 5 3 17.5 25.7 Yes
## 3 1009.5 1007.2 8 7 15.4 20.2 Yes
## 4 1005.5 1007.0 2 7 13.5 14.1 Yes
## 5 1018.3 1018.5 7 7 11.1 15.4 Yes
## 6 1023.8 1021.7 7 5 10.9 14.8 No
## RainTomorrow
## 1 Yes
## 2 Yes
## 3 Yes
## 4 Yes
## 5 No
## 6 No
str(weather)
## 'data.frame': 366 obs. of 21 variables:
## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 8.8 8.4 ...
## $ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 18.2 17 19.5 22.8 ...
## $ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 ...
## $ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6 4 5.4 ...
## $ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.6 4.1 7.7 ...
## $ WindGustDir : chr "NW" "ENE" "NW" "NW" ...
## $ WindGustSpeed: int 30 39 85 54 50 44 43 41 48 31 ...
## $ WindDir9am : chr "SW" "E" "N" "WNW" ...
## $ WindDir3pm : chr "NW" "W" "NNE" "W" ...
## $ WindSpeed9am : int 6 4 6 30 20 20 19 11 19 7 ...
## $ WindSpeed3pm : int 20 17 6 24 28 24 26 24 17 6 ...
## $ Humidity9am : int 68 80 82 62 68 70 63 65 70 82 ...
## $ Humidity3pm : int 29 36 69 56 49 57 47 57 48 32 ...
## $ Pressure9am : num 1020 1012 1010 1006 1018 ...
## $ Pressure3pm : num 1015 1008 1007 1007 1018 ...
## $ Cloud9am : int 7 5 8 2 7 7 4 6 7 7 ...
## $ Cloud3pm : int 7 3 7 7 7 5 6 7 7 1 ...
## $ Temp9am : num 14.4 17.5 15.4 13.5 11.1 10.9 12.4 12.1 14.1 13.3 ...
## $ Temp3pm : num 23.6 25.7 20.2 14.1 15.4 14.8 17.3 15.5 18.9 21.7 ...
## $ RainToday : chr "No" "Yes" "Yes" "Yes" ...
## $ RainTomorrow : chr "Yes" "Yes" "Yes" "Yes" ...
Basic insights of the data can be obtained by exploring the data through visualizations. We are using coorelation matrix and scatter plots to visualize the data
The correlation matrix above supports us to analyze the effect of different parameters over rainfall. The correlation matrix heap here is very noteworthy. We can see that the highest correlation with the rainfall are humidity,cloud and temperature.
To further understand the parameters which deeply affect the rainfall, we created scatterplots for the four variables which had highest coorelation with rainfall. From the scatterplots, its evident that when the weather is more cloudy and humid, the rainfall is the highest.
require(dplyr)
weatherRegg<- weather
weatherRegg$RainToday <- recode(weather$RainToday, "No" = 0, "Yes" = 1)
weatherRegg$RainTomorrow <- recode(weather$RainTomorrow, "No" = 0, "Yes" = 1)
reggmodel <- glm(RainTomorrow ~ ., family = binomial(link="logit"), data=weatherRegg)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(reggmodel)
##
## Call:
## glm(formula = RainTomorrow ~ ., family = binomial(link = "logit"),
## data = weatherRegg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.02887 -0.10685 -0.01224 -0.00002 3.15153
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 260.52707 107.44767 2.425 0.015322 *
## MinTemp -0.33823 0.26318 -1.285 0.198737
## MaxTemp -0.21135 0.36661 -0.576 0.564286
## Rainfall 0.12914 0.09683 1.334 0.182297
## Evaporation 0.31842 0.23904 1.332 0.182835
## Sunshine -0.66296 0.27655 -2.397 0.016518 *
## WindGustDirENE -0.54738 2.19208 -0.250 0.802813
## WindGustDirESE 3.64764 1.82417 2.000 0.045541 *
## WindGustDirN 0.91555 2.39382 0.382 0.702115
## WindGustDirNE 2.23105 2.68240 0.832 0.405558
## WindGustDirNNE -14.42241 3260.78978 -0.004 0.996471
## WindGustDirNNW 4.97603 2.16524 2.298 0.021554 *
## WindGustDirNW 2.98815 2.15169 1.389 0.164909
## WindGustDirS 2.40579 2.65890 0.905 0.365567
## WindGustDirSE -21.15686 2291.10106 -0.009 0.992632
## WindGustDirSSE 2.51173 1.84431 1.362 0.173237
## WindGustDirSSW 0.86792 2.64498 0.328 0.742807
## WindGustDirSW 5.52328 3.54491 1.558 0.119212
## WindGustDirW 0.85325 1.89019 0.451 0.651693
## WindGustDirWNW -0.29152 2.06771 -0.141 0.887882
## WindGustDirWSW -13.89983 6825.85033 -0.002 0.998375
## WindGustSpeed 0.06231 0.05893 1.057 0.290349
## WindDir9amENE -2.12126 2.82320 -0.751 0.452433
## WindDir9amESE -5.95309 3.06067 -1.945 0.051771 .
## WindDir9amN -1.67279 2.18905 -0.764 0.444771
## WindDir9amNE -0.20852 2.83790 -0.073 0.941426
## WindDir9amNNE 0.16845 2.29555 0.073 0.941503
## WindDir9amNNW -3.39296 2.23621 -1.517 0.129195
## WindDir9amNW -5.05989 2.49940 -2.024 0.042925 *
## WindDir9amS -4.78507 2.69140 -1.778 0.075418 .
## WindDir9amSE 1.22870 2.06594 0.595 0.552016
## WindDir9amSSE -2.05909 2.10754 -0.977 0.328564
## WindDir9amSSW 0.36536 2.83968 0.129 0.897626
## WindDir9amSW 0.54461 2.36431 0.230 0.817822
## WindDir9amW -23.37262 2819.70716 -0.008 0.993386
## WindDir9amWNW -4.06728 2.87520 -1.415 0.157185
## WindDir9amWSW -1.91935 7.43437 -0.258 0.796274
## WindDir3pmENE 2.43092 2.71153 0.897 0.369979
## WindDir3pmESE 3.47374 2.69452 1.289 0.197333
## WindDir3pmN 0.97895 2.81154 0.348 0.727699
## WindDir3pmNE 2.64567 3.40070 0.778 0.436581
## WindDir3pmNNE 1.55092 3.43962 0.451 0.652063
## WindDir3pmNNW 2.33835 2.60362 0.898 0.369124
## WindDir3pmNW 2.44865 2.48173 0.987 0.323805
## WindDir3pmS 0.39389 8.65676 0.046 0.963708
## WindDir3pmSE 1.46772 3.06486 0.479 0.632019
## WindDir3pmSSE -2.53686 3.27723 -0.774 0.438879
## WindDir3pmSSW -8.45834 3692.46450 -0.002 0.998172
## WindDir3pmSW -7.99777 3677.13315 -0.002 0.998265
## WindDir3pmW 6.99768 2.82163 2.480 0.013138 *
## WindDir3pmWNW 3.07910 2.55812 1.204 0.228722
## WindDir3pmWSW 0.36702 3.02197 0.121 0.903335
## WindSpeed9am 0.18091 0.08901 2.032 0.042109 *
## WindSpeed3pm -0.10912 0.08082 -1.350 0.176956
## Humidity9am 0.05699 0.05344 1.066 0.286222
## Humidity3pm 0.22536 0.07857 2.868 0.004127 **
## Pressure9am 1.71116 0.53794 3.181 0.001468 **
## Pressure3pm -1.99797 0.57550 -3.472 0.000517 ***
## Cloud9am -0.20677 0.21496 -0.962 0.336108
## Cloud3pm -0.02541 0.24399 -0.104 0.917057
## Temp9am 0.24636 0.37720 0.653 0.513667
## Temp3pm 0.57604 0.38584 1.493 0.135445
## RainToday -1.36880 1.28648 -1.064 0.287332
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 312.127 on 327 degrees of freedom
## Residual deviance: 92.717 on 265 degrees of freedom
## (38 observations deleted due to missingness)
## AIC: 218.72
##
## Number of Fisher Scoring iterations: 18
fitted_results <- predict(reggmodel,newdata=weatherRegg,type='response')
fitted_results <- ifelse(fitted_results > 0.5,1,0)
fitted_results <- na.omit(fitted_results)
misClasificError <- mean(fitted_results != weatherRegg$RainTomorrow)
## Warning in fitted_results != weatherRegg$RainTomorrow: longer object length is
## not a multiple of shorter object length
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.816939890710383"
In the summary if the p-value in the last column is more than 0.05 we consider them to be insignificant in contribution.So the only columns that are relevat to the model is:
Sunshine
WindSpeed9am
Humidity3pm
Pressure9am
Pressure3pm
set.seed(1)
trainingIndex <- sample(nrow(weather),nrow(weather)*0.80)
Train <- weather[trainingIndex,]
Test <- weather[-trainingIndex,]
dectree <- rpart(RainTomorrow~., data = Train, method = 'class')
prp(dectree, type=5,extra=100, main= "Probabilities per class")
Based on various parameters like pressure, cloud, humidity, wind direction we are constructing a decision tree to predict the percentage of rainfall.
#Determining the prediction table
pred_rainfall <- predict(dectree, type ="class")
table(pred_rainfall)
## pred_rainfall
## No Yes
## 239 53
#confusion matrix for decision tree
confmat_rainfall <- table(pred_rainfall, Train$RainTomorrow)
# Compute the accuracies, precision, recall and F-measure
cat("\nConfusion matrix:\n")
##
## Confusion matrix:
confmat_rainfall
##
## pred_rainfall No Yes
## No 225 14
## Yes 13 40
accuracy_rainfall <- sum(diag(confmat_rainfall)) / nrow(Train)
precision_rainfall = confmat_rainfall[1,1]/sum(confmat_rainfall[,1])
recall_rainfall = confmat_rainfall[1,1]/sum(confmat_rainfall[1,])
fmeasure_rainfall = 2 * (precision_rainfall * recall_rainfall) / (precision_rainfall + recall_rainfall)
cat(paste("Accuracy:\t", format(accuracy_rainfall, digits=2), "\n",sep=" "))
## Accuracy: 0.91
cat(paste("Precision:\t", format(precision_rainfall, digits=2), "\n",sep=" "))
## Precision: 0.95
cat(paste("Recall:\t\t", format(recall_rainfall, digits=2), "\n",sep=" "))
## Recall: 0.94
cat(paste("F-measure:\t", format(fmeasure_rainfall, digits=2), "\n",sep=" "))
## F-measure: 0.94
Since we compared the two models,we concluded that decision tree is better with the accuracy.
print(paste('Linear Regression Accuracy',1-misClasificError))
## [1] "Linear Regression Accuracy 0.816939890710383"
print(paste("Decision Tree Precision:", format(accuracy_rainfall, digits=2)))
## [1] "Decision Tree Precision: 0.91"