library(rpart)
library(rpart.plot)
library(ggplot2)
library(lattice)
library(caret)
library(rattle)
library(gplots)
library(ROCR)
library(randomForest)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(reshape2)
weather <- read.csv("/Users/sindhuherle/Documents/Data mining/weather.csv")

1. Introduction

Weather forecasting refers to the process in which science and technology are applied to predict the conditions of the atmosphere for a given location and time.Weather forecasting is done by collecting data on the current state of the atmosphere and applying a scientific understanding of atmospheric processes to predict atmospheric progression. The Weather has great impact on various aspects of human life.The goal of weather prediction is to provide information to people in advance so that they can make better decisions.We have used the dataset from kaggle(https://www.kaggle.com/datasets/arpina/weather) The dataset has 21 variables and 366 observations. The variables include details like the minimum and maximum temperature, humidity, pressure, wind direction, wind speed and other weather related factors.

2. Visualizations

Before analyzing, let us view the structure of the dataset to get a better understanding of all the variables

head(weather)
##   MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
## 1     8.0    24.3      0.0         3.4      6.3          NW            30
## 2    14.0    26.9      3.6         4.4      9.7         ENE            39
## 3    13.7    23.4      3.6         5.8      3.3          NW            85
## 4    13.3    15.5     39.8         7.2      9.1          NW            54
## 5     7.6    16.1      2.8         5.6     10.6         SSE            50
## 6     6.2    16.9      0.0         5.8      8.2          SE            44
##   WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm
## 1         SW         NW            6           20          68          29
## 2          E          W            4           17          80          36
## 3          N        NNE            6            6          82          69
## 4        WNW          W           30           24          62          56
## 5        SSE        ESE           20           28          68          49
## 6         SE          E           20           24          70          57
##   Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
## 1      1019.7      1015.0        7        7    14.4    23.6        No
## 2      1012.4      1008.4        5        3    17.5    25.7       Yes
## 3      1009.5      1007.2        8        7    15.4    20.2       Yes
## 4      1005.5      1007.0        2        7    13.5    14.1       Yes
## 5      1018.3      1018.5        7        7    11.1    15.4       Yes
## 6      1023.8      1021.7        7        5    10.9    14.8        No
##   RainTomorrow
## 1          Yes
## 2          Yes
## 3          Yes
## 4          Yes
## 5           No
## 6           No
str(weather)
## 'data.frame':    366 obs. of  21 variables:
##  $ MinTemp      : num  8 14 13.7 13.3 7.6 6.2 6.1 8.3 8.8 8.4 ...
##  $ MaxTemp      : num  24.3 26.9 23.4 15.5 16.1 16.9 18.2 17 19.5 22.8 ...
##  $ Rainfall     : num  0 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 ...
##  $ Evaporation  : num  3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6 4 5.4 ...
##  $ Sunshine     : num  6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.6 4.1 7.7 ...
##  $ WindGustDir  : chr  "NW" "ENE" "NW" "NW" ...
##  $ WindGustSpeed: int  30 39 85 54 50 44 43 41 48 31 ...
##  $ WindDir9am   : chr  "SW" "E" "N" "WNW" ...
##  $ WindDir3pm   : chr  "NW" "W" "NNE" "W" ...
##  $ WindSpeed9am : int  6 4 6 30 20 20 19 11 19 7 ...
##  $ WindSpeed3pm : int  20 17 6 24 28 24 26 24 17 6 ...
##  $ Humidity9am  : int  68 80 82 62 68 70 63 65 70 82 ...
##  $ Humidity3pm  : int  29 36 69 56 49 57 47 57 48 32 ...
##  $ Pressure9am  : num  1020 1012 1010 1006 1018 ...
##  $ Pressure3pm  : num  1015 1008 1007 1007 1018 ...
##  $ Cloud9am     : int  7 5 8 2 7 7 4 6 7 7 ...
##  $ Cloud3pm     : int  7 3 7 7 7 5 6 7 7 1 ...
##  $ Temp9am      : num  14.4 17.5 15.4 13.5 11.1 10.9 12.4 12.1 14.1 13.3 ...
##  $ Temp3pm      : num  23.6 25.7 20.2 14.1 15.4 14.8 17.3 15.5 18.9 21.7 ...
##  $ RainToday    : chr  "No" "Yes" "Yes" "Yes" ...
##  $ RainTomorrow : chr  "Yes" "Yes" "Yes" "Yes" ...
2.1 Correlation Matrix Heatmap

The correlation matrix above supports us to analyze the effect of different parameters over rainfall. The correlation matrix heap here is very noteworthy. We can see that the highest correlation with the rainfall are humidity,cloud and temperature.

2.2 Scatter Plots

To further understand the parameters which deeply affect the rainfall, we created scatterplots for the four variables which had highest coorelation with rainfall. From the scatterplots, its evident that when the weather is more cloudy and humid, the rainfall is the highest.

3. Linear Regression

4. Decision Tree

set.seed(1)
trainingIndex <- sample(nrow(weather),nrow(weather)*0.80)
Train <- weather[trainingIndex,]
Test <- weather[-trainingIndex,]
dectree <- rpart(RainTomorrow~., data = Train, method = 'class')
prp(dectree, type=5,extra=3, main= "Probabilities per class")

Based on various parameters like pressure, cloud, humidity, wind direction we are constructing a decision tree to predict the percentage of rainfall.

4. Model Comparison and Conclusion