library(rpart)
library(rpart.plot)
library(ggplot2)
library(lattice)
library(caret)
library(rattle)
library(gplots)
library(ROCR)
library(randomForest)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(reshape2)
weather <- read.csv("/Users/sindhuherle/Documents/Data mining/weather.csv")
Weather forecasting refers to the process in which science and technology are applied to predict the conditions of the atmosphere for a given location and time.Weather forecasting is done by collecting data on the current state of the atmosphere and applying a scientific understanding of atmospheric processes to predict atmospheric progression. The Weather has great impact on various aspects of human life.The goal of weather prediction is to provide information to people in advance so that they can make better decisions.We have used the dataset from kaggle(https://www.kaggle.com/datasets/arpina/weather) The dataset has 21 variables and 366 observations. The variables include details like the minimum and maximum temperature, humidity, pressure, wind direction, wind speed and other weather related factors.
Before analyzing, let us view the structure of the dataset to get a better understanding of all the variables
head(weather)
## MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
## 1 8.0 24.3 0.0 3.4 6.3 NW 30
## 2 14.0 26.9 3.6 4.4 9.7 ENE 39
## 3 13.7 23.4 3.6 5.8 3.3 NW 85
## 4 13.3 15.5 39.8 7.2 9.1 NW 54
## 5 7.6 16.1 2.8 5.6 10.6 SSE 50
## 6 6.2 16.9 0.0 5.8 8.2 SE 44
## WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm
## 1 SW NW 6 20 68 29
## 2 E W 4 17 80 36
## 3 N NNE 6 6 82 69
## 4 WNW W 30 24 62 56
## 5 SSE ESE 20 28 68 49
## 6 SE E 20 24 70 57
## Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday
## 1 1019.7 1015.0 7 7 14.4 23.6 No
## 2 1012.4 1008.4 5 3 17.5 25.7 Yes
## 3 1009.5 1007.2 8 7 15.4 20.2 Yes
## 4 1005.5 1007.0 2 7 13.5 14.1 Yes
## 5 1018.3 1018.5 7 7 11.1 15.4 Yes
## 6 1023.8 1021.7 7 5 10.9 14.8 No
## RainTomorrow
## 1 Yes
## 2 Yes
## 3 Yes
## 4 Yes
## 5 No
## 6 No
str(weather)
## 'data.frame': 366 obs. of 21 variables:
## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 8.8 8.4 ...
## $ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 18.2 17 19.5 22.8 ...
## $ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 ...
## $ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6 4 5.4 ...
## $ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.6 4.1 7.7 ...
## $ WindGustDir : chr "NW" "ENE" "NW" "NW" ...
## $ WindGustSpeed: int 30 39 85 54 50 44 43 41 48 31 ...
## $ WindDir9am : chr "SW" "E" "N" "WNW" ...
## $ WindDir3pm : chr "NW" "W" "NNE" "W" ...
## $ WindSpeed9am : int 6 4 6 30 20 20 19 11 19 7 ...
## $ WindSpeed3pm : int 20 17 6 24 28 24 26 24 17 6 ...
## $ Humidity9am : int 68 80 82 62 68 70 63 65 70 82 ...
## $ Humidity3pm : int 29 36 69 56 49 57 47 57 48 32 ...
## $ Pressure9am : num 1020 1012 1010 1006 1018 ...
## $ Pressure3pm : num 1015 1008 1007 1007 1018 ...
## $ Cloud9am : int 7 5 8 2 7 7 4 6 7 7 ...
## $ Cloud3pm : int 7 3 7 7 7 5 6 7 7 1 ...
## $ Temp9am : num 14.4 17.5 15.4 13.5 11.1 10.9 12.4 12.1 14.1 13.3 ...
## $ Temp3pm : num 23.6 25.7 20.2 14.1 15.4 14.8 17.3 15.5 18.9 21.7 ...
## $ RainToday : chr "No" "Yes" "Yes" "Yes" ...
## $ RainTomorrow : chr "Yes" "Yes" "Yes" "Yes" ...
The correlation matrix above supports us to analyze the effect of different parameters over rainfall. The correlation matrix heap here is very noteworthy. We can see that the highest correlation with the rainfall are humidity,cloud and temperature.
To further understand the parameters which deeply affect the rainfall, we created scatterplots for the four variables which had highest coorelation with rainfall. From the scatterplots, its evident that when the weather is more cloudy and humid, the rainfall is the highest.
set.seed(1)
trainingIndex <- sample(nrow(weather),nrow(weather)*0.80)
Train <- weather[trainingIndex,]
Test <- weather[-trainingIndex,]
dectree <- rpart(RainTomorrow~., data = Train, method = 'class')
prp(dectree, type=5,extra=3, main= "Probabilities per class")
Based on various parameters like pressure, cloud, humidity, wind direction we are constructing a decision tree to predict the percentage of rainfall.