Introduction

Weather forecasting is the application of science and technology to predict the conditions of the atmosphere for a given location and time.

Weather forecasts are made by collecting quantitative data about the current state of the atmosphere at a given place and using meteorology to project how the atmosphere will change.

The importance of accurately forecasting the weather can be felt in sectors such as:

Aviation -Heavy rain or exceptionally low ceilings can prevent an aircraft from landing and taking off

Marine -Commercial and recreational use of waterways can be limited significantly by wind direction and speed, wave periodicity and heights, tides, and precipitation

Agriculture -Farmers rely on weather forecasts to decide what work to do on any particular day. For example, drying hay is only feasible in dry weather but on ther hand, prolonged periods of dryness can ruin cotton, wheat, and corn crops.

Forestry -Weather forecasting of wind, precipitations and humidity is essential for preventing and controlling wildfires.

Utility companies -Electricity and gas companies rely on weather forecasts to anticipate demand, which can be strongly affected by the weather.

Military applications -The UK Royal Navy, working with the UK Met Office uses data to provide accurate and timely weather and oceanographic information to submarines, ships and Fleet Air Arm aircraft

In this paper, we will use Decision trees algorithms to forecast whether it will rain on a specific day given information of previous days.

The dataset used contains complete records of daily rainfall patterns from January 1st, 1948 to December 12, 2017 and was collected at the Seattle-Tacoma International Airport it can be downloaded Here.

Note: Decision trees algorithm is a type of Classification algorithm known be to effective because of it’s high accuracy rate. We will use the rpart package in this case study.

#Seattle Weather data_Rain prediction_1948-2017
library(readr)
sw <-read_csv("seattleWeather_1948-2017.csv")
## Parsed with column specification:
## cols(
##   DATE = col_date(format = ""),
##   PRCP = col_double(),
##   TMAX = col_integer(),
##   TMIN = col_integer(),
##   RAIN = col_logical()
## )
dim(sw)
## [1] 25551     5
head(sw)
## # A tibble: 6 x 5
##   DATE        PRCP  TMAX  TMIN RAIN 
##   <date>     <dbl> <int> <int> <lgl>
## 1 1948-01-01  0.47    51    42 TRUE 
## 2 1948-01-02  0.59    45    36 TRUE 
## 3 1948-01-03  0.42    45    35 TRUE 
## 4 1948-01-04  0.31    45    34 TRUE 
## 5 1948-01-05  0.17    45    32 TRUE 
## 6 1948-01-06  0.44    48    39 TRUE

We have 25,551 observations and 5 variables.

Features

The data consists of 5 variables namely:

DATE = the date of the observation

PRCP = the amount of precipitation, in inches

TMAX = the maximum temperature for that day, in degrees Fahrenheit

TMIN = the minimum temperature for that day, in degrees Fahrenheit

RAIN = TRUE if rain was observed on that day, FALSE if it was not

#Removing RAIN "NA" value records
sw1 <- sw[-c(which(is.na(sw$RAIN))),]
dim(sw1)
## [1] 25548     5

We’ve removed the missing values and the data is clean and ready for analysis.

# Slpit the data 80-20
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
index <- createDataPartition(sw1$RAIN, p = 0.8, list = FALSE)
# Training dataset
sw1_train <- sw1[index,]
dim(sw1_train)
## [1] 20439     5
# Testing dataset
sw1_test <- sw1[-index,]
dim(sw1_test)
## [1] 5109    5

The data has been split into training and test data in a 4:1 ratio

#Decision Tree Model Used
set.seed(123)
library(rpart)
sw1_model<-rpart(RAIN~TMAX+TMIN, method="class", control=rpart.control(minsplit=5, cp=0.000001), data=sw1_train)

sw1_pred<-predict(sw1_model, type="class")

conf_matrix<-table(sw1_pred,sw1_train$RAIN)
cat("Confusion_Matrix:")
## Confusion_Matrix:
conf_matrix
##         
## sw1_pred FALSE TRUE
##    FALSE  9100 2020
##    TRUE   2619 6700

Of the 11,719 days rain didn’t fall, the system correctly predicted no rain 9,145 times. Also of the 8,720 days rain fell, the system correctly predicted downpour 6,681 times.

accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
cat("Accuracy of the Model:")
## Accuracy of the Model:
accuracy
## [1] 0.7730319

Accuracy is the main metric here and 77% is good but not great. The more data fed in to the Algorithm may improve the result further

sensi <- (conf_matrix[1,1])/(conf_matrix[1,1]+conf_matrix[1,2])
cat("Sensitivity of the Model:")
## Sensitivity of the Model:
sensi
## [1] 0.8183453
speci <- (conf_matrix[2,2])/(conf_matrix[2,1]+conf_matrix[2,2])
cat("Specificity of the Model:")
## Specificity of the Model:
speci
## [1] 0.7189613

Sensitivity of the Model is 82% while Specificity of the Model is 72%

#Validation accuracy
sw1_test$pred <- predict(sw1_model, sw1_test, type="class")

conf_matrix_val<-table(sw1_test$pred,sw1_test$RAIN)
cat("Confusion_Matrix_test_val:")
## Confusion_Matrix_test_val:
conf_matrix_val
##        
##         FALSE TRUE
##   FALSE  2265  543
##   TRUE    664 1637
set.seed(4400)

In validating our model using the test data set we get 2,256 True positives, 551 False positives, 673 False negatives, and 1,629 True negatives

accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
cat("Accuracy of the Model_test_val:")
## Accuracy of the Model_test_val:
accuracy_val
## [1] 0.7637502

Our test set acheived an accuracy of 76% which means this what this model is likely to achieve with similar data

sensi_val <- (conf_matrix_val[1,1])/(conf_matrix_val[1,1]+conf_matrix_val[1,2])
cat("Sensitivity of the Model_test_val:")
## Sensitivity of the Model_test_val:
sensi_val
## [1] 0.8066239
speci_val <- (conf_matrix_val[2,2])/(conf_matrix_val[2,1]+conf_matrix_val[2,2])
cat("Specificity of the Model_test_val:")
## Specificity of the Model_test_val:
speci_val
## [1] 0.7114298

Sensitivity of the test set is 80% while the Specificity is 71%

#View result of test dataset
set.seed(4400) # For identical results across all document evaluations
library(readr)
result <-read_csv("sw1_test.csv")
head(result)
## # A tibble: 6 x 7
##      X1 DATE        PRCP  TMAX  TMIN RAIN  pred 
##   <int> <date>     <dbl> <int> <int> <lgl> <lgl>
## 1     1 1948-01-12  0       41    26 FALSE FALSE
## 2     2 1948-01-23  0       47    43 FALSE TRUE 
## 3     3 1948-01-27  0       53    33 FALSE FALSE
## 4     4 1948-01-29  0.22    42    34 TRUE  TRUE 
## 5     5 1948-02-01  0.03    39    30 TRUE  TRUE 
## 6     6 1948-02-04  0.14    39    31 TRUE  TRUE

This is the result of a prediction the algorithm made for 6 days. It was wrong once out of 6.