Weather forecasting is the application of science and technology to predict the conditions of the atmosphere for a given location and time.
Weather forecasts are made by collecting quantitative data about the current state of the atmosphere at a given place and using meteorology to project how the atmosphere will change.
The importance of accurately forecasting the weather can be felt in sectors such as:
Aviation -Heavy rain or exceptionally low ceilings can prevent an aircraft from landing and taking off
Marine -Commercial and recreational use of waterways can be limited significantly by wind direction and speed, wave periodicity and heights, tides, and precipitation
Agriculture -Farmers rely on weather forecasts to decide what work to do on any particular day. For example, drying hay is only feasible in dry weather but on ther hand, prolonged periods of dryness can ruin cotton, wheat, and corn crops.
Forestry -Weather forecasting of wind, precipitations and humidity is essential for preventing and controlling wildfires.
Utility companies -Electricity and gas companies rely on weather forecasts to anticipate demand, which can be strongly affected by the weather.
Military applications -The UK Royal Navy, working with the UK Met Office uses data to provide accurate and timely weather and oceanographic information to submarines, ships and Fleet Air Arm aircraft
In this paper, we will use Decision trees algorithms to forecast whether it will rain on a specific day given information of previous days.
The dataset used contains complete records of daily rainfall patterns from January 1st, 1948 to December 12, 2017 and was collected at the Seattle-Tacoma International Airport it can be downloaded Here.
Note: Decision trees algorithm is a type of Classification algorithm known be to effective because of it’s high accuracy rate. We will use the rpart package in this case study.
#Seattle Weather data_Rain prediction_1948-2017
library(readr)
sw <-read_csv("seattleWeather_1948-2017.csv")
## Parsed with column specification:
## cols(
## DATE = col_date(format = ""),
## PRCP = col_double(),
## TMAX = col_integer(),
## TMIN = col_integer(),
## RAIN = col_logical()
## )
dim(sw)
## [1] 25551 5
head(sw)
## # A tibble: 6 x 5
## DATE PRCP TMAX TMIN RAIN
## <date> <dbl> <int> <int> <lgl>
## 1 1948-01-01 0.47 51 42 TRUE
## 2 1948-01-02 0.59 45 36 TRUE
## 3 1948-01-03 0.42 45 35 TRUE
## 4 1948-01-04 0.31 45 34 TRUE
## 5 1948-01-05 0.17 45 32 TRUE
## 6 1948-01-06 0.44 48 39 TRUE
We have 25,551 observations and 5 variables.
The data consists of 5 variables namely:
DATE = the date of the observation
PRCP = the amount of precipitation, in inches
TMAX = the maximum temperature for that day, in degrees Fahrenheit
TMIN = the minimum temperature for that day, in degrees Fahrenheit
RAIN = TRUE if rain was observed on that day, FALSE if it was not
#Removing RAIN "NA" value records
sw1 <- sw[-c(which(is.na(sw$RAIN))),]
dim(sw1)
## [1] 25548 5
We’ve removed the missing values and the data is clean and ready for analysis.
# Slpit the data 80-20
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
index <- createDataPartition(sw1$RAIN, p = 0.8, list = FALSE)
# Training dataset
sw1_train <- sw1[index,]
dim(sw1_train)
## [1] 20439 5
# Testing dataset
sw1_test <- sw1[-index,]
dim(sw1_test)
## [1] 5109 5
The data has been split into training and test data in a 4:1 ratio
#Decision Tree Model Used
set.seed(123)
library(rpart)
sw1_model<-rpart(RAIN~TMAX+TMIN, method="class", control=rpart.control(minsplit=5, cp=0.000001), data=sw1_train)
sw1_pred<-predict(sw1_model, type="class")
conf_matrix<-table(sw1_pred,sw1_train$RAIN)
cat("Confusion_Matrix:")
## Confusion_Matrix:
conf_matrix
##
## sw1_pred FALSE TRUE
## FALSE 9100 2020
## TRUE 2619 6700
Of the 11,719 days rain didn’t fall, the system correctly predicted no rain 9,145 times. Also of the 8,720 days rain fell, the system correctly predicted downpour 6,681 times.
accuracy<-(conf_matrix[1,1]+conf_matrix[2,2])/(sum(conf_matrix))
cat("Accuracy of the Model:")
## Accuracy of the Model:
accuracy
## [1] 0.7730319
Accuracy is the main metric here and 77% is good but not great. The more data fed in to the Algorithm may improve the result further
sensi <- (conf_matrix[1,1])/(conf_matrix[1,1]+conf_matrix[1,2])
cat("Sensitivity of the Model:")
## Sensitivity of the Model:
sensi
## [1] 0.8183453
speci <- (conf_matrix[2,2])/(conf_matrix[2,1]+conf_matrix[2,2])
cat("Specificity of the Model:")
## Specificity of the Model:
speci
## [1] 0.7189613
Sensitivity of the Model is 82% while Specificity of the Model is 72%
#Validation accuracy
sw1_test$pred <- predict(sw1_model, sw1_test, type="class")
conf_matrix_val<-table(sw1_test$pred,sw1_test$RAIN)
cat("Confusion_Matrix_test_val:")
## Confusion_Matrix_test_val:
conf_matrix_val
##
## FALSE TRUE
## FALSE 2265 543
## TRUE 664 1637
set.seed(4400)
In validating our model using the test data set we get 2,256 True positives, 551 False positives, 673 False negatives, and 1,629 True negatives
accuracy_val<-(conf_matrix_val[1,1]+conf_matrix_val[2,2])/(sum(conf_matrix_val))
cat("Accuracy of the Model_test_val:")
## Accuracy of the Model_test_val:
accuracy_val
## [1] 0.7637502
Our test set acheived an accuracy of 76% which means this what this model is likely to achieve with similar data
sensi_val <- (conf_matrix_val[1,1])/(conf_matrix_val[1,1]+conf_matrix_val[1,2])
cat("Sensitivity of the Model_test_val:")
## Sensitivity of the Model_test_val:
sensi_val
## [1] 0.8066239
speci_val <- (conf_matrix_val[2,2])/(conf_matrix_val[2,1]+conf_matrix_val[2,2])
cat("Specificity of the Model_test_val:")
## Specificity of the Model_test_val:
speci_val
## [1] 0.7114298
Sensitivity of the test set is 80% while the Specificity is 71%
#View result of test dataset
set.seed(4400) # For identical results across all document evaluations
library(readr)
result <-read_csv("sw1_test.csv")
head(result)
## # A tibble: 6 x 7
## X1 DATE PRCP TMAX TMIN RAIN pred
## <int> <date> <dbl> <int> <int> <lgl> <lgl>
## 1 1 1948-01-12 0 41 26 FALSE FALSE
## 2 2 1948-01-23 0 47 43 FALSE TRUE
## 3 3 1948-01-27 0 53 33 FALSE FALSE
## 4 4 1948-01-29 0.22 42 34 TRUE TRUE
## 5 5 1948-02-01 0.03 39 30 TRUE TRUE
## 6 6 1948-02-04 0.14 39 31 TRUE TRUE
This is the result of a prediction the algorithm made for 6 days. It was wrong once out of 6.