Overview of Data:
13 variable in Train set (12 in test, need rainfall - but that is what we are predicting)
Structures seen below, need to adjust rainfall for a factor
#Load Library
library(mice)
## Warning: package 'mice' was built under R version 4.4.3
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Bring in Data
train <- read.csv("C:/Users/gianc/OneDrive/Desktop/BUA 302 - FINAL FINAL FINAL/train.csv", na.strings = c("", "NA", header = TRUE))
test <- read.csv("C:/Users/gianc/OneDrive/Desktop/BUA 302 - FINAL FINAL FINAL/test.csv")
test$rainfall <- ""
test$rainfall <- as.factor(test$rainfall)
#See structure
str(train)
## 'data.frame': 2190 obs. of 13 variables:
## $ id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pressure : num 1017 1020 1024 1013 1022 ...
## $ maxtemp : num 21.2 16.2 19.4 18.1 21.3 20.6 19.5 15.8 17.6 16.5 ...
## $ temparature : num 20.6 16.9 16.1 17.8 18.4 18.6 18.4 13.6 16.5 14.4 ...
## $ mintemp : num 19.9 15.8 14.6 16.9 15.2 16.5 15.3 12.7 15.6 12 ...
## $ dewpoint : num 19.4 15.4 9.3 16.8 9.6 12.5 11.3 11.8 12.5 8.6 ...
## $ humidity : num 87 95 75 95 52 79 56 96 86 77 ...
## $ cloud : num 88 91 47 95 45 81 46 100 100 84 ...
## $ sunshine : num 1.1 0 8.3 0 3.6 0 7.6 0 0 1 ...
## $ winddirection: num 60 50 70 60 40 20 20 50 50 50 ...
## $ windspeed : num 17.2 21.9 18.1 35.6 24.8 15.7 28.4 52.8 37.5 38.3 ...
## $ rainfall : int 1 1 1 1 0 1 0 1 1 0 ...
str(test)
## 'data.frame': 730 obs. of 13 variables:
## $ id : int 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pressure : num 1020 1016 1024 1023 1022 ...
## $ maxtemp : num 17.5 17.5 11.2 20.6 16.1 15.6 15.5 20.5 16.3 10.4 ...
## $ temparature : num 15.8 16.5 10.4 17.3 13.8 12.6 13.7 16.2 13.2 8.5 ...
## $ mintemp : num 12.7 15.8 9.4 15.2 6.4 11.5 10.7 15.2 11.3 7 ...
## $ dewpoint : num 14.9 15.1 8.9 9.5 4.3 9 11.8 13.1 10.8 3.1 ...
## $ humidity : num 96 97 86 75 68 76 79 94 85 69 ...
## $ cloud : num 99 99 96 45 49 94 95 93 99 88 ...
## $ sunshine : num 0 0 0 7.1 9.2 0 0 0.2 0.1 0 ...
## $ winddirection: num 50 50 40 20 20 20 20 70 20 20 ...
## $ windspeed : num 24.3 35.3 16.9 50.6 19.4 41.4 43.1 41.3 34 26.4 ...
## $ rainfall : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
#Double check for change as factor
str(train)
## 'data.frame': 2190 obs. of 13 variables:
## $ id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pressure : num 1017 1020 1024 1013 1022 ...
## $ maxtemp : num 21.2 16.2 19.4 18.1 21.3 20.6 19.5 15.8 17.6 16.5 ...
## $ temparature : num 20.6 16.9 16.1 17.8 18.4 18.6 18.4 13.6 16.5 14.4 ...
## $ mintemp : num 19.9 15.8 14.6 16.9 15.2 16.5 15.3 12.7 15.6 12 ...
## $ dewpoint : num 19.4 15.4 9.3 16.8 9.6 12.5 11.3 11.8 12.5 8.6 ...
## $ humidity : num 87 95 75 95 52 79 56 96 86 77 ...
## $ cloud : num 88 91 47 95 45 81 46 100 100 84 ...
## $ sunshine : num 1.1 0 8.3 0 3.6 0 7.6 0 0 1 ...
## $ winddirection: num 60 50 70 60 40 20 20 50 50 50 ...
## $ windspeed : num 17.2 21.9 18.1 35.6 24.8 15.7 28.4 52.8 37.5 38.3 ...
## $ rainfall : int 1 1 1 1 0 1 0 1 1 0 ...
str(test)
## 'data.frame': 730 obs. of 13 variables:
## $ id : int 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pressure : num 1020 1016 1024 1023 1022 ...
## $ maxtemp : num 17.5 17.5 11.2 20.6 16.1 15.6 15.5 20.5 16.3 10.4 ...
## $ temparature : num 15.8 16.5 10.4 17.3 13.8 12.6 13.7 16.2 13.2 8.5 ...
## $ mintemp : num 12.7 15.8 9.4 15.2 6.4 11.5 10.7 15.2 11.3 7 ...
## $ dewpoint : num 14.9 15.1 8.9 9.5 4.3 9 11.8 13.1 10.8 3.1 ...
## $ humidity : num 96 97 86 75 68 76 79 94 85 69 ...
## $ cloud : num 99 99 96 45 49 94 95 93 99 88 ...
## $ sunshine : num 0 0 0 7.1 9.2 0 0 0.2 0.1 0 ...
## $ winddirection: num 50 50 40 20 20 20 20 70 20 20 ...
## $ windspeed : num 24.3 35.3 16.9 50.6 19.4 41.4 43.1 41.3 34 26.4 ...
## $ rainfall : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
colSums(is.na(train)) #No NAs
## id day pressure maxtemp temparature
## 0 0 0 0 0
## mintemp dewpoint humidity cloud sunshine
## 0 0 0 0 0
## winddirection windspeed rainfall
## 0 0 0
colSums(is.na(test)) #One NA in TEST
## id day pressure maxtemp temparature
## 0 0 0 0 0
## mintemp dewpoint humidity cloud sunshine
## 0 0 0 0 0
## winddirection windspeed rainfall
## 1 0 0
#used package mice, to use rainfall method to input missing winddirection value.
test <- test %>%
mice(method = "rf")%>%
complete()
##
## iter imp variable
## 1 1 winddirection
## 1 2 winddirection
## 1 3 winddirection
## 1 4 winddirection
## 1 5 winddirection
## 2 1 winddirection
## 2 2 winddirection
## 2 3 winddirection
## 2 4 winddirection
## 2 5 winddirection
## 3 1 winddirection
## 3 2 winddirection
## 3 3 winddirection
## 3 4 winddirection
## 3 5 winddirection
## 4 1 winddirection
## 4 2 winddirection
## 4 3 winddirection
## 4 4 winddirection
## 4 5 winddirection
## 5 1 winddirection
## 5 2 winddirection
## 5 3 winddirection
## 5 4 winddirection
## 5 5 winddirection
## Warning: Number of logged events: 1
#No more missing values
colSums(is.na(test))
## id day pressure maxtemp temparature
## 0 0 0 0 0
## mintemp dewpoint humidity cloud sunshine
## 0 0 0 0 0
## winddirection windspeed rainfall
## 0 0 0
Simply looking at the correlation those that are significant are below:
Positive - humidity & cloud
Negative - sunshine
We are able to understand this since our predictor variable is binary. There is a positive relationship we humidity and cloud present because it would result in rainfall, with a 1. On the other hand, if there is sunshine then this will result in a 0. BUT nothing is certain as for right now, this is just logical thinking.
I would also like to point our the strong relationship with our numerical variables such as pressure, maxtemp, temperature, minttemp, dewpoint. We should keep aware of these relationship when generating our model.
str(train)
## 'data.frame': 2190 obs. of 13 variables:
## $ id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ pressure : num 1017 1020 1024 1013 1022 ...
## $ maxtemp : num 21.2 16.2 19.4 18.1 21.3 20.6 19.5 15.8 17.6 16.5 ...
## $ temparature : num 20.6 16.9 16.1 17.8 18.4 18.6 18.4 13.6 16.5 14.4 ...
## $ mintemp : num 19.9 15.8 14.6 16.9 15.2 16.5 15.3 12.7 15.6 12 ...
## $ dewpoint : num 19.4 15.4 9.3 16.8 9.6 12.5 11.3 11.8 12.5 8.6 ...
## $ humidity : num 87 95 75 95 52 79 56 96 86 77 ...
## $ cloud : num 88 91 47 95 45 81 46 100 100 84 ...
## $ sunshine : num 1.1 0 8.3 0 3.6 0 7.6 0 0 1 ...
## $ winddirection: num 60 50 70 60 40 20 20 50 50 50 ...
## $ windspeed : num 17.2 21.9 18.1 35.6 24.8 15.7 28.4 52.8 37.5 38.3 ...
## $ rainfall : int 1 1 1 1 0 1 0 1 1 0 ...
cor(train)
## id day pressure maxtemp temparature
## id 1.000000000 0.1530651187 -0.008234737 0.01258991 0.01430652
## day 0.153065119 1.0000000000 0.005337193 0.14629426 0.15359022
## pressure -0.008234737 0.0053371931 1.000000000 -0.80049902 -0.81653085
## maxtemp 0.012589910 0.1462942610 -0.800499025 1.00000000 0.98293176
## temparature 0.014306518 0.1535902206 -0.816530852 0.98293176 1.00000000
## mintemp 0.018707781 0.1614752942 -0.814453338 0.96552873 0.98714995
## dewpoint 0.006796853 0.1379288312 -0.817007955 0.90670259 0.93361741
## humidity -0.029042329 -0.0740478530 -0.119949253 -0.07261494 -0.02501622
## cloud 0.002226192 -0.0481749768 0.098599808 -0.28904652 -0.24935465
## sunshine -0.003022229 0.0609361285 -0.257163439 0.45238748 0.41401861
## winddirection -0.004222734 0.0247996669 -0.643292856 0.66223482 0.66896268
## windspeed 0.020167291 -0.0001985325 0.266012467 -0.35416795 -0.34226188
## rainfall 0.033674078 -0.0004619472 -0.049885545 -0.07930352 -0.04966020
## mintemp dewpoint humidity cloud sunshine
## id 0.018707781 0.006796853 -0.029042329 0.002226192 -0.003022229
## day 0.161475294 0.137928831 -0.074047853 -0.048174977 0.060936128
## pressure -0.814453338 -0.817007955 -0.119949253 0.098599808 -0.257163439
## maxtemp 0.965528730 0.906702585 -0.072614942 -0.289046518 0.452387476
## temparature 0.987149954 0.933617410 -0.025016217 -0.249354652 0.414018610
## mintemp 1.000000000 0.941341633 0.009891374 -0.219398782 0.379496549
## dewpoint 0.941341633 1.000000000 0.153389749 -0.088446054 0.249676271
## humidity 0.009891374 0.153389749 1.000000000 0.584854349 -0.541591914
## cloud -0.219398782 -0.088446054 0.584854349 1.000000000 -0.805128433
## sunshine 0.379496549 0.249676271 -0.541591914 -0.805128433 1.000000000
## winddirection 0.663828248 0.643073259 -0.012430170 -0.127086795 0.272234726
## windspeed -0.328870508 -0.312178683 0.062284700 0.184697543 -0.241751882
## rainfall -0.026841468 0.081965037 0.454213483 0.641191204 -0.555286920
## winddirection windspeed rainfall
## id -0.004222734 0.0201672909 0.0336740779
## day 0.024799667 -0.0001985325 -0.0004619472
## pressure -0.643292856 0.2660124670 -0.0498855452
## maxtemp 0.662234820 -0.3541679464 -0.0793035209
## temparature 0.668962683 -0.3422618828 -0.0496601951
## mintemp 0.663828248 -0.3288705084 -0.0268414679
## dewpoint 0.643073259 -0.3121786832 0.0819650369
## humidity -0.012430170 0.0622847001 0.4542134832
## cloud -0.127086795 0.1846975426 0.6411912036
## sunshine 0.272234726 -0.2417518820 -0.5552869202
## winddirection 1.000000000 -0.1924169131 -0.0069391214
## windspeed -0.192416913 1.0000000000 0.1116245937
## rainfall -0.006939121 0.1116245937 1.0000000000
corrplot(cor(train))
We created a Density plots to depict the smoothed estimate of data distribution. Remember Density does not equate to frequency, but probabilities under the curve.
pressure
Very slight difference. When pressure reaches 1010-1020, slighting more likely to rain.
If pressure exceeds 1020, the it will most likely NOT rain.
maxtemp
when maximum temperature reach 24-30 degrees, it will be more likely to rain.
after 33 degrees, more likely NOT to rain
temperature
mintemp
dewpoint
will NOT rain below 10
will rain around 15-17
after 25 it will be more likely to rain
humidity
cloud
Anything below 60 it will most likely NOT rain
Huge spike anything after 75 it will most likely rain
sunshine
Huge opposite shifts comparing when it rains and when it does not
Anything 0-2.5 measurement of sunshine it is most likely going to rain
Anything around5 and above, there will most likely not be any rainfall
winddirection
windspeed
In conclusion there were only 3 variables that had significant disparity in density when compared to their status of rainfall.
# Select numerical variables and rainfall
numeric_vars <- c("pressure", "maxtemp", "temparature", "mintemp", "dewpoint",
"humidity", "cloud", "sunshine", "winddirection", "windspeed")
# Melt data to long format for facetting
train_long <- melt(train, id.vars = "rainfall", measure.vars = numeric_vars)
# Plot density with facet wrap
ggplot(train_long, aes(x = value, fill = factor(rainfall))) +
geom_density(alpha = 0.5) + # Transparent fill
facet_wrap(~ variable, scales = "free") + # Wrap by variable, free scales
labs(title = "Density Plots of Numerical Variables by Rainfall",
x = "Value", y = "Density", fill = "Rainfall") +
theme_minimal()
We will choose a logistic regression model to answer what is the probability of an event occurring. In this case, we are attempting to answer the likelihood of rainfall.
Running a logistic regression and utilizing all of our variables except for id and day, we receive a accuracy rate of 86.8%.
The decrease from 2444.4 to 1443.7 suggest that the model has improved over the null model, when predictors are involved.
As FOR THE AIC comparing to our second model our first model is better.
# Fit logistic regression
log_reg <- glm(rainfall ~ . - id - day , data = train, family = "binomial")
summary(log_reg)
##
## Call:
## glm(formula = rainfall ~ . - id - day, family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 26.3349469 22.8572309 1.152 0.249260
## pressure -0.0333959 0.0219648 -1.520 0.128404
## maxtemp 0.0276169 0.0678539 0.407 0.684004
## temparature -0.0411014 0.1152849 -0.357 0.721451
## mintemp -0.0678926 0.0894629 -0.759 0.447918
## dewpoint 0.1504699 0.0415929 3.618 0.000297 ***
## humidity 0.0402758 0.0124140 3.244 0.001177 **
## cloud 0.0638011 0.0062067 10.279 < 2e-16 ***
## sunshine -0.1540483 0.0320463 -4.807 1.53e-06 ***
## winddirection -0.0003703 0.0011284 -0.328 0.742781
## windspeed 0.0151766 0.0077001 1.971 0.048728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2446.4 on 2189 degrees of freedom
## Residual deviance: 1443.7 on 2179 degrees of freedom
## AIC: 1465.7
##
## Number of Fisher Scoring iterations: 5
# Predict on training data
log_probs <- predict(log_reg, newdata = train, type = "response")
# Convert probabilities to binary predictions
log_preds <- ifelse(log_probs > 0.5, 1, 0)
# Compute accuracy on TRAINING data
train_accuracy <- mean(log_preds == train$rainfall, na.rm = TRUE) # Ensure NA values are ignored
print(train_accuracy)
## [1] 0.8680365
While just focusing on the most significant variables, especially choosing these from our EDA, we see that it does not make or model any better. Statistically, we choose these variables due to the significant in our previous model (looking a p-values).
This model slightly decreases with out accuracy rate being 85.9%, an increase in Residual Deviance and an increase in AIC.
We will not be using this model and sticking to our first model. In the end, although, we believed that focusing on “important” variables; it did not make our model better.
# Fit logistic regression
log_reg2 <- glm(rainfall ~ humidity + cloud + sunshine, data = train, family = "binomial")
summary(log_reg2)
##
## Call:
## glm(formula = rainfall ~ humidity + cloud + sunshine, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.170335 0.973988 -9.415 < 2e-16 ***
## humidity 0.075670 0.011349 6.667 2.6e-11 ***
## cloud 0.065169 0.005991 10.877 < 2e-16 ***
## sunshine -0.087411 0.027317 -3.200 0.00138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2446.4 on 2189 degrees of freedom
## Residual deviance: 1497.4 on 2186 degrees of freedom
## AIC: 1505.4
##
## Number of Fisher Scoring iterations: 5
# Predict on training data
log_probs2 <- predict(log_reg2, newdata = train, type = "response")
# Convert probabilities to binary predictions
log_preds2 <- ifelse(log_probs2 > 0.5, 1, 0)
# Compute accuracy on TRAINING data
train_accuracy2 <- mean(log_preds2 == train$rainfall, na.rm = TRUE) # Ensure NA values are ignored
print(train_accuracy2)
## [1] 0.8598174
# Predict probabilities on the TEST data
log_probs_test <- predict(log_reg, newdata = test, type = "response")
# Convert probabilities to binary predictions for the TEST data
log_preds_test <- ifelse(log_probs_test > 0.5, 1, 0)
# Fill the predictions into the 'rainfall' column of the test dataset
test$rainfall <- log_preds_test
# Select only the 'id' and 'rainfall' columns from the test dataframe
submission <- data.frame(id = test$id, rainfall = test$rainfall)
# Write the submission dataframe to a CSV file
write.csv(submission, file = 'final_project_log_v1.csv', row.names = FALSE)
Plan to see if using RandomForest will yield better results. One problem I had was understanding in this case, we are trying to classify. That is why for train we had to change it to a factor.
Overview of data:
We are trying to classify. In this case the binary outcome or rainfall or not.
Using 500 trees
3 variables are choose randomly at each split
Results of first model:
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(123)
# Convert rainfall to factor (if it's binary or categorical)
train$rainfall <- factor(train$rainfall, levels = c(0, 1))
# Re-run the Random Forest model for classification
rf_model <- randomForest(rainfall ~ . - id - day, data = train, importance = TRUE)
# Print model summary
print(rf_model)
##
## Call:
## randomForest(formula = rainfall ~ . - id - day, data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 13.7%
## Confusion matrix:
## 0 1 class.error
## 0 354 186 0.34444444
## 1 114 1536 0.06909091
Overall looking at the variable importance. We can see that for mean decrease accuracy, cloud is they highest variable. Mean decrease accuracy means is the measurement of accuracy if we were to exclude this variable from our model. This suggest that cloud is one strong predictor.
Look at mean decrease gini, which is the measurement of how strong a predictor is when splitting the data. We see sunshine and cloud be the strongest. Overall the variables that stand out within the graph to be useful predictors are: dewpoint, humidity, sunshine, and cloud.
# Show variable importance
var_importance <- importance(rf_model)
print(var_importance)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## pressure 8.984766 19.635104 23.703522 51.49967
## maxtemp 2.257418 23.193758 25.280407 51.70055
## temparature 9.594227 19.197536 23.139947 49.41607
## mintemp 6.484718 21.430431 24.626359 47.23691
## dewpoint 4.991463 27.887977 31.974545 56.76731
## humidity 15.602779 18.492336 25.785691 89.20012
## cloud 55.031423 49.591576 74.203579 231.32486
## sunshine 17.465665 27.492565 35.677425 147.88095
## winddirection 12.855963 13.401913 20.192767 34.12076
## windspeed 5.571231 5.901377 7.952763 53.97901
# Plot the importance of variables
varImpPlot(rf_model)
When wondering how to improve our randomForest model, I thought to change the number of trees. We decided to see that around 200 trees our model begins to stabilize.
# Track OOB error rate for various values of ntree
oob_errors <- rf_model$err.rate[, 1] # OOB error rate for class 0
trees <- 1:length(oob_errors) # Number of trees
# Plot the OOB error rate against the number of trees
plot(trees, oob_errors, type = "l", col = "blue", xlab = "Number of Trees", ylab = "OOB Error Rate")
Taking into consideration of what we have learned these are the changes we have made:
Only decided to use cloud, dewpoint, humidity, and sunshine due to our variable importance.
Change the number of trees to 200 instead of 500
Results of our new model:
We were able to decrease our class 0 error slightly to 34.1%
Slightly decrease our class 1 error rate to 6.8%
Overall there is a slight decrease to OOB error rate to 13.56%
set.seed(123)
rf_model2 <- randomForest(rainfall ~ cloud + dewpoint + humidity + sunshine - id - day, data = train, importance = TRUE, ntree = 200)
# Print model summary
print(rf_model2)
##
## Call:
## randomForest(formula = rainfall ~ cloud + dewpoint + humidity + sunshine - id - day, data = train, importance = TRUE, ntree = 200)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.56%
## Confusion matrix:
## 0 1 class.error
## 0 356 184 0.34074074
## 1 113 1537 0.06848485
# Predict class labels (0 or 1) for the test set
rf_preds <- predict(rf_model2, newdata = test)
# Create a submission dataframe with 'id' and the predicted 'rainfall'
rf_submission <- data.frame(id = test$id, rainfall = rf_preds)
# Write the submission dataframe to a CSV file
write.csv(rf_submission, file = 'submission_rf_model2.2.csv', row.names = FALSE)
After submitted our 1st logistic model and 2nd randomForest model here are the scores.
submission_rf_model2.2 received a score of .77205
final_project_log_v1 received a score of .75502
Ultimately our random forest model performed better when predicting rainfall. Since our best score is from our randomforest model, we are in #2486 on the leaderboard.