1 Data Preparation

Overview of Data:

13 variable in Train set (12 in test, need rainfall - but that is what we are predicting)
Structures seen below, need to adjust rainfall for a factor

#Load Library
library(mice)

## Warning: package 'mice' was built under R version 4.4.3

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.2

## corrplot 0.95 loaded

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.2

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.4.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Bring in Data 
train <- read.csv("C:/Users/gianc/OneDrive/Desktop/BUA 302 - FINAL FINAL FINAL/train.csv", na.strings = c("", "NA", header = TRUE))

test <- read.csv("C:/Users/gianc/OneDrive/Desktop/BUA 302 - FINAL FINAL FINAL/test.csv")
test$rainfall <- ""
test$rainfall <- as.factor(test$rainfall)


#See structure
str(train)

## 'data.frame':    2190 obs. of  13 variables:
##  $ id           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ day          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pressure     : num  1017 1020 1024 1013 1022 ...
##  $ maxtemp      : num  21.2 16.2 19.4 18.1 21.3 20.6 19.5 15.8 17.6 16.5 ...
##  $ temparature  : num  20.6 16.9 16.1 17.8 18.4 18.6 18.4 13.6 16.5 14.4 ...
##  $ mintemp      : num  19.9 15.8 14.6 16.9 15.2 16.5 15.3 12.7 15.6 12 ...
##  $ dewpoint     : num  19.4 15.4 9.3 16.8 9.6 12.5 11.3 11.8 12.5 8.6 ...
##  $ humidity     : num  87 95 75 95 52 79 56 96 86 77 ...
##  $ cloud        : num  88 91 47 95 45 81 46 100 100 84 ...
##  $ sunshine     : num  1.1 0 8.3 0 3.6 0 7.6 0 0 1 ...
##  $ winddirection: num  60 50 70 60 40 20 20 50 50 50 ...
##  $ windspeed    : num  17.2 21.9 18.1 35.6 24.8 15.7 28.4 52.8 37.5 38.3 ...
##  $ rainfall     : int  1 1 1 1 0 1 0 1 1 0 ...

str(test)

## 'data.frame':    730 obs. of  13 variables:
##  $ id           : int  2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 ...
##  $ day          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pressure     : num  1020 1016 1024 1023 1022 ...
##  $ maxtemp      : num  17.5 17.5 11.2 20.6 16.1 15.6 15.5 20.5 16.3 10.4 ...
##  $ temparature  : num  15.8 16.5 10.4 17.3 13.8 12.6 13.7 16.2 13.2 8.5 ...
##  $ mintemp      : num  12.7 15.8 9.4 15.2 6.4 11.5 10.7 15.2 11.3 7 ...
##  $ dewpoint     : num  14.9 15.1 8.9 9.5 4.3 9 11.8 13.1 10.8 3.1 ...
##  $ humidity     : num  96 97 86 75 68 76 79 94 85 69 ...
##  $ cloud        : num  99 99 96 45 49 94 95 93 99 88 ...
##  $ sunshine     : num  0 0 0 7.1 9.2 0 0 0.2 0.1 0 ...
##  $ winddirection: num  50 50 40 20 20 20 20 70 20 20 ...
##  $ windspeed    : num  24.3 35.3 16.9 50.6 19.4 41.4 43.1 41.3 34 26.4 ...
##  $ rainfall     : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...

#Double check for change as factor
str(train)

## 'data.frame':    2190 obs. of  13 variables:
##  $ id           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ day          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pressure     : num  1017 1020 1024 1013 1022 ...
##  $ maxtemp      : num  21.2 16.2 19.4 18.1 21.3 20.6 19.5 15.8 17.6 16.5 ...
##  $ temparature  : num  20.6 16.9 16.1 17.8 18.4 18.6 18.4 13.6 16.5 14.4 ...
##  $ mintemp      : num  19.9 15.8 14.6 16.9 15.2 16.5 15.3 12.7 15.6 12 ...
##  $ dewpoint     : num  19.4 15.4 9.3 16.8 9.6 12.5 11.3 11.8 12.5 8.6 ...
##  $ humidity     : num  87 95 75 95 52 79 56 96 86 77 ...
##  $ cloud        : num  88 91 47 95 45 81 46 100 100 84 ...
##  $ sunshine     : num  1.1 0 8.3 0 3.6 0 7.6 0 0 1 ...
##  $ winddirection: num  60 50 70 60 40 20 20 50 50 50 ...
##  $ windspeed    : num  17.2 21.9 18.1 35.6 24.8 15.7 28.4 52.8 37.5 38.3 ...
##  $ rainfall     : int  1 1 1 1 0 1 0 1 1 0 ...

str(test)

## 'data.frame':    730 obs. of  13 variables:
##  $ id           : int  2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 ...
##  $ day          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pressure     : num  1020 1016 1024 1023 1022 ...
##  $ maxtemp      : num  17.5 17.5 11.2 20.6 16.1 15.6 15.5 20.5 16.3 10.4 ...
##  $ temparature  : num  15.8 16.5 10.4 17.3 13.8 12.6 13.7 16.2 13.2 8.5 ...
##  $ mintemp      : num  12.7 15.8 9.4 15.2 6.4 11.5 10.7 15.2 11.3 7 ...
##  $ dewpoint     : num  14.9 15.1 8.9 9.5 4.3 9 11.8 13.1 10.8 3.1 ...
##  $ humidity     : num  96 97 86 75 68 76 79 94 85 69 ...
##  $ cloud        : num  99 99 96 45 49 94 95 93 99 88 ...
##  $ sunshine     : num  0 0 0 7.1 9.2 0 0 0.2 0.1 0 ...
##  $ winddirection: num  50 50 40 20 20 20 20 70 20 20 ...
##  $ windspeed    : num  24.3 35.3 16.9 50.6 19.4 41.4 43.1 41.3 34 26.4 ...
##  $ rainfall     : Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...

2 Handling Missing Data

One NA for winddirection in Test Set
- Used ‘mice’, specifically using the rainfall method to generate value for missing NA

colSums(is.na(train)) #No NAs

##            id           day      pressure       maxtemp   temparature 
##             0             0             0             0             0 
##       mintemp      dewpoint      humidity         cloud      sunshine 
##             0             0             0             0             0 
## winddirection     windspeed      rainfall 
##             0             0             0

colSums(is.na(test)) #One NA in TEST

##            id           day      pressure       maxtemp   temparature 
##             0             0             0             0             0 
##       mintemp      dewpoint      humidity         cloud      sunshine 
##             0             0             0             0             0 
## winddirection     windspeed      rainfall 
##             1             0             0

#used package mice, to use rainfall method to input missing winddirection value.
test <- test %>%
  mice(method = "rf")%>%
  complete()

## 
##  iter imp variable
##   1   1  winddirection
##   1   2  winddirection
##   1   3  winddirection
##   1   4  winddirection
##   1   5  winddirection
##   2   1  winddirection
##   2   2  winddirection
##   2   3  winddirection
##   2   4  winddirection
##   2   5  winddirection
##   3   1  winddirection
##   3   2  winddirection
##   3   3  winddirection
##   3   4  winddirection
##   3   5  winddirection
##   4   1  winddirection
##   4   2  winddirection
##   4   3  winddirection
##   4   4  winddirection
##   4   5  winddirection
##   5   1  winddirection
##   5   2  winddirection
##   5   3  winddirection
##   5   4  winddirection
##   5   5  winddirection

## Warning: Number of logged events: 1

#No more missing values 
colSums(is.na(test))

##            id           day      pressure       maxtemp   temparature 
##             0             0             0             0             0 
##       mintemp      dewpoint      humidity         cloud      sunshine 
##             0             0             0             0             0 
## winddirection     windspeed      rainfall 
##             0             0             0

3 Feature Engineering - Correlation

Simply looking at the correlation those that are significant are below:

Positive - humidity & cloud
Negative - sunshine

We are able to understand this since our predictor variable is binary. There is a positive relationship we humidity and cloud present because it would result in rainfall, with a 1. On the other hand, if there is sunshine then this will result in a 0. BUT nothing is certain as for right now, this is just logical thinking.

I would also like to point our the strong relationship with our numerical variables such as pressure, maxtemp, temperature, minttemp, dewpoint. We should keep aware of these relationship when generating our model.

str(train)

## 'data.frame':    2190 obs. of  13 variables:
##  $ id           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ day          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pressure     : num  1017 1020 1024 1013 1022 ...
##  $ maxtemp      : num  21.2 16.2 19.4 18.1 21.3 20.6 19.5 15.8 17.6 16.5 ...
##  $ temparature  : num  20.6 16.9 16.1 17.8 18.4 18.6 18.4 13.6 16.5 14.4 ...
##  $ mintemp      : num  19.9 15.8 14.6 16.9 15.2 16.5 15.3 12.7 15.6 12 ...
##  $ dewpoint     : num  19.4 15.4 9.3 16.8 9.6 12.5 11.3 11.8 12.5 8.6 ...
##  $ humidity     : num  87 95 75 95 52 79 56 96 86 77 ...
##  $ cloud        : num  88 91 47 95 45 81 46 100 100 84 ...
##  $ sunshine     : num  1.1 0 8.3 0 3.6 0 7.6 0 0 1 ...
##  $ winddirection: num  60 50 70 60 40 20 20 50 50 50 ...
##  $ windspeed    : num  17.2 21.9 18.1 35.6 24.8 15.7 28.4 52.8 37.5 38.3 ...
##  $ rainfall     : int  1 1 1 1 0 1 0 1 1 0 ...

cor(train)

##                         id           day     pressure     maxtemp temparature
## id             1.000000000  0.1530651187 -0.008234737  0.01258991  0.01430652
## day            0.153065119  1.0000000000  0.005337193  0.14629426  0.15359022
## pressure      -0.008234737  0.0053371931  1.000000000 -0.80049902 -0.81653085
## maxtemp        0.012589910  0.1462942610 -0.800499025  1.00000000  0.98293176
## temparature    0.014306518  0.1535902206 -0.816530852  0.98293176  1.00000000
## mintemp        0.018707781  0.1614752942 -0.814453338  0.96552873  0.98714995
## dewpoint       0.006796853  0.1379288312 -0.817007955  0.90670259  0.93361741
## humidity      -0.029042329 -0.0740478530 -0.119949253 -0.07261494 -0.02501622
## cloud          0.002226192 -0.0481749768  0.098599808 -0.28904652 -0.24935465
## sunshine      -0.003022229  0.0609361285 -0.257163439  0.45238748  0.41401861
## winddirection -0.004222734  0.0247996669 -0.643292856  0.66223482  0.66896268
## windspeed      0.020167291 -0.0001985325  0.266012467 -0.35416795 -0.34226188
## rainfall       0.033674078 -0.0004619472 -0.049885545 -0.07930352 -0.04966020
##                    mintemp     dewpoint     humidity        cloud     sunshine
## id             0.018707781  0.006796853 -0.029042329  0.002226192 -0.003022229
## day            0.161475294  0.137928831 -0.074047853 -0.048174977  0.060936128
## pressure      -0.814453338 -0.817007955 -0.119949253  0.098599808 -0.257163439
## maxtemp        0.965528730  0.906702585 -0.072614942 -0.289046518  0.452387476
## temparature    0.987149954  0.933617410 -0.025016217 -0.249354652  0.414018610
## mintemp        1.000000000  0.941341633  0.009891374 -0.219398782  0.379496549
## dewpoint       0.941341633  1.000000000  0.153389749 -0.088446054  0.249676271
## humidity       0.009891374  0.153389749  1.000000000  0.584854349 -0.541591914
## cloud         -0.219398782 -0.088446054  0.584854349  1.000000000 -0.805128433
## sunshine       0.379496549  0.249676271 -0.541591914 -0.805128433  1.000000000
## winddirection  0.663828248  0.643073259 -0.012430170 -0.127086795  0.272234726
## windspeed     -0.328870508 -0.312178683  0.062284700  0.184697543 -0.241751882
## rainfall      -0.026841468  0.081965037  0.454213483  0.641191204 -0.555286920
##               winddirection     windspeed      rainfall
## id             -0.004222734  0.0201672909  0.0336740779
## day             0.024799667 -0.0001985325 -0.0004619472
## pressure       -0.643292856  0.2660124670 -0.0498855452
## maxtemp         0.662234820 -0.3541679464 -0.0793035209
## temparature     0.668962683 -0.3422618828 -0.0496601951
## mintemp         0.663828248 -0.3288705084 -0.0268414679
## dewpoint        0.643073259 -0.3121786832  0.0819650369
## humidity       -0.012430170  0.0622847001  0.4542134832
## cloud          -0.127086795  0.1846975426  0.6411912036
## sunshine        0.272234726 -0.2417518820 -0.5552869202
## winddirection   1.000000000 -0.1924169131 -0.0069391214
## windspeed      -0.192416913  1.0000000000  0.1116245937
## rainfall       -0.006939121  0.1116245937  1.0000000000

corrplot(cor(train))

4 Exploratory Data Analysis

4.1 Density plots across all variables by rainfall

We created a Density plots to depict the smoothed estimate of data distribution. Remember Density does not equate to frequency, but probabilities under the curve.

pressure
1. Very slight difference. When pressure reaches 1010-1020, slighting more likely to rain.
2. If pressure exceeds 1020, the it will most likely NOT rain.
maxtemp
1. when maximum temperature reach 24-30 degrees, it will be more likely to rain.
2. after 33 degrees, more likely NOT to rain
temperature
1. Very slight increase in likelihood to rain between 20-27.
mintemp
1. Very slight increase in rainfall when minimum temp is between 17 - 15 degrees.
dewpoint
1. will NOT rain below 10
2. will rain around 15-17
3. after 25 it will be more likely to rain
humidity
1. VERY significant different, when humidity reaches 80, there is a massive decrease in likelihood not to rain & a significant increase in rainfall.
cloud
1. Anything below 60 it will most likely NOT rain
2. Huge spike anything after 75 it will most likely rain
sunshine
1. Huge opposite shifts comparing when it rains and when it does not
2. Anything 0-2.5 measurement of sunshine it is most likely going to rain
3. Anything around5 and above, there will most likely not be any rainfall
winddirection
1. It is difficult to see and high peaks, expect for ranges of 50-100 more likely to rain.
windspeed
1. If windspeed are above 20, it is a very slight increase that there will be rain

In conclusion there were only 3 variables that had significant disparity in density when compared to their status of rainfall.

humidity, cloud, and sunshine

# Select numerical variables and rainfall
numeric_vars <- c("pressure", "maxtemp", "temparature", "mintemp", "dewpoint", 
                  "humidity", "cloud", "sunshine", "winddirection", "windspeed")

# Melt data to long format for facetting
train_long <- melt(train, id.vars = "rainfall", measure.vars = numeric_vars)

# Plot density with facet wrap
ggplot(train_long, aes(x = value, fill = factor(rainfall))) +
  geom_density(alpha = 0.5) +  # Transparent fill
  facet_wrap(~ variable, scales = "free") +  # Wrap by variable, free scales
  labs(title = "Density Plots of Numerical Variables by Rainfall",
       x = "Value", y = "Density", fill = "Rainfall") +
  theme_minimal()

5 Model Development

5.1 Model Selection

We will choose a logistic regression model to answer what is the probability of an event occurring. In this case, we are attempting to answer the likelihood of rainfall.

5.2 Logistic Regression Model & Evaluation

Running a logistic regression and utilizing all of our variables except for id and day, we receive a accuracy rate of 86.8%.

The decrease from 2444.4 to 1443.7 suggest that the model has improved over the null model, when predictors are involved.

As FOR THE AIC comparing to our second model our first model is better.

# Fit logistic regression
log_reg <- glm(rainfall ~ . - id - day , data = train, family = "binomial")
summary(log_reg)

## 
## Call:
## glm(formula = rainfall ~ . - id - day, family = "binomial", data = train)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   26.3349469 22.8572309   1.152 0.249260    
## pressure      -0.0333959  0.0219648  -1.520 0.128404    
## maxtemp        0.0276169  0.0678539   0.407 0.684004    
## temparature   -0.0411014  0.1152849  -0.357 0.721451    
## mintemp       -0.0678926  0.0894629  -0.759 0.447918    
## dewpoint       0.1504699  0.0415929   3.618 0.000297 ***
## humidity       0.0402758  0.0124140   3.244 0.001177 ** 
## cloud          0.0638011  0.0062067  10.279  < 2e-16 ***
## sunshine      -0.1540483  0.0320463  -4.807 1.53e-06 ***
## winddirection -0.0003703  0.0011284  -0.328 0.742781    
## windspeed      0.0151766  0.0077001   1.971 0.048728 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2446.4  on 2189  degrees of freedom
## Residual deviance: 1443.7  on 2179  degrees of freedom
## AIC: 1465.7
## 
## Number of Fisher Scoring iterations: 5

# Predict on training data
log_probs <- predict(log_reg, newdata = train, type = "response")

# Convert probabilities to binary predictions
log_preds <- ifelse(log_probs > 0.5, 1, 0)

# Compute accuracy on TRAINING data
train_accuracy <- mean(log_preds == train$rainfall, na.rm = TRUE)  # Ensure NA values are ignored
print(train_accuracy)

## [1] 0.8680365

5.3 Feature Importance

While just focusing on the most significant variables, especially choosing these from our EDA, we see that it does not make or model any better. Statistically, we choose these variables due to the significant in our previous model (looking a p-values).

This model slightly decreases with out accuracy rate being 85.9%, an increase in Residual Deviance and an increase in AIC.

We will not be using this model and sticking to our first model. In the end, although, we believed that focusing on “important” variables; it did not make our model better.

# Fit logistic regression
log_reg2 <- glm(rainfall ~ humidity + cloud + sunshine, data = train, family = "binomial")
summary(log_reg2)

## 
## Call:
## glm(formula = rainfall ~ humidity + cloud + sunshine, family = "binomial", 
##     data = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.170335   0.973988  -9.415  < 2e-16 ***
## humidity     0.075670   0.011349   6.667  2.6e-11 ***
## cloud        0.065169   0.005991  10.877  < 2e-16 ***
## sunshine    -0.087411   0.027317  -3.200  0.00138 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2446.4  on 2189  degrees of freedom
## Residual deviance: 1497.4  on 2186  degrees of freedom
## AIC: 1505.4
## 
## Number of Fisher Scoring iterations: 5

# Predict on training data
log_probs2 <- predict(log_reg2, newdata = train, type = "response")

# Convert probabilities to binary predictions
log_preds2 <- ifelse(log_probs2 > 0.5, 1, 0)

# Compute accuracy on TRAINING data
train_accuracy2 <- mean(log_preds2 == train$rainfall, na.rm = TRUE)  # Ensure NA values are ignored
print(train_accuracy2)

## [1] 0.8598174

5.4 Submission to Kaggle - Logistic Regression

# Predict probabilities on the TEST data
log_probs_test <- predict(log_reg, newdata = test, type = "response")

# Convert probabilities to binary predictions for the TEST data
log_preds_test <- ifelse(log_probs_test > 0.5, 1, 0)

# Fill the predictions into the 'rainfall' column of the test dataset
test$rainfall <- log_preds_test

# Select only the 'id' and 'rainfall' columns from the test dataframe
submission <- data.frame(id = test$id, rainfall = test$rainfall)

# Write the submission dataframe to a CSV file
write.csv(submission, file = 'final_project_log_v1.csv', row.names = FALSE)

6 Model Development - RandomForest

6.1 Model Selection

Plan to see if using RandomForest will yield better results. One problem I had was understanding in this case, we are trying to classify. That is why for train we had to change it to a factor.

6.2 RandomForest

Overview of data:

We are trying to classify. In this case the binary outcome or rainfall or not.
Using 500 trees
3 variables are choose randomly at each split

Results of first model:

OOB estimate error rate is 13.7%
Our class 0 error rate shows that 34.4% of the time our model misclassifies when it does not rain
Our class 1 error shows that 6.9% of the time our model misclassifies day when it rains
Our overall accuracy is 86.3% that the model performs well.

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(123)

# Convert rainfall to factor (if it's binary or categorical)
train$rainfall <- factor(train$rainfall, levels = c(0, 1))

# Re-run the Random Forest model for classification
rf_model <- randomForest(rainfall ~ . - id - day, data = train, importance = TRUE)

# Print model summary
print(rf_model)

## 
## Call:
##  randomForest(formula = rainfall ~ . - id - day, data = train,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 13.7%
## Confusion matrix:
##     0    1 class.error
## 0 354  186  0.34444444
## 1 114 1536  0.06909091

6.3 Variable importance

Overall looking at the variable importance. We can see that for mean decrease accuracy, cloud is they highest variable. Mean decrease accuracy means is the measurement of accuracy if we were to exclude this variable from our model. This suggest that cloud is one strong predictor.

Look at mean decrease gini, which is the measurement of how strong a predictor is when splitting the data. We see sunshine and cloud be the strongest. Overall the variables that stand out within the graph to be useful predictors are: dewpoint, humidity, sunshine, and cloud.

# Show variable importance
var_importance <- importance(rf_model)
print(var_importance)

##                       0         1 MeanDecreaseAccuracy MeanDecreaseGini
## pressure       8.984766 19.635104            23.703522         51.49967
## maxtemp        2.257418 23.193758            25.280407         51.70055
## temparature    9.594227 19.197536            23.139947         49.41607
## mintemp        6.484718 21.430431            24.626359         47.23691
## dewpoint       4.991463 27.887977            31.974545         56.76731
## humidity      15.602779 18.492336            25.785691         89.20012
## cloud         55.031423 49.591576            74.203579        231.32486
## sunshine      17.465665 27.492565            35.677425        147.88095
## winddirection 12.855963 13.401913            20.192767         34.12076
## windspeed      5.571231  5.901377             7.952763         53.97901

# Plot the importance of variables
varImpPlot(rf_model)

6.4 Model Evaluation

When wondering how to improve our randomForest model, I thought to change the number of trees. We decided to see that around 200 trees our model begins to stabilize.

# Track OOB error rate for various values of ntree
oob_errors <- rf_model$err.rate[, 1]  # OOB error rate for class 0
trees <- 1:length(oob_errors)  # Number of trees

# Plot the OOB error rate against the number of trees
plot(trees, oob_errors, type = "l", col = "blue", xlab = "Number of Trees", ylab = "OOB Error Rate")

6.5 Revision to rf model

Taking into consideration of what we have learned these are the changes we have made:

Only decided to use cloud, dewpoint, humidity, and sunshine due to our variable importance.
Change the number of trees to 200 instead of 500

Results of our new model:

We were able to decrease our class 0 error slightly to 34.1%
Slightly decrease our class 1 error rate to 6.8%
Overall there is a slight decrease to OOB error rate to 13.56%

set.seed(123)

rf_model2 <- randomForest(rainfall ~ cloud + dewpoint + humidity + sunshine - id - day, data = train, importance = TRUE, ntree = 200)

# Print model summary
print(rf_model2)

## 
## Call:
##  randomForest(formula = rainfall ~ cloud + dewpoint + humidity +      sunshine - id - day, data = train, importance = TRUE, ntree = 200) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.56%
## Confusion matrix:
##     0    1 class.error
## 0 356  184  0.34074074
## 1 113 1537  0.06848485

6.6 Submission to Kaggle - randomForest

# Predict class labels (0 or 1) for the test set
rf_preds <- predict(rf_model2, newdata = test)

# Create a submission dataframe with 'id' and the predicted 'rainfall'
rf_submission <- data.frame(id = test$id, rainfall = rf_preds)

# Write the submission dataframe to a CSV file
write.csv(rf_submission, file = 'submission_rf_model2.2.csv', row.names = FALSE)

7 Submission to Kaggle

After submitted our 1st logistic model and 2nd randomForest model here are the scores.

submission_rf_model2.2 received a score of .77205
final_project_log_v1 received a score of .75502

Ultimately our random forest model performed better when predicting rainfall. Since our best score is from our randomforest model, we are in #2486 on the leaderboard.

BUA 302 FINAL PROJECT

GianC

2025-03-21