Linear Regression - Weather Data

library(RCurl)
library(rattle)
library(batchtools)
library(ggplot2)
library(caret)
library(gmodels)
library(gridExtra)
library(corrplot)
library(e1071)
library(dplyr)
library(GGally)

First Data-Import

weather_raw <- read.csv(url("https://raw.githubusercontent.com/josephsimone/DATA607/master/ww-ii-data..csv"), header = TRUE, sep = ",", dec=".", stringsAsFactors = TRUE)
dim(weather_raw)
## [1] 119040     31
str(weather_raw)
## 'data.frame':    119040 obs. of  31 variables:
##  $ STA        : int  10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 ...
##  $ Date       : Factor w/ 2192 levels "1/1/1940","1/1/1941",..: 1643 1709 1775 1793 1799 1805 1811 1817 1823 1649 ...
##  $ Precip     : Factor w/ 540 levels "0","0.254","0.508",..: 5 1 182 182 1 1 540 244 540 244 ...
##  $ WindGustSpd: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MaxTemp    : num  25.6 28.9 26.1 26.7 26.7 ...
##  $ MinTemp    : num  22.2 21.7 22.2 22.2 21.7 ...
##  $ MeanTemp   : num  23.9 25.6 24.4 24.4 24.4 ...
##  $ Snowfall   : Factor w/ 36 levels "","#VALUE!","0",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ PoorWeather: Factor w/ 39 levels "","0","1","1        1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ YR         : int  42 42 42 42 42 42 42 42 42 42 ...
##  $ MO         : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ DA         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ PRCP       : Factor w/ 541 levels "","0","0.01",..: 6 2 12 12 2 2 541 16 541 16 ...
##  $ DR         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SPD        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ MAX        : int  78 84 79 80 80 80 83 80 81 78 ...
##  $ MIN        : int  72 71 72 72 71 71 73 72 73 71 ...
##  $ MEA        : int  75 78 76 76 76 76 78 76 77 74 ...
##  $ SNF        : Factor w/ 36 levels "","0","0.1","0.2",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ SND        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ FT         : logi  NA NA NA NA NA NA ...
##  $ FB         : logi  NA NA NA NA NA NA ...
##  $ FTI        : logi  NA NA NA NA NA NA ...
##  $ ITH        : logi  NA NA NA NA NA NA ...
##  $ PGT        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ TSHDSBRSGF : Factor w/ 39 levels "","0","1","1        1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SD3        : logi  NA NA NA NA NA NA ...
##  $ RHX        : logi  NA NA NA NA NA NA ...
##  $ RHN        : logi  NA NA NA NA NA NA ...
##  $ RVG        : logi  NA NA NA NA NA NA ...
##  $ WTE        : logi  NA NA NA NA NA NA ...

Cases

To begin with, I wanted to do a Linear Regression Model using this War World II that I found on Kaggle, however, as you can see by the Data Types of the this Data-Set. Since this Data-Set is from the 1940’s, some of the values had ‘character’ values mixed in with ‘decimal’ valus. Nevertheless, I would like to continue to use “Weather-Related” Data. Therefore, I will for using the Sample Data-Set from “Data Mining with Rattle and R The Art of Excavating Data for Knowledge Discover” by Graham Williams, “weather”. This Data-Set comes from a weather monitoring station located in Canberra, Australia.The Bureau makes available 13 months of daily weather observations from many locations across Australia.

At first glace, the information in this dataset includes Rainfall, temperatures, wind speed and humatity at different times of day.There are over twenty cases in this dataset, however, a lot of the cases and NULL. Therefore, I will be eleminating them from my dataset for analysis. The cases that I will be keeping include, the station #, , date, precipitation, wind gust speed, max, min and mean temperature, and snowfall.**

Data collection

Observations were drawn from Canberra Airport {station 070351}.

Some cloud observations are from automated equipment; these are somewhat different to those made by a human observer and may not appear every day.

Type of study

This study is an observational study, this is a collected of weather conditions recorded on each day at various weather station in Australia.

Dependent Variable

The response vairbale that I will be trying to calculate, is going to be Rainfall. Given the Minimum Temperature can you predict the amount of Rainfal. Therefore, this is a quantitative variable.**

Independent Variable

The two independent variables that I will using for this linear regression analysis are the Rainfall and Minimum Tempatures.

Final Data-Import

dim(weather)
## [1] 366  24
names(weather)
##  [1] "Date"          "Location"      "MinTemp"       "MaxTemp"      
##  [5] "Rainfall"      "Evaporation"   "Sunshine"      "WindGustDir"  
##  [9] "WindGustSpeed" "WindDir9am"    "WindDir3pm"    "WindSpeed9am" 
## [13] "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"   "Pressure9am"  
## [17] "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
## [21] "Temp3pm"       "RainToday"     "RISK_MM"       "RainTomorrow"
Variable Distribtion:

Date: The date of observation (a date object)
Location: The common name of the location of the weather station
MinTemp: The minimum temperature in degrees centigrade
MaxTemp: The maximum temperature in degrees centigrade
Rainfall: The amount of rainfall recorded for the day in millimeters
Evaporation: Class A pan evaporation (in millimeters) during 24 h
Sunshine: The number of hours of bright sunshine in the day
WindGustDir: The direction of the strongest wind gust in the 24 h to midnight
WindGustSpeed: The speed (in kilometers per hour) of the strongest wind gust in the 24 h to midnight
WindDir9am: The direction of the wind gust at 9 a.m.
WindDir3pm: The direction of the wind gust at 3 p.m.
WindSpeed9am: Wind speed (in kilometers per hour) averaged over 10 min before 9 a.m.
WindSpeed3pm: Wind speed (in kilometers per hour) averaged over 10 min before 3 p.m.
Humidity9am: Relative humidity (in percent) at 9 a.m.
Humidity3pm: Relative humidity (in percent) at 3 p.m.
Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9 a.m.
Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3 p.m.
Cloud9am: Fraction of sky obscured by cloud at 9 a.m. This is measured in ”oktas,” which are a unit of eighths. It records how many eighths of the sky are obscured by cloud. A 0 measure indicates completely clear sky, while an 8 indicates that it is completely overcast
Cloud3pm: Fraction of sky obscured by cloud at 3 p.m; see Cloud9am for a description of the values
Temp9am: Temperature (degrees C) at 9 a.m.
Temp3pm: Temperature (degrees C) at 3 p.m. RainToday: Integer 1 if precipitation (in millimeters) in the 24 h to 9 a.m. exceeds 1 mm, otherwise 0
RISK_MM: The continuous target variable; the amount of rain recorded during the next day
RainTomorrow: The binary target variable whether it rains or not during the next day

write.csv(weather, file = "weather.csv")
set.seed(123)
wd <- read.csv(url("https://raw.githubusercontent.com/josephsimone/DATA607/master/weather.csv"), header = TRUE, sep = ",", stringsAsFactors = TRUE)

dim(wd)
## [1] 366  24
wd <- wd[complete.cases(wd),]
wd <- subset(wd, select = -c(Date, Location, RISK_MM, RainTomorrow, RainToday, WindDir9am, WindDir3pm))
wd <- wd%>%
   select(Rainfall, everything())

Reserach Question

Given Minimum Temperature of a Day, can one predicted the amount of RainFall that it will rain the following day ?

\(H_\theta:\) Given Minimum Temperature, can you predict the amount of Rainfall for that day ? \(H_A:\) One cannot predict the amount of Rainfall in a day, given the Minimum Temperature. ### Scatter Plot

scatter.smooth(x=wd$Rainfall, y=wd$MinTemp, main="")

BoxPlot

par(mfrow=c(1, 2))

boxplot(wd$Rainfall, main="Rainfall", sub=paste("Outlier rows: ", boxplot.stats(wd$Rainfall)$out))
boxplot(wd$MinTemp, main="MinTemp", sub=paste("Outlier rows: ", boxplot.stats(wd$MinTemp)$out))

Density PLot

par(mfrow=c(1, 2))

plot(density(wd$Rainfall), main="Density Plot: Rainfall", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(wd$Rainfall), 2)))
plot(density(wd$MinTemp), main="Density Plot: MinTemp", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(wd$MinTemp), 2)))
polygon(density(wd$MinTemp), col="blue")

Correlation between Rainfall and MinTemp

cor(wd$Rainfall, wd$MinTemp)
## [1] 0.1973385
ggpairs(data=wd, columns=1:2, title="Rainfall vs MinTemp")

Linear Model

linearMod <- lm(Rainfall~MinTemp, data=wd)
print(linearMod)
## 
## Call:
## lm(formula = Rainfall ~ MinTemp, data = wd)
## 
## Coefficients:
## (Intercept)      MinTemp  
##      0.3384       0.1424
summary(linearMod)
## 
## Call:
## lm(formula = Rainfall ~ MinTemp, data = wd)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.314 -1.875 -0.965 -0.185 37.568 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.33842    0.38221   0.885 0.376576    
## MinTemp      0.14238    0.03917   3.635 0.000323 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.212 on 326 degrees of freedom
## Multiple R-squared:  0.03894,    Adjusted R-squared:  0.03599 
## F-statistic: 13.21 on 1 and 326 DF,  p-value: 0.0003235

Pvalue

lmp <- function (modelobject) {
    if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
    f <- summary(modelobject)$fstatistic
    p <- pf(f[1],f[2],f[3],lower.tail=F)
    attributes(p) <- NULL
    return(p)
}
lmp(linearMod)
## [1] 0.0003234825

R-Squared

summary(linearMod)$r.squared
## [1] 0.0389425

Conclusion

The correlation between the Rainfall and Minimum Temperature is very weak. In addition the \(R^2\) and \(P\) value for this Linear Model are very low.

Thereofore we reject the null hypothesis, in favor of the alternative hypothesis, you cannot predict the amount of Rainfall given the Minimum Temperature.

Weather patterns are known to be none-linear.

In the future,I would like to include more variables than the Maximum Temperature to predict the null hypothesis.