library(RCurl)
library(rattle)
library(batchtools)
library(ggplot2)
library(caret)
library(gmodels)
library(gridExtra)
library(corrplot)
library(e1071)
library(dplyr)
library(GGally)
weather_raw <- read.csv(url("https://raw.githubusercontent.com/josephsimone/DATA607/master/ww-ii-data..csv"), header = TRUE, sep = ",", dec=".", stringsAsFactors = TRUE)
dim(weather_raw)
## [1] 119040 31
str(weather_raw)
## 'data.frame': 119040 obs. of 31 variables:
## $ STA : int 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 ...
## $ Date : Factor w/ 2192 levels "1/1/1940","1/1/1941",..: 1643 1709 1775 1793 1799 1805 1811 1817 1823 1649 ...
## $ Precip : Factor w/ 540 levels "0","0.254","0.508",..: 5 1 182 182 1 1 540 244 540 244 ...
## $ WindGustSpd: num NA NA NA NA NA NA NA NA NA NA ...
## $ MaxTemp : num 25.6 28.9 26.1 26.7 26.7 ...
## $ MinTemp : num 22.2 21.7 22.2 22.2 21.7 ...
## $ MeanTemp : num 23.9 25.6 24.4 24.4 24.4 ...
## $ Snowfall : Factor w/ 36 levels "","#VALUE!","0",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ PoorWeather: Factor w/ 39 levels "","0","1","1 1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ YR : int 42 42 42 42 42 42 42 42 42 42 ...
## $ MO : int 7 7 7 7 7 7 7 7 7 7 ...
## $ DA : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PRCP : Factor w/ 541 levels "","0","0.01",..: 6 2 12 12 2 2 541 16 541 16 ...
## $ DR : int NA NA NA NA NA NA NA NA NA NA ...
## $ SPD : int NA NA NA NA NA NA NA NA NA NA ...
## $ MAX : int 78 84 79 80 80 80 83 80 81 78 ...
## $ MIN : int 72 71 72 72 71 71 73 72 73 71 ...
## $ MEA : int 75 78 76 76 76 76 78 76 77 74 ...
## $ SNF : Factor w/ 36 levels "","0","0.1","0.2",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ SND : int NA NA NA NA NA NA NA NA NA NA ...
## $ FT : logi NA NA NA NA NA NA ...
## $ FB : logi NA NA NA NA NA NA ...
## $ FTI : logi NA NA NA NA NA NA ...
## $ ITH : logi NA NA NA NA NA NA ...
## $ PGT : num NA NA NA NA NA NA NA NA NA NA ...
## $ TSHDSBRSGF : Factor w/ 39 levels "","0","1","1 1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ SD3 : logi NA NA NA NA NA NA ...
## $ RHX : logi NA NA NA NA NA NA ...
## $ RHN : logi NA NA NA NA NA NA ...
## $ RVG : logi NA NA NA NA NA NA ...
## $ WTE : logi NA NA NA NA NA NA ...
To begin with, I wanted to do a Linear Regression Model using this War World II that I found on Kaggle, however, as you can see by the Data Types of the this Data-Set. Since this Data-Set is from the 1940’s, some of the values had ‘character’ values mixed in with ‘decimal’ valus. Nevertheless, I would like to continue to use “Weather-Related” Data. Therefore, I will for using the Sample Data-Set from “Data Mining with Rattle and R The Art of Excavating Data for Knowledge Discover” by Graham Williams, “weather”. This Data-Set comes from a weather monitoring station located in Canberra, Australia.The Bureau makes available 13 months of daily weather observations from many locations across Australia.
At first glace, the information in this dataset includes Rainfall, temperatures, wind speed and humatity at different times of day.There are over twenty cases in this dataset, however, a lot of the cases and NULL. Therefore, I will be eleminating them from my dataset for analysis. The cases that I will be keeping include, the station #, , date, precipitation, wind gust speed, max, min and mean temperature, and snowfall.**
Observations were drawn from Canberra Airport {station 070351}.
Some cloud observations are from automated equipment; these are somewhat different to those made by a human observer and may not appear every day.
This study is an observational study, this is a collected of weather conditions recorded on each day at various weather station in Australia.
Australian Government - Bureau of Meterology.
“Data Mining with Rattle and R The Art of Excavating Data for Knowledge Discover” by Graham Williams
The response vairbale that I will be trying to calculate, is going to be Rainfall. Given the Minimum Temperature can you predict the amount of Rainfal. Therefore, this is a quantitative variable.**
The two independent variables that I will using for this linear regression analysis are the Rainfall and Minimum Tempatures.
dim(weather)
## [1] 366 24
names(weather)
## [1] "Date" "Location" "MinTemp" "MaxTemp"
## [5] "Rainfall" "Evaporation" "Sunshine" "WindGustDir"
## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "WindSpeed9am"
## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "Pressure9am"
## [17] "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am"
## [21] "Temp3pm" "RainToday" "RISK_MM" "RainTomorrow"
Date: The date of observation (a date object)
Location: The common name of the location of the weather station
MinTemp: The minimum temperature in degrees centigrade
MaxTemp: The maximum temperature in degrees centigrade
Rainfall: The amount of rainfall recorded for the day in millimeters
Evaporation: Class A pan evaporation (in millimeters) during 24 h
Sunshine: The number of hours of bright sunshine in the day
WindGustDir: The direction of the strongest wind gust in the 24 h to midnight
WindGustSpeed: The speed (in kilometers per hour) of the strongest wind gust in the 24 h to midnight
WindDir9am: The direction of the wind gust at 9 a.m.
WindDir3pm: The direction of the wind gust at 3 p.m.
WindSpeed9am: Wind speed (in kilometers per hour) averaged over 10 min before 9 a.m.
WindSpeed3pm: Wind speed (in kilometers per hour) averaged over 10 min before 3 p.m.
Humidity9am: Relative humidity (in percent) at 9 a.m.
Humidity3pm: Relative humidity (in percent) at 3 p.m.
Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9 a.m.
Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3 p.m.
Cloud9am: Fraction of sky obscured by cloud at 9 a.m. This is measured in ”oktas,” which are a unit of eighths. It records how many eighths of the sky are obscured by cloud. A 0 measure indicates completely clear sky, while an 8 indicates that it is completely overcast
Cloud3pm: Fraction of sky obscured by cloud at 3 p.m; see Cloud9am for a description of the values
Temp9am: Temperature (degrees C) at 9 a.m.
Temp3pm: Temperature (degrees C) at 3 p.m. RainToday: Integer 1 if precipitation (in millimeters) in the 24 h to 9 a.m. exceeds 1 mm, otherwise 0
RISK_MM: The continuous target variable; the amount of rain recorded during the next day
RainTomorrow: The binary target variable whether it rains or not during the next day
write.csv(weather, file = "weather.csv")
set.seed(123)
wd <- read.csv(url("https://raw.githubusercontent.com/josephsimone/DATA607/master/weather.csv"), header = TRUE, sep = ",", stringsAsFactors = TRUE)
dim(wd)
## [1] 366 24
wd <- wd[complete.cases(wd),]
wd <- subset(wd, select = -c(Date, Location, RISK_MM, RainTomorrow, RainToday, WindDir9am, WindDir3pm))
wd <- wd%>%
select(Rainfall, everything())
\(H_\theta:\) Given Minimum Temperature, can you predict the amount of Rainfall for that day ? \(H_A:\) One cannot predict the amount of Rainfall in a day, given the Minimum Temperature. ### Scatter Plot
scatter.smooth(x=wd$Rainfall, y=wd$MinTemp, main="")
par(mfrow=c(1, 2))
boxplot(wd$Rainfall, main="Rainfall", sub=paste("Outlier rows: ", boxplot.stats(wd$Rainfall)$out))
boxplot(wd$MinTemp, main="MinTemp", sub=paste("Outlier rows: ", boxplot.stats(wd$MinTemp)$out))
par(mfrow=c(1, 2))
plot(density(wd$Rainfall), main="Density Plot: Rainfall", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(wd$Rainfall), 2)))
plot(density(wd$MinTemp), main="Density Plot: MinTemp", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(wd$MinTemp), 2)))
polygon(density(wd$MinTemp), col="blue")
cor(wd$Rainfall, wd$MinTemp)
## [1] 0.1973385
ggpairs(data=wd, columns=1:2, title="Rainfall vs MinTemp")
linearMod <- lm(Rainfall~MinTemp, data=wd)
print(linearMod)
##
## Call:
## lm(formula = Rainfall ~ MinTemp, data = wd)
##
## Coefficients:
## (Intercept) MinTemp
## 0.3384 0.1424
summary(linearMod)
##
## Call:
## lm(formula = Rainfall ~ MinTemp, data = wd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.314 -1.875 -0.965 -0.185 37.568
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.33842 0.38221 0.885 0.376576
## MinTemp 0.14238 0.03917 3.635 0.000323 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.212 on 326 degrees of freedom
## Multiple R-squared: 0.03894, Adjusted R-squared: 0.03599
## F-statistic: 13.21 on 1 and 326 DF, p-value: 0.0003235
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
lmp(linearMod)
## [1] 0.0003234825
summary(linearMod)$r.squared
## [1] 0.0389425
The correlation between the Rainfall and Minimum Temperature is very weak. In addition the \(R^2\) and \(P\) value for this Linear Model are very low.
Thereofore we reject the null hypothesis, in favor of the alternative hypothesis, you cannot predict the amount of Rainfall given the Minimum Temperature.
Weather patterns are known to be none-linear.
In the future,I would like to include more variables than the Maximum Temperature to predict the null hypothesis.