#Embassy Weather Prediction with Regression
https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data No: row number year: year of data in this row month: month of data in this row day: day of data in this row hour: hour of data in this row pm2.5: PM2.5 concentration (ug/m^3) DEWP: Dew Point (℃) TEMP: Temperature (℃) PRES: Pressure (hPa) cbwd: Combined wind direction Iws: Cumulated wind speed (m/s) Is: Cumulated hours of snow Ir: Cumulated hours of rain
ew <- read.csv("embassy.csv", header=TRUE)
str(ew)
'data.frame': 43824 obs. of 13 variables:
$ No : int 1 2 3 4 5 6 7 8 9 10 ...
$ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
$ month: int 1 1 1 1 1 1 1 1 1 1 ...
$ day : int 1 1 1 1 1 1 1 1 1 1 ...
$ hour : int 0 1 2 3 4 5 6 7 8 9 ...
$ pm2.5: int NA NA NA NA NA NA NA NA NA NA ...
$ DEWP : int -21 -21 -21 -21 -20 -19 -19 -19 -19 -20 ...
$ TEMP : num -11 -12 -11 -14 -12 -10 -9 -9 -9 -8 ...
$ PRES : num 1021 1020 1019 1019 1018 ...
$ cbwd : Factor w/ 4 levels "cv","NE","NW",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Iws : num 1.79 4.92 6.71 9.84 12.97 ...
$ Is : int 0 0 0 0 0 0 0 0 0 0 ...
$ Ir : int 0 0 0 0 0 0 0 0 0 0 ...
head(ew)
###Removing Nas
sapply(ew, function(x) sum(is.na(x)))
No year month day hour pm2.5 DEWP TEMP PRES cbwd Iws Is Ir
0 0 0 0 0 2067 0 0 0 0 0 0 0
ew <- na.exclude(ew)
Let’s remove No, since it’s a useless column
ew <- ew[,-1]
attach(ew)
Let’s take a look at Temperature based on all other predictors
##Plots
plot(ew$TEMP, ew$month, main="Temperature based on Month", xlab="Temperature in Fahrenheight", ylab="Month")
Middle of the year is the hottest.
plot(ew$TEMP, ew$Iws, main="Temperature vs Windspeed", xlab="Temperature in Fahrenheight", ylab="Windspeed m/s")
Higher Wind speeds make cooler temperatures
plot(ew$TEMP, ew$PRES, main="Temperature vs Pressure", xlab="Temperature", ylab="Pressure")
Higher pressures correlate with lower temperatures
plot(ew$TEMP, ew$hour)
Temperature is higher in the afternoons and cooler in the twightlight hours of the morning. ##Train test split
set.seed(1234)
i <- sample(1:nrow(ew), .8*nrow(ew), replace=FALSE)
ewtrain <- ew[i,]
ewtest <- ew[-i,]
##Predicting Temperature based on all other predictors via Linear Regression
ew_lm <- lm(TEMP~., data=ewtrain)
summary(ew_lm)
Call:
lm(formula = TEMP ~ ., data = ewtrain)
Residuals:
Min 1Q Median 3Q Max
-17.0726 -3.4019 -0.2823 3.1559 22.1552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.376e+02 3.893e+01 -6.103 1.05e-09 ***
year 3.606e-01 1.920e-02 18.786 < 2e-16 ***
month 5.215e-02 8.377e-03 6.225 4.87e-10 ***
day 1.977e-02 3.093e-03 6.391 1.67e-10 ***
hour 2.219e-01 4.045e-03 54.851 < 2e-16 ***
pm2.5 -2.634e-02 3.133e-04 -84.083 < 2e-16 ***
DEWP 4.630e-01 3.412e-03 135.698 < 2e-16 ***
PRES -4.699e-01 4.354e-03 -107.924 < 2e-16 ***
cbwdNE 2.117e-01 9.975e-02 2.123 0.0338 *
cbwdNW 2.831e-02 8.279e-02 0.342 0.7324
cbwdSE 1.360e+00 7.668e-02 17.730 < 2e-16 ***
Iws 8.000e-03 6.106e-04 13.101 < 2e-16 ***
Is -6.958e-01 3.521e-02 -19.763 < 2e-16 ***
Ir -5.014e-01 1.944e-02 -25.791 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.942 on 33391 degrees of freedom
Multiple R-squared: 0.8358, Adjusted R-squared: 0.8358
F-statistic: 1.308e+04 on 13 and 33391 DF, p-value: < 2.2e-16
ew_lm_pred <- predict(ew_lm, newdata=ewtest)
ew_lm_mse <- mean((ew_lm_pred - ewtest$TEMP)^2)
ew_lm_cor <- cor(ew_lm_pred, ewtest$TEMP)
print(paste("Linear Regression accuracy: ", ew_lm_cor))
[1] "Linear Regression accuracy: 0.91238975053004"
##Predicting Temperature based on all other predictors via Decision Tree
library(tree)
ew_tree <- tree(TEMP~., data=ewtrain)
ew_tree
node), split, n, deviance, yval
* denotes terminal node
1) root 33405 4967000 12.3900
2) DEWP < 2.5 17159 1395000 3.3970
4) PRES < 1019.25 5029 421200 10.9100
8) pm2.5 < 80.5 2885 201600 14.5100
16) month < 10.5 2342 132000 16.5400 *
17) month > 10.5 543 18200 5.7380 *
9) pm2.5 > 80.5 2144 131700 6.0620 *
5) PRES > 1019.25 12130 572600 0.2832
10) month < 2.5 4930 115200 -3.5960 *
11) month > 2.5 7200 332400 2.9390
22) month < 11.5 4818 174800 5.7030 *
23) month > 11.5 2382 46320 -2.6520 *
3) DEWP > 2.5 16246 718800 21.8900
6) PRES < 1010.5 9642 254900 24.9800
12) hour < 8.5 3562 57750 21.5900 *
13) hour > 8.5 6080 132100 26.9700 *
7) PRES > 1010.5 6604 236400 17.3600
14) month < 9.5 4454 130600 19.5200 *
15) month > 9.5 2150 42330 12.9000 *
summary(ew_tree)
Regression tree:
tree(formula = TEMP ~ ., data = ewtrain)
Variables actually used in tree construction:
[1] "DEWP" "PRES" "pm2.5" "month" "hour"
Number of terminal nodes: 10
Residual mean deviance: 29.38 = 981100 / 33400
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-26.54000 -3.51700 0.09628 0.00000 3.41000 25.94000
plot(ew_tree)
text(ew_tree, cex=.7, pretty=1)
ew_tree_pred <- predict(ew_tree, newdata=ewtest)
ew_tree_mse <- mean((ew_tree_pred - ewtest$TEMP)^2)
ew_tree_cor <- cor(ew_tree_pred, ewtest$TEMP)
print(paste("Decision Tree accuracy: ", ew_tree_cor))
[1] "Decision Tree accuracy: 0.89266297926514"
##Predicting Temperature based on all other predictors via RandomForest
library("randomForest")
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
set.seed(1234)
ew_rf <- randomForest(TEMP~., data=ewtest)
ew_rf
Call:
randomForest(formula = TEMP ~ ., data = ewtest)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 3
Mean of squared residuals: 6.113181
% Var explained: 95.83
ew_rf_pred <- predict(ew_rf, newdata=ewtest)
ew_rf_mse <- mean((ew_rf_pred - ewtest$TEMP)^2)
ew_rf_cor <- cor(ew_rf_pred, ewtest$TEMP)
print(paste("RandomForest accuracy: ", ew_rf_cor))
[1] "RandomForest accuracy: 0.995525009204737"
##Predicting Temperature based on all other predictors via K Nearest Neighbors
ew$cbwd <- as.integer(ew$cbwd)
ewknndata <- data.frame(scale(ew[,1:12]))
ewknntrain <- ewknndata[i,]
ewknntest <- ewknndata[-i,]
ew_knn <- knnreg(ewknntrain[,c("year", "month", "day", "hour", "pm2.5", "DEWP", "PRES", "cbwd", "Iws", "Is", "Ir")], ewknntrain[,"TEMP"], k=3)
ew_knn_pred <- predict(ew_knn, ewknntest[, c("year", "month", "day", "hour", "pm2.5", "DEWP", "PRES", "cbwd", "Iws", "Is", "Ir")])
ew_knn_cor <- cor(ew_knn_pred, ewknntest$TEMP)
ew_knn_mse <- mean((ew_knn_pred - ewknntest$TEMP)^2)
print(paste("KNN with k=3 accuracy: ", ew_knn_cor))
[1] "KNN with k=3 accuracy: 0.981239623058666"
ew_knn2 <- knnreg(ewknntrain[,c("year", "month", "day", "hour", "pm2.5", "DEWP", "PRES", "cbwd", "Iws", "Is", "Ir")], ewknntrain[,"TEMP"], k=2)
ew_knn_pred2 <- predict(ew_knn2, ewknntest[, c("year", "month", "day", "hour", "pm2.5", "DEWP", "PRES", "cbwd", "Iws", "Is", "Ir")])
ew_knn_cor2 <- cor(ew_knn_pred2, ewknntest$TEMP)
ew_knn_mse2 <- mean((ew_knn_pred2 - ewknntest$TEMP)^2)
print(paste("KNN with k=2 accuracy: ", ew_knn_cor2))
[1] "KNN with k=2 accuracy: 0.982482967925183"
##Embassy Weather Machine Learning Results ###Accuracy of Models
print(paste("Linear Regression accuracy: ", ew_lm_cor))
[1] "Linear Regression accuracy: 0.91238975053004"
print(paste("Decision Tree accuracy: ", ew_tree_cor))
[1] "Decision Tree accuracy: 0.89266297926514"
print(paste("RandomForest accuracy: ", ew_rf_cor))
[1] "RandomForest accuracy: 0.995525009204737"
print(paste("KNN with k=2 accuracy: ", ew_knn_cor2))
[1] "KNN with k=2 accuracy: 0.982482967925183"
The decision tree was the least accurate. Probably because it assumed othogonal binary separations that did not exist in the data and could not change those separations after the fact after it learned more about the data Linear regression was more accurate than decision trees, probably because it had less bias about the shape of the data. It assumed a linear relationship. Random Forest was more accurate than everything else, probably because it took the deficiency of decision trees, and averaged 500 of them. Any individual deficiency was overcome by the collective. KNN was second most accurate, because a time’s temperature can be assumed to be like similar times’ temperatures. However this assumption is not perfect, and that’s why it did worse than random forest. Also though 2 neighbors was the highest accuracy, in a perfect world every observation would have a twin we could predict from instead of having to average 2.
###What we learned. We can predict with a 99.5% accuracy what temperature the US embassy in Beijing’s temperature will be based on air pressure, date, dew point, and precipitation with the random forest model. Looking at the decision tree, depending on the dew point, pressure, and date, we can predit with a 90% what temperature it will be. Also, this does not require an opaque algorithm. We also learned the relationships between pressure, windspeed, and temperature.