Importing Data and Grooming

This is a data set from the University of California Irvine Data Collection. It was used to try to predict the square area of a forest fire on a given day based on certain environmental, temporal and geographic features. I chose the set because it was relatively limited in variables and reasonable populated, yet still possessed enough different data types that the required transformations would be possible or needed.

Variable Names:

X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
month - month of the year: ‘jan’ to ‘dec’
day - day of the week: ‘mon’ to ‘sun’
FFMC - FFMC index from the FWI system: 18.7 to 96.20
DMC - DMC index from the FWI system: 1.1 to 291.3
DC - DC index from the FWI system: 7.9 to 860.6
ISI - ISI index from the FWI system: 0.0 to 56.10
temp - temperature in Celsius degrees: 2.2 to 33.30
RH - relative humidity in %: 15.0 to 100
wind - wind speed in km/h: 0.40 to 9.40
rain - outside rain in mm/m2 : 0.0 to 6.4
area - the burned area of the forest (in ha): 0.00 to 1090.84

require(SciViews)
require(magrittr)
require(GGally)
setwd('/Users/gaboston/Documents/data605')
data  <- read.csv("forestfires.csv") # From UCIrvine 
#[Cortez and Morais, 2007] P. Cortez and A. Morais. A Data Mining Approach to Predict Forest data using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.
data$X_trans <- ln(data$X +1)
data$Y_trans <- ln(data$Y +1)
weekdays <- c('fri', 'tue',  'mon', 'wed', 'thu', 'sun')
data$day_type <- NA
data$day<-as.character(data$day)
data$day_type =ifelse(data$day %in% weekdays, 'no_sat', 'sat')
ggpairs(data)

After looking at the data, and a test regression (not shown), it was clear that some transformations would be necessary to make the model work. At the suggestion of the documentation with the data set, the X and Y grid locations were transformed using natural logs and the SciViews wrapper (which allows you to us ln as you would in traditional notation). The days were initially converted weekdays and weekends, but further refined to Saturdays and weekdays as there seemed to be more of a correlation with the Saturday and less with other days of the week. This became the dichotomous variable in the linear equation.

Cortez and Morais, 2007 P. Cortez and A. Morais. A Data Mining Approach to Predict Forest data using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimarães, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.

Data Dictionary

More Transformations

In an effort to facilitate performance of the model and to reduce the effects of heavily skewed data in some fields (with many observations of zero) area, rain, and ISI were incremented by one and transformed using natural logarithm as above.

With the transformations made I tried a number of configurations of data to build the model backing out of variables until there was no additional improvement in doing so (as well as trying a few different interaction terms. The following was the most successful, defining success as the highest adjusted \(R^2\) values combined with the most statistically significant coefficients. I could have squeezed a few more hundredths out of the adjusted \(R^2\) at the cost of a statistically insignificant y-intercept and the loss of significant slope coefficient, so I settled for whopping 1.77% of the areas probable size being predicted by this model.

data$area <-ln(data$area+1)
data$rain <- ln(data$rain +1)
data$ISI<- ln(data$ISI +1)

model <- lm(area~ wind + X_trans + DMC^2+ ISI^2 + RH^2 + day_type*wind + day_type*ISI, data = data)
summary(model)

## 
## Call:
## lm(formula = area ~ wind + X_trans + DMC^2 + ISI^2 + RH^2 + day_type * 
##     wind + day_type * ISI, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8070 -1.0870 -0.5348  0.8633  5.5446 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       1.128952   0.455873   2.476   0.0136 *
## wind              0.084763   0.037551   2.257   0.0244 *
## X_trans           0.167135   0.131675   1.269   0.2049  
## DMC               0.002462   0.001079   2.282   0.0229 *
## ISI              -0.282115   0.156761  -1.800   0.0725 .
## RH               -0.006921   0.003895  -1.777   0.0762 .
## day_typesat      -1.083182   0.821630  -1.318   0.1880  
## wind:day_typesat -0.131085   0.098999  -1.324   0.1861  
## ISI:day_typesat   0.792310   0.351067   2.257   0.0244 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.386 on 508 degrees of freedom
## Multiple R-squared:  0.03293,    Adjusted R-squared:  0.0177 
## F-statistic: 2.162 on 8 and 508 DF,  p-value: 0.029

plot(model)

Based on the residuals vs. fitted plot above, there is a very strong linear direction for many of the points which suggests two things: 1 That there is an effect (likely linear in nature) MISSED in this regression 2. The slope of the OLS line is likely skewed to compensate for the downward direction of the associated misfits which are correlated and if that influence were in someway captured by the regression, then the distribution of residuals would likely become more random as we would prefer to see it.

When you look at the the scale-location, the alignment of some many on a line with the others randomly supports the sense that there is a missing influence in this model and combined with qqplot it appears that on the lower tail, this linear shift is most apparent with a diversion on the upper tail that seems driven more by a few large outliers.

In looking at the leverage plot, there is some suggestion that there might be a few outliers influencing that fit. It could be possible that they are drawing the slope of the line away from the true slope we see in the linear arrangement of residuals in the fitted v residual plot.

Given that the adjusted \(R^2\) is very small and the F-statistic is smaller than 5 and with a small p-value, it seems likely that, despite significant variables in the model itself, the overall model is not a particularly good estimator of the burn area as constructed.

Based on the literature within University of California’s website on this set and my own personal experiences with modeling, it does appear like using a support vector might prove more useful.

Disussion Week 12 - Data 605

Bethany Poulin

4/29/2018

Importing Data and Grooming

More Transformations