This is an assignment from a regression models class I took at Penn State. The assignment explores the use of regularization techniques, logistic regression, cross validation, sensitivity and specificity on a wildfire dataset from California in 2009. The first part of the assignment deals with exploratory data analysis and then building a model from the dataset. The second part involves using that model to investigate possible recommendations to give the state of California on prevention of houses being burnt down from the wildfires.
Severe wildfires in southern California in 2009 present a rare opportunity to study which factors influence whether a house near a forest is burned in a wildfire, or is not. I will analyze data that contains 487 randomly selected houses in neighborhoods judged to be “at risk” to forest fires. Many characteristics of these houses were obtained from satellite images and other remote geospatial data sources. The state of California is interested in using this data to make recommendations to homeowners on how to maintain their property to minimize the risk of fire damage to their home.
The state of California is planning to launch a campaign to promote one of the following three actions, and wants to determine which of these three practices is likely to be the most beneficial in preventing house loss to wildfire.
The state of California is also wants an estimate of the number of houses that would have been burned in the 2009 fires if one of the three proposed management actions had been taken before 2009. Which of the three proposed management actions will be most beneficial?
The data set contains 487 observations of 21 variables. There are 274 homes that burned from this Survey out of a total of 487, i.e. 56% of homes in this area burned. The response variable is called burnt and is a binary variable coded as 0 for a house not being burnt down and 1 for a house being burnt down. The remaining 20 variables are different types of predictors which I have grouped as follows:
For this Exploratory Data Analysis (EDA), I only look at the variables that can be addressed by the homeowner. First, I looked at the four of the five continuous variables through boxplots. The four plots show the relationship between the status of burnt (0,1) and the distribution of the variable. Boxplots are showing that perc.woody and possibly perc.cleared are having an influence on whether the house will be burnt or not since the medians are different from each other. Note, perc.woody really kicks into gear on being burnt once it gets over the 20% threshold. The other two variables, distance2bush and distance2tree, are difficult to look at and do not provide any useful information to the EDA. The buildings variable could be looked at through the lens of a continuous or factor variable. The boxplot of the buildings variable (not included) did not show any discernible information so I will look at it through a contingency table.
The two variables, planted and buildings, are viewed as factor variables. The planted variable identifies if the home had native species replanted,r, or if it was replanted with non-native plants, p. The buildings variables tells us how many buildings are on the property of the home. In Table 1, the planted variable has been separated into whether it is burnt or not burnt based on planted outcome. I use a chi-square test on the table to see if there is any dependence between the variables of burnt and planted. If the null hypothesis is rejected then it can concluded that there is dependence between the variables planted and burnt. The chi-squared test shows a test statistic of 117.84 with a corresponding p.value of approximately 0, therefore there is some type of dependence between planted and burnt.
In Table 2, I look at the relationship of buildings and the burnt outcome. It is a 2x9 table and will require a different approach then chi-square test. I use spearman’s correlation and a Mantzel-Haenszel test statistics to determine if there is any relationship. Spearman correlation statistic is a non-parametric alternative to the common Pearson’s statistic. Here the observations are converted into rank orders and correlation is computed from the ranked pairs. Mantel-Haenszel (MH) statistic tests the null hypothesis of independence with ordinal variables (i.e., correlation parameter, \(\rho\), is equal to zero) versus a two-sided alternative. The test shows a correlation of -0.05 and a MH statistic 1.18. The first number points to an almost no linear relationship between burnt and the number of buildings. The MH statistic of 1.18 provides no evidence for us to reject the Null Hypothesis of independence.
Next, I look at five variables in a scatterplot matrix to see if any of the predictors are correlated with each other. Distance2tree and perc.woody are highly negatively correlated with each other, indicated by the red. Distance2bush and perc.woody are highly positively correlated with each other, indicated by the blue. Buildings does not appear to correlated with any of the variables. Burnt, the response variable, is correlated negatively and positively with all of the variables, except for buildings. Since the variables of interest are so highly correlated it would be wise to use a regularization technique such as Ridge or Lasso to help in the analysis. The regularization technique will also allow the analysis to utilize all the variables in the dataset, giving us more insight into what determines the likelihood of a house being burnt down.
The EDA points to the following:
In conclusion, I would expect planted, perc.woody to be important in the models with buildings not being important and the others will need to be more thouroughly investigated through the statistical analysis.
To conduct the statistical analysis I first divide the data into a Train and Test set using the tools available in the caret package. The Train set will consist of 70% of the data. I will then use two regularization techniques, Ridge and Lasso, to build an appropriate model for the Train dataset, to help with the multicollinearity and large number of variables. Since the response variable is binary I will use logistic regression to model the dataset. To assess model fit I will look at the sensitivity and specificity of the confusion tables. A better model is one that can discriminate between cases with and without certain conditions by looking at the sensitivity and specificity. Sensitivity looks at the proportion of true positives identified by the test. Specificity looks at the proportion of true negatives that the test correctly identifies. For the two regularization techniques I will utilize 10-fold cross validation on the training dataset to build a model. I will then use the remaining 30% of the data that resides in the Test to further investigate the model. The technique with the best performance (sensitivity and specificity) on the Test set will be utilized to investigate which of the three proposals I would recommend to the state of California.
The training model in the Ridge technique have only the planted and perc.woody as variables that are significant. Adj.for.type is also significant, but this variable can not be controlled by the homeowner. The sensitivity for the Test data is 78% and the specificity is 81%, i.e. the model identifies 78% of the true positive outcomes and 81% of the true negative outcomes.
The training model in the lasso technique identifies only perc.woody and the ffdi as variables that are significant. The ffdi variable is just an index for fire threat and is not controlled by the homeowner. The sensitivity for the Test data is 78% and the specificity is 80%, i.e. the model identifies 78% of the true positive outcomes and 80% of the true negative outcomes.
The ridge technique scores slightly better in our analysis based on the specificity and sensitivity. The planted and perc.woody varibles make sense as variables that influence the model as born out through the EDA. I will use the Ridge model to investigate the three proposals for the state of California that seek to reduce the number of burnt houses.
Proposal 1: Remove all Trees within 10 meters of the house
Method: I took the entire fire data set and found the variable distance2tree and recoded anything that was less than 10 as 10. I then ran the chosen ridge model on the Test data to make comparisons
Results: Proposal gives us a sensitivity of 78% and specificity of 82%.
Proposal B: Require any home with at least 50% woody vegetation within 40m of the house to remove vegetation until they only have 50% woody vegetation within 40m of their house.
Method: I took the entire fire data set and found the variable perc.woody and recoded anything that had perc.woody greater than 50% as just 50%. I then ran the chosen ridge model on the Test data to make comparisons
Results: Proposal B gives us a sensitivity of 75% and specificity of 80%
Proposal C: Replant any “remnant” vegetation within 40m of all houses with identical “planted” vegetation.
Method: I took the entire fire data set and found the variable planted and recoded anything that had planted as r and replaced with a p. I then ran the chosen ridge model on the Test data to make comparisons
Results: Proposal C gives us a sensitivity of 73% and specificity of 86%
| A Sens | A Spec | B Sens | B Spec | C Sens | C Spec | |
|---|---|---|---|---|---|---|
| 0.7777778 | 0.8192771 | 0.7460317 | 0.7951807 | 0.6984127 | 0.7108434 |
Recommendations: I found the ridge model with perc.woody and planted as predictor variables to be the best model when compared to the other model using the Lasso technique. I used the ridge model to make predictions based on the 3 Proposals. Proposal A only gives a slight improvement to doing nothing while Propsoal B and Propsoal C increases the number of houses being burnt. Thereofre, I recommend Proposal A to the state of California. However, to decrease the likelihood of a house being burnt down I would recommend investigating interactions and perhaps even transformation of some of the variables to better understand the data. Also, investigating the threshold of whether a house would be burnt or not could be another avenue to address.
Call: xtabs(formula = ~fire$planted + fire$burnt)
Number of cases in table: 487
Number of factors: 2
Test for independence of all factors:
Chisq = 117.84, df = 1, p-value = 1.884e-27
[1] -0.04945235 1.18853000
21 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -8.727970e-01
ffdi 1.775019e-03
slope 9.509125e-03
aspect 5.897851e-05
topo -2.306314e-02
perc.cleared 1.377767e-04
amt.burntless5yrs 8.377812e-07
perc.burntless5yrs 7.812707e-04
amt.not.burnt5to10yrs -8.530190e-07
perc.burnt5to10yrs 2.970346e-03
amt.unlogged -1.543007e-06
perc.logged -7.074507e-04
amt.not.NP 1.174078e-06
amt.not.SF -3.577478e-06
adj.for.type 5.553955e-02
edge 5.662403e-07
distance2tree -6.248115e-04
planted 4.333087e-01
buildings -1.435315e-02
perc.woody 9.159507e-03
distance2bush -4.101150e-05