Wildland forest fires are an issue that plagues much of the United States and can be started by a number of different sources from lightning and power lines to debris burning and arson. A rather disturbing piece of data according to the National Interagency Fire Center (www.nifc.gov) is that humans cause an average of 62,631 fires a year across the US which add up to over 2.5 million acres burned a year. The majority of these fires, 67%, are in the US Forest Service’s Eastern and Southern Regions with nearly 40% of the total acres in the Southern Region. This issue is also one that effects our family personally. Part of my wife’s job for the Virginia Department of Forestry is responding to wild fires. Her agency is also struggling with how to respond to fires. They have faced a sever reduction in force over the last 15 years going from approximately 300 trained wildland fire fighters to 150. With these limited resources it has fallen more often than not to local volunteer fire companies to deal with wild fires. Now as the state faces even further budget cuts the question has arisen as to where to spend fire and education resources and how can the state reduce the number of fires that are caused by humans?
The main goal of this project is to determine if there is are any relationships between the counties that have the largest percentages of human cause fire and demographic data about these locations. There is a belief that the areas with the most human caused fires in the state tend to be the rural counties that have smaller populations, are in general poorer, and have the most limited resources because of their low populations and tax bases. At the end of this project I hope to be able to have a data based answer to that question. The data that I will be using is a summary of the United States’ 2010 Census data sourced from http://en.wikipedia.org/wiki/Virginia_locations_by_per_capita_income and a data set on the fires and causes for the state of Virginia sourced from https://fam.nwcg.gov/fam-web/ . There appears to have been a change to a person’s ability to get data from this website over the last few weeks so I will be using a pull of the data that I generated for a possible weekly assignment data set a few weeks ago.
The two questions that we want to answer throughout the project are:
Is there a relationship between the precentage of fires that are human caused and the income in an area?
Is there a relationship between the percentage of fires that are human caused and the population in an area?
Downloading the Virginia fire data from 2001 - 2013
file <- 'https://raw.githubusercontent.com/eriknylander99/Data/master/NASF_State_Data_Final_Project.csv'
download.file(file, destfile = 'va_fire.csv')
Downloading the State Population and Household Income Data
file <- 'https://raw.githubusercontent.com/eriknylander99/Data/master/Wiki_State_Demographics.csv'
download.file(file, destfile = 'va_demo.csv')
Downloading the Shapefile for the counties of Virginia
file <- 'ftp://ftp2.census.gov/geo/pvs/tiger2010st/51_Virginia/51/tl_2010_51_county10.zip'
download.file(file, destfile = 'census.zip')
unzip('census.zip')
Loading the data into R.
require(dplyr)
require(ggplot2)
firetmp <- read.csv('va_fire.csv', stringsAsFactors = FALSE)
firetmp <- tbl_df(firetmp)
str(firetmp)
## Classes 'tbl_df', 'tbl' and 'data.frame': 13233 obs. of 22 variables:
## $ Local_Incident_ID : chr "ACC369620.5" "ACC370182" "ACC370183" "ACC370194" ...
## $ Fire_Discovery_Date : chr "3/12/2001 0:00" "5/7/2001 0:00" "5/7/2001 0:00" "5/8/2001 0:00" ...
## $ Fire_Discovery_Time : chr "" "" "" "" ...
## $ Fire_Containment_Date : logi NA NA NA NA NA NA ...
## $ Fire_Containment_Time : logi NA NA NA NA NA NA ...
## $ Fire_Reporting_Agency_Unit_ID: chr "VAVDOF" "VAVDOF" "VAVDOF" "VAVDOF" ...
## $ State : chr "VA" "VA" "VA" "VA" ...
## $ State.FIPS : int 51 51 51 51 51 51 51 51 51 51 ...
## $ County : chr "Accomack" "Accomack" "Accomack" "Accomack" ...
## $ County_FIPS : int 1 1 1 1 1 1 1 1 1 1 ...
## $ District : logi NA NA NA NA NA NA ...
## $ Latitude : num NA NA NA NA NA NA NA NA NA NA ...
## $ Longitude : num NA NA NA NA NA NA NA NA NA NA ...
## $ Statistical_Cause_Code : int 5 5 2 2 5 1 9 5 7 6 ...
## $ Ownership_Code : logi NA NA NA NA NA NA ...
## $ Residences_Threatened : int 2 0 0 4 0 4 1 0 0 0 ...
## $ Residences_Destroyed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Other_Structures_Threatened : int 0 0 0 2 0 4 1 0 0 0 ...
## $ Other_Structures_Destroyed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Number_Injuries : logi NA NA NA NA NA NA ...
## $ Number_Fatalities : logi NA NA NA NA NA NA ...
## $ Final_Fire_Acre_Quantity : num 0.5 2 3 4 2.5 23.5 0.6 0.2 1 0.1 ...
demotmp <- read.csv('va_demo.csv', stringsAsFactors = FALSE)
demotmp <- tbl_df(demotmp)
str(demotmp)
## Classes 'tbl_df', 'tbl' and 'data.frame': 134 obs. of 7 variables:
## $ Rank : int 3 100 105 114 58 28 49 94 116 110 ...
## $ County_City : chr "City of Alexandria" "City of Bedford" "City of Bristol" "City of Buena Vista" ...
## $ Per_capita_income : int 54345 20092 19700 19030 24578 29306 26115 20781 18840 19245 ...
## $ Median_household_income: int 80847 32262 32079 39955 42240 67855 50571 35277 29936 32788 ...
## $ Median_family_income : int 102017 41026 39212 46081 62378 77561 64154 47188 39198 39410 ...
## $ Population : int 139966 6222 17835 6650 43475 222209 17411 5961 43055 5927 ...
## $ Households : int 68082 2627 7879 2603 17778 79574 7275 2632 18831 2316 ...
The state fire data contains a date column that needs to I will convert from string to date for the possibility of a time-series based analysis in the future.
firetmp$Fire_Discovery_Date <- as.Date(firetmp$Fire_Discovery_Date, "%m/%d/%Y")
The fire data set contains 13,233 different incidents that the Virginia Department of Forestry has responded to over the 13 year period from 2001 to 2013. This is not the total number of wildland fires that occurred during this time period as this data is only for the fires that a VDOF employee responded to and completed a fire report. The data does not include the many small grass and wild fires that are dealt with by local fire companies. The demographic data set contains 134 rows for the cities and counties in Virginia from the 2010 Census Data.
The question that I am trying to answer has to do with human caused fires in the state of Virginia so some work needs to be done with the state fire data to prepare it for analysis. The first few steps will be to drop columns that we don’t need and to remove any rows that do not contain data on the statistical causes for the fire.
fire_df <- firetmp %>%
select(Fire_Discovery_Date, State, State.FIPS, County, County_FIPS, Statistical_Cause_Code,
Residences_Threatened, Residences_Destroyed, Other_Structures_Threatened, Other_Structures_Destroyed,
Final_Fire_Acre_Quantity) %>%
filter(!is.na(Statistical_Cause_Code)) %>%
filter(Statistical_Cause_Code <=9)
The data set now contains information about the wildland fires reported in the state of Virginia that have a valid statistical cause code. These codes can be seen in the table below.
| USFS Code | Statistical cause |
|---|---|
| 1 | Lightning |
| 2 | Equipment use |
| 3 | Smoking |
| 4 | Campfire |
| 5 | Debris burning |
| 6 | Railroad |
| 7 | Arson |
| 8 | Children |
| 9 | Miscellaneous |
ggplot(fire_df, aes(factor(Statistical_Cause_Code),
fill = factor(Statistical_Cause_Code,
labels = c('1-Lightning', '2-Equipment Use', '3-Smoking',
'4-Campfire', '5-Debris Burning', '6-Railroad',
'7-Arson', '8-Children', '9-Misc')))) +
geom_histogram(binwidth = 1) +
labs(x='Cause Code', y='Total Number of Fires', fill = 'Cause Code') +
ggtitle('Total Number of Fires by Cause')
Is there a relationship between the percentage of fires that are human caused and the income in an area?
Is there a relationship between the percentage of fires that are human caused and the population in an area?
To answer these questions the first step is to determine which causes to use as ‘Human Caused’. It’s can be argued that most of the above causes are ‘Human’, however, for this analysis we will focus on the two causes that are the most common ones that my wife has to respond to: arson and debris burning. The next step of the analysis is to determine the percent of the fires in a given county are caused by these two factors.
fire <- fire_df %>%
group_by(County, County_FIPS, Statistical_Cause_Code) %>%
summarise(cause = n(),
acres = sum(Final_Fire_Acre_Quantity)) %>%
mutate(cause_per = round((cause/sum(cause))*100, digits=3))
fire
## Source: local data frame [707 x 6]
## Groups: County, County_FIPS
##
## County County_FIPS Statistical_Cause_Code cause acres cause_per
## 1 Accomack 1 1 8 385.0 8.247
## 2 Accomack 1 2 8 170.3 8.247
## 3 Accomack 1 3 2 23.0 2.062
## 4 Accomack 1 5 19 58.6 19.588
## 5 Accomack 1 6 1 0.1 1.031
## 6 Accomack 1 7 48 152.9 49.485
## 7 Accomack 1 8 2 0.3 2.062
## 8 Accomack 1 9 9 57.8 9.278
## 9 Albemarle 3 1 6 26.9 3.046
## 10 Albemarle 3 2 17 41.3 8.629
## .. ... ... ... ... ... ...
Now the data is aggregated by fire cause we can now continue to further aggregate the data to generate a single result per county focusing on just the human caused fires and their percent of the total fires in a county.
human <- fire %>%
group_by(County, County_FIPS) %>%
filter(Statistical_Cause_Code == 5 | Statistical_Cause_Code == 7) %>%
summarise(humancaused = sum(cause),
percent_fires = sum(cause_per),
total_acres = sum(acres))
head(human)
## Source: local data frame [6 x 5]
## Groups: County
##
## County County_FIPS humancaused percent_fires total_acres
## 1 Accomack 1 67 69.073 211.5
## 2 Albemarle 3 74 37.563 903.5
## 3 Alleghany 5 8 24.242 50.5
## 4 Amelia 7 52 48.149 73.0
## 5 Amherst 9 36 40.449 649.3
## 6 Appomattox 11 54 48.649 703.8
The data set now contains only the number of human caused fires, the percent of fires in each county caused by humans, and the total number of acres burned. The next task in preparing the data is to merge the data from the fire and demographic data frames.
human.fire <- merge(human, demotmp, by.x = "County", by.y = "County_City")
head(human.fire)
## County County_FIPS humancaused percent_fires total_acres Rank
## 1 Accomack 1 67 69.073 211.5 79
## 2 Albemarle 3 74 37.563 903.5 12
## 3 Alleghany 5 8 24.242 50.5 86
## 4 Amelia 7 52 48.149 73.0 61
## 5 Amherst 9 36 40.449 649.3 91
## 6 Appomattox 11 54 48.649 703.8 81
## Per_capita_income Median_household_income Median_family_income
## 1 22766 41372 49727
## 2 36685 64847 83894
## 3 22013 43160 53205
## 4 24197 50135 58029
## 5 21097 44757 55211
## 6 22388 49224 58954
## Population Households
## 1 33164 13798
## 2 98970 38157
## 3 16250 6891
## 4 12690 4821
## 5 32353 12560
## 6 14973 6033
For this analysis I will be investigating the relationship between the median household income, the county population and the percent of human caused fires. Finally I will be looking to see if there is any relationship between the income / population and the total amount of acres burned in a county through human causes.
ggplot(human.fire, aes(Median_household_income, percent_fires)) +
geom_point(colour = 'red', size = 3) + geom_smooth(method = lm) +
labs(x='Median Household Income', y='Percent of Human Caused Fires') +
ggtitle("Percent of Human Caused Fires vs Median Household Income")
From the above plot we can see that there my be a relationship between the percent of human caused fires in a given county and the median household income in the county. The first thing that I will check is if there is a linear relationship using R’s built in linear model.
x <- human.fire$Median_household_income
y <- human.fire$percent_fires
fit <- lm(y~x)
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.429 -9.815 -2.696 12.710 33.556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.3248824 5.2884021 13.109 < 2e-16 ***
## x -0.0003653 0.0000976 -3.743 0.000315 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.44 on 93 degrees of freedom
## Multiple R-squared: 0.1309, Adjusted R-squared: 0.1216
## F-statistic: 14.01 on 1 and 93 DF, p-value: 0.0003151
The results of this analysis indicate that there is a weak negative correlation between the Median Household Income and the percent fires indicating that there is the possibility that as income increases the percent of fire that is caused by humans decreases. However if we look closely at the graph there seems to be a non-linear relationship between the variables. I will re-plot the data using a loess regression model.
ggplot(human.fire, aes(Median_household_income, percent_fires)) +
geom_point(colour = 'red', size = 3) + geom_smooth(method = loess) +
labs(x='Median Household Income', y='Percent of Human Caused Fires') +
ggtitle("Percent of Human Caused Fires vs Median Household Income")
Looking at the graph of the loess regression with a log scale, there is a definite non-linear relationship to the data and it appears that there may be an exponential relationship between the median household income and the percent of human caused fires.
exponential.model <- lm(-log(y)~ x)
summary(exponential.model)
##
## Call:
## lm(formula = -log(y) ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51028 -0.27403 -0.00627 0.18971 1.34259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.224e+00 1.159e-01 -36.433 < 2e-16 ***
## x 6.979e-06 2.140e-06 3.262 0.00155 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3386 on 93 degrees of freedom
## Multiple R-squared: 0.1027, Adjusted R-squared: 0.093
## F-statistic: 10.64 on 1 and 93 DF, p-value: 0.001549
From the exponential model I can see that this is also not a strong predictor for the data set. This would be an area to continue to explore later on in the MSDA program once we have covered more statistical analysis tools.
require(scales)
## Loading required package: scales
ggplot(human.fire, aes(Population, percent_fires)) +
geom_point(colour = 'red', size = 3) + geom_smooth(method = lm) + scale_x_log10(labels = comma) +
labs(y='Percent of Human Caused Fires') +
ggtitle("Percent of Human Caused Fires vs Population")
Interestingly, there appears to be no relationship between the population of a given county and the percentage of fires in the county caused by humans. To further explore this, lets take a look at the linear model for the data.
x <- human.fire$Population
y <- human.fire$percent_fires
fit <- lm(y~x)
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.045 -10.675 -1.025 8.855 39.597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.342e+01 2.000e+00 26.714 <2e-16 ***
## x -6.201e-05 2.381e-05 -2.604 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.99 on 93 degrees of freedom
## Multiple R-squared: 0.06797, Adjusted R-squared: 0.05794
## F-statistic: 6.782 on 1 and 93 DF, p-value: 0.01072
This gives us a very low R and R^2 value and just further indicates that there is little to no relationship between the population of a given county and percent of human caused fires.
ggplot(human.fire, aes(Median_household_income, total_acres)) +
geom_point(colour = 'blue', size = 3) + geom_smooth(method = lm) +
labs(x='Median Household Income', y='Total Acres Burned') +
ggtitle("Total Acres Burned by Human Caused Fires vs Median Household Income")
Once again there is no real relationship in this data but there is an interesting observation that the largest human caused fires in the state of Virginia have occurred in the poorest counties in the state.
ggplot(human.fire, aes(Population, total_acres)) +
geom_point(colour = 'blue', size = 3) + geom_smooth(method = lm) + scale_x_log10(labels = comma) +
labs(y='Total Acres Burned') +
ggtitle("Total Acres Burned by Human Caused Fires vs Population")
This result is rather interesting, there is no real relationship between the total acres burned and the population. That being said all of the biggest human caused fires in the stat have happened in counties that are in the middle of the population curve. This raises some questions about what other factors are contributing to human caused fires in these counties.
The final Analysis that I will look at is a basic geographic analysis of the human caused fires in the state of Virginia to see if there are any patterns in the locations of counties with a high percentage of human caused fires.
The first part of the analysis is to load in the packages and the census shapefile that was download at the beginning of the project.
require(ggmap)
## Loading required package: ggmap
require(maptools) # Must have the rgeos() package installed for this portion
## Loading required package: maptools
## Loading required package: sp
## Checking rgeos availability: TRUE
va10 <- readShapeSpatial("tl_2010_51_county10.shp")
vadata <- fortify(va10, region = 'COUNTYFP10') # requires rgeos()
## Loading required package: rgeos
## rgeos version: 0.3-8, (SVN revision 460)
## GEOS runtime version: 3.4.2-CAPI-1.8.2 r3921
## Polygon checking: TRUE
vadata$id <- as.numeric(vadata$id)
va_info <- merge(vadata, human.fire, by.x = 'id', by.y = 'County_FIPS', all.x = TRUE)
va_info <- arrange(va_info, group, order) # the data frame must be reordered from the merge to create the maps
Now that data and packages have been loaded into R the first thing that I will do is get a sense for the median household income across the state of Virginia. There is a need to limit the shading on the graph to exclude the Washington D.C. area from the analysis. This decision was made to see more shading across the state and to eliminate counties that see very little wildfire.
ggplot(va_info, aes(x = long, y = lat, group = group, fill = (Median_household_income))) +
scale_fill_gradient(limit = c(20000, 100000), low="green", high="white", na.value = 'grey50', labels = comma) +
geom_polygon() +
geom_path(color="Black") +
coord_map() +
labs(fill = 'Median Houshold\nIncome') +
ggtitle('Household Income in Counties with Human Caused Fires')
The areas of the state with the highest median household incomes are the areas around Northern Virginia / Washington D.C. and and the Richmond through Virginia Beach corridor. The poorest areas of the state are the agricultural counties in the south central part of the state and the region of Appalachia in south western Virginia. What will be interesting to see is if either of these areas has counties that have a higher percentage of human caused fires than the rest of the state.
ggplot(va_info, aes(x = long, y = lat, group = group,
fill = percent_fires)) +
scale_fill_gradient(low="White", high="red", na.value = 'grey50') +
geom_polygon() +
geom_path(color="Black") +
coord_map() +
labs(fill = 'Percent of\nHuman Caused Fires') +
ggtitle('Percentage of Human Caused Fire by County in Virginia')
The area of the state that with the largest percentage of human caused fires is the Appalachia area of south western Virginia. This area is also one of poorest areas in the state and has a number of interesting cultural and sociological differences from the rest of the state. However there area a few other ‘hot spots’ around the state that also lead to further questions. Why is there an area of increased human caused fires just west of the Northern Virginia area which is one of the most populous and wealthiest parts of the state? Another interesting question is, Why do we have one county that is just to the east of Appalachia that has an extremely low percentage of human caused fires? Like most analysis the number of of questions at the end of the project are greater than the number of questions that I had at the start.
As I have learned throughout this semester, the end of a data analysis project often leads to more nuanced view of the topic and more questions then I had when I started the project. Ultimately there are a number of factors that lead to human caused fires in the state of Virginia. There is a correlation between the median household incomes and percent of human caused fire that is pronounced in the poorest of counties. However, we also see a number of issues with fires in counties that are in the middle of the income range. Even though there is the expectation that there would be more fires in rural, low population counties, the data indicates that there needs to be a certain level of population before arson and debris burning become an issue. There are also interesting geographical relationships in the data. There geographical analysis reinforces the correlation between income and human caused fire but it also shows us that there are certain ‘hot-spots’ in the state. The largest one of these lies ‘Coal Country’ of Appalachia. This area of the state has very rugged terrain and a very different culture from the rest of the state. That being said there also pockets of higher percentages of human caused fire in areas of the state that also have a higher income. This project has just scratched the surface of the cultural, social, economic, and historical reasons for human caused fire throughout the state and while I found some correlations it’s now obvious that the problem of human caused fire in the state of Virginia and the Southern United States is much more complicated then one would assume.