The purpose of this document is to perform some initial data exploration on the firefighter monitoring data provided by Joe Sol and Joe Domitrovich of the USDA Forest Service. There are a lot of data, and the structure of the data is somewhat complex, so I figured I’d put together a markdown document that will provide some initial inight into how we might best work with the data. We’ll start by loading all of the necessary libraries.
library(stringr)
library(dplyr)
library(DT)
library(stringdist)
The data can be found here: S:\ursa\campbell\Firefighter_Monitoring\Data\Firefighter_Data_Analysis. Next, we’ll create a list of all of the files anywhere throughout the entire directory structure, recursively looking through every subdirectory:
# set working directory
setwd("S:/ursa/campbell/Firefighter_Monitoring")
# create file list
files <- list.files("Data/Firefighter_Data_Analysis",
full.names = T,
recursive = T) %>%
tolower() %>%
paste0("s:/ursa/campbell/firefighter_monitoring/", .)
# convert to datatable format for display
files.dt <- as.data.frame(files) %>%
datatable(rownames = F)
files.dt
There are 856 individual files. Clearly there are many different types of files. Let’s see what kinds of different file types (as represented by different file extensions) there are in the database:
# create a list of file extensions
extensions <- str_split(files, "\\.") %>%
sapply(tail, 1) %>%
unique()
# count the number of files with each extension
ext.count.df <- data.frame(extension = extensions,
count = NA)
for (extension in extensions){
count <- sum(grepl(paste0(".", extension, "$"), files))
ext.count.df$count[ext.count.df$extension == extension] <- count
}
ext.count.df <- ext.count.df[order(ext.count.df$count, decreasing = T),]
# convert to datatable for display
ext.count.dt <- datatable(ext.count.df, rownames = F)
ext.count.dt
Of these file types, we’re most interested in the tabular data (e.g. CSV files, TXT files, XLSX files). So, let’s dig a little deeper into them, starting with the CSV files:
# list all of the csv files, convert to datatable for display
csv.files <- files[grep(".csv$", files)]
# conver to data table for display
csv.files.dt <- as.data.frame(csv.files) %>%
datatable(rownames = F)
csv.files.dt
There’s a lot of different file names, but they share some similarities throughout the list (e.g. “qtravel_x.csv”, “gps_x.csv”). Let’s see if we can narrow this list down further to groups of similar files:
# isolate just the file names (without the full paths or extensions)
file.names <- str_split(csv.files, "/") %>%
sapply(tail, 1) %>%
str_split("\\.") %>%
sapply(head, 1) %>%
sort()
# create a list of the unique file name parts
file.parts <- str_split(file.names, "_", simplify = T) %>%
as.vector() %>%
unique()
# count the number of files names with each part
part.count.df <- data.frame(file.part = file.parts,
count = NA)
for (file.part in file.parts){
count <- sum(grepl(file.part, file.names))
part.count.df$count[part.count.df$file.part == file.part] <- count
}
part.count.df <- part.count.df[order(part.count.df$count, decreasing = T),]
# clean up
part.count.df <- part.count.df[part.count.df$count > 1 &
part.count.df$file.part != "" &
is.na(as.numeric(part.count.df$file.part)),]
# convert to datatable for display
part.count.dt <- datatable(part.count.df, rownames = F)
part.count.dt
Intuitively, files with “gps” in the name should have GPS data, but just to be sure, let’s take a look at a randomly-selected file with “gps” in the name:
# randomly select one of the "gps" csv files, read it in, and convert to datatable for display
gps <- sample(file.names[grep("gps", file.names)],1) %>%
grep(csv.files)
gps.dt <- csv.files[gps] %>%
read.csv() %>%
head() %>%
datatable(rownames = F)
gps.dt
Yep! Looks like what we want. Knowing that they were using “QTravel” brand GPS units, it’s likely that the CSV files with “qtravel” in the name likewise contain GPS data. Let’s check it out:
# randomly select one of the "qtravel" csv files, read it in, and convert to datatable for display
qtravel <- sample(file.names[grep("qtravel", file.names)],1) %>%
grep(csv.files)
qtravel.dt <- csv.files[qtravel] %>%
read.csv() %>%
head() %>%
datatable(rownames = F)
qtravel.dt
Looks exactly the same. OK, so I think we’ve now narrowed down at least some of our key data. It’s worth looking into some of other files as well, but between the “gps” files and the “qtravel” files, we have a good starting point. Here’s a list of all of the “gps” and “qtravel” files:
# list all of the "gps" and "qtravel" csv files
gps.files.1 <- csv.files[grep("qtravel", csv.files)]
gps.files.2 <- csv.files[grep("gps", csv.files)]
gps.files <- c(gps.files.1, gps.files.2)
# convert to data table for display
gps.files.dt <- as.data.frame(gps.files) %>%
datatable(rownames = F)
gps.files.dt