Like much of the western United States, my adopted home state of Utah is experiencing water stress from increasing demand, drought, and conflict over water rights. At the same time, Utahns use a lot of water per capita compared to residents of other states. According to the United States Geological Survey, in 2014 people in Utah used more water per person than in any other state. For my proposal project, I propose to explore water consumption in Salt Lake City, the largest city in Utah. A first step to water solutions in Utah is to better understand who is using water, when, and for what.
The city of Salt Lake makes water consumption data publicly available at the census tract and block level, with information on the type of water user (single residence, apartment, hospital, business, etc.) and amount of water used. The data at the block level is available here and can be accessed via Socrata Open Data API.
To do an exploratory data analysis, I first clean up the data a bit. I remove rows that have non-year entries in the year column and NA entries in the water consumption column. I also set the data types to what I need for this analysis. Then, let’s look at the class of each column and see the dimensions of the data frame.
# remove lines with non-year entry for year
water <- waterbyblock[grep("[0-9]{4}", waterbyblock$YEAR),]
# remove lines with NA for consumption
water <- water[!is.na(water$CONSUMPTION),]
# set data types to what I need
water[,1:5] <- lapply(water[,1:5], as.factor)
water[,6:7] <- lapply(water[,6:7], as.numeric)
sapply(water, class)
## YEAR MONTH TRACT BLOCK TYPE CONSUMPTION
## "factor" "factor" "factor" "factor" "factor" "numeric"
## CONNECTIONS
## "numeric"
dim(water)
## [1] 1204707 7
So this data set after cleaning includes 1204707 observations of water consumption in Salt Lake City.
To make the first exploratory plot, I group these observations by month, year, and type of user; the types include categories like single residence, park, business, etc. Then I sum up all water consumption within these groups so that I can see the distribution of aggregated monthly water consumption across the types of water users.
monthly <- water %>% group_by(MONTH, YEAR, TYPE) %>%
summarize(consumption = sum(CONSUMPTION))
Let’s see what these distributions look like.
This box plot shows that single residence and business users consume the most water each month in Salt Lake City. There are some very high outliers; exploring these points showed that they all occurred in 2014, a drought year. This first exploratory information can be used to inform further analysis about how water is used.