Christian McIntire, supported by Dr. Alexandre Scarcioffolo
Introduction
Within this research, I will be comparing and predicting housing price trends between Tennessee and Ohio, representing Southeastern and Midwestern U.S. housing markets, respectively. Both Tennessee and Ohio characterize their respective regions in the United States, and after living in both states for extended periods of time, it is fascinating to contrast the housing markets in these states to explore affordability and growth potential for young professionals navigating rising real estate costs, or looking for potential investments. Housing affordability is a critical concern for younger demographics, especially graduates entering the workforce. Because of rising housing prices in recent years, there has been a steady increase of adults choosing to live with their parents after graduating from college (Avery, 2022). Comparing these two states offers insights into regional differences in the United States and potential opportunities for future homeowners or investors. The goal of this research is to uncover how population growth and housing prices in Tennessee and Ohio compare over time, and understand what factors drive these trends. Can predictive models forecast future housing price trends in each state to guide young professionals in making informed decisions about affordability and opportunities?
This report will use historical housing price data along with population density statistics to identify trends and differences between Tennessee and Ohio. It will also apply predictive modeling techniques, such as time series analysis and machine learning, to forecast housing prices. The expected outcomes for this project are: A comparative analysis of housing price dynamics in Tennessee and Ohio, insights into the drivers of housing prices and their implications for young professionals and eager investors, and predictive modeling to guide decision-making on housing affordability, investment profitability, and long-term opportunities.
Ethical Considerations
When conducting research that pertains to county-specific housing price data, as well as state population data, it is imperative to understand the ethical considerations at hand. To begin, data privacy and sensitivity is a major ethical consideration within this research. While population data is often anonymized, the aggregation of demographic statistics can lead to potential risks like exposing sensitive trends about vulnerable communities or encouraging discriminatory practices. The purpose of this research is to understand and predict housing trends across Tennessee and Ohio as a whole, and focus on different factors that have led to housing pricing increases in both states. This approach avoids misinterpretation or misuse of the data, and does not focus on trends of any specific communities, but studies the larger picture of state and county-wide data. Population growth can often highlight disparities in access to housing, infrastructure, and resources, and results from this research can positively impact urban planning or development policies. However, careful framing and proper background is needed to prevent marginalizing underrepresented populations. Population data may also exclude transient or undocumented populations, which can lead to partially incomplete conclusions about growth or housing demands. According to authors Antony K Cooper and Serena Coetzee, another potential downside of using public data is the possibility of how “any data set is invariably a biased representation of the population” (Cooper & Coetzee, 2020). To combat this, I will be focusing on broader trends within my research and comparing larger picture results from my prediction models between Ohio and Tennessee.
Additionally, these findings are tailored for young professionals or eager investors, so my findings which highlight rising housing prices might inadvertently attract speculative investments, leading to higher levels of gentrification. Increased gentrification is not the desired result of this research, but rather informing readers of what trends are present and possible in future months and years. This research will also provide insights into regional housing affordability, contributing to how young professionals or low-income groups might be disproportionately affected by rising housing costs. Future research could suggest actionable strategies for addressing housing inequality, and focus on affordable but sustainable growth in the busier counties of both Tennessee and Ohio.
Another important consideration to take into account with this research is that the Housing Price Index Dataset, sourced from the Federal Reserve Bank of St. Louis, is not adjusted for inflation. Using these nominal prices without adjusting for inflation could misrepresent affordability trends over time, which I will be taking into account within this research.
I will mention these limitations to ensure that any relevant stakeholders understand the contextual nature and nuances of the data and any pertinent visualizations that I create. Also, in alignment with ethical standards outlined for real estate professionals, this research ensures that all visualizations and predictive models accurately represent trends without unnecessary exaggeration or bias (Treleaven et al., 2021).
Stakeholders
Other stakeholders may include, but are not limited to:
Young Professionals and Recent College Graduates
Concerned about affordable housing options and proximity to urban job markets in larger cities within Tennessee and Ohio
Likely interested in actionable insights to make informed decisions on housing investments or a potential relocation
Local and State Governments
May use these findings for urban planning, zoning decisions, and housing policies
Real Estate Investors and Developers:
Research provides insights into housing market trends
Can inform investment strategies in both Tennessee and Ohio
Ethical concerns may include avoiding exacerbating affordability crises through any speculative development
Academics and Future Researchers
May use this research to leverage predictive methodology and predictive findings to broaden studies on population and housing dynamics
Overall, there are several key ethical principles I will keep in mind when conducting this research. The first of which is transparency, as I will clearly outline my methodologies, assumptions, and limitations of my models within this document. The next is equity, as I will strive to ensure that the findings within this research do not disproportionately harm or benefit any one demographic group, but provide any curious researcher with more information on past trends and how they may expand out into the future. The third is that I will prioritize recommendations that align with long-term societal benefits. Additionally, I have ensured that any data that I will be using and modeling with is properly sourced and cited within this research.
Comprehensive Analysis and Data Exploration
Data and Design
Due to the complexity of this project and the integration of multiple metrics, I utilized 5 key datasets that form the foundation of this analysis. These datasets were carefully selected to provide both historical depth and the aspects necessary for a comprehensive comparative analysis of housing price trends and population dynamics in Tennessee and Ohio. Because there was not one singular dataset that provided all of the aspects of this research, combining several publicly available datasets allows for robust predictive modeling across a variety of variables, along with interactive data visualizations, and insights into affordability, growth potential, and investment strategies for young professionals.
The first dataset that I will utilize is the All-Transactions House Price Index for Tennessee. The source for this dataset is the Federal Reserve Bank of St. Louis (FRED). This time series dataset covers housing price trends in Tennessee from 1975 to 2024. It provides a housing price index normalized to a baseline year, enabling the analysis of price changes over nearly five decades. This dataset offers a comprehensive view of historical trends within the housing price index of Tennessee, critical for identifying prolonged growth and market dynamics in the state.
The second dataset that I will utilize is also sourced from the Federal Reserve Bank of St. Louis, however this dataset is the All-Transactions House Price Index for Ohio. This will provide a direct comparison between the two states, allowing for precise time-series predictions and visualizations that are key for this report. By pairing this dataset with the Tennessee HPI, I will perform comparative analyses between Tennessee and Ohio, showcasing similarities and differences in regional housing markets over time.
The third dataset, County Median Home Prices and Monthly Mortgage Payments, is essential for interactive visualizations which will demonstrate the drastic variations in county home prices in each state. This dataset includes county-level median home prices and estimated monthly mortgage payments, offering insights into housing affordability. While the raw dataset encompasses all U.S. counties, I will be focusing on counties specific to Tennessee and Ohio, organizing and cleaning the dataset to enhance its utility for this report and research. This dataset is crucial for visualizing regional housing affordability trends and analyzing potential economic impacts on young professionals entering the housing market, and eager investors looking for sustained gain.
The fourth dataset in this report is sourced from the U.S. Census Bureau, by Macrotrends. Macrotrends provides a Tennessee Population Dataset (1900-2023), with over a century of annual population estimates for Tennessee. It allows for time series analysis of population growth and its correlation with housing price trends. While it does not differentiate between counties, the state-level data offers sufficient granularity for this project’s focus on macroeconomic and demographic drivers.
The fifth and final dataset, like the Tennessee dataset, provides annual population estimates for Ohio from 1900 to 2023. It enables a comparative analysis of population trends between the two states, supporting insights into how demographic shifts influence housing markets. This dataset is also from Macrotrends, sourced from the U.S. Census Bureau.
While county level historical population data would have provided further specificity, the lack of sufficiently detailed datasets made statewide population data the strongest available option for this project. By focusing on state-level trends for population dynamics and county-level data for housing prices, this approach balances historical depth with regional particularity. Statewide population trends are still very effective in modeling broad economic and housing market shifts, while county-level housing data will allow for specific visualizations and exploratory sections of this report. These datasets collectively address this project’s core objectives, which include understanding historical and predictive housing trends, exploring the drivers of these trends, and assessing affordability and growth potential for Tennessee and Ohio. This combination of objectives, completed through various statistical analysis, interactive data visualizations, and advanced predictive modeling, will ensure a comprehensive analysis grounded in robust and reliable data.
Visualization Preparation
To begin the analysis of this research, I will be creating two interactive visualizations, one for Tennessee and one for Ohio. The first two visualizations will be three-dimensional choropleth graphs of housing prices in Tennessee and Ohio using the rayshader package in Rstudio.
The three-dimensional element adds a new dimension to the visualizations, making it easier to identify disparities in housing prices across counties. For instance, counties with drastically higher prices will stand out visually, and will draw attention to those regions of economic regard or interest. Additionally, comparing Tennessee and Ohio through similar three-dimensional visualizations allows for a direct comparison of housing price distributions by county, which will produce insights within regional differences. By visualizing housing prices at the county level, this will assist in uncovering patterns that a state-wide analysis might not pick up. This perspective is key to understanding affordability trends for young professionals who may prioritize specific regions within each state. The visualizations will use a green gradient color palette and a simple, coherent design to allow for comparison continuity across diagrams. While the initial choropleth maps provide an overview of county-level trends, the three-dimensional visualizations serve as a bridge for future predictive modeling. They visually emphasize certain counties within each state that have high variance or noticeable outliers, which could warrant further investigation in future forecasting models.
Before I began gathering data for this visualization, I was interested in seeing median home pricing in various counties in Tennessee. The dataset that I found was through the National Association of Realtors, and after cleaning the data, I was able to map each county’s location with the corresponding median home price. Once I created a 2d choropleth of Tennessee, I created a green gradient color palette to visualize changes around the state. This simple green theme along with a plain black border would be the basis for a design theme that I would use for future visualizations in this project as well. Creating a choropleth visualization fit this assignment as it provided a very customizable base graph that I would be able to further build on. Next, I created a choropleth graph for Tennessee housing prices to expand my research and compare Tennessee prices with other state prices in the southeast, again based on county. To do so, I used the same dataset that again used the median county home price as the main metric. However, I knew I could better visualize the data with a more complex visualization. I then utilized the rayshader package to convert my static and flat choropleth graphs into three-dimensional visualizations, to view how drastic the changes across counties were. The new, three-dimensional visualizations now provide an additional design element through the height of each county, which adds an interesting element to the visualization. The graphs are identical in terms of labeling and consistency in order to produce a more consistent output.
Choropleth Visualizations
Data Cleaning
# CSV Cleaninglibrary(tidyverse)file_path <-"/Users/cmacbook/Documents/1-Denison/Senior Fall/DA 352/1-Final Project/Data/USMedianHousingPrices.csv"housing_data <-read_csv(file_path)housing_data <- housing_data %>%mutate(`MedianHomePrice`=str_replace_all(`MedianHomePrice`, "[$,]", "") %>%as.numeric(), # Remove dollar signs and commas, then convert to numericstate =str_extract(County, ",\\s*\\w+$") %>%# Extract state after commastr_remove_all(",\\s*"), # Remove comma and spacesCounty =str_extract(County, "^[^,]+") %>%# Extract the county name before the commastr_trim() %>%# Trim leading and trailing spacestolower() # Convert to lowercase ) %>%distinct(County, state, .keep_all =TRUE) # Remove duplicateswrite_csv(housing_data, "/Users/cmacbook/Documents/1-Denison/Senior Fall/DA 352/1-Final Project/Data/Cleaned_USMedianHousingPrices.csv")
Tennessee - 2D Choropleth Visualization
Visualization Code
library(tidyverse)library(sf)library(scales)# This Code was based off of my visualization in DA301, and adjusted to provide a cleaner output for this visualizationUSMedianHousingPrices <-normalizePath("/Users/cmacbook/Documents/1-Denison/Senior Fall/DA 352/1-Final Project/Data/Cleaned_USMedianHousingPrices.csv")shapefile <-normalizePath("/Users/cmacbook/Documents/1-Denison/Senior Fall/DA 352/1-Final Project/Data/tl_2024_us_county.shp")# Step 1: Load and prepare the housing datahome_sales <-read_csv(USMedianHousingPrices) %>%mutate(county =str_trim(tolower(County)), # Standardize county namesstate =str_trim(tolower(state)) # Standardize state names ) %>%distinct(county, state, .keep_all =TRUE) # Remove duplicateshome_sales <- home_sales %>%mutate(county =str_replace(county, " county$", ""))# Step 2: Load and prepare the Tennessee shapefiletennessee_geo <-st_read(shapefile, quiet =TRUE) %>%filter(STATEFP =="47") %>%mutate(NAME =str_trim(tolower(NAME)),state ="tennessee" ) %>%distinct(NAME, state, .keep_all =TRUE)# Step 3: Merge shapefile with housing data, matching by both `county` and `state`tennessee_geo <- tennessee_geo %>%left_join(home_sales, by =c("NAME"="county", "state"="state")) %>%mutate(price =ifelse(is.na(`MedianHomePrice`), median(`MedianHomePrice`, na.rm =TRUE), `MedianHomePrice`)) # Handle missing prices# Step 4: Create the 2D ggplot choropleth mapggTNPrices <-ggplot(data = tennessee_geo) +geom_sf(aes(fill = price), color ="black", size =0.3) +# Add subtle bordersscale_fill_gradient(name =NULL, low ="white", high ="darkgreen", na.value ="grey",limits =c(0, 1000000), # Set limits for the color scalebreaks =c(0, 250000, 500000, 750000, 1000000), # Define specific breakslabels = scales::label_dollar(scale =1) # Format labels with dollar signs and commas ) +labs(title ="County Median Home Prices Q1 2024 - Tennessee",subtitle ="Data Source: National Association of Realtors" ) +theme_linedraw() +theme(plot.title =element_text(hjust =0.5, size =16),plot.subtitle =element_text(hjust =0.5, size =12),legend.position ="right",axis.text =element_blank(), # Remove axis textaxis.ticks =element_blank(), # Remove axis ticksaxis.title =element_blank(), # Remove axis titlespanel.grid.major =element_blank(), # Remove major gridlinespanel.grid.minor =element_blank(), # Remove minor gridlinesplot.margin =margin(t =30, r =30, b =30, l =30) # Add space around the plot ) +coord_sf(expand =TRUE, clip ="off") # Ensure the map fits nicely within the plot# Step 5: Render the 2D mapprint(ggTNPrices)
The 2D Tennessee Choropleth map provides an interesting visual representation of how median home prices range across Tennessee counties in the first quarter of 2024. The color gradient, ranging from light green (lower prices) to dark green (higher prices) shows disparities and trends within housing affordability across the state.
From this visualization, there are several key observations to be made. The first of which is the dominance of the Nashville Metropolitan area. The dark green shading in the counties surrounding Nashville, such as Davidson and Williamson county, clearly indicates that the highest median home prices reside around the greater Nashville area (County Median Home Prices and Monthly Mortgage Payment, 2024). There are also regional disparities to be analyzed, as rural and less urbanized areas in Tennessee exhibit much lighter shading, with significantly lower median home prices. This trend can also be reversed, as the darker shaded counties (with more expensive median housing prices) fall around more rural areas, namely Chattanooga, Memphis, and Knoxville, in addition to Nashville.
The Map captures subtle graduations in median home prices, making it easier to identify transitional areas where affordability starts to decline as proximity to urban centers increases. This visualization can be useful for policymakers and urban planners to pinpoint areas of high housing cost stress and potentially identify regions to establish affordable housing developments. Conversely, investors may look to Davidson and Williamson county and either be detracted from the high prices, or allured to further potential growth in these urban centers.
Tennessee - 3D Choropleth Visualization
* Click and Drag to rotate figure
Visualization Code
# Tennessee 3D Choropleth# This code takes the previous 2D visualization and uses the Rayshader package to transform it into a 3D visualizaiton# Step 1: Load and prepare the housing datahome_sales <-read_csv(USMedianHousingPrices) %>%mutate(county =str_trim(tolower(County)), # Standardize county namesstate =str_trim(tolower(state)) # Standardize state names ) %>%distinct(county, state, .keep_all =TRUE) # Remove duplicateshome_sales <- home_sales %>%mutate(county =str_replace(county, " county$", ""))# Step 2: Load and prepare the Tennessee shapefiletennessee_geo <-st_read(shapefile, quiet =TRUE) %>%# Suppress reading outputfilter(STATEFP =="47") %>%# Tennessee FIPS codemutate(NAME =str_trim(tolower(NAME)), # Standardize county namesstate ="tennessee"# Add state column ) %>%distinct(NAME, state, .keep_all =TRUE) # Remove duplicates# Step 3: Merge shapefile with housing data, matching by both `county` and `state`tennessee_geo <- tennessee_geo %>%left_join(home_sales, by =c("NAME"="county", "state"="state")) %>%mutate(price =ifelse(is.na(`MedianHomePrice`), median(`MedianHomePrice`, na.rm =TRUE), `MedianHomePrice`)) # Handle missing prices# Step 4: Create a 2D ggplot choropleth mapggTNPrices <-ggplot(data = tennessee_geo) +geom_sf(aes(fill = price), color =NA) +scale_fill_gradient(name =NULL, low ="white", high ="darkgreen", na.value ="grey",limits =c(0, 1000000), # Set limits for the color scalebreaks =c(0, 250000, 500000, 750000, 1000000), # Define specific breakslabels = scales::label_dollar(scale =1) # Format labels with dollar signs and commas ) +labs(title ="County Median Home Prices Q1 2024 - Tennessee",subtitle ="Data Source: National Association of Realtors" ) +theme_linedraw() +theme(plot.title =element_text(hjust =0.5, size =10, margin =margin(b =15)),plot.subtitle =element_text(hjust =0.5, size =8, margin =margin(b =25)),legend.text =element_text(size =7), # Adjust legend text sizeaxis.text =element_blank(),axis.ticks =element_blank(),panel.grid.major =element_blank(),panel.grid.minor =element_blank(),plot.margin =margin(t =30, r =20, b =10, l =10) )# Step 5: Render the 3D map with rayshaderplot_gg(ggTNPrices,multicore =TRUE,width =5,height =5,scale =200, # Adjust scale for height exaggerationwindowsize =c(1400, 866),zoom =0.6,phi =25)rgl::rglwidget()