true

Data Dive 5 - Analyzing the Documentation of the Texas Housing Dataset

Loading our Data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows

Unclear Aspects of the dataset:

    • Volume Column: The term ‘Volume’ as column name is not particularly useful in understanding its desired purpose. My initial understanding was that Volume represented the physical volume of the houses sold in that particular city. Upon reading the documentation, I realized that it equaled the sum total dollar value of all sales made in the given city at a given time. This makes it all the more confusing with the fact that there is already a ‘sales’ column in the data set that I initially assumed represented the information found in the Volume column.
    • Inventory Column: The Inventory column is supposed to represent the amount of time it would take to sell all current listings at current pace of sales. However, upon my initial inspection, I assumed that inventory referred to the number of unsold homes in the city at a give time. This was further complicated by the fact that the column values were stored as doubles and had decimal values. This seemed rather unusual as even unsold houses rarely have a fractional/decimal count.
    • Median: The Median column is unclear in that it does not mention what value of which it is the Median. At first glance, the term median could be the median volume, median number of sales, median size,etc. Only upon reading the documentation does it become clear that it refers to the median sales price.

Reasons for this lack of Clarity:

The source of this dataset is the tidyverse package, however even this collected from a different primary data source - Texas A&M University’s Texas Real Estate Research Center. Therefore, it appears like that column names and were labelled by the Texas A&M University. As for the reasons behind the names, it appears like the primary data set actually has slightly different column names as pictured below:

Texas A&M Original Data Set: ~/Desktop

These original names appear to be a little bit more descriptive than the tidyverse version, but a few issues still remain- like Inventory and Volume. My guess as to why theses have been named this way is that perhaps this nomenclature is appropriate for those who already have domain knowledge in the field of Real Estate and can therefore understand the intended meaning without explanation. While I cannot verify if this is the case, I believe that Universities would probably have departments with strong domain-experts that decide the column names of their research data sets

Persistent Ambiguities even with the Documentation:

According to the documentation, the Inventory Column represents the amount of time it would take to sell all current listings at current pace of sales. That being said, there is hardly any mention of what “current pace of sales” means. This is especially true because there is no column called Pace of Sales which makes it difficult to interpret form a qualitative approach. I would imagine that we might have to calculate the rate of sales manually using the Inventory and Listings column. While this might be an effective way of solving the issue, it doesn’t give us much information on whether this pace of sales is continuous, averaged, or the median pace. Ultimately, this ambiguity makes this column difficult to interpret.

Illustrating these Irregularities and Ambiguities using charts:

As described above, we do not know what the right way to calculate the ‘pace’ across multiple years or multiple cities. While it is possible to find the individual pace of sales for given city on a given date, aggregating this across time or across cities is difficult as we do not whether to take the cumulative, median or average pace across an individual group.

In all cases, we end up with wildly different understanding of ‘Pace of sales’ accross the 3 aggregates.

#library(dplyr)

new_housing = txhousing |> mutate(pace = (listings/inventory))

new_housing = na.omit(new_housing)

yearly_paces = new_housing|> group_by(year) |> summarize(avg_pace = mean(pace),
                                                        med_pace = median(pace))

yearly_paces = na.omit(yearly_paces)
library(ggplot2)
df_long <- reshape2::melt(yearly_paces, id.vars = "year")

df_long = na.omit(df_long)

# Plotting using ggplot2
x = ggplot(df_long, aes(x = year, y = value, color = variable)) +
 geom_line() + geom_point()+
  labs(x = "Year", y = "Value", title = "Change of 'Pace of Sales' across Years") +
  scale_color_manual(values = c("blue", "red")) +
  theme_minimal()
print(x)

As we see in the above image, we don’t get a clear picture of how ‘Pace of Sales’ because we do not know how to aggreagte pace and the documentation gives us no clues as to how this particular feature needs to be calculated and analyzed.

Similarly for cumulative price, the transformation of the Pace varies greatly from those of average or median pace:

## Groupby year and finding cumulative Price:
yearly_cum_pace = new_housing |> group_by(year) |> summarize(cum_pace = sum(pace))
yearly_cum_pace =  reshape2::melt(yearly_cum_pace, id.vars = "year")

yearly_cum_pace = na.omit(yearly_cum_pace)

library(ggplot2)
y = ggplot(yearly_cum_pace, aes(x = year, y = value, color = variable)) +
 geom_line() + geom_point()+
  labs(x = "Year", y = "Value", title = "Line Plot of Cumulative Pace")  +
  theme_minimal()

print(y)

Conclusion and Risks:

In conclusion, we see that the documentation for this Texas Housing Data Set has a few ambiguities that make it difficult to determine the interpretation behind certain columns. In particular, we see that some columns are named in such a way that they can be very easily mistaken as representating some other values. For example, Inventory can be very easily mistaken as representing the number of unsold or vacant properties.However, as discussed earlier, it refers to the amount of time it would take to sell all current listings at current pace of sales. This ambiguity in the labelling can result in risks such as decieving customers or investors about the state of the housing market

Likewise, we see that the case of Volume can also be misinterpreted as the volume of property, which also runs the risk of deicieving property buyers. Overall, most of the issues and risks in this dataset can be mitigated by appropriately labelling these columns as well as providing some accurate descriptions and use cases in the documentation where appropriate.