library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
The source of this dataset is the tidyverse package, however even this collected from a different primary data source - Texas A&M University’s Texas Real Estate Research Center. Therefore, it appears like that column names and were labelled by the Texas A&M University. As for the reasons behind the names, it appears like the primary data set actually has slightly different column names as pictured below:
~/Desktop
These original names appear to be a little bit more descriptive than the tidyverse version, but a few issues still remain- like Inventory and Volume. My guess as to why theses have been named this way is that perhaps this nomenclature is appropriate for those who already have domain knowledge in the field of Real Estate and can therefore understand the intended meaning without explanation. While I cannot verify if this is the case, I believe that Universities would probably have departments with strong domain-experts that decide the column names of their research data sets
According to the documentation, the Inventory Column represents the amount of time it would take to sell all current listings at current pace of sales. That being said, there is hardly any mention of what “current pace of sales” means. This is especially true because there is no column called Pace of Sales which makes it difficult to interpret form a qualitative approach. I would imagine that we might have to calculate the rate of sales manually using the Inventory and Listings column. While this might be an effective way of solving the issue, it doesn’t give us much information on whether this pace of sales is continuous, averaged, or the median pace. Ultimately, this ambiguity makes this column difficult to interpret.
As described above, we do not know what the right way to calculate the ‘pace’ across multiple years or multiple cities. While it is possible to find the individual pace of sales for given city on a given date, aggregating this across time or across cities is difficult as we do not whether to take the cumulative, median or average pace across an individual group.
In all cases, we end up with wildly different understanding of ‘Pace of sales’ accross the 3 aggregates.
#library(dplyr)
new_housing = txhousing |> mutate(pace = (listings/inventory))
new_housing = na.omit(new_housing)
yearly_paces = new_housing|> group_by(year) |> summarize(avg_pace = mean(pace),
med_pace = median(pace))
yearly_paces = na.omit(yearly_paces)
library(ggplot2)
df_long <- reshape2::melt(yearly_paces, id.vars = "year")
df_long = na.omit(df_long)
# Plotting using ggplot2
x = ggplot(df_long, aes(x = year, y = value, color = variable)) +
geom_line() + geom_point()+
labs(x = "Year", y = "Value", title = "Change of 'Pace of Sales' across Years") +
scale_color_manual(values = c("blue", "red")) +
theme_minimal()
print(x)
As we see in the above image, we don’t get a clear picture of how ‘Pace of Sales’ because we do not know how to aggreagte pace and the documentation gives us no clues as to how this particular feature needs to be calculated and analyzed.
Similarly for cumulative price, the transformation of the Pace varies greatly from those of average or median pace:
## Groupby year and finding cumulative Price:
yearly_cum_pace = new_housing |> group_by(year) |> summarize(cum_pace = sum(pace))
yearly_cum_pace = reshape2::melt(yearly_cum_pace, id.vars = "year")
yearly_cum_pace = na.omit(yearly_cum_pace)
library(ggplot2)
y = ggplot(yearly_cum_pace, aes(x = year, y = value, color = variable)) +
geom_line() + geom_point()+
labs(x = "Year", y = "Value", title = "Line Plot of Cumulative Pace") +
theme_minimal()
print(y)
In conclusion, we see that the documentation for this Texas Housing Data Set has a few ambiguities that make it difficult to determine the interpretation behind certain columns. In particular, we see that some columns are named in such a way that they can be very easily mistaken as representating some other values. For example, Inventory can be very easily mistaken as representing the number of unsold or vacant properties.However, as discussed earlier, it refers to the amount of time it would take to sell all current listings at current pace of sales. This ambiguity in the labelling can result in risks such as decieving customers or investors about the state of the housing market
Likewise, we see that the case of Volume can also be misinterpreted as the volume of property, which also runs the risk of deicieving property buyers. Overall, most of the issues and risks in this dataset can be mitigated by appropriately labelling these columns as well as providing some accurate descriptions and use cases in the documentation where appropriate.