Data Dive 4

Loading our txhousing Data

## Loading the tidyverse library as well as the TXHousing Dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows

Taking 5 different samples of our data:

We use the sample() function to create 5 different dataframes (df_1, df_2, df_3, df_4, df_5) that have been sampled with replacement from our original txhousing dataframe.

Visualising the differences in our Samples:

Let us make use of different visualisatoins to view how our samples are different from each other:

1) How do ‘Median Sales’ vary across each subsample?

df_list <- list(df_1, df_2, df_3, df_4, df_5)

# Set up the plotting area
par(mfrow = c(1, 5), mar = c(4, 4, 2, 1))  # 1 row, 5 columns

# Create box plots for each dataframe
for (i in 1:5) {
  boxplot(df_list[[i]]$median, main = paste("Sample ", i), ylab = "Median Sales")
}

The median sales seem to to be distributed relatively similarly with the mean value following roughly at $12,000 for all samples.

Distribution of Volume across subsamples:

We will use a historgram to visualise how the volume of sales varies across subsamples.

df_list <- list(df_1, df_2, df_3, df_4, df_5)

# Set up the plotting area
par(mfrow = c(1, 5), mar = c(4, 4, 2, 1))  # 1 row, 5 columns



# Create box plots for each dataframe
for (i in 1:5) {
  hist(df_list[[i]]$volume, main = paste("Sample ", i), breaks = 15, ylab = "Volume ")
}

Similar to the case with Median Sales, the distribution across samples for the Volume column is also largely similar. The differences however, are visible in the right-tails of the samples. In the case of sample 3, the histogram indicates an almost descending order of frequence as the bins increase.

How does the Year column vary across the distribution:

df_1_year = df_1 |> group_by(year) |> count()
df_2_year = df_2 |> group_by(year) |> count()
df_3_year = df_3 |> group_by(year) |> count()
df_4_year = df_4 |> group_by(year) |> count()
df_5_year = df_5 |> group_by(year) |> count()

    

library(ggplot2)
ggplot(mapping = aes(x, y)) +
  geom_bar(data = data.frame(x = df_1_year$year, y = df_1_year$n), width = 0.9, stat = 'identity', fill = "#CCCCCC") +
  geom_bar(data = data.frame(x = df_2_year$year, y = df_2_year$n), width = 0.7, stat = 'identity', fill = "#9DACBB") +
  geom_bar(data = data.frame(x = df_3_year$year, y = df_3_year$n), width = 0.5, stat = 'identity', fill = "#6E8DAB") +
  geom_bar(data = data.frame(x = df_4_year$year, y = df_4_year$n), width = 0.3, stat = 'identity', fill = "#3F6D9B") +
  geom_bar(data = data.frame(x = df_5_year$year, y = df_5_year$n), width = 0.2, stat = 'identity', fill = "#104E8B") + scale_fill_manual(values = c("sample 1" = "#CCCCCC", "sample 2" = "#9DACBB", "sample 3" = "#6E8DAB" , "sample 4" = "#3F6D9B", "sample 5" = "#104E8B"),
                    name = "Subgroup Legend")

  #theme_classic() + scale_y_continuous(expand = c(0, 0))

This distribution is a lot more interesting as we truly see that each of the 5 samples have a varying distribution for each year in the dataset.

Anomaly:

It is also interesting because we observe an anomaly in sample 5 for the year 2015. Our original dataset, has the lowest number of entries for the year 2015. While this fact is reflected in Sampled 1,2,3 and 4; Sample 5 seems to be the only subsample that has a proportionally high number of entries form the year 2015.

Conclusions about our Sampling Process:

From this process it appears that quantitative columns display a higher degree of consistency across subsamples and are distributed rather uniformly across the 5 samples. However, categorical samples like Year, tend to exhibit variations and anomalies in the distribution across samples.

Going forward, our analyses drawn from this sampling process need to take this phenomenon into account as it might indicate certain biases in our samples. I would suggest drawing more samples from our population dataset as well as increasing the size of our samples to something higher than just 51%. That way we can observe a more representative subsample to make more accurate predictions.