## Loading the tidyverse library as well as the TXHousing Dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
We use the sample() function to create 5 different dataframes (df_1, df_2, df_3, df_4, df_5) that have been sampled with replacement from our original txhousing dataframe.
Let us make use of different visualisatoins to view how our samples are different from each other:
df_list <- list(df_1, df_2, df_3, df_4, df_5)
# Set up the plotting area
par(mfrow = c(1, 5), mar = c(4, 4, 2, 1)) # 1 row, 5 columns
# Create box plots for each dataframe
for (i in 1:5) {
boxplot(df_list[[i]]$median, main = paste("Sample ", i), ylab = "Median Sales")
}
The median sales seem to to be distributed relatively similarly with the mean value following roughly at $12,000 for all samples.
We will use a historgram to visualise how the volume of sales varies across subsamples.
df_list <- list(df_1, df_2, df_3, df_4, df_5)
# Set up the plotting area
par(mfrow = c(1, 5), mar = c(4, 4, 2, 1)) # 1 row, 5 columns
# Create box plots for each dataframe
for (i in 1:5) {
hist(df_list[[i]]$volume, main = paste("Sample ", i), breaks = 15, ylab = "Volume ")
}
Similar to the case with Median Sales, the distribution across samples for the Volume column is also largely similar. The differences however, are visible in the right-tails of the samples. In the case of sample 3, the histogram indicates an almost descending order of frequence as the bins increase.
df_1_year = df_1 |> group_by(year) |> count()
df_2_year = df_2 |> group_by(year) |> count()
df_3_year = df_3 |> group_by(year) |> count()
df_4_year = df_4 |> group_by(year) |> count()
df_5_year = df_5 |> group_by(year) |> count()
library(ggplot2)
ggplot(mapping = aes(x, y)) +
geom_bar(data = data.frame(x = df_1_year$year, y = df_1_year$n), width = 0.9, stat = 'identity', fill = "#CCCCCC") +
geom_bar(data = data.frame(x = df_2_year$year, y = df_2_year$n), width = 0.7, stat = 'identity', fill = "#9DACBB") +
geom_bar(data = data.frame(x = df_3_year$year, y = df_3_year$n), width = 0.5, stat = 'identity', fill = "#6E8DAB") +
geom_bar(data = data.frame(x = df_4_year$year, y = df_4_year$n), width = 0.3, stat = 'identity', fill = "#3F6D9B") +
geom_bar(data = data.frame(x = df_5_year$year, y = df_5_year$n), width = 0.2, stat = 'identity', fill = "#104E8B") + scale_fill_manual(values = c("sample 1" = "#CCCCCC", "sample 2" = "#9DACBB", "sample 3" = "#6E8DAB" , "sample 4" = "#3F6D9B", "sample 5" = "#104E8B"),
name = "Subgroup Legend")
#theme_classic() + scale_y_continuous(expand = c(0, 0))
This distribution is a lot more interesting as we truly see that each of the 5 samples have a varying distribution for each year in the dataset.
It is also interesting because we observe an anomaly in sample 5 for the year 2015. Our original dataset, has the lowest number of entries for the year 2015. While this fact is reflected in Sampled 1,2,3 and 4; Sample 5 seems to be the only subsample that has a proportionally high number of entries form the year 2015.
From this process it appears that quantitative columns display a higher degree of consistency across subsamples and are distributed rather uniformly across the 5 samples. However, categorical samples like Year, tend to exhibit variations and anomalies in the distribution across samples.
Going forward, our analyses drawn from this sampling process need to take this phenomenon into account as it might indicate certain biases in our samples. I would suggest drawing more samples from our population dataset as well as increasing the size of our samples to something higher than just 51%. That way we can observe a more representative subsample to make more accurate predictions.