| Package | Version |
|---|---|
| moments | 0.14.1 |
| knitr | 1.51 |
| kableExtra | 1.4.0 |
| ggplot2 | 4.0.3 |
| tidyr | 1.3.2 |
| dplyr | 1.2.1 |
| scales | 1.4.0 |
| stringr | 1.6.0 |
# load dataset stored in the same folder of .rmd file
df <- read.csv("realestate_texas.csv")
# Randomly select row indices, then sort them to keep original order
idx <- sort(sample(nrow(df), min(10, nrow(df))))
kable(df[idx, ])
| city | year | month | sales | volume | median_price | listings | months_inventory | |
|---|---|---|---|---|---|---|---|---|
| 10 | Beaumont | 2010 | 10 | 150 | 23.904 | 138500 | 1779 | 11.5 |
| 34 | Beaumont | 2012 | 10 | 193 | 27.350 | 121800 | 1671 | 9.9 |
| 42 | Beaumont | 2013 | 6 | 232 | 36.275 | 134100 | 1675 | 9.0 |
| 74 | Bryan-College Station | 2011 | 2 | 101 | 16.125 | 148500 | 1562 | 9.3 |
| 95 | Bryan-College Station | 2012 | 11 | 159 | 28.882 | 149100 | 1442 | 7.3 |
| 119 | Bryan-College Station | 2014 | 11 | 169 | 34.903 | 172800 | 973 | 3.8 |
| 165 | Tyler | 2013 | 9 | 287 | 51.099 | 147600 | 2917 | 10.2 |
| 199 | Wichita Falls | 2011 | 7 | 127 | 13.594 | 102300 | 1029 | 9.2 |
| 222 | Wichita Falls | 2013 | 6 | 121 | 15.547 | 104700 | 923 | 7.9 |
| 233 | Wichita Falls | 2014 | 5 | 140 | 17.833 | 115700 | 899 | 7.6 |
| Variable | Description | Type |
|---|---|---|
| city | City or market area observed | Qualitative nominal |
| year | Year of the observation | Quantitative discrete |
| month | Month of the observation | Quantitative discrete |
| sales | Number of sales in that city–month | Quantitative discrete |
| volume | Total sales volume in millions of US dollars | Quantitative continuous |
| median_price | Median sale price (US dollars) | Quantitative continuous |
| listings | Active for-sale listings (inventory) | Quantitative discrete |
| months_inventory | Months needed to clear inventory at the current sales pace | Quantitative continuous |
Time dimension: year and month they index repeated observations per city. For time-based analysis, we can create a period variable (e.g., first day of the month) from year and month in order to enables chronological ordering, moving averages to analyze price trends, trend-seasonality (aggregated series or city-level series) and cross-cities comparisons in the same period.
Comprehensive descriptive analysis involves using measures of position (central tendency), variability (dispersion) and shape to summarize distributions. For quantitative variables, the relevant measures include the mean, median, 1st and 3rd quartiles, standard deviation, interquartile range, variance and skewness/kurtosis indices.
In the code chunk below, the function compute_indices() is defined to produce a comprehensive set of descriptive statistics for a quantitative variable. The function returns measures of position (mean, median, first and third quartiles) together with minimum and maximum values, dispersion (variance, standard deviation, interquartile range and coefficient of variation), and distributional shape (skewness and excess kurtosis).
The function is then applied to the selected quantitative variables (sales, volume, median_price, listings, and months_inventory), and the resulting statistics are assembled into a single summary table and rounded to improve readability in the final report.
# Function to compute descriptive statistics for a quantitative variable
compute_indices <- function(x) {
# Convert to numeric and remove missing values
x <- na.omit(as.numeric(x))
# if no valid observations (size of data equals to 0)
if (length(x) == 0) {
return(c(mean = NA, median = NA, q1 = NA, q3 = NA,min = NA, max = NA,
variance = NA, sd = NA, iqr = NA, cv = NA,skewness = NA, kurtosis_excess = NA))
}
# Aliases for mean and std.dev functions
m <- mean(x)
s <- sd(x)
# Return a named vector of descriptive indices
c(
# --- position indices ---
mean = m,
median = median(x),
q1 = quantile(x, 0.25),
q3 = quantile(x, 0.75),
# --- extreme values ---
min = min(x),
max = max(x),
# --- variability indices ---
variance = var(x),
sd = s,
iqr = IQR(x),
cv = ifelse(m == 0, NA, s/m),
# --- shape indices ---
# When std.dev is 0, all observations are identical (constant variable),
# so skewness and kurtosis are not defined
skewness = ifelse(is.na(s) || s == 0, NA, skewness(x)),
# the excess kurtosis is defined as kurtosis minus 3
kurtosis_excess = ifelse(is.na(s) || s == 0, NA, kurtosis(x) - 3)
)
}
# Quantitative variables of interest
vars <- c("sales", "volume", "median_price", "listings", "months_inventory")
# Apply the function to each variable and combine results in a table
results <- t(sapply(vars, function(v) compute_indices(df[[v]])))
results <- as.data.frame(results)
# round values for cleaner reporting
results_rounded <- round(results, 2)
Then, the summary statistics are transformed into a report-ready table.
# Build final table from computed results
# names of variables are moved from row names into first column (Variable)
table_results <- cbind(Variable = rownames(results_rounded), results_rounded)
rownames(table_results) <- NULL
# Rename columns to report-friendly label
# with first column header intentionally left blank
colnames(table_results) <- c("","Mean","Median","Q1","Q3","Min","Max","Variance","SD","IQR","CV","Skewness","Excess Kurtosis")
# Create formatted table
kable(
table_results,
caption = "Descriptive statistics for quantitative variables",
align = "lrrrrrrrrrrrr",
booktabs = TRUE
) %>%
# Add headers above for grouping indices
add_header_above(c(" " = 1,"Position" = 4,"Extremes" = 2,"Variability" = 4,"Shape" = 2)) %>%
# Add style for HTML
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")
)
| Mean | Median | Q1 | Q3 | Min | Max | Variance | SD | IQR | CV | Skewness | Excess Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sales | 192.29 | 175.50 | 127.00 | 247.00 | 79.00 | 423.00 | 6.34430e+03 | 79.65 | 120.00 | 0.41 | 0.72 | -0.31 |
| volume | 31.01 | 27.06 | 17.66 | 40.89 | 8.17 | 83.55 | 2.77270e+02 | 16.65 | 23.23 | 0.54 | 0.88 | 0.18 |
| median_price | 132665.42 | 134500.00 | 117300.00 | 150050.00 | 73800.00 | 180000.00 | 5.13573e+08 | 22662.15 | 32750.00 | 0.17 | -0.36 | -0.62 |
| listings | 1738.02 | 1618.50 | 1026.50 | 2056.00 | 743.00 | 3296.00 | 5.66569e+05 | 752.71 | 1029.50 | 0.43 | 0.65 | -0.79 |
| months_inventory | 9.19 | 8.95 | 7.80 | 10.95 | 3.40 | 14.90 | 5.31000e+00 | 2.30 | 3.15 | 0.25 | 0.04 | -0.17 |
Given that descriptive statistics were computed for the quantitative variables (sales, volume, median_price, listings, and months_inventory), it is appropriate to compute frequency distributions for the remaining categorical or discrete time-related variables (i.e. Year):
# function used to compute frequency table for specific variable
freq_dist_1var <- function(x) {
ni <- table(x)
fi <- ni/length(x)
Ni <- cumsum(ni)
Fi <- cumsum(fi)
return (cbind(ni,fi,Ni,Fi))
}
# apply function to variable year
freq_year <- freq_dist_1var(df$year)
# render table
kable(freq_year, caption = "Frequency distribution - Year", align = "lrrrr", booktabs = TRUE)
| ni | fi | Ni | Fi | |
|---|---|---|---|---|
| 2010 | 48 | 0.2 | 48 | 0.2 |
| 2011 | 48 | 0.2 | 96 | 0.4 |
| 2012 | 48 | 0.2 | 144 | 0.6 |
| 2013 | 48 | 0.2 | 192 | 0.8 |
| 2014 | 48 | 0.2 | 240 | 1.0 |
Based on the computed summary table, the conclusions are as follows:
## 1 - Highest variability (CV): volume - 0.54
## 2 - Highest skewness (absolute): volume - 0.88
The first conclusion is based on the coefficient of variation (CV = sd / mean), which is the appropriate measure for comparing dispersion across variables expressed on different scales. In the reported results, volume has the largest CV (0.54), therefore volume exhibits the greatest relative variability.
The second conclusion is based on skewness. Considering asymmetry in absolute value, volume shows the largest skewness (0.88). Since the skewness is positive, the distribution of volume is right-skewed, indicating a longer upper tail and the presence of relatively high observations.
The quantitative variable median_price was selected and partitioned into class intervals in order to construct a frequency distribution and visualize the resulting frequencies using a bar chart.
# width for class intervals
breaks_price <- pretty(df$median_price)
# Human-readable numeric labels (no scientific notation)
labels_price <- paste0(
"[",
formatC(breaks_price[-length(breaks_price)], format = "f", digits = 0, big.mark = ""),
" - ",
formatC(breaks_price[-1], format = "f", digits = 0, big.mark = ""),
")"
)
df$median_price_class <- cut(
df$median_price,
breaks = breaks_price,
labels = labels_price,
include.lowest = TRUE,
right = FALSE
)
# apply function to variable median_price_class
freq_median_price <- freq_dist_1var(df$median_price_class)
# render table
kable(freq_median_price, caption = "Frequency distribution - Median price", align = "lrrrr", booktabs = TRUE)
| ni | fi | Ni | Fi | |
|---|---|---|---|---|
| [60000 - 80000) | 1 | 0.0041667 | 1 | 0.0041667 |
| [80000 - 100000) | 23 | 0.0958333 | 24 | 0.1000000 |
| [100000 - 120000) | 41 | 0.1708333 | 65 | 0.2708333 |
| [120000 - 140000) | 74 | 0.3083333 | 139 | 0.5791667 |
| [140000 - 160000) | 80 | 0.3333333 | 219 | 0.9125000 |
| [160000 - 180000) | 21 | 0.0875000 | 240 | 1.0000000 |
# Convert matrix to data frame and keep class labels
freq_median_price_df <- as.data.frame(freq_median_price)
freq_median_price_df$Class <- rownames(freq_median_price_df)
# Preserve original order of classes
freq_median_price_df$Class <- factor(freq_median_price_df$Class, levels = freq_median_price_df$Class)
# Relative frequency bar chart
ggplot(freq_median_price_df, aes(x = Class, y = fi)) +
geom_col(fill = "#2c7fb8", width = 0.5, color = "grey25") +
labs(
title = "Relative frequency distribution for median price classes",
x = "Median price",
y = "Relative frequency"
) +
theme_bw(base_size = 12) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid.minor = element_blank()
)
The Gini heterogeneity index was computed based on the distribution of observations across classes by using the relative frequency fi values.
The unnormalized index is
\[\begin{equation} G = 1 - \sum_{i} fi_i^2 \end{equation}\]
while the normalized version is
\[\begin{equation} G^{*} = \frac{k}{k-1}\left(1 - \sum_{i=1}^{k} fi_i^2\right) \end{equation}\] where k is the number of non empty classes
p <- freq_median_price_df$fi
# remove possible zero/NA classes
p <- p[!is.na(p) & p > 0]
# number of non empty classes
k <- length(p)
# gini heterogeneity (unnormalized)
G <- 1 - sum(p^2)
# normalized Gini heterogeneity in [0,1]
G_norm <- if (k > 1) (k / (k - 1)) * G else NA_real_
gini_df <- data.frame(
k = k,
G = G,
G_norm = G_norm
)
kable(
gini_df,
caption = "Gini heterogeneity indices based on class relative frequencies",
digits = 4,
col.names = c("Number of non empty classes (k)",
"Gini heterogeneity (G)",
"Normalized Gini heterogeneity (G*)"),
align = "lll",
booktabs = TRUE
)
| Number of non empty classes (k) | Gini heterogeneity (G) | Normalized Gini heterogeneity (G*) |
|---|---|---|
| 6 | 0.7478 | 0.8973 |
Based on the reported frequency distribution and the Gini heterogeneity indices, you can state the empirical distribution of median_price across the six class intervals has a relatively high Gini heterogeneity index (G=0.7478). With k=6 non empty classes, the corresponding normalized index (G*=0.8973) is close to the upper bound of 1, indicating substantial dispersion of observations across price bands.
At the same time, the class frequencies show clear substantive concentration in the mid-to-upper market segments: the intervals [120,000–140,000) and [140,000–160,000) jointly account for approximately 64% of the sample (about 0.31 and 0.33 relative frequency, respectively). By contrast, the lowest interval [60,000–80,000) is extremely sparse (ni=1, fi≈0.0042), indicating negligible representation at the bottom of the defined price bands.
Probabilities are estimated empirically from observed frequencies.
For an event \(A\), the estimated
probability is \[
\widehat{P}(A)=\frac{n_A}{N},
\] where \(n_A\) denotes the
absolute frequency of event \(A\), and
\(N\) is the total number of
observations.
For two events \(A\) and \(B\), the joint probability is \[
\widehat{P}(A \cap B)=\frac{n_{A,B}}{N},
\] while the conditional probability (for \(n_B>0\)) is \[
\widehat{P}(A \mid B)=\frac{n_{A,B}}{n_B}.
\] Percent values are obtained as \(100
\times \widehat{P}(\cdot)\).
The probability is estimated using an empirical frequency approach by computing a frequency distribution for city, obtaining absolute frequencies (ni) and relative frequencies (fi) for each category.
# apply function to variable city
freq_city <- freq_dist_1var(df$city)
# render table
kable(freq_city, caption = "Frequency distribution - City", align = "lrrrr", booktabs = TRUE)
| ni | fi | Ni | Fi | |
|---|---|---|---|---|
| Beaumont | 60 | 0.25 | 60 | 0.25 |
| Bryan-College Station | 60 | 0.25 | 120 | 0.50 |
| Tyler | 60 | 0.25 | 180 | 0.75 |
| Wichita Falls | 60 | 0.25 | 240 | 1.00 |
The estimated probability of selecting a row corresponding to Beaumont is then derived from the relative frequency of that category:
\[ \hat{P}(\text{City}=\text{Beaumont}) = f_i \]
# Estimated probability P(city = "Beaumont")
p_beaumont <- freq_city["Beaumont", "fi"]
# Build report table
prob_table <- data.frame(Event = "City = Beaumont", Probability = p_beaumont, Percentage = 100 * p_beaumont)
# Print with kable
knitr::kable(prob_table, caption = "Probability of selecting Beaumont",
digits = c(0, 2, 2), align = "lrr", booktabs = TRUE)
| Event | Probability | Percentage |
|---|---|---|
| City = Beaumont | 0.25 | 25 |
Based on the frequency distribution, the estimate is:
\[ \hat{P}(\text{City}=\text{Beaumont}) = 0.25 \]
which corresponds to 25% of the sample.
Following the same approach described in section 5.1, the probability of selecting a row corresponding to July is obtained from the relative frequency of month = 7 in the distribution of month:
\[ \hat{P}(\text{Month}=7)=f_{7} \]
# apply function to variable month
freq_month <- freq_dist_1var(df$month)
# render table
kable(freq_month, row.names = TRUE, caption = "Frequency distribution - Month",
digits = c(0, 4, 0, 4), align = "lrrrr", booktabs = TRUE)
| ni | fi | Ni | Fi | |
|---|---|---|---|---|
| 1 | 20 | 0.0833 | 20 | 0.0833 |
| 2 | 20 | 0.0833 | 40 | 0.1667 |
| 3 | 20 | 0.0833 | 60 | 0.2500 |
| 4 | 20 | 0.0833 | 80 | 0.3333 |
| 5 | 20 | 0.0833 | 100 | 0.4167 |
| 6 | 20 | 0.0833 | 120 | 0.5000 |
| 7 | 20 | 0.0833 | 140 | 0.5833 |
| 8 | 20 | 0.0833 | 160 | 0.6667 |
| 9 | 20 | 0.0833 | 180 | 0.7500 |
| 10 | 20 | 0.0833 | 200 | 0.8333 |
| 11 | 20 | 0.0833 | 220 | 0.9167 |
| 12 | 20 | 0.0833 | 240 | 1.0000 |
# Estimated probability P(month = 7)
p_month_july <- freq_month[7, "fi"]
# Build report table
prob_table <- data.frame(Event = "Month = 7", Probability = p_month_july, Percentage = 100 * p_month_july)
# Print with kable
knitr::kable(prob_table,caption = "Probability of selecting July",
digits = c(0, 4, 4), align = "lrr", booktabs = TRUE)
| Event | Probability | Percentage |
|---|---|---|
| Month = 7 | 0.0833 | 8.3333 |
The estimated probability is:
\[ \hat{P}(\text{Month}=7)=0.0833 \]
which corresponds to approximately 8.33% of the
sample.
Based on the above frequency table, this result is consistent with a
monthly partition over 12 months, where each month is expected to
contribute about one-twelfth of observations under a balanced temporal
structure.
Consistent with the empirical procedure adopted in sections 5.1 and 5.2, the probability for this event is obtained from the relative frequency of the combined category period = 2012_12 in the frequency distribution of period.
# define column period in data frame
# format month in 2 chars for ordering purposes
df$period = as.Date(sprintf("%04d-%02d-01", as.integer(df$year), as.integer(df$month)))
# apply function to variable period
freq_period <- freq_dist_1var(df$period)
# filter only 2012 rows (2012-01-01 ... 2012-12-01) to avoid printing full table
# Ni and Fi remain cumulative with respect to the original full period ordering
freq_period_2012 <- freq_period[grepl("^2012-", rownames(freq_period)), , drop = FALSE]
# render table
kable(freq_period_2012, caption = "Frequency distribution - Period \n (filtered for 2012)",
digits = c(0, 4, 0, 4), align = "lrrrr", booktabs = TRUE)
| ni | fi | Ni | Fi | |
|---|---|---|---|---|
| 2012-01-01 | 4 | 0.0167 | 100 | 0.4167 |
| 2012-02-01 | 4 | 0.0167 | 104 | 0.4333 |
| 2012-03-01 | 4 | 0.0167 | 108 | 0.4500 |
| 2012-04-01 | 4 | 0.0167 | 112 | 0.4667 |
| 2012-05-01 | 4 | 0.0167 | 116 | 0.4833 |
| 2012-06-01 | 4 | 0.0167 | 120 | 0.5000 |
| 2012-07-01 | 4 | 0.0167 | 124 | 0.5167 |
| 2012-08-01 | 4 | 0.0167 | 128 | 0.5333 |
| 2012-09-01 | 4 | 0.0167 | 132 | 0.5500 |
| 2012-10-01 | 4 | 0.0167 | 136 | 0.5667 |
| 2012-11-01 | 4 | 0.0167 | 140 | 0.5833 |
| 2012-12-01 | 4 | 0.0167 | 144 | 0.6000 |
# Estimated probability P(month = 12 and year = 2012)
p_event_dec2012 <- freq_period["2012-12-01", "fi"]
# Build report table
prob_table <- data.frame(Event = "December 2012",
Probability = p_event_dec2012, Percentage = 100 * p_event_dec2012)
# Print with kable
knitr::kable(prob_table, caption = "Probability of selecting December 2012",
digits = c(0, 4, 4), align = "lrr", booktabs = TRUE)
| Event | Probability | Percentage |
|---|---|---|
| December 2012 | 0.0167 | 1.6667 |
Therefore, the estimated probability is:
\[ \hat{P}(\text{December 2012})=0.0167 \]
which corresponds to approximately 1.67% of the full dataset.
The estimated average sale price is computed as: \[ \text{avg_sale_price}=\frac{\text{volume}\times 10^{6}}{\text{sales}} \] since, as by study variables description in section 1, volume is measured in millions of dollars, while sales represents the number of transactions per city-month.
# volume is in millions of USD, so multiply by 10^6
# then divide by number of sales
df$avg_sale_price <- ifelse(df$sales > 0, (df$volume * 1e+06) / df$sales, NA)
# print examples
# randomly select row indices, then sort them to keep original order
idx <- sort(sample(nrow(df), min(5, nrow(df))))
kable(df[idx, c("city", "year", "month", "volume", "sales", "avg_sale_price")])
| city | year | month | volume | sales | avg_sale_price | |
|---|---|---|---|---|---|---|
| 40 | Beaumont | 2013 | 4 | 29.433 | 198 | 148651.5 |
| 93 | Bryan-College Station | 2012 | 9 | 28.434 | 149 | 190832.2 |
| 206 | Wichita Falls | 2012 | 2 | 10.697 | 90 | 118855.6 |
| 209 | Wichita Falls | 2012 | 5 | 12.451 | 102 | 122068.6 |
| 229 | Wichita Falls | 2014 | 1 | 9.626 | 89 | 108157.3 |
The avg_sale_price provides a mean value per transaction and complements the existing median-based price indicator (median_price).
Listing effectiveness is defined as the extent to which available
inventory is converted into sales.
A simple monthly absorption proxy is: \[
\text{listing_effectiveness}=\frac{\text{sales}}{\text{listings}}
\] An alternative turnover-based indicator is: \[
\text{inventory_turnover_ratio}=\frac{1}{\text{months_inventory}}
\]
While months_inventory describes how long current stock will last, its inverse flips the perspective to show how efficiently that inventory is being converted into revenue, where higher values indicate faster market turnover.
#listing effectiveness
df$listing_effectiveness <- ifelse(df$listings > 0, df$sales / df$listings, NA)
# inverse of months_inventory (higher means faster inventory turnover)
df$inventory_turnover_ratio <- ifelse(df$months_inventory > 0, 1 / df$months_inventory, NA)
# print examples
# randomly select row indices, then sort them to keep original order
idx <- sort(sample(nrow(df), min(5, nrow(df))))
kable(df[idx, c("city", "year", "month", "sales", "listings", "months_inventory",
"listing_effectiveness", "inventory_turnover_ratio")])
| city | year | month | sales | listings | months_inventory | listing_effectiveness | inventory_turnover_ratio | |
|---|---|---|---|---|---|---|---|---|
| 25 | Beaumont | 2012 | 1 | 110 | 1647 | 11.4 | 0.0667881 | 0.0877193 |
| 92 | Bryan-College Station | 2012 | 8 | 296 | 1518 | 8.1 | 0.1949934 | 0.1234568 |
| 133 | Tyler | 2011 | 1 | 143 | 2852 | 12.6 | 0.0501403 | 0.0793651 |
| 163 | Tyler | 2013 | 7 | 369 | 2998 | 10.7 | 0.1230821 | 0.0934579 |
| 233 | Wichita Falls | 2014 | 5 | 140 | 899 | 7.6 | 0.1557286 | 0.1315789 |
# aggregate data by period across all cities
monthly_trend <- df %>%
group_by(period) %>%
summarise(
# calculate means for both indicators
listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
inventory_turnover_ratio = mean(inventory_turnover_ratio, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_longer(cols = c(listing_effectiveness, inventory_turnover_ratio),
names_to = "Metric", values_to = "Value")
# plot monthly trends of effectiveness metrics
ggplot(monthly_trend, aes(x = period, y = Value, color = Metric)) +
geom_line(linewidth = 0.9) +
theme_bw(base_size = 11) +
labs(title = "Monthly Trend of effectiveness metrics", x = NULL, y = "Mean value", color = NULL)
The monthly trend confirms a general increase in market absorption over the sample period for both indicators, which is consistent with their conceptual link as both proxy the speed of inventory absorption:
inventory_turnover_ratio shows a relatively smooth upward trajectory, rising from about 0.10–0.12 in early years to around 0.17–0.18 by the end of the period
listing_effectiveness is more volatile, with pronounced short-term oscillations and several peaks (up to ~0.20), suggesting higher sensitivity to short-run fluctuations in sales/listings but it also exhibits a clear positive trend
Conditional summary statistics are descriptive measures computed on subsets of the data defined by specific criteria (strata). Typical measures include the mean, median, total, and standard deviation; in particular, the mean under a given condition can be interpreted as a conditional expectation. These summaries are used to characterize how a variable behaves across different sub-populations or time segments.
This section reports conditional statistical analyses stratified by city, year, and month. The analysis is implemented in R (using either dplyr or base R). For each stratum, key summary statistics, primarily mean and standard deviation, are estimated and then visualized to support comparison of cross-sectional differences and temporal patterns.
# quantitative variables and derived indicators
num_vars <- c("avg_sale_price","listing_effectiveness", "inventory_turnover_ratio")
# compute mean and standard deviation of every variable in num_vars
summarise_mean_sd <- function(data) {
data %>%
summarise(
# mean of each selected column within the current group
across(all_of(num_vars),\(x) mean(x, na.rm = TRUE), .names = "{.col}_mean"),
# standard deviation of each selected column within the current group
across(all_of(num_vars), \(x) sd(x, na.rm = TRUE), .names = "{.col}_sd"),
.groups = "drop" # return an not grouped data frame
)
}
# conditional summaries for different stratifications
summary_by_city <- df %>% group_by(city) %>% summarise_mean_sd()
summary_by_year <- df %>% group_by(year) %>% summarise_mean_sd()
summary_by_month <- df %>% group_by(month) %>% summarise_mean_sd()
summary_by_city_year <- df %>% group_by(city, year) %>% summarise_mean_sd()
summary_by_city_month <- df %>% group_by(city, month) %>% summarise_mean_sd()
# print summary tables
kable(summary_by_city, caption = "Conditional summary by city (mean, sd)", digits = 4, booktabs = TRUE)
| city | avg_sale_price_mean | listing_effectiveness_mean | inventory_turnover_ratio_mean | avg_sale_price_sd | listing_effectiveness_sd | inventory_turnover_ratio_sd |
|---|---|---|---|---|---|---|
| Beaumont | 146640.4 | 0.1061 | 0.1032 | 11232.13 | 0.0267 | 0.0182 |
| Bryan-College Station | 183534.3 | 0.1473 | 0.1453 | 15149.35 | 0.0729 | 0.0533 |
| Tyler | 167676.8 | 0.0935 | 0.0909 | 12350.51 | 0.0235 | 0.0164 |
| Wichita Falls | 119430.0 | 0.1280 | 0.1293 | 11398.48 | 0.0247 | 0.0136 |
kable(summary_by_year, caption = "Conditional summary by year (mean, sd)", digits = 4, booktabs = TRUE)
| year | avg_sale_price_mean | listing_effectiveness_mean | inventory_turnover_ratio_mean | avg_sale_price_sd | listing_effectiveness_sd | inventory_turnover_ratio_sd |
|---|---|---|---|---|---|---|
| 2010 | 150188.6 | 0.0997 | 0.1046 | 23279.55 | 0.0337 | 0.0217 |
| 2011 | 148250.6 | 0.0927 | 0.0950 | 24938.38 | 0.0232 | 0.0181 |
| 2012 | 150898.7 | 0.1097 | 0.1040 | 26438.50 | 0.0281 | 0.0178 |
| 2013 | 158705.2 | 0.1346 | 0.1288 | 26523.81 | 0.0448 | 0.0314 |
| 2014 | 163558.7 | 0.1570 | 0.1534 | 31740.53 | 0.0618 | 0.0496 |
kable(summary_by_month, caption = "Conditional summary by month (mean, sd)", digits = 4, booktabs = TRUE)
| month | avg_sale_price_mean | listing_effectiveness_mean | inventory_turnover_ratio_mean | avg_sale_price_sd | listing_effectiveness_sd | inventory_turnover_ratio_sd |
|---|---|---|---|---|---|---|
| 1 | 145640.4 | 0.0831 | 0.1190 | 29819.11 | 0.0230 | 0.0290 |
| 2 | 148840.5 | 0.0878 | 0.1160 | 25120.42 | 0.0219 | 0.0284 |
| 3 | 151136.5 | 0.1160 | 0.1119 | 23237.92 | 0.0346 | 0.0283 |
| 4 | 151461.3 | 0.1253 | 0.1091 | 26174.30 | 0.0380 | 0.0297 |
| 5 | 158235.0 | 0.1415 | 0.1102 | 25787.19 | 0.0503 | 0.0314 |
| 6 | 161545.8 | 0.1424 | 0.1104 | 23470.46 | 0.0576 | 0.0337 |
| 7 | 156881.0 | 0.1435 | 0.1127 | 27220.12 | 0.0740 | 0.0386 |
| 8 | 156455.6 | 0.1419 | 0.1154 | 28253.21 | 0.0526 | 0.0394 |
| 9 | 156522.3 | 0.1117 | 0.1188 | 29669.41 | 0.0348 | 0.0417 |
| 10 | 155897.4 | 0.1119 | 0.1218 | 32527.29 | 0.0360 | 0.0418 |
| 11 | 154233.0 | 0.1025 | 0.1259 | 29684.87 | 0.0293 | 0.0437 |
| 12 | 154995.5 | 0.1173 | 0.1351 | 27008.87 | 0.0379 | 0.0496 |
kable(summary_by_city_year, caption = "Conditional summary by city/year (mean, sd)", digits = 4, booktabs = TRUE)
| city | year | avg_sale_price_mean | listing_effectiveness_mean | inventory_turnover_ratio_mean | avg_sale_price_sd | listing_effectiveness_sd | inventory_turnover_ratio_sd |
|---|---|---|---|---|---|---|---|
| Beaumont | 2010 | 146582.5 | 0.0898 | 0.0920 | 13960.173 | 0.0195 | 0.0062 |
| Beaumont | 2011 | 145921.9 | 0.0823 | 0.0855 | 12655.337 | 0.0117 | 0.0052 |
| Beaumont | 2012 | 141475.9 | 0.1015 | 0.0933 | 10345.771 | 0.0158 | 0.0079 |
| Beaumont | 2013 | 150079.0 | 0.1225 | 0.1142 | 6245.121 | 0.0215 | 0.0069 |
| Beaumont | 2014 | 149142.7 | 0.1346 | 0.1311 | 11234.169 | 0.0218 | 0.0064 |
| Bryan-College Station | 2010 | 174601.8 | 0.1056 | 0.1163 | 11964.068 | 0.0396 | 0.0111 |
| Bryan-College Station | 2011 | 173689.0 | 0.1027 | 0.1033 | 11645.001 | 0.0315 | 0.0117 |
| Bryan-College Station | 2012 | 179360.6 | 0.1215 | 0.1141 | 9072.876 | 0.0423 | 0.0167 |
| Bryan-College Station | 2013 | 187315.8 | 0.1708 | 0.1613 | 12931.505 | 0.0649 | 0.0372 |
| Bryan-College Station | 2014 | 202704.3 | 0.2362 | 0.2318 | 8625.369 | 0.0768 | 0.0312 |
| Tyler | 2010 | 159537.5 | 0.0745 | 0.0795 | 8554.899 | 0.0151 | 0.0059 |
| Tyler | 2011 | 160248.0 | 0.0773 | 0.0747 | 8949.978 | 0.0126 | 0.0064 |
| Tyler | 2012 | 165533.0 | 0.0902 | 0.0866 | 12271.146 | 0.0134 | 0.0052 |
| Tyler | 2013 | 174501.8 | 0.1012 | 0.0986 | 8939.224 | 0.0143 | 0.0065 |
| Tyler | 2014 | 178563.5 | 0.1242 | 0.1152 | 10805.818 | 0.0199 | 0.0122 |
| Wichita Falls | 2010 | 120032.5 | 0.1290 | 0.1306 | 12351.214 | 0.0302 | 0.0084 |
| Wichita Falls | 2011 | 113143.6 | 0.1085 | 0.1166 | 8247.222 | 0.0159 | 0.0084 |
| Wichita Falls | 2012 | 117225.3 | 0.1255 | 0.1222 | 13981.539 | 0.0154 | 0.0072 |
| Wichita Falls | 2013 | 122924.3 | 0.1439 | 0.1413 | 8760.490 | 0.0283 | 0.0133 |
| Wichita Falls | 2014 | 123824.3 | 0.1331 | 0.1355 | 10994.397 | 0.0187 | 0.0135 |
kable(summary_by_city_month, caption = "Conditional summary by city/month (mean, sd)", digits = 4, booktabs = TRUE)
| city | month | avg_sale_price_mean | listing_effectiveness_mean | inventory_turnover_ratio_mean | avg_sale_price_sd | listing_effectiveness_sd | inventory_turnover_ratio_sd |
|---|---|---|---|---|---|---|---|
| Beaumont | 1 | 142059.2 | 0.0760 | 0.1050 | 20363.512 | 0.0201 | 0.0151 |
| Beaumont | 2 | 146503.0 | 0.0826 | 0.1030 | 12974.719 | 0.0197 | 0.0146 |
| Beaumont | 3 | 149918.4 | 0.1037 | 0.1029 | 5398.706 | 0.0132 | 0.0198 |
| Beaumont | 4 | 142949.1 | 0.1118 | 0.1000 | 5511.596 | 0.0141 | 0.0176 |
| Beaumont | 5 | 146873.9 | 0.1208 | 0.0993 | 6495.480 | 0.0303 | 0.0187 |
| Beaumont | 6 | 148591.7 | 0.1183 | 0.0990 | 4913.971 | 0.0252 | 0.0186 |
| Beaumont | 7 | 153993.7 | 0.1061 | 0.0981 | 15215.577 | 0.0179 | 0.0190 |
| Beaumont | 8 | 150966.9 | 0.1278 | 0.1012 | 6549.042 | 0.0352 | 0.0198 |
| Beaumont | 9 | 144663.8 | 0.1043 | 0.1038 | 13874.571 | 0.0352 | 0.0238 |
| Beaumont | 10 | 148133.6 | 0.1137 | 0.1051 | 9899.859 | 0.0319 | 0.0213 |
| Beaumont | 11 | 134896.1 | 0.0966 | 0.1074 | 11773.634 | 0.0173 | 0.0223 |
| Beaumont | 12 | 150135.5 | 0.1119 | 0.1139 | 10028.542 | 0.0229 | 0.0228 |
| Bryan-College Station | 1 | 179365.7 | 0.0862 | 0.1403 | 13494.092 | 0.0256 | 0.0355 |
| Bryan-College Station | 2 | 169985.7 | 0.0867 | 0.1330 | 18446.113 | 0.0305 | 0.0389 |
| Bryan-College Station | 3 | 174920.3 | 0.1226 | 0.1246 | 8552.149 | 0.0546 | 0.0435 |
| Bryan-College Station | 4 | 182128.2 | 0.1443 | 0.1200 | 14123.928 | 0.0573 | 0.0468 |
| Bryan-College Station | 5 | 181804.4 | 0.1950 | 0.1294 | 18412.798 | 0.0620 | 0.0480 |
| Bryan-College Station | 6 | 181582.2 | 0.2164 | 0.1363 | 18298.850 | 0.0701 | 0.0530 |
| Bryan-College Station | 7 | 183344.8 | 0.2228 | 0.1447 | 16508.899 | 0.1132 | 0.0612 |
| Bryan-College Station | 8 | 184104.9 | 0.1943 | 0.1506 | 16633.849 | 0.0737 | 0.0608 |
| Bryan-College Station | 9 | 191815.7 | 0.1236 | 0.1578 | 9544.628 | 0.0520 | 0.0618 |
| Bryan-College Station | 10 | 193938.3 | 0.1214 | 0.1614 | 13905.882 | 0.0587 | 0.0614 |
| Bryan-College Station | 11 | 192760.5 | 0.1167 | 0.1661 | 11943.247 | 0.0436 | 0.0659 |
| Bryan-College Station | 12 | 186660.8 | 0.1381 | 0.1802 | 15651.209 | 0.0618 | 0.0775 |
| Tyler | 1 | 154935.3 | 0.0669 | 0.0929 | 6400.878 | 0.0161 | 0.0126 |
| Tyler | 2 | 164516.8 | 0.0768 | 0.0921 | 8645.045 | 0.0132 | 0.0135 |
| Tyler | 3 | 161441.0 | 0.0947 | 0.0899 | 11066.124 | 0.0114 | 0.0126 |
| Tyler | 4 | 162962.8 | 0.0971 | 0.0868 | 10856.908 | 0.0148 | 0.0133 |
| Tyler | 5 | 178711.5 | 0.1042 | 0.0866 | 6087.930 | 0.0233 | 0.0155 |
| Tyler | 6 | 180028.9 | 0.1071 | 0.0854 | 11050.260 | 0.0258 | 0.0151 |
| Tyler | 7 | 170866.7 | 0.1040 | 0.0852 | 8333.915 | 0.0225 | 0.0149 |
| Tyler | 8 | 173738.0 | 0.1028 | 0.0871 | 11343.693 | 0.0213 | 0.0159 |
| Tyler | 9 | 169106.3 | 0.0955 | 0.0896 | 17250.045 | 0.0248 | 0.0180 |
| Tyler | 10 | 167987.0 | 0.0950 | 0.0927 | 15113.128 | 0.0300 | 0.0199 |
| Tyler | 11 | 166102.4 | 0.0826 | 0.0975 | 7061.601 | 0.0267 | 0.0223 |
| Tyler | 12 | 161724.3 | 0.0952 | 0.1053 | 14740.546 | 0.0302 | 0.0260 |
| Wichita Falls | 1 | 106201.5 | 0.1032 | 0.1379 | 9788.224 | 0.0169 | 0.0156 |
| Wichita Falls | 2 | 114356.4 | 0.1052 | 0.1359 | 7397.539 | 0.0152 | 0.0120 |
| Wichita Falls | 3 | 118266.5 | 0.1428 | 0.1301 | 12167.279 | 0.0263 | 0.0067 |
| Wichita Falls | 4 | 117805.3 | 0.1481 | 0.1294 | 7684.451 | 0.0286 | 0.0116 |
| Wichita Falls | 5 | 125550.3 | 0.1459 | 0.1256 | 5015.104 | 0.0285 | 0.0136 |
| Wichita Falls | 6 | 135980.5 | 0.1278 | 0.1208 | 13412.726 | 0.0119 | 0.0092 |
| Wichita Falls | 7 | 119318.8 | 0.1411 | 0.1229 | 7206.987 | 0.0288 | 0.0120 |
| Wichita Falls | 8 | 117012.4 | 0.1428 | 0.1226 | 5664.009 | 0.0211 | 0.0128 |
| Wichita Falls | 9 | 120503.5 | 0.1235 | 0.1241 | 6905.672 | 0.0213 | 0.0162 |
| Wichita Falls | 10 | 113530.6 | 0.1176 | 0.1281 | 13971.742 | 0.0165 | 0.0166 |
| Wichita Falls | 11 | 123173.0 | 0.1141 | 0.1325 | 12234.014 | 0.0145 | 0.0154 |
| Wichita Falls | 12 | 121461.4 | 0.1242 | 0.1412 | 12532.343 | 0.0176 | 0.0144 |
The city-level chart summarizes conditional means of average sale price across all months and years within each market. Bryan–College Station and Tyler exhibit the highest central price levels, while Wichita Falls records the lowest mean with Beaumont occupies an intermediate position. The error bars reflect dispersion at city-level, indicating that price levels are not only shifted across cities but also differ in short-run variability.
# average sale price by city (mean ± SD)
ggplot(summary_by_city,
aes(x = city, y = avg_sale_price_mean)) +
geom_col(col="black", fill = "#2c7fb8", width = 0.7) +
geom_errorbar(
aes(
ymin = avg_sale_price_mean - avg_sale_price_sd,
ymax = avg_sale_price_mean + avg_sale_price_sd
),
width = 0.3
) +
labs(
title = "Average sale price by city (mean ± SD)",
x = "City",
y = "Average sale price (USD)"
) +
theme_bw(base_size = 11)
The grouped bar chart refines the comparison by conditioning on both city and year, so each bar represents the mean price within a city–year stratum and the error bar captures dispersion within that stratum. Bryan–College Station shows the highest levels and a positive temporal gradient from 2011 to 2014, consistent with strengthening market conditions in that sub-market. Also Tyler displays a stable increase over the sample period. Beaumont appears comparatively stable with limited evidence of a sustained upward shift. Wichita Falls remains the lowest-priced market throughout, with only modest growth after 2011.
# average sale price by city and year (mean ± SD)
ggplot(summary_by_city_year,
aes(x = city, y = avg_sale_price_mean, fill = factor(year))) +
geom_col(color = "black", position = position_dodge(width = 0.8), width = 0.7) +
geom_errorbar(
aes(
ymin = avg_sale_price_mean - avg_sale_price_sd,
ymax = avg_sale_price_mean + avg_sale_price_sd
),
position = position_dodge(width = 0.8),
width = 0.2
) +
labs(
title = "Average sale price by city and year (mean ± SD)",
x = "City",
y = "Average sale price (USD)",
fill = "Year"
) +
theme_bw(base_size = 11)
Overall, the figures point to marked cross-sectional heterogeneity in the Texas panel: geographic stratification is a primary source of variation in average transaction values.
The line chart displays conditional means of listing effectiveness (sales relative to listings) for each city–year stratum, with the series show a common upward shift after an early dip around 2011, consistent with an improvement in the conversion of active listings into sales. The magnitude and timing of that improvement, however, differ across cities:
Bryan–College Station stands out from 2012 onward: effectiveness rises from roughly 0.10 in the early years to about 0.24 by 2014, the highest level in the plot.
Wichita Falls begins with the highest effectiveness in 2010 (near 0.13), then declines in 2011 before recovering to a local peak around 2013 (≈0.144) then a slight fall between 2013 and 2014 (to ≈0.133).
Beaumont follows a comparatively smooth upward path after 2011, from about 0.09 to roughly 0.135 in 2014, converging toward the levels of Wichita Falls by the end of the period.
Tyler shows a stable linear improvement, reaching approximately 0.124 in 2014 and reducing the gap with Beaumont and Wichita Falls.
# listing effectiveness (sales/listings) over time, by city
ggplot(
summary_by_city_year,
aes(x = year, y = listing_effectiveness_mean, color = city, group = city)
) +
geom_line(linewidth = 0.8) +
geom_point(size = 1.8) +
labs(
title = "Listing Effectiveness by year and city",
x = "Year",
y = "Listing effectiveness (mean)",
color = "City"
) +
theme_bw(base_size = 11)
The figure supports the below main conclusions for the report.
First, city-specific dynamics dominate: Bryan–College Station diverges upward while the other three markets move in a narrower band.
Second, by 2014 there is partial convergence among Beaumont, Tyler and Wichita Falls (effectiveness roughly 0.12–0.14), whereas Bryan–College Station remains an outlier on the high side.
The chart reports conditional means of inventory turnover by calendar month within each city by aggregating across years in summary_by_city_month:
Bryan–College Station shows the strongest seasonal swing: turnover falls to about 0.12 around month 4, then rises steadily to roughly 0.18 in December, the highest value on the plot, which is consistent with faster clearing of listings in the second half of the year.
Wichita Falls starts close to Bryan–College Station (≈0.138), drifts down to a trough near 0.12 in month 6, and recovers to about 0.14 by December, ranking as second.
Beaumont and Tyler follow a milder U-shaped pattern, with mid-year lows (Tyler near 0.085 in month 7, the lowest overall) and modest year-end gains (Tyler ≈0.105, Beaumont ≈0.115).
Shared features include a mid-year dip (months 4–7) and a fourth-quarter rise, which supports interpreting turnover as driven by both seasonality and persistent city effects.
# seasonal pattern of inventory turnover by month and city
ggplot(
summary_by_city_month,
aes(x = month, y = inventory_turnover_ratio_mean, color = city, group = city)
) +
geom_line(linewidth = 0.9) +
geom_point(size = 1.6) +
scale_x_continuous(breaks = 1:12) +
labs(
title = "Inventory turnover ratio by month and city",
x = "Month",
y = "Inventory turnover ratio (mean)",
color = "City"
) +
theme_bw(base_size = 11)
The second figure plots conditional means of listing effectiveness by month and city. Seasonality is again evident, with most cities showing higher absorption in spring–summer and lower values in late autumn, then a small December uptick.
Bryan–College Station exhibits the largest amplitude: effectiveness rises from about 0.09 early in the year to a peak near 0.22 in July, then falls sharply through August–September and stabilizes around 0.12–0.14. That pattern indicates a concentrated summer selling season in that market.
Wichita Falls peaks earlier, around 0.15 in April, with a secondary high near 0.14 in August, and ends the year near 0.12.
Beaumont fluctuates between roughly 0.08 and 0.13, with a high near 0.13 in August.
Tyler stays below the other cities all year (about 0.07–0.11), with a gentle peak near 0.11 in June and a November low near 0.08.
# seasonal pattern of listing effectiveness by month and city
ggplot(
summary_by_city_month,
aes(x = month, y = listing_effectiveness_mean, color = city, group = city)
) +
geom_line(linewidth = 0.9) +
geom_point(size = 1.6) +
scale_x_continuous(breaks = 1:12) +
labs(
title = "Seasonal pattern of listing effectiveness by month and city",
x = "Month",
y = "Listing effectiveness (mean)",
color = "City"
) +
theme_bw(base_size = 11)
In this section, customized graphics are produced using ggplot2 to support comparative analysis of the Texas real estate data. The visualizations address three objectives:
Together, these figures facilitate assessment of cross-sectional heterogeneity, seasonal patterns, and temporal trends in market activity.
The boxplots summarize the empirical distribution of monthly median sale prices within each city (all years pooled). The median line and interquartile range (IQR) describe the central level and spread of typical prices; whiskers and points indicate the remaining range and upper-tail outliers.
Bryan–College Station shows the highest median (about $157,000), followed by Tyler (≈ $141,000), Beaumont (≈ $130,000), and Wichita Falls (≈ $102,000). The gap between cities indicates strong cross-sectional heterogeneity in price levels.
| City | Mean (USD) | IQR (USD) |
|---|---|---|
| Beaumont | 129988.3 | 11525 |
| Bryan-College Station | 157488.3 | 11175 |
| Tyler | 141441.7 | 13700 |
| Wichita Falls | 101743.3 | 16375 |
Box heights (IQR) are basically similar across cities, so within-city variability of monthly medians is comparable in relative terms, though Wichita Falls and Tyler appear slightly wider. That pattern suggests differences between cities are driven mainly by level shifts, not by market volatility.
ggplot(df, aes(x = city, y = median_price, fill = city)) +
geom_boxplot(width = 0.6, color = "black") +
labs(
title = "Distribution of median sale price by city",
x = "City",
y = "Median sale price (USD)"
) +
theme_bw(base_size = 11) +
theme(legend.position = "none")
Disaggregating by year refines the pooled comparison and highlights temporal dynamics within each market.
The ordering by price level is preserved in most years: Bryan–College Station remains highest, Wichita Falls lowest, with Tyler and Beaumont in between. In particular, by 2014, Tyler’s distribution has shifted upward and overtakes Beaumont, displaying the most regular monotonic rise in median level year over year with a sustained local price increase.
ggplot(df, aes(x = city, y = median_price, fill = factor(year))) +
geom_boxplot(
position = position_dodge(width = 0.8),
width = 0.7,
color = "black",
) +
labs(
title = "Distribution of median sale price by city and year",
x = "City",
y = "Median sale price (USD)",
fill = "Year"
) +
theme_bw(base_size = 11)
The stacked bar chart aggregates sales counts by calendar month and city, pooling observations across all years in the panel.
Total market activity shows a pronounced seasonal cycle: volumes are lowest in January (2548 sales), rise through spring, and peak in June (4871 sales), then decline through autumn to a secondary low in November (3137), with a modest recovery in December, defining a summer-weighted selling season pattern.
ggplot(total_sales_month_city, aes(x = factor(month), y = total_sales, fill = city)) +
geom_col(position = "stack", width = 0.7, color = "black") +
scale_x_discrete(breaks = 1:12) +
labs(title = "Total sales by month and city (absolute)", x = "Month", y = "Total sales (count)", fill = "City") +
theme_bw(base_size = 11)
The normalized chart displays each city’s share of that month’s aggregate. Relative composition is stable across months: Tyler typically accounts for about 34–39% of monthly sales, Beaumont and Bryan–College Station each near mean of 23–26, and Wichita Falls about 13–17%. Minor mid-year shifts appear with Bryan–College Station gains in relative share around months 5–7 and with small offsetting changes for Tyler and Beaumont.
This view supports the conclusion that geographic structure in the four-city panel is persistent, with only limited evidence of month-specific reallocation among cities.
| Month | Beaumont | Bryan-College Station | Tyler | Wichita Falls |
|---|---|---|---|---|
| 1 | 0.24 | 0.23 | 0.36 | 0.17 |
| 2 | 0.24 | 0.22 | 0.38 | 0.16 |
| 3 | 0.23 | 0.25 | 0.35 | 0.17 |
| 4 | 0.22 | 0.28 | 0.34 | 0.16 |
| 5 | 0.22 | 0.32 | 0.33 | 0.14 |
| 6 | 0.21 | 0.33 | 0.34 | 0.13 |
| 7 | 0.20 | 0.32 | 0.34 | 0.14 |
| 8 | 0.23 | 0.28 | 0.34 | 0.15 |
| 9 | 0.24 | 0.22 | 0.39 | 0.16 |
| 10 | 0.26 | 0.21 | 0.38 | 0.15 |
| 11 | 0.25 | 0.23 | 0.36 | 0.16 |
| 12 | 0.26 | 0.23 | 0.36 | 0.15 |
ggplot(total_sales_month_city) +
geom_bar(aes(x = month, y = total_sales, fill = factor(city)), stat = "identity", position = "fill") +
scale_x_continuous(breaks = 1:12) +
labs(title = "Total sales by month and city (normalized)", x = "Month", y = "Total sales (count)", fill = "City") +
theme_bw(base_size = 11)
The below chart displays absolute monthly sales counts stacked by city within each calendar year. Each bar’s height is total transactions in that month–year and segment heights show each city’s contribution in units.
Total monthly sales generally increase over the sample period with city segments grow in parallel during many months, so the rise reflects an overall volume gains rather than a single city driving the entire increase.
For every considered year, sales tend to be lower in early months (especially January), peak in mid-summer (June–August) then ease toward year-end. This pattern aligns with the pooled monthly stacked chart with summer as the dominant selling season in these markets.
# total sales per city and month and years
total_sales_year_month_city <- df %>%
group_by(city, year, month) %>%
summarise(total_sales = sum(sales, na.rm = TRUE), .groups = "drop")
ggplot(total_sales_year_month_city) +
geom_bar(aes(x = month, y = total_sales, fill = factor(city)), stat = "identity", position = "stack") +
scale_x_continuous(breaks = 1:12) +
facet_wrap(~ year, ncol = 2) +
labs(title = "Total sales by year, month and city", x = "Month", y = "Total sales (count)", fill = "City") +
theme_bw(base_size = 11)
The city-specific line charts (month on the x-axis, one line per year, 2010–2014) show strong seasonality and generally rising sales over the period, but with clear cross-city differences.
Tyler, Beaumont, and Bryan–College Station all record higher volumes in last two years (2013–2014), with strong counts in spring–summer and lower in winter. In particular, Bryan–College Station has the most regular pattern (sharp June–July peak, then a fast drop) and the highest peak levels.
# plot monthly sales for one city with one line per calendar year.
plot_sales_by_city_period <- function(city_name) {
df %>%
# case insensitive comparison
filter(str_to_upper(city) == str_to_upper(city_name)) %>%
# one line for each year
ggplot(aes(x = month, y = sales, color = factor(year), group = year)) +
geom_line(linewidth = 0.9) +
geom_point(size = 1.5) +
scale_x_continuous(breaks = 1:12) +
labs(
title = paste("Seasonal sales by year —", city_name),
x = "Month",
y = "Sales (count)",
color = "Year"
) +
theme_bw(base_size = 11)
}
Wichita Falls is smaller in scale and less predictable: year lines overlap, peaks shift across months, and there is no clear upward trend over 2010–2014, then showing the more volatile market in the report.
The figure plots monthly sales against period (2010–early 2015), faceted by city with free y-scales so each panel shows local level and variation. Observed series (blue) are overlaid with a linear trend (red dashed).
Beaumont shows a positive upward trend with clear seasonal oscillation.
Bryan–College Station combines a steep positive trend with regular, amplifying seasonality: peak months rise markedly in later year (above 400 units), while troughs remain near 100, exhibiting both the strongest growth and large seasonal swings.
Tyler displays a steady upward trend and consistent seasonal cycles, with a higher baseline than Beaumont and Wichita Falls (exceeding 400 at peak in 2014). Tyler is the largest active market in peak volume among the four.
For Wichita Falls the linear trend is basically flat, implying little net growth over the periods with the seasonal pattern less smooth than in the other three cities.
ggplot(df, aes(x = period, y = sales)) +
geom_line(linewidth = 0.8, color = "#2c7fb8") +
geom_point(size = 1.2, color = "#2c7fb8") +
geom_smooth(method = "lm", se = FALSE, color = "red",
linewidth = 0.8, linetype = "dashed") +
facet_wrap(~ city, ncol = 2, scales = "free_y") +
labs(
title = "Sales dynamics over time by city",
x = NULL,
y = "Sales (count)"
) +
theme_bw(base_size = 11)
This report presents a descriptive statistical analysis of the Texas real estate panel in realestate_texas.csv with monthly market indicators for four cities (Beaumont, Bryan–College Station, Tyler, Wichita Falls) over 2010–2014 (N = 240 city–month observations).
Pooled descriptive statistics show that volume has the greatest relative dispersion (CV ≈ 0.54) and the strongest right-skewness** (≈ 0.88), while median_price is comparatively stable (CV ≈ 0.17):
volume is therefore the most heterogeneous and asymmetric variable
median_price is the most suitable for central-tendency comparisons
Classifying median_price into six intervals yields substantial heterogeneity across bands (Gini G = 0.7478, normalized G = 0.8973*), yet most observations fall in mid-to-upper segments ([120,000–140,000) and [140,000–160,000) ≈ 64% combined), with negligible mass at the lowest band.
Conditional summaries by city, year, and month confirm that geography, calendar year, and season jointly shape outcomes:
city effects dominate price levels
year effects capture growth in several markets
month effects capture recurring summer peaks and winter troughs in sales
With graphical evidence, boxplots of median_price reveal a stable city ranking (Bryan–College Station and Tyler highest; Wichita Falls lowest), with appreciation in 2013–2014 for the upper-tier cities.
Stacked and normalized sales charts show strong seasonality in aggregate volume (peak around June) but stable city shares across months and years.
Line charts of sales over period and by year within city indicate positive trends in Beaumont, Bryan–College Station, and Tyler, with pronounced and regular seasonality in the larger markets. On the contrary, we can observe stagnant dynamics in Wichita Falls.
As overall assessment, the Texas submarkets in this sample are characterized by
persistent cross-city heterogeneity in prices and absorption
similar seasonal cycles in transaction sales
moderate upward drift in activity and prices in three of four cities over 2010–2014
concentration of median prices in middle-to-upper brackets
Wichita Falls behaves as a smaller and less trending market.