0. Intro

The Texas real estate market has always been dynamic, shaped by economic growth, demographic shifts, and evolving buyer preferences. With rising housing demand and fluctuating market conditions, understanding historical sales trends is essential for making informed decisions. This report explores home sales data across Texas, uncovering key patterns in prices, sales volumes, and inventory. By combining statistical analysis with clear visualizations, the goal is to highlight opportunities for growth and provide insights that can help optimize real estate listings and marketing strategies.

# Load data & quick look

df <- read.csv("realestate_texas.csv")

df <- df |>
  mutate(
    city = as.factor(city),
    year = as.integer(year),
    month = as.integer(month),
    sales = as.integer(sales),
    volume = as.numeric(volume),       
    median_price = as.numeric(median_price),
    listings = as.integer(listings),
    months_inventory = as.numeric(months_inventory),
    date = make_date(year, month, 1),
    month_name = month(date, label = TRUE, abbr = TRUE)
  ) |>
  arrange(city, date)

df <- df |>
  rename(
    MonthsI = months_inventory,
    MonthN = month_name,
    MedianP = median_price
  )


glimpse(df)
## Rows: 240
## Columns: 10
## $ city     <fct> Beaumont, Beaumont, Beaumont, Beaumont, Beaumont, Beaumont, B…
## $ year     <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2…
## $ month    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8…
## $ sales    <int> 83, 108, 182, 200, 202, 189, 164, 174, 124, 150, 150, 148, 10…
## $ volume   <dbl> 14.162, 17.690, 28.701, 26.819, 28.833, 27.219, 22.706, 25.23…
## $ MedianP  <dbl> 163800, 138200, 122400, 123200, 123100, 122800, 124300, 13680…
## $ listings <int> 1533, 1586, 1689, 1708, 1771, 1803, 1857, 1830, 1829, 1779, 1…
## $ MonthsI  <dbl> 9.5, 10.0, 10.6, 10.6, 10.9, 11.1, 11.7, 11.6, 11.7, 11.5, 11…
## $ date     <date> 2010-01-01, 2010-02-01, 2010-03-01, 2010-04-01, 2010-05-01, …
## $ MonthN   <ord> Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, J…

2. Position, variability and shape indices

## Position, Variability, and Shape Indices

# Select numeric variables
numeric_vars <-df %>%
  select(sales, volume, MedianP, listings, MonthsI)

# Summary statistics (position + variability)
variability_cv <- numeric_vars %>%
  summarise(
    across(everything(),
           list(
             mean = ~round(mean(. , na.rm = TRUE),2),
             median = ~round(median(. , na.rm = TRUE),2),
             sd = ~round(sd(. , na.rm = TRUE),2),
             cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2),
             min = ~round(min(. , na.rm = TRUE),2),
             max = ~round(max(. , na.rm = TRUE),2)
           ),
           .names = "{.col}_{.fn}"
    )
  ) %>%
  pivot_longer(everything(),
               names_to = c("Variable", "Statistic"),
               names_sep = "_",
               values_to = "Value")

kable(variability_cv)
Variable Statistic Value
sales mean 192.29
sales median 175.50
sales sd 79.65
sales cv 0.41
sales min 79.00
sales max 423.00
volume mean 31.01
volume median 27.06
volume sd 16.65
volume cv 0.54
volume min 8.17
volume max 83.55
MedianP mean 132665.42
MedianP median 134500.00
MedianP sd 22662.15
MedianP cv 0.17
MedianP min 73800.00
MedianP max 180000.00
listings mean 1738.02
listings median 1618.50
listings sd 752.71
listings cv 0.43
listings min 743.00
listings max 3296.00
MonthsI mean 9.19
MonthsI median 8.95
MonthsI sd 2.30
MonthsI cv 0.25
MonthsI min 3.40
MonthsI max 14.90
# --- Shape indices (skewness & kurtosis) ---
shape_stats <- numeric_vars %>%
  summarise(
    across(everything(),
           list(
             skewness = ~skewness(. , na.rm = TRUE),
             kurtosis = ~kurtosis(. , na.rm = TRUE)
           ),
           .names = "{.col}_{.fn}"
    )
  ) %>%
  pivot_longer(everything(),
               names_to = c("Variable", "Statistic"),
               names_sep = "_",
               values_to = "Value")

kable(shape_stats, caption = "Shape Indices")
Shape Indices
Variable Statistic Value
sales skewness 0.7181040
sales kurtosis 2.6868236
volume skewness 0.8847420
volume kurtosis 3.1769870
MedianP skewness -0.3645529
MedianP kurtosis 2.3770382
listings skewness 0.6494982
listings kurtosis 2.2082100
MonthsI skewness 0.0409753
MonthsI kurtosis 2.8255525

This section focused on five numerical variables: sales, volume, median_price, listings, and months_inventory.

# Frequency distribution for categorical / time variables

# City frequency table
table_city <- as.data.frame(table(df$city))
colnames(table_city) <- c("City", "Frequency")

# Year frequency table
table_year <- as.data.frame(table(df$year))
colnames(table_year) <- c("Year", "Frequency")

# Month frequency table
table_month <- as.data.frame(table(df$month))
colnames(table_month) <- c("Month", "Frequency")


# Display with knitr::kable (simple & knitting safe)
library(knitr)

kable(table_city, caption = "Frequency Distribution of Cities")
Frequency Distribution of Cities
City Frequency
Beaumont 60
Bryan-College Station 60
Tyler 60
Wichita Falls 60
kable(table_year, caption = "Frequency Distribution of Years")
Frequency Distribution of Years
Year Frequency
2010 48
2011 48
2012 48
2013 48
2014 48
kable(table_month, caption = "Frequency Distribution of Months")
Frequency Distribution of Months
Month Frequency
1 20
2 20
3 20
4 20
5 20
6 20
7 20
8 20
9 20
10 20
11 20
12 20

From the results above we can note that the dataset is evenly distributed across time and cities, which makes the analysis more consistent. Each month from January to December appears 20 times, ensuring a balanced view of seasonal patterns. The years from 2010 to 2014 each contribute 48 records, giving us a steady five-year timeline for tracking changes. Likewise, the four cities; Beaumont (60), Bryan-College Station (60), Tyler (60), and Wichita Falls (60) are all equally represented, allowing us to compare locations without bias toward any single city.

3. Which variables show greatest variability and skewness?

# Identify variable with max CV
variability_cv <- numeric_vars %>%
  summarise(across(everything(),
                   ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2))) %>%
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "CV")

max_cv <- variability_cv %>%
  arrange(desc(CV)) %>%
  slice(1)

# Identify variable with max skewness
skewness_vals <- numeric_vars %>%
  summarise(across(everything(),
                   ~round(skewness(. , na.rm = TRUE),2))) %>%
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Skewness")

max_skewness <- skewness_vals %>%
  arrange(desc(abs(Skewness))) %>%
  slice(1)

kable(max_cv, caption = "Variable with Highest Variability (CV)")
Variable with Highest Variability (CV)
Variable CV
volume 0.54
kable(max_skewness, caption = "Variable with Highest Skewness")
Variable with Highest Skewness
Variable Skewness
volume 0.88

Among the variables analyzed, volume shows the highest variability with a coefficient of variation of 0.54. This indicates that transaction values fluctuate widely relative to their mean, pointing to inconsistent market activity across months. In addition, volume also has the highest skewness (0.88), reflecting a right-skewed distribution where most months record moderate transaction volumes, but a few months experience exceptionally high spikes. These conclusions were reached by comparing variability with a coefficient of variation and skewness as an indicator of distribution shape across all the variables.

4. Create classes for a quantitative variable & Gini index

# Custom Gini index
gini.index <- function(x){
  ni = table(x)
  fi = ni/length(x)
  fi2 = fi^2
  J = length(table(x))
  
  gini = 1 - sum(fi2)
  gini.norm = gini / ((J-1)/J)
  
  return(gini.norm)
}

# Divide median_price into classes (bins)
df$price_class <- cut(
  df$MedianP,
  breaks = 5,   # divide into 5 equal-width classes
  labels = c("Very Low", "Low", "Medium", "High", "Very High")
)

freq_dist <- df %>%
  count(price_class)

kable(freq_dist)
price_class n
Very Low 18
Low 40
Medium 73
High 84
Very High 25
# Bar plot
ggplot(as.data.frame(freq_dist), aes(x = price_class, y = n)) +
  geom_col(fill = "steelblue") +
  labs(title = "Frequency of Median Price Classes",
       x = "Price Class", y = "Count") +
  theme_minimal()

# Apply Gini index
gini_value <- round(gini.index(df$price_class),3)
kable(tibble(`Normalized Gini Index` = gini_value))
Normalized Gini Index
0.926

The results show that most properties fall within the Medium (73) and High (84) price categories, while fewer are found in the Very Low (18) and Very High (25) ends of the scale. This suggests that the housing market is fairly balanced, with prices clustering around the middle ranges rather than being heavily concentrated at the extremes. The Gini index of 0.93 indicated that the model demonstrated excellent discriminatory power.

5. Probabilities

P_beaumont <- mean(df$city == "Beaumont")
P_july <- mean(df$month == 7)
P_dec2012 <- mean(df$month == 12 & df$year == 2012)

prob_table <- tibble(
  event = c("city=Beaumont","month=July","month=Dec & year=2012"),
  probability = round(c(P_beaumont, P_july, P_dec2012),3)
)

kable(prob_table, caption = "Probabilities of Selected Events")
Probabilities of Selected Events
event probability
city=Beaumont 0.250
month=July 0.083
month=Dec & year=2012 0.017

The probabilities show that a random observation has a 25% chance of being from Beaumont, making it the most likely among the listed events. The chance of selecting a record from July is much lower at about 8.3%, reflecting the uniform spread of months across the dataset. Also, the probability of finding a record specifically from December 2012 is just 1.7%, highlighting how rare it is when narrowing down to both a specific month and year.

6. Create new variables (average price, ad-effectiveness proxy)

# Create new variables
df <- df %>%
  mutate(
    avg_price = volume / sales,  # Average property price
    ad_effectiveness = sales / listings  # Effectiveness of ads (conversion rate)
  )


head(df %>% select(city, year, month, sales, volume, listings, avg_price, ad_effectiveness))
# Overall Summary of the new variables
summary_new_vars <- df %>%
  
  summarise(
    avg_price_mean = round(mean(avg_price, na.rm = TRUE),2),
    avg_price_median = round(median(avg_price, na.rm = TRUE),2),
    avg_price_sd = round(sd(avg_price, na.rm = TRUE),2),
    ad_effectiveness_mean = round(mean(ad_effectiveness, na.rm = TRUE),2),
    ad_effectiveness_median = round(median(ad_effectiveness, na.rm = TRUE),2),
    ad_effectiveness_sd = round(sd(ad_effectiveness, na.rm = TRUE),2)
  )
  
kable(summary_new_vars, caption = "Summary of New Variables")
Summary of New Variables
avg_price_mean avg_price_median avg_price_sd ad_effectiveness_mean ad_effectiveness_median ad_effectiveness_sd
0.15 0.16 0.03 0.12 0.11 0.05

The results show that the average property price (avg_price) remains relatively stable, with the mean (0.15) and median (0.16) being very close, suggesting a fairly symmetric distribution and low dispersion (SD ≈ 0.03). In contrast, the effectiveness of sales ads (ad_effectiveness) is more varied, with a much smaller mean (0.12) and median (0.11), but a relatively higher spread (SD ≈ 0.05) compared to its scale.

7. Conditional analysis (by city, year, month)

# Summary by cities
city_summary <- df %>%
  group_by(city) %>%
  summarise(across(c(sales, volume, MedianP, listings, MonthsI),
                   list(mean = ~round(mean(. , na.rm = TRUE),2),
                        sd = ~round(sd(. , na.rm = TRUE),2),
                        cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2)),
                   .names = "{.col}_{.fn}"))
# Summary by years
year_summary <- df %>%
  group_by(year) %>%
  summarise(across(c(sales, volume, MedianP, listings, MonthsI),
                   list(mean = ~round(mean(. , na.rm = TRUE),2),
                        sd = ~round(sd(. , na.rm = TRUE),2),
                        cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2)),
                   .names = "{.col}_{.fn}"))
# Summary by months
month_summary <- df %>%
  group_by(month) %>%
  summarise(across(c(sales, volume, MedianP, listings, MonthsI),
                   list(mean = ~round(mean(. , na.rm = TRUE),2),
                        sd = ~round(sd(. , na.rm = TRUE),2),
                        cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2)),
                   .names = "{.col}_{.fn}"))
# CITY LEVEL

# Sales mean by city
ggplot(city_summary, aes(x = city, y = sales_mean)) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Average Sales by City", x = "City", y = "Mean Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Coefficient of variation of volume by city
ggplot(city_summary, aes(x = city, y = volume_cv)) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Volume CV by City", x = "City", y = "CV (Volume)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# YEAR LEVEL

# Median price trend over years
ggplot(year_summary, aes(x = year, y = MedianP_mean)) +
  geom_line(color = "steelblue", size = 1.2, group = 1) +
  geom_point(color = "steelblue", size = 2) +
  theme_minimal() +
  labs(title = "Average Median Price Over Years", x = "Year", y = "Mean Median Price")

# Listings variability (SD) over years
ggplot(year_summary, aes(x = year, y = listings_sd)) +
  geom_line(color = "steelblue", size = 1.2, group = 1) +
  geom_point(color = "steelblue", size = 2) +
  theme_minimal() +
  labs(title = "Listings Standard Deviation Over Years", x = "Year", y = "Listings SD")

# MONTH LEVEL

# Months inventory trend over months
ggplot(month_summary, aes(x = month, y = MonthsI_mean)) +
  geom_line(color = "steelblue", size = 1.2, group = 1) +
  geom_point(color = "steelblue", size = 2) +
  theme_minimal() +
  labs(title = "Average Inventory by Month", x = "Month", y = "Mean Months Inventory")

# CV of sales by month
ggplot(month_summary, aes(x = month, y = sales_cv)) +
  geom_line(color = "steelblue", size = 1.2, group = 1) +
  geom_point(color = "steelblue", size = 2) +
  theme_minimal() +
  labs(title = "Sales CV by Month", x = "Month", y = "Coefficient of Variation")

9. Figures: ggplot2 visualizations

# Boxplot: Distribution of median prices by year and city 
ggplot(df, aes(x = factor(year), y = MedianP, fill = city)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Median Prices by Year and City",
       x = "Year", y = "Median Price") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Bar Chart: Total sales by month and city, faceted by year
sales_by_month_city <- df %>%
  group_by(year, city, month) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE), .groups = "drop")

ggplot(sales_by_month_city, aes(x = factor(month), y = total_sales, fill = city)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~year) +
  theme_minimal() +
  labs(title = "Total Sales by Month and City (Faceted by Year)",
       x = "Month", y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Line Chart: Sales trends over time using year-month date
df <- df %>%
  mutate(date = make_date(year, month, 1)) 

sales_trend <- df %>%
  group_by(date) %>%
  summarise(avg_sales = mean(sales, na.rm = TRUE), .groups = "drop")

ggplot(sales_trend, aes(x = date, y = avg_sales)) +
  geom_line(color = "steelblue", size = 1.2) +
  geom_point(color = "darkblue", size = 2) +
  theme_minimal() +
  labs(title = "Sales Trends Over Time",
       x = "Date", y = "Average Sales")

9. Summary & recommendations

This analysis explored property sales data across cities, months, and years, using descriptive statistics, probability analysis, conditional summaries, and visualization techniques to identify key market dynamics.

Descriptive Statistics:

Frequency Distributions:

Probability Analysis:

Class Intervals (Median Price):

Conditional Analysis:

Visual Insights:

  1. Sales Trends Over Time (Line Chart)
  1. Distribution of Median Prices by Year and City (Boxplots)
  1. Total Sales by Month and City (Bar Charts, Faceted by Year)

Conclusions & Recommendations