Texas Real Estate Market Analysis

0. Intro
1. Variable types and recommended analyses
2. Position, variability and shape indices
3. Which variables show greatest variability and skewness?
4. Create classes for a quantitative variable & Gini index
5. Probabilities
6. Create new variables (average price, ad-effectiveness proxy)
7. Conditional analysis (by city, year, month)
9. Figures: ggplot2 visualizations
9. Summary & recommendations

0. Intro

The Texas real estate market has always been dynamic, shaped by economic growth, demographic shifts, and evolving buyer preferences. With rising housing demand and fluctuating market conditions, understanding historical sales trends is essential for making informed decisions. This report explores home sales data across Texas, uncovering key patterns in prices, sales volumes, and inventory. By combining statistical analysis with clear visualizations, the goal is to highlight opportunities for growth and provide insights that can help optimize real estate listings and marketing strategies.

# Load data & quick look

df <- read.csv("realestate_texas.csv")

df <- df |>
  mutate(
    city = as.factor(city),
    year = as.integer(year),
    month = as.integer(month),
    sales = as.integer(sales),
    volume = as.numeric(volume),       
    median_price = as.numeric(median_price),
    listings = as.integer(listings),
    months_inventory = as.numeric(months_inventory),
    date = make_date(year, month, 1),
    month_name = month(date, label = TRUE, abbr = TRUE)
  ) |>
  arrange(city, date)

df <- df |>
  rename(
    MonthsI = months_inventory,
    MonthN = month_name,
    MedianP = median_price
  )


glimpse(df)

## Rows: 240
## Columns: 10
## $ city     <fct> Beaumont, Beaumont, Beaumont, Beaumont, Beaumont, Beaumont, B…
## $ year     <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2…
## $ month    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8…
## $ sales    <int> 83, 108, 182, 200, 202, 189, 164, 174, 124, 150, 150, 148, 10…
## $ volume   <dbl> 14.162, 17.690, 28.701, 26.819, 28.833, 27.219, 22.706, 25.23…
## $ MedianP  <dbl> 163800, 138200, 122400, 123200, 123100, 122800, 124300, 13680…
## $ listings <int> 1533, 1586, 1689, 1708, 1771, 1803, 1857, 1830, 1829, 1779, 1…
## $ MonthsI  <dbl> 9.5, 10.0, 10.6, 10.6, 10.9, 11.1, 11.7, 11.6, 11.7, 11.5, 11…
## $ date     <date> 2010-01-01, 2010-02-01, 2010-03-01, 2010-04-01, 2010-05-01, …
## $ MonthN   <ord> Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, J…

1. Variable types and recommended analyses

var_table <- tibble(
  variable = names(df),
  class = map_chr(df, ~class(.x)[1])
)

kable(var_table, caption = "Variable Types in Dataset")

Variable Types in Dataset
variable	class
city	factor
year	integer
month	integer
sales	integer
volume	numeric
MedianP	numeric
listings	integer
MonthsI	numeric
date	Date
MonthN	ordered

The dataset contains a mix of categorical, temporal, and quantitative variables. The categorical variable is city, which identifies the geographical location of sales. Temporal variables include year and month, which together create a natural time dimension useful for trend and seasonality analysis. Quantitative variables include sales (count data), volume (continuous, representing dollar value in millions), median_price (continuous, in dollars), listings (count of active properties), and months_inventory (continuous, representing time to sell). The time-related variables (year and month) should be combined into a proper date format to allow for time-series analysis and visualization. Descriptive statistics and frequency distributions are appropriate for sales and listings, while measures of central tendency, variability, and distribution shape can be applied to continuous variables like volume, median_price, and months_inventory. Additionally, temporal analysis can be used to identify seasonality, growth trends, and market cycles across cities.

2. Position, variability and shape indices

## Position, Variability, and Shape Indices

# Select numeric variables
numeric_vars <-df %>%
  select(sales, volume, MedianP, listings, MonthsI)

# Summary statistics (position + variability)
variability_cv <- numeric_vars %>%
  summarise(
    across(everything(),
           list(
             mean = ~round(mean(. , na.rm = TRUE),2),
             median = ~round(median(. , na.rm = TRUE),2),
             sd = ~round(sd(. , na.rm = TRUE),2),
             cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2),
             min = ~round(min(. , na.rm = TRUE),2),
             max = ~round(max(. , na.rm = TRUE),2)
           ),
           .names = "{.col}_{.fn}"
    )
  ) %>%
  pivot_longer(everything(),
               names_to = c("Variable", "Statistic"),
               names_sep = "_",
               values_to = "Value")

kable(variability_cv)

Variable	Statistic	Value
sales	mean	192.29
sales	median	175.50
sales	sd	79.65
sales	cv	0.41
sales	min	79.00
sales	max	423.00
volume	mean	31.01
volume	median	27.06
volume	sd	16.65
volume	cv	0.54
volume	min	8.17
volume	max	83.55
MedianP	mean	132665.42
MedianP	median	134500.00
MedianP	sd	22662.15
MedianP	cv	0.17
MedianP	min	73800.00
MedianP	max	180000.00
listings	mean	1738.02
listings	median	1618.50
listings	sd	752.71
listings	cv	0.43
listings	min	743.00
listings	max	3296.00
MonthsI	mean	9.19
MonthsI	median	8.95
MonthsI	sd	2.30
MonthsI	cv	0.25
MonthsI	min	3.40
MonthsI	max	14.90

# --- Shape indices (skewness & kurtosis) ---
shape_stats <- numeric_vars %>%
  summarise(
    across(everything(),
           list(
             skewness = ~skewness(. , na.rm = TRUE),
             kurtosis = ~kurtosis(. , na.rm = TRUE)
           ),
           .names = "{.col}_{.fn}"
    )
  ) %>%
  pivot_longer(everything(),
               names_to = c("Variable", "Statistic"),
               names_sep = "_",
               values_to = "Value")

kable(shape_stats, caption = "Shape Indices")

Shape Indices
Variable	Statistic	Value
sales	skewness	0.7181040
sales	kurtosis	2.6868236
volume	skewness	0.8847420
volume	kurtosis	3.1769870
MedianP	skewness	-0.3645529
MedianP	kurtosis	2.3770382
listings	skewness	0.6494982
listings	kurtosis	2.2082100
MonthsI	skewness	0.0409753
MonthsI	kurtosis	2.8255525

This section focused on five numerical variables: sales, volume, median_price, listings, and months_inventory.

Sales have a mean of 192.29 with moderate variability (SD = 79.65) and a coefficient of variation of 0.41. The distribution is positively skewed (0.72) with slightly low kurtosis (2.69), indicating that while most months see typical sales levels, some months experience unusually high activity.
Volume shows a mean of 31.01 with relatively high variability (SD = 16.65) and a coefficient of variation of 0.54. The skewness of 0.88 and kurtosis of 3.18 reflect occasional spikes in transaction values.
Median_price averages 132,665.42 with a standard deviation of 22,662.15 (CV = 0.17). The distribution is nearly symmetric (skewness = –0.36, kurtosis = 2.38), suggesting stable housing prices with moderate fluctuations over time.
Listings average 1,738.02, but with large variability (SD = 752.71, CV = 0.43). The positive skewness (0.65) and low kurtosis (2.21) point to significant changes in property availability across months.
Months_inventory has a mean of 9.19 months with low variability (SD = 2.30, CV = 0.25). The skewness is close to zero (0.04) with a kurtosis of 2.83, indicating a balanced and steady supply of housing stock.

# Frequency distribution for categorical / time variables

# City frequency table
table_city <- as.data.frame(table(df$city))
colnames(table_city) <- c("City", "Frequency")

# Year frequency table
table_year <- as.data.frame(table(df$year))
colnames(table_year) <- c("Year", "Frequency")

# Month frequency table
table_month <- as.data.frame(table(df$month))
colnames(table_month) <- c("Month", "Frequency")


# Display with knitr::kable (simple & knitting safe)
library(knitr)

kable(table_city, caption = "Frequency Distribution of Cities")

Frequency Distribution of Cities
City	Frequency
Beaumont	60
Bryan-College Station	60
Tyler	60
Wichita Falls	60

kable(table_year, caption = "Frequency Distribution of Years")

Frequency Distribution of Years
Year	Frequency
2010	48
2011	48
2012	48
2013	48
2014	48

kable(table_month, caption = "Frequency Distribution of Months")

Frequency Distribution of Months
Month	Frequency
1	20
2	20
3	20
4	20
5	20
6	20
7	20
8	20
9	20
10	20
11	20
12	20

From the results above we can note that the dataset is evenly distributed across time and cities, which makes the analysis more consistent. Each month from January to December appears 20 times, ensuring a balanced view of seasonal patterns. The years from 2010 to 2014 each contribute 48 records, giving us a steady five-year timeline for tracking changes. Likewise, the four cities; Beaumont (60), Bryan-College Station (60), Tyler (60), and Wichita Falls (60) are all equally represented, allowing us to compare locations without bias toward any single city.

3. Which variables show greatest variability and skewness?

# Identify variable with max CV
variability_cv <- numeric_vars %>%
  summarise(across(everything(),
                   ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2))) %>%
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "CV")

max_cv <- variability_cv %>%
  arrange(desc(CV)) %>%
  slice(1)

# Identify variable with max skewness
skewness_vals <- numeric_vars %>%
  summarise(across(everything(),
                   ~round(skewness(. , na.rm = TRUE),2))) %>%
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Skewness")

max_skewness <- skewness_vals %>%
  arrange(desc(abs(Skewness))) %>%
  slice(1)

kable(max_cv, caption = "Variable with Highest Variability (CV)")

Variable with Highest Variability (CV)
Variable	CV
volume	0.54

kable(max_skewness, caption = "Variable with Highest Skewness")

Variable with Highest Skewness
Variable	Skewness
volume	0.88

Among the variables analyzed, volume shows the highest variability with a coefficient of variation of 0.54. This indicates that transaction values fluctuate widely relative to their mean, pointing to inconsistent market activity across months. In addition, volume also has the highest skewness (0.88), reflecting a right-skewed distribution where most months record moderate transaction volumes, but a few months experience exceptionally high spikes. These conclusions were reached by comparing variability with a coefficient of variation and skewness as an indicator of distribution shape across all the variables.

4. Create classes for a quantitative variable & Gini index

# Custom Gini index
gini.index <- function(x){
  ni = table(x)
  fi = ni/length(x)
  fi2 = fi^2
  J = length(table(x))
  
  gini = 1 - sum(fi2)
  gini.norm = gini / ((J-1)/J)
  
  return(gini.norm)
}

# Divide median_price into classes (bins)
df$price_class <- cut(
  df$MedianP,
  breaks = 5,   # divide into 5 equal-width classes
  labels = c("Very Low", "Low", "Medium", "High", "Very High")
)

freq_dist <- df %>%
  count(price_class)

kable(freq_dist)

price_class	n
Very Low	18
Low	40
Medium	73
High	84
Very High	25

# Bar plot
ggplot(as.data.frame(freq_dist), aes(x = price_class, y = n)) +
  geom_col(fill = "steelblue") +
  labs(title = "Frequency of Median Price Classes",
       x = "Price Class", y = "Count") +
  theme_minimal()

# Apply Gini index
gini_value <- round(gini.index(df$price_class),3)
kable(tibble(`Normalized Gini Index` = gini_value))

Normalized Gini Index
0.926

The results show that most properties fall within the Medium (73) and High (84) price categories, while fewer are found in the Very Low (18) and Very High (25) ends of the scale. This suggests that the housing market is fairly balanced, with prices clustering around the middle ranges rather than being heavily concentrated at the extremes. The Gini index of 0.93 indicated that the model demonstrated excellent discriminatory power.

5. Probabilities

P_beaumont <- mean(df$city == "Beaumont")
P_july <- mean(df$month == 7)
P_dec2012 <- mean(df$month == 12 & df$year == 2012)

prob_table <- tibble(
  event = c("city=Beaumont","month=July","month=Dec & year=2012"),
  probability = round(c(P_beaumont, P_july, P_dec2012),3)
)

kable(prob_table, caption = "Probabilities of Selected Events")

Probabilities of Selected Events
event	probability
city=Beaumont	0.250
month=July	0.083
month=Dec & year=2012	0.017

The probabilities show that a random observation has a 25% chance of being from Beaumont, making it the most likely among the listed events. The chance of selecting a record from July is much lower at about 8.3%, reflecting the uniform spread of months across the dataset. Also, the probability of finding a record specifically from December 2012 is just 1.7%, highlighting how rare it is when narrowing down to both a specific month and year.

6. Create new variables (average price, ad-effectiveness proxy)

# Create new variables
df <- df %>%
  mutate(
    avg_price = volume / sales,  # Average property price
    ad_effectiveness = sales / listings  # Effectiveness of ads (conversion rate)
  )


head(df %>% select(city, year, month, sales, volume, listings, avg_price, ad_effectiveness))

# Overall Summary of the new variables
summary_new_vars <- df %>%
  
  summarise(
    avg_price_mean = round(mean(avg_price, na.rm = TRUE),2),
    avg_price_median = round(median(avg_price, na.rm = TRUE),2),
    avg_price_sd = round(sd(avg_price, na.rm = TRUE),2),
    ad_effectiveness_mean = round(mean(ad_effectiveness, na.rm = TRUE),2),
    ad_effectiveness_median = round(median(ad_effectiveness, na.rm = TRUE),2),
    ad_effectiveness_sd = round(sd(ad_effectiveness, na.rm = TRUE),2)
  )
  
kable(summary_new_vars, caption = "Summary of New Variables")

Summary of New Variables
avg_price_mean	avg_price_median	avg_price_sd	ad_effectiveness_mean	ad_effectiveness_median	ad_effectiveness_sd
0.15	0.16	0.03	0.12	0.11	0.05

The results show that the average property price (avg_price) remains relatively stable, with the mean (0.15) and median (0.16) being very close, suggesting a fairly symmetric distribution and low dispersion (SD ≈ 0.03). In contrast, the effectiveness of sales ads (ad_effectiveness) is more varied, with a much smaller mean (0.12) and median (0.11), but a relatively higher spread (SD ≈ 0.05) compared to its scale.

7. Conditional analysis (by city, year, month)

# Summary by cities
city_summary <- df %>%
  group_by(city) %>%
  summarise(across(c(sales, volume, MedianP, listings, MonthsI),
                   list(mean = ~round(mean(. , na.rm = TRUE),2),
                        sd = ~round(sd(. , na.rm = TRUE),2),
                        cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2)),
                   .names = "{.col}_{.fn}"))
# Summary by years
year_summary <- df %>%
  group_by(year) %>%
  summarise(across(c(sales, volume, MedianP, listings, MonthsI),
                   list(mean = ~round(mean(. , na.rm = TRUE),2),
                        sd = ~round(sd(. , na.rm = TRUE),2),
                        cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2)),
                   .names = "{.col}_{.fn}"))
# Summary by months
month_summary <- df %>%
  group_by(month) %>%
  summarise(across(c(sales, volume, MedianP, listings, MonthsI),
                   list(mean = ~round(mean(. , na.rm = TRUE),2),
                        sd = ~round(sd(. , na.rm = TRUE),2),
                        cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2)),
                   .names = "{.col}_{.fn}"))

# CITY LEVEL

# Sales mean by city
ggplot(city_summary, aes(x = city, y = sales_mean)) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Average Sales by City", x = "City", y = "Mean Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Coefficient of variation of volume by city
ggplot(city_summary, aes(x = city, y = volume_cv)) +
  geom_col(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Volume CV by City", x = "City", y = "CV (Volume)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# YEAR LEVEL

# Median price trend over years
ggplot(year_summary, aes(x = year, y = MedianP_mean)) +
  geom_line(color = "steelblue", size = 1.2, group = 1) +
  geom_point(color = "steelblue", size = 2) +
  theme_minimal() +
  labs(title = "Average Median Price Over Years", x = "Year", y = "Mean Median Price")

# Listings variability (SD) over years
ggplot(year_summary, aes(x = year, y = listings_sd)) +
  geom_line(color = "steelblue", size = 1.2, group = 1) +
  geom_point(color = "steelblue", size = 2) +
  theme_minimal() +
  labs(title = "Listings Standard Deviation Over Years", x = "Year", y = "Listings SD")

# MONTH LEVEL

# Months inventory trend over months
ggplot(month_summary, aes(x = month, y = MonthsI_mean)) +
  geom_line(color = "steelblue", size = 1.2, group = 1) +
  geom_point(color = "steelblue", size = 2) +
  theme_minimal() +
  labs(title = "Average Inventory by Month", x = "Month", y = "Mean Months Inventory")

# CV of sales by month
ggplot(month_summary, aes(x = month, y = sales_cv)) +
  geom_line(color = "steelblue", size = 1.2, group = 1) +
  geom_point(color = "steelblue", size = 2) +
  theme_minimal() +
  labs(title = "Sales CV by Month", x = "Month", y = "Coefficient of Variation")

9. Figures: ggplot2 visualizations

# Boxplot: Distribution of median prices by year and city 
ggplot(df, aes(x = factor(year), y = MedianP, fill = city)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Median Prices by Year and City",
       x = "Year", y = "Median Price") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Bar Chart: Total sales by month and city, faceted by year
sales_by_month_city <- df %>%
  group_by(year, city, month) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE), .groups = "drop")

ggplot(sales_by_month_city, aes(x = factor(month), y = total_sales, fill = city)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~year) +
  theme_minimal() +
  labs(title = "Total Sales by Month and City (Faceted by Year)",
       x = "Month", y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Line Chart: Sales trends over time using year-month date
df <- df %>%
  mutate(date = make_date(year, month, 1)) 

sales_trend <- df %>%
  group_by(date) %>%
  summarise(avg_sales = mean(sales, na.rm = TRUE), .groups = "drop")

ggplot(sales_trend, aes(x = date, y = avg_sales)) +
  geom_line(color = "steelblue", size = 1.2) +
  geom_point(color = "darkblue", size = 2) +
  theme_minimal() +
  labs(title = "Sales Trends Over Time",
       x = "Date", y = "Average Sales")

9. Summary & recommendations

This analysis explored property sales data across cities, months, and years, using descriptive statistics, probability analysis, conditional summaries, and visualization techniques to identify key market dynamics.

Descriptive Statistics:

Median Price had the highest variability (variance ≈ 5.14 × 10⁸), showing wide fluctuations across cities and time.
Volume showed the highest skewness (0.88), indicating that most sales are clustered at lower volumes with occasional high outliers.
Average property prices were stable, with a mean around 132,665, while ad effectiveness averaged 0.15, suggesting moderate consistency in marketing performance.

Frequency Distributions:

Each month (Jan–Dec) had 20 entries, showing balanced seasonal coverage.
Years 2010–2014 had 48 entries each, ensuring even time representation.
Cities were evenly distributed (60 entries each), allowing fair comparison.

Probability Analysis:

The probability of a row being from Beaumont is 25%.
The chance of a property listed in July is 8.3%, reflecting seasonal demand.
The probability of a property from December 2012 is only 1.7%, showing rare specificity when both time and year conditions combine.

Class Intervals (Median Price):

Most properties fall into the High (84) and Medium (73) price classes, while Very Low (18) and Very High (25) are less common.
The Gini Index (0.93) indicated that the model demonstrated excellent discriminatory power.

Conditional Analysis:

By city, Bryan–College Station leads in higher median prices, while Wichita Falls remains most affordable.
Seasonal trends show mid-year sales peaks across all cities.
Yearly trends reveal steady growth in sales and volumes from 2010 to 2014.

Visual Insights:

Sales Trends Over Time (Line Chart)

The chart shows a clear upward trend in average sales from 2010 to 2014.
Sales fluctuate seasonally, with peaks and dips within each year, suggesting cyclical demand in the housing market (likely linked to seasonal buying patterns).
By 2014, average sales are notably higher than in 2010, indicating overall growth in housing activity.

Distribution of Median Prices by Year and City (Boxplots)

Bryan-College Station consistently has the highest median prices, followed by Tyler, Beaumont, and Wichita Falls (lowest).
Across all cities, there is a general upward movement in median prices over time, especially noticeable in Bryan-College Station after 2012.
The spread of prices is relatively stable within each city, though some outliers are visible

Total Sales by Month and City (Bar Charts, Faceted by Year)

Seasonal variation is evident: sales tend to peak around mid-year (spring/summer months) and drop in late fall/winter.
Bryan-College Station and Tyler generally report the highest sales volumes compared to Beaumont and Wichita Falls.
Over the years, the overall sales volume appears to increase, supporting the trend seen in the line chart.

Conclusions & Recommendations

Growth Opportunity: The market shows a clear upward trend, making it attractive for long-term investments but also highlight the need to address affordability challenges.
Target Cities: Focus on Tyler and Bryan–College Station for stronger sales potential, while positioning Wichita Falls for affordability-driven buyers.
Affordability Balance: Address disparities between high-priced markets and affordable regions to capture broader demand.
Marketing Efficiency: With ad effectiveness averaging around 0.15, campaigns should be optimized to raise conversion rates.
Sales show strong seasonality, peaking in spring and summer, suggesting marketing and inventory strategies should align with these cycles.