The Texas real estate market has always been dynamic, shaped by economic growth, demographic shifts, and evolving buyer preferences. With rising housing demand and fluctuating market conditions, understanding historical sales trends is essential for making informed decisions. This report explores home sales data across Texas, uncovering key patterns in prices, sales volumes, and inventory. By combining statistical analysis with clear visualizations, the goal is to highlight opportunities for growth and provide insights that can help optimize real estate listings and marketing strategies.
# Load data & quick look
df <- read.csv("realestate_texas.csv")
df <- df |>
mutate(
city = as.factor(city),
year = as.integer(year),
month = as.integer(month),
sales = as.integer(sales),
volume = as.numeric(volume),
median_price = as.numeric(median_price),
listings = as.integer(listings),
months_inventory = as.numeric(months_inventory),
date = make_date(year, month, 1),
month_name = month(date, label = TRUE, abbr = TRUE)
) |>
arrange(city, date)
df <- df |>
rename(
MonthsI = months_inventory,
MonthN = month_name,
MedianP = median_price
)
glimpse(df)
## Rows: 240
## Columns: 10
## $ city <fct> Beaumont, Beaumont, Beaumont, Beaumont, Beaumont, Beaumont, B…
## $ year <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2…
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8…
## $ sales <int> 83, 108, 182, 200, 202, 189, 164, 174, 124, 150, 150, 148, 10…
## $ volume <dbl> 14.162, 17.690, 28.701, 26.819, 28.833, 27.219, 22.706, 25.23…
## $ MedianP <dbl> 163800, 138200, 122400, 123200, 123100, 122800, 124300, 13680…
## $ listings <int> 1533, 1586, 1689, 1708, 1771, 1803, 1857, 1830, 1829, 1779, 1…
## $ MonthsI <dbl> 9.5, 10.0, 10.6, 10.6, 10.9, 11.1, 11.7, 11.6, 11.7, 11.5, 11…
## $ date <date> 2010-01-01, 2010-02-01, 2010-03-01, 2010-04-01, 2010-05-01, …
## $ MonthN <ord> Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, J…
var_table <- tibble(
variable = names(df),
class = map_chr(df, ~class(.x)[1])
)
kable(var_table, caption = "Variable Types in Dataset")
variable | class |
---|---|
city | factor |
year | integer |
month | integer |
sales | integer |
volume | numeric |
MedianP | numeric |
listings | integer |
MonthsI | numeric |
date | Date |
MonthN | ordered |
The dataset contains a mix of categorical, temporal, and quantitative variables. The categorical variable is city, which identifies the geographical location of sales. Temporal variables include year and month, which together create a natural time dimension useful for trend and seasonality analysis. Quantitative variables include sales (count data), volume (continuous, representing dollar value in millions), median_price (continuous, in dollars), listings (count of active properties), and months_inventory (continuous, representing time to sell). The time-related variables (year and month) should be combined into a proper date format to allow for time-series analysis and visualization. Descriptive statistics and frequency distributions are appropriate for sales and listings, while measures of central tendency, variability, and distribution shape can be applied to continuous variables like volume, median_price, and months_inventory. Additionally, temporal analysis can be used to identify seasonality, growth trends, and market cycles across cities.
## Position, Variability, and Shape Indices
# Select numeric variables
numeric_vars <-df %>%
select(sales, volume, MedianP, listings, MonthsI)
# Summary statistics (position + variability)
variability_cv <- numeric_vars %>%
summarise(
across(everything(),
list(
mean = ~round(mean(. , na.rm = TRUE),2),
median = ~round(median(. , na.rm = TRUE),2),
sd = ~round(sd(. , na.rm = TRUE),2),
cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2),
min = ~round(min(. , na.rm = TRUE),2),
max = ~round(max(. , na.rm = TRUE),2)
),
.names = "{.col}_{.fn}"
)
) %>%
pivot_longer(everything(),
names_to = c("Variable", "Statistic"),
names_sep = "_",
values_to = "Value")
kable(variability_cv)
Variable | Statistic | Value |
---|---|---|
sales | mean | 192.29 |
sales | median | 175.50 |
sales | sd | 79.65 |
sales | cv | 0.41 |
sales | min | 79.00 |
sales | max | 423.00 |
volume | mean | 31.01 |
volume | median | 27.06 |
volume | sd | 16.65 |
volume | cv | 0.54 |
volume | min | 8.17 |
volume | max | 83.55 |
MedianP | mean | 132665.42 |
MedianP | median | 134500.00 |
MedianP | sd | 22662.15 |
MedianP | cv | 0.17 |
MedianP | min | 73800.00 |
MedianP | max | 180000.00 |
listings | mean | 1738.02 |
listings | median | 1618.50 |
listings | sd | 752.71 |
listings | cv | 0.43 |
listings | min | 743.00 |
listings | max | 3296.00 |
MonthsI | mean | 9.19 |
MonthsI | median | 8.95 |
MonthsI | sd | 2.30 |
MonthsI | cv | 0.25 |
MonthsI | min | 3.40 |
MonthsI | max | 14.90 |
# --- Shape indices (skewness & kurtosis) ---
shape_stats <- numeric_vars %>%
summarise(
across(everything(),
list(
skewness = ~skewness(. , na.rm = TRUE),
kurtosis = ~kurtosis(. , na.rm = TRUE)
),
.names = "{.col}_{.fn}"
)
) %>%
pivot_longer(everything(),
names_to = c("Variable", "Statistic"),
names_sep = "_",
values_to = "Value")
kable(shape_stats, caption = "Shape Indices")
Variable | Statistic | Value |
---|---|---|
sales | skewness | 0.7181040 |
sales | kurtosis | 2.6868236 |
volume | skewness | 0.8847420 |
volume | kurtosis | 3.1769870 |
MedianP | skewness | -0.3645529 |
MedianP | kurtosis | 2.3770382 |
listings | skewness | 0.6494982 |
listings | kurtosis | 2.2082100 |
MonthsI | skewness | 0.0409753 |
MonthsI | kurtosis | 2.8255525 |
This section focused on five numerical variables: sales, volume, median_price, listings, and months_inventory.
Sales have a mean of 192.29 with moderate variability (SD = 79.65) and a coefficient of variation of 0.41. The distribution is positively skewed (0.72) with slightly low kurtosis (2.69), indicating that while most months see typical sales levels, some months experience unusually high activity.
Volume shows a mean of 31.01 with relatively high variability (SD = 16.65) and a coefficient of variation of 0.54. The skewness of 0.88 and kurtosis of 3.18 reflect occasional spikes in transaction values.
Median_price averages 132,665.42 with a standard deviation of 22,662.15 (CV = 0.17). The distribution is nearly symmetric (skewness = –0.36, kurtosis = 2.38), suggesting stable housing prices with moderate fluctuations over time.
Listings average 1,738.02, but with large variability (SD = 752.71, CV = 0.43). The positive skewness (0.65) and low kurtosis (2.21) point to significant changes in property availability across months.
Months_inventory has a mean of 9.19 months with low variability (SD = 2.30, CV = 0.25). The skewness is close to zero (0.04) with a kurtosis of 2.83, indicating a balanced and steady supply of housing stock.
# Frequency distribution for categorical / time variables
# City frequency table
table_city <- as.data.frame(table(df$city))
colnames(table_city) <- c("City", "Frequency")
# Year frequency table
table_year <- as.data.frame(table(df$year))
colnames(table_year) <- c("Year", "Frequency")
# Month frequency table
table_month <- as.data.frame(table(df$month))
colnames(table_month) <- c("Month", "Frequency")
# Display with knitr::kable (simple & knitting safe)
library(knitr)
kable(table_city, caption = "Frequency Distribution of Cities")
City | Frequency |
---|---|
Beaumont | 60 |
Bryan-College Station | 60 |
Tyler | 60 |
Wichita Falls | 60 |
kable(table_year, caption = "Frequency Distribution of Years")
Year | Frequency |
---|---|
2010 | 48 |
2011 | 48 |
2012 | 48 |
2013 | 48 |
2014 | 48 |
kable(table_month, caption = "Frequency Distribution of Months")
Month | Frequency |
---|---|
1 | 20 |
2 | 20 |
3 | 20 |
4 | 20 |
5 | 20 |
6 | 20 |
7 | 20 |
8 | 20 |
9 | 20 |
10 | 20 |
11 | 20 |
12 | 20 |
From the results above we can note that the dataset is evenly distributed across time and cities, which makes the analysis more consistent. Each month from January to December appears 20 times, ensuring a balanced view of seasonal patterns. The years from 2010 to 2014 each contribute 48 records, giving us a steady five-year timeline for tracking changes. Likewise, the four cities; Beaumont (60), Bryan-College Station (60), Tyler (60), and Wichita Falls (60) are all equally represented, allowing us to compare locations without bias toward any single city.
# Identify variable with max CV
variability_cv <- numeric_vars %>%
summarise(across(everything(),
~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2))) %>%
pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "CV")
max_cv <- variability_cv %>%
arrange(desc(CV)) %>%
slice(1)
# Identify variable with max skewness
skewness_vals <- numeric_vars %>%
summarise(across(everything(),
~round(skewness(. , na.rm = TRUE),2))) %>%
pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "Skewness")
max_skewness <- skewness_vals %>%
arrange(desc(abs(Skewness))) %>%
slice(1)
kable(max_cv, caption = "Variable with Highest Variability (CV)")
Variable | CV |
---|---|
volume | 0.54 |
kable(max_skewness, caption = "Variable with Highest Skewness")
Variable | Skewness |
---|---|
volume | 0.88 |
Among the variables analyzed, volume shows the highest variability with a coefficient of variation of 0.54. This indicates that transaction values fluctuate widely relative to their mean, pointing to inconsistent market activity across months. In addition, volume also has the highest skewness (0.88), reflecting a right-skewed distribution where most months record moderate transaction volumes, but a few months experience exceptionally high spikes. These conclusions were reached by comparing variability with a coefficient of variation and skewness as an indicator of distribution shape across all the variables.
# Custom Gini index
gini.index <- function(x){
ni = table(x)
fi = ni/length(x)
fi2 = fi^2
J = length(table(x))
gini = 1 - sum(fi2)
gini.norm = gini / ((J-1)/J)
return(gini.norm)
}
# Divide median_price into classes (bins)
df$price_class <- cut(
df$MedianP,
breaks = 5, # divide into 5 equal-width classes
labels = c("Very Low", "Low", "Medium", "High", "Very High")
)
freq_dist <- df %>%
count(price_class)
kable(freq_dist)
price_class | n |
---|---|
Very Low | 18 |
Low | 40 |
Medium | 73 |
High | 84 |
Very High | 25 |
# Bar plot
ggplot(as.data.frame(freq_dist), aes(x = price_class, y = n)) +
geom_col(fill = "steelblue") +
labs(title = "Frequency of Median Price Classes",
x = "Price Class", y = "Count") +
theme_minimal()
# Apply Gini index
gini_value <- round(gini.index(df$price_class),3)
kable(tibble(`Normalized Gini Index` = gini_value))
Normalized Gini Index |
---|
0.926 |
The results show that most properties fall within the Medium (73) and High (84) price categories, while fewer are found in the Very Low (18) and Very High (25) ends of the scale. This suggests that the housing market is fairly balanced, with prices clustering around the middle ranges rather than being heavily concentrated at the extremes. The Gini index of 0.93 indicated that the model demonstrated excellent discriminatory power.
P_beaumont <- mean(df$city == "Beaumont")
P_july <- mean(df$month == 7)
P_dec2012 <- mean(df$month == 12 & df$year == 2012)
prob_table <- tibble(
event = c("city=Beaumont","month=July","month=Dec & year=2012"),
probability = round(c(P_beaumont, P_july, P_dec2012),3)
)
kable(prob_table, caption = "Probabilities of Selected Events")
event | probability |
---|---|
city=Beaumont | 0.250 |
month=July | 0.083 |
month=Dec & year=2012 | 0.017 |
The probabilities show that a random observation has a 25% chance of being from Beaumont, making it the most likely among the listed events. The chance of selecting a record from July is much lower at about 8.3%, reflecting the uniform spread of months across the dataset. Also, the probability of finding a record specifically from December 2012 is just 1.7%, highlighting how rare it is when narrowing down to both a specific month and year.
# Create new variables
df <- df %>%
mutate(
avg_price = volume / sales, # Average property price
ad_effectiveness = sales / listings # Effectiveness of ads (conversion rate)
)
head(df %>% select(city, year, month, sales, volume, listings, avg_price, ad_effectiveness))
# Overall Summary of the new variables
summary_new_vars <- df %>%
summarise(
avg_price_mean = round(mean(avg_price, na.rm = TRUE),2),
avg_price_median = round(median(avg_price, na.rm = TRUE),2),
avg_price_sd = round(sd(avg_price, na.rm = TRUE),2),
ad_effectiveness_mean = round(mean(ad_effectiveness, na.rm = TRUE),2),
ad_effectiveness_median = round(median(ad_effectiveness, na.rm = TRUE),2),
ad_effectiveness_sd = round(sd(ad_effectiveness, na.rm = TRUE),2)
)
kable(summary_new_vars, caption = "Summary of New Variables")
avg_price_mean | avg_price_median | avg_price_sd | ad_effectiveness_mean | ad_effectiveness_median | ad_effectiveness_sd |
---|---|---|---|---|---|
0.15 | 0.16 | 0.03 | 0.12 | 0.11 | 0.05 |
The results show that the average property price (avg_price) remains relatively stable, with the mean (0.15) and median (0.16) being very close, suggesting a fairly symmetric distribution and low dispersion (SD ≈ 0.03). In contrast, the effectiveness of sales ads (ad_effectiveness) is more varied, with a much smaller mean (0.12) and median (0.11), but a relatively higher spread (SD ≈ 0.05) compared to its scale.
# Summary by cities
city_summary <- df %>%
group_by(city) %>%
summarise(across(c(sales, volume, MedianP, listings, MonthsI),
list(mean = ~round(mean(. , na.rm = TRUE),2),
sd = ~round(sd(. , na.rm = TRUE),2),
cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2)),
.names = "{.col}_{.fn}"))
# Summary by years
year_summary <- df %>%
group_by(year) %>%
summarise(across(c(sales, volume, MedianP, listings, MonthsI),
list(mean = ~round(mean(. , na.rm = TRUE),2),
sd = ~round(sd(. , na.rm = TRUE),2),
cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2)),
.names = "{.col}_{.fn}"))
# Summary by months
month_summary <- df %>%
group_by(month) %>%
summarise(across(c(sales, volume, MedianP, listings, MonthsI),
list(mean = ~round(mean(. , na.rm = TRUE),2),
sd = ~round(sd(. , na.rm = TRUE),2),
cv = ~round(sd(. , na.rm = TRUE)/mean(. , na.rm = TRUE),2)),
.names = "{.col}_{.fn}"))
# CITY LEVEL
# Sales mean by city
ggplot(city_summary, aes(x = city, y = sales_mean)) +
geom_col(fill = "steelblue") +
theme_minimal() +
labs(title = "Average Sales by City", x = "City", y = "Mean Sales") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Coefficient of variation of volume by city
ggplot(city_summary, aes(x = city, y = volume_cv)) +
geom_col(fill = "steelblue") +
theme_minimal() +
labs(title = "Volume CV by City", x = "City", y = "CV (Volume)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# YEAR LEVEL
# Median price trend over years
ggplot(year_summary, aes(x = year, y = MedianP_mean)) +
geom_line(color = "steelblue", size = 1.2, group = 1) +
geom_point(color = "steelblue", size = 2) +
theme_minimal() +
labs(title = "Average Median Price Over Years", x = "Year", y = "Mean Median Price")
# Listings variability (SD) over years
ggplot(year_summary, aes(x = year, y = listings_sd)) +
geom_line(color = "steelblue", size = 1.2, group = 1) +
geom_point(color = "steelblue", size = 2) +
theme_minimal() +
labs(title = "Listings Standard Deviation Over Years", x = "Year", y = "Listings SD")
# MONTH LEVEL
# Months inventory trend over months
ggplot(month_summary, aes(x = month, y = MonthsI_mean)) +
geom_line(color = "steelblue", size = 1.2, group = 1) +
geom_point(color = "steelblue", size = 2) +
theme_minimal() +
labs(title = "Average Inventory by Month", x = "Month", y = "Mean Months Inventory")
# CV of sales by month
ggplot(month_summary, aes(x = month, y = sales_cv)) +
geom_line(color = "steelblue", size = 1.2, group = 1) +
geom_point(color = "steelblue", size = 2) +
theme_minimal() +
labs(title = "Sales CV by Month", x = "Month", y = "Coefficient of Variation")
# Boxplot: Distribution of median prices by year and city
ggplot(df, aes(x = factor(year), y = MedianP, fill = city)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of Median Prices by Year and City",
x = "Year", y = "Median Price") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Bar Chart: Total sales by month and city, faceted by year
sales_by_month_city <- df %>%
group_by(year, city, month) %>%
summarise(total_sales = sum(sales, na.rm = TRUE), .groups = "drop")
ggplot(sales_by_month_city, aes(x = factor(month), y = total_sales, fill = city)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~year) +
theme_minimal() +
labs(title = "Total Sales by Month and City (Faceted by Year)",
x = "Month", y = "Total Sales") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Line Chart: Sales trends over time using year-month date
df <- df %>%
mutate(date = make_date(year, month, 1))
sales_trend <- df %>%
group_by(date) %>%
summarise(avg_sales = mean(sales, na.rm = TRUE), .groups = "drop")
ggplot(sales_trend, aes(x = date, y = avg_sales)) +
geom_line(color = "steelblue", size = 1.2) +
geom_point(color = "darkblue", size = 2) +
theme_minimal() +
labs(title = "Sales Trends Over Time",
x = "Date", y = "Average Sales")
This analysis explored property sales data across cities, months, and years, using descriptive statistics, probability analysis, conditional summaries, and visualization techniques to identify key market dynamics.
Descriptive Statistics:
Median Price had the highest variability (variance ≈ 5.14 × 10⁸), showing wide fluctuations across cities and time.
Volume showed the highest skewness (0.88), indicating that most sales are clustered at lower volumes with occasional high outliers.
Average property prices were stable, with a mean around 132,665, while ad effectiveness averaged 0.15, suggesting moderate consistency in marketing performance.
Frequency Distributions:
Each month (Jan–Dec) had 20 entries, showing balanced seasonal coverage.
Years 2010–2014 had 48 entries each, ensuring even time representation.
Cities were evenly distributed (60 entries each), allowing fair comparison.
Probability Analysis:
The probability of a row being from Beaumont is 25%.
The chance of a property listed in July is 8.3%, reflecting seasonal demand.
The probability of a property from December 2012 is only 1.7%, showing rare specificity when both time and year conditions combine.
Class Intervals (Median Price):
Most properties fall into the High (84) and Medium (73) price classes, while Very Low (18) and Very High (25) are less common.
The Gini Index (0.93) indicated that the model demonstrated excellent discriminatory power.
Conditional Analysis:
By city, Bryan–College Station leads in higher median prices, while Wichita Falls remains most affordable.
Seasonal trends show mid-year sales peaks across all cities.
Yearly trends reveal steady growth in sales and volumes from 2010 to 2014.
Visual Insights:
The chart shows a clear upward trend in average sales from 2010 to 2014.
Sales fluctuate seasonally, with peaks and dips within each year, suggesting cyclical demand in the housing market (likely linked to seasonal buying patterns).
By 2014, average sales are notably higher than in 2010, indicating overall growth in housing activity.
Bryan-College Station consistently has the highest median prices, followed by Tyler, Beaumont, and Wichita Falls (lowest).
Across all cities, there is a general upward movement in median prices over time, especially noticeable in Bryan-College Station after 2012.
The spread of prices is relatively stable within each city, though some outliers are visible
Seasonal variation is evident: sales tend to peak around mid-year (spring/summer months) and drop in late fall/winter.
Bryan-College Station and Tyler generally report the highest sales volumes compared to Beaumont and Wichita Falls.
Over the years, the overall sales volume appears to increase, supporting the trend seen in the line chart.
Conclusions & Recommendations
Growth Opportunity: The market shows a clear upward trend, making it attractive for long-term investments but also highlight the need to address affordability challenges.
Target Cities: Focus on Tyler and Bryan–College Station for stronger sales potential, while positioning Wichita Falls for affordability-driven buyers.
Affordability Balance: Address disparities between high-priced markets and affordable regions to capture broader demand.
Marketing Efficiency: With ad effectiveness averaging around 0.15, campaigns should be optimized to raise conversion rates.
Sales show strong seasonality, peaking in spring and summer, suggesting marketing and inventory strategies should align with these cycles.