Project Description

This project analyzes the Texas housing market from 2010 to 2014. The dataset includes four cities: Beaumont, Bryan-College Station, Tyler, and Wichita Falls. The aim is to understand spatial and temporal dynamics in real estate through descriptive statistics, visualizations, and derived indicators such as listing effectiveness.

1 – Variable Analysis

The dataset contains both qualitative and quantitative variables related to real estate activity in Texas. Below is a breakdown of each variable:

Regarding the treatment of time-related variables (year and month), two strategies are possible: 1. Merge them into a single ‘date’ variable (e.g., “MM-YYYY”) to perform time series analysis. 2. Analyze them separately to detect annual trends and monthly seasonality.

Type of analysis by variable: - city: Comparison across cities using distributions, mean, and median. - year/month: Useful for identifying trends and seasonality in sales and listings. - sales/volume: Analyze trends, inter-city comparisons, and correlations. - median_price: Monitor price evolution and relationship with listings and sales. - listings: Evaluate supply and its link with sales and inventory time. - months_inventory: Measures market efficiency; useful for regression or correlation analysis.

Summary statistics such as measures of central tendency (mean, median, mode), dispersion (variance, standard deviation, CV, IQR), and shape (skewness, kurtosis) will be applied to all quantitative variables where appropriate.

2 – Descriptive Statistics

We calculate measures of central tendency (mean, median, mode), dispersion (standard deviation, variance, range, interquartile range), coefficient of variation, and distribution shape (skewness and kurtosis) for all relevant quantitative variables.

data <- read.csv("~/Downloads/realestate_texas.csv")
attach(data)
if(!require(moments)) install.packages("moments")
library(moments)

# Creating a function to calculate the mode in R
get_mode <- function(x) {
  uniq <- unique(x)
  uniq[which.max(tabulate(match(x, uniq)))]
}

#mean
sales_mean <- mean(sales)
volume_mean <- mean(volume)
listings_mean <- mean(listings)
medianprice_mean <- mean(median_price)
monthsinvet_mean <- mean(months_inventory)

#mode
sales_mode <- get_mode(sales)
volume_mode <- get_mode(volume)
listings_mode <- get_mode(listings)
medianprice_mode <- get_mode(median_price)
monthsinvet_mode <- get_mode(months_inventory)

#median
sales_median <- median(sales)
volume_median <- median(volume)
listings_median <- median(listings)
medianprice_median <- median(median_price)
monthsinvet_median <- median(months_inventory)

#standard deviation
sales_sd <- sd(sales)
volume_sd <- sd(volume)
listings_sd <- sd(listings)
medianprice_sd <- sd(median_price)
monthsinvet_sd <- sd(months_inventory)

#variance
sales_var <- var(sales)
volume_var <- var(volume)
listings_var <- var(listings)
medianprice_var <- var(median_price)
monthsinvet_var <- var(months_inventory)

#range
sales_range <- diff(range(sales))
volume_range <- diff(range(volume))
listings_range <- diff(range(listings))
medianprice_range <- diff(range(median_price))
monthsinvet_range <- diff(range(months_inventory))

#CV
sales_CV <- (sales_sd/sales_mean)*100
volume_CV <- (volume_sd / volume_mean) * 100
medianprice_CV <- (medianprice_sd / medianprice_mean) * 100
listings_CV <- (listings_sd / listings_mean) * 100
monthsinvet_CV <- (monthsinvet_sd / monthsinvet_mean) * 100

#IQR
sales_IQR <- IQR(sales)
volume_IQR <- IQR(volume)
listings_IQR <- IQR(listings)
medianprice_IQR <- IQR(median_price)
monthsinvet_IQR <- IQR(months_inventory)

#skewnss and kurtosis
sales_skew <- skewness(sales)
sales_kurt <- kurtosis(sales) - 3 # Subtracting 3 from the kurtosis index because the 'moments' package computes it centered at 3

volume_skew <- skewness(volume)
volume_kurt <- kurtosis(volume) - 3

medianprice_skew <- skewness(median_price)
medianprice_kurt <- kurtosis(median_price) - 3

listings_skew <- skewness(listings)
listings_kurt <- kurtosis(listings) - 3

monthsinvet_skew <- skewness(months_inventory)
monthsinvet_kurt <- kurtosis(months_inventory) - 3

results <- data.frame(
  Variable = c("sales", "volume", "median_price", "listings", "months_inventory"),
  Mean = c(sales_mean, volume_mean, medianprice_mean, listings_mean, monthsinvet_mean),
  Median = c(sales_median, volume_median, medianprice_median, listings_median, monthsinvet_median),
  Mode = c(sales_mode, volume_mode, medianprice_mode, listings_mode, monthsinvet_mode),
  SD = c(sales_sd, volume_sd, medianprice_sd, listings_sd, monthsinvet_sd),
  Variance = c(sales_var, volume_var, medianprice_var, listings_var, monthsinvet_var),
  CV_percent = c(sales_CV, volume_CV, medianprice_CV, listings_CV, monthsinvet_CV),
  Range = c(sales_range, volume_range, medianprice_range, listings_range, monthsinvet_range),
  IQR = c(sales_IQR, volume_IQR, medianprice_IQR, listings_IQR, monthsinvet_IQR),
  Skewness = c(sales_skew, volume_skew, medianprice_skew, listings_skew, monthsinvet_skew),
  Kurtosis = c(sales_kurt, volume_kurt, medianprice_kurt, listings_kurt, monthsinvet_kurt)
)

knitr::kable(results)
Variable Mean Median Mode SD Variance CV_percent Range IQR Skewness Kurtosis
sales 192.29167 175.5000 124.000 79.651111 6.344300e+03 41.42203 344.000 120.0000 0.7181040 -0.3131764
volume 31.00519 27.0625 35.335 16.651447 2.772707e+02 53.70536 75.381 23.2335 0.8847420 0.1769870
median_price 132665.41667 134500.0000 130000.000 22662.148687 5.135730e+08 17.08218 106200.000 32750.0000 -0.3645529 -0.6229618
listings 1738.02083 1618.5000 1581.000 752.707756 5.665690e+05 43.30833 2553.000 1029.5000 0.6494982 -0.7917900
months_inventory 9.19250 8.9500 8.100 2.303669 5.306889e+00 25.06031 11.500 3.1500 0.0409753 -0.1744475

Quantitative variables (sales, volume, median_price, listings, months_inventory) were analyzed using measures of central tendency (mean, median, mode), dispersion (standard deviation, variance, range, IQR, coefficient of variation), and shape (skewness and excess kurtosis).

The variable ‘volume’ shows the highest relative variability (CV ≈ 53.7%) and the most pronounced positive skewness, suggesting a distribution with some exceptionally high-value sales.

median_price has the lowest relative variability (CV ≈ 17.1%) and a slight negative skewness, indicating a concentration of prices above the mean and some lower outliers.

months_inventory is nearly symmetrical (skewness ≈ 0.04) with moderate variability. Its kurtosis is slightly negative, suggesting a distribution flatter than normal (platykurtic).

In general, all variables have negative or near-zero excess kurtosis, indicating distributions that are not sharply peaked and show limited presence of extreme values.

For the categorical and temporal variables (city, year, month), absolute, relative, and cumulative frequency distributions were created. All are perfectly balanced, ensuring fair and unbiased comparisons across cities and time periods.

3 – Variability and Skewness

These insights indicate that real estate transaction values in Texas include some outliers or very high-value sales.

The coefficient of variation (CV) and skewness were compared across variables to identify which had the highest relative variability and most asymmetrical distribution.

CV is useful because it normalizes variability relative to the mean, making it easier to compare variables on different scales (e.g., dollars vs. months).

Skewness helps identify how asymmetrical a distribution is—positive skewness indicates a long right tail, negative skewness a long left tail.

var_max_cv <- results$Variable[which.max(results$CV_percent)]
var_max_skew <- results$Variable[which.max(results$Skewness)]
cv_val <- round(max(results$CV_percent), 2)
skew_val <- round(max(results$Skewness), 2)

The variable with the highest relative variability is volume (CV ≈ 53.7%). The variable with the most pronounced positive skewness is also volume.

4 – Frequency Distribution and Gini Index

sales_clas <- cut(data$sales, breaks = 5, right=FALSE)
distr_freq_sales <- as.data.frame(
  cbind(
    ni_sales = table(sales_clas),
    fi_sales = table(sales_clas)/length(data$sales),
    Ni_sales = cumsum(table(sales_clas)),
    Fi_sales = cumsum(table(sales_clas)/length(data$sales))
  )
)

distr_freq_sales$Classe <- rownames(distr_freq_sales)
rownames(distr_freq_sales) <- NULL
library(ggplot2)

ggplot(data = distr_freq_sales, aes(x = Classe, y = ni_sales)) +
  geom_col(fill = "steelblue") +
  labs(title = "Distribution of 'sales' by class", x = "Sales class intervals", y = "Frequency") +
  theme_minimal()

gini.index <- function(x) {
  fi <- x / sum(x)
  J <- length(x)
  gini <- 1 - sum(fi^2)
  gini / ((J - 1) / J)
}
gini.index(distr_freq_sales$ni_sales)
## [1] 0.9132813

Gini index based on sales class frequencies

The normalized Gini index calculated on the class frequencies is 0.913, indicating a high level of heterogeneity in the distribution of sales.
This means that a large portion of the observations is concentrated in a few intervals, while other classes are much less populated.

5 – Probability Calculations

We calculated the probability that a randomly selected row:
- Refers to the city “Beaumont”.
- Refers to the month of July (7).
- Refers to December 2012.

n_total <- nrow(data)
prob_beaumont <- sum(data$city == "Beaumont") / n_total
prob_july <- sum(data$month == 7) / n_total
prob_dec_2012 <- sum(data$month == 12 & data$year == 2012) / n_total

prob_beaumont
## [1] 0.25
prob_july
## [1] 0.08333333
prob_dec_2012
## [1] 0.01666667

These calculations help understand dataset balance and the distribution of specific time periods and cities.

6 – New Variables

Two new variables were created:
1. mean_price calculated as volume divided by sales (converted to dollars).
2. listings_effectiveness calculated as the ratio of sales to listings.

data$mean_price <- data$volume * 1e6 / data$sales
data$listings_effectiveness <- data$sales / data$listings

Correlations were also calculated to explore relationships between variables. Sales are strongly correlated with volume, and moderately with listings and median price.
Listings and months_inventory also show a high correlation, suggesting more listings lead to longer inventory time.
Months_inventory has weak or no correlation with sales, indicating it may depend on supply factors rather than transaction volume.

cor_matrix <- round(cor(data[, c("sales", "listings", "volume", "median_price", "months_inventory")]), 2)
cor_matrix
##                  sales listings volume median_price months_inventory
## sales             1.00     0.62   0.98         0.59             0.15
## listings          0.62     1.00   0.57         0.40             0.74
## volume            0.98     0.57   1.00         0.70             0.06
## median_price      0.59     0.40   0.70         1.00            -0.04
## months_inventory  0.15     0.74   0.06        -0.04             1.00
ggplot(data, aes(x = listings, y = sales, color = listings_effectiveness)) +
  geom_point() +
  scale_color_gradient(low = "lightblue", high = "darkred") +
  labs(
    title = "Listings vs Sales colored by Listings Effectiveness",
    x = "Listings",
    y = "Sales",
    color = "listings Effectiveness"
  ) +
  theme_minimal()

The scatter plot reveals that when the number of active listings exceeds 2000, the effectiveness of the ads tends to drop significantly. In many cases, a lower number of listings leads to a similar number ofsales, indicating that volume alone does not guarantee better performance. This suggests that ad effectiveness may be influenced by contextual factorssuch as the city, average property price, or seasonal trends in the real estate market.

ggplot(data, aes(x = median_price, y = listings_effectiveness)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "darkred") +
  labs(
    title = "Listings Effectiveness vs Median Price",
    x = "Mean Sale Price",
    y = "Sales / Listings Ratio"
  ) +
  theme_minimal()

The scatter plot reveals a slight positive trend between median sale price and ad effectiveness.However, the relationship is weak and the data is highly dispersed, indicating that median price alone does not strongly explain listing performance. Additional factors such as location, timing, and demand are likely influencing the observed effectiveness more significantly.

ggplot(data, aes(x = city, y = listings_effectiveness)) +
  geom_boxplot(fill = "skyblue") +
  labs(
    title = "Listings Effectiveness by City",
    x = "City",
    y = "Sales / Listings Ratio"
  ) +
  theme_minimal()

The analysis of ad effectiveness (sales / listings) across cities reveals meaningful differences:
- Bryan-College Station stands out with the highest average effectiveness, although with considerable variability and several outliers.
- Wichita Falls combines decent effectiveness with more consistent performance, suggesting a well-balanced market.
- Beaumont and Tyler exhibit lower average effectiveness but more stable behavior.

In conclusion, geographic location does influence ad performance, thoughit is likely not the sole explanatory factor.

ggplot(data, aes(x = factor(month), y = listings_effectiveness)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "Listings Effectiveness by Month", x = "Month", y = "Sales / Listings")

Listings effectiveness measures how many sales each listing generates. Boxplots by city and month reveal geographic differences and seasonal patterns, with central months of the year and certain cities showing higher listing performance.
Additionally, the scatter plot between median sale price and ad effectiveness reveals a slight positive correlation: as the median price increases, listing effectiveness also tends to rise slightly.
This suggests that higher-priced markets might experience slightly better listing performance.

7 – Conditional Analysis

We analyzed listing effectiveness by grouping data by city, year, and month.

Using dplyr, we calculated mean and standard deviation for each group. Visualizations such as heatmaps and error bar charts show how listing effectiveness changes over time and across cities.

library(dplyr)
summary_data <- data %>%
  group_by(city, year, month) %>%
  summarise(
    mean_effectiveness = mean(listings_effectiveness, na.rm = TRUE),
    sd_effectiveness = sd(listings_effectiveness, na.rm = TRUE),
    .groups = 'drop'
  )

# Heatmap
ggplot(summary_data, aes(x = factor(month), y = city, fill = mean_effectiveness)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(
    title = "Listings Effectiveness Heatmap",
    x = "Month",
    y = "City",
    fill = "Effectiveness"
  ) +
  theme_minimal()

ggplot(summary_data, aes(x = month, y = mean_effectiveness, color = city)) +
  geom_line(size = 1) +
  geom_errorbar(
    aes(ymin = mean_effectiveness - sd_effectiveness,
        ymax = mean_effectiveness + sd_effectiveness),
    width = 0.3, alpha = 0.5
  ) +
  facet_wrap(~ year) +
  scale_x_continuous(breaks = seq(2,12,2)) +
  labs(
    title = "Monthly Listings Effectiveness with Standard Deviation",
    x = "Month",
    y = "Effectiveness ± SD",
    color = "City"
  ) +
  theme_minimal()

The two charts provide a complementary view of the real estate listings effectiveness over time.

The first graph (line chart with standard deviation) illustrates the monthly trend in listing effectiveness from 2010 to 2014 across all cities. Bryan-College Station clearly shows a strong increase in effectiveness during mid-year months, especially from 2013 to 2014, peaking between May and July. The other cities display more stable and generally lower effectiveness.

The second graph (heatmap) summarizes the average monthly effectiveness per city, disregarding yearly differences. Again, Bryan-College Station stands out with consistently higher effectiveness during summer months. The color intensity highlights the influence of seasonality and helps identify the best periods to maximize listing impact.

In summary, Bryan-College Station shows strong positive seasonality, while the other cities remain more stable but with lower average effectiveness.

8 - Visualizations

# Boxplot median price by city
ggplot(data, aes(x = city, y = median_price)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Median Price by City", x = "City", y = "Median Price") +
  theme_minimal()

# Total sales by month and city
monthly_sales <- data %>% group_by(month, city) %>% summarise(total_sales = sum(sales), .groups = "drop")

ggplot(monthly_sales, aes(x = factor(month), y = total_sales, fill = city)) +
  geom_col(position = "dodge") +
  labs(title = "Total Sales by Month and City", x = "Month", y = "Sales") +
  theme_minimal()

# Normalized bar chart
monthly_sales_norm <- monthly_sales %>% group_by(month) %>% mutate(perc_sales = total_sales / sum(total_sales))

ggplot(monthly_sales_norm, aes(x = factor(month), y = perc_sales, fill = city)) +
  geom_col(position = "fill") +
  labs(title = "Relative Sales Distribution", x = "Month", y = "Percentage") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal()

# Line chart: Sales trends by year and city
monthly_trend <- data %>% group_by(year, month, city) %>% summarise(mean_sales = mean(sales), .groups = "drop")

ggplot(monthly_trend, aes(x = month, y = mean_sales, color = city)) +
  geom_line() +
  facet_wrap(~year) +
  scale_x_continuous(breaks = seq(2, 12, 2)) +
  labs(title = "Monthly Sales Trends", x = "Month", y = "Avg Sales") +
  theme_minimal()

1 - Boxplot of median prices by city. The boxplot highlights that Bryan-College Station has the highest median sale prices, followed by Tyler, Beaumont, and Wichita Falls. Price variability is noticeable, suggesting differences even within the same city.

2 - Bar chart of total sales by month and city. This chart shows a clear seasonal trend: sales rise in summer months (May to July) across all cities. Tyler consistently leads in total sales volume.

3 - Normalized (percentage) bar chart. Highlights the relative share of each city in the total monthly sales. Tyler maintains a strong presence each month, followed by Bryan-College Station. This chart is useful to assess city-level market contribution.

4 - Monthly sales trend by city and year (line chart). Illustrates seasonal and yearly sales patterns. Cities show similar seasonal dynamics, with noticeable peaks mid-year. There’s a general upward trend in sales over time, particularly for Tyler and Bryan-College Station.

9 - Final Considerations

The analysis of Texas real estate data from 2010 to 2014 revealed clear seasonal patterns and significant geographic differences. Property sales peak during the summer months, especially from May to July. Bryan-College Station stands out for its high listing effectiveness, while Tyler leads in total sales volume.

The highest coefficient of variation (CV) was found in the ‘volume’ variable, indicating a wide dispersion in total sales values. This same variable also showed the highest skewness, suggesting the presence of some exceptionally large transactions.

The custom indicator of listing effectiveness (sales/listings) proved useful to identify inefficiencies: a high number of listings does not necessarily result in higher sales, pointing to the need for more targeted strategies.

Visualizations confirmed the influence of both geography and time, with seasonality emerging as a key factor in marketing performance. These insights provide valuable support for optimizing sales strategies and identifying cities with the greatest growth potential.