Script for data analysis

This report presents an analysis of the real estate market in Texas based on sales carried out from 2010 to 2014 in the cities of Beaumont, Bryan-College Station, Tyler, and Wichita Falls. Below is a statistical report constructed according to the following approach:

  1. Univariate analysis of the dataset variables with graphical representations, calculation of the main statistical indicators, identification of variables with the greatest variability and skewness, creation of class for some variables, examples of occurrence probabilities within the dataset and creation of new variables such as average price and an effectiveness index of sales listings.

  2. Bivariate analysis to individually explore the relationships between each numerical variable and the categorical variables in the dataset, including graphical representations of relationships, distributions, and main findings.

  3. Bivariate analysis to individually explore the relationships among the numerical variables in the dataset, with investigation of possible linear correlations, linear fitting and residual distribution analysis, nonlinear smooth fitting, and related graphical representations.

  4. Multivariate analysis aimed at finding a plausible model to explain the variability of the dataset and the relationships among its variables, and to generally describe the model’s behavior, including any nonlinearities and group effects, with related graphical representations.

Load dataset

#Set project directory
setwd("C:/Users/matmi/OneDrive/Desktop/ProgettiR")

#List of library to use in project 
library(ggplot2)
library(dplyr)
## 
## Caricamento pacchetto: 'dplyr'
## I seguenti oggetti sono mascherati da 'package:stats':
## 
##     filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
## 
##     intersect, setdiff, setequal, union
library(DescTools)
library(moments)
library(tidyr)
library(corrplot)
## corrplot 0.95 loaded
library(broom)
library(mgcv)
## Caricamento del pacchetto richiesto: nlme
## 
## Caricamento pacchetto: 'nlme'
## Il seguente oggetto è mascherato da 'package:dplyr':
## 
##     collapse
## This is mgcv 1.9-4. For overview type '?mgcv'.
library(lme4)
## Caricamento del pacchetto richiesto: Matrix
## 
## Caricamento pacchetto: 'Matrix'
## I seguenti oggetti sono mascherati da 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Caricamento pacchetto: 'lme4'
## Il seguente oggetto è mascherato da 'package:nlme':
## 
##     lmList
library(ggeffects)
library(knitr)

#Import csv file
data <- read.csv("realestate_texas.csv")

#Attach dataset
attach(data)

#Summary of dataset
summary(data)
##      city                year          month           sales      
##  Length:240         Min.   :2010   Min.   : 1.00   Min.   : 79.0  
##  Class :character   1st Qu.:2011   1st Qu.: 3.75   1st Qu.:127.0  
##  Mode  :character   Median :2012   Median : 6.50   Median :175.5  
##                     Mean   :2012   Mean   : 6.50   Mean   :192.3  
##                     3rd Qu.:2013   3rd Qu.: 9.25   3rd Qu.:247.0  
##                     Max.   :2014   Max.   :12.00   Max.   :423.0  
##      volume        median_price       listings    months_inventory
##  Min.   : 8.166   Min.   : 73800   Min.   : 743   Min.   : 3.400  
##  1st Qu.:17.660   1st Qu.:117300   1st Qu.:1026   1st Qu.: 7.800  
##  Median :27.062   Median :134500   Median :1618   Median : 8.950  
##  Mean   :31.005   Mean   :132665   Mean   :1738   Mean   : 9.193  
##  3rd Qu.:40.893   3rd Qu.:150050   3rd Qu.:2056   3rd Qu.:10.950  
##  Max.   :83.547   Max.   :180000   Max.   :3296   Max.   :14.900

1) Univariate analysis

Variable “city”

This is a categorical variable. First of all we can explore the frequency of each value in variable and calculate Gini Index

#Function for distribution of all variables frequencies
dist_freq <- function(vector) {
  n <- length(vector)
  tab <- table(vector)
  df <- data.frame(
    value = names(tab),
    ni = as.numeric(tab),
    fi = as.numeric(tab) / n,
    Ni = cumsum(as.numeric(tab)),
    Fi = cumsum(as.numeric(tab)) / n
  )
  return(df)
}

dist_city_freq <- dist_freq(city)

#Function for Gini index calculate
gini_index <- function(freq) {
  freq_vec <- as.numeric(freq)
  Gini(freq_vec)
}

gini_city <- gini_index(dist_city_freq$ni)

For the categorical variable “city”, we observe a uniform distribution with equal frequencies across its 4 categories. Consequently, the Gini index is 0, indicating perfect equality in the distribution. Given the categorical nature and the balanced frequencies, this exhaustive analysis sufficiently describes this variable.

#Plot of cities distribution
ggplot(data = data)+
  geom_bar(aes(x = city),
           stat = "count",
           col = "black",
           fill = "blue")+
  labs(title = "City distribution",
       x = "City",
       y = "Abosulte frequencies")+
  theme_classic()

Variable “year”

This is a categorical variable. First of all we can explore the frequency of each value in variable and calculate Gini Index

#Distribution of all variables frequencies
dist_year_freq <- dist_freq(year)

#Gini index calculate
gini_years <- gini_index(dist_year_freq$ni)

For the categorical variable “year”, we observe a uniform distribution with equal frequencies across its 5 categories. Consequently, the Gini index is 0, indicating perfect equality in the distribution. Given the categorical nature and the balanced frequencies, this exhaustive analysis sufficiently describes this variable.

#Plot of years distribution
ggplot(data = data)+
  geom_bar(aes(x = year),
           stat = "count",
           col = "black",
           fill = "blue")+
  labs(title = "Years distribution",
       x = "Years",
       y = "Absolute frequencies")+
  theme_classic()

Variable “month”

This is a categorical variable. First of all we can explore the frequency of each value in variable and calculate Gini Index

#Built correct labels for months
data$month <- sprintf("%02d", as.numeric(data$month))
month_names <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
data$month <- factor(data$month, levels = sprintf("%02d", 1:12), labels = month_names)

#Distribution of all variables frequencies
dist_months_freq <- dist_freq(month)

#Gini index calculate
gini_months <- gini_index(dist_months_freq$ni)

For the categorical variable “month”, we observe a uniform distribution with equal frequencies across its 12 categories. Consequently, the Gini index is 0, indicating perfect equality in the distribution. Given the categorical nature and the balanced frequencies, this exhaustive analysis sufficiently describes this variable.

#Plot of months distribution
ggplot(data = data)+
  geom_bar(aes(x = month),
           stat = "count",
           col = "black",
           fill = "blue")+
  labs(title = "Months distribution",
       x = "Months",
       y = "Aboslute frequencies")+
  theme_classic()

Variable “date”

For time analysis of data we can built a new variable that concatenate year and month

#Variable "date" creation
date <- paste(year, month, "1", sep = "-")
date <- as.Date(date)
data$date <- date

#Distribution of all variables frequencies
dist_dates_freq <- dist_freq(date)

#Gini index calculate
gini_dates <- gini_index(dist_dates_freq$ni)

For the categorical variables “month” and “year”, we observe two uniform distributionswith equal frequencies respectively across their 12 and 5 categories. Consequently, the Gini index of each distribution is 0. Analysis beetwen year and month shows other uniform distribution of month for each years.Starting to paste year and month we built another variable (type date) to show trend and patterns by time

#Plot of month distribution by year (every month is a date "YYYY-MM-01")
ggplot(data = data)+
  geom_bar(aes(x = month),
           stat = "count",
           col = "black",
           fill = "blue")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months distribution for each year",
       x = "Months",
       y = "Absolute frequencies")+
  scale_y_continuous(breaks = seq(0, 12, 1)) +
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “sales”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index.

  • Range: range of variable is beetwen 79 and 423 number of sales
  • Mean: mean is 192.29 number of sales
  • Median: median is 175.5 number of sales. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
  • Standard deviation: standard deviation is 79.65. Mean > Median > Standard Deviation. Positive asimmetry is confirmed
  • Variation coefficient: variation coefficient is 41.4%. Generally, the variable shows 41.4% of variability instead its means
  • Skewness: skewness is 0.72. So, it’s > 0 and that confirms positive asimmetry
  • Kurtosis: kurtosis is -0.313. So, the variable “sales” distribution is platykurtic
  • Gini index: Gini index is 0.24. The distribution shows moderate heterogeneity
#Distribution of all variables frequencies
dist_sales_freq <- dist_freq(sales)

#Gini index calculate
gini_sales <- gini_index(dist_sales_freq$ni)

#Aritmetic mean
mean_sales <- mean(sales)

#Median
median_sales <- median(sales)

#Standard deviation
sd_sales <- sd(sales)

#Variation coefficient
var_cf_sales <- (sd_sales/mean_sales)*100

#Range
range_sales <- range(sales)

#Quantile and IQR
q_sales <- quantile(sales, seq(0,1, 0.25))
iqr_sales <- IQR(sales)

#Skewness
skew_sales <- skewness(sales)

#Kurtosis
kurt_sales <- kurtosis(sales)-3

#Function to create a vector with principal statistical index
stat_index_vect <- function (pos, mean_v, median_v, sd_v, cf_v, iqr_v, skew_v, kurt_v) {
  name_v <- colnames(data)[pos]
  v_st_index <- cbind(name_v, mean_v, median_v, sd_v, cf_v, iqr_v, skew_v, kurt_v)
  return(v_st_index)
}

pos_sales <- which(names(data) == "sales")
st_index_sales <- stat_index_vect(pos_sales, mean_sales, median_sales, sd_sales, var_cf_sales, iqr_sales, skew_sales, kurt_sales)
#Plot of sales distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = sales),
               binwidth = 10,
               col = "black",
               fill = "blue")+
  geom_vline(xintercept = mean_sales, col = "red", lwd = 1)+
  geom_vline(xintercept = median_sales, col = "pink", lwd = 1)+
  annotate("label", x = mean_sales + 20, y = 16, label = round(mean_sales, 2), color = "red") +
  annotate("label", x = median_sales - 20, y = 16, label = median_sales, color = "pink") +
  labs(title = "Number of sales distribution",
       x = "Number of sales",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "sales"
ggplot(data = data)+
  geom_boxplot(aes(y = sales), fill = "blue")+
  labs(title = "Number of sales boxplot",
           x = "Relative frequencies",
           y = "Number of sales")+
  theme_classic()

Variable “volume”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

  • Range: range of variable is beetwen 8.17 and 83.55 number of sales
  • Mean: mean is 31 mln $ of sales
  • Median: median is 27.06 mln $ of sales. Mean > Median, so we have positive asimmetry (influenced by high values of outliers) Standard deviation: standard deviation is 16.65. Mean > Median > Standard - Deviation. Positive asimmetry is confirmed
  • Variation coefficient: variation coefficient is 53.7%. Generally, the variable shows 53.7% of variability instead its means
  • Skewness: skewness is 0.88. So, it’s > 0 and that confirms positive asimmetry
  • Kurtosis: kurtosis is 0.17. So, the variable “sales” distribution is leptokurtic
  • Gini index: Gini index is 0.012. The distribution shows moderate heterogeneity
#Distribution of all variables frequencies
dist_volume_freq <- dist_freq(volume)

#Gini index calculate
gini_volume <- gini_index(dist_volume_freq$ni)

#Aritmetic mean
mean_volume <- mean(volume)

#Median
median_volume <- median(volume)

#Standard deviation
sd_volume <- sd(volume)

#Variation coefficient
var_cf_volume <- (sd_volume/mean_volume)*100

#Range
range_volume <- range(volume)

#Quantile and IQR
q_volume <- quantile(volume, seq(0,1, 0.25))
iqr_volume <- IQR(volume)

#Skewness
skew_volume <- skewness(volume)

#Kurtosis
kurt_volume <- kurtosis(volume)-3

pos_volume <- which(names(data) == "volume")
st_index_volume <- stat_index_vect(pos_volume, mean_volume, median_volume, sd_volume, var_cf_volume, iqr_volume, skew_volume, kurt_volume)
#Plot of volume distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = volume),
                 binwidth = 5,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_volume, col = "red", lwd = 1)+
  geom_vline(xintercept = median_volume, col = "pink", lwd = 1)+
  annotate("label", x = mean_volume + 5, y = 30, label = round(mean_volume, 2), color = "red") +
  annotate("label", x = median_volume - 5, y = 30, label = median_volume, color = "pink") +
  labs(title = "Volume of sales distribution",
       x = "Volume of sales (mln $)",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "volume"
ggplot(data = data)+
  geom_boxplot(aes(y = volume), fill = "blue")+
  labs(title = "Volume of sales boxplot",
           x = "Relative frequencies",
           y = "Volume of sales (mln $)")+
  theme_classic()

Variable “median_price”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

  • Range: range of variable is beetwen 73800 and 180000
  • Mean: mean is 132665 $ of sales
  • Median: median is 134500 $ of sales. Mean < Median, so we have negative asimmetry (influenced by low values of outliers)
  • Standard deviation: standard deviation is 22662
  • Variation coefficient: variation coefficient is 17.1%. Generally, the variable shows 17.1% of variability instead its means
  • Skewness: skewness is -0.364. So, it’s < 0 and that confirms negative asimmetry
  • Kurtosis: kurtosis is -0.623. So, the variable “median_price” distribution is mesokurtic
  • Gini index: Gini index is 0.14. The distribution shows moderate heterogeneity
#Distribution of all variables frequencies
dist_median_price_freq <- dist_freq(median_price)

#Gini index calculate
gini_median_price <- gini_index(dist_median_price_freq$ni)

#Aritmetic mean
mean_median_price <- mean(median_price)

#Median
median_median_price <- median(median_price)

#Standard deviation
sd_median_price <- sd(median_price)

#Variation coefficient
var_cf_median_price <- (sd_median_price/mean_median_price)*100

#Range
range_median_price <- range(median_price)

#Quantile and IQR
q_median_price <- quantile(median_price, seq(0,1, 0.25))
iqr_median_price <- IQR(median_price)

#Skewness
skew_median_price <- skewness(median_price)

#Kurtosis
kurt_median_price <- kurtosis(median_price)-3

pos_mprice <- which(names(data) == "median_price")
st_index_mprice <- stat_index_vect(pos_mprice, mean_median_price, median_median_price, sd_median_price, var_cf_median_price, iqr_median_price, skew_median_price, kurt_median_price)
#Plot of median price distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = median_price),
                 binwidth = 2500,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_median_price, col = "red", lwd = 1)+
  geom_vline(xintercept = median_median_price, col = "pink", lwd = 1)+
  annotate("label", x = mean_median_price + 12000, y = 18, label = round(mean_median_price, 2), color = "red") +
  annotate("label", x = median_median_price - 12000, y = 18, label = median_median_price, color = "pink") +
  labs(title = "Median price distribution",
       x = "Median price",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "median price"
ggplot(data = data)+
  geom_boxplot(aes(y = median_price), fill = "blue")+
  labs(title = "Median price boxplot",
           x = "Relative frequencies",
           y = "Median price")+
  theme_classic()

Variable “listings”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index.

  • Range: range of variable is beetwen 743 and 3296
  • Mean: mean is 1738.02 of listings
  • Median: median is 1618.5 of listings. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
  • Standard deviation: standard deviation is 752.71. Mean > Median > Standard Deviation that confirms positive asimmetry
  • Variation coefficient: variation coefficient is 43.31%. Generally, the variable shows 43.31% of variability instead its means
  • Skewness: skewness is 0.649. So, it’s > 0 and that confirms positive asimmetry
  • Kurtosis: kurtosis is -0.792. So, the variable “listings” distribution is mesokurtic
  • Gini index: Gini index is 0.083. The distribution shows moderate heterogeneity
#Distribution of all variables frequencies
dist_listings_freq <- dist_freq(listings)

#Gini index calculate
gini_listings <- gini_index(dist_listings_freq$ni)

#Aritmetic mean
mean_listings <- mean(listings)

#Median
median_listings <- median(listings)

#Standard deviation
sd_listings <- sd(listings)

#Variation coefficient
var_cf_listings <- (sd_listings/mean_listings)*100

#Range
range_listings <- range(listings)

#Quantile and IQR
q_listings <- quantile(listings, seq(0,1, 0.25))
iqr_listings <- IQR(listings)

#Skewness
skew_listings <- skewness(listings)

#Kurtosis
kurt_listings <- kurtosis(listings)-3

pos_listings <- which(names(data) == "listings")
st_index_listings <- stat_index_vect(pos_listings, mean_listings, median_listings, sd_listings, var_cf_listings, iqr_listings, skew_listings, kurt_listings)
#Plot of listings distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = listings),
                 binwidth = 100,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_listings, col = "red", lwd = 1)+
  geom_vline(xintercept = median_listings, col = "pink", lwd = 1)+
  annotate("label", x = mean_listings + 250, y = 20, label = round(mean_listings, 2), color = "red") +
  annotate("label", x = median_listings - 250, y = 20, label = median_listings, color = "pink") +
  labs(title = "Number of listings distribution",
       x = "Number of listings",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "listings"
ggplot(data = data)+
  geom_boxplot(aes(y = listings), fill = "blue")+
  labs(title = "Number of listings boxplot",
           x = "Relative frequencies",
           y = "Number of listings")+
  theme_classic()

Variable “months_inventory”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

  • Range: range of variable is beetwen 3.4 and 14.9
  • Mean: mean is 9.19
  • Median: median is 8.95. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
  • Standard deviation: standard deviation is 2.3. Mean > Median > Standard Deviation that confirms positive asimmetry
  • Variation coefficient: variation coefficient is 25.06%. Generally, the variable shows 25.06% of variability instead its means
  • Skewness: skewness is 0.041. So, it’s > 0 and that confirms positive asimmetry
  • Kurtosis: kurtosis is -0.174. So, the variable “months_inventory” distribution is mesokurtic
  • Gini index: Gini index is 0.359. The distribution shows moderate heterogeneity
#Distribution of all variables frequencies
dist_minv_freq <- dist_freq(months_inventory)

#Gini index calculate
gini_minv <- gini_index(dist_minv_freq$ni)

#Aritmetic mean
mean_minv <- mean(months_inventory)

#Median
median_minv <- median(months_inventory)

#Standard deviation
sd_minv <- sd(months_inventory)

#Variation coefficient
var_cf_minv <- (sd_minv/mean_minv)*100

#Range
range_minv <- range(months_inventory)

#Quantile and IQR
q_minv <- quantile(months_inventory, seq(0,1, 0.25))
iqr_minv <- IQR(months_inventory)

#Skewness
skew_minv <- skewness(months_inventory)

#Kurtosis
kurt_minv <- kurtosis(months_inventory)-3

pos_minv <- which(names(data) == "months_inventory")
st_index_minv <- stat_index_vect(pos_minv, mean_minv, median_minv, sd_minv, var_cf_minv, iqr_minv, skew_minv, kurt_minv)
#Plot of months inventory distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = months_inventory),
                 binwidth = 1,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_minv, col = "red", lwd = 1)+
  geom_vline(xintercept = median_minv, col = "pink", lwd = 1)+
  annotate("label", x = mean_minv + 0.5, y = 45, label = round(mean_minv, 2), color = "red") +
  annotate("label", x = median_minv - 0.5, y = 45, label = median_minv, color = "pink") +
  labs(title = "Months inventory distribution",
       x = "Months inventory",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "months inventory"
ggplot(data = data)+
  geom_boxplot(aes(y = months_inventory), fill = "blue")+
  labs(title = "Months frequencies boxplot",
           x = "Relative frequencies",
           y = "Months inventory")+
  theme_classic()

Other variable analysis

Below are shown some analysis on varibles:
1) Classification of variable in decreasing order of variability
2) Classification of variable in decreasing order of asimmetry
3) Creation of class and its analysis on variables
4) Probability of every occurrency of each variable
5) Creation and analysis of new variable, like mean price and efficiency index of listings

Classification of variable in decreasing order of variability

The variable with the bigger variability is “volume”, with a variation coefficient equal to 53%, while the variable “median_price” has a smaller variability. The results of statistic analysis are shown below

#Creation of matrix of statistical index for each variable in dataset
m_stat_index <- as.data.frame(rbind(
  st_index_sales,
  st_index_volume,
  st_index_mprice,
  st_index_listings,
  st_index_minv
))

m_stat_index_cv <- m_stat_index[order(m_stat_index$cf_v, decreasing = TRUE), ]

#Plot of statistical index table
kable(m_stat_index_cv, caption = "Statistical index for each variable ordinated by variability", row.names = FALSE)
Statistical index for each variable ordinated by variability
name_v mean_v median_v sd_v cf_v iqr_v skew_v kurt_v
volume 31.0051875 27.0625 16.6514471564494 53.7053586805415 23.2335 0.884742026325995 0.176986997089741
listings 1738.02083333333 1618.5 752.707756098841 43.3083275909432 1029.5 0.649498226273971 -0.791790033332591
sales 192.291666666667 175.5 79.6511111777793 41.4220296482492 120 0.718104024884959 -0.313176409071494
months_inventory 9.1925 8.95 2.30366862229334 25.0603059264982 3.15 0.040975265871081 -0.174447541638487
median_price 132665.416666667 134500 22662.148686505 17.0821825732064 32750 -0.364552878177372 -0.622961820755544

Classification of variable in decreasing order of asimmetry

The variable with the bigger asimmetry is “volume” with a skewness of 88,4%, while the variable “months_inventory” has a smaller asimmetry. The results of statistic analysis are shown below

#Matrix ordered by decreasing asimmetry
m_stat_index$skew_v <- as.numeric(as.character(m_stat_index$skew_v))
m_stat_index_as <- m_stat_index[order(abs(m_stat_index$skew_v), decreasing = TRUE), ]

#Plot of statistical index table
kable(m_stat_index_as, caption = "Statistical index for each variable ordinated by skewness", row.names = FALSE)
Statistical index for each variable ordinated by skewness
name_v mean_v median_v sd_v cf_v iqr_v skew_v kurt_v
volume 31.0051875 27.0625 16.6514471564494 53.7053586805415 23.2335 0.8847420 0.176986997089741
sales 192.291666666667 175.5 79.6511111777793 41.4220296482492 120 0.7181040 -0.313176409071494
listings 1738.02083333333 1618.5 752.707756098841 43.3083275909432 1029.5 0.6494982 -0.791790033332591
median_price 132665.416666667 134500 22662.148686505 17.0821825732064 32750 -0.3645529 -0.622961820755544
months_inventory 9.1925 8.95 2.30366862229334 25.0603059264982 3.15 0.0409753 -0.174447541638487

Creation of class and its analysis on variables

This analysis divides each variable into classes defined by their respective quartiles to verify whether observations are evenly distributed among the four theoretical groups, each representing approximately 25% of the data. Quartiles were chosen to segment data into balanced subsets, facilitating the evaluation of distribution uniformity and identification of any anomalies such as duplicates at interval boundaries.Statistically, this approach highlights:

  • Uniformity of segmentation: Ideally, quartile-based classes should have similar absolute frequencies. Significant differences may point to duplicated values, clustering at interval edges, or distribution asymmetries.
  • Robustness to specific data values: Unequal frequencies suggest that the variable does not split neatly into four equal groups due to repeated values or particular distribution features.
  • Our results show that, except for the variable months_inventory, all variables display near-uniform frequency distributions across quartiles, indicating effective segmentation and lack of distortion from duplicates or outliers.

In contrast, months_inventory shows low uneven distribution, suggesting data aggregation or clusters that cause asymmetry and call for further examination or alternative segmentation methods.

#Function to create classes for each variable based on thir respective quartiles
create_class <- function(data, field, quantiles) {
  vec <- data[[field]]
  
  class <- cut(vec,
               breaks = quantiles,
               include.lowest = TRUE,
               labels = c("Q1", "Q2", "Q3", "Q4"))
  return(class)
}

#Class for variable "sales", frequency distribution and Gini index
data$sales_class <- create_class(data, "sales", q_sales)
dist_sales_class <- dist_freq(data$sales_class)
gini_sales_class <- gini_index(dist_sales_class$ni)

#Class for variable "volume", frequency distribution and Gini index
data$volume_class <- create_class(data, "volume", q_volume)
dist_volume_class <- dist_freq(data$volume_class)
gini_volume_class <- gini_index(dist_volume_class$ni)

#Class for variable "median_price", frequency distribution and Gini index
data$mdpr_class <- create_class(data, "median_price", q_median_price)
dist_mdpr_class <- dist_freq(data$mdpr_class)
gini_mdpr_class <- gini_index(dist_mdpr_class$ni)

#Class for variable "listings", frequency distribution and Gini index
data$list_class <- create_class(data, "listings", q_listings)
dist_list_class <- dist_freq(data$list_class)
gini_list_class <- gini_index(dist_list_class$ni)

#Class for variable "months_inventory", frequency distribution and Gini index
data$minv_class <- create_class(data, "months_inventory", q_minv)
dist_minv_class <- dist_freq(data$minv_class)
gini_minv_class <- gini_index(dist_minv_class$ni)

#Function to extract only value and absolute frequencies from dataset
exct_data <- function(data) {
  exct_data <- data[,1:2]
  return(exct_data)
}

#Extract data from frequencies total distribution for each variable and built of new dataset
d_s_class <- exct_data(dist_sales_class)
d_s_class$name_var <- "sales"
d_v_class <- exct_data(dist_volume_class)
d_v_class$name_var <- "volume"
d_mp_class <- exct_data(dist_mdpr_class)
d_mp_class$name_var <- "median_price"
d_l_class <- exct_data(dist_list_class)
d_l_class$name_var <- "listings"
d_mv_class <- exct_data(dist_minv_class)
d_mv_class$name_var <- "months_inventory"
dist_freq_class <- rbind(d_s_class, d_v_class, d_mp_class, d_l_class, d_mv_class)

#Plot of absolute frequency distribution of class (quartile) of variable "sales"
ggplot(data = dist_freq_class)+
     geom_bar(aes(x = value, y = ni), stat = "identity", color = "black", fill = "blue") +
     facet_wrap(~ name_var, scales = "free_x")+
     labs(title = "Absolute frequencies for class (quartiles) for each variable", 
          x = "Classes (quartiles)", 
          y = "Absolute frequencies")+ 
 theme_classic()

#Plot of table of Gini index for each variable
v_gini <- as.data.frame(cbind(gini_sales_class, gini_volume_class, gini_mdpr_class, gini_list_class, gini_minv_class))
kable(v_gini,
      caption = "Gini index of classes built from quartiles for each variable", row.naes = FALSE)
Gini index of classes built from quartiles for each variable
gini_sales_class gini_volume_class gini_mdpr_class gini_list_class gini_minv_class
0.0083333 0 0.0083333 0 0.0416667

Probability of every occurrency of each variable

#Function to create dataset with value and relative frequencies for each variable
exct_freq <- function(data) {

  exct_r1 <- data[, 1]
  exct_r2 <- data[,3]
  exct_rows <- as.data.frame(cbind(exct_r1, exct_r2))
  return(exct_rows)
}

#Creation of dataset for each categorical variable and new dataset
f_city <- exct_freq(dist_city_freq)
f_year <- exct_freq(dist_year_freq)
f_month <- exct_freq(dist_months_freq)
f_date <- exct_freq(dist_dates_freq)

f_dataset <- as.data.frame(rbind(f_city, f_year, f_month, f_date))

#Example of some probabilities
p_beaumont <- f_dataset$exct_r2[f_dataset$exct_r1 == "Beaumont"]
print(paste("The probability that the city associated with a dataset record could be Beaumont is: ", p_beaumont))
## [1] "The probability that the city associated with a dataset record could be Beaumont is:  0.25"
p_tyler <- f_dataset$exct_r2[f_dataset$exct_r1 == "Tyler"]
print(paste("The probability that the city associated with a dataset record could be Tyler is: ", p_tyler))
## [1] "The probability that the city associated with a dataset record could be Tyler is:  0.25"
p_2011 <- f_dataset$exct_r2[f_dataset$exct_r1 == "2011"]
print(paste("The probability that the year associated with a dataset record could be 2011 is: ", p_2011)) 
## [1] "The probability that the year associated with a dataset record could be 2011 is:  0.2"
p_2011_02_01 <- f_dataset$exct_r2[f_dataset$exct_r1 == "2011-02-01"]
print(paste("The probability that the date associated with a dataset record could be 2011 is: ", p_2011_02_01)) 
## [1] "The probability that the date associated with a dataset record could be 2011 is:  0.0166666666666667"

Creation and analysis of new variables

Based on variables in the dataset, it could create other variables:
1) Mean price = [volume/sales]
2) Efficiency listings index = [(sales/listings)]x100
3) Variation of efficiency listings index = ([sales - (listings/months_inventory)]/(listings/months_inventory))x100

  • Mean price represents the mean price of sell for each month, year and city in the dataset
  • Efficiency listings index represents the percentage of sales on relative listings for each month, year and city
  • Variation of efficiency listings index represent the percentage of variation between the sales in one period and the efficiency calculated like mean number of sales to sell all listings. The term listings/months_inventory represents the mean of sales to sell all listings. The difference between sales and this last term, divided by the last term, represents the variation of efficiency listings index
#Variable "mean_price"
data$mean_price <- ifelse(sales == 0, 0, volume/sales)

#Variable "efficiency_list_index"
data$efficiency_list_index <- (ifelse(listings == 0, 0, sales/listings))*100

#Variable "variation_eff_list_index"
mean_eff_listings <- ifelse(months_inventory == 0, 0, listings/months_inventory)
data$mean_eff_list_index <- ifelse(mean_eff_listings == 0, 0, 
                                   ((sales - mean_eff_listings)/(mean_eff_listings))*100)

Variable “mean_price”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

  • Range: range of variable is beetwen 0.097 and 0.213 (mln $)
  • Mean: mean is 0.154 (mln $)
  • Median: median is 0.156 (mkn $). Mean < Median, so we have negative asimmetry (influenced by low values of outliers)
  • Standard deviation: standard deviation is 0.027.
  • Variation coefficient: variation coefficient is 17.59%. Generally, the variable shows 17.59% of variability instead its means
  • Skewness: skewness is -0.069. So, it’s < 0 and that confirms negative asimmetry
  • Kurtosis: kurtosis is -0.778. So, the variable “mean_price” distribution is mesokurtic
  • Gini index: Gini index is 0. The distribution is homogeneous
#Distribution of all variables frequencies
dist_mpr_freq <- dist_freq(data$mean_price)

#Gini index calculate
gini_mpr <- gini_index(dist_mpr_freq$ni)

#Aritmetic mean
mean_mpr <- mean(data$mean_price)

#Median
median_mpr <- median(data$mean_price)

#Standard deviation
sd_mpr <- sd(data$mean_price)

#Variation coefficient
var_cf_mpr <- (sd_mpr/mean_mpr)*100

#Range
range_mpr <- range(data$mean_price)

#Quantile and IQR
q_mpr <- quantile(data$mean_price, seq(0,1, 0.25))
iqr_mpr <- IQR(data$mean_price)

#Skewness
skew_mpr <- skewness(data$mean_price)

#Kurtosis
kurt_mpr <- kurtosis(data$mean_price)-3

#Creation of statistical index to plot table
pos_mpr <- which(names(data) == "mean_price")
st_index_mpr <- stat_index_vect(pos_mpr, mean_mpr, median_mpr, sd_mpr, var_cf_mpr, iqr_mpr, skew_mpr, kurt_mpr)

kable(st_index_mpr, caption = "Statistical index for the variable mean_price", row.names = FALSE) 
Statistical index for the variable mean_price
name_v mean_v median_v sd_v cf_v iqr_v skew_v kurt_v
mean_price 0.154320365777658 0.156588479842459 0.0271474563315486 17.5916225928743 0.0409762176937704 -0.0687052806534169 -0.77843287102
#Plot of mean price distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = mean_price),
                 binwidth = 0.01,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_mpr, col = "red", lwd = 1)+
  geom_vline(xintercept = median_mpr, col = "pink", lwd = 1)+
  annotate("label", x = mean_mpr - 0.01, y = 40, label = round(mean_mpr, 2), color = "red") +
  annotate("label", x = median_mpr + 0.01, y = 40, label = round(median_mpr, 2), color = "pink") +
  labs(title = "Mean price distribution ",
       x = "Mean price (mln $)", 
       y = "Absolute frequencies")+ 
  theme_classic()

#Boxplot for variable "mean price"
ggplot(data = data)+
  geom_boxplot(aes(y = mean_price), fill = "blue")+
  labs(title = "Mean price boxplot", 
           x = "Relative frequencies", 
           y = "Mean price (mln $)")+ 
  theme_classic()

Variable “efficiency_list_index”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

  • Range: range of variable is beetwen 5.01 and 38.71
  • Mean: mean is 11.87
  • Median: median is 10.96. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
  • Standard deviation: standard deviation is 4.69. Mean > Median > Standard Deviation, so it confirms positive asimmetry
  • Variation coefficient: variation coefficient is 39.49%. Generally, the variable shows 39.49% of variability instead its means
  • Skewness: skewness is 2.09. So, it’s > 0 and that confirms positive asimmetry
  • Kurtosis: kurtosis is 6.88. So, the variable “efficiency_list_index” distribution is leptokurtic
  • Gini index: Gini index is 0.042. The distribution shows very small heterogeneity
#Distribution of all variables frequencies
dist_eff_list_freq <- dist_freq(data$efficiency_list_index)

#Gini index calculate
gini_eff_list <- gini_index(dist_eff_list_freq$ni)

#Aritmetic mean
mean_eff_list <- mean(data$efficiency_list_index)

#Median
median_eff_list <- median(data$efficiency_list_index)

#Standard deviation
sd_eff_list <- sd(data$efficiency_list_index)

#Variation coefficient
var_cf_eff_list <- (sd_eff_list/mean_eff_list)*100

#Range
range_eff_list <- range(data$efficiency_list_index)

#Quantile and IQR
q_eff_list <- quantile(data$efficiency_list_index, seq(0,1, 0.25))
iqr_eff_list <- IQR(data$efficiency_list_index)

#Skewness
skew_eff_list <- skewness(data$efficiency_list_index)

#Kurtosis
kurt_eff_list <- kurtosis(data$efficiency_list_index)-3

#Creation of statistical index to plot table
pos_mlis <- which(names(data) == "efficiency_list_index")
st_index_mlis <- stat_index_vect(pos_mlis, mean_eff_list, median_eff_list, sd_eff_list, var_cf_eff_list, iqr_eff_list, skew_eff_list, kurt_eff_list)

kable(st_index_mlis, caption = "Statistical index for variable efficency_list_index", row.names = FALSE) 
Statistical index for variable efficency_list_index
name_v mean_v median_v sd_v cf_v iqr_v skew_v kurt_v
efficiency_list_index 11.8744921731651 10.9626835334067 4.68990009256993 39.4955845199724 4.51207215683548 2.08931813879356 6.88174747715347

Histogram of variable “efficiency_list_index” absolute frequencies with mean and median

#Plot of efficiency listings index distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = efficiency_list_index),
                 binwidth = 1,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_eff_list, col = "red", lwd = 1)+
  geom_vline(xintercept = median_eff_list, col = "pink", lwd = 1)+
  annotate("label", x = mean_eff_list + 2, y = 40, label = round(mean_eff_list, 2), color = "red") +
  annotate("label", x = median_eff_list - 2, y = 40, label = round(median_eff_list, 2), color = "pink") +
  labs(title = "Efficiency listings index distribution",
       x = "Efficiency listings index", 
       y = "Absolute frequencies")+ 
  theme_classic()

#Boxplot for variable "efficiency listings index"
ggplot(data = data)+
  geom_boxplot(aes(y = efficiency_list_index), fill = "blue")+
  labs(title = "Efficiency listings index boxplot", 
           x = "Relative frequencies", 
           y = "Efficiency listings index")+ 
  theme_classic()

Variable “mean_eff_list_index”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

  • Range: range of variable is beetwen -51.3 and 79.3
  • Mean: mean is 2.07
  • Median: median is 2.06. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
  • Standard deviation: standard deviation is 24.82. This index shows a very high variability of this calculated KPI
  • Variation coefficient: variation coefficient is 1196.9%. Generally, the variable shows a very high varibility instead its means
  • Skewness: skewness is 0.46. So, it’s > 0 and that confirms positive asimmetry
  • Kurtosis: kurtosis is 0.172. So, the variable “mean_eff_list_index” distribution is leptokurtic
  • Gini index: Gini index is 0. The distribution is homogeneous
#Distribution of all variables frequencies
dist_meff_list_freq <- dist_freq(data$mean_eff_list_index)

#Gini index calculate
gini_meff_list <- gini_index(dist_meff_list_freq$ni)

#Aritmetic mean
mean_meff_list <- mean(data$mean_eff_list_index)

#Median
median_meff_list <- median(data$mean_eff_list_index)

#Standard deviation
sd_meff_list <- sd(data$mean_eff_list_index)

#Variation coefficient
var_cf_meff_list <- (sd_meff_list/mean_meff_list)*100

#Range
range_meff_list <- range(data$mean_eff_list_index)

#Quantile and IQR
q_meff_list <- quantile(data$mean_eff_list_index, seq(0,1, 0.25))
iqr_meff_list <- IQR(data$mean_eff_list_index)

#Skewness
skew_meff_list <- skewness(data$mean_eff_list_index)

#Kurtosis
kurt_meff_list <- kurtosis(data$mean_eff_list_index)-3

#Creation of statistical index to plot table
pos_mel <- which(names(data) == "mean_eff_list_index")
st_index_mel <- stat_index_vect(pos_mel, mean_meff_list, median_meff_list, sd_meff_list, var_cf_meff_list, iqr_meff_list, skew_meff_list, kurt_meff_list)

kable(st_index_mel, caption = "Statistical index for variation of efficiency listings index", row.names = FALSE) 
Statistical index for variation of efficiency listings index
name_v mean_v median_v sd_v cf_v iqr_v skew_v kurt_v
mean_eff_list_index 2.07809270544987 2.06321958662783 24.8726892383184 1196.89988676102 34.0724048034506 0.465258599586698 0.171896889361363
#Plot of variation efficiency listings index distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = mean_eff_list_index),
                 binwidth = 5,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_meff_list, col = "red", lwd = 1)+
  geom_vline(xintercept = median_meff_list, col = "pink", lwd = 1)+
  annotate("label", x = mean_meff_list + 10, y = 25, label = round(mean_meff_list, 2), color = "red") +
  annotate("label", x = median_meff_list - 10, y = 25, label = round(median_meff_list, 2), color = "pink") +
  labs(title = "Variation of efficiency listings index distribution",
       x = "Variation of efficiency listings index (%)", 
       y = "Absolute frequencies")+ 
  theme_classic()

#Boxplot for variable "variation efficiency listings index"
ggplot(data = data)+
  geom_boxplot(aes(y = mean_eff_list_index), fill = "blue")+
  labs(title = "Variation of efficiency listings index boxplot", 
           x = "Relative frequencies", 
           y = "Variation of efficiency listings index (%)")+ 
  theme_classic()

Bivariate analysis by year, month, city and time series

Variable “sales”

Bivariate analysis by city: Tyler is the city with the highest total and average number of sales, while Wichita Falls has the lowest indicators. The variability in the number of sales is around 20% for all cities except Bryan-College Station. Bryan shows a much higher variability in sales number (41%). This situation suggests a healthier market in Tyler and Bryan-College Station, although the latter has high variability and is therefore more subject to temporal fluctuations, and a somewhat less healthy market in Beaumont and Wichita Falls, albeit with lower temporal fluctuations.

Bivariate analysis by year and month: The analysis shows that the total and average number of sales peak around mid-year, regardless of the year. This suggests a seasonal trend in the data, which should be confirmed through the analysis of the historical data trends. Additionally, three peaks (at the beginning, middle, and end of the year) in sales variability are observed, independently of the year. Overall, the results indicate real estate markets subject to seasonality, with differences in sales opportunities based on different times of the year and varying peaks in sales variability. These factors should be considered, for example, when managing advertising campaigns.

Bivariate analysis by date: The analysis confirms the seasonal pattern of the data for each city, showing a periodic and repetitive trend with a peak in the number of sales around mid-year. The temporal distribution of the boxplots by city further highlights the mid-year sales peak and the three variability peaks identified earlier. Overall, these findings confirm at the local level—with some differences—what was observed at the global level. Moreover, the temporal trends support the analysis of sales numbers by city, identifying a positive sales trend for all cities (especially Tyler and Bryan-College Station, which have higher total and average sales) and a downward trend for Wichita Falls.

#Function to aggregate data by city
agg_city <- function (data, city, kpi){
  data_by_city <- data %>%
    group_by({{city}}) %>%
    summarise(num_kpi = sum({{kpi}}), mean_kpi = mean({{kpi}}), sd_kpi = sd({{kpi}}), .groups = "drop") %>%
    mutate(cv_kpi = (sd_kpi/mean_kpi)*100)
  
  return(data_by_city)
}

#Total number, mean and standard deviation of sales by city
sales_by_city <- agg_city(data, city, sales)
names(sales_by_city)[2:5] <- c("num_sales", "mean_sales", "sd_sales", "cv_sales")

kable(sales_by_city, caption = "Statistical index for number of sales aggregated by city", row.names = FALSE) 
Statistical index for number of sales aggregated by city
city num_sales mean_sales sd_sales cv_sales
Beaumont 10643 177.3833 41.48395 23.38661
Bryan-College Station 12358 205.9667 84.98374 41.26092
Tyler 16185 269.7500 61.96380 22.97083
Wichita Falls 6964 116.0667 22.15192 19.08551
#Plot of total number of sales by city
ggplot(data = sales_by_city)+
  geom_bar(aes(x = city, y = num_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Number of sales by city", 
       x = "City", 
       y = "Number of sales")+ 
  theme_classic()

#Plot of mean number of sales by city
ggplot(data = sales_by_city)+
  geom_bar(aes(x = city, y = mean_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean number of sales by city", 
       x = "City", 
       y = "Mean number of sales")+ 
  theme_classic()

#Plot of variation coefficient of number of sales by city
ggplot(data = sales_by_city)+
  geom_bar(aes(x = city, y = cv_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Number of sales variability by city", 
       x = "City", 
       y = "Number of sales variability")+ 
  theme_classic()

#Function to aggregate data by month and year
agg_date <- function (data, year, month, kpi){
  data_by_date <- data %>%
    group_by({{year}}, {{month}}) %>%
    summarise(num_kpi = sum({{kpi}}), mean_kpi = mean({{kpi}}), sd_kpi = sd({{kpi}}), .groups = "drop") %>%
    mutate(cv_kpi = (sd_kpi/mean_kpi)*100)
  
  return(data_by_date)
}

#Total number, mean and standard deviation of sales by month and year
sales_by_date <- agg_date(data, year, month, sales)
names(sales_by_date)[3:6] <- c("num_sales", "mean_sales", "sd_sales", "cv_sales")

kable(sales_by_date,caption = "Statistical index for number of salesw aggregated by year and months", row.names = FALSE) 
Statistical index for number of salesw aggregated by year and months
year month num_sales mean_sales sd_sales cv_sales
2010 Jan 421 105.25 36.60943 34.78330
2010 Feb 487 121.75 40.26061 33.06826
2010 Mar 755 188.75 43.59950 23.09907
2010 Apr 916 229.00 63.95311 27.92712
2010 May 931 232.75 58.84089 25.28072
2010 Jun 866 216.50 71.44462 32.99982
2010 Jul 712 178.00 62.50867 35.11723
2010 Aug 738 184.50 45.00000 24.39024
2010 Sep 598 149.50 47.19816 31.57068
2010 Oct 565 141.25 45.70467 32.35729
2010 Nov 503 125.75 30.99866 24.65102
2010 Dec 604 151.00 40.70217 26.95508
2011 Jan 425 106.25 27.03547 25.44515
2011 Feb 469 117.25 44.25965 37.74810
2011 Mar 668 167.00 52.42773 31.39385
2011 Apr 716 179.00 58.64583 32.76303
2011 May 780 195.00 70.28039 36.04123
2011 Jun 885 221.25 93.93038 42.45441
2011 Jul 812 203.00 69.95713 34.46164
2011 Aug 786 196.50 70.27802 35.76490
2011 Sep 629 157.25 67.60855 42.99431
2011 Oct 594 148.50 57.57604 38.77174
2011 Nov 549 137.25 49.37864 35.97715
2011 Dec 565 141.25 48.25194 34.16067
2012 Jan 499 124.75 29.78115 23.87266
2012 Feb 574 143.50 57.61076 40.14687
2012 Mar 711 177.75 66.69020 37.51910
2012 Apr 747 186.75 52.77863 28.26165
2012 May 882 220.50 90.71751 41.14173
2012 Jun 898 224.50 86.18004 38.38755
2012 Jul 928 232.00 89.81462 38.71320
2012 Aug 954 238.50 87.99432 36.89489
2012 Sep 707 176.75 78.20646 44.24694
2012 Oct 742 185.50 79.80601 43.02211
2012 Nov 650 162.50 37.24245 22.91843
2012 Dec 643 160.75 52.20073 32.47324
2013 Jan 576 144.00 49.22059 34.18097
2013 Feb 593 148.25 54.90219 37.03351
2013 Mar 814 203.50 64.04426 31.47138
2013 Apr 878 219.50 74.54082 33.95937
2013 May 1057 264.25 90.36362 34.19626
2013 Jun 1045 261.25 108.21699 41.42277
2013 Jul 1127 281.75 122.70391 43.55063
2013 Aug 1107 276.75 92.01585 33.24873
2013 Sep 814 203.50 66.00253 32.43367
2013 Oct 738 184.50 65.97727 35.76004
2013 Nov 690 172.50 65.07688 37.72573
2013 Dec 733 183.25 64.04881 34.95160
2014 Jan 627 156.75 61.34805 39.13751
2014 Feb 694 173.50 62.21736 35.86015
2014 Mar 841 210.25 85.35563 40.59721
2014 Apr 977 244.25 84.10063 34.43219
2014 May 1127 281.75 112.15577 39.80684
2014 Jun 1177 294.25 134.62386 45.75152
2014 Jul 1136 284.00 122.29745 43.06248
2014 Aug 1044 261.00 89.70693 34.37047
2014 Sep 899 224.75 103.54186 46.06979
2014 Oct 959 239.75 106.31518 44.34418
2014 Nov 745 186.25 84.50000 45.36913
2014 Dec 843 210.75 91.73649 43.52858
#Plot of total number of sales by month and year
ggplot(data = sales_by_date)+
  geom_bar(aes(x = month, y = num_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_sales, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of sales by year and month", 
       x = "Months", 
       y = "Number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean number of sales by month and year
ggplot(data = sales_by_date)+
  geom_bar(aes(x = month, y = mean_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_sales, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean number of sales by year and month", 
       x = "Months", 
       y = "Mean number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of sales by month and year
ggplot(data = sales_by_date)+
  geom_bar(aes(x = month, y = cv_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_sales, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of sales variability by year and month", 
       x = "Months", 
       y = "Number of sales variability")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Function to aggregate data by date and city
agg_datecity <- function (data, city, date, kpi){
  data_by_datecity <- data %>%
    group_by({{city}}, {{date}}) %>%
    summarise(num_kpi = sum({{kpi}}), .groups = "drop")
  
  return(data_by_datecity)
}

#Total number, mean and standard deviation of sales by date and city for trend analysis
sales_by_datecity <- agg_datecity(data, city, date, sales)
names(sales_by_datecity)[3] <- c("num_sales")

#Plot by year and city to explore trend 
ggplot(data = sales_by_datecity)+
  geom_smooth(aes(x = date, y = num_sales, colour = city), lwd = 1, method = "loess")+
  geom_point(aes(x = date, y = num_sales, colour = city))+
  labs(title = "Number of sales trend by city", 
       x = "Period", 
       y = "Number of sales")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'

#Boxplot of number of sales by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = sales, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of sales distribution by city and months", 
       x = "Months", 
       y = "Number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of number of sales by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = sales, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of sales by year, month and city", 
       x = "Months", 
       y = "Number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “volume”

Bivariate analysis by city: Tyler is the city with the highest total and average volume of sales, while Wichita Falls has the lowest indicators. The variability in the volume of sales is around 25-30% for all cities except Bryan-College Station. Bryan shows a much higher variability in sales number (45%). This situation suggests a healthier market in Tyler and Bryan-College Station, although the latter has high variability and is therefore more subject to temporal fluctuations, and a somewhat less healthy market in Beaumont and Wichita Falls, albeit with lower temporal fluctuations.

Bivariate analysis by year and month: The analysis shows that the total and average volume of sales peak around mid-year, regardless of the year. This suggests a seasonal trend in the data, which should be confirmed through the analysis of the historical data trends. Additionally, three peaks (at the beginning, middle, and end of the year) in volume of sales variability are observed, independently of the year. Overall, the results indicate real estate markets subject to seasonality, with differences in sales opportunities based on different times of the year and varying peaks in sales variability. These factors should be considered, for example, when managing advertising campaigns.

Bivariate analysis by date: The analysis confirms the seasonal pattern of the data for each city, showing a periodic and repetitive trend with a peak in the volume of sales around mid-year. The temporal distribution of the boxplots by city further highlights the mid-year sales peak and the three variability peaks identified earlier. Overall, these findings confirm at the local level—with some differences—what was observed at the global level. Moreover, the temporal trends support the analysis of sales volume by city, identifying a positive sales trend for all cities (especially Tyler and Bryan-College Station, which have higher total and average sales) and a downward trend for Wichita Falls.

Overall, the analysis suggests a likely linear relationship between the number of sales and the value of sales, which should be further confirmed by investigating the relationship between the two variables.

#Total, mean and standard deviation of volume by city
volume_by_city <- agg_city(data, city, volume)
names(volume_by_city)[2:5] <- c("num_volume", "mean_volume", "sd_volume", "cv_volume")

kable(volume_by_city, caption = "Statistical index for volume of sales in mln $ aggregated by city", row.names = FALSE) 
Statistical index for volume of sales in mln $ aggregated by city
city num_volume mean_volume sd_volume cv_volume
Beaumont 1567.896 26.13160 6.970384 26.67416
Bryan-College Station 2291.496 38.19160 17.248577 45.16327
Tyler 2746.043 45.76738 13.107146 28.63862
Wichita Falls 835.810 13.93017 3.239766 23.25719
#Plot of volume of sales by city
ggplot(data = volume_by_city)+
  geom_bar(aes(x = city, y = num_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Volume of sales by city", 
       x = "City", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()

#Plot of mean volume of sales by city
ggplot(data = volume_by_city)+
  geom_bar(aes(x = city, y = mean_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean volume of sales by city", 
       x = "City", 
       y = "Mean volume of sales (mln $)")+ 
  theme_classic()

#Plot of variation coefficient of volume of sales by city
ggplot(data = volume_by_city)+
  geom_bar(aes(x = city, y = cv_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Volume of sales variability by city", 
       x = "City", 
       y = "Volume of sales variability")+ 
  theme_classic()

#Total number, mean and standard deviation of volume by month and year
volume_by_date <- agg_date(data, year, month, volume)
names(volume_by_date)[3:6] <- c("num_volume", "mean_volume", "sd_volume", "cv_volume")

kable(volume_by_date, caption = "Statistical index for volume of sales in mln $ aggregated by year and month", row.names = FALSE) 
Statistical index for volume of sales in mln $ aggregated by year and month
year month num_volume mean_volume sd_volume cv_volume
2010 Jan 63.751 15.93775 6.922791 43.43644
2010 Feb 76.897 19.22425 8.535588 44.40011
2010 Mar 111.876 27.96900 7.298540 26.09511
2010 Apr 135.196 33.79900 13.280337 39.29210
2010 May 143.689 35.92225 13.236626 36.84799
2010 Jun 138.191 34.54775 13.560836 39.25244
2010 Jul 106.764 26.69100 12.121784 45.41525
2010 Aug 114.356 28.58900 10.666873 37.31111
2010 Sep 88.826 22.20650 7.228736 32.55234
2010 Oct 87.164 21.79100 8.068802 37.02814
2010 Nov 73.663 18.41575 5.684010 30.86494
2010 Dec 92.070 23.01750 7.441038 32.32774
2011 Jan 60.658 15.16450 5.313391 35.03835
2011 Feb 69.379 17.34475 8.107725 46.74455
2011 Mar 101.566 25.39150 10.674532 42.03979
2011 Apr 110.267 27.56675 11.256460 40.83347
2011 May 122.036 30.50900 15.375147 50.39545
2011 Jun 138.986 34.74650 18.521100 53.30350
2011 Jul 130.855 32.71375 14.836718 45.35315
2011 Aug 120.816 30.20400 14.510272 48.04089
2011 Sep 95.103 23.77575 12.214512 51.37382
2011 Oct 88.274 22.06850 10.280971 46.58663
2011 Nov 86.288 21.57200 10.695940 49.58251
2011 Dec 83.347 20.83675 8.284211 39.75769
2012 Jan 69.791 17.44775 6.837480 39.18832
2012 Feb 82.504 20.62600 10.198258 49.44370
2012 Mar 109.129 27.28225 12.719385 46.62147
2012 Apr 112.128 28.03200 10.981393 39.17449
2012 May 147.526 36.88150 19.394133 52.58499
2012 Jun 149.335 37.33375 18.633920 49.91173
2012 Jul 152.463 38.11575 20.164871 52.90430
2012 Aug 154.271 38.56775 19.762346 51.24060
2012 Sep 112.759 28.18975 15.251248 54.10210
2012 Oct 115.004 28.75100 15.203367 52.87944
2012 Nov 102.547 25.63675 7.173133 27.97989
2012 Dec 97.386 24.34650 10.330214 42.42997
2013 Jan 91.739 22.93475 9.591138 41.81924
2013 Feb 89.946 22.48650 11.426154 50.81340
2013 Mar 125.502 31.37550 13.893366 44.28094
2013 Apr 145.621 36.40525 18.420387 50.59816
2013 May 184.474 46.11850 22.272422 48.29390
2013 Jun 184.315 46.07875 24.932241 54.10789
2013 Jul 188.551 47.13775 26.927092 57.12426
2013 Aug 185.637 46.40925 20.868560 44.96638
2013 Sep 136.781 34.19525 14.307844 41.84161
2013 Oct 124.589 31.14725 15.597292 50.07598
2013 Nov 110.195 27.54875 12.784377 46.40638
2013 Dec 119.965 29.99125 12.059098 40.20872
2014 Jan 94.076 23.51900 12.074357 51.33873
2014 Feb 114.304 28.57600 13.162092 46.05995
2014 Mar 139.621 34.90525 17.803333 51.00474
2014 Apr 162.877 40.71925 20.136292 49.45153
2014 May 196.317 49.07925 26.326048 53.63987
2014 Jun 215.236 53.80900 30.670720 56.99924
2014 Jul 203.805 50.95125 29.507853 57.91389
2014 Aug 185.203 46.30075 22.953933 49.57573
2014 Sep 158.514 39.62850 23.355544 58.93623
2014 Oct 166.541 41.63525 21.281137 51.11327
2014 Nov 123.450 30.86250 17.263882 55.93805
2014 Dec 149.125 37.28125 19.756553 52.99327
#Plot of volume of sales by month and year
ggplot(data = volume_by_date)+
  geom_bar(aes(x = month, y = num_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_volume, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales by year and month", 
       x = "Months", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean volume of sales by month and year
ggplot(data = volume_by_date)+
  geom_bar(aes(x = month, y = mean_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_volume, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean volume of sales by year and month", 
       x = "Months", 
       y = "Mean volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of volume of sales by month and year
ggplot(data = volume_by_date)+
  geom_bar(aes(x = month, y = cv_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_volume, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales varibility by year and month", 
       x = "Months", 
       y = "Volume of sales variability")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Total volume of sales by date and city for trend analysis
volume_by_datecity <- agg_date(data, city, date, volume)
names(volume_by_datecity)[3] <- c("num_volume")

#Plot by year and city to explore trend 
ggplot(data = volume_by_datecity)+
  geom_smooth(aes(x = date, y = num_volume, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = num_volume, colour = city))+
  labs(title = "Volume of sales trend by city", 
       x = "Period", 
       y = "Volume of sales (mln $)")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of volume of sales by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = volume, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales distribution by city and month", 
       x = "Months", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of volume of sales by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = volume, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales by year, month and city", 
       x = "Months", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “median_price”

Bivariate analysis by city: Bryan-College Station is the city with the highest total and average median price, even exceeding the same indicators for Tyler. Tyler shows generally higher median sale prices, despite having higher total sales value and number of sales compared to Bryan. Wichita Falls, on the other hand, confirms to be the city with the least healthy real estate market, having the lowest median prices among the cities and the highest variability in median prices.

Bivariate analysis by year and month: The analysis shows a more stable pattern across months and years for the median sale price compared to the number and value of sales. No peaks are observed in the total or average median sale prices. This suggests that seasonality affects the number and value of sales but not house valuations and thus the median price. However, the variability over time reveals a more chaotic situation across years, without a clear pattern, indicating influences from more complex dynamics rather than periodic or recurring variability.

Bivariate analysis by date: The temporal analysis confirms a relatively stable trend in the median sale price, showing a slight increase for the cities of Bryan-College Station and Tyler, which are also the cities with the highest number and value of sales and the strongest upward trends. Unlike the previous analysis, this one highlights a seasonal trend more clearly, although it is less pronounced than that of the number and value of sales, confirming the impression that additional dynamics influence the median sale price, obscuring a purely seasonal pattern.

Overall, it appears that the median sale price does not follow a strictly linear relationship with the number and value of sales, suggesting a more complex dynamic influenced by local behaviors. This situation should be further confirmed through bivariate analysis of the relationship with the mentioned variables.

#Total, mean and standard deviation of median price by city
median_price_by_city <- agg_city(data, city, median_price)
names(median_price_by_city)[2:5] <- c("num_median_price", "mean_median_price", "sd_median_price", "cv_median_price")

kable(median_price_by_city, caption = "Statistical index for median price aggregated by city", row.names = FALSE) 
Statistical index for median price aggregated by city
city num_median_price mean_median_price sd_median_price cv_median_price
Beaumont 7799300 129988.3 10104.993 7.773769
Bryan-College Station 9449300 157488.3 8852.235 5.620883
Tyler 8486500 141441.7 9336.538 6.600981
Wichita Falls 6104600 101743.3 11320.034 11.126070
#Plot of volume of median price by city
ggplot(data = median_price_by_city)+
  geom_bar(aes(x = city, y = num_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Total median price by city", 
       x = "City", 
       y = "Total median price")+ 
  theme_classic()

#Plot of mean median price of sales by city
ggplot(data = median_price_by_city)+
  geom_bar(aes(x = city, y = mean_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean of median price by city", 
       x = "City", 
       y = "Mean of median price")+ 
  theme_classic()

#Plot of variation coefficient of median price of sales by city
ggplot(data = median_price_by_city)+
  geom_bar(aes(x = city, y = cv_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Median price variability by city", 
       x = "City", 
       y = "Median price variability")+ 
  theme_classic()

#Total number, mean and standard deviation of median price by month and year
median_price_by_date <- agg_date(data, year, month, median_price)
names(median_price_by_date)[3:6] <- c("num_median_price", "mean_median_price", "sd_median_price", "cv_median_price")

kable(median_price_by_date, caption = "Statistical index for median price by year and month", row.names = FALSE) 
Statistical index for median price by year and month
year month num_median_price mean_median_price sd_median_price cv_median_price
2010 Jan 541800 135450 33735.69 24.906378
2010 Feb 524400 131100 31406.47 23.956121
2010 Mar 489300 122325 24973.24 20.415480
2010 Apr 513200 128300 18899.38 14.730618
2010 May 520500 130125 20121.53 15.463229
2010 Jun 533100 133275 15108.80 11.336560
2010 Jul 508900 127225 23351.71 18.354657
2010 Aug 524400 131100 23883.47 18.217748
2010 Sep 503300 125825 20739.56 16.482858
2010 Oct 536600 134150 25058.66 18.679586
2010 Nov 526400 131600 32545.87 24.730901
2010 Dec 527300 131825 16897.61 12.818212
2011 Jan 489000 122250 24234.20 19.823480
2011 Feb 484800 121200 24134.76 19.913169
2011 Mar 507000 126750 21773.76 17.178510
2011 Apr 529900 132475 15490.94 11.693480
2011 May 515100 128775 17313.46 13.444736
2011 Jun 521400 130350 24845.86 19.060881
2011 Jul 523000 130750 20059.99 15.342251
2011 Aug 541400 135350 20121.38 14.866185
2011 Sep 516400 129100 22192.04 17.189807
2011 Oct 496400 124100 35492.44 28.599871
2011 Nov 511600 127900 31642.06 24.739688
2011 Dec 501000 125250 22397.10 17.881915
2012 Jan 457000 114250 24834.72 21.737173
2012 Feb 508900 127225 20443.48 16.068759
2012 Mar 517100 129275 23645.21 18.290626
2012 Apr 505000 126250 26047.33 20.631547
2012 May 522300 130575 23545.05 18.031817
2012 Jun 544700 136175 15604.14 11.458887
2012 Jul 549500 137375 18896.27 13.755249
2012 Aug 540200 135050 24313.85 18.003592
2012 Sep 541100 135275 27574.31 20.383891
2012 Oct 500300 125075 28832.55 23.052207
2012 Nov 547000 136750 13484.93 9.861009
2012 Dec 510600 127650 27354.52 21.429318
2013 Jan 513300 128325 23111.96 18.010486
2013 Feb 522400 130600 20824.18 15.945011
2013 Mar 512900 128225 32607.20 25.429671
2013 Apr 524200 131050 28138.88 21.471867
2013 May 564100 141025 17159.33 12.167578
2013 Jun 558100 139525 26363.15 18.894931
2013 Jul 544800 136200 26459.65 19.427059
2013 Aug 558100 139525 23961.97 17.173959
2013 Sep 565400 141350 18549.12 13.122833
2013 Oct 562200 140550 24629.86 17.523913
2013 Nov 536600 134150 24053.90 17.930598
2013 Dec 552600 138150 19851.53 14.369549
2014 Jan 483900 120975 28066.04 23.199866
2014 Feb 561000 140250 24954.96 17.793198
2014 Mar 522000 130500 26588.59 20.374402
2014 Apr 557500 139375 26724.94 19.174846
2014 May 567700 141925 21542.57 15.178844
2014 Jun 595100 148775 16999.88 11.426569
2014 Jul 568800 142200 29364.04 20.649815
2014 Aug 569400 142350 31003.82 21.779991
2014 Sep 554600 138650 38758.10 27.953910
2014 Oct 574100 143525 25713.60 17.915763
2014 Nov 564500 141125 30097.77 21.327030
2014 Dec 576500 144125 32345.36 22.442576
#Plot of median price of sales by month and year
ggplot(data = median_price_by_date)+
  geom_bar(aes(x = month, y = num_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_median_price, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Total median price by year and month", 
       x = "Months", 
       y = "Total median price")+  
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean median price of sales by month and year
ggplot(data = median_price_by_date)+
  geom_bar(aes(x = month, y = mean_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_median_price, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean of median pruce by year and month", 
       x = "Months", 
       y = "Mean of median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of median price of sales by month and year
ggplot(data = median_price_by_date)+
  geom_bar(aes(x = month, y = cv_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_median_price, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price variability by year and month", 
       x = "Months", 
       y = "Variability of median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Total volume of median price by date and city for trend analysis
median_price_by_datecity <- agg_datecity(data, city, date, median_price)
names(median_price_by_datecity)[3] <- c("num_median_price")

#Plot by year and city to explore trend 
ggplot(data = median_price_by_datecity)+
  geom_smooth(aes(x = date, y = num_median_price, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = num_median_price, colour = city))+
  labs(title = "Total median price trend by city", 
       x = "Period", 
       y = "Total median price")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of median price of sales by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = median_price, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price distribution by city and month", 
       x = "Months", 
       y = "Median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of median price by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = median_price, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price by year, month and city", 
       x = "Months", 
       y = "Median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “listings”

Bivariate analysis by city: The analysis shows that Tyler is the city with the highest total and average number of listings, and it is also the city with the highest number and value of sales. Wichita Falls, on the other hand, has the lowest total and average number of listings and ranks last in both sales number and value. This may suggest some correlation between the number of listings and the number and value of sales; however, it could also reflect higher demand, which is mirrored in the larger number of available listings as well as in the higher number and value of sales, revealing a correlation that may not be truly causal. The highest variability in the number of listings occurs in Bryan-College Station, which is also the city with the highest total and average median sale price. This could indicate that in this city, the relationship between supply and demand—which influences the sale price of individual listings—is unstable or subject to variation due to complex dynamics, which also affect the number of available listings.

Bivariate analysis by year and month: The analysis shows that the total and average number of listings has a slight peak around mid-year, regardless of the year, similar to what was observed for the number and value of sales. Variability, however, does not appear to follow a specific pattern and seems rather stable over time, except for some cases (such as in 2013 or 2014).

Bivariate analysis by date: The analysis shows a decrease over time in the number of listings for the cities of Bryan-College Station and Tyler, while Beaumont and Wichita Falls exhibit a stable situation. Overall, it appears that cities experiencing an increase over time in the number and value of sales—and thus seemingly having a better real estate market—show the opposite trend in the number of listings. This could be due to the fact that, as demand and the respective sales capacity and market penetration increase, it is no longer necessary to invest as much in advertising and listings. This behavior is confirmed by the boxplot, which supports the temporal trend of average values and variability of observations by city.

Overall, the number of listings shows an inverse proportional relationship with the number and value of sales, which, however, needs to be further investigated to determine whether it is linear or influenced by more complex nonlinear behaviors.

#Total, mean and standard deviation of listings by city
listings_by_city <- agg_city(data, city, listings)
names(listings_by_city)[2:5] <- c("num_listings", "mean_listings", "sd_listings", "cv_listings")

kable(listings_by_city, caption = "Statistical index for number of listings aggregated by city", row.names = FALSE) 
Statistical index for number of listings aggregated by city
city num_listings mean_listings sd_listings cv_listings
Beaumont 100759 1679.3167 91.13382 5.426839
Bryan-College Station 87488 1458.1333 252.52753 17.318548
Tyler 174303 2905.0500 226.75458 7.805531
Wichita Falls 54575 909.5833 73.75504 8.108663
#Plot of volume of listings by city
ggplot(data = listings_by_city)+
  geom_bar(aes(x = city, y = num_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Number of listings by city", 
       x = "City", 
       y = "Number of listings")+ 
  theme_classic()

#Plot of mean listings of sales by city
ggplot(data = listings_by_city)+
  geom_bar(aes(x = city, y = mean_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean of listings by city", 
       x = "City", 
       y = "Mean of listings")+ 
  theme_classic()

#Plot of variation coefficient of listings of sales by city
ggplot(data = listings_by_city)+
  geom_bar(aes(x = city, y = cv_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Listings variability by city", 
       x = "City", 
       y = "Listings variability")+ 
  theme_classic()

#Total number, mean and standard deviation of listings by month and year
listings_by_date <- agg_date(data, year, month, listings)
names(listings_by_date)[3:6] <- c("num_listings", "mean_listings", "sd_listings", "cv_listings")

kable(listings_by_date, caption = "Statistical index for number of listings aggregated by year and month", row.names = FALSE) 
Statistical index for number of listings aggregated by year and month
year month num_listings mean_listings sd_listings cv_listings
2010 Jan 6466 1616.50 783.9211 48.49497
2010 Feb 6703 1675.75 779.9484 46.54325
2010 Mar 6941 1735.25 738.8362 42.57808
2010 Apr 7610 1902.50 871.1883 45.79176
2010 May 7473 1868.25 947.4173 50.71149
2010 Jun 7657 1914.25 984.9468 51.45341
2010 Jul 7768 1942.00 959.9205 49.42948
2010 Aug 7718 1929.50 954.2175 49.45413
2010 Sep 7702 1925.50 972.0838 50.48475
2010 Oct 7470 1867.50 917.5832 49.13431
2010 Nov 7262 1815.50 879.6384 48.45158
2010 Dec 6878 1719.50 826.7162 48.07887
2011 Jan 6964 1741.00 800.9024 46.00244
2011 Feb 7141 1785.25 833.7063 46.69970
2011 Mar 7554 1888.50 887.1056 46.97408
2011 Apr 7792 1948.00 914.6573 46.95366
2011 May 7990 1997.50 922.9047 46.20299
2011 Jun 7889 1972.25 930.5813 47.18374
2011 Jul 7776 1944.00 943.6444 48.54138
2011 Aug 7592 1898.00 940.5683 49.55576
2011 Sep 7420 1855.00 887.7466 47.85696
2011 Oct 7233 1808.25 890.4644 49.24454
2011 Nov 6910 1727.50 833.0564 48.22324
2011 Dec 6522 1630.50 791.3859 48.53639
2012 Jan 6803 1700.75 814.4947 47.89032
2012 Feb 7018 1754.50 823.8529 46.95657
2012 Mar 7291 1822.75 811.8002 44.53711
2012 Apr 7416 1854.00 834.8185 45.02797
2012 May 7453 1863.25 849.5153 45.59320
2012 Jun 7433 1858.25 873.0400 46.98184
2012 Jul 7431 1857.75 887.3719 47.76595
2012 Aug 7176 1794.00 892.0617 49.72473
2012 Sep 7086 1771.50 852.7878 48.13930
2012 Oct 6933 1733.25 839.9684 48.46205
2012 Nov 6801 1700.25 821.1977 48.29864
2012 Dec 6446 1611.50 759.5668 47.13415
2013 Jan 6579 1644.75 748.6046 45.51480
2013 Feb 6735 1683.75 746.3692 44.32779
2013 Mar 6973 1743.25 793.4471 45.51539
2013 Apr 7191 1797.75 836.3852 46.52400
2013 May 7086 1771.50 853.9924 48.20730
2013 Jun 7046 1761.50 875.5313 49.70373
2013 Jul 6935 1733.75 915.1089 52.78205
2013 Aug 6843 1710.75 899.3140 52.56840
2013 Sep 6611 1652.75 914.7066 55.34453
2013 Oct 6458 1614.50 897.0561 55.56247
2013 Nov 6234 1558.50 837.1063 53.71231
2013 Dec 5834 1458.50 766.7170 52.56887
2014 Jan 6129 1532.25 793.8226 51.80764
2014 Feb 6253 1563.25 790.5101 50.56837
2014 Mar 6375 1593.75 814.8412 51.12729
2014 Apr 6505 1626.25 827.1817 50.86436
2014 May 6475 1618.75 806.1585 49.80130
2014 Jun 6640 1660.00 851.4368 51.29138
2014 Jul 6514 1628.50 889.2669 54.60650
2014 Aug 6397 1599.25 847.1428 52.97125
2014 Sep 6159 1539.75 809.5638 52.57761
2014 Oct 6113 1528.25 772.6195 50.55583
2014 Nov 5847 1461.75 728.5403 49.84028
2014 Dec 5475 1368.75 675.7817 49.37218
#Plot of listings of sales by month and year
ggplot(data = listings_by_date)+
  geom_bar(aes(x = month, y = num_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_listings, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of listings by year and month", 
       x = "Months", 
       y = "Number of listings")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean listings of sales by month and year
ggplot(data = listings_by_date)+
  geom_bar(aes(x = month, y = mean_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_listings, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean of listings by year and month", 
       x = "Months", 
       y = "Mean of listings")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of listings by month and year
ggplot(data = listings_by_date)+
  geom_bar(aes(x = month, y = cv_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_listings, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Listings variability by year and month", 
       x = "Months", 
       y = "Listings variability")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Total listings by date and city for trend analysis
listings_by_datecity <- agg_datecity(data, city, date, listings)
names(listings_by_datecity)[3] <- c("num_listings")

#Plot by year and city to explore trend 
ggplot(data = listings_by_datecity)+
  geom_smooth(aes(x = date, y = num_listings, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = num_listings, colour = city))+
  labs(title = "Number of listings trend by city", 
       x = "Period", 
       y = "Number of listings")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of listings by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = listings, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of listings distribution by city and month", 
       x = "Months", 
       y = "Number of listings")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of listings by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = listings, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of listings by year, month and city", 
       x = "Months", 
       y = "Number of listings")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “months_inventory”

Bivariate analysis by city: The analysis shows that Tyler has the highest total and average number of months to sell all listings, while Wichita Falls has the lowest indicators. Variability, on the other hand, is greater for Bryan-College Station, which aligns with the higher variability in the number of listings observed earlier. The first result is linked to the fact that Tyler is also the city with the highest number and value of sales, so in absolute terms it may take longer to process all listings. To assess efficiency, it would be necessary to evaluate other indicators created that relate the different measures present in the dataset.

Bivariate analysis by year and month: The analysis shows a pattern of the number of months to sell listings that is partially similar, in terms of average and total values, to that of the number and value of sales, with a peak mid-year for some years, but much less pronounced. The variability pattern appears less marked and more dependent on individual years (for example, in 2013 there is a noticeable increasing variability over the months).

Bivariate analysis by date: The temporal analysis reveals seasonality, although less pronounced compared to the number and value of sales, and a decreasing trend over time, especially for those cities (Tyler and Bryan-College Station) with a positive trend in the number and value of sales and a negative trend in the number of listings. The other cities also show a generally decreasing trend, but less markedly. This suggests a direct, probably nearly linear, proportional relationship between the number of listings and the number of months to sell listings, and an inverse proportionality of this latter indicator with the number and value of sales, following a relationship that appears to involve nonlinear and more complex dynamics, confirming what was already suggested by the city-level analysis.

#Total, mean and standard deviation of months inventory by city
minv_by_city <- agg_city(data, city, months_inventory)
names(minv_by_city)[2:5] <- c("num_minv", "mean_minv", "sd_minv", "cv_minv")

kable(minv_by_city, caption = "Statistical index for months inventory aggregated by city", row.names = FALSE) 
Statistical index for months inventory aggregated by city
city num_minv mean_minv sd_minv cv_minv
Beaumont 598.2 9.970000 1.6495814 16.545450
Bryan-College Station 459.5 7.658333 2.2472048 29.343262
Tyler 679.5 11.325000 1.8864032 16.656982
Wichita Falls 469.0 7.816667 0.7809526 9.990865
#Plot of total months inventory by city
ggplot(data = minv_by_city)+
  geom_bar(aes(x = city, y = num_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Months inventory by city", 
       x = "City", 
       y = "Months inventory")+ 
  theme_classic()

#Plot of mean months inventory by city
ggplot(data = minv_by_city)+
  geom_bar(aes(x = city, y = mean_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean of months inventory by city", 
       x = "City", 
       y = "Mean of months inventory")+  
  theme_classic()

#Plot of variation coefficient of months inventory by city
ggplot(data = minv_by_city)+
  geom_bar(aes(x = city, y = cv_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Months inventory variability by city", 
       x = "City", 
       y = "Months inventory variability")+ 
  theme_classic()

#Total number, mean and standard deviation of months inventory by month and year
minv_by_date <- agg_date(data, year, month, months_inventory)
names(minv_by_date)[3:6] <- c("num_minv", "mean_minv", "sd_minv", "cv_minv")

kable(minv_by_date, caption = "Statistical index for months inventory aggregated by year and city", row.names = FALSE) 
Statistical index for months inventory aggregated by year and city
year month num_minv mean_minv sd_minv cv_minv
2010 Jan 35.0 8.750 2.042058 23.33780
2010 Feb 36.6 9.150 1.957039 21.38840
2010 Mar 37.9 9.475 1.774589 18.72917
2010 Apr 40.1 10.025 2.129750 21.24439
2010 May 38.8 9.700 2.393045 24.67057
2010 Jun 40.2 10.050 2.452889 24.40686
2010 Jul 42.2 10.550 2.379776 22.55712
2010 Aug 42.5 10.625 2.304163 21.68624
2010 Sep 42.6 10.650 2.580052 24.22584
2010 Oct 42.0 10.500 2.471167 23.53492
2010 Nov 41.6 10.400 2.405549 23.13028
2010 Dec 39.2 9.800 2.287648 23.34335
2011 Jan 39.8 9.950 2.112660 21.23277
2011 Feb 40.8 10.200 2.210581 21.67236
2011 Mar 43.6 10.900 2.356551 21.61974
2011 Apr 46.1 11.525 2.372587 20.58643
2011 May 48.3 12.075 2.270646 18.80452
2011 Jun 47.7 11.925 2.211146 18.54210
2011 Jul 46.3 11.575 2.364142 20.42455
2011 Aug 44.9 11.225 2.351418 20.94804
2011 Sep 44.1 11.025 2.093442 18.98814
2011 Oct 42.6 10.650 2.080865 19.53864
2011 Nov 40.7 10.175 1.944865 19.11415
2011 Dec 38.5 9.625 1.882153 19.55484
2012 Jan 39.8 9.950 1.887679 18.97165
2012 Feb 40.6 10.150 1.763519 17.37457
2012 Mar 41.9 10.475 1.652019 15.77106
2012 Apr 42.5 10.625 1.682013 15.83071
2012 May 42.1 10.525 1.543535 14.66541
2012 Jun 41.9 10.475 1.641899 15.67446
2012 Jul 41.4 10.350 1.682260 16.25372
2012 Aug 39.2 9.800 1.794436 18.31057
2012 Sep 38.5 9.625 1.652019 17.16383
2012 Oct 36.9 9.225 1.594522 17.28479
2012 Nov 35.9 8.975 1.631717 18.18069
2012 Dec 33.5 8.375 1.408013 16.81209
2013 Jan 34.2 8.550 1.173314 13.72298
2013 Feb 34.8 8.700 1.101514 12.66108
2013 Mar 35.6 8.900 1.151810 12.94169
2013 Apr 36.2 9.050 1.382027 15.27102
2013 May 34.8 8.700 1.534058 17.63285
2013 Jun 34.4 8.600 1.741647 20.25171
2013 Jul 32.9 8.225 2.041854 24.82497
2013 Aug 32.0 8.000 1.954482 24.43103
2013 Sep 30.6 7.650 2.199242 28.74827
2013 Oct 30.0 7.500 2.252406 30.03208
2013 Nov 28.9 7.225 2.012254 27.85127
2013 Dec 26.8 6.700 1.867262 27.86958
2014 Jan 28.0 7.000 1.790717 25.58167
2014 Feb 28.4 7.100 1.762574 24.82498
2014 Mar 29.1 7.275 1.802544 24.77724
2014 Apr 29.5 7.375 1.858987 25.20661
2014 May 29.5 7.375 1.851801 25.10917
2014 Jun 29.9 7.475 2.069420 27.68455
2014 Jul 29.5 7.375 2.280899 30.92744
2014 Aug 29.2 7.300 2.264214 31.01664
2014 Sep 27.9 6.975 2.066196 29.62288
2014 Oct 27.3 6.825 1.905037 27.91263
2014 Nov 26.1 6.525 1.824600 27.96322
2014 Dec 24.3 6.075 1.783956 29.36554
#Plot of months inventory by month and year
ggplot(data = minv_by_date)+
  geom_bar(aes(x = month, y = num_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_minv, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory by year and month", 
       x = "Months", 
       y = "Months inventory")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean months inventory by month and year
ggplot(data = minv_by_date)+
  geom_bar(aes(x = month, y = mean_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_minv, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean of months inventory by year and month", 
       x = "Months", 
       y = "Mean of months inventory")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of months inventory by month and year
ggplot(data = minv_by_date)+
  geom_bar(aes(x = month, y = cv_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_minv, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory variabiliy by year and month", 
       x = "Months",  
       y = "Months inventory variability")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Total months inventory by date and city for trend analysis
minv_by_datecity <- agg_datecity(data, city, date, months_inventory)
names(minv_by_datecity)[3] <- c("num_minv")

#Plot by year and city to explore trend 
ggplot(data = minv_by_datecity)+
  geom_smooth(aes(x = date, y = num_minv, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = num_minv, colour = city))+
  labs(title = "Months inventory trend by city", 
       x = "Period", 
       y = "Months inventory")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of months inventory by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = months_inventory, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory distribution by city and month", 
       x = "Months", 
       y = "Months inventory")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of months inventory by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = months_inventory, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory by year, month and city", 
       x = "Months", 
       y = "Months inventory")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

New created variables

Bivariate analysis by city:

  • Average sales value: Bryan-College Station and Tyler show higher average values, indicating markets with generally more expensive or higher-value properties. Wichita Falls shows the lowest average value.
  • Listing efficiency index: Bryan-College Station stands out for the highest efficiency, followed by Wichita Falls and Beaumont, while Tyler exhibits the lowest efficiency, suggesting more difficulty in converting listings into sales.
  • Variation in listing efficiency: Wichita Falls shows the largest positive variation, indicating significant improvements over time; Tyler shows a negative variation, signaling a decline in efficiency.

Bivariate analysis by year and month:

  • Average sales value: Moderately seasonal patterns are evident with peaks in spring/summer months and a slight overall upward trend, signaling a market with seasonality and positive but not too pronounced trends.
  • Listing efficiency index: Shows clear seasonality with recurring peaks in the mid-year months and a year-on-year increasing trend indicating progressive improvements in efficiency.
  • Efficiency variability: Initially high in early years with peaks in central seasons, variability lessens over subsequent years, indicating more stable markets or those less subject to strong fluctuations.

Bivariate analysis by date:

  • Efficiency indices show a general positive trend from 2010 to 2014 in all cities, with Bryan-College Station exhibiting strong growth and Wichita Falls remaining more stable.
  • Differences among cities indicate local dynamics where some areas are more resilient or improving relative to others.
  • Variability suggests markets tend to stabilize but at different rates and levels depending on the territorial context.

There is a substantial correlation between the number and value of sales and efficiency indicators, highlighting that more active and higher-value markets tend to be more efficient or improve over time. Seasonality and annual dynamics suggest opportunities for strategic interventions targeted at specific periods.

#Function to aggregate new created variables by city
agg_new_city <- function(data, city, kpi1, kpi2, kpi3, kpi4) {
  datanew_by_city <- data %>% 
    group_by({{city}}) %>%
    summarize(mean_price = (sum({{kpi2}})/sum({{kpi1}})),
            eff_list_ind = (sum({{kpi1}})/sum({{kpi3}}))*100,
            mean_eff_listings = (sum({{kpi3}})/sum({{kpi4}})),
             .groups = "drop")%>%
            mutate(var_eff_list = ((mean({{kpi1}}) - mean_eff_listings)/(mean_eff_listings))*100)
  
  return(datanew_by_city)
}

agg_new_var_city <- agg_new_city(data, city, sales, volume, listings, months_inventory)

#Plot of statistical index table
agg_new_var_city <- agg_new_var_city[order(agg_new_var_city$eff_list_ind, decreasing = TRUE), ]

kable(agg_new_var_city, caption = "Statistical index for new variables aggregated by city and ordinated by efficiency listings index", row.names = FALSE) 
Statistical index for new variables aggregated by city and ordinated by efficiency listings index
city mean_price eff_list_ind mean_eff_listings var_eff_list
Bryan-College Station 0.1854261 14.125366 190.3983 0.9944459
Wichita Falls 0.1200187 12.760421 116.3646 65.2492747
Beaumont 0.1473171 10.562828 168.4370 14.1623825
Tyler 0.1696659 9.285554 256.5166 -25.0373272
#Plot of mean price by city
ggplot(data = agg_new_var_city)+
  geom_bar(aes(x = city, y = mean_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean price (mln $) by city", 
       x = "City", 
       y = "Mean price (mln $)")+ 
  theme_classic()

#Plot of efficiency listings index by city
ggplot(data = agg_new_var_city)+
  geom_bar(aes(x = city, y = eff_list_ind),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Efficiency listings index by city", 
       x = "City", 
       y = "Efficiency listings index")+ 
  theme_classic()

#Plot of variation efficiency listings index by city
ggplot(data = agg_new_var_city)+
  geom_bar(aes(x = city, y = var_eff_list),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Efficency listings index variation by city", 
       x = "City", 
       y = "Efficiency listings index variation (%)")+ 
  theme_classic()

#Function to aggregate new created variables by year and month
agg_new_ym <- function(data, year, month, kpi1, kpi2, kpi3, kpi4) {
  datanew_by_ym <- data %>% 
    group_by({{year}}, {{month}}) %>%
    summarize(mean_price = (sum({{kpi2}})/sum({{kpi1}})),
            eff_list_ind = (sum({{kpi1}})/sum({{kpi3}}))*100,
            mean_eff_listings = (sum({{kpi3}})/sum({{kpi4}})),
             .groups = "drop")%>%
            mutate(var_eff_list = ((mean({{kpi1}}) - mean_eff_listings)/(mean_eff_listings))*100)
  
  return(datanew_by_ym)
}

agg_new_var_ym <- agg_new_ym(data, year, month, sales, volume, listings, months_inventory)

#Plot of statistical index table
agg_new_var_ym <- agg_new_var_ym[order(agg_new_var_ym$eff_list_ind, decreasing = TRUE), ]

kable(agg_new_var_ym, caption = "Statistical index for new variables aggregated by year and month and ordinated by efficiency listings index", row.names = FALSE) 
Statistical index for new variables aggregated by year and month and ordinated by efficiency listings index
year month mean_price eff_list_ind mean_eff_listings var_eff_list
2014 Jun 0.1828683 17.725904 222.0736 -13.4108308
2014 Jul 0.1794058 17.439361 220.8136 -12.9167306
2014 May 0.1741943 17.405405 219.4915 -12.3922136
2014 Aug 0.1773975 16.320150 219.0753 -12.2257829
2013 Jul 0.1673035 16.250901 210.7903 -8.7758351
2013 Aug 0.1676938 16.177115 213.8438 -10.0784256
2014 Oct 0.1736611 15.687878 223.9194 -14.1246115
2014 Dec 0.1768980 15.397260 225.3086 -14.6541096
2014 Apr 0.1667114 15.019216 220.5085 -12.7962465
2013 May 0.1745260 14.916737 203.6207 -5.5637878
2013 Jun 0.1763780 14.831110 204.8256 -6.1193112
2014 Sep 0.1763226 14.596525 220.7527 -12.8927180
2012 Aug 0.1617096 13.294314 183.0612 5.0422705
2014 Mar 0.1660178 13.192157 219.0722 -12.2245098
2014 Nov 0.1657047 12.741577 224.0230 -14.1643150
2013 Dec 0.1636630 12.564278 217.6866 -11.6658096
2012 Jul 0.1642920 12.488225 179.4928 7.1306015
2010 May 0.1543383 12.458183 192.6031 -0.1616932
2013 Sep 0.1680356 12.312812 216.0458 -10.9949327
2013 Apr 0.1658554 12.209707 198.6464 -3.1990219
2012 Jun 0.1662973 12.081259 177.3986 8.3952756
2010 Apr 0.1475939 12.036794 189.7756 1.3258322
2012 May 0.1672630 11.834161 177.0309 8.6204101
2013 Mar 0.1541794 11.673598 195.8708 -1.8272862
2013 Oct 0.1688198 11.427687 215.2667 -10.6728089
2010 Jun 0.1595739 11.309912 190.4726 0.9550085
2011 Jun 0.1570463 11.218152 165.3878 16.2671124
2014 Feb 0.1647032 11.098673 220.1761 -12.6645877
2013 Nov 0.1597029 11.068335 215.7093 -10.8561250
2010 Mar 0.1481801 10.877395 183.1398 4.9971786
2012 Oct 0.1549919 10.702438 187.8862 2.3447642
2011 Jul 0.1611515 10.442387 167.9482 14.4946523
2011 Aug 0.1537099 10.353003 169.0869 13.7236016
2014 Jan 0.1500415 10.230054 218.8929 -12.1526078
2012 Apr 0.1501044 10.072816 174.4941 10.1995123
2012 Sep 0.1594894 9.977420 184.0519 4.4768440
2012 Dec 0.1514557 9.975178 192.4179 -0.0656092
2011 May 0.1564564 9.762203 165.4244 16.2413955
2012 Mar 0.1534866 9.751749 174.0095 10.5063892
2010 Aug 0.1549539 9.562063 181.6000 5.8874816
2012 Nov 0.1577646 9.557418 189.4429 1.5037617
2011 Apr 0.1540042 9.188912 169.0239 13.7659886
2010 Jul 0.1499494 9.165808 184.0758 4.4632896
2011 Mar 0.1520449 8.842997 173.2569 10.9864531
2013 Feb 0.1516796 8.804751 193.5345 -0.6421678
2010 Dec 0.1524338 8.781623 175.4592 9.5933896
2013 Jan 0.1592691 8.755130 192.3684 -0.0398997
2011 Dec 0.1475168 8.662987 169.4026 13.5116401
2011 Sep 0.1511971 8.477089 168.2540 14.2865566
2011 Oct 0.1486094 8.212360 169.7887 13.2534909
2012 Feb 0.1437352 8.178968 172.8571 11.2431129
2011 Nov 0.1571730 7.945007 169.7789 13.2600699
2010 Sep 0.1485385 7.764217 180.7981 6.3571150
2010 Oct 0.1542726 7.563588 177.8571 8.1157965
2012 Jan 0.1398617 7.334999 170.9296 12.4975501
2010 Feb 0.1578994 7.265404 183.1421 4.9958974
2010 Nov 0.1464473 6.926467 174.5673 10.1533095
2011 Feb 0.1479296 6.567708 175.0245 9.8655650
2010 Jan 0.1514276 6.510980 184.7429 4.0861171
2011 Jan 0.1427247 6.102815 174.9749 9.8967308
#Plot of mean price by month and year
ggplot(data = agg_new_var_ym)+
  geom_bar(aes(x = month, y = mean_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_price, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean price by year and month", 
       x = "Months", 
       y = "Mean price (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of efficiency listings index by month and year
ggplot(data = agg_new_var_ym)+
  geom_bar(aes(x = month, y = eff_list_ind),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = eff_list_ind, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Efficiency listings index by year and month", 
       x = "Months", 
       y = "Efficiency listings index")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

##Plot of variation efficiency listings index by month and year
ggplot(data = agg_new_var_ym)+
  geom_bar(aes(x = month, y = var_eff_list),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = var_eff_list, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Efficiency listings index variation by year and month", 
       x = "Months", 
       y = "Efficiency listings index variation")+  
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Efficiency listings index by date and city for trend analysis
agg_new_efl <- function(data, city, date, kpi1, kpi3) {
  datanew_by_efl <- data %>% 
    group_by({{city}}, {{date}}) %>%
    summarize(eff_list_ind = (sum({{kpi1}})/sum({{kpi3}}))*100, .groups = "drop")%>%
  
  return(datanew_by_efl)
}

efl_by_datecity <- agg_new_efl(data, city, date, sales, listings)

#Plot by year and city to explore trend 
ggplot(data = efl_by_datecity)+
  geom_smooth(aes(x = date, y = eff_list_ind, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = eff_list_ind, colour = city))+
  labs(title = "Efficiency listings index trend by city", 
       x = "Period", 
       y = "Efficiency listings index")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of efficiency list index by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = efficiency_list_index, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Efficiency listings index distribution by city and month", 
       x = "Months",
       y = "Efficiency listings index")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of efficiency listings index by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = efficiency_list_index, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Efficiency listings index by year, month and city", 
       x = "Months", 
       y = "Efficiency listings index")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

First breakpoint: results of analysis (points 1 and 2)

Historical Trends in Real Estate Sales:

  • Real estate sales in Texas show a well-defined seasonality, with a significant increase in both the number and average value of transactions during the spring and early summer months. For example, May, June, and July stand out for having sales volumes above the average, likely related to favorable weather conditions and seasonal economic factors that encourage property purchases. This seasonality is also reflected in the monthly variations of sales indicators and in the higher efficiency observed during these periods.
  • Looking at the cities, Bryan-College Station and Tyler confirm themselves as more dynamic markets: Bryan-College Station shows a particularly high average sales value, suggesting the presence of higher-end properties or stronger demand, while Tyler records a very high total number of sales with contained variability, indicating market stability. In contrast, Wichita Falls shows more moderate dynamics, with a lower average sales value and higher variability, possibly reflecting less stable demand or a less homogeneous supply.
  • An interesting finding is the progress in sales efficiency recorded in almost all cities: Bryan-College Station in particular shows a clear improvement over time in the ability to convert listings into actual sales, indicating a market evolving positively and likely more effective marketing and sales strategies. This trend is an encouraging sign, suggesting, for example, that real estate operators in this area are refining their sales techniques and market responsiveness.

Effectiveness of Marketing Strategies for Listings:

  • The listing efficiency index, calculated as the percentage of listings sold relative to the total, varies significantly among cities and over different periods analyzed but generally shows a positive trend over time. For instance, Wichita Falls exhibits a marked increase in this index in recent years, suggesting that advertising campaigns or property positioning strategies are becoming more effective, possibly due to better targeting of buyers or additional services offered.
  • Monthly and annual variations of the index also reveal a seasonal pattern: spring and summer periods are when listing conversion reaches its highest levels. This indicates strategic time windows in which to focus marketing efforts, optimizing resources and maximizing results.
  • The differences between cities are significant: while Bryan-College Station maintains a consistently high index, signaling an efficient and competitive market, Tyler suffers from lower efficiency values, highlighting a possible oversupply or less responsive demand. In these cases, targeted interventions such as more aggressive promotional campaigns, improved listing quality, or incentives for real estate agents could help increase sales effectiveness.

Initial Overall Assessment:

  • The overall picture shows a Texas real estate market characterized by clear seasonality and sharp territorial differences in terms of both volume and sales efficiency. The positive trends in listing efficiency and average sales value in some cities indicate improvements in market dynamics and commercial strategies.
  • However, disparities among different areas suggest that not all cities are fully exploiting their market potential, making further in-depth analysis desirable to support personalized interventions. Overall, this analysis represents a solid starting point to optimize marketing actions and enhance commercial penetration in the territory, aiming to increase profitability and competitiveness for real estate operators in Texas.

Bivariate analysis to explore relationship among variables

Variables “volume” vs “sales”

The analysis of the relationship between the number of sales and the value of sales variables shows a strong linear correlation with a marked direct proportionality. This was suggested by the similar trends of both variables by city and by year and month, indicating that this relationship holds at a global level and not only locally. The analysis of the shape and slope of the confidence ellipses, as well as the visualization of the linear regression lines of the data, further confirm the linear relationship between these two variables.

#The analysis of trends suggest some kind of correlation between number of sales and volume. 
#For this reason, we can plot this two variables and explore this correlation
ggplot(data = data)+
  geom_point(aes(x = sales, y = volume, colour = city))+
  geom_smooth(aes(x = sales, y = volume), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales vs Number of sales", 
        x = "Number of sales", 
        y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Function for correlation index by city and year between number and volume of sales
corr_fun <- function(data, year, city, kpi1, kpi2){
  corr_by_yc <- data %>%
  group_by({{year}}, {{city}}) %>%
  summarize(corr_index = cor({{kpi1}}, {{kpi2}}, use = "complete.obs"), .groups = "drop")
  
  return(corr_by_yc)
}

corr_by_yearcity <- corr_fun(data, year, city, sales, volume)

#Plot of corrplot for this calculated index
corr_by_yearcity %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                   0.937
##  2  2010 Bryan-College Station      0.995
##  3  2010 Tyler                      0.969
##  4  2010 Wichita Falls              0.953
##  5  2011 Beaumont                   0.912
##  6  2011 Bryan-College Station      0.990
##  7  2011 Tyler                      0.988
##  8  2011 Wichita Falls              0.943
##  9  2012 Beaumont                   0.960
## 10  2012 Bryan-College Station      0.993
## 11  2012 Tyler                      0.973
## 12  2012 Wichita Falls              0.820
## 13  2013 Beaumont                   0.987
## 14  2013 Bryan-College Station      0.996
## 15  2013 Tyler                      0.973
## 16  2013 Wichita Falls              0.930
## 17  2014 Beaumont                   0.979
## 18  2014 Bryan-College Station      0.993
## 19  2014 Tyler                      0.983
## 20  2014 Wichita Falls              0.923
#Transform in wide format
corr_matrix <- corr_by_yearcity %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat <- as.matrix(corr_matrix[,-1])

#Set rownames like years, to be more clear
rownames(mat) <- corr_matrix$year

#Corrplot
corrplot(mat, method = "color", addCoef.col = "white", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Volume of sales vs Number of sales correlation by year and city",
         mar = c(0,0,1,0)) 

#This analysis confirm that number of sales and volume show a very important correlation,
#globally and by city and year (values from 0.91 to 1.00). The only value of correlation under 0.91
#is for year = 2012 and city = Wichita Falls (0.82)

#The high correlation suggests a strong linear relationship between sales and volume, with data 
#clustered near a straight line in bivariate space. The joint distributions are therefore expected 
#to exhibit a narrow elliptical shape. However, since the correlation reflects only the linear 
#association, further controls for outliers, non-Gaussian distributions, or possible nonlinear 
#relationships are necessary.

#Plot of joint distribution of sales and volume
ggplot(data, aes(x = sales, y = volume)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Number of sales joint distribution with confidence ellipses", 
       x = "Number of sales", 
       y = "Volume of sales (mln $)") + 
  theme_classic()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Plot of points around linear regression rect
ggplot(data, aes(x = sales, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Number of sales linear regression with confidence intervals", 
       x = "Number of sales", 
       y = "Volume of sales (mln $)") + 
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#All this analysis confirm strong linearity beetwen sales and volume

Variables “median_price” vs “sales”

Summary of the Analysis of the Relationship between Number of Sales and Median Sales Value:
The exploratory analysis of the relationship between the number of sales and the median sales value of real estate reveals a complex relationship that is not perfectly linear, but shows several significant trends at territorial and temporal levels.

Correlations and Trends by City and Year:
At a first level, the correlation between the two variables varies considerably across cities and over the years. In general, the correlation is positive, indicating that an increase in the number of sales tends to be associated with an increase in the median sales value, but this relationship is not strong or stable in all cases:

  • Tyler: shows a positive and robust trend between 2011 and 2013, although a slight decline is observed in 2014, suggesting a still dynamic market that may be slowing down.
  • Bryan-College Station: demonstrates a growing market from 2010 to 2013, with stabilization in 2014.
  • Beaumont: exhibits a correlation ranging from -0.69 to 0.75, indicating a potentially growing market but with variability over time.
  • Wichita Falls: shows a generally stable but weaker correlation trend, indicating a more stagnant or troubled market.

Linear Model and Residual Distribution:
Applying a linear regression model between number of sales and median value confirms the presence of a linear relationship, albeit imperfect:

  • The presence of heteroscedasticity and skewness in the residuals indicates that errors are not distributed uniformly, with residual variability changing depending on the prediction region.
  • Consequently, the model tends to overestimate or underestimate the relationship, especially for low sales values.
  • The presence of outliers suggests possible market stratification or atypical cases that deserve further investigation.

Nonlinear Analysis with Loess:
The nonlinear regression analysis (loess) highlights a better fit of the relationship between sales and median prices, capturing curvatures and local dynamics more accurately:

  • Confirms market growth for Beaumont, Tyler, and Bryan-College Station in a more nuanced and realistic manner.
  • Makes the stability of Wichita Falls’ market clearer.

Overall Statistical Considerations:
The analysis suggests that although the number of sales and median market value are correlated, the relationship cannot be simply modeled with a basic linear regression without accounting for variability, nonlinearity, and outliers. Seasonality, local dynamics, and specific market conditions influence the link between quantity and price.

#Correlation with sales
ggplot(data = data)+
  geom_point(aes(x = sales, y = median_price, colour = city))+
  geom_smooth(aes(x = sales, y = median_price), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price vs Number of sales by year", 
       x = "Number of sales", 
       y = "Median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between number and median price of sales
corr_by_yearcity_mp <- corr_fun(data, year, city, sales, median_price)

#Plot of corrplot for this calculated index
corr_by_yearcity_mp %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                -0.692  
##  2  2010 Bryan-College Station   -0.535  
##  3  2010 Tyler                    0.280  
##  4  2010 Wichita Falls            0.463  
##  5  2011 Beaumont                 0.486  
##  6  2011 Bryan-College Station   -0.426  
##  7  2011 Tyler                    0.627  
##  8  2011 Wichita Falls            0.542  
##  9  2012 Beaumont                 0.618  
## 10  2012 Bryan-College Station    0.173  
## 11  2012 Tyler                    0.666  
## 12  2012 Wichita Falls            0.375  
## 13  2013 Beaumont                 0.394  
## 14  2013 Bryan-College Station    0.531  
## 15  2013 Tyler                    0.708  
## 16  2013 Wichita Falls           -0.0353 
## 17  2014 Beaumont                 0.752  
## 18  2014 Bryan-College Station   -0.00894
## 19  2014 Tyler                    0.493  
## 20  2014 Wichita Falls            0.232
#Transform in wide format
corr_matrix_mp <- corr_by_yearcity_mp %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_mp <- as.matrix(corr_matrix_mp[,-1])

#Set rownames like years, to be more clear
rownames(mat_mp) <- corr_matrix_mp$year

#Corrplot
corrplot(mat_mp, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Median price vs Number of sales correlation by year and city",
         mar = c(0,0,1,0)) 

#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so median price growth like number of sales, but not in a
#strong dependency and in other cases the correlation is negative or null.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from -0.69 to 0.75, so the market meybe is growing
#Bryan-College Station --> the market are growth from 2010 to 2013, and into 2014 is freeze
#Tyler --> shows a positive market trend, but into 2014 a small decrease
#Wichita Falls --> shows a stable trend, but generally the market seems in a difficult situation

#Plot of joint distribution of sales and median price
ggplot(data, aes(x = sales, y = median_price)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Number of sales joint distribution with confidence ellipses", 
       x = "Number of sales", 
       y = "Median price") + 
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = sales, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Number of sales linear regression with confidence intervals", 
       x = "Number of sales", 
       y = "Median price") + 
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(median_price ~ sales, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (median price vs number of sales))", 
       x = "Predicted values", 
       y = "Residuals") + 
  theme_classic()

#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values ​​of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.

#Plot of loess smooth of sales and median price by city
ggplot(data, aes(x = sales, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Number of sales non linear regression with confidence intervals", 
       x = "Number of sales", 
       y = "Median price") + 
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Trends confirm grwoth in market for Beaoumont and Tyler and also for Bryan-College Station.
#Wichita Falls confirms that its market show a stable situation

Variables “median_price” vs “volume”

The correlation between the total sales value and the median sales value is generally positive, indicating that an increase in the overall transaction value tends to be accompanied by an increase in the median value. However, this relationship is not always strong and shows considerable variability across time and between cities.

In particular, the city of Tyler exhibits a strong positive correlation from 2011 to 2013, with a slight decline in 2014, confirming similar trends observed in the previous analysis on the number of sales. Other cities, such as Beaumont, show wide fluctuations in correlation over time, ranging from negative to highly positive values, suggesting more unstable or evolving markets. Bryan-College Station shows a growth phase until 2013 followed by stabilization in 2014, whereas Wichita Falls displays a more stable but generally weaker trend.

Linear Model and Residual Analysis
The linear regression confirms not a very high positive linear relationship between volume and median value and as in the previous case, residuals exhibit heteroscedasticity and asymmetry, indicating that the adequacy of the linear model is limited. The presence of outliers and significant dispersion in residuals suggests that nonlinear factors or specific market conditions influence this relationship.

Nonlinear Analysis with Loess
The loess regression improves the fit to data curvature and local dynamics, more accurately confirming market growth in the major cities and highlighting a more stagnant situation in Wichita Falls.

Spearman Correlation
The use of the Spearman correlation, less sensitive to outliers, reinforces the previous conclusions, showing a generally positive and monotonic relationship between volume and median value, though not perfectly linear.

Final Considerations
These results indicate that the median sales value is influenced, albeit in a complex and nonlinear way, by the total volume of real estate transactions. The presence of heteroscedasticity and outliers suggests considering more flexible models or variable transformations (for example, logarithmic) for a better market representation.

The analysis also highlights significant territorial differences that may reflect economic conditions, micro-dynamics of supply and demand, or varied sales strategies.

#Correlation with volume
ggplot(data = data)+
  geom_point(aes(x = volume, y = median_price, colour = city))+
  geom_smooth(aes(x = volume, y = median_price), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price vs Volume of sales by year", 
       x = "Volume of sales (mln $)", 
       y = "Median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between volume and median price of sales
corr_by_yearcity_mpv <- corr_fun(data, year, city, volume, median_price)

#Plot of corrplot for this calculated index
corr_by_yearcity_mpv %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                 -0.686 
##  2  2010 Bryan-College Station    -0.525 
##  3  2010 Tyler                     0.436 
##  4  2010 Wichita Falls             0.644 
##  5  2011 Beaumont                  0.637 
##  6  2011 Bryan-College Station    -0.315 
##  7  2011 Tyler                     0.682 
##  8  2011 Wichita Falls             0.721 
##  9  2012 Beaumont                  0.768 
## 10  2012 Bryan-College Station     0.234 
## 11  2012 Tyler                     0.723 
## 12  2012 Wichita Falls             0.793 
## 13  2013 Beaumont                  0.492 
## 14  2013 Bryan-College Station     0.579 
## 15  2013 Tyler                     0.798 
## 16  2013 Wichita Falls             0.297 
## 17  2014 Beaumont                  0.818 
## 18  2014 Bryan-College Station     0.0872
## 19  2014 Tyler                     0.585 
## 20  2014 Wichita Falls             0.487
#Transform in wide format
corr_matrix_mpv <- corr_by_yearcity_mpv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_mpv <- as.matrix(corr_matrix_mpv[,-1])

#Set rownames like years, to be more clear
rownames(mat_mpv) <- corr_matrix_mpv$year

#Corrplot
corrplot(mat_mpv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Median price vs Volume of sales correlation by year and city",
         mar = c(0,0,1,0)) 

#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so median price growth like number of sales, but not in a
#strong dependency and in other cases the correlation is negative or null.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from -0.69 to 0.82, so the market meybe is growing
#Bryan-College Station --> the market are growth from 2010 to 2013, and into 2014 is freeze
#Tyler --> shows a positive market trend, but into 2014 a small decrease
#Wichita Falls --> shows a stable trend

#Plot of joint distribution of volume and median price
ggplot(data, aes(x = volume, y = median_price)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Volume of sales joint distribution with confidence ellipses", 
       x = "Volume of sales (mln $)", 
       y = "Median price") + 
  theme_classic() 

#Plot of points around linear regression rect
ggplot(data, aes(x = volume, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Volume of sales linear regression with confidence intervals", 
       x = "Volume of sales (mln $)",  
       y = "Median price") + 
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(median_price ~ volume, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (median price vs volume of sales)", 
       x = "Predicted values", 
       y = "Residuals") + 
  theme_classic()

#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values ​​of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.

#Plot of loess smooth of volume and median price by city
ggplot(data, aes(x = volume, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Volume of sales non linear regression with confidence intervals", 
       x = "Volume of sales (mln $)", 
       y = "Median price") + 
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Trends confirm grwoth in market for Beaoumont and Tyler and also for Bryan-College Station.
#Also Wichita Falls show a low growth, despite the flat trend of median price by number of sales,
#while correlation shows a bad situation for the market
#Generally, between volume and median price there isn't a linear correlation, but more linear
#than the correlation between sales e median price

#Beacause residuals analysis shows an important dispersion of values, we could try to calculate
#correlation using Spearman index
corr_sp_fun <- function(data, year, city, kpi1, kpi2) {
  corr_sp_data <- data %>%
  group_by({{year}}, {{city}}) %>%
  summarize(corr_spearman = cor({{kpi1}}, {{kpi2}}, method = "spearman", use = "complete.obs"), .groups = "drop")
  
  return(corr_sp_data)
}

corr_by_yearcity_mpv_sp <- corr_sp_fun(data, year, city, volume, median_price)
  
corr_by_yearcity_mpv_sp %>%
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_spearman
##    <int> <chr>                         <dbl>
##  1  2010 Beaumont                    -0.503 
##  2  2010 Bryan-College Station       -0.545 
##  3  2010 Tyler                        0.406 
##  4  2010 Wichita Falls                0.643 
##  5  2011 Beaumont                     0.748 
##  6  2011 Bryan-College Station       -0.154 
##  7  2011 Tyler                        0.776 
##  8  2011 Wichita Falls                0.827 
##  9  2012 Beaumont                     0.510 
## 10  2012 Bryan-College Station        0.497 
## 11  2012 Tyler                        0.650 
## 12  2012 Wichita Falls                0.718 
## 13  2013 Beaumont                     0.608 
## 14  2013 Bryan-College Station        0.586 
## 15  2013 Tyler                        0.802 
## 16  2013 Wichita Falls                0.427 
## 17  2014 Beaumont                     0.760 
## 18  2014 Bryan-College Station       -0.0140
## 19  2014 Tyler                        0.503 
## 20  2014 Wichita Falls                0.315
corr_matrix_spearman_mpv <- corr_by_yearcity_mpv_sp %>%
  pivot_wider(names_from = city, values_from = corr_spearman)

mat_spearman <- as.matrix(corr_matrix_spearman_mpv[,-1])
rownames(mat_spearman) <- corr_matrix_spearman_mpv$year

corrplot(mat_spearman, method = "color", addCoef.col = "black", tl.col = "black",
         tl.cex = 0.7, tl.srt = 30, number.cex = 0.9, cl.pos = "b",
         title = "Median price vs Volume of sales Spearman corr by year and city",
         mar = c(0,0,1,0)) 

#The bivariate analysis between median sales price and total sales value highlights a positive 
#relationship with a less pronounced nonlinearity than that observed between median price and number 
#of sales. This result is consistent with the fact that, although the median price is less 
#influenced by outliers than the mean, it still tends to reflect the central market trend, which 
#increases along with the total sales value. Analysis using Spearman's correlation, which is less 
#sensitive to outliers, confirms this dynamic, suggesting a monotonic but not perfectly linear 
#relationship between the two variables. In conclusion, these results invite the exploration of 
#models or transformations that can better capture the complexity of the relationship between 
#sales volume and median price.

Variables “listings” vs “sales”

Correlation and General Trend
The analysis highlights a generally positive correlation between the number of active property listings and the number of sales, suggesting that greater supply tends to be accompanied by a higher number of sales. However, the strength of this correlation shows a decreasing trend over time across all cities considered, indicating a likely change in market dynamics.

City-Level Considerations
Beaumont and Bryan-College Station show a decrease in correlation over the analyzed period, with values ranging from moderate to low, suggesting that sales no longer proportionally follow the trend in listings over the years. Tyler shows a slight decrease in correlation over the years but maintains a more stable relationship compared to other cities. Wichita Falls presents a rather stable trend, with a more consistent correlation over time.

Linear Model and Residual Analysis
The application of a linear regression model confirms a positive relationship between the number of listings and sales, but the residuals show heteroscedasticity and skewness, meaning non-constant variance of errors and asymmetry. Furthermore, the residuals display strong unexplained variability and cyclical temporal patterns, underscoring the influence of additional factors or nonlinear effects. The presence of outliers indicates markets or periods with particular behavior.

Nonlinear Regression (Loess)
Using loess regression provides a more flexible and realistic model, capturing local variations over time and possible nonlinearities in the relationship between listings and sales.

Spearman Correlation
The analysis with Spearman’s correlation, which is robust to outliers, confirms the observed trends: a positive but complex and evolving relationship, with a decreasing correlation over time.

#Correlation with sales (verify if listings influence sales)
ggplot(data = data)+
  geom_point(aes(x = listings, y = sales, colour = city))+
  geom_smooth(aes(x = listings, y = sales), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of sales vs Listings by year", 
       x = "Number of listings", 
       y = "Number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between number and listings
corr_by_yearcity_ls <- corr_fun(data, year, city, listings, sales)

#Plot of corrplot for this calculated index
corr_by_yearcity_ls %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                   0.568
##  2  2010 Bryan-College Station      0.618
##  3  2010 Tyler                      0.390
##  4  2010 Wichita Falls             -0.118
##  5  2011 Beaumont                   0.509
##  6  2011 Bryan-College Station      0.722
##  7  2011 Tyler                      0.866
##  8  2011 Wichita Falls              0.760
##  9  2012 Beaumont                   0.446
## 10  2012 Bryan-College Station      0.414
## 11  2012 Tyler                      0.749
## 12  2012 Wichita Falls              0.274
## 13  2013 Beaumont                   0.443
## 14  2013 Bryan-College Station      0.191
## 15  2013 Tyler                      0.852
## 16  2013 Wichita Falls              0.415
## 17  2014 Beaumont                   0.334
## 18  2014 Bryan-College Station      0.208
## 19  2014 Tyler                      0.382
## 20  2014 Wichita Falls              0.657
#Transform in wide format
corr_matrix_ls <- corr_by_yearcity_ls %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_ls <- as.matrix(corr_matrix_ls[,-1])

#Set rownames like years, to be more clear
rownames(mat_ls) <- corr_matrix_ls$year

#Corrplot
corrplot(mat_ls, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Number of sales vs Listings correlation by year and city",
         mar = c(0,0,1,0)) 

#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so number of sales growth like listings, but not at all in a
#strong dependency.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.57 to 0.33, so the number of sales are decreasing according to number of listings
#Bryan-College Station --> the correlation goes from 0.62 to 0.21, so the number of sales are decreasing according to number of listings
#Tyler --> shows a small decrease into the years
#Wichita Falls --> shows a stable trend

#Plot of joint distribution of listings and sales
ggplot(data, aes(x = listings, y = sales)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of sales vs Listings joint distribution with confidence intervals", 
       x = "Number of listings", 
       y = "Number of sales") + 
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = sales, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of sales vs Listings linear regression with confidence intervals", 
       x = "Listings", 
       y = "Number of sales") + 
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(sales ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (number of sales vs listings)", 
       x = "Predicted values", 
       y = "Residuals") + 
  theme_classic()

#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.

#Plot of loess smooth of sales and median price by city
ggplot(data, aes(x = listings, y = sales, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of sales vs Listings non linear regression with confidence intervals", 
       x = "Number of listings", 
       y = "Number of sales") + 
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Analysis of the relationship between the number of active listings and the number of sales 
#highlights a generalized positive correlation, but one that decreases over time for all cities 
#considered. This decrease tends to coincide with an increase in the median price and the total 
#value of sales, suggesting a market in which the decrease in active supply does not translate 
#into a proportional reduction in sales, likely due to demand and price dynamics. Examination of 
#the residuals from the linear regression model reveals strong unexplained variability and wave-like 
#temporal patterns, indicating that the relationship between listings and sales is influenced by 
#additional factors or complex nonlinear processes. The large confidence intervals, which indicate 
#considerable data dispersion, require caution in interpretation, except for one city where the 
#relationship appears more defined and stable. Overall, the results underscore the need for more 
#flexible models or the inclusion of additional explanatory variables to fully describe market 
#dynamics.

#Beacause residuals analysis shows an important dispersion of values, we could try to calculate
#correlation using Spearman index
corr_by_yearcity_ls_sp <- corr_sp_fun(data, year, city, listings, sales)

corr_by_yearcity_ls_sp %>%
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_spearman
##    <int> <chr>                         <dbl>
##  1  2010 Beaumont                     0.378 
##  2  2010 Bryan-College Station        0.797 
##  3  2010 Tyler                        0.399 
##  4  2010 Wichita Falls               -0.0210
##  5  2011 Beaumont                     0.581 
##  6  2011 Bryan-College Station        0.706 
##  7  2011 Tyler                        0.881 
##  8  2011 Wichita Falls                0.778 
##  9  2012 Beaumont                     0.623 
## 10  2012 Bryan-College Station        0.404 
## 11  2012 Tyler                        0.914 
## 12  2012 Wichita Falls                0.480 
## 13  2013 Beaumont                     0.497 
## 14  2013 Bryan-College Station        0.0385
## 15  2013 Tyler                        0.930 
## 16  2013 Wichita Falls                0.343 
## 17  2014 Beaumont                     0.371 
## 18  2014 Bryan-College Station        0.259 
## 19  2014 Tyler                        0.503 
## 20  2014 Wichita Falls                0.685
corr_matrix_spearman_ls <- corr_by_yearcity_ls_sp %>%
  pivot_wider(names_from = city, values_from = corr_spearman)

mat_spearman <- as.matrix(corr_matrix_spearman_ls[,-1])
rownames(mat_spearman) <- corr_matrix_spearman_ls$year

corrplot(mat_spearman, method = "color", addCoef.col = "black", tl.col = "black",
         tl.cex = 0.7, tl.srt = 30, number.cex = 0.9, cl.pos = "b",
         title = "Number of sales vs Listings Spearman corr by year and city",
         mar = c(0,0,1,0)) 

#Also the analysis with correlation Spearman index confirms the same considerations already done

Variables “listings” vs “volume”

General Correlation and Residual Patterns The relationship between the number of listings and the total sales value (volume) appears generally positive, although with a low to moderate correlation that varies over time and among cities. The residual analysis from the linear regression model reveals four distinct clusters corresponding to the analyzed cities, suggesting that the relationship between the variables differs by geographic area. This indicates the need for separate models or models with interaction terms to better understand territory-specific dynamics.

Linear Interaction Model The linear model with interaction between number of listings and city explains about 54% of the overall variability (R² ≈ 0.54). However, the statistical significance of the interaction terms is generally low, indicating that, statistically, the slope of the relationship between volume and listings does not differ significantly among cities. This suggests that, under the simple linear model, city does not significantly modify the effect of listings on sales value.

Models with Nonlinearity
The polynomial (quadratic) model with interactions did not show significant improvement or relevant terms. Generalized Additive Models (GAMs), instead, offer a more flexible and realistic description:

  • The relationship between number of listings and sales volume is nonlinear and significant for some cities (Bryan-College Station and Tyler), while for others (Wichita Falls and Beaumont) the relationship is rather linear or less influential.
  • The GAM model highlights that the city effect on the relationship between listings and sales volume is significant and depends on the shape of the relationship, i.e., it modifies linearity.

Visualizations and Local Trends
Loess regressions for each city clearly show different territorial market dynamics, with varying curves relating number of listings and sales volume. In particular, some cities exhibit strong deviations from linearity and significant local variation.

Final Considerations
This analysis shows that the relationship between the number of active listings and total sales value varies considerably among cities and cannot be adequately captured by a simple linear model. Flexible models such as GAMs are more effective in describing local complexities and diversity. For future in-depth analyses, it is advisable to employ approaches that allow modeling nonlinearities and spatiotemporal interactions to improve interpretation and forecasting of the real estate market.

#Correlation with volume
ggplot(data = data)+
  geom_point(aes(x = listings, y = volume, colour = city))+
  geom_smooth(aes(x = listings, y = volume), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales vs Listings by year",  
       x = "Number of listings", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between volume and listings
corr_by_yearcity_ls <- corr_fun(data, year, city, listings, volume)

#Plot of corrplot for this calculated index
corr_by_yearcity_ls %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                  0.448 
##  2  2010 Bryan-College Station     0.624 
##  3  2010 Tyler                     0.450 
##  4  2010 Wichita Falls            -0.0104
##  5  2011 Beaumont                  0.559 
##  6  2011 Bryan-College Station     0.670 
##  7  2011 Tyler                     0.882 
##  8  2011 Wichita Falls             0.704 
##  9  2012 Beaumont                  0.355 
## 10  2012 Bryan-College Station     0.368 
## 11  2012 Tyler                     0.776 
## 12  2012 Wichita Falls             0.412 
## 13  2013 Beaumont                  0.421 
## 14  2013 Bryan-College Station     0.172 
## 15  2013 Tyler                     0.899 
## 16  2013 Wichita Falls             0.326 
## 17  2014 Beaumont                  0.285 
## 18  2014 Bryan-College Station     0.120 
## 19  2014 Tyler                     0.363 
## 20  2014 Wichita Falls             0.638
#Transform in wide format
corr_matrix_ls <- corr_by_yearcity_ls %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_ls <- as.matrix(corr_matrix_ls[,-1])

#Set rownames like years, to be more clear
rownames(mat_ls) <- corr_matrix_ls$year

#Corrplot
corrplot(mat_ls, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Volume of sales vs Listings correlation by year and city",
         mar = c(0,0,1,0)) 

#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so volume growth like number of listings, but not in a
#strong dependency
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.45 to 0.29 (decreasing)
#Bryan-College Station --> the correlation goes from 0.62 to 0.12 (decreasing)
#Tyler --> shows a positive trend from 2011 to 2013 with a decrease into 2014
#Wichita Falls --> shows strong variability in trend

#Plot of joint distribution of volume and listings
ggplot(data, aes(x = listings, y = volume)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Listings joint distribution with confidence ellipses", 
       x = "Number of listings", 
       y = "Volume of sales (mln $)") + 
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Listings linear regression with confidence intervals", 
       x = "Number of listings", 
       y = "Volume of sales (mln $)") + 
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(volume ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals, color = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (volume of sales vs listings))",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#The residuals plot from the linear regression model linking sales value to the number of active 
#listings reveals an interesting pattern: the residuals are not randomly distributed but highlight 
#the presence of four distinct clusters, corresponding to the different cities analyzed. This 
#suggests that the relationship between sales value and the number of listings varies significantly 
#across cities, indicating potentially different market dynamics. To further explore these 
#differences, it is necessary to estimate separate regression models for each city or introduce 
#interaction terms into a single model. This step will allow us to better understand how individual 
#cities contribute to the observed variability and whether the slope of the relationship varies 
#significantly depending on the geographical context.

#These four different trends of residuals suggest four different trends of regression between volume
#and sales by city. So, below there is a plot of these four regression lines
ggplot(data, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +  
  labs(title = "Volume of sales vs Listings linear regressions", 
       x = "Number of listings", 
       y = "Volume of sales (mln $)") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Interaction model to investigate the influence of variable city on volume-listings relationship
model_interaction <- lm(volume ~ listings * city, data = data)
summary(model_interaction)
## 
## Call:
## lm(formula = volume ~ listings * city, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.405  -6.782  -0.731   4.847  39.197 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                         4.940e+01  2.741e+01   1.802   0.0728 .
## listings                           -1.386e-02  1.630e-02  -0.850   0.3960  
## cityBryan-College Station           1.032e+01  2.876e+01   0.359   0.7201  
## cityTyler                           8.051e+00  3.340e+01   0.241   0.8097  
## cityWichita Falls                  -4.136e+01  3.300e+01  -1.253   0.2113  
## listings:cityBryan-College Station -9.057e-04  1.733e-02  -0.052   0.9584  
## listings:cityTyler                  9.835e-03  1.757e-02   0.560   0.5761  
## listings:cityWichita Falls          2.033e-02  2.591e-02   0.785   0.4334  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.41 on 232 degrees of freedom
## Multiple R-squared:  0.5443, Adjusted R-squared:  0.5305 
## F-statistic: 39.59 on 7 and 232 DF,  p-value: < 2.2e-16
#From the summary of interaction model we can explore some results
#R^2 --> this interaction explain only the 54% of variability of the entire dataset
#Global p-value --> 2.2e-16, so at least one predictor (excluding intercept) has a significant contribution to the model
#Intercept p-value --> intercept is statistically significant for 10%
#Other p-values --> have values very high, so there are not statistically significant
#Generally, from interaction model, there is no significant dependence of the volume vs listings 
#relationship on cities at a statistical level, that is, the slope of the relationship does not 
#differ significantly between cities. For this reason, we can explore other type of interaction,
#introducing non linearity

#Given a dataset with approximately 240 observations evenly distributed across four cities 
#(about 60 cases per city), model complexity must be balanced with data availability to avoid 
#overfitting and unstable estimates. To explore the relationship between sales volume and number 
#of active listings, accounting for potential non-linearities and city-specific differences, two 
#flexible yet parsimonious modeling approaches are considered.

#First, polynomial regression with interaction terms allows capturing simple curvilinear effects 
#(e.g., quadratic terms) and varying relationships by city without inflating the number of 
#parameters excessively.

model_poly <- lm(volume ~ city * (listings + I(listings^2)), data = data)
summary(model_poly)
## 
## Call:
## lm(formula = volume ~ city * (listings + I(listings^2)), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.707  -6.608  -0.815   4.691  39.701 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             -8.451e+01  4.479e+02  -0.189    0.851
## cityBryan-College Station                2.311e+02  4.497e+02   0.514    0.608
## cityTyler                                1.793e+02  4.830e+02   0.371    0.711
## cityWichita Falls                        4.964e+01  4.838e+02   0.103    0.918
## listings                                 1.460e-01  5.340e-01   0.273    0.785
## I(listings^2)                           -4.757e-05  1.588e-04  -0.300    0.765
## cityBryan-College Station:listings      -2.877e-01  5.371e-01  -0.536    0.593
## cityTyler:listings                      -1.760e-01  5.485e-01  -0.321    0.749
## cityWichita Falls:listings              -4.359e-02  6.717e-01  -0.065    0.948
## cityBryan-College Station:I(listings^2)  9.242e-05  1.601e-04   0.577    0.564
## cityTyler:I(listings^2)                  5.207e-05  1.603e-04   0.325    0.746
## cityWichita Falls:I(listings^2)         -5.703e-06  2.763e-04  -0.021    0.984
## 
## Residual standard error: 11.38 on 228 degrees of freedom
## Multiple R-squared:  0.5542, Adjusted R-squared:  0.5327 
## F-statistic: 25.77 on 11 and 228 DF,  p-value: < 2.2e-16
#All p-values show a low stastic significance

#Second, Generalized Additive Models (GAMs) provide a more flexible framework to model smooth 
#non-linear relationships that can differ between cities, while controlling complexity through 
#limited degrees of freedom in the smoothing parameters to adapt to the modest sample size.

data$Bryan <- ifelse(data$city == "Bryan-College Station", 1, 0)
data$Tyler <- ifelse(data$city == "Tyler", 1, 0)
data$Wichita <- ifelse(data$city == "Wichita Falls", 1, 0)
data$Beaumont <- ifelse(data$city == "Beaumont", 1, 0)

gam_model <- gam(volume ~ city + 
                   s(listings, by=Bryan, k=5) + 
                   s(listings, by=Tyler, k=5) + 
                   s(listings, by=Wichita, k=5)+
                   s(listings, by=Beaumont, k=5),
                 data=data)

summary(gam_model)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## volume ~ city + s(listings, by = Bryan, k = 5) + s(listings, 
##     by = Tyler, k = 5) + s(listings, by = Wichita, k = 5) + s(listings, 
##     by = Beaumont, k = 5)
## 
## Parametric coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 21.370      5.032   4.247 3.15e-05 ***
## cityBryan-College Station    3.914     12.882   0.304 0.761511    
## cityTyler                   14.546      4.118   3.533 0.000497 ***
## cityWichita Falls           -1.039      7.450  -0.139 0.889252    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df      F p-value  
## s(listings):Bryan    2.352  2.682  4.674  0.0461 *
## s(listings):Tyler    1.429  1.429 12.858  0.0104 *
## s(listings):Wichita  1.429  1.429  1.255  0.4410  
## s(listings):Beaumont 1.714  1.714  1.059  0.4613  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 20/24
## R-sq.(adj) =  0.539   Deviance explained = 55.4%
## GCV = 132.72  Scale est. = 127.78    n = 240
#1. The intercept estimate for Tyler is statistically significant, indicating that at zero listings, 
#the sales volume in Tyler is higher than the baseline city, while the intercepts for the other 
#cities are not significant.
#2.The higher effective degrees of freedom (edf) for Bryan and Tyler indicate a nonlinear 
#relationship between listings and sales volume, whereas for Wichita and Beaumont the relationship 
#is linear or nearly linear.
#3.The smooth terms for Bryan and Tyler are statistically significant, meaning that the 
#relationship between listings and sales volume in these cities is significant and non-constant 
#(nonlinear), while for Wichita and Beaumont listings have little influence on the sales volume.
#4.Overall, the nonlinear GAM model, unlike the linear interaction model, confirms that the 
#relationship between listings and sales volume depends on the city, but this dependence is 
#statistically significant only for Bryan and Tyler. Furthermore, the city modifies the linearity of 
#the relationship; where the relationship is linear, the city factor does not significantly 
#influence it.

#Plot of loess smooth of volume and listings by city
ggplot(data, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Listings non linear regression with confidence intervals", 
       x = "Number of listings", 
       y = "Volume of sales (mln $)") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

Variables “listings” vs “median_price”

Correlation and General Trend
The correlation between the number of listings and the median sales price is generally low and subject to significant variations across cities and years:

  • Bryan-College Station shows predominantly negative correlations, except in 2014.
  • Tyler and Wichita Falls exhibit positive but weak and inconsistent correlations over time.
  • Beaumont shows a shift from negative to positive correlations, though always of modest magnitude.
  • In summary, no strong and stable linear relationship emerges between these two variables.

Linear Model and Residual Analysis
The linear regression model reveals residual patterns similar to those observed in the relationship between listings and sales volume, with distinct clusters by city suggesting different market dynamics. The presence of these residual differences indicates the need to consider geographic variability in the model.

Linear Interaction Model
The interaction model between number of listings and city explains about 83% of the overall variability, but the interaction terms are generally not significant. This implies that the slope of the relationship between median price and number of listings does not differ significantly among cities in a simple linear model.

Nonlinear Models and Generalized Additive Models (GAM)
Polynomial regression with interaction terms does not show significant improvements. GAMs reveal that the relationship between listings and median price is significant and nonlinear for Bryan, Tyler, and Beaumont, while the effect is less relevant for Wichita. The GAM model confirms the dependence of the relationship on the geographic area and shows how the city modifies the linearity of the relationship.

Visualization with LOESS Regression
LOESS curves by city clearly highlight local dynamics and nonlinear relationships, reinforcing the idea of territorial heterogeneity in the relationship between listings and median price.

Final considerations
The relationship between the number of property listings and median sales price appears complex, weakly linear, and influenced by territorial and temporal factors. A simple linear model is insufficient to adequately describe it, whereas flexible approaches like GAMs enable capturing nonlinearities and geographic differences. These results suggest further analysis using models that allow precise modeling of spatial and nonlinear variability in the real estate market.

#Correlation with median price
ggplot(data = data)+
  geom_point(aes(x = listings, y = median_price, colour = city))+
  geom_smooth(aes(x = listings, y = median_price), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price vs listings by year",
       x = "Number of listings",
       y = "Median price")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between median price and listings
corr_by_yearcity_ls <- corr_fun(data, year, city, listings, median_price)

#Plot of corrplot for this calculated index
corr_by_yearcity_ls %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                 -0.581 
##  2  2010 Bryan-College Station    -0.316 
##  3  2010 Tyler                     0.493 
##  4  2010 Wichita Falls             0.0507
##  5  2011 Beaumont                  0.508 
##  6  2011 Bryan-College Station    -0.424 
##  7  2011 Tyler                     0.376 
##  8  2011 Wichita Falls             0.312 
##  9  2012 Beaumont                  0.0904
## 10  2012 Bryan-College Station     0.0654
## 11  2012 Tyler                     0.465 
## 12  2012 Wichita Falls             0.500 
## 13  2013 Beaumont                  0.122 
## 14  2013 Bryan-College Station    -0.184 
## 15  2013 Tyler                     0.655 
## 16  2013 Wichita Falls            -0.260 
## 17  2014 Beaumont                  0.371 
## 18  2014 Bryan-College Station    -0.742 
## 19  2014 Tyler                    -0.226 
## 20  2014 Wichita Falls             0.317
#Transform in wide format
corr_matrix_ls <- corr_by_yearcity_ls %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_ls <- as.matrix(corr_matrix_ls[,-1])

#Set rownames like years, to be more clear
rownames(mat_ls) <- corr_matrix_ls$year

#Corrplot
corrplot(mat_ls, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Median price vs Listings correlation by year and city",
         mar = c(0,0,1,0))

#This analysis reveals varying relationships between listings and median price across cities and 
#years. For Bryan-College Station, the relationship is predominantly negative with generally low 
#linear correlation, except in 2014. Tyler and Wichita Falls exhibit positive but weak and 
#inconsistent correlations over time. Beaumont shows a shift from negative to positive correlations 
#across years, though correlations remain weak. Overall, regardless of the year, all cities display 
#low linear correlation between these two variables, suggesting the absence of a strong linear 
#relationship.

#Plot of joint distribution of median price and listings
ggplot(data, aes(x = listings, y = median_price)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Listings joint distribution with confidence ellipses", 
       x = "Listings",
       y = "Median price") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Listings linear regression with confidence intervals", 
       x = "Number of listings",
       y = "Median price") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(median_price ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals, color = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (median price vs listings)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#This analysis shows a particular pattern of residuals, like the analysis between listings and volume.
#So, some add on this analysis, like done previously

#These four different trends of residuals suggest four different trends of regression between median price
#and sales by city. So, below there is a plot of these four regression lines
ggplot(data, aes(x = listings, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +  
  labs(title = "Median price vs Listings linear regressions",
       x = "Number of listings",
       y = "Median price") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Interaction model to investigate the influence of variable city on volume-listings relationship
model_interaction <- lm(median_price ~ listings * city, data = data)
summary(model_interaction)
## 
## Call:
## lm(formula = median_price ~ listings * city, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27642  -6338    652   6038  33846 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        154590.766  22727.226   6.802 8.68e-11 ***
## listings                              -14.650     13.514  -1.084   0.2795    
## cityBryan-College Station           37982.241  23845.127   1.593   0.1126    
## cityTyler                           15533.789  27694.337   0.561   0.5754    
## cityWichita Falls                  -47723.146  27362.571  -1.744   0.0825 .  
## listings:cityBryan-College Station     -9.411     14.367  -0.655   0.5131    
## listings:cityTyler                      4.777     14.565   0.328   0.7432    
## listings:cityWichita Falls              9.017     21.482   0.420   0.6751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9460 on 232 degrees of freedom
## Multiple R-squared:  0.8309, Adjusted R-squared:  0.8257 
## F-statistic: 162.8 on 7 and 232 DF,  p-value: < 2.2e-16
#From the summary of interaction model we can explore some results
#R^2 --> this interaction explain only the 83% of variability of the entire dataset
#Global p-value --> 2.2e-16, so at least one predictor (excluding intercept) has a significant contribution to the model
#Intercept p-value --> intercept is statistically significant
#Other p-values --> have values high, so there are not statistically significant
#Generally, from interaction model, there is no significant dependence of the median price vs listings 
#relationship on cities at a statistical level, that is, the slope of the relationship does not 
#differ significantly between cities. For this reason, we can explore other type of interaction,
#introducing non linearity

#First, polynomial regression with interaction terms allows capturing simple curvilinear effects 
#(e.g., quadratic terms) and varying relationships by city without inflating the number of 
#parameters excessively.

model_poly <- lm(median_price ~ city * (listings + I(listings^2)), data = data)
summary(model_poly)
## 
## Call:
## lm(formula = median_price ~ city * (listings + I(listings^2)), 
##     data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27494  -6103     58   5577  34019 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              3.906e+05  3.702e+05   1.055    0.293
## cityBryan-College Station               -1.330e+05  3.717e+05  -0.358    0.721
## cityTyler                               -1.697e+04  3.992e+05  -0.043    0.966
## cityWichita Falls                       -2.052e+05  3.999e+05  -0.513    0.608
## listings                                -2.963e+02  4.414e+02  -0.671    0.503
## I(listings^2)                            8.383e-02  1.313e-01   0.639    0.524
## cityBryan-College Station:listings       1.774e+02  4.439e+02   0.400    0.690
## cityTyler:listings                       1.446e+02  4.534e+02   0.319    0.750
## cityWichita Falls:listings               1.151e+02  5.552e+02   0.207    0.836
## cityBryan-College Station:I(listings^2) -5.031e-02  1.323e-01  -0.380    0.704
## cityTyler:I(listings^2)                 -5.926e-02  1.325e-01  -0.447    0.655
## cityWichita Falls:I(listings^2)          1.366e-02  2.283e-01   0.060    0.952
## 
## Residual standard error: 9408 on 228 degrees of freedom
## Multiple R-squared:  0.8356, Adjusted R-squared:  0.8276 
## F-statistic: 105.3 on 11 and 228 DF,  p-value: < 2.2e-16
#All p-values show a low stastic significance

#Second, Generalized Additive Models (GAMs) provide a more flexible framework to model smooth 
#non-linear relationships that can differ between cities, while controlling complexity through 
#limited degrees of freedom in the smoothing parameters to adapt to the modest sample size.
gam_model <- gam(median_price ~ city + 
                   s(listings, by=Bryan, k=5) + 
                   s(listings, by=Tyler, k=5) + 
                   s(listings, by=Wichita, k=5)+
                   s(listings, by=Beaumont, k=5),
                 data=data)

summary(gam_model)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## median_price ~ city + s(listings, by = Bryan, k = 5) + s(listings, 
##     by = Tyler, k = 5) + s(listings, by = Wichita, k = 5) + s(listings, 
##     by = Beaumont, k = 5)
## 
## Parametric coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  93801       3905  24.019  < 2e-16 ***
## cityBryan-College Station    25577       8472   3.019  0.00282 ** 
## cityTyler                    31260       5829   5.363 1.98e-07 ***
## cityWichita Falls             1638       6136   0.267  0.78981    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df      F  p-value    
## s(listings):Bryan    2.188  2.516 15.370 1.39e-06 ***
## s(listings):Tyler    2.008  2.263  4.475  0.00512 ** 
## s(listings):Wichita  1.429  1.429  2.287  0.25782    
## s(listings):Beaumont 1.714  1.714 50.053  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 20/24
## R-sq.(adj) =  0.829   Deviance explained = 83.5%
## GCV = 9.1502e+07  Scale est. = 8.7942e+07  n = 240
#1. The intercept estimate for Tyler and Bryan are statistically significant, indicating that at zero listings, 
#the median price in Tyler and Bryan is higher than the baseline city, while the intercepts for the other 
#cities are not significant.
#2.The higher effective degrees of freedom (edf) for all cities indicate a nonlinear 
#relationship between listings and median price.
#3.The smooth terms for Bryan, Tyler and Beaumont are statistically significant, meaning that the 
#relationship between listings and median price in these cities is significant and non-constant 
#(nonlinear), while for Wichita listings have little influence on median price.
#4.Overall, the nonlinear GAM model, unlike the linear interaction model, confirms that the 
#relationship between listings and median price depends on the city, but this dependence is 
#statistically significant only for Bryan, Tyler and Beaumont. Furthermore, the city modifies the linearity of 
#the relationship; where the relationship is linear, the city factor does not significantly 
#influence it.

#Plot of loess smooth of median price and listings by city
ggplot(data, aes(x = listings, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Listings non linear regression with confidence intervals",
       x = "Number of listings",
       y = "Median price") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

Variables “months_inventory” vs “sales”

Correlation and General Trend
The linear correlation between number of sales and months needed to sell listings (months_inventory) is generally low, except for 2011 which shows stronger correlations. Overall, a positive correlation is observed: as sales increase, months of inventory tend to increase as well, but this relationship is weak and sometimes negative. Among cities, Beaumont and Bryan-College Station show a declining correlation over time, while Tyler and Wichita Falls display trends of increase followed by decrease.

Visualizations and Linear Model
Joint distributions and linear regression plots reveal considerable data dispersion and scatter, confirmed by residual analysis. Residuals exhibit heteroscedasticity and skewness, indicating that the simple linear model does not adequately describe the actual relationship.

Nature of the Relationship and Nonlinear Models
Loess regression analysis suggests a nonlinear and partly inverse relationship: generally, higher sales correspond to fewer months needed to sell listings. This aligns with market intuition but contrasts with the averaged positive correlations previously observed.

Spearman Correlation
Spearman correlation analysis, less sensitive to outliers, confirms the weak and complex nature of the relationship, emphasizing the need for flexible models and variable transformations for accurate description.

Final Considerations
The relationship between sales numbers and months inventory shows complex features not well captured by simple linear modeling. Evidence indicates a nonlinear, likely inverse relationship, warranting further investigation via nonlinear models or transformations. The city-specific differences and time dynamics highlight the structural market complexity.

#Correlation with sales
ggplot(data = data)+
  geom_point(aes(x = sales, y = months_inventory, colour = city))+
  geom_smooth(aes(x = sales, y = months_inventory), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory vs Number of sales by year",
       x = "Number of sales",
       y = "Months inventory")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between number and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, sales, months_inventory)

#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                 0.431  
##  2  2010 Bryan-College Station    0.345  
##  3  2010 Tyler                   -0.00956
##  4  2010 Wichita Falls           -0.249  
##  5  2011 Beaumont                 0.646  
##  6  2011 Bryan-College Station    0.778  
##  7  2011 Tyler                    0.757  
##  8  2011 Wichita Falls            0.769  
##  9  2012 Beaumont                -0.149  
## 10  2012 Bryan-College Station    0.289  
## 11  2012 Tyler                    0.453  
## 12  2012 Wichita Falls            0.0972 
## 13  2013 Beaumont                -0.118  
## 14  2013 Bryan-College Station    0.0449 
## 15  2013 Tyler                    0.653  
## 16  2013 Wichita Falls            0.235  
## 17  2014 Beaumont                 0.0904 
## 18  2014 Bryan-College Station    0.113  
## 19  2014 Tyler                    0.0578 
## 20  2014 Wichita Falls            0.616
#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])

#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year

#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Number of sales correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for the year 2011.
#Generally correlation is positive, so months inventory growth like number of sales, but not at all in a
#strong dependency. There are, also, negative terms of correlation.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.43 to 0.09
#Bryan-College Station --> the correlation goes from 0.34 to 0.11
#Tyler --> shows a growth in positive correlation from 2010 to 2011 and a decrease from 2013 to 2014
#Wichita Falls --> shows a similar trend than Tyler

#Plot of joint distribution of months inventory and sales
ggplot(data, aes(x = sales, y = months_inventory)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Number of sales joint distribution with confidence ellipses",
       x = "Number of sales",
       y = "Months inventory") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = sales, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Number of sales linear regression with confidence intervals",
       x = "Number of sales",
       y = "Months inventory") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ sales, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (months inventory vs number of sales)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.

#Plot of loess smooth of sales and months inventory by city
ggplot(data, aes(x = sales, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Number of sales non linear regression with confidence intervals",
       x = "Number of sales",
       y = "Months inventory") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#The relationship between the number of sales and the months needed to sell listings 
#(months_inventory) appears to be nonlinear and inversely proportional. Specifically, 
#as the number of sales increases, the time required to sell all listings tends to decrease. 
#This intuition aligns with market expectations and is supported by the scatter of residuals in 
#the linear regression model, which indicates that a simple linear trend may not adequately capture 
#the complexity of this relationship.This appears to be at odds with the positive correlation 
#coefficients found for year and month, but the “average” positive correlation coefficient 
#alone may not correctly reflect the true nature of the relationship between variables, best 
#highlighted with this trend.

#Beacause residuals analysis shows an important dispersion of values, we could try to calculate
#correlation using Spearman index
corr_by_yearcity_minv_sp <- corr_sp_fun(data, year, city, sales, months_inventory)

corr_by_yearcity_minv_sp %>%
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_spearman
##    <int> <chr>                         <dbl>
##  1  2010 Beaumont                    0.214  
##  2  2010 Bryan-College Station       0.258  
##  3  2010 Tyler                      -0.112  
##  4  2010 Wichita Falls              -0.123  
##  5  2011 Beaumont                    0.587  
##  6  2011 Bryan-College Station       0.727  
##  7  2011 Tyler                       0.828  
##  8  2011 Wichita Falls               0.743  
##  9  2012 Beaumont                   -0.0495 
## 10  2012 Bryan-College Station       0.253  
## 11  2012 Tyler                       0.621  
## 12  2012 Wichita Falls               0.239  
## 13  2013 Beaumont                   -0.120  
## 14  2013 Bryan-College Station      -0.00699
## 15  2013 Tyler                       0.796  
## 16  2013 Wichita Falls               0.162  
## 17  2014 Beaumont                    0.106  
## 18  2014 Bryan-College Station       0.115  
## 19  2014 Tyler                       0.168  
## 20  2014 Wichita Falls               0.614
corr_matrix_spearman_minv <- corr_by_yearcity_minv_sp %>%
  pivot_wider(names_from = city, values_from = corr_spearman)

mat_spearman <- as.matrix(corr_matrix_spearman_minv[,-1])
rownames(mat_spearman) <- corr_matrix_spearman_minv$year

corrplot(mat_spearman, method = "color", addCoef.col = "black", tl.col = "black",
         tl.cex = 0.7, tl.srt = 30, number.cex = 0.9, cl.pos = "b",
         title = "Months inventory vs Number of sales Spearman corr by year and city",
         mar = c(0,0,1,0))

#Also the analysis with correlation Spearman index confirms the same considerations already done

Variables “months_inventory” vs “volume”

Correlation and General Trend
The linear correlation between total sales value and months needed to sell listings is generally low, with some exceptions in 2011 showing stronger correlations. Overall, a positive correlation is observed: as sales value increases, months inventory tends to increase as well, but this dependency is neither strong nor stable over time. Looking at individual cities, Beaumont and Bryan-College Station show a decreasing correlation over time, Tyler displays an initial positive trend followed by a decline, and Wichita Falls exhibits high variability in trend.

Visualizations and Linear Model
Joint distributions and linear regression plots reveal considerable data dispersion and non-random residual patterns. Residual analysis of the linear model highlights heteroscedasticity and non-random patterns, suggesting that the relationship is not adequately captured by a simple linear model.

Nature of the Relationship and Nonlinear Models
LOESS regression confirms a nonlinear relationship between sales volume and months inventory, consistent with a pattern where an increase in sales value does not correspond linearly to an increase in time to sell. This indicates complex and varied phenomena across cities and periods.

Final Considerations
The relationship between sales value and months inventory is weak, unstable, and nonlinear, making a simple linear model inadequate to describe it. Temporal and spatial differences suggest the need for more flexible models and possibly additional variables or transformations for better understanding.

#Correlation with volume
ggplot(data = data)+
  geom_point(aes(x = volume, y = months_inventory, colour = city))+
  geom_smooth(aes(x = volume, y = months_inventory), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory vs Volume of sales by year",
       x = "Volume of sales (mln $)",
       y = "Months inventory")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between volume and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, volume, months_inventory)

#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                  0.307 
##  2  2010 Bryan-College Station     0.372 
##  3  2010 Tyler                     0.0157
##  4  2010 Wichita Falls            -0.152 
##  5  2011 Beaumont                  0.559 
##  6  2011 Bryan-College Station     0.733 
##  7  2011 Tyler                     0.781 
##  8  2011 Wichita Falls             0.747 
##  9  2012 Beaumont                 -0.219 
## 10  2012 Bryan-College Station     0.238 
## 11  2012 Tyler                     0.503 
## 12  2012 Wichita Falls             0.275 
## 13  2013 Beaumont                 -0.127 
## 14  2013 Bryan-College Station     0.0261
## 15  2013 Tyler                     0.696 
## 16  2013 Wichita Falls             0.0916
## 17  2014 Beaumont                  0.0292
## 18  2014 Bryan-College Station     0.0252
## 19  2014 Tyler                     0.0315
## 20  2014 Wichita Falls             0.595
#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])

#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year

#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Volume of sales correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for the year 2011.
#Generally correlation is positive, so months inventory growth like volume, but not in a
#strong dependency
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.31 to 0.03 (decreasing)
#Bryan-College Station --> the correlation goes from 0.37 to 0.03 (decreasing)
#Tyler --> shows a positive trend from 2010 to 2011 with a decrease from 2013 to 2014
#Wichita Falls --> shows strong variability in trend

#Plot of joint distribution of volume and months inventory
ggplot(data, aes(x = volume, y = months_inventory)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Volume of sales joint distribution with confidence ellipses",
       x = "Volume of sales (mln $)",
       y = "Months inventory") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = volume, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Volume of sales linear regression with confidence intervals",
       x = "Volume of sales (mln $)",
       y = "Months inventory") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ volume, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (months inventory vs volume of sales)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#There is a similar situation between months inventory and sales, with a non linear relationship
#explained by the distribution of residuals

#Plot of loess smooth of sales and months inventory by city
ggplot(data, aes(x = volume, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Volume of sales non linear regression with confidence intervals",
       x = "Volume of sales (mln $)",
       y = "Months inventory") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

Variables “months_inventory” vs “median_price”

Correlation and General Trend
The correlation between median sales price and months needed to sell listings is generally low and varies over time and across cities. Bryan-College Station in 2014 shows a relatively strong negative correlation, while other cities exhibit fluctuating or decreasing values. The negative correlation suggests that, in some cases, an increase in median price tends to be associated with a decrease in the time required to sell homes, a counterintuitive result compared to the expectation that higher prices require more time to sell.

Visualizations and Linear Model
Joint distributions and linear regressions show dispersion and city differences, with residuals indicating a nonlinear relationship not well captured by the simple linear model. Residuals are non-randomly distributed, suggesting complex patterns or local factors not accounted for by the model.

Nonlinear Models and Local Trends
LOESS regression by city clearly highlights nonlinear relationships. In some cities, a significant inverse relationship emerges between median price and months inventory, where increases in median price correspond to decreases in selling time.

Final Considerations
The relationship between median sales price and months inventory is complex, predominantly nonlinear, with varying territorial effects. Counterintuitive results (price increases linked to shorter selling times) suggest sophisticated market dynamics and the importance of flexible model-based interpretations.

#Correlation with median price
ggplot(data = data)+
  geom_point(aes(x = median_price, y = months_inventory, colour = city))+
  geom_smooth(aes(x = median_price, y = months_inventory), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory vs Median price by year",
       x = "Median price",
       y = "Months inventory")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between median price and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, median_price, months_inventory)

#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                 -0.475 
##  2  2010 Bryan-College Station    -0.194 
##  3  2010 Tyler                     0.155 
##  4  2010 Wichita Falls            -0.0413
##  5  2011 Beaumont                  0.406 
##  6  2011 Bryan-College Station    -0.408 
##  7  2011 Tyler                     0.218 
##  8  2011 Wichita Falls             0.283 
##  9  2012 Beaumont                 -0.308 
## 10  2012 Bryan-College Station    -0.0548
## 11  2012 Tyler                     0.120 
## 12  2012 Wichita Falls             0.443 
## 13  2013 Beaumont                 -0.400 
## 14  2013 Bryan-College Station    -0.288 
## 15  2013 Tyler                     0.438 
## 16  2013 Wichita Falls            -0.440 
## 17  2014 Beaumont                  0.0964
## 18  2014 Bryan-College Station    -0.783 
## 19  2014 Tyler                    -0.459 
## 20  2014 Wichita Falls             0.269
#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])

#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year

#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Median price correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for city of Bryan-College Station in the year 2014.
#Generally correlation is negative, so months inventory decrease while median price increase, 
#but not in a strong dependency. This first result is not expected, because if median price is high
#we may expect that the number of months for sell is high, but the relationship shows non linearity
#and this correlation could not explore all trends in data
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from -0.47 to 0.10 (decreasing in absolute value)
#Bryan-College Station --> the correlation goes from -0.19 to -0.78 (increasing)
#Tyler --> the correlation goes from 0.15 to -0.46 (increasing in absolute value)
#Wichita Falls --> shows strong variability in trend

#Plot of joint distribution of median price and months inventory
ggplot(data, aes(x = median_price, y = months_inventory)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Median price joint distribution with confidence intervals",
       x = "Median price",
       y = "Months inventory") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = median_price, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Median price linear regression with confidence intervals",
       x = "Median price",
       y = "Months inventory") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ median_price, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (months inventory vs median price)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#There is a similar situation between months inventory and median price, with a non linear relationship
#explained by the distribution of residuals

#Plot of loess smooth of median price and months inventory by city
ggplot(data, aes(x = median_price, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Median price non lineare regression with confidence intervals",
       x = "Median price",
       y = "Months inventory") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Overall, the relationship between the two variables appears nonlinear, as indicated by the 
#residual plot showing points scattered around the horizontal line without clear patterns 
#suggesting a significant local influence of the city variable. However, in certain cities, a 
#significant inverse relationship is observed: as the median price increases, the number of months 
#required to sell all listings tends to decrease

Variables “months_inventory” vs “listings”

Correlation and General Trend
The analysis shows a strong positive linear correlation between the number of active listings and the number of months needed to sell all listings (months_inventory). This outcome aligns with the notion that a higher number of active listings indicates lower selling capacity and therefore a longer time to clear the market.Joint plots and confidence ellipses confirm this robust positive relationship between listings and months inventory, with variations across cities. Linear regression lines for each city illustrate trends consistent with the positive correlation.

Linear Regression Model and Residuals
The linear regression model confirms the statistical significance of both the intercept and slope (both with very low p-values around 2e-16). The coefficient of determination (R²) is moderate, about 54%, indicating the model explains a good portion of variance but leaves room for unexplained factors.Residual analysis reveals some non-random patterns, suggesting possible local effects or additional influencing variables.

Final Considerations
The strong positive linear dependence between listings and months inventory is expected and statistically significant. Residual variability and city-level differences suggest further investigation could improve understanding and modeling of the local real estate market.

#Correlation with listings
ggplot(data = data)+
  geom_point(aes(x = listings, y = months_inventory, colour = city))+
  geom_smooth(aes(x = listings, y = months_inventory), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory vs Listings by year",
       x = "Number of listings",
       y = "Months inventory")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between listings and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, listings, months_inventory)

#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>% 
  ungroup()
## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                   0.966
##  2  2010 Bryan-College Station      0.898
##  3  2010 Tyler                      0.854
##  4  2010 Wichita Falls              0.974
##  5  2011 Beaumont                   0.813
##  6  2011 Bryan-College Station      0.993
##  7  2011 Tyler                      0.965
##  8  2011 Wichita Falls              0.796
##  9  2012 Beaumont                   0.730
## 10  2012 Bryan-College Station      0.965
## 11  2012 Tyler                      0.882
## 12  2012 Wichita Falls              0.980
## 13  2013 Beaumont                   0.526
## 14  2013 Bryan-College Station      0.986
## 15  2013 Tyler                      0.883
## 16  2013 Wichita Falls              0.941
## 17  2014 Beaumont                   0.910
## 18  2014 Bryan-College Station      0.988
## 19  2014 Tyler                      0.922
## 20  2014 Wichita Falls              0.991
#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])

#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year

#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Listings correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows a strong linear correlation between number of listings and months
#inventory. We could expect this trend because a bigger number of listings suggest a small number
#of sell, so bigger values of variable months inventory.
#To confirm this below are shown joint distribution with confidence ellipses

#Plot of joint distribution of listings and months inventory
ggplot(data, aes(x = listings, y = months_inventory)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Listings joint distribution with confidence ellipses",
       x = "Number of listings",
       y = "Months inventory") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Listings linear regression with confidence intervals",
       x = "Number of listings",
       y = "Months inventory") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals, color = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (months inventory vs listings)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#The analysis of the relationship between the number of months needed to sell listings and the 
#number of active listings reveals a strong positive linear dependence, consistent with the notion 
#that a higher number of listings reflects a lower selling capacity and therefore a longer time to 
#clear the market. In the summary of the linear regression model, both the slope coefficient for 
#the listings variable and the intercept have very low p-values (2e-16), indicating statistical 
#significance. However, the coefficient of determination R² is moderate, around 54%

#The analysis of the relationship between the number of months needed to sell listings and the 
#number of active listings reveals a strong positive linear dependence, consistent with the notion 
#that a higher number of listings reflects a lower selling capacity and therefore a longer time to 
#clear the market. In the summary of the linear regression model, both the slope coefficient for 
#the listings variable and the intercept have very low p-values (2e-16), indicating statistical 
#significance. However, the coefficient of determination R² is moderate, around 54%

Second breakpoint: results of analysis (point 3)

Historical Trends in Sales and Values:

  • The number and value of sales are strongly correlated and show a significant linear trend over time and across cities, confirming the consistency and growth of the Texas real estate market.
  • The median sales value exhibits more complex and localized variations, suggesting price dynamics influenced by territorial and temporal factors.

Effectiveness of Marketing Strategies (Listings):

  • The relationship between the number of active listings and sales shows a declining correlation over time, indicating that the mere quantity of listings does not fully explain current sales.
  • Significant variation among cities and the presence of non-linearity (highlighted by GAM models and loess regressions) underscore the need for differentiated marketing strategies attentive to local dynamics.

Distribution and Relationships among Prices, Sales, and Listing Duration:

  • Correlations between median prices and total sales values are positive but moderate and unstable, with heteroscedasticity and outliers requiring complex models.
  • The average selling time (months_inventory) shows complex and nonlinear relationships with sales, prices, and number of listings, suggesting that the market is also influenced by liquidity factors and more sophisticated supply/demand dynamics.
  • Nonlinear models (GAM, loess) are effective in capturing these dynamics and could guide targeted interventions to reduce selling times.

Implications:

  • The evidence suggests that there is no “one-size-fits-all” model: each city presents specific dynamics, with different signals of growth, stability, or slowdown.
  • Marketing and listing management strategies must therefore be personalized and adapted to local contexts, monitoring not only listing volume but also nonlinear patterns of price, sales, and duration.
  • The use of advanced models and detailed visualizations allows the identification of growth opportunities and the optimization of timing and quality of commercial interventions on listings.

Multivariate analysis

Linear multivariate model with interactions between variables

After conducting thorough bivariate analyses, it could proceed with a multivariate analysis focusing on variables that do not exhibit multicollinearity. Hypothesizing a linear mixed-effects model that includes the joint effects and all interactions among volume, listings, and months_inventory variables, while accounting for random intercepts by year and month. This approach allows me to capture complex relationships and local variations—such as differences across cities—without bias from strongly correlated predictors.

#Creation of scaled data frame
volume_scaled <- as.vector(scale(data$volume))
listings_scaled <- as.vector(scale(data$listings))
months_inv_scaled <- as.vector(scale(months_inventory))
data_scaled <- data.frame(
                          year = data$year,
                          month = data$month,
                          city = data$city,
                          volume = volume_scaled,
                          listings = listings_scaled,
                          months_inventory = months_inv_scaled
                          )

mod_multivariate <- lmer(volume ~ listings * months_inventory * city + (1|year) + (1|month), data = data_scaled)
summary(mod_multivariate)
## Linear mixed model fit by REML ['lmerMod']
## Formula: volume ~ listings * months_inventory * city + (1 | year) + (1 |  
##     month)
##    Data: data_scaled
## 
## REML criterion at convergence: 229
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4525 -0.5906 -0.0039  0.5382  4.2657 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  month    (Intercept) 0.21712  0.4660  
##  year     (Intercept) 0.01398  0.1182  
##  Residual             0.11826  0.3439  
## Number of obs: 240, groups:  month, 12; year, 5
## 
## Fixed effects:
##                                                     Estimate Std. Error t value
## (Intercept)                                         -0.20734    0.19196  -1.080
## listings                                            -0.92099    0.79243  -1.162
## months_inventory                                    -0.37871    0.14355  -2.638
## cityBryan-College Station                            0.83173    0.14995   5.547
## cityTyler                                           -0.01248    0.52803  -0.024
## cityWichita Falls                                   -2.77226    1.23405  -2.246
## listings:months_inventory                           -0.67500    0.57726  -1.169
## listings:cityBryan-College Station                   4.46206    0.87585   5.095
## listings:cityTyler                                   2.24928    0.80297   2.801
## listings:cityWichita Falls                          -0.52055    1.39737  -0.373
## months_inventory:cityBryan-College Station          -1.06162    0.20119  -5.277
## months_inventory:cityTyler                          -0.86318    0.30189  -2.859
## months_inventory:cityWichita Falls                  -2.55279    1.27016  -2.010
## listings:months_inventory:cityBryan-College Station  0.97342    0.58925   1.652
## listings:months_inventory:cityTyler                  0.79424    0.58685   1.353
## listings:months_inventory:cityWichita Falls         -1.34502    1.23212  -1.092
## 
## Correlation matrix not shown by default, as p = 16 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it
#Plot of listings and volume by city
ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_classic() +
  labs(title = "Volume of sales vs Listings by city", 
       x = "Number of listings",
       y = "Volume of sales (scaled)")
## `geom_smooth()` using formula = 'y ~ x'

#Plot of listings and volume by city and with interaction with months inventory

#Creation of bins of months inventory (25% of population for every bins)
data_scaled <- data_scaled %>%
  mutate(
    months_inventory_bin_num = ntile(months_inventory, 4),
    months_inventory_bin = factor(
      months_inventory_bin_num,
      levels = 1:4,
      labels = c("Low", "Medium-Low", "Medium-High", "High")
    )
  )

ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ months_inventory_bin) +
  theme_classic() +
  labs(title = "Interaction Volume of sales vs Listings by year and months inventory",
       x = "Number of listings",
       y = "Volume of sales (scaled)")
## `geom_smooth()` using formula = 'y ~ x'

#Plot of marginal effects between variables
effects <- ggpredict(mod_multivariate, terms = c("listings", "months_inventory", "city"))
p <- plot(effects) + 
  labs(
    title = "Marginal effects of listings and months inventory by city",
    x = "Number of listings (scaled)",
    y = "Predicted volume of sales (scaled)"
  )

print(p)

#Plot of residuals to show other caratheristic on correlation
data_scaled$residuals <- resid(mod_multivariate)
data_scaled$fitted <- fitted(mod_multivariate)

ggplot(data_scaled, aes(x = fitted, y = residuals, colour = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression of multivariate model residuals",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

After careful scaling and accounting for multicollinearity, a linear mixed-effects model was fitted incorporating the full factorial interactions among the numerical predictors (listings and months_inventory) and the categorical variable city, while including random intercepts for year and month. The results indicate that:

  • The effect of months_inventory on sales volume is significantly negative, with stronger negative impacts observed in certain cities (Bryan, Tyler, Wichita Falls).
  • The influence of listings on volume varies considerably by city, showing positive and significant effects in Bryan and Tyler, while the overall baseline effect is non-significant or negative.
  • Interactions between listings and months_inventory are generally non-significant, suggesting that their combined influence may be less critical than their individual and city-specific effects.
  • Higher-order interactions (three-way) do not contribute significantly in the current model, indicating potential for simplification.

Visualizing predicted effects via interaction plots or marginal effect plots (using tools such as ggplot2 or ggeffects) can elucidate how the relationships between listings, months_inventory, and sales volume differ across cities, helping to capture heterogeneity in effects. Given the limited significance of complex triple interactions, consider simplifying the model by removing non-significant higher-order terms to enhance interpretability and reduce overfitting. Exploring alternative model specifications or incorporating additional relevant covariates may further improve model accuracy and insight into drivers of sales volume.

For Tyler and Bryan, the parallel lines with narrow confidence intervals indicate that the effects of listings and months_inventory on volume are stable and similar across these cities, showing consistent and precise model estimates. This reflects a relatively homogeneous pattern of influence. In contrast, for Beaumont and Wichita Falls, the presence of intersecting lines with wider confidence intervals suggests more variability and complexity in how these predictors affect volume. This could reflect more uncertain or non-linear relationships, possibly due to less consistent data or more complex local dynamics.

Non linear multivariate model with interactions between variables

Plotting residuals against fitted values reveals positive linear patterns that differ by city. This indicates that the current model under- or overestimates volume in a systematic way depending on predicted value and city. Such patterns suggest that the model does not fully capture the relationship between predictors and volume across all cities, pointing to potential model misspecification and the need to consider nonlinear terms or random slopes to better represent city-specific effects.

#Non linear model GAM with the capture of non linear effect and smooth terms
data_scaled$city <- as.factor(data_scaled$city)

mod_gam <- gamm(
  volume ~ s(listings, by = city) + s(months_inventory, by = city) + city,
  random = list(year = ~1, month = ~1),
  data = data_scaled
)

summary(mod_gam$gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## volume ~ s(listings, by = city) + s(months_inventory, by = city) + 
##     city
## 
## Parametric coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -0.08993    0.11841  -0.759   0.4484    
## cityBryan-College Station  0.91006    0.14479   6.285 1.67e-09 ***
## cityTyler                 -1.19451    0.49073  -2.434   0.0157 *  
## cityWichita Falls         -0.81361    0.92225  -0.882   0.3786    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                                  edf Ref.df      F  p-value    
## s(listings):cityBeaumont                      0.9999 0.9999  0.230 0.632280    
## s(listings):cityBryan-College Station         1.0000 1.0000 64.057  < 2e-16 ***
## s(listings):cityTyler                         1.0000 1.0000 37.920  < 2e-16 ***
## s(listings):cityWichita Falls                 1.0000 1.0000  0.336 0.562894    
## s(months_inventory):cityBeaumont              1.0000 1.0000 15.530 0.000109 ***
## s(months_inventory):cityBryan-College Station 2.4917 2.4917 43.121  < 2e-16 ***
## s(months_inventory):cityTyler                 1.8676 1.8676 44.673  < 2e-16 ***
## s(months_inventory):cityWichita Falls         1.0000 1.0000  8.898 0.003168 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.713   
##   Scale est. = 0.11131   n = 240
#Plot of predicted values
data_scaled$predicted <- predict(mod_gam$gam, newdata = data_scaled)

ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_point(aes(y = predicted), size = 1, colour = "red") +
  facet_wrap(~ city)+
  labs(title = " Predicted vs Observed values of volume of sales by listings and city",
       x = "Number of listings(scaled)",
       y = "Volume of sales (scaled)") +
  theme_classic()

#So, we can allow slopes of predictors like listings and months_inventory to vary by city, providing 
#flexibility to capture city-specific effect heterogeneity
mod_gam_random_slopes <- gamm(
  volume ~ s(listings, by = city) + s(months_inventory, by = city) + city,
  random = list(
    city = ~ 1 + listings + months_inventory,
    year = ~ 1,
    month = ~ 1
  ),
  data = data_scaled
)

summary(mod_gam_random_slopes$gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## volume ~ s(listings, by = city) + s(months_inventory, by = city) + 
##     city
## 
## Parametric coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                0.17045    0.13696   1.244    0.215    
## cityBryan-College Station  0.87642    0.19748   4.438 1.42e-05 ***
## cityTyler                 -3.50305    0.57635  -6.078 5.11e-09 ***
## cityWichita Falls         -0.03652    1.21113  -0.030    0.976    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                                 edf Ref.df      F  p-value    
## s(listings):cityBeaumont                      1.000  1.000  9.691  0.00209 ** 
## s(listings):cityBryan-College Station         1.000  1.000 56.761  < 2e-16 ***
## s(listings):cityTyler                         1.000  1.000 69.365  < 2e-16 ***
## s(listings):cityWichita Falls                 1.000  1.000  0.938  0.33382    
## s(months_inventory):cityBeaumont              1.000  1.000 20.004 1.24e-05 ***
## s(months_inventory):cityBryan-College Station 2.606  2.606 34.330  < 2e-16 ***
## s(months_inventory):cityTyler                 1.000  1.000 92.742  < 2e-16 ***
## s(months_inventory):cityWichita Falls         1.000  1.000  0.664  0.41587    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.754   
##   Scale est. = 6.1686e-08  n = 240
#Plot of predicted values
data_scaled$predicted <- predict(mod_gam_random_slopes$gam, newdata = data_scaled)

ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_point(aes(y = predicted), size = 1, colour = "red") +
  facet_wrap(~ city)+
  labs(title = "Predicted vs Observed values of volume of sales by listings and city",
       x = "Number of listings (scaled)",
       y = "Volume of sales (scaled)")+
  theme_classic()

#Evaluation of residuals for investigate model modified

#Plot of residuals to show other caratheristic on correlation
data_scaled$residuals <- resid(mod_gam_random_slopes)
data_scaled$fitted <- fitted(mod_gam_random_slopes)

ggplot(data, aes(x = fitted, y = residuals, colour = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Non linear muultivariate model residuals",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

The second model (GAM mixed model with splines)better captures the complexity and heterogeneity of the relationships between volume, listings, and months_inventory across different cities. It allows for non-linear effects of predictors by fitting spline functions, which better represent complex and potentially non-monotonic relationships that the linear model cannot capture. The GAM model shows a higher adjusted R² (~0.71), indicating improved explanatory power compared to the linear mixed model. Significant spline terms for months_inventory and listings (in some cities) reveal meaningful variation in the shape of effects that linear interactions missed.

The plot comparing observed versus predicted sales volume across cities shows mixed predictive performance of the model:

  • For Beaumont and Wichita Falls, the predicted values (red points) align reasonably well with the observed data, indicating the model captures the general pattern for these cities effectively.
  • In contrast, for Bryan-College Station and Tyler, the predicted values appear more clustered and less spread along the observed data range, suggesting that the model may be underfitting or failing to fully capture the variability in these cities.
  • Overall, the model exhibits decent predictive capability but shows variability in fit quality across different cities.

Third breakpoints (point 4)

Model and Main Results:

  • The mixed GAM (gamm) model with splines allows modeling the relationship between sales value (volume), number of listings, and months on market (months_inventory) flexibly, capturing nonlinearities and city-specific variations. Moreover, including random effects for year, month, and city with varying slopes (random slopes) enhances the ability to capture temporal and territorial heterogeneity.
  • The improvement in adjusted R² (from about 0.54 in the linear model to around 0.71 in the GAM model) indicates a greater ability of the model to explain sales volume variability.
  • Significant spline terms for some cities highlight important nonlinear effects that simple linear interaction models could not capture. The capacity to model city-specific relationship curves helps represent distinctive local market dynamics.

Goodness of Fit and Predictions Analysis:

  • Beaumont and Wichita Falls show good alignment between observed and predicted values, suggesting the model effectively captures trends in these markets, likely more stable or with dynamics well represented by the data.
  • Bryan-College Station and Tyler exhibit more clustered and less variable predicted values compared to observed data, indicating the model struggles to reproduce the real complexity and variability of these markets, probably influenced by more complex factors or unmodeled noise.

Interpretation in the Project Context:

  • Identify and interpret historical sales trends: The model highlights general and city-specific trends, going beyond simple linear associations, providing a more realistic and dynamic market representation.
  • Assess marketing strategies efficacy (listings): Including nonlinear interaction between number of listings and months on market captures the combined effect of listings and speed of sales, identifying conditions that enhance or limit generated value. This helps evaluate whether increasing listings effectively leads to more sales or if long selling times mitigate this effect.

Added Value:

  • The model’s ability to recognize nonlinear dynamics and territorial differences helps better understand which cities offer the greatest growth opportunities and where intervention is needed to improve listing effectiveness.
  • Residual analysis and differentiated performance by city suggest orienting personalized strategies for more complex and less linear markets (like Bryan-College Station and Tyler), possibly integrating additional variables or higher-frequency data to improve predictions.
  • Finally, visual representation of effects and models provides practical tools to monitor and communicate market trends and marketing strategy effectiveness over time.

Final consideration about the analysis and raccomandations

Based on the detailed analysis conducted, several data-driven strategies could be adopted to improve sales management and the effectiveness of property listings. Specifically:

Territorial segmentation and personalized marketing strategies:

  • Tailor advertising campaigns by city: Since sales dynamics, prices, and listing durations vary significantly across Beaumont, Bryan-College Station, Tyler, and Wichita Falls, it is useful to develop market-specific strategies by customizing messages, channels, and timing.
  • Targeted resource allocation: Allocate greater investments and focus on more dynamic markets with higher growth potential (e.g., Tyler and Bryan-College Station), while applying more conservative strategies or demand-stimulating interventions in more stable or less dynamic markets (e.g., Wichita Falls).

Optimization of listing timing and promotions:

  • Leverage seasonality: Intensify marketing and promotion efforts during peak sales months (spring/summer) to maximize listing conversion.
  • Manage listing efficiency: Monitor the listing efficiency index over time and take actions (e.g., price revision, improved descriptions, new channels) in periods or areas with lower performance.

Predictive modeling and continuous monitoring:

  • Implement flexible predictive models (e.g., GAM) to forecast sales volumes based on the number of listings, months on market, and territorial characteristics, improving commercial planning.
  • Analyze residuals and outliers to identify anomalous cases and adapt listing management strategies for particularly complex markets.

Development of customized KPIs:

  • Use the listing efficiency index as a standard metric, integrated with territorial and temporal variability indicators, to continuously and promptly assess campaign quality and intervene quickly.
  • Monitor market liquidity through variables such as average months to sell listings (months_inventory), to calibrate prices, incentives, and positioning strategies.

Improvement of offer and demand segmentation:

  • The correlation analysis between median price and sales volume suggests the need to evaluate targeted offers for specific market segments with different price and demand characteristics, avoiding risky one-size-fits-all strategies.