Real Estate Market Analysis in Texas

Script for data analysis

This report presents an analysis of the real estate market in Texas based on sales carried out from 2010 to 2014 in the cities of Beaumont, Bryan-College Station, Tyler, and Wichita Falls. Below is a statistical report constructed according to the following approach:

Univariate analysis of the dataset variables with graphical representations, calculation of the main statistical indicators, identification of variables with the greatest variability and skewness, creation of class for some variables, examples of occurrence probabilities within the dataset and creation of new variables such as average price and an effectiveness index of sales listings.
Bivariate analysis to individually explore the relationships between each numerical variable and the categorical variables in the dataset, including graphical representations of relationships, distributions, and main findings.
Bivariate analysis to individually explore the relationships among the numerical variables in the dataset, with investigation of possible linear correlations, linear fitting and residual distribution analysis, nonlinear smooth fitting, and related graphical representations.
Multivariate analysis aimed at finding a plausible model to explain the variability of the dataset and the relationships among its variables, and to generally describe the model’s behavior, including any nonlinearities and group effects, with related graphical representations.

Load dataset

#Set project directory
setwd("C:/Users/matmi/OneDrive/Desktop/ProgettiR")

#List of library to use in project 
library(ggplot2)
library(dplyr)

## 
## Caricamento pacchetto: 'dplyr'

## I seguenti oggetti sono mascherati da 'package:stats':
## 
##     filter, lag

## I seguenti oggetti sono mascherati da 'package:base':
## 
##     intersect, setdiff, setequal, union

library(DescTools)
library(moments)
library(tidyr)
library(corrplot)

## corrplot 0.95 loaded

library(broom)
library(mgcv)

## Caricamento del pacchetto richiesto: nlme

## 
## Caricamento pacchetto: 'nlme'

## Il seguente oggetto è mascherato da 'package:dplyr':
## 
##     collapse

## This is mgcv 1.9-4. For overview type '?mgcv'.

library(lme4)

## Caricamento del pacchetto richiesto: Matrix

## 
## Caricamento pacchetto: 'Matrix'

## I seguenti oggetti sono mascherati da 'package:tidyr':
## 
##     expand, pack, unpack

## 
## Caricamento pacchetto: 'lme4'

## Il seguente oggetto è mascherato da 'package:nlme':
## 
##     lmList

library(ggeffects)
library(knitr)

#Import csv file
data <- read.csv("realestate_texas.csv")

#Attach dataset
attach(data)

#Summary of dataset
summary(data)

##      city                year          month           sales      
##  Length:240         Min.   :2010   Min.   : 1.00   Min.   : 79.0  
##  Class :character   1st Qu.:2011   1st Qu.: 3.75   1st Qu.:127.0  
##  Mode  :character   Median :2012   Median : 6.50   Median :175.5  
##                     Mean   :2012   Mean   : 6.50   Mean   :192.3  
##                     3rd Qu.:2013   3rd Qu.: 9.25   3rd Qu.:247.0  
##                     Max.   :2014   Max.   :12.00   Max.   :423.0  
##      volume        median_price       listings    months_inventory
##  Min.   : 8.166   Min.   : 73800   Min.   : 743   Min.   : 3.400  
##  1st Qu.:17.660   1st Qu.:117300   1st Qu.:1026   1st Qu.: 7.800  
##  Median :27.062   Median :134500   Median :1618   Median : 8.950  
##  Mean   :31.005   Mean   :132665   Mean   :1738   Mean   : 9.193  
##  3rd Qu.:40.893   3rd Qu.:150050   3rd Qu.:2056   3rd Qu.:10.950  
##  Max.   :83.547   Max.   :180000   Max.   :3296   Max.   :14.900

1) Univariate analysis

Variable “city”

This is a categorical variable. First of all we can explore the frequency of each value in variable and calculate Gini Index

#Function for distribution of all variables frequencies
dist_freq <- function(vector) {
  n <- length(vector)
  tab <- table(vector)
  df <- data.frame(
    value = names(tab),
    ni = as.numeric(tab),
    fi = as.numeric(tab) / n,
    Ni = cumsum(as.numeric(tab)),
    Fi = cumsum(as.numeric(tab)) / n
  )
  return(df)
}

dist_city_freq <- dist_freq(city)

#Function for Gini index calculate
gini_index <- function(freq) {
  freq_vec <- as.numeric(freq)
  Gini(freq_vec)
}

gini_city <- gini_index(dist_city_freq$ni)

For the categorical variable “city”, we observe a uniform distribution with equal frequencies across its 4 categories. Consequently, the Gini index is 0, indicating perfect equality in the distribution. Given the categorical nature and the balanced frequencies, this exhaustive analysis sufficiently describes this variable.

#Plot of cities distribution
ggplot(data = data)+
  geom_bar(aes(x = city),
           stat = "count",
           col = "black",
           fill = "blue")+
  labs(title = "City distribution",
       x = "City",
       y = "Abosulte frequencies")+
  theme_classic()

Variable “year”

This is a categorical variable. First of all we can explore the frequency of each value in variable and calculate Gini Index

#Distribution of all variables frequencies
dist_year_freq <- dist_freq(year)

#Gini index calculate
gini_years <- gini_index(dist_year_freq$ni)

For the categorical variable “year”, we observe a uniform distribution with equal frequencies across its 5 categories. Consequently, the Gini index is 0, indicating perfect equality in the distribution. Given the categorical nature and the balanced frequencies, this exhaustive analysis sufficiently describes this variable.

#Plot of years distribution
ggplot(data = data)+
  geom_bar(aes(x = year),
           stat = "count",
           col = "black",
           fill = "blue")+
  labs(title = "Years distribution",
       x = "Years",
       y = "Absolute frequencies")+
  theme_classic()

Variable “month”

This is a categorical variable. First of all we can explore the frequency of each value in variable and calculate Gini Index

#Built correct labels for months
data$month <- sprintf("%02d", as.numeric(data$month))
month_names <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
data$month <- factor(data$month, levels = sprintf("%02d", 1:12), labels = month_names)

#Distribution of all variables frequencies
dist_months_freq <- dist_freq(month)

#Gini index calculate
gini_months <- gini_index(dist_months_freq$ni)

For the categorical variable “month”, we observe a uniform distribution with equal frequencies across its 12 categories. Consequently, the Gini index is 0, indicating perfect equality in the distribution. Given the categorical nature and the balanced frequencies, this exhaustive analysis sufficiently describes this variable.

#Plot of months distribution
ggplot(data = data)+
  geom_bar(aes(x = month),
           stat = "count",
           col = "black",
           fill = "blue")+
  labs(title = "Months distribution",
       x = "Months",
       y = "Aboslute frequencies")+
  theme_classic()

Variable “date”

For time analysis of data we can built a new variable that concatenate year and month

#Variable "date" creation
date <- paste(year, month, "1", sep = "-")
date <- as.Date(date)
data$date <- date

#Distribution of all variables frequencies
dist_dates_freq <- dist_freq(date)

#Gini index calculate
gini_dates <- gini_index(dist_dates_freq$ni)

For the categorical variables “month” and “year”, we observe two uniform distributionswith equal frequencies respectively across their 12 and 5 categories. Consequently, the Gini index of each distribution is 0. Analysis beetwen year and month shows other uniform distribution of month for each years.Starting to paste year and month we built another variable (type date) to show trend and patterns by time

#Plot of month distribution by year (every month is a date "YYYY-MM-01")
ggplot(data = data)+
  geom_bar(aes(x = month),
           stat = "count",
           col = "black",
           fill = "blue")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months distribution for each year",
       x = "Months",
       y = "Absolute frequencies")+
  scale_y_continuous(breaks = seq(0, 12, 1)) +
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “sales”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index.

Range: range of variable is beetwen 79 and 423 number of sales
Mean: mean is 192.29 number of sales
Median: median is 175.5 number of sales. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
Standard deviation: standard deviation is 79.65. Mean > Median > Standard Deviation. Positive asimmetry is confirmed
Variation coefficient: variation coefficient is 41.4%. Generally, the variable shows 41.4% of variability instead its means
Skewness: skewness is 0.72. So, it’s > 0 and that confirms positive asimmetry
Kurtosis: kurtosis is -0.313. So, the variable “sales” distribution is platykurtic
Gini index: Gini index is 0.24. The distribution shows moderate heterogeneity

#Distribution of all variables frequencies
dist_sales_freq <- dist_freq(sales)

#Gini index calculate
gini_sales <- gini_index(dist_sales_freq$ni)

#Aritmetic mean
mean_sales <- mean(sales)

#Median
median_sales <- median(sales)

#Standard deviation
sd_sales <- sd(sales)

#Variation coefficient
var_cf_sales <- (sd_sales/mean_sales)*100

#Range
range_sales <- range(sales)

#Quantile and IQR
q_sales <- quantile(sales, seq(0,1, 0.25))
iqr_sales <- IQR(sales)

#Skewness
skew_sales <- skewness(sales)

#Kurtosis
kurt_sales <- kurtosis(sales)-3

#Function to create a vector with principal statistical index
stat_index_vect <- function (pos, mean_v, median_v, sd_v, cf_v, iqr_v, skew_v, kurt_v) {
  name_v <- colnames(data)[pos]
  v_st_index <- cbind(name_v, mean_v, median_v, sd_v, cf_v, iqr_v, skew_v, kurt_v)
  return(v_st_index)
}

pos_sales <- which(names(data) == "sales")
st_index_sales <- stat_index_vect(pos_sales, mean_sales, median_sales, sd_sales, var_cf_sales, iqr_sales, skew_sales, kurt_sales)

#Plot of sales distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = sales),
               binwidth = 10,
               col = "black",
               fill = "blue")+
  geom_vline(xintercept = mean_sales, col = "red", lwd = 1)+
  geom_vline(xintercept = median_sales, col = "pink", lwd = 1)+
  annotate("label", x = mean_sales + 20, y = 16, label = round(mean_sales, 2), color = "red") +
  annotate("label", x = median_sales - 20, y = 16, label = median_sales, color = "pink") +
  labs(title = "Number of sales distribution",
       x = "Number of sales",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "sales"
ggplot(data = data)+
  geom_boxplot(aes(y = sales), fill = "blue")+
  labs(title = "Number of sales boxplot",
           x = "Relative frequencies",
           y = "Number of sales")+
  theme_classic()

Variable “volume”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

Range: range of variable is beetwen 8.17 and 83.55 number of sales
Mean: mean is 31 mln $ of sales
Median: median is 27.06 mln $ of sales. Mean > Median, so we have positive asimmetry (influenced by high values of outliers) Standard deviation: standard deviation is 16.65. Mean > Median > Standard - Deviation. Positive asimmetry is confirmed
Variation coefficient: variation coefficient is 53.7%. Generally, the variable shows 53.7% of variability instead its means
Skewness: skewness is 0.88. So, it’s > 0 and that confirms positive asimmetry
Kurtosis: kurtosis is 0.17. So, the variable “sales” distribution is leptokurtic
Gini index: Gini index is 0.012. The distribution shows moderate heterogeneity

#Distribution of all variables frequencies
dist_volume_freq <- dist_freq(volume)

#Gini index calculate
gini_volume <- gini_index(dist_volume_freq$ni)

#Aritmetic mean
mean_volume <- mean(volume)

#Median
median_volume <- median(volume)

#Standard deviation
sd_volume <- sd(volume)

#Variation coefficient
var_cf_volume <- (sd_volume/mean_volume)*100

#Range
range_volume <- range(volume)

#Quantile and IQR
q_volume <- quantile(volume, seq(0,1, 0.25))
iqr_volume <- IQR(volume)

#Skewness
skew_volume <- skewness(volume)

#Kurtosis
kurt_volume <- kurtosis(volume)-3

pos_volume <- which(names(data) == "volume")
st_index_volume <- stat_index_vect(pos_volume, mean_volume, median_volume, sd_volume, var_cf_volume, iqr_volume, skew_volume, kurt_volume)

#Plot of volume distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = volume),
                 binwidth = 5,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_volume, col = "red", lwd = 1)+
  geom_vline(xintercept = median_volume, col = "pink", lwd = 1)+
  annotate("label", x = mean_volume + 5, y = 30, label = round(mean_volume, 2), color = "red") +
  annotate("label", x = median_volume - 5, y = 30, label = median_volume, color = "pink") +
  labs(title = "Volume of sales distribution",
       x = "Volume of sales (mln $)",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "volume"
ggplot(data = data)+
  geom_boxplot(aes(y = volume), fill = "blue")+
  labs(title = "Volume of sales boxplot",
           x = "Relative frequencies",
           y = "Volume of sales (mln $)")+
  theme_classic()

Variable “median_price”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

Range: range of variable is beetwen 73800 and 180000
Mean: mean is 132665 $ of sales
Median: median is 134500 $ of sales. Mean < Median, so we have negative asimmetry (influenced by low values of outliers)
Standard deviation: standard deviation is 22662
Variation coefficient: variation coefficient is 17.1%. Generally, the variable shows 17.1% of variability instead its means
Skewness: skewness is -0.364. So, it’s < 0 and that confirms negative asimmetry
Kurtosis: kurtosis is -0.623. So, the variable “median_price” distribution is mesokurtic
Gini index: Gini index is 0.14. The distribution shows moderate heterogeneity

#Distribution of all variables frequencies
dist_median_price_freq <- dist_freq(median_price)

#Gini index calculate
gini_median_price <- gini_index(dist_median_price_freq$ni)

#Aritmetic mean
mean_median_price <- mean(median_price)

#Median
median_median_price <- median(median_price)

#Standard deviation
sd_median_price <- sd(median_price)

#Variation coefficient
var_cf_median_price <- (sd_median_price/mean_median_price)*100

#Range
range_median_price <- range(median_price)

#Quantile and IQR
q_median_price <- quantile(median_price, seq(0,1, 0.25))
iqr_median_price <- IQR(median_price)

#Skewness
skew_median_price <- skewness(median_price)

#Kurtosis
kurt_median_price <- kurtosis(median_price)-3

pos_mprice <- which(names(data) == "median_price")
st_index_mprice <- stat_index_vect(pos_mprice, mean_median_price, median_median_price, sd_median_price, var_cf_median_price, iqr_median_price, skew_median_price, kurt_median_price)

#Plot of median price distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = median_price),
                 binwidth = 2500,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_median_price, col = "red", lwd = 1)+
  geom_vline(xintercept = median_median_price, col = "pink", lwd = 1)+
  annotate("label", x = mean_median_price + 12000, y = 18, label = round(mean_median_price, 2), color = "red") +
  annotate("label", x = median_median_price - 12000, y = 18, label = median_median_price, color = "pink") +
  labs(title = "Median price distribution",
       x = "Median price",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "median price"
ggplot(data = data)+
  geom_boxplot(aes(y = median_price), fill = "blue")+
  labs(title = "Median price boxplot",
           x = "Relative frequencies",
           y = "Median price")+
  theme_classic()

Variable “listings”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index.

Range: range of variable is beetwen 743 and 3296
Mean: mean is 1738.02 of listings
Median: median is 1618.5 of listings. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
Standard deviation: standard deviation is 752.71. Mean > Median > Standard Deviation that confirms positive asimmetry
Variation coefficient: variation coefficient is 43.31%. Generally, the variable shows 43.31% of variability instead its means
Skewness: skewness is 0.649. So, it’s > 0 and that confirms positive asimmetry
Kurtosis: kurtosis is -0.792. So, the variable “listings” distribution is mesokurtic
Gini index: Gini index is 0.083. The distribution shows moderate heterogeneity

#Distribution of all variables frequencies
dist_listings_freq <- dist_freq(listings)

#Gini index calculate
gini_listings <- gini_index(dist_listings_freq$ni)

#Aritmetic mean
mean_listings <- mean(listings)

#Median
median_listings <- median(listings)

#Standard deviation
sd_listings <- sd(listings)

#Variation coefficient
var_cf_listings <- (sd_listings/mean_listings)*100

#Range
range_listings <- range(listings)

#Quantile and IQR
q_listings <- quantile(listings, seq(0,1, 0.25))
iqr_listings <- IQR(listings)

#Skewness
skew_listings <- skewness(listings)

#Kurtosis
kurt_listings <- kurtosis(listings)-3

pos_listings <- which(names(data) == "listings")
st_index_listings <- stat_index_vect(pos_listings, mean_listings, median_listings, sd_listings, var_cf_listings, iqr_listings, skew_listings, kurt_listings)

#Plot of listings distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = listings),
                 binwidth = 100,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_listings, col = "red", lwd = 1)+
  geom_vline(xintercept = median_listings, col = "pink", lwd = 1)+
  annotate("label", x = mean_listings + 250, y = 20, label = round(mean_listings, 2), color = "red") +
  annotate("label", x = median_listings - 250, y = 20, label = median_listings, color = "pink") +
  labs(title = "Number of listings distribution",
       x = "Number of listings",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "listings"
ggplot(data = data)+
  geom_boxplot(aes(y = listings), fill = "blue")+
  labs(title = "Number of listings boxplot",
           x = "Relative frequencies",
           y = "Number of listings")+
  theme_classic()

Variable “months_inventory”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

Range: range of variable is beetwen 3.4 and 14.9
Mean: mean is 9.19
Median: median is 8.95. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
Standard deviation: standard deviation is 2.3. Mean > Median > Standard Deviation that confirms positive asimmetry
Variation coefficient: variation coefficient is 25.06%. Generally, the variable shows 25.06% of variability instead its means
Skewness: skewness is 0.041. So, it’s > 0 and that confirms positive asimmetry
Kurtosis: kurtosis is -0.174. So, the variable “months_inventory” distribution is mesokurtic
Gini index: Gini index is 0.359. The distribution shows moderate heterogeneity

#Distribution of all variables frequencies
dist_minv_freq <- dist_freq(months_inventory)

#Gini index calculate
gini_minv <- gini_index(dist_minv_freq$ni)

#Aritmetic mean
mean_minv <- mean(months_inventory)

#Median
median_minv <- median(months_inventory)

#Standard deviation
sd_minv <- sd(months_inventory)

#Variation coefficient
var_cf_minv <- (sd_minv/mean_minv)*100

#Range
range_minv <- range(months_inventory)

#Quantile and IQR
q_minv <- quantile(months_inventory, seq(0,1, 0.25))
iqr_minv <- IQR(months_inventory)

#Skewness
skew_minv <- skewness(months_inventory)

#Kurtosis
kurt_minv <- kurtosis(months_inventory)-3

pos_minv <- which(names(data) == "months_inventory")
st_index_minv <- stat_index_vect(pos_minv, mean_minv, median_minv, sd_minv, var_cf_minv, iqr_minv, skew_minv, kurt_minv)

#Plot of months inventory distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = months_inventory),
                 binwidth = 1,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_minv, col = "red", lwd = 1)+
  geom_vline(xintercept = median_minv, col = "pink", lwd = 1)+
  annotate("label", x = mean_minv + 0.5, y = 45, label = round(mean_minv, 2), color = "red") +
  annotate("label", x = median_minv - 0.5, y = 45, label = median_minv, color = "pink") +
  labs(title = "Months inventory distribution",
       x = "Months inventory",
       y = "Absolute frequencies")+
  theme_classic()

#Boxplot for variable "months inventory"
ggplot(data = data)+
  geom_boxplot(aes(y = months_inventory), fill = "blue")+
  labs(title = "Months frequencies boxplot",
           x = "Relative frequencies",
           y = "Months inventory")+
  theme_classic()

Other variable analysis

Below are shown some analysis on varibles:
1) Classification of variable in decreasing order of variability
2) Classification of variable in decreasing order of asimmetry
3) Creation of class and its analysis on variables
4) Probability of every occurrency of each variable
5) Creation and analysis of new variable, like mean price and efficiency index of listings

Classification of variable in decreasing order of variability

The variable with the bigger variability is “volume”, with a variation coefficient equal to 53%, while the variable “median_price” has a smaller variability. The results of statistic analysis are shown below

#Creation of matrix of statistical index for each variable in dataset
m_stat_index <- as.data.frame(rbind(
  st_index_sales,
  st_index_volume,
  st_index_mprice,
  st_index_listings,
  st_index_minv
))

m_stat_index_cv <- m_stat_index[order(m_stat_index$cf_v, decreasing = TRUE), ]

#Plot of statistical index table
kable(m_stat_index_cv, caption = "Statistical index for each variable ordinated by variability", row.names = FALSE)

Statistical index for each variable ordinated by variability
name_v	mean_v	median_v	sd_v	cf_v	iqr_v	skew_v	kurt_v
volume	31.0051875	27.0625	16.6514471564494	53.7053586805415	23.2335	0.884742026325995	0.176986997089741
listings	1738.02083333333	1618.5	752.707756098841	43.3083275909432	1029.5	0.649498226273971	-0.791790033332591
sales	192.291666666667	175.5	79.6511111777793	41.4220296482492	120	0.718104024884959	-0.313176409071494
months_inventory	9.1925	8.95	2.30366862229334	25.0603059264982	3.15	0.040975265871081	-0.174447541638487
median_price	132665.416666667	134500	22662.148686505	17.0821825732064	32750	-0.364552878177372	-0.622961820755544

Classification of variable in decreasing order of asimmetry

The variable with the bigger asimmetry is “volume” with a skewness of 88,4%, while the variable “months_inventory” has a smaller asimmetry. The results of statistic analysis are shown below

#Matrix ordered by decreasing asimmetry
m_stat_index$skew_v <- as.numeric(as.character(m_stat_index$skew_v))
m_stat_index_as <- m_stat_index[order(abs(m_stat_index$skew_v), decreasing = TRUE), ]

#Plot of statistical index table
kable(m_stat_index_as, caption = "Statistical index for each variable ordinated by skewness", row.names = FALSE)

Statistical index for each variable ordinated by skewness
name_v	mean_v	median_v	sd_v	cf_v	iqr_v	skew_v	kurt_v
volume	31.0051875	27.0625	16.6514471564494	53.7053586805415	23.2335	0.8847420	0.176986997089741
sales	192.291666666667	175.5	79.6511111777793	41.4220296482492	120	0.7181040	-0.313176409071494
listings	1738.02083333333	1618.5	752.707756098841	43.3083275909432	1029.5	0.6494982	-0.791790033332591
median_price	132665.416666667	134500	22662.148686505	17.0821825732064	32750	-0.3645529	-0.622961820755544
months_inventory	9.1925	8.95	2.30366862229334	25.0603059264982	3.15	0.0409753	-0.174447541638487

Creation of class and its analysis on variables

This analysis divides each variable into classes defined by their respective quartiles to verify whether observations are evenly distributed among the four theoretical groups, each representing approximately 25% of the data. Quartiles were chosen to segment data into balanced subsets, facilitating the evaluation of distribution uniformity and identification of any anomalies such as duplicates at interval boundaries.Statistically, this approach highlights:

Uniformity of segmentation: Ideally, quartile-based classes should have similar absolute frequencies. Significant differences may point to duplicated values, clustering at interval edges, or distribution asymmetries.
Robustness to specific data values: Unequal frequencies suggest that the variable does not split neatly into four equal groups due to repeated values or particular distribution features.
Our results show that, except for the variable months_inventory, all variables display near-uniform frequency distributions across quartiles, indicating effective segmentation and lack of distortion from duplicates or outliers.

In contrast, months_inventory shows low uneven distribution, suggesting data aggregation or clusters that cause asymmetry and call for further examination or alternative segmentation methods.

#Function to create classes for each variable based on thir respective quartiles
create_class <- function(data, field, quantiles) {
  vec <- data[[field]]
  
  class <- cut(vec,
               breaks = quantiles,
               include.lowest = TRUE,
               labels = c("Q1", "Q2", "Q3", "Q4"))
  return(class)
}

#Class for variable "sales", frequency distribution and Gini index
data$sales_class <- create_class(data, "sales", q_sales)
dist_sales_class <- dist_freq(data$sales_class)
gini_sales_class <- gini_index(dist_sales_class$ni)

#Class for variable "volume", frequency distribution and Gini index
data$volume_class <- create_class(data, "volume", q_volume)
dist_volume_class <- dist_freq(data$volume_class)
gini_volume_class <- gini_index(dist_volume_class$ni)

#Class for variable "median_price", frequency distribution and Gini index
data$mdpr_class <- create_class(data, "median_price", q_median_price)
dist_mdpr_class <- dist_freq(data$mdpr_class)
gini_mdpr_class <- gini_index(dist_mdpr_class$ni)

#Class for variable "listings", frequency distribution and Gini index
data$list_class <- create_class(data, "listings", q_listings)
dist_list_class <- dist_freq(data$list_class)
gini_list_class <- gini_index(dist_list_class$ni)

#Class for variable "months_inventory", frequency distribution and Gini index
data$minv_class <- create_class(data, "months_inventory", q_minv)
dist_minv_class <- dist_freq(data$minv_class)
gini_minv_class <- gini_index(dist_minv_class$ni)

#Function to extract only value and absolute frequencies from dataset
exct_data <- function(data) {
  exct_data <- data[,1:2]
  return(exct_data)
}

#Extract data from frequencies total distribution for each variable and built of new dataset
d_s_class <- exct_data(dist_sales_class)
d_s_class$name_var <- "sales"
d_v_class <- exct_data(dist_volume_class)
d_v_class$name_var <- "volume"
d_mp_class <- exct_data(dist_mdpr_class)
d_mp_class$name_var <- "median_price"
d_l_class <- exct_data(dist_list_class)
d_l_class$name_var <- "listings"
d_mv_class <- exct_data(dist_minv_class)
d_mv_class$name_var <- "months_inventory"
dist_freq_class <- rbind(d_s_class, d_v_class, d_mp_class, d_l_class, d_mv_class)

#Plot of absolute frequency distribution of class (quartile) of variable "sales"
ggplot(data = dist_freq_class)+
     geom_bar(aes(x = value, y = ni), stat = "identity", color = "black", fill = "blue") +
     facet_wrap(~ name_var, scales = "free_x")+
     labs(title = "Absolute frequencies for class (quartiles) for each variable", 
          x = "Classes (quartiles)", 
          y = "Absolute frequencies")+ 
 theme_classic()

#Plot of table of Gini index for each variable
v_gini <- as.data.frame(cbind(gini_sales_class, gini_volume_class, gini_mdpr_class, gini_list_class, gini_minv_class))
kable(v_gini,
      caption = "Gini index of classes built from quartiles for each variable", row.naes = FALSE)

Gini index of classes built from quartiles for each variable
gini_sales_class	gini_volume_class	gini_mdpr_class	gini_list_class	gini_minv_class
0.0083333	0	0.0083333	0	0.0416667

Probability of every occurrency of each variable

#Function to create dataset with value and relative frequencies for each variable
exct_freq <- function(data) {

  exct_r1 <- data[, 1]
  exct_r2 <- data[,3]
  exct_rows <- as.data.frame(cbind(exct_r1, exct_r2))
  return(exct_rows)
}

#Creation of dataset for each categorical variable and new dataset
f_city <- exct_freq(dist_city_freq)
f_year <- exct_freq(dist_year_freq)
f_month <- exct_freq(dist_months_freq)
f_date <- exct_freq(dist_dates_freq)

f_dataset <- as.data.frame(rbind(f_city, f_year, f_month, f_date))

#Example of some probabilities
p_beaumont <- f_dataset$exct_r2[f_dataset$exct_r1 == "Beaumont"]
print(paste("The probability that the city associated with a dataset record could be Beaumont is: ", p_beaumont))

## [1] "The probability that the city associated with a dataset record could be Beaumont is:  0.25"

p_tyler <- f_dataset$exct_r2[f_dataset$exct_r1 == "Tyler"]
print(paste("The probability that the city associated with a dataset record could be Tyler is: ", p_tyler))

## [1] "The probability that the city associated with a dataset record could be Tyler is:  0.25"

p_2011 <- f_dataset$exct_r2[f_dataset$exct_r1 == "2011"]
print(paste("The probability that the year associated with a dataset record could be 2011 is: ", p_2011))

## [1] "The probability that the year associated with a dataset record could be 2011 is:  0.2"

p_2011_02_01 <- f_dataset$exct_r2[f_dataset$exct_r1 == "2011-02-01"]
print(paste("The probability that the date associated with a dataset record could be 2011 is: ", p_2011_02_01))

## [1] "The probability that the date associated with a dataset record could be 2011 is:  0.0166666666666667"

Creation and analysis of new variables

Based on variables in the dataset, it could create other variables:
1) Mean price = [volume/sales]
2) Efficiency listings index = [(sales/listings)]x100
3) Variation of efficiency listings index = ([sales - (listings/months_inventory)]/(listings/months_inventory))x100

Mean price represents the mean price of sell for each month, year and city in the dataset
Efficiency listings index represents the percentage of sales on relative listings for each month, year and city
Variation of efficiency listings index represent the percentage of variation between the sales in one period and the efficiency calculated like mean number of sales to sell all listings. The term listings/months_inventory represents the mean of sales to sell all listings. The difference between sales and this last term, divided by the last term, represents the variation of efficiency listings index

#Variable "mean_price"
data$mean_price <- ifelse(sales == 0, 0, volume/sales)

#Variable "efficiency_list_index"
data$efficiency_list_index <- (ifelse(listings == 0, 0, sales/listings))*100

#Variable "variation_eff_list_index"
mean_eff_listings <- ifelse(months_inventory == 0, 0, listings/months_inventory)
data$mean_eff_list_index <- ifelse(mean_eff_listings == 0, 0, 
                                   ((sales - mean_eff_listings)/(mean_eff_listings))*100)

Variable “mean_price”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

Range: range of variable is beetwen 0.097 and 0.213 (mln $)
Mean: mean is 0.154 (mln $)
Median: median is 0.156 (mkn $). Mean < Median, so we have negative asimmetry (influenced by low values of outliers)
Standard deviation: standard deviation is 0.027.
Variation coefficient: variation coefficient is 17.59%. Generally, the variable shows 17.59% of variability instead its means
Skewness: skewness is -0.069. So, it’s < 0 and that confirms negative asimmetry
Kurtosis: kurtosis is -0.778. So, the variable “mean_price” distribution is mesokurtic
Gini index: Gini index is 0. The distribution is homogeneous

#Distribution of all variables frequencies
dist_mpr_freq <- dist_freq(data$mean_price)

#Gini index calculate
gini_mpr <- gini_index(dist_mpr_freq$ni)

#Aritmetic mean
mean_mpr <- mean(data$mean_price)

#Median
median_mpr <- median(data$mean_price)

#Standard deviation
sd_mpr <- sd(data$mean_price)

#Variation coefficient
var_cf_mpr <- (sd_mpr/mean_mpr)*100

#Range
range_mpr <- range(data$mean_price)

#Quantile and IQR
q_mpr <- quantile(data$mean_price, seq(0,1, 0.25))
iqr_mpr <- IQR(data$mean_price)

#Skewness
skew_mpr <- skewness(data$mean_price)

#Kurtosis
kurt_mpr <- kurtosis(data$mean_price)-3

#Creation of statistical index to plot table
pos_mpr <- which(names(data) == "mean_price")
st_index_mpr <- stat_index_vect(pos_mpr, mean_mpr, median_mpr, sd_mpr, var_cf_mpr, iqr_mpr, skew_mpr, kurt_mpr)

kable(st_index_mpr, caption = "Statistical index for the variable mean_price", row.names = FALSE)

Statistical index for the variable mean_price
name_v	mean_v	median_v	sd_v	cf_v	iqr_v	skew_v	kurt_v
mean_price	0.154320365777658	0.156588479842459	0.0271474563315486	17.5916225928743	0.0409762176937704	-0.0687052806534169	-0.77843287102

#Plot of mean price distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = mean_price),
                 binwidth = 0.01,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_mpr, col = "red", lwd = 1)+
  geom_vline(xintercept = median_mpr, col = "pink", lwd = 1)+
  annotate("label", x = mean_mpr - 0.01, y = 40, label = round(mean_mpr, 2), color = "red") +
  annotate("label", x = median_mpr + 0.01, y = 40, label = round(median_mpr, 2), color = "pink") +
  labs(title = "Mean price distribution ",
       x = "Mean price (mln $)", 
       y = "Absolute frequencies")+ 
  theme_classic()

#Boxplot for variable "mean price"
ggplot(data = data)+
  geom_boxplot(aes(y = mean_price), fill = "blue")+
  labs(title = "Mean price boxplot", 
           x = "Relative frequencies", 
           y = "Mean price (mln $)")+ 
  theme_classic()

Variable “efficiency_list_index”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

Range: range of variable is beetwen 5.01 and 38.71
Mean: mean is 11.87
Median: median is 10.96. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
Standard deviation: standard deviation is 4.69. Mean > Median > Standard Deviation, so it confirms positive asimmetry
Variation coefficient: variation coefficient is 39.49%. Generally, the variable shows 39.49% of variability instead its means
Skewness: skewness is 2.09. So, it’s > 0 and that confirms positive asimmetry
Kurtosis: kurtosis is 6.88. So, the variable “efficiency_list_index” distribution is leptokurtic
Gini index: Gini index is 0.042. The distribution shows very small heterogeneity

#Distribution of all variables frequencies
dist_eff_list_freq <- dist_freq(data$efficiency_list_index)

#Gini index calculate
gini_eff_list <- gini_index(dist_eff_list_freq$ni)

#Aritmetic mean
mean_eff_list <- mean(data$efficiency_list_index)

#Median
median_eff_list <- median(data$efficiency_list_index)

#Standard deviation
sd_eff_list <- sd(data$efficiency_list_index)

#Variation coefficient
var_cf_eff_list <- (sd_eff_list/mean_eff_list)*100

#Range
range_eff_list <- range(data$efficiency_list_index)

#Quantile and IQR
q_eff_list <- quantile(data$efficiency_list_index, seq(0,1, 0.25))
iqr_eff_list <- IQR(data$efficiency_list_index)

#Skewness
skew_eff_list <- skewness(data$efficiency_list_index)

#Kurtosis
kurt_eff_list <- kurtosis(data$efficiency_list_index)-3

#Creation of statistical index to plot table
pos_mlis <- which(names(data) == "efficiency_list_index")
st_index_mlis <- stat_index_vect(pos_mlis, mean_eff_list, median_eff_list, sd_eff_list, var_cf_eff_list, iqr_eff_list, skew_eff_list, kurt_eff_list)

kable(st_index_mlis, caption = "Statistical index for variable efficency_list_index", row.names = FALSE)

Statistical index for variable efficency_list_index
name_v	mean_v	median_v	sd_v	cf_v	iqr_v	skew_v	kurt_v
efficiency_list_index	11.8744921731651	10.9626835334067	4.68990009256993	39.4955845199724	4.51207215683548	2.08931813879356	6.88174747715347

Histogram of variable “efficiency_list_index” absolute frequencies with mean and median

#Plot of efficiency listings index distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = efficiency_list_index),
                 binwidth = 1,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_eff_list, col = "red", lwd = 1)+
  geom_vline(xintercept = median_eff_list, col = "pink", lwd = 1)+
  annotate("label", x = mean_eff_list + 2, y = 40, label = round(mean_eff_list, 2), color = "red") +
  annotate("label", x = median_eff_list - 2, y = 40, label = round(median_eff_list, 2), color = "pink") +
  labs(title = "Efficiency listings index distribution",
       x = "Efficiency listings index", 
       y = "Absolute frequencies")+ 
  theme_classic()

#Boxplot for variable "efficiency listings index"
ggplot(data = data)+
  geom_boxplot(aes(y = efficiency_list_index), fill = "blue")+
  labs(title = "Efficiency listings index boxplot", 
           x = "Relative frequencies", 
           y = "Efficiency listings index")+ 
  theme_classic()

Variable “mean_eff_list_index”

This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index

Range: range of variable is beetwen -51.3 and 79.3
Mean: mean is 2.07
Median: median is 2.06. Mean > Median, so we have positive asimmetry (influenced by high values of outliers)
Standard deviation: standard deviation is 24.82. This index shows a very high variability of this calculated KPI
Variation coefficient: variation coefficient is 1196.9%. Generally, the variable shows a very high varibility instead its means
Skewness: skewness is 0.46. So, it’s > 0 and that confirms positive asimmetry
Kurtosis: kurtosis is 0.172. So, the variable “mean_eff_list_index” distribution is leptokurtic
Gini index: Gini index is 0. The distribution is homogeneous

#Distribution of all variables frequencies
dist_meff_list_freq <- dist_freq(data$mean_eff_list_index)

#Gini index calculate
gini_meff_list <- gini_index(dist_meff_list_freq$ni)

#Aritmetic mean
mean_meff_list <- mean(data$mean_eff_list_index)

#Median
median_meff_list <- median(data$mean_eff_list_index)

#Standard deviation
sd_meff_list <- sd(data$mean_eff_list_index)

#Variation coefficient
var_cf_meff_list <- (sd_meff_list/mean_meff_list)*100

#Range
range_meff_list <- range(data$mean_eff_list_index)

#Quantile and IQR
q_meff_list <- quantile(data$mean_eff_list_index, seq(0,1, 0.25))
iqr_meff_list <- IQR(data$mean_eff_list_index)

#Skewness
skew_meff_list <- skewness(data$mean_eff_list_index)

#Kurtosis
kurt_meff_list <- kurtosis(data$mean_eff_list_index)-3

#Creation of statistical index to plot table
pos_mel <- which(names(data) == "mean_eff_list_index")
st_index_mel <- stat_index_vect(pos_mel, mean_meff_list, median_meff_list, sd_meff_list, var_cf_meff_list, iqr_meff_list, skew_meff_list, kurt_meff_list)

kable(st_index_mel, caption = "Statistical index for variation of efficiency listings index", row.names = FALSE)

Statistical index for variation of efficiency listings index
name_v	mean_v	median_v	sd_v	cf_v	iqr_v	skew_v	kurt_v
mean_eff_list_index	2.07809270544987	2.06321958662783	24.8726892383184	1196.89988676102	34.0724048034506	0.465258599586698	0.171896889361363

#Plot of variation efficiency listings index distribution with mean and median
ggplot(data = data)+
  geom_histogram(aes(x = mean_eff_list_index),
                 binwidth = 5,
                 col = "black",
                 fill = "blue")+
  geom_vline(xintercept = mean_meff_list, col = "red", lwd = 1)+
  geom_vline(xintercept = median_meff_list, col = "pink", lwd = 1)+
  annotate("label", x = mean_meff_list + 10, y = 25, label = round(mean_meff_list, 2), color = "red") +
  annotate("label", x = median_meff_list - 10, y = 25, label = round(median_meff_list, 2), color = "pink") +
  labs(title = "Variation of efficiency listings index distribution",
       x = "Variation of efficiency listings index (%)", 
       y = "Absolute frequencies")+ 
  theme_classic()

#Boxplot for variable "variation efficiency listings index"
ggplot(data = data)+
  geom_boxplot(aes(y = mean_eff_list_index), fill = "blue")+
  labs(title = "Variation of efficiency listings index boxplot", 
           x = "Relative frequencies", 
           y = "Variation of efficiency listings index (%)")+ 
  theme_classic()

Bivariate analysis by year, month, city and time series

Variable “sales”

Bivariate analysis by city: Tyler is the city with the highest total and average number of sales, while Wichita Falls has the lowest indicators. The variability in the number of sales is around 20% for all cities except Bryan-College Station. Bryan shows a much higher variability in sales number (41%). This situation suggests a healthier market in Tyler and Bryan-College Station, although the latter has high variability and is therefore more subject to temporal fluctuations, and a somewhat less healthy market in Beaumont and Wichita Falls, albeit with lower temporal fluctuations.

Bivariate analysis by year and month: The analysis shows that the total and average number of sales peak around mid-year, regardless of the year. This suggests a seasonal trend in the data, which should be confirmed through the analysis of the historical data trends. Additionally, three peaks (at the beginning, middle, and end of the year) in sales variability are observed, independently of the year. Overall, the results indicate real estate markets subject to seasonality, with differences in sales opportunities based on different times of the year and varying peaks in sales variability. These factors should be considered, for example, when managing advertising campaigns.

Bivariate analysis by date: The analysis confirms the seasonal pattern of the data for each city, showing a periodic and repetitive trend with a peak in the number of sales around mid-year. The temporal distribution of the boxplots by city further highlights the mid-year sales peak and the three variability peaks identified earlier. Overall, these findings confirm at the local level—with some differences—what was observed at the global level. Moreover, the temporal trends support the analysis of sales numbers by city, identifying a positive sales trend for all cities (especially Tyler and Bryan-College Station, which have higher total and average sales) and a downward trend for Wichita Falls.

#Function to aggregate data by city
agg_city <- function (data, city, kpi){
  data_by_city <- data %>%
    group_by({{city}}) %>%
    summarise(num_kpi = sum({{kpi}}), mean_kpi = mean({{kpi}}), sd_kpi = sd({{kpi}}), .groups = "drop") %>%
    mutate(cv_kpi = (sd_kpi/mean_kpi)*100)
  
  return(data_by_city)
}

#Total number, mean and standard deviation of sales by city
sales_by_city <- agg_city(data, city, sales)
names(sales_by_city)[2:5] <- c("num_sales", "mean_sales", "sd_sales", "cv_sales")

kable(sales_by_city, caption = "Statistical index for number of sales aggregated by city", row.names = FALSE)

Statistical index for number of sales aggregated by city
city	num_sales	mean_sales	sd_sales	cv_sales
Beaumont	10643	177.3833	41.48395	23.38661
Bryan-College Station	12358	205.9667	84.98374	41.26092
Tyler	16185	269.7500	61.96380	22.97083
Wichita Falls	6964	116.0667	22.15192	19.08551

#Plot of total number of sales by city
ggplot(data = sales_by_city)+
  geom_bar(aes(x = city, y = num_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Number of sales by city", 
       x = "City", 
       y = "Number of sales")+ 
  theme_classic()

#Plot of mean number of sales by city
ggplot(data = sales_by_city)+
  geom_bar(aes(x = city, y = mean_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean number of sales by city", 
       x = "City", 
       y = "Mean number of sales")+ 
  theme_classic()

#Plot of variation coefficient of number of sales by city
ggplot(data = sales_by_city)+
  geom_bar(aes(x = city, y = cv_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Number of sales variability by city", 
       x = "City", 
       y = "Number of sales variability")+ 
  theme_classic()

#Function to aggregate data by month and year
agg_date <- function (data, year, month, kpi){
  data_by_date <- data %>%
    group_by({{year}}, {{month}}) %>%
    summarise(num_kpi = sum({{kpi}}), mean_kpi = mean({{kpi}}), sd_kpi = sd({{kpi}}), .groups = "drop") %>%
    mutate(cv_kpi = (sd_kpi/mean_kpi)*100)
  
  return(data_by_date)
}

#Total number, mean and standard deviation of sales by month and year
sales_by_date <- agg_date(data, year, month, sales)
names(sales_by_date)[3:6] <- c("num_sales", "mean_sales", "sd_sales", "cv_sales")

kable(sales_by_date,caption = "Statistical index for number of salesw aggregated by year and months", row.names = FALSE)

Statistical index for number of salesw aggregated by year and months
year	month	num_sales	mean_sales	sd_sales	cv_sales
2010	Jan	421	105.25	36.60943	34.78330
2010	Feb	487	121.75	40.26061	33.06826
2010	Mar	755	188.75	43.59950	23.09907
2010	Apr	916	229.00	63.95311	27.92712
2010	May	931	232.75	58.84089	25.28072
2010	Jun	866	216.50	71.44462	32.99982
2010	Jul	712	178.00	62.50867	35.11723
2010	Aug	738	184.50	45.00000	24.39024
2010	Sep	598	149.50	47.19816	31.57068
2010	Oct	565	141.25	45.70467	32.35729
2010	Nov	503	125.75	30.99866	24.65102
2010	Dec	604	151.00	40.70217	26.95508
2011	Jan	425	106.25	27.03547	25.44515
2011	Feb	469	117.25	44.25965	37.74810
2011	Mar	668	167.00	52.42773	31.39385
2011	Apr	716	179.00	58.64583	32.76303
2011	May	780	195.00	70.28039	36.04123
2011	Jun	885	221.25	93.93038	42.45441
2011	Jul	812	203.00	69.95713	34.46164
2011	Aug	786	196.50	70.27802	35.76490
2011	Sep	629	157.25	67.60855	42.99431
2011	Oct	594	148.50	57.57604	38.77174
2011	Nov	549	137.25	49.37864	35.97715
2011	Dec	565	141.25	48.25194	34.16067
2012	Jan	499	124.75	29.78115	23.87266
2012	Feb	574	143.50	57.61076	40.14687
2012	Mar	711	177.75	66.69020	37.51910
2012	Apr	747	186.75	52.77863	28.26165
2012	May	882	220.50	90.71751	41.14173
2012	Jun	898	224.50	86.18004	38.38755
2012	Jul	928	232.00	89.81462	38.71320
2012	Aug	954	238.50	87.99432	36.89489
2012	Sep	707	176.75	78.20646	44.24694
2012	Oct	742	185.50	79.80601	43.02211
2012	Nov	650	162.50	37.24245	22.91843
2012	Dec	643	160.75	52.20073	32.47324
2013	Jan	576	144.00	49.22059	34.18097
2013	Feb	593	148.25	54.90219	37.03351
2013	Mar	814	203.50	64.04426	31.47138
2013	Apr	878	219.50	74.54082	33.95937
2013	May	1057	264.25	90.36362	34.19626
2013	Jun	1045	261.25	108.21699	41.42277
2013	Jul	1127	281.75	122.70391	43.55063
2013	Aug	1107	276.75	92.01585	33.24873
2013	Sep	814	203.50	66.00253	32.43367
2013	Oct	738	184.50	65.97727	35.76004
2013	Nov	690	172.50	65.07688	37.72573
2013	Dec	733	183.25	64.04881	34.95160
2014	Jan	627	156.75	61.34805	39.13751
2014	Feb	694	173.50	62.21736	35.86015
2014	Mar	841	210.25	85.35563	40.59721
2014	Apr	977	244.25	84.10063	34.43219
2014	May	1127	281.75	112.15577	39.80684
2014	Jun	1177	294.25	134.62386	45.75152
2014	Jul	1136	284.00	122.29745	43.06248
2014	Aug	1044	261.00	89.70693	34.37047
2014	Sep	899	224.75	103.54186	46.06979
2014	Oct	959	239.75	106.31518	44.34418
2014	Nov	745	186.25	84.50000	45.36913
2014	Dec	843	210.75	91.73649	43.52858

#Plot of total number of sales by month and year
ggplot(data = sales_by_date)+
  geom_bar(aes(x = month, y = num_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_sales, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of sales by year and month", 
       x = "Months", 
       y = "Number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean number of sales by month and year
ggplot(data = sales_by_date)+
  geom_bar(aes(x = month, y = mean_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_sales, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean number of sales by year and month", 
       x = "Months", 
       y = "Mean number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of sales by month and year
ggplot(data = sales_by_date)+
  geom_bar(aes(x = month, y = cv_sales),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_sales, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of sales variability by year and month", 
       x = "Months", 
       y = "Number of sales variability")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Function to aggregate data by date and city
agg_datecity <- function (data, city, date, kpi){
  data_by_datecity <- data %>%
    group_by({{city}}, {{date}}) %>%
    summarise(num_kpi = sum({{kpi}}), .groups = "drop")
  
  return(data_by_datecity)
}

#Total number, mean and standard deviation of sales by date and city for trend analysis
sales_by_datecity <- agg_datecity(data, city, date, sales)
names(sales_by_datecity)[3] <- c("num_sales")

#Plot by year and city to explore trend 
ggplot(data = sales_by_datecity)+
  geom_smooth(aes(x = date, y = num_sales, colour = city), lwd = 1, method = "loess")+
  geom_point(aes(x = date, y = num_sales, colour = city))+
  labs(title = "Number of sales trend by city", 
       x = "Period", 
       y = "Number of sales")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using formula = 'y ~ x'

#Boxplot of number of sales by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = sales, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of sales distribution by city and months", 
       x = "Months", 
       y = "Number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of number of sales by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = sales, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of sales by year, month and city", 
       x = "Months", 
       y = "Number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “volume”

Bivariate analysis by city: Tyler is the city with the highest total and average volume of sales, while Wichita Falls has the lowest indicators. The variability in the volume of sales is around 25-30% for all cities except Bryan-College Station. Bryan shows a much higher variability in sales number (45%). This situation suggests a healthier market in Tyler and Bryan-College Station, although the latter has high variability and is therefore more subject to temporal fluctuations, and a somewhat less healthy market in Beaumont and Wichita Falls, albeit with lower temporal fluctuations.

Bivariate analysis by year and month: The analysis shows that the total and average volume of sales peak around mid-year, regardless of the year. This suggests a seasonal trend in the data, which should be confirmed through the analysis of the historical data trends. Additionally, three peaks (at the beginning, middle, and end of the year) in volume of sales variability are observed, independently of the year. Overall, the results indicate real estate markets subject to seasonality, with differences in sales opportunities based on different times of the year and varying peaks in sales variability. These factors should be considered, for example, when managing advertising campaigns.

Bivariate analysis by date: The analysis confirms the seasonal pattern of the data for each city, showing a periodic and repetitive trend with a peak in the volume of sales around mid-year. The temporal distribution of the boxplots by city further highlights the mid-year sales peak and the three variability peaks identified earlier. Overall, these findings confirm at the local level—with some differences—what was observed at the global level. Moreover, the temporal trends support the analysis of sales volume by city, identifying a positive sales trend for all cities (especially Tyler and Bryan-College Station, which have higher total and average sales) and a downward trend for Wichita Falls.

Overall, the analysis suggests a likely linear relationship between the number of sales and the value of sales, which should be further confirmed by investigating the relationship between the two variables.

#Total, mean and standard deviation of volume by city
volume_by_city <- agg_city(data, city, volume)
names(volume_by_city)[2:5] <- c("num_volume", "mean_volume", "sd_volume", "cv_volume")

kable(volume_by_city, caption = "Statistical index for volume of sales in mln $ aggregated by city", row.names = FALSE)

Statistical index for volume of sales in mln $ aggregated by city
city	num_volume	mean_volume	sd_volume	cv_volume
Beaumont	1567.896	26.13160	6.970384	26.67416
Bryan-College Station	2291.496	38.19160	17.248577	45.16327
Tyler	2746.043	45.76738	13.107146	28.63862
Wichita Falls	835.810	13.93017	3.239766	23.25719

#Plot of volume of sales by city
ggplot(data = volume_by_city)+
  geom_bar(aes(x = city, y = num_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Volume of sales by city", 
       x = "City", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()

#Plot of mean volume of sales by city
ggplot(data = volume_by_city)+
  geom_bar(aes(x = city, y = mean_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean volume of sales by city", 
       x = "City", 
       y = "Mean volume of sales (mln $)")+ 
  theme_classic()

#Plot of variation coefficient of volume of sales by city
ggplot(data = volume_by_city)+
  geom_bar(aes(x = city, y = cv_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Volume of sales variability by city", 
       x = "City", 
       y = "Volume of sales variability")+ 
  theme_classic()

#Total number, mean and standard deviation of volume by month and year
volume_by_date <- agg_date(data, year, month, volume)
names(volume_by_date)[3:6] <- c("num_volume", "mean_volume", "sd_volume", "cv_volume")

kable(volume_by_date, caption = "Statistical index for volume of sales in mln $ aggregated by year and month", row.names = FALSE)

Statistical index for volume of sales in mln $ aggregated by year and month
year	month	num_volume	mean_volume	sd_volume	cv_volume
2010	Jan	63.751	15.93775	6.922791	43.43644
2010	Feb	76.897	19.22425	8.535588	44.40011
2010	Mar	111.876	27.96900	7.298540	26.09511
2010	Apr	135.196	33.79900	13.280337	39.29210
2010	May	143.689	35.92225	13.236626	36.84799
2010	Jun	138.191	34.54775	13.560836	39.25244
2010	Jul	106.764	26.69100	12.121784	45.41525
2010	Aug	114.356	28.58900	10.666873	37.31111
2010	Sep	88.826	22.20650	7.228736	32.55234
2010	Oct	87.164	21.79100	8.068802	37.02814
2010	Nov	73.663	18.41575	5.684010	30.86494
2010	Dec	92.070	23.01750	7.441038	32.32774
2011	Jan	60.658	15.16450	5.313391	35.03835
2011	Feb	69.379	17.34475	8.107725	46.74455
2011	Mar	101.566	25.39150	10.674532	42.03979
2011	Apr	110.267	27.56675	11.256460	40.83347
2011	May	122.036	30.50900	15.375147	50.39545
2011	Jun	138.986	34.74650	18.521100	53.30350
2011	Jul	130.855	32.71375	14.836718	45.35315
2011	Aug	120.816	30.20400	14.510272	48.04089
2011	Sep	95.103	23.77575	12.214512	51.37382
2011	Oct	88.274	22.06850	10.280971	46.58663
2011	Nov	86.288	21.57200	10.695940	49.58251
2011	Dec	83.347	20.83675	8.284211	39.75769
2012	Jan	69.791	17.44775	6.837480	39.18832
2012	Feb	82.504	20.62600	10.198258	49.44370
2012	Mar	109.129	27.28225	12.719385	46.62147
2012	Apr	112.128	28.03200	10.981393	39.17449
2012	May	147.526	36.88150	19.394133	52.58499
2012	Jun	149.335	37.33375	18.633920	49.91173
2012	Jul	152.463	38.11575	20.164871	52.90430
2012	Aug	154.271	38.56775	19.762346	51.24060
2012	Sep	112.759	28.18975	15.251248	54.10210
2012	Oct	115.004	28.75100	15.203367	52.87944
2012	Nov	102.547	25.63675	7.173133	27.97989
2012	Dec	97.386	24.34650	10.330214	42.42997
2013	Jan	91.739	22.93475	9.591138	41.81924
2013	Feb	89.946	22.48650	11.426154	50.81340
2013	Mar	125.502	31.37550	13.893366	44.28094
2013	Apr	145.621	36.40525	18.420387	50.59816
2013	May	184.474	46.11850	22.272422	48.29390
2013	Jun	184.315	46.07875	24.932241	54.10789
2013	Jul	188.551	47.13775	26.927092	57.12426
2013	Aug	185.637	46.40925	20.868560	44.96638
2013	Sep	136.781	34.19525	14.307844	41.84161
2013	Oct	124.589	31.14725	15.597292	50.07598
2013	Nov	110.195	27.54875	12.784377	46.40638
2013	Dec	119.965	29.99125	12.059098	40.20872
2014	Jan	94.076	23.51900	12.074357	51.33873
2014	Feb	114.304	28.57600	13.162092	46.05995
2014	Mar	139.621	34.90525	17.803333	51.00474
2014	Apr	162.877	40.71925	20.136292	49.45153
2014	May	196.317	49.07925	26.326048	53.63987
2014	Jun	215.236	53.80900	30.670720	56.99924
2014	Jul	203.805	50.95125	29.507853	57.91389
2014	Aug	185.203	46.30075	22.953933	49.57573
2014	Sep	158.514	39.62850	23.355544	58.93623
2014	Oct	166.541	41.63525	21.281137	51.11327
2014	Nov	123.450	30.86250	17.263882	55.93805
2014	Dec	149.125	37.28125	19.756553	52.99327

#Plot of volume of sales by month and year
ggplot(data = volume_by_date)+
  geom_bar(aes(x = month, y = num_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_volume, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales by year and month", 
       x = "Months", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean volume of sales by month and year
ggplot(data = volume_by_date)+
  geom_bar(aes(x = month, y = mean_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_volume, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean volume of sales by year and month", 
       x = "Months", 
       y = "Mean volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of volume of sales by month and year
ggplot(data = volume_by_date)+
  geom_bar(aes(x = month, y = cv_volume),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_volume, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales varibility by year and month", 
       x = "Months", 
       y = "Volume of sales variability")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Total volume of sales by date and city for trend analysis
volume_by_datecity <- agg_date(data, city, date, volume)
names(volume_by_datecity)[3] <- c("num_volume")

#Plot by year and city to explore trend 
ggplot(data = volume_by_datecity)+
  geom_smooth(aes(x = date, y = num_volume, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = num_volume, colour = city))+
  labs(title = "Volume of sales trend by city", 
       x = "Period", 
       y = "Volume of sales (mln $)")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of volume of sales by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = volume, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales distribution by city and month", 
       x = "Months", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of volume of sales by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = volume, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales by year, month and city", 
       x = "Months", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “median_price”

Bivariate analysis by city: Bryan-College Station is the city with the highest total and average median price, even exceeding the same indicators for Tyler. Tyler shows generally higher median sale prices, despite having higher total sales value and number of sales compared to Bryan. Wichita Falls, on the other hand, confirms to be the city with the least healthy real estate market, having the lowest median prices among the cities and the highest variability in median prices.

Bivariate analysis by year and month: The analysis shows a more stable pattern across months and years for the median sale price compared to the number and value of sales. No peaks are observed in the total or average median sale prices. This suggests that seasonality affects the number and value of sales but not house valuations and thus the median price. However, the variability over time reveals a more chaotic situation across years, without a clear pattern, indicating influences from more complex dynamics rather than periodic or recurring variability.

Bivariate analysis by date: The temporal analysis confirms a relatively stable trend in the median sale price, showing a slight increase for the cities of Bryan-College Station and Tyler, which are also the cities with the highest number and value of sales and the strongest upward trends. Unlike the previous analysis, this one highlights a seasonal trend more clearly, although it is less pronounced than that of the number and value of sales, confirming the impression that additional dynamics influence the median sale price, obscuring a purely seasonal pattern.

Overall, it appears that the median sale price does not follow a strictly linear relationship with the number and value of sales, suggesting a more complex dynamic influenced by local behaviors. This situation should be further confirmed through bivariate analysis of the relationship with the mentioned variables.

#Total, mean and standard deviation of median price by city
median_price_by_city <- agg_city(data, city, median_price)
names(median_price_by_city)[2:5] <- c("num_median_price", "mean_median_price", "sd_median_price", "cv_median_price")

kable(median_price_by_city, caption = "Statistical index for median price aggregated by city", row.names = FALSE)

Statistical index for median price aggregated by city
city	num_median_price	mean_median_price	sd_median_price	cv_median_price
Beaumont	7799300	129988.3	10104.993	7.773769
Bryan-College Station	9449300	157488.3	8852.235	5.620883
Tyler	8486500	141441.7	9336.538	6.600981
Wichita Falls	6104600	101743.3	11320.034	11.126070

#Plot of volume of median price by city
ggplot(data = median_price_by_city)+
  geom_bar(aes(x = city, y = num_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Total median price by city", 
       x = "City", 
       y = "Total median price")+ 
  theme_classic()

#Plot of mean median price of sales by city
ggplot(data = median_price_by_city)+
  geom_bar(aes(x = city, y = mean_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean of median price by city", 
       x = "City", 
       y = "Mean of median price")+ 
  theme_classic()

#Plot of variation coefficient of median price of sales by city
ggplot(data = median_price_by_city)+
  geom_bar(aes(x = city, y = cv_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Median price variability by city", 
       x = "City", 
       y = "Median price variability")+ 
  theme_classic()

#Total number, mean and standard deviation of median price by month and year
median_price_by_date <- agg_date(data, year, month, median_price)
names(median_price_by_date)[3:6] <- c("num_median_price", "mean_median_price", "sd_median_price", "cv_median_price")

kable(median_price_by_date, caption = "Statistical index for median price by year and month", row.names = FALSE)

Statistical index for median price by year and month
year	month	num_median_price	mean_median_price	sd_median_price	cv_median_price
2010	Jan	541800	135450	33735.69	24.906378
2010	Feb	524400	131100	31406.47	23.956121
2010	Mar	489300	122325	24973.24	20.415480
2010	Apr	513200	128300	18899.38	14.730618
2010	May	520500	130125	20121.53	15.463229
2010	Jun	533100	133275	15108.80	11.336560
2010	Jul	508900	127225	23351.71	18.354657
2010	Aug	524400	131100	23883.47	18.217748
2010	Sep	503300	125825	20739.56	16.482858
2010	Oct	536600	134150	25058.66	18.679586
2010	Nov	526400	131600	32545.87	24.730901
2010	Dec	527300	131825	16897.61	12.818212
2011	Jan	489000	122250	24234.20	19.823480
2011	Feb	484800	121200	24134.76	19.913169
2011	Mar	507000	126750	21773.76	17.178510
2011	Apr	529900	132475	15490.94	11.693480
2011	May	515100	128775	17313.46	13.444736
2011	Jun	521400	130350	24845.86	19.060881
2011	Jul	523000	130750	20059.99	15.342251
2011	Aug	541400	135350	20121.38	14.866185
2011	Sep	516400	129100	22192.04	17.189807
2011	Oct	496400	124100	35492.44	28.599871
2011	Nov	511600	127900	31642.06	24.739688
2011	Dec	501000	125250	22397.10	17.881915
2012	Jan	457000	114250	24834.72	21.737173
2012	Feb	508900	127225	20443.48	16.068759
2012	Mar	517100	129275	23645.21	18.290626
2012	Apr	505000	126250	26047.33	20.631547
2012	May	522300	130575	23545.05	18.031817
2012	Jun	544700	136175	15604.14	11.458887
2012	Jul	549500	137375	18896.27	13.755249
2012	Aug	540200	135050	24313.85	18.003592
2012	Sep	541100	135275	27574.31	20.383891
2012	Oct	500300	125075	28832.55	23.052207
2012	Nov	547000	136750	13484.93	9.861009
2012	Dec	510600	127650	27354.52	21.429318
2013	Jan	513300	128325	23111.96	18.010486
2013	Feb	522400	130600	20824.18	15.945011
2013	Mar	512900	128225	32607.20	25.429671
2013	Apr	524200	131050	28138.88	21.471867
2013	May	564100	141025	17159.33	12.167578
2013	Jun	558100	139525	26363.15	18.894931
2013	Jul	544800	136200	26459.65	19.427059
2013	Aug	558100	139525	23961.97	17.173959
2013	Sep	565400	141350	18549.12	13.122833
2013	Oct	562200	140550	24629.86	17.523913
2013	Nov	536600	134150	24053.90	17.930598
2013	Dec	552600	138150	19851.53	14.369549
2014	Jan	483900	120975	28066.04	23.199866
2014	Feb	561000	140250	24954.96	17.793198
2014	Mar	522000	130500	26588.59	20.374402
2014	Apr	557500	139375	26724.94	19.174846
2014	May	567700	141925	21542.57	15.178844
2014	Jun	595100	148775	16999.88	11.426569
2014	Jul	568800	142200	29364.04	20.649815
2014	Aug	569400	142350	31003.82	21.779991
2014	Sep	554600	138650	38758.10	27.953910
2014	Oct	574100	143525	25713.60	17.915763
2014	Nov	564500	141125	30097.77	21.327030
2014	Dec	576500	144125	32345.36	22.442576

#Plot of median price of sales by month and year
ggplot(data = median_price_by_date)+
  geom_bar(aes(x = month, y = num_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_median_price, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Total median price by year and month", 
       x = "Months", 
       y = "Total median price")+  
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean median price of sales by month and year
ggplot(data = median_price_by_date)+
  geom_bar(aes(x = month, y = mean_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_median_price, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean of median pruce by year and month", 
       x = "Months", 
       y = "Mean of median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of median price of sales by month and year
ggplot(data = median_price_by_date)+
  geom_bar(aes(x = month, y = cv_median_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_median_price, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price variability by year and month", 
       x = "Months", 
       y = "Variability of median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Total volume of median price by date and city for trend analysis
median_price_by_datecity <- agg_datecity(data, city, date, median_price)
names(median_price_by_datecity)[3] <- c("num_median_price")

#Plot by year and city to explore trend 
ggplot(data = median_price_by_datecity)+
  geom_smooth(aes(x = date, y = num_median_price, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = num_median_price, colour = city))+
  labs(title = "Total median price trend by city", 
       x = "Period", 
       y = "Total median price")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of median price of sales by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = median_price, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price distribution by city and month", 
       x = "Months", 
       y = "Median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of median price by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = median_price, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price by year, month and city", 
       x = "Months", 
       y = "Median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “listings”

Bivariate analysis by city: The analysis shows that Tyler is the city with the highest total and average number of listings, and it is also the city with the highest number and value of sales. Wichita Falls, on the other hand, has the lowest total and average number of listings and ranks last in both sales number and value. This may suggest some correlation between the number of listings and the number and value of sales; however, it could also reflect higher demand, which is mirrored in the larger number of available listings as well as in the higher number and value of sales, revealing a correlation that may not be truly causal. The highest variability in the number of listings occurs in Bryan-College Station, which is also the city with the highest total and average median sale price. This could indicate that in this city, the relationship between supply and demand—which influences the sale price of individual listings—is unstable or subject to variation due to complex dynamics, which also affect the number of available listings.

Bivariate analysis by year and month: The analysis shows that the total and average number of listings has a slight peak around mid-year, regardless of the year, similar to what was observed for the number and value of sales. Variability, however, does not appear to follow a specific pattern and seems rather stable over time, except for some cases (such as in 2013 or 2014).

Bivariate analysis by date: The analysis shows a decrease over time in the number of listings for the cities of Bryan-College Station and Tyler, while Beaumont and Wichita Falls exhibit a stable situation. Overall, it appears that cities experiencing an increase over time in the number and value of sales—and thus seemingly having a better real estate market—show the opposite trend in the number of listings. This could be due to the fact that, as demand and the respective sales capacity and market penetration increase, it is no longer necessary to invest as much in advertising and listings. This behavior is confirmed by the boxplot, which supports the temporal trend of average values and variability of observations by city.

Overall, the number of listings shows an inverse proportional relationship with the number and value of sales, which, however, needs to be further investigated to determine whether it is linear or influenced by more complex nonlinear behaviors.

#Total, mean and standard deviation of listings by city
listings_by_city <- agg_city(data, city, listings)
names(listings_by_city)[2:5] <- c("num_listings", "mean_listings", "sd_listings", "cv_listings")

kable(listings_by_city, caption = "Statistical index for number of listings aggregated by city", row.names = FALSE)

Statistical index for number of listings aggregated by city
city	num_listings	mean_listings	sd_listings	cv_listings
Beaumont	100759	1679.3167	91.13382	5.426839
Bryan-College Station	87488	1458.1333	252.52753	17.318548
Tyler	174303	2905.0500	226.75458	7.805531
Wichita Falls	54575	909.5833	73.75504	8.108663

#Plot of volume of listings by city
ggplot(data = listings_by_city)+
  geom_bar(aes(x = city, y = num_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Number of listings by city", 
       x = "City", 
       y = "Number of listings")+ 
  theme_classic()

#Plot of mean listings of sales by city
ggplot(data = listings_by_city)+
  geom_bar(aes(x = city, y = mean_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean of listings by city", 
       x = "City", 
       y = "Mean of listings")+ 
  theme_classic()

#Plot of variation coefficient of listings of sales by city
ggplot(data = listings_by_city)+
  geom_bar(aes(x = city, y = cv_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Listings variability by city", 
       x = "City", 
       y = "Listings variability")+ 
  theme_classic()

#Total number, mean and standard deviation of listings by month and year
listings_by_date <- agg_date(data, year, month, listings)
names(listings_by_date)[3:6] <- c("num_listings", "mean_listings", "sd_listings", "cv_listings")

kable(listings_by_date, caption = "Statistical index for number of listings aggregated by year and month", row.names = FALSE)

Statistical index for number of listings aggregated by year and month
year	month	num_listings	mean_listings	sd_listings	cv_listings
2010	Jan	6466	1616.50	783.9211	48.49497
2010	Feb	6703	1675.75	779.9484	46.54325
2010	Mar	6941	1735.25	738.8362	42.57808
2010	Apr	7610	1902.50	871.1883	45.79176
2010	May	7473	1868.25	947.4173	50.71149
2010	Jun	7657	1914.25	984.9468	51.45341
2010	Jul	7768	1942.00	959.9205	49.42948
2010	Aug	7718	1929.50	954.2175	49.45413
2010	Sep	7702	1925.50	972.0838	50.48475
2010	Oct	7470	1867.50	917.5832	49.13431
2010	Nov	7262	1815.50	879.6384	48.45158
2010	Dec	6878	1719.50	826.7162	48.07887
2011	Jan	6964	1741.00	800.9024	46.00244
2011	Feb	7141	1785.25	833.7063	46.69970
2011	Mar	7554	1888.50	887.1056	46.97408
2011	Apr	7792	1948.00	914.6573	46.95366
2011	May	7990	1997.50	922.9047	46.20299
2011	Jun	7889	1972.25	930.5813	47.18374
2011	Jul	7776	1944.00	943.6444	48.54138
2011	Aug	7592	1898.00	940.5683	49.55576
2011	Sep	7420	1855.00	887.7466	47.85696
2011	Oct	7233	1808.25	890.4644	49.24454
2011	Nov	6910	1727.50	833.0564	48.22324
2011	Dec	6522	1630.50	791.3859	48.53639
2012	Jan	6803	1700.75	814.4947	47.89032
2012	Feb	7018	1754.50	823.8529	46.95657
2012	Mar	7291	1822.75	811.8002	44.53711
2012	Apr	7416	1854.00	834.8185	45.02797
2012	May	7453	1863.25	849.5153	45.59320
2012	Jun	7433	1858.25	873.0400	46.98184
2012	Jul	7431	1857.75	887.3719	47.76595
2012	Aug	7176	1794.00	892.0617	49.72473
2012	Sep	7086	1771.50	852.7878	48.13930
2012	Oct	6933	1733.25	839.9684	48.46205
2012	Nov	6801	1700.25	821.1977	48.29864
2012	Dec	6446	1611.50	759.5668	47.13415
2013	Jan	6579	1644.75	748.6046	45.51480
2013	Feb	6735	1683.75	746.3692	44.32779
2013	Mar	6973	1743.25	793.4471	45.51539
2013	Apr	7191	1797.75	836.3852	46.52400
2013	May	7086	1771.50	853.9924	48.20730
2013	Jun	7046	1761.50	875.5313	49.70373
2013	Jul	6935	1733.75	915.1089	52.78205
2013	Aug	6843	1710.75	899.3140	52.56840
2013	Sep	6611	1652.75	914.7066	55.34453
2013	Oct	6458	1614.50	897.0561	55.56247
2013	Nov	6234	1558.50	837.1063	53.71231
2013	Dec	5834	1458.50	766.7170	52.56887
2014	Jan	6129	1532.25	793.8226	51.80764
2014	Feb	6253	1563.25	790.5101	50.56837
2014	Mar	6375	1593.75	814.8412	51.12729
2014	Apr	6505	1626.25	827.1817	50.86436
2014	May	6475	1618.75	806.1585	49.80130
2014	Jun	6640	1660.00	851.4368	51.29138
2014	Jul	6514	1628.50	889.2669	54.60650
2014	Aug	6397	1599.25	847.1428	52.97125
2014	Sep	6159	1539.75	809.5638	52.57761
2014	Oct	6113	1528.25	772.6195	50.55583
2014	Nov	5847	1461.75	728.5403	49.84028
2014	Dec	5475	1368.75	675.7817	49.37218

#Plot of listings of sales by month and year
ggplot(data = listings_by_date)+
  geom_bar(aes(x = month, y = num_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_listings, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of listings by year and month", 
       x = "Months", 
       y = "Number of listings")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean listings of sales by month and year
ggplot(data = listings_by_date)+
  geom_bar(aes(x = month, y = mean_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_listings, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean of listings by year and month", 
       x = "Months", 
       y = "Mean of listings")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of listings by month and year
ggplot(data = listings_by_date)+
  geom_bar(aes(x = month, y = cv_listings),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_listings, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Listings variability by year and month", 
       x = "Months", 
       y = "Listings variability")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Total listings by date and city for trend analysis
listings_by_datecity <- agg_datecity(data, city, date, listings)
names(listings_by_datecity)[3] <- c("num_listings")

#Plot by year and city to explore trend 
ggplot(data = listings_by_datecity)+
  geom_smooth(aes(x = date, y = num_listings, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = num_listings, colour = city))+
  labs(title = "Number of listings trend by city", 
       x = "Period", 
       y = "Number of listings")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of listings by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = listings, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of listings distribution by city and month", 
       x = "Months", 
       y = "Number of listings")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of listings by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = listings, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of listings by year, month and city", 
       x = "Months", 
       y = "Number of listings")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

Variable “months_inventory”

Bivariate analysis by city: The analysis shows that Tyler has the highest total and average number of months to sell all listings, while Wichita Falls has the lowest indicators. Variability, on the other hand, is greater for Bryan-College Station, which aligns with the higher variability in the number of listings observed earlier. The first result is linked to the fact that Tyler is also the city with the highest number and value of sales, so in absolute terms it may take longer to process all listings. To assess efficiency, it would be necessary to evaluate other indicators created that relate the different measures present in the dataset.

Bivariate analysis by year and month: The analysis shows a pattern of the number of months to sell listings that is partially similar, in terms of average and total values, to that of the number and value of sales, with a peak mid-year for some years, but much less pronounced. The variability pattern appears less marked and more dependent on individual years (for example, in 2013 there is a noticeable increasing variability over the months).

Bivariate analysis by date: The temporal analysis reveals seasonality, although less pronounced compared to the number and value of sales, and a decreasing trend over time, especially for those cities (Tyler and Bryan-College Station) with a positive trend in the number and value of sales and a negative trend in the number of listings. The other cities also show a generally decreasing trend, but less markedly. This suggests a direct, probably nearly linear, proportional relationship between the number of listings and the number of months to sell listings, and an inverse proportionality of this latter indicator with the number and value of sales, following a relationship that appears to involve nonlinear and more complex dynamics, confirming what was already suggested by the city-level analysis.

#Total, mean and standard deviation of months inventory by city
minv_by_city <- agg_city(data, city, months_inventory)
names(minv_by_city)[2:5] <- c("num_minv", "mean_minv", "sd_minv", "cv_minv")

kable(minv_by_city, caption = "Statistical index for months inventory aggregated by city", row.names = FALSE)

Statistical index for months inventory aggregated by city
city	num_minv	mean_minv	sd_minv	cv_minv
Beaumont	598.2	9.970000	1.6495814	16.545450
Bryan-College Station	459.5	7.658333	2.2472048	29.343262
Tyler	679.5	11.325000	1.8864032	16.656982
Wichita Falls	469.0	7.816667	0.7809526	9.990865

#Plot of total months inventory by city
ggplot(data = minv_by_city)+
  geom_bar(aes(x = city, y = num_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Months inventory by city", 
       x = "City", 
       y = "Months inventory")+ 
  theme_classic()

#Plot of mean months inventory by city
ggplot(data = minv_by_city)+
  geom_bar(aes(x = city, y = mean_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean of months inventory by city", 
       x = "City", 
       y = "Mean of months inventory")+  
  theme_classic()

#Plot of variation coefficient of months inventory by city
ggplot(data = minv_by_city)+
  geom_bar(aes(x = city, y = cv_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Months inventory variability by city", 
       x = "City", 
       y = "Months inventory variability")+ 
  theme_classic()

#Total number, mean and standard deviation of months inventory by month and year
minv_by_date <- agg_date(data, year, month, months_inventory)
names(minv_by_date)[3:6] <- c("num_minv", "mean_minv", "sd_minv", "cv_minv")

kable(minv_by_date, caption = "Statistical index for months inventory aggregated by year and city", row.names = FALSE)

Statistical index for months inventory aggregated by year and city
year	month	num_minv	mean_minv	sd_minv	cv_minv
2010	Jan	35.0	8.750	2.042058	23.33780
2010	Feb	36.6	9.150	1.957039	21.38840
2010	Mar	37.9	9.475	1.774589	18.72917
2010	Apr	40.1	10.025	2.129750	21.24439
2010	May	38.8	9.700	2.393045	24.67057
2010	Jun	40.2	10.050	2.452889	24.40686
2010	Jul	42.2	10.550	2.379776	22.55712
2010	Aug	42.5	10.625	2.304163	21.68624
2010	Sep	42.6	10.650	2.580052	24.22584
2010	Oct	42.0	10.500	2.471167	23.53492
2010	Nov	41.6	10.400	2.405549	23.13028
2010	Dec	39.2	9.800	2.287648	23.34335
2011	Jan	39.8	9.950	2.112660	21.23277
2011	Feb	40.8	10.200	2.210581	21.67236
2011	Mar	43.6	10.900	2.356551	21.61974
2011	Apr	46.1	11.525	2.372587	20.58643
2011	May	48.3	12.075	2.270646	18.80452
2011	Jun	47.7	11.925	2.211146	18.54210
2011	Jul	46.3	11.575	2.364142	20.42455
2011	Aug	44.9	11.225	2.351418	20.94804
2011	Sep	44.1	11.025	2.093442	18.98814
2011	Oct	42.6	10.650	2.080865	19.53864
2011	Nov	40.7	10.175	1.944865	19.11415
2011	Dec	38.5	9.625	1.882153	19.55484
2012	Jan	39.8	9.950	1.887679	18.97165
2012	Feb	40.6	10.150	1.763519	17.37457
2012	Mar	41.9	10.475	1.652019	15.77106
2012	Apr	42.5	10.625	1.682013	15.83071
2012	May	42.1	10.525	1.543535	14.66541
2012	Jun	41.9	10.475	1.641899	15.67446
2012	Jul	41.4	10.350	1.682260	16.25372
2012	Aug	39.2	9.800	1.794436	18.31057
2012	Sep	38.5	9.625	1.652019	17.16383
2012	Oct	36.9	9.225	1.594522	17.28479
2012	Nov	35.9	8.975	1.631717	18.18069
2012	Dec	33.5	8.375	1.408013	16.81209
2013	Jan	34.2	8.550	1.173314	13.72298
2013	Feb	34.8	8.700	1.101514	12.66108
2013	Mar	35.6	8.900	1.151810	12.94169
2013	Apr	36.2	9.050	1.382027	15.27102
2013	May	34.8	8.700	1.534058	17.63285
2013	Jun	34.4	8.600	1.741647	20.25171
2013	Jul	32.9	8.225	2.041854	24.82497
2013	Aug	32.0	8.000	1.954482	24.43103
2013	Sep	30.6	7.650	2.199242	28.74827
2013	Oct	30.0	7.500	2.252406	30.03208
2013	Nov	28.9	7.225	2.012254	27.85127
2013	Dec	26.8	6.700	1.867262	27.86958
2014	Jan	28.0	7.000	1.790717	25.58167
2014	Feb	28.4	7.100	1.762574	24.82498
2014	Mar	29.1	7.275	1.802544	24.77724
2014	Apr	29.5	7.375	1.858987	25.20661
2014	May	29.5	7.375	1.851801	25.10917
2014	Jun	29.9	7.475	2.069420	27.68455
2014	Jul	29.5	7.375	2.280899	30.92744
2014	Aug	29.2	7.300	2.264214	31.01664
2014	Sep	27.9	6.975	2.066196	29.62288
2014	Oct	27.3	6.825	1.905037	27.91263
2014	Nov	26.1	6.525	1.824600	27.96322
2014	Dec	24.3	6.075	1.783956	29.36554

#Plot of months inventory by month and year
ggplot(data = minv_by_date)+
  geom_bar(aes(x = month, y = num_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = num_minv, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory by year and month", 
       x = "Months", 
       y = "Months inventory")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of mean months inventory by month and year
ggplot(data = minv_by_date)+
  geom_bar(aes(x = month, y = mean_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_minv, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean of months inventory by year and month", 
       x = "Months", 
       y = "Mean of months inventory")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of variation coefficient of months inventory by month and year
ggplot(data = minv_by_date)+
  geom_bar(aes(x = month, y = cv_minv),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = cv_minv, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory variabiliy by year and month", 
       x = "Months",  
       y = "Months inventory variability")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Total months inventory by date and city for trend analysis
minv_by_datecity <- agg_datecity(data, city, date, months_inventory)
names(minv_by_datecity)[3] <- c("num_minv")

#Plot by year and city to explore trend 
ggplot(data = minv_by_datecity)+
  geom_smooth(aes(x = date, y = num_minv, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = num_minv, colour = city))+
  labs(title = "Months inventory trend by city", 
       x = "Period", 
       y = "Months inventory")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of months inventory by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = months_inventory, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory distribution by city and month", 
       x = "Months", 
       y = "Months inventory")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of months inventory by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = months_inventory, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory by year, month and city", 
       x = "Months", 
       y = "Months inventory")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

New created variables

Bivariate analysis by city:

Average sales value: Bryan-College Station and Tyler show higher average values, indicating markets with generally more expensive or higher-value properties. Wichita Falls shows the lowest average value.
Listing efficiency index: Bryan-College Station stands out for the highest efficiency, followed by Wichita Falls and Beaumont, while Tyler exhibits the lowest efficiency, suggesting more difficulty in converting listings into sales.
Variation in listing efficiency: Wichita Falls shows the largest positive variation, indicating significant improvements over time; Tyler shows a negative variation, signaling a decline in efficiency.

Bivariate analysis by year and month:

Average sales value: Moderately seasonal patterns are evident with peaks in spring/summer months and a slight overall upward trend, signaling a market with seasonality and positive but not too pronounced trends.
Listing efficiency index: Shows clear seasonality with recurring peaks in the mid-year months and a year-on-year increasing trend indicating progressive improvements in efficiency.
Efficiency variability: Initially high in early years with peaks in central seasons, variability lessens over subsequent years, indicating more stable markets or those less subject to strong fluctuations.

Bivariate analysis by date:

Efficiency indices show a general positive trend from 2010 to 2014 in all cities, with Bryan-College Station exhibiting strong growth and Wichita Falls remaining more stable.
Differences among cities indicate local dynamics where some areas are more resilient or improving relative to others.
Variability suggests markets tend to stabilize but at different rates and levels depending on the territorial context.

There is a substantial correlation between the number and value of sales and efficiency indicators, highlighting that more active and higher-value markets tend to be more efficient or improve over time. Seasonality and annual dynamics suggest opportunities for strategic interventions targeted at specific periods.

#Function to aggregate new created variables by city
agg_new_city <- function(data, city, kpi1, kpi2, kpi3, kpi4) {
  datanew_by_city <- data %>% 
    group_by({{city}}) %>%
    summarize(mean_price = (sum({{kpi2}})/sum({{kpi1}})),
            eff_list_ind = (sum({{kpi1}})/sum({{kpi3}}))*100,
            mean_eff_listings = (sum({{kpi3}})/sum({{kpi4}})),
             .groups = "drop")%>%
            mutate(var_eff_list = ((mean({{kpi1}}) - mean_eff_listings)/(mean_eff_listings))*100)
  
  return(datanew_by_city)
}

agg_new_var_city <- agg_new_city(data, city, sales, volume, listings, months_inventory)

#Plot of statistical index table
agg_new_var_city <- agg_new_var_city[order(agg_new_var_city$eff_list_ind, decreasing = TRUE), ]

kable(agg_new_var_city, caption = "Statistical index for new variables aggregated by city and ordinated by efficiency listings index", row.names = FALSE)

Statistical index for new variables aggregated by city and ordinated by efficiency listings index
city	mean_price	eff_list_ind	mean_eff_listings	var_eff_list
Bryan-College Station	0.1854261	14.125366	190.3983	0.9944459
Wichita Falls	0.1200187	12.760421	116.3646	65.2492747
Beaumont	0.1473171	10.562828	168.4370	14.1623825
Tyler	0.1696659	9.285554	256.5166	-25.0373272

#Plot of mean price by city
ggplot(data = agg_new_var_city)+
  geom_bar(aes(x = city, y = mean_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Mean price (mln $) by city", 
       x = "City", 
       y = "Mean price (mln $)")+ 
  theme_classic()

#Plot of efficiency listings index by city
ggplot(data = agg_new_var_city)+
  geom_bar(aes(x = city, y = eff_list_ind),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Efficiency listings index by city", 
       x = "City", 
       y = "Efficiency listings index")+ 
  theme_classic()

#Plot of variation efficiency listings index by city
ggplot(data = agg_new_var_city)+
  geom_bar(aes(x = city, y = var_eff_list),
           stat = "identity",
           col = "black",
           fill = "blue")+
  labs(title = "Efficency listings index variation by city", 
       x = "City", 
       y = "Efficiency listings index variation (%)")+ 
  theme_classic()

#Function to aggregate new created variables by year and month
agg_new_ym <- function(data, year, month, kpi1, kpi2, kpi3, kpi4) {
  datanew_by_ym <- data %>% 
    group_by({{year}}, {{month}}) %>%
    summarize(mean_price = (sum({{kpi2}})/sum({{kpi1}})),
            eff_list_ind = (sum({{kpi1}})/sum({{kpi3}}))*100,
            mean_eff_listings = (sum({{kpi3}})/sum({{kpi4}})),
             .groups = "drop")%>%
            mutate(var_eff_list = ((mean({{kpi1}}) - mean_eff_listings)/(mean_eff_listings))*100)
  
  return(datanew_by_ym)
}

agg_new_var_ym <- agg_new_ym(data, year, month, sales, volume, listings, months_inventory)

#Plot of statistical index table
agg_new_var_ym <- agg_new_var_ym[order(agg_new_var_ym$eff_list_ind, decreasing = TRUE), ]

kable(agg_new_var_ym, caption = "Statistical index for new variables aggregated by year and month and ordinated by efficiency listings index", row.names = FALSE)

Statistical index for new variables aggregated by year and month and ordinated by efficiency listings index
year	month	mean_price	eff_list_ind	mean_eff_listings	var_eff_list
2014	Jun	0.1828683	17.725904	222.0736	-13.4108308
2014	Jul	0.1794058	17.439361	220.8136	-12.9167306
2014	May	0.1741943	17.405405	219.4915	-12.3922136
2014	Aug	0.1773975	16.320150	219.0753	-12.2257829
2013	Jul	0.1673035	16.250901	210.7903	-8.7758351
2013	Aug	0.1676938	16.177115	213.8438	-10.0784256
2014	Oct	0.1736611	15.687878	223.9194	-14.1246115
2014	Dec	0.1768980	15.397260	225.3086	-14.6541096
2014	Apr	0.1667114	15.019216	220.5085	-12.7962465
2013	May	0.1745260	14.916737	203.6207	-5.5637878
2013	Jun	0.1763780	14.831110	204.8256	-6.1193112
2014	Sep	0.1763226	14.596525	220.7527	-12.8927180
2012	Aug	0.1617096	13.294314	183.0612	5.0422705
2014	Mar	0.1660178	13.192157	219.0722	-12.2245098
2014	Nov	0.1657047	12.741577	224.0230	-14.1643150
2013	Dec	0.1636630	12.564278	217.6866	-11.6658096
2012	Jul	0.1642920	12.488225	179.4928	7.1306015
2010	May	0.1543383	12.458183	192.6031	-0.1616932
2013	Sep	0.1680356	12.312812	216.0458	-10.9949327
2013	Apr	0.1658554	12.209707	198.6464	-3.1990219
2012	Jun	0.1662973	12.081259	177.3986	8.3952756
2010	Apr	0.1475939	12.036794	189.7756	1.3258322
2012	May	0.1672630	11.834161	177.0309	8.6204101
2013	Mar	0.1541794	11.673598	195.8708	-1.8272862
2013	Oct	0.1688198	11.427687	215.2667	-10.6728089
2010	Jun	0.1595739	11.309912	190.4726	0.9550085
2011	Jun	0.1570463	11.218152	165.3878	16.2671124
2014	Feb	0.1647032	11.098673	220.1761	-12.6645877
2013	Nov	0.1597029	11.068335	215.7093	-10.8561250
2010	Mar	0.1481801	10.877395	183.1398	4.9971786
2012	Oct	0.1549919	10.702438	187.8862	2.3447642
2011	Jul	0.1611515	10.442387	167.9482	14.4946523
2011	Aug	0.1537099	10.353003	169.0869	13.7236016
2014	Jan	0.1500415	10.230054	218.8929	-12.1526078
2012	Apr	0.1501044	10.072816	174.4941	10.1995123
2012	Sep	0.1594894	9.977420	184.0519	4.4768440
2012	Dec	0.1514557	9.975178	192.4179	-0.0656092
2011	May	0.1564564	9.762203	165.4244	16.2413955
2012	Mar	0.1534866	9.751749	174.0095	10.5063892
2010	Aug	0.1549539	9.562063	181.6000	5.8874816
2012	Nov	0.1577646	9.557418	189.4429	1.5037617
2011	Apr	0.1540042	9.188912	169.0239	13.7659886
2010	Jul	0.1499494	9.165808	184.0758	4.4632896
2011	Mar	0.1520449	8.842997	173.2569	10.9864531
2013	Feb	0.1516796	8.804751	193.5345	-0.6421678
2010	Dec	0.1524338	8.781623	175.4592	9.5933896
2013	Jan	0.1592691	8.755130	192.3684	-0.0398997
2011	Dec	0.1475168	8.662987	169.4026	13.5116401
2011	Sep	0.1511971	8.477089	168.2540	14.2865566
2011	Oct	0.1486094	8.212360	169.7887	13.2534909
2012	Feb	0.1437352	8.178968	172.8571	11.2431129
2011	Nov	0.1571730	7.945007	169.7789	13.2600699
2010	Sep	0.1485385	7.764217	180.7981	6.3571150
2010	Oct	0.1542726	7.563588	177.8571	8.1157965
2012	Jan	0.1398617	7.334999	170.9296	12.4975501
2010	Feb	0.1578994	7.265404	183.1421	4.9958974
2010	Nov	0.1464473	6.926467	174.5673	10.1533095
2011	Feb	0.1479296	6.567708	175.0245	9.8655650
2010	Jan	0.1514276	6.510980	184.7429	4.0861171
2011	Jan	0.1427247	6.102815	174.9749	9.8967308

#Plot of mean price by month and year
ggplot(data = agg_new_var_ym)+
  geom_bar(aes(x = month, y = mean_price),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = mean_price, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Mean price by year and month", 
       x = "Months", 
       y = "Mean price (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of efficiency listings index by month and year
ggplot(data = agg_new_var_ym)+
  geom_bar(aes(x = month, y = eff_list_ind),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = eff_list_ind, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Efficiency listings index by year and month", 
       x = "Months", 
       y = "Efficiency listings index")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

##Plot of variation efficiency listings index by month and year
ggplot(data = agg_new_var_ym)+
  geom_bar(aes(x = month, y = var_eff_list),
           stat = "identity",
           col = "black",
           fill = "blue")+
  geom_line(aes(x = month, y = var_eff_list, group = 1), color = "red", lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Efficiency listings index variation by year and month", 
       x = "Months", 
       y = "Efficiency listings index variation")+  
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Efficiency listings index by date and city for trend analysis
agg_new_efl <- function(data, city, date, kpi1, kpi3) {
  datanew_by_efl <- data %>% 
    group_by({{city}}, {{date}}) %>%
    summarize(eff_list_ind = (sum({{kpi1}})/sum({{kpi3}}))*100, .groups = "drop")%>%
  
  return(datanew_by_efl)
}

efl_by_datecity <- agg_new_efl(data, city, date, sales, listings)

#Plot by year and city to explore trend 
ggplot(data = efl_by_datecity)+
  geom_smooth(aes(x = date, y = eff_list_ind, colour = city), lwd = 1)+
  geom_point(aes(x = date, y = eff_list_ind, colour = city))+
  labs(title = "Efficiency listings index trend by city", 
       x = "Period", 
       y = "Efficiency listings index")+ 
  facet_wrap(~ city, scales = "free_x")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Boxplot of efficiency list index by date and city
ggplot(data = data)+
  geom_boxplot(aes(x = month, y = efficiency_list_index, fill = city))+
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Efficiency listings index distribution by city and month", 
       x = "Months",
       y = "Efficiency listings index")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Plot of efficiency listings index by month, year and city
ggplot(data = data)+
  geom_bar(aes(x = month, y = efficiency_list_index, fill = city),
           stat = "identity",
           position = "stack",
           col = "black")+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Efficiency listings index by year, month and city", 
       x = "Months", 
       y = "Efficiency listings index")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

First breakpoint: results of analysis (points 1 and 2)

Historical Trends in Real Estate Sales:

Real estate sales in Texas show a well-defined seasonality, with a significant increase in both the number and average value of transactions during the spring and early summer months. For example, May, June, and July stand out for having sales volumes above the average, likely related to favorable weather conditions and seasonal economic factors that encourage property purchases. This seasonality is also reflected in the monthly variations of sales indicators and in the higher efficiency observed during these periods.
Looking at the cities, Bryan-College Station and Tyler confirm themselves as more dynamic markets: Bryan-College Station shows a particularly high average sales value, suggesting the presence of higher-end properties or stronger demand, while Tyler records a very high total number of sales with contained variability, indicating market stability. In contrast, Wichita Falls shows more moderate dynamics, with a lower average sales value and higher variability, possibly reflecting less stable demand or a less homogeneous supply.
An interesting finding is the progress in sales efficiency recorded in almost all cities: Bryan-College Station in particular shows a clear improvement over time in the ability to convert listings into actual sales, indicating a market evolving positively and likely more effective marketing and sales strategies. This trend is an encouraging sign, suggesting, for example, that real estate operators in this area are refining their sales techniques and market responsiveness.

Effectiveness of Marketing Strategies for Listings:

The listing efficiency index, calculated as the percentage of listings sold relative to the total, varies significantly among cities and over different periods analyzed but generally shows a positive trend over time. For instance, Wichita Falls exhibits a marked increase in this index in recent years, suggesting that advertising campaigns or property positioning strategies are becoming more effective, possibly due to better targeting of buyers or additional services offered.
Monthly and annual variations of the index also reveal a seasonal pattern: spring and summer periods are when listing conversion reaches its highest levels. This indicates strategic time windows in which to focus marketing efforts, optimizing resources and maximizing results.
The differences between cities are significant: while Bryan-College Station maintains a consistently high index, signaling an efficient and competitive market, Tyler suffers from lower efficiency values, highlighting a possible oversupply or less responsive demand. In these cases, targeted interventions such as more aggressive promotional campaigns, improved listing quality, or incentives for real estate agents could help increase sales effectiveness.

Initial Overall Assessment:

The overall picture shows a Texas real estate market characterized by clear seasonality and sharp territorial differences in terms of both volume and sales efficiency. The positive trends in listing efficiency and average sales value in some cities indicate improvements in market dynamics and commercial strategies.
However, disparities among different areas suggest that not all cities are fully exploiting their market potential, making further in-depth analysis desirable to support personalized interventions. Overall, this analysis represents a solid starting point to optimize marketing actions and enhance commercial penetration in the territory, aiming to increase profitability and competitiveness for real estate operators in Texas.

Bivariate analysis to explore relationship among variables

Variables “volume” vs “sales”

The analysis of the relationship between the number of sales and the value of sales variables shows a strong linear correlation with a marked direct proportionality. This was suggested by the similar trends of both variables by city and by year and month, indicating that this relationship holds at a global level and not only locally. The analysis of the shape and slope of the confidence ellipses, as well as the visualization of the linear regression lines of the data, further confirm the linear relationship between these two variables.

#The analysis of trends suggest some kind of correlation between number of sales and volume. 
#For this reason, we can plot this two variables and explore this correlation
ggplot(data = data)+
  geom_point(aes(x = sales, y = volume, colour = city))+
  geom_smooth(aes(x = sales, y = volume), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales vs Number of sales", 
        x = "Number of sales", 
        y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Function for correlation index by city and year between number and volume of sales
corr_fun <- function(data, year, city, kpi1, kpi2){
  corr_by_yc <- data %>%
  group_by({{year}}, {{city}}) %>%
  summarize(corr_index = cor({{kpi1}}, {{kpi2}}, use = "complete.obs"), .groups = "drop")
  
  return(corr_by_yc)
}

corr_by_yearcity <- corr_fun(data, year, city, sales, volume)

#Plot of corrplot for this calculated index
corr_by_yearcity %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                   0.937
##  2  2010 Bryan-College Station      0.995
##  3  2010 Tyler                      0.969
##  4  2010 Wichita Falls              0.953
##  5  2011 Beaumont                   0.912
##  6  2011 Bryan-College Station      0.990
##  7  2011 Tyler                      0.988
##  8  2011 Wichita Falls              0.943
##  9  2012 Beaumont                   0.960
## 10  2012 Bryan-College Station      0.993
## 11  2012 Tyler                      0.973
## 12  2012 Wichita Falls              0.820
## 13  2013 Beaumont                   0.987
## 14  2013 Bryan-College Station      0.996
## 15  2013 Tyler                      0.973
## 16  2013 Wichita Falls              0.930
## 17  2014 Beaumont                   0.979
## 18  2014 Bryan-College Station      0.993
## 19  2014 Tyler                      0.983
## 20  2014 Wichita Falls              0.923

#Transform in wide format
corr_matrix <- corr_by_yearcity %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat <- as.matrix(corr_matrix[,-1])

#Set rownames like years, to be more clear
rownames(mat) <- corr_matrix$year

#Corrplot
corrplot(mat, method = "color", addCoef.col = "white", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Volume of sales vs Number of sales correlation by year and city",
         mar = c(0,0,1,0))

#This analysis confirm that number of sales and volume show a very important correlation,
#globally and by city and year (values from 0.91 to 1.00). The only value of correlation under 0.91
#is for year = 2012 and city = Wichita Falls (0.82)

#The high correlation suggests a strong linear relationship between sales and volume, with data 
#clustered near a straight line in bivariate space. The joint distributions are therefore expected 
#to exhibit a narrow elliptical shape. However, since the correlation reflects only the linear 
#association, further controls for outliers, non-Gaussian distributions, or possible nonlinear 
#relationships are necessary.

#Plot of joint distribution of sales and volume
ggplot(data, aes(x = sales, y = volume)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Number of sales joint distribution with confidence ellipses", 
       x = "Number of sales", 
       y = "Volume of sales (mln $)") + 
  theme_classic()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Plot of points around linear regression rect
ggplot(data, aes(x = sales, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Number of sales linear regression with confidence intervals", 
       x = "Number of sales", 
       y = "Volume of sales (mln $)") + 
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#All this analysis confirm strong linearity beetwen sales and volume

Variables “median_price” vs “sales”

Summary of the Analysis of the Relationship between Number of Sales and Median Sales Value:
The exploratory analysis of the relationship between the number of sales and the median sales value of real estate reveals a complex relationship that is not perfectly linear, but shows several significant trends at territorial and temporal levels.

Correlations and Trends by City and Year:
At a first level, the correlation between the two variables varies considerably across cities and over the years. In general, the correlation is positive, indicating that an increase in the number of sales tends to be associated with an increase in the median sales value, but this relationship is not strong or stable in all cases:

Tyler: shows a positive and robust trend between 2011 and 2013, although a slight decline is observed in 2014, suggesting a still dynamic market that may be slowing down.
Bryan-College Station: demonstrates a growing market from 2010 to 2013, with stabilization in 2014.
Beaumont: exhibits a correlation ranging from -0.69 to 0.75, indicating a potentially growing market but with variability over time.
Wichita Falls: shows a generally stable but weaker correlation trend, indicating a more stagnant or troubled market.

Linear Model and Residual Distribution:
Applying a linear regression model between number of sales and median value confirms the presence of a linear relationship, albeit imperfect:

The presence of heteroscedasticity and skewness in the residuals indicates that errors are not distributed uniformly, with residual variability changing depending on the prediction region.
Consequently, the model tends to overestimate or underestimate the relationship, especially for low sales values.
The presence of outliers suggests possible market stratification or atypical cases that deserve further investigation.

Nonlinear Analysis with Loess:
The nonlinear regression analysis (loess) highlights a better fit of the relationship between sales and median prices, capturing curvatures and local dynamics more accurately:

Confirms market growth for Beaumont, Tyler, and Bryan-College Station in a more nuanced and realistic manner.
Makes the stability of Wichita Falls’ market clearer.

Overall Statistical Considerations:
The analysis suggests that although the number of sales and median market value are correlated, the relationship cannot be simply modeled with a basic linear regression without accounting for variability, nonlinearity, and outliers. Seasonality, local dynamics, and specific market conditions influence the link between quantity and price.

#Correlation with sales
ggplot(data = data)+
  geom_point(aes(x = sales, y = median_price, colour = city))+
  geom_smooth(aes(x = sales, y = median_price), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price vs Number of sales by year", 
       x = "Number of sales", 
       y = "Median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between number and median price of sales
corr_by_yearcity_mp <- corr_fun(data, year, city, sales, median_price)

#Plot of corrplot for this calculated index
corr_by_yearcity_mp %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                -0.692  
##  2  2010 Bryan-College Station   -0.535  
##  3  2010 Tyler                    0.280  
##  4  2010 Wichita Falls            0.463  
##  5  2011 Beaumont                 0.486  
##  6  2011 Bryan-College Station   -0.426  
##  7  2011 Tyler                    0.627  
##  8  2011 Wichita Falls            0.542  
##  9  2012 Beaumont                 0.618  
## 10  2012 Bryan-College Station    0.173  
## 11  2012 Tyler                    0.666  
## 12  2012 Wichita Falls            0.375  
## 13  2013 Beaumont                 0.394  
## 14  2013 Bryan-College Station    0.531  
## 15  2013 Tyler                    0.708  
## 16  2013 Wichita Falls           -0.0353 
## 17  2014 Beaumont                 0.752  
## 18  2014 Bryan-College Station   -0.00894
## 19  2014 Tyler                    0.493  
## 20  2014 Wichita Falls            0.232

#Transform in wide format
corr_matrix_mp <- corr_by_yearcity_mp %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_mp <- as.matrix(corr_matrix_mp[,-1])

#Set rownames like years, to be more clear
rownames(mat_mp) <- corr_matrix_mp$year

#Corrplot
corrplot(mat_mp, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Median price vs Number of sales correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so median price growth like number of sales, but not in a
#strong dependency and in other cases the correlation is negative or null.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from -0.69 to 0.75, so the market meybe is growing
#Bryan-College Station --> the market are growth from 2010 to 2013, and into 2014 is freeze
#Tyler --> shows a positive market trend, but into 2014 a small decrease
#Wichita Falls --> shows a stable trend, but generally the market seems in a difficult situation

#Plot of joint distribution of sales and median price
ggplot(data, aes(x = sales, y = median_price)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Number of sales joint distribution with confidence ellipses", 
       x = "Number of sales", 
       y = "Median price") + 
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = sales, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Number of sales linear regression with confidence intervals", 
       x = "Number of sales", 
       y = "Median price") + 
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(median_price ~ sales, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (median price vs number of sales))", 
       x = "Predicted values", 
       y = "Residuals") + 
  theme_classic()

#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.

#Plot of loess smooth of sales and median price by city
ggplot(data, aes(x = sales, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Number of sales non linear regression with confidence intervals", 
       x = "Number of sales", 
       y = "Median price") + 
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Trends confirm grwoth in market for Beaoumont and Tyler and also for Bryan-College Station.
#Wichita Falls confirms that its market show a stable situation

Variables “median_price” vs “volume”

The correlation between the total sales value and the median sales value is generally positive, indicating that an increase in the overall transaction value tends to be accompanied by an increase in the median value. However, this relationship is not always strong and shows considerable variability across time and between cities.

In particular, the city of Tyler exhibits a strong positive correlation from 2011 to 2013, with a slight decline in 2014, confirming similar trends observed in the previous analysis on the number of sales. Other cities, such as Beaumont, show wide fluctuations in correlation over time, ranging from negative to highly positive values, suggesting more unstable or evolving markets. Bryan-College Station shows a growth phase until 2013 followed by stabilization in 2014, whereas Wichita Falls displays a more stable but generally weaker trend.

Linear Model and Residual Analysis
The linear regression confirms not a very high positive linear relationship between volume and median value and as in the previous case, residuals exhibit heteroscedasticity and asymmetry, indicating that the adequacy of the linear model is limited. The presence of outliers and significant dispersion in residuals suggests that nonlinear factors or specific market conditions influence this relationship.

Nonlinear Analysis with Loess
The loess regression improves the fit to data curvature and local dynamics, more accurately confirming market growth in the major cities and highlighting a more stagnant situation in Wichita Falls.

Spearman Correlation
The use of the Spearman correlation, less sensitive to outliers, reinforces the previous conclusions, showing a generally positive and monotonic relationship between volume and median value, though not perfectly linear.

Final Considerations
These results indicate that the median sales value is influenced, albeit in a complex and nonlinear way, by the total volume of real estate transactions. The presence of heteroscedasticity and outliers suggests considering more flexible models or variable transformations (for example, logarithmic) for a better market representation.

The analysis also highlights significant territorial differences that may reflect economic conditions, micro-dynamics of supply and demand, or varied sales strategies.

#Correlation with volume
ggplot(data = data)+
  geom_point(aes(x = volume, y = median_price, colour = city))+
  geom_smooth(aes(x = volume, y = median_price), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price vs Volume of sales by year", 
       x = "Volume of sales (mln $)", 
       y = "Median price")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between volume and median price of sales
corr_by_yearcity_mpv <- corr_fun(data, year, city, volume, median_price)

#Plot of corrplot for this calculated index
corr_by_yearcity_mpv %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                 -0.686 
##  2  2010 Bryan-College Station    -0.525 
##  3  2010 Tyler                     0.436 
##  4  2010 Wichita Falls             0.644 
##  5  2011 Beaumont                  0.637 
##  6  2011 Bryan-College Station    -0.315 
##  7  2011 Tyler                     0.682 
##  8  2011 Wichita Falls             0.721 
##  9  2012 Beaumont                  0.768 
## 10  2012 Bryan-College Station     0.234 
## 11  2012 Tyler                     0.723 
## 12  2012 Wichita Falls             0.793 
## 13  2013 Beaumont                  0.492 
## 14  2013 Bryan-College Station     0.579 
## 15  2013 Tyler                     0.798 
## 16  2013 Wichita Falls             0.297 
## 17  2014 Beaumont                  0.818 
## 18  2014 Bryan-College Station     0.0872
## 19  2014 Tyler                     0.585 
## 20  2014 Wichita Falls             0.487

#Transform in wide format
corr_matrix_mpv <- corr_by_yearcity_mpv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_mpv <- as.matrix(corr_matrix_mpv[,-1])

#Set rownames like years, to be more clear
rownames(mat_mpv) <- corr_matrix_mpv$year

#Corrplot
corrplot(mat_mpv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Median price vs Volume of sales correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so median price growth like number of sales, but not in a
#strong dependency and in other cases the correlation is negative or null.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from -0.69 to 0.82, so the market meybe is growing
#Bryan-College Station --> the market are growth from 2010 to 2013, and into 2014 is freeze
#Tyler --> shows a positive market trend, but into 2014 a small decrease
#Wichita Falls --> shows a stable trend

#Plot of joint distribution of volume and median price
ggplot(data, aes(x = volume, y = median_price)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Volume of sales joint distribution with confidence ellipses", 
       x = "Volume of sales (mln $)", 
       y = "Median price") + 
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = volume, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Volume of sales linear regression with confidence intervals", 
       x = "Volume of sales (mln $)",  
       y = "Median price") + 
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(median_price ~ volume, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (median price vs volume of sales)", 
       x = "Predicted values", 
       y = "Residuals") + 
  theme_classic()

#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.

#Plot of loess smooth of volume and median price by city
ggplot(data, aes(x = volume, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Volume of sales non linear regression with confidence intervals", 
       x = "Volume of sales (mln $)", 
       y = "Median price") + 
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Trends confirm grwoth in market for Beaoumont and Tyler and also for Bryan-College Station.
#Also Wichita Falls show a low growth, despite the flat trend of median price by number of sales,
#while correlation shows a bad situation for the market
#Generally, between volume and median price there isn't a linear correlation, but more linear
#than the correlation between sales e median price

#Beacause residuals analysis shows an important dispersion of values, we could try to calculate
#correlation using Spearman index
corr_sp_fun <- function(data, year, city, kpi1, kpi2) {
  corr_sp_data <- data %>%
  group_by({{year}}, {{city}}) %>%
  summarize(corr_spearman = cor({{kpi1}}, {{kpi2}}, method = "spearman", use = "complete.obs"), .groups = "drop")
  
  return(corr_sp_data)
}

corr_by_yearcity_mpv_sp <- corr_sp_fun(data, year, city, volume, median_price)
  
corr_by_yearcity_mpv_sp %>%
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_spearman
##    <int> <chr>                         <dbl>
##  1  2010 Beaumont                    -0.503 
##  2  2010 Bryan-College Station       -0.545 
##  3  2010 Tyler                        0.406 
##  4  2010 Wichita Falls                0.643 
##  5  2011 Beaumont                     0.748 
##  6  2011 Bryan-College Station       -0.154 
##  7  2011 Tyler                        0.776 
##  8  2011 Wichita Falls                0.827 
##  9  2012 Beaumont                     0.510 
## 10  2012 Bryan-College Station        0.497 
## 11  2012 Tyler                        0.650 
## 12  2012 Wichita Falls                0.718 
## 13  2013 Beaumont                     0.608 
## 14  2013 Bryan-College Station        0.586 
## 15  2013 Tyler                        0.802 
## 16  2013 Wichita Falls                0.427 
## 17  2014 Beaumont                     0.760 
## 18  2014 Bryan-College Station       -0.0140
## 19  2014 Tyler                        0.503 
## 20  2014 Wichita Falls                0.315

corr_matrix_spearman_mpv <- corr_by_yearcity_mpv_sp %>%
  pivot_wider(names_from = city, values_from = corr_spearman)

mat_spearman <- as.matrix(corr_matrix_spearman_mpv[,-1])
rownames(mat_spearman) <- corr_matrix_spearman_mpv$year

corrplot(mat_spearman, method = "color", addCoef.col = "black", tl.col = "black",
         tl.cex = 0.7, tl.srt = 30, number.cex = 0.9, cl.pos = "b",
         title = "Median price vs Volume of sales Spearman corr by year and city",
         mar = c(0,0,1,0))

#The bivariate analysis between median sales price and total sales value highlights a positive 
#relationship with a less pronounced nonlinearity than that observed between median price and number 
#of sales. This result is consistent with the fact that, although the median price is less 
#influenced by outliers than the mean, it still tends to reflect the central market trend, which 
#increases along with the total sales value. Analysis using Spearman's correlation, which is less 
#sensitive to outliers, confirms this dynamic, suggesting a monotonic but not perfectly linear 
#relationship between the two variables. In conclusion, these results invite the exploration of 
#models or transformations that can better capture the complexity of the relationship between 
#sales volume and median price.

Variables “listings” vs “sales”

Correlation and General Trend
The analysis highlights a generally positive correlation between the number of active property listings and the number of sales, suggesting that greater supply tends to be accompanied by a higher number of sales. However, the strength of this correlation shows a decreasing trend over time across all cities considered, indicating a likely change in market dynamics.

City-Level Considerations
Beaumont and Bryan-College Station show a decrease in correlation over the analyzed period, with values ranging from moderate to low, suggesting that sales no longer proportionally follow the trend in listings over the years. Tyler shows a slight decrease in correlation over the years but maintains a more stable relationship compared to other cities. Wichita Falls presents a rather stable trend, with a more consistent correlation over time.

Linear Model and Residual Analysis
The application of a linear regression model confirms a positive relationship between the number of listings and sales, but the residuals show heteroscedasticity and skewness, meaning non-constant variance of errors and asymmetry. Furthermore, the residuals display strong unexplained variability and cyclical temporal patterns, underscoring the influence of additional factors or nonlinear effects. The presence of outliers indicates markets or periods with particular behavior.

Nonlinear Regression (Loess)
Using loess regression provides a more flexible and realistic model, capturing local variations over time and possible nonlinearities in the relationship between listings and sales.

Spearman Correlation
The analysis with Spearman’s correlation, which is robust to outliers, confirms the observed trends: a positive but complex and evolving relationship, with a decreasing correlation over time.

#Correlation with sales (verify if listings influence sales)
ggplot(data = data)+
  geom_point(aes(x = listings, y = sales, colour = city))+
  geom_smooth(aes(x = listings, y = sales), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Number of sales vs Listings by year", 
       x = "Number of listings", 
       y = "Number of sales")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between number and listings
corr_by_yearcity_ls <- corr_fun(data, year, city, listings, sales)

#Plot of corrplot for this calculated index
corr_by_yearcity_ls %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                   0.568
##  2  2010 Bryan-College Station      0.618
##  3  2010 Tyler                      0.390
##  4  2010 Wichita Falls             -0.118
##  5  2011 Beaumont                   0.509
##  6  2011 Bryan-College Station      0.722
##  7  2011 Tyler                      0.866
##  8  2011 Wichita Falls              0.760
##  9  2012 Beaumont                   0.446
## 10  2012 Bryan-College Station      0.414
## 11  2012 Tyler                      0.749
## 12  2012 Wichita Falls              0.274
## 13  2013 Beaumont                   0.443
## 14  2013 Bryan-College Station      0.191
## 15  2013 Tyler                      0.852
## 16  2013 Wichita Falls              0.415
## 17  2014 Beaumont                   0.334
## 18  2014 Bryan-College Station      0.208
## 19  2014 Tyler                      0.382
## 20  2014 Wichita Falls              0.657

#Transform in wide format
corr_matrix_ls <- corr_by_yearcity_ls %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_ls <- as.matrix(corr_matrix_ls[,-1])

#Set rownames like years, to be more clear
rownames(mat_ls) <- corr_matrix_ls$year

#Corrplot
corrplot(mat_ls, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Number of sales vs Listings correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so number of sales growth like listings, but not at all in a
#strong dependency.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.57 to 0.33, so the number of sales are decreasing according to number of listings
#Bryan-College Station --> the correlation goes from 0.62 to 0.21, so the number of sales are decreasing according to number of listings
#Tyler --> shows a small decrease into the years
#Wichita Falls --> shows a stable trend

#Plot of joint distribution of listings and sales
ggplot(data, aes(x = listings, y = sales)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of sales vs Listings joint distribution with confidence intervals", 
       x = "Number of listings", 
       y = "Number of sales") + 
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = sales, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of sales vs Listings linear regression with confidence intervals", 
       x = "Listings", 
       y = "Number of sales") + 
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(sales ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (number of sales vs listings)", 
       x = "Predicted values", 
       y = "Residuals") + 
  theme_classic()

#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.

#Plot of loess smooth of sales and median price by city
ggplot(data, aes(x = listings, y = sales, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Number of sales vs Listings non linear regression with confidence intervals", 
       x = "Number of listings", 
       y = "Number of sales") + 
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Analysis of the relationship between the number of active listings and the number of sales 
#highlights a generalized positive correlation, but one that decreases over time for all cities 
#considered. This decrease tends to coincide with an increase in the median price and the total 
#value of sales, suggesting a market in which the decrease in active supply does not translate 
#into a proportional reduction in sales, likely due to demand and price dynamics. Examination of 
#the residuals from the linear regression model reveals strong unexplained variability and wave-like 
#temporal patterns, indicating that the relationship between listings and sales is influenced by 
#additional factors or complex nonlinear processes. The large confidence intervals, which indicate 
#considerable data dispersion, require caution in interpretation, except for one city where the 
#relationship appears more defined and stable. Overall, the results underscore the need for more 
#flexible models or the inclusion of additional explanatory variables to fully describe market 
#dynamics.

#Beacause residuals analysis shows an important dispersion of values, we could try to calculate
#correlation using Spearman index
corr_by_yearcity_ls_sp <- corr_sp_fun(data, year, city, listings, sales)

corr_by_yearcity_ls_sp %>%
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_spearman
##    <int> <chr>                         <dbl>
##  1  2010 Beaumont                     0.378 
##  2  2010 Bryan-College Station        0.797 
##  3  2010 Tyler                        0.399 
##  4  2010 Wichita Falls               -0.0210
##  5  2011 Beaumont                     0.581 
##  6  2011 Bryan-College Station        0.706 
##  7  2011 Tyler                        0.881 
##  8  2011 Wichita Falls                0.778 
##  9  2012 Beaumont                     0.623 
## 10  2012 Bryan-College Station        0.404 
## 11  2012 Tyler                        0.914 
## 12  2012 Wichita Falls                0.480 
## 13  2013 Beaumont                     0.497 
## 14  2013 Bryan-College Station        0.0385
## 15  2013 Tyler                        0.930 
## 16  2013 Wichita Falls                0.343 
## 17  2014 Beaumont                     0.371 
## 18  2014 Bryan-College Station        0.259 
## 19  2014 Tyler                        0.503 
## 20  2014 Wichita Falls                0.685

corr_matrix_spearman_ls <- corr_by_yearcity_ls_sp %>%
  pivot_wider(names_from = city, values_from = corr_spearman)

mat_spearman <- as.matrix(corr_matrix_spearman_ls[,-1])
rownames(mat_spearman) <- corr_matrix_spearman_ls$year

corrplot(mat_spearman, method = "color", addCoef.col = "black", tl.col = "black",
         tl.cex = 0.7, tl.srt = 30, number.cex = 0.9, cl.pos = "b",
         title = "Number of sales vs Listings Spearman corr by year and city",
         mar = c(0,0,1,0))

#Also the analysis with correlation Spearman index confirms the same considerations already done

Variables “listings” vs “volume”

General Correlation and Residual Patterns The relationship between the number of listings and the total sales value (volume) appears generally positive, although with a low to moderate correlation that varies over time and among cities. The residual analysis from the linear regression model reveals four distinct clusters corresponding to the analyzed cities, suggesting that the relationship between the variables differs by geographic area. This indicates the need for separate models or models with interaction terms to better understand territory-specific dynamics.

Linear Interaction Model The linear model with interaction between number of listings and city explains about 54% of the overall variability (R² ≈ 0.54). However, the statistical significance of the interaction terms is generally low, indicating that, statistically, the slope of the relationship between volume and listings does not differ significantly among cities. This suggests that, under the simple linear model, city does not significantly modify the effect of listings on sales value.

Models with Nonlinearity
The polynomial (quadratic) model with interactions did not show significant improvement or relevant terms. Generalized Additive Models (GAMs), instead, offer a more flexible and realistic description:

The relationship between number of listings and sales volume is nonlinear and significant for some cities (Bryan-College Station and Tyler), while for others (Wichita Falls and Beaumont) the relationship is rather linear or less influential.
The GAM model highlights that the city effect on the relationship between listings and sales volume is significant and depends on the shape of the relationship, i.e., it modifies linearity.

Visualizations and Local Trends
Loess regressions for each city clearly show different territorial market dynamics, with varying curves relating number of listings and sales volume. In particular, some cities exhibit strong deviations from linearity and significant local variation.

Final Considerations
This analysis shows that the relationship between the number of active listings and total sales value varies considerably among cities and cannot be adequately captured by a simple linear model. Flexible models such as GAMs are more effective in describing local complexities and diversity. For future in-depth analyses, it is advisable to employ approaches that allow modeling nonlinearities and spatiotemporal interactions to improve interpretation and forecasting of the real estate market.

#Correlation with volume
ggplot(data = data)+
  geom_point(aes(x = listings, y = volume, colour = city))+
  geom_smooth(aes(x = listings, y = volume), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Volume of sales vs Listings by year",  
       x = "Number of listings", 
       y = "Volume of sales (mln $)")+ 
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between volume and listings
corr_by_yearcity_ls <- corr_fun(data, year, city, listings, volume)

#Plot of corrplot for this calculated index
corr_by_yearcity_ls %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                  0.448 
##  2  2010 Bryan-College Station     0.624 
##  3  2010 Tyler                     0.450 
##  4  2010 Wichita Falls            -0.0104
##  5  2011 Beaumont                  0.559 
##  6  2011 Bryan-College Station     0.670 
##  7  2011 Tyler                     0.882 
##  8  2011 Wichita Falls             0.704 
##  9  2012 Beaumont                  0.355 
## 10  2012 Bryan-College Station     0.368 
## 11  2012 Tyler                     0.776 
## 12  2012 Wichita Falls             0.412 
## 13  2013 Beaumont                  0.421 
## 14  2013 Bryan-College Station     0.172 
## 15  2013 Tyler                     0.899 
## 16  2013 Wichita Falls             0.326 
## 17  2014 Beaumont                  0.285 
## 18  2014 Bryan-College Station     0.120 
## 19  2014 Tyler                     0.363 
## 20  2014 Wichita Falls             0.638

#Transform in wide format
corr_matrix_ls <- corr_by_yearcity_ls %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_ls <- as.matrix(corr_matrix_ls[,-1])

#Set rownames like years, to be more clear
rownames(mat_ls) <- corr_matrix_ls$year

#Corrplot
corrplot(mat_ls, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Volume of sales vs Listings correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so volume growth like number of listings, but not in a
#strong dependency
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.45 to 0.29 (decreasing)
#Bryan-College Station --> the correlation goes from 0.62 to 0.12 (decreasing)
#Tyler --> shows a positive trend from 2011 to 2013 with a decrease into 2014
#Wichita Falls --> shows strong variability in trend

#Plot of joint distribution of volume and listings
ggplot(data, aes(x = listings, y = volume)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Listings joint distribution with confidence ellipses", 
       x = "Number of listings", 
       y = "Volume of sales (mln $)") + 
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Listings linear regression with confidence intervals", 
       x = "Number of listings", 
       y = "Volume of sales (mln $)") + 
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(volume ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals, color = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (volume of sales vs listings))",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#The residuals plot from the linear regression model linking sales value to the number of active 
#listings reveals an interesting pattern: the residuals are not randomly distributed but highlight 
#the presence of four distinct clusters, corresponding to the different cities analyzed. This 
#suggests that the relationship between sales value and the number of listings varies significantly 
#across cities, indicating potentially different market dynamics. To further explore these 
#differences, it is necessary to estimate separate regression models for each city or introduce 
#interaction terms into a single model. This step will allow us to better understand how individual 
#cities contribute to the observed variability and whether the slope of the relationship varies 
#significantly depending on the geographical context.

#These four different trends of residuals suggest four different trends of regression between volume
#and sales by city. So, below there is a plot of these four regression lines
ggplot(data, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +  
  labs(title = "Volume of sales vs Listings linear regressions", 
       x = "Number of listings", 
       y = "Volume of sales (mln $)") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Interaction model to investigate the influence of variable city on volume-listings relationship
model_interaction <- lm(volume ~ listings * city, data = data)
summary(model_interaction)

## 
## Call:
## lm(formula = volume ~ listings * city, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.405  -6.782  -0.731   4.847  39.197 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                         4.940e+01  2.741e+01   1.802   0.0728 .
## listings                           -1.386e-02  1.630e-02  -0.850   0.3960  
## cityBryan-College Station           1.032e+01  2.876e+01   0.359   0.7201  
## cityTyler                           8.051e+00  3.340e+01   0.241   0.8097  
## cityWichita Falls                  -4.136e+01  3.300e+01  -1.253   0.2113  
## listings:cityBryan-College Station -9.057e-04  1.733e-02  -0.052   0.9584  
## listings:cityTyler                  9.835e-03  1.757e-02   0.560   0.5761  
## listings:cityWichita Falls          2.033e-02  2.591e-02   0.785   0.4334  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.41 on 232 degrees of freedom
## Multiple R-squared:  0.5443, Adjusted R-squared:  0.5305 
## F-statistic: 39.59 on 7 and 232 DF,  p-value: < 2.2e-16

#From the summary of interaction model we can explore some results
#R^2 --> this interaction explain only the 54% of variability of the entire dataset
#Global p-value --> 2.2e-16, so at least one predictor (excluding intercept) has a significant contribution to the model
#Intercept p-value --> intercept is statistically significant for 10%
#Other p-values --> have values very high, so there are not statistically significant
#Generally, from interaction model, there is no significant dependence of the volume vs listings 
#relationship on cities at a statistical level, that is, the slope of the relationship does not 
#differ significantly between cities. For this reason, we can explore other type of interaction,
#introducing non linearity

#Given a dataset with approximately 240 observations evenly distributed across four cities 
#(about 60 cases per city), model complexity must be balanced with data availability to avoid 
#overfitting and unstable estimates. To explore the relationship between sales volume and number 
#of active listings, accounting for potential non-linearities and city-specific differences, two 
#flexible yet parsimonious modeling approaches are considered.

#First, polynomial regression with interaction terms allows capturing simple curvilinear effects 
#(e.g., quadratic terms) and varying relationships by city without inflating the number of 
#parameters excessively.

model_poly <- lm(volume ~ city * (listings + I(listings^2)), data = data)
summary(model_poly)

## 
## Call:
## lm(formula = volume ~ city * (listings + I(listings^2)), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.707  -6.608  -0.815   4.691  39.701 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             -8.451e+01  4.479e+02  -0.189    0.851
## cityBryan-College Station                2.311e+02  4.497e+02   0.514    0.608
## cityTyler                                1.793e+02  4.830e+02   0.371    0.711
## cityWichita Falls                        4.964e+01  4.838e+02   0.103    0.918
## listings                                 1.460e-01  5.340e-01   0.273    0.785
## I(listings^2)                           -4.757e-05  1.588e-04  -0.300    0.765
## cityBryan-College Station:listings      -2.877e-01  5.371e-01  -0.536    0.593
## cityTyler:listings                      -1.760e-01  5.485e-01  -0.321    0.749
## cityWichita Falls:listings              -4.359e-02  6.717e-01  -0.065    0.948
## cityBryan-College Station:I(listings^2)  9.242e-05  1.601e-04   0.577    0.564
## cityTyler:I(listings^2)                  5.207e-05  1.603e-04   0.325    0.746
## cityWichita Falls:I(listings^2)         -5.703e-06  2.763e-04  -0.021    0.984
## 
## Residual standard error: 11.38 on 228 degrees of freedom
## Multiple R-squared:  0.5542, Adjusted R-squared:  0.5327 
## F-statistic: 25.77 on 11 and 228 DF,  p-value: < 2.2e-16

#All p-values show a low stastic significance

#Second, Generalized Additive Models (GAMs) provide a more flexible framework to model smooth 
#non-linear relationships that can differ between cities, while controlling complexity through 
#limited degrees of freedom in the smoothing parameters to adapt to the modest sample size.

data$Bryan <- ifelse(data$city == "Bryan-College Station", 1, 0)
data$Tyler <- ifelse(data$city == "Tyler", 1, 0)
data$Wichita <- ifelse(data$city == "Wichita Falls", 1, 0)
data$Beaumont <- ifelse(data$city == "Beaumont", 1, 0)

gam_model <- gam(volume ~ city + 
                   s(listings, by=Bryan, k=5) + 
                   s(listings, by=Tyler, k=5) + 
                   s(listings, by=Wichita, k=5)+
                   s(listings, by=Beaumont, k=5),
                 data=data)

summary(gam_model)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## volume ~ city + s(listings, by = Bryan, k = 5) + s(listings, 
##     by = Tyler, k = 5) + s(listings, by = Wichita, k = 5) + s(listings, 
##     by = Beaumont, k = 5)
## 
## Parametric coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 21.370      5.032   4.247 3.15e-05 ***
## cityBryan-College Station    3.914     12.882   0.304 0.761511    
## cityTyler                   14.546      4.118   3.533 0.000497 ***
## cityWichita Falls           -1.039      7.450  -0.139 0.889252    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df      F p-value  
## s(listings):Bryan    2.352  2.682  4.674  0.0461 *
## s(listings):Tyler    1.429  1.429 12.858  0.0104 *
## s(listings):Wichita  1.429  1.429  1.255  0.4410  
## s(listings):Beaumont 1.714  1.714  1.059  0.4613  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 20/24
## R-sq.(adj) =  0.539   Deviance explained = 55.4%
## GCV = 132.72  Scale est. = 127.78    n = 240

#1. The intercept estimate for Tyler is statistically significant, indicating that at zero listings, 
#the sales volume in Tyler is higher than the baseline city, while the intercepts for the other 
#cities are not significant.
#2.The higher effective degrees of freedom (edf) for Bryan and Tyler indicate a nonlinear 
#relationship between listings and sales volume, whereas for Wichita and Beaumont the relationship 
#is linear or nearly linear.
#3.The smooth terms for Bryan and Tyler are statistically significant, meaning that the 
#relationship between listings and sales volume in these cities is significant and non-constant 
#(nonlinear), while for Wichita and Beaumont listings have little influence on the sales volume.
#4.Overall, the nonlinear GAM model, unlike the linear interaction model, confirms that the 
#relationship between listings and sales volume depends on the city, but this dependence is 
#statistically significant only for Bryan and Tyler. Furthermore, the city modifies the linearity of 
#the relationship; where the relationship is linear, the city factor does not significantly 
#influence it.

#Plot of loess smooth of volume and listings by city
ggplot(data, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Volume of sales vs Listings non linear regression with confidence intervals", 
       x = "Number of listings", 
       y = "Volume of sales (mln $)") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

Variables “listings” vs “median_price”

Correlation and General Trend
The correlation between the number of listings and the median sales price is generally low and subject to significant variations across cities and years:

Bryan-College Station shows predominantly negative correlations, except in 2014.
Tyler and Wichita Falls exhibit positive but weak and inconsistent correlations over time.
Beaumont shows a shift from negative to positive correlations, though always of modest magnitude.
In summary, no strong and stable linear relationship emerges between these two variables.

Linear Model and Residual Analysis
The linear regression model reveals residual patterns similar to those observed in the relationship between listings and sales volume, with distinct clusters by city suggesting different market dynamics. The presence of these residual differences indicates the need to consider geographic variability in the model.

Linear Interaction Model
The interaction model between number of listings and city explains about 83% of the overall variability, but the interaction terms are generally not significant. This implies that the slope of the relationship between median price and number of listings does not differ significantly among cities in a simple linear model.

Nonlinear Models and Generalized Additive Models (GAM)
Polynomial regression with interaction terms does not show significant improvements. GAMs reveal that the relationship between listings and median price is significant and nonlinear for Bryan, Tyler, and Beaumont, while the effect is less relevant for Wichita. The GAM model confirms the dependence of the relationship on the geographic area and shows how the city modifies the linearity of the relationship.

Visualization with LOESS Regression
LOESS curves by city clearly highlight local dynamics and nonlinear relationships, reinforcing the idea of territorial heterogeneity in the relationship between listings and median price.

Final considerations
The relationship between the number of property listings and median sales price appears complex, weakly linear, and influenced by territorial and temporal factors. A simple linear model is insufficient to adequately describe it, whereas flexible approaches like GAMs enable capturing nonlinearities and geographic differences. These results suggest further analysis using models that allow precise modeling of spatial and nonlinear variability in the real estate market.

#Correlation with median price
ggplot(data = data)+
  geom_point(aes(x = listings, y = median_price, colour = city))+
  geom_smooth(aes(x = listings, y = median_price), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Median price vs listings by year",
       x = "Number of listings",
       y = "Median price")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between median price and listings
corr_by_yearcity_ls <- corr_fun(data, year, city, listings, median_price)

#Plot of corrplot for this calculated index
corr_by_yearcity_ls %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                 -0.581 
##  2  2010 Bryan-College Station    -0.316 
##  3  2010 Tyler                     0.493 
##  4  2010 Wichita Falls             0.0507
##  5  2011 Beaumont                  0.508 
##  6  2011 Bryan-College Station    -0.424 
##  7  2011 Tyler                     0.376 
##  8  2011 Wichita Falls             0.312 
##  9  2012 Beaumont                  0.0904
## 10  2012 Bryan-College Station     0.0654
## 11  2012 Tyler                     0.465 
## 12  2012 Wichita Falls             0.500 
## 13  2013 Beaumont                  0.122 
## 14  2013 Bryan-College Station    -0.184 
## 15  2013 Tyler                     0.655 
## 16  2013 Wichita Falls            -0.260 
## 17  2014 Beaumont                  0.371 
## 18  2014 Bryan-College Station    -0.742 
## 19  2014 Tyler                    -0.226 
## 20  2014 Wichita Falls             0.317

#Transform in wide format
corr_matrix_ls <- corr_by_yearcity_ls %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_ls <- as.matrix(corr_matrix_ls[,-1])

#Set rownames like years, to be more clear
rownames(mat_ls) <- corr_matrix_ls$year

#Corrplot
corrplot(mat_ls, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Median price vs Listings correlation by year and city",
         mar = c(0,0,1,0))

#This analysis reveals varying relationships between listings and median price across cities and 
#years. For Bryan-College Station, the relationship is predominantly negative with generally low 
#linear correlation, except in 2014. Tyler and Wichita Falls exhibit positive but weak and 
#inconsistent correlations over time. Beaumont shows a shift from negative to positive correlations 
#across years, though correlations remain weak. Overall, regardless of the year, all cities display 
#low linear correlation between these two variables, suggesting the absence of a strong linear 
#relationship.

#Plot of joint distribution of median price and listings
ggplot(data, aes(x = listings, y = median_price)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Listings joint distribution with confidence ellipses", 
       x = "Listings",
       y = "Median price") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Listings linear regression with confidence intervals", 
       x = "Number of listings",
       y = "Median price") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(median_price ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals, color = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (median price vs listings)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#This analysis shows a particular pattern of residuals, like the analysis between listings and volume.
#So, some add on this analysis, like done previously

#These four different trends of residuals suggest four different trends of regression between median price
#and sales by city. So, below there is a plot of these four regression lines
ggplot(data, aes(x = listings, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +  
  labs(title = "Median price vs Listings linear regressions",
       x = "Number of listings",
       y = "Median price") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Interaction model to investigate the influence of variable city on volume-listings relationship
model_interaction <- lm(median_price ~ listings * city, data = data)
summary(model_interaction)

## 
## Call:
## lm(formula = median_price ~ listings * city, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27642  -6338    652   6038  33846 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        154590.766  22727.226   6.802 8.68e-11 ***
## listings                              -14.650     13.514  -1.084   0.2795    
## cityBryan-College Station           37982.241  23845.127   1.593   0.1126    
## cityTyler                           15533.789  27694.337   0.561   0.5754    
## cityWichita Falls                  -47723.146  27362.571  -1.744   0.0825 .  
## listings:cityBryan-College Station     -9.411     14.367  -0.655   0.5131    
## listings:cityTyler                      4.777     14.565   0.328   0.7432    
## listings:cityWichita Falls              9.017     21.482   0.420   0.6751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9460 on 232 degrees of freedom
## Multiple R-squared:  0.8309, Adjusted R-squared:  0.8257 
## F-statistic: 162.8 on 7 and 232 DF,  p-value: < 2.2e-16

#From the summary of interaction model we can explore some results
#R^2 --> this interaction explain only the 83% of variability of the entire dataset
#Global p-value --> 2.2e-16, so at least one predictor (excluding intercept) has a significant contribution to the model
#Intercept p-value --> intercept is statistically significant
#Other p-values --> have values high, so there are not statistically significant
#Generally, from interaction model, there is no significant dependence of the median price vs listings 
#relationship on cities at a statistical level, that is, the slope of the relationship does not 
#differ significantly between cities. For this reason, we can explore other type of interaction,
#introducing non linearity

#First, polynomial regression with interaction terms allows capturing simple curvilinear effects 
#(e.g., quadratic terms) and varying relationships by city without inflating the number of 
#parameters excessively.

model_poly <- lm(median_price ~ city * (listings + I(listings^2)), data = data)
summary(model_poly)

## 
## Call:
## lm(formula = median_price ~ city * (listings + I(listings^2)), 
##     data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27494  -6103     58   5577  34019 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              3.906e+05  3.702e+05   1.055    0.293
## cityBryan-College Station               -1.330e+05  3.717e+05  -0.358    0.721
## cityTyler                               -1.697e+04  3.992e+05  -0.043    0.966
## cityWichita Falls                       -2.052e+05  3.999e+05  -0.513    0.608
## listings                                -2.963e+02  4.414e+02  -0.671    0.503
## I(listings^2)                            8.383e-02  1.313e-01   0.639    0.524
## cityBryan-College Station:listings       1.774e+02  4.439e+02   0.400    0.690
## cityTyler:listings                       1.446e+02  4.534e+02   0.319    0.750
## cityWichita Falls:listings               1.151e+02  5.552e+02   0.207    0.836
## cityBryan-College Station:I(listings^2) -5.031e-02  1.323e-01  -0.380    0.704
## cityTyler:I(listings^2)                 -5.926e-02  1.325e-01  -0.447    0.655
## cityWichita Falls:I(listings^2)          1.366e-02  2.283e-01   0.060    0.952
## 
## Residual standard error: 9408 on 228 degrees of freedom
## Multiple R-squared:  0.8356, Adjusted R-squared:  0.8276 
## F-statistic: 105.3 on 11 and 228 DF,  p-value: < 2.2e-16

#All p-values show a low stastic significance

#Second, Generalized Additive Models (GAMs) provide a more flexible framework to model smooth 
#non-linear relationships that can differ between cities, while controlling complexity through 
#limited degrees of freedom in the smoothing parameters to adapt to the modest sample size.
gam_model <- gam(median_price ~ city + 
                   s(listings, by=Bryan, k=5) + 
                   s(listings, by=Tyler, k=5) + 
                   s(listings, by=Wichita, k=5)+
                   s(listings, by=Beaumont, k=5),
                 data=data)

summary(gam_model)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## median_price ~ city + s(listings, by = Bryan, k = 5) + s(listings, 
##     by = Tyler, k = 5) + s(listings, by = Wichita, k = 5) + s(listings, 
##     by = Beaumont, k = 5)
## 
## Parametric coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  93801       3905  24.019  < 2e-16 ***
## cityBryan-College Station    25577       8472   3.019  0.00282 ** 
## cityTyler                    31260       5829   5.363 1.98e-07 ***
## cityWichita Falls             1638       6136   0.267  0.78981    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                        edf Ref.df      F  p-value    
## s(listings):Bryan    2.188  2.516 15.370 1.39e-06 ***
## s(listings):Tyler    2.008  2.263  4.475  0.00512 ** 
## s(listings):Wichita  1.429  1.429  2.287  0.25782    
## s(listings):Beaumont 1.714  1.714 50.053  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 20/24
## R-sq.(adj) =  0.829   Deviance explained = 83.5%
## GCV = 9.1502e+07  Scale est. = 8.7942e+07  n = 240

#1. The intercept estimate for Tyler and Bryan are statistically significant, indicating that at zero listings, 
#the median price in Tyler and Bryan is higher than the baseline city, while the intercepts for the other 
#cities are not significant.
#2.The higher effective degrees of freedom (edf) for all cities indicate a nonlinear 
#relationship between listings and median price.
#3.The smooth terms for Bryan, Tyler and Beaumont are statistically significant, meaning that the 
#relationship between listings and median price in these cities is significant and non-constant 
#(nonlinear), while for Wichita listings have little influence on median price.
#4.Overall, the nonlinear GAM model, unlike the linear interaction model, confirms that the 
#relationship between listings and median price depends on the city, but this dependence is 
#statistically significant only for Bryan, Tyler and Beaumont. Furthermore, the city modifies the linearity of 
#the relationship; where the relationship is linear, the city factor does not significantly 
#influence it.

#Plot of loess smooth of median price and listings by city
ggplot(data, aes(x = listings, y = median_price, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Median price vs Listings non linear regression with confidence intervals",
       x = "Number of listings",
       y = "Median price") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

Variables “months_inventory” vs “sales”

Correlation and General Trend
The linear correlation between number of sales and months needed to sell listings (months_inventory) is generally low, except for 2011 which shows stronger correlations. Overall, a positive correlation is observed: as sales increase, months of inventory tend to increase as well, but this relationship is weak and sometimes negative. Among cities, Beaumont and Bryan-College Station show a declining correlation over time, while Tyler and Wichita Falls display trends of increase followed by decrease.

Visualizations and Linear Model
Joint distributions and linear regression plots reveal considerable data dispersion and scatter, confirmed by residual analysis. Residuals exhibit heteroscedasticity and skewness, indicating that the simple linear model does not adequately describe the actual relationship.

Nature of the Relationship and Nonlinear Models
Loess regression analysis suggests a nonlinear and partly inverse relationship: generally, higher sales correspond to fewer months needed to sell listings. This aligns with market intuition but contrasts with the averaged positive correlations previously observed.

Spearman Correlation
Spearman correlation analysis, less sensitive to outliers, confirms the weak and complex nature of the relationship, emphasizing the need for flexible models and variable transformations for accurate description.

Final Considerations
The relationship between sales numbers and months inventory shows complex features not well captured by simple linear modeling. Evidence indicates a nonlinear, likely inverse relationship, warranting further investigation via nonlinear models or transformations. The city-specific differences and time dynamics highlight the structural market complexity.

#Correlation with sales
ggplot(data = data)+
  geom_point(aes(x = sales, y = months_inventory, colour = city))+
  geom_smooth(aes(x = sales, y = months_inventory), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory vs Number of sales by year",
       x = "Number of sales",
       y = "Months inventory")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between number and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, sales, months_inventory)

#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                 0.431  
##  2  2010 Bryan-College Station    0.345  
##  3  2010 Tyler                   -0.00956
##  4  2010 Wichita Falls           -0.249  
##  5  2011 Beaumont                 0.646  
##  6  2011 Bryan-College Station    0.778  
##  7  2011 Tyler                    0.757  
##  8  2011 Wichita Falls            0.769  
##  9  2012 Beaumont                -0.149  
## 10  2012 Bryan-College Station    0.289  
## 11  2012 Tyler                    0.453  
## 12  2012 Wichita Falls            0.0972 
## 13  2013 Beaumont                -0.118  
## 14  2013 Bryan-College Station    0.0449 
## 15  2013 Tyler                    0.653  
## 16  2013 Wichita Falls            0.235  
## 17  2014 Beaumont                 0.0904 
## 18  2014 Bryan-College Station    0.113  
## 19  2014 Tyler                    0.0578 
## 20  2014 Wichita Falls            0.616

#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])

#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year

#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Number of sales correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for the year 2011.
#Generally correlation is positive, so months inventory growth like number of sales, but not at all in a
#strong dependency. There are, also, negative terms of correlation.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.43 to 0.09
#Bryan-College Station --> the correlation goes from 0.34 to 0.11
#Tyler --> shows a growth in positive correlation from 2010 to 2011 and a decrease from 2013 to 2014
#Wichita Falls --> shows a similar trend than Tyler

#Plot of joint distribution of months inventory and sales
ggplot(data, aes(x = sales, y = months_inventory)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Number of sales joint distribution with confidence ellipses",
       x = "Number of sales",
       y = "Months inventory") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = sales, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Number of sales linear regression with confidence intervals",
       x = "Number of sales",
       y = "Months inventory") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ sales, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (months inventory vs number of sales)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.

#Plot of loess smooth of sales and months inventory by city
ggplot(data, aes(x = sales, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Number of sales non linear regression with confidence intervals",
       x = "Number of sales",
       y = "Months inventory") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#The relationship between the number of sales and the months needed to sell listings 
#(months_inventory) appears to be nonlinear and inversely proportional. Specifically, 
#as the number of sales increases, the time required to sell all listings tends to decrease. 
#This intuition aligns with market expectations and is supported by the scatter of residuals in 
#the linear regression model, which indicates that a simple linear trend may not adequately capture 
#the complexity of this relationship.This appears to be at odds with the positive correlation 
#coefficients found for year and month, but the “average” positive correlation coefficient 
#alone may not correctly reflect the true nature of the relationship between variables, best 
#highlighted with this trend.

#Beacause residuals analysis shows an important dispersion of values, we could try to calculate
#correlation using Spearman index
corr_by_yearcity_minv_sp <- corr_sp_fun(data, year, city, sales, months_inventory)

corr_by_yearcity_minv_sp %>%
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_spearman
##    <int> <chr>                         <dbl>
##  1  2010 Beaumont                    0.214  
##  2  2010 Bryan-College Station       0.258  
##  3  2010 Tyler                      -0.112  
##  4  2010 Wichita Falls              -0.123  
##  5  2011 Beaumont                    0.587  
##  6  2011 Bryan-College Station       0.727  
##  7  2011 Tyler                       0.828  
##  8  2011 Wichita Falls               0.743  
##  9  2012 Beaumont                   -0.0495 
## 10  2012 Bryan-College Station       0.253  
## 11  2012 Tyler                       0.621  
## 12  2012 Wichita Falls               0.239  
## 13  2013 Beaumont                   -0.120  
## 14  2013 Bryan-College Station      -0.00699
## 15  2013 Tyler                       0.796  
## 16  2013 Wichita Falls               0.162  
## 17  2014 Beaumont                    0.106  
## 18  2014 Bryan-College Station       0.115  
## 19  2014 Tyler                       0.168  
## 20  2014 Wichita Falls               0.614

corr_matrix_spearman_minv <- corr_by_yearcity_minv_sp %>%
  pivot_wider(names_from = city, values_from = corr_spearman)

mat_spearman <- as.matrix(corr_matrix_spearman_minv[,-1])
rownames(mat_spearman) <- corr_matrix_spearman_minv$year

corrplot(mat_spearman, method = "color", addCoef.col = "black", tl.col = "black",
         tl.cex = 0.7, tl.srt = 30, number.cex = 0.9, cl.pos = "b",
         title = "Months inventory vs Number of sales Spearman corr by year and city",
         mar = c(0,0,1,0))

#Also the analysis with correlation Spearman index confirms the same considerations already done

Variables “months_inventory” vs “volume”

Correlation and General Trend
The linear correlation between total sales value and months needed to sell listings is generally low, with some exceptions in 2011 showing stronger correlations. Overall, a positive correlation is observed: as sales value increases, months inventory tends to increase as well, but this dependency is neither strong nor stable over time. Looking at individual cities, Beaumont and Bryan-College Station show a decreasing correlation over time, Tyler displays an initial positive trend followed by a decline, and Wichita Falls exhibits high variability in trend.

Visualizations and Linear Model
Joint distributions and linear regression plots reveal considerable data dispersion and non-random residual patterns. Residual analysis of the linear model highlights heteroscedasticity and non-random patterns, suggesting that the relationship is not adequately captured by a simple linear model.

Nature of the Relationship and Nonlinear Models
LOESS regression confirms a nonlinear relationship between sales volume and months inventory, consistent with a pattern where an increase in sales value does not correspond linearly to an increase in time to sell. This indicates complex and varied phenomena across cities and periods.

Final Considerations
The relationship between sales value and months inventory is weak, unstable, and nonlinear, making a simple linear model inadequate to describe it. Temporal and spatial differences suggest the need for more flexible models and possibly additional variables or transformations for better understanding.

#Correlation with volume
ggplot(data = data)+
  geom_point(aes(x = volume, y = months_inventory, colour = city))+
  geom_smooth(aes(x = volume, y = months_inventory), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory vs Volume of sales by year",
       x = "Volume of sales (mln $)",
       y = "Months inventory")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between volume and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, volume, months_inventory)

#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                  0.307 
##  2  2010 Bryan-College Station     0.372 
##  3  2010 Tyler                     0.0157
##  4  2010 Wichita Falls            -0.152 
##  5  2011 Beaumont                  0.559 
##  6  2011 Bryan-College Station     0.733 
##  7  2011 Tyler                     0.781 
##  8  2011 Wichita Falls             0.747 
##  9  2012 Beaumont                 -0.219 
## 10  2012 Bryan-College Station     0.238 
## 11  2012 Tyler                     0.503 
## 12  2012 Wichita Falls             0.275 
## 13  2013 Beaumont                 -0.127 
## 14  2013 Bryan-College Station     0.0261
## 15  2013 Tyler                     0.696 
## 16  2013 Wichita Falls             0.0916
## 17  2014 Beaumont                  0.0292
## 18  2014 Bryan-College Station     0.0252
## 19  2014 Tyler                     0.0315
## 20  2014 Wichita Falls             0.595

#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])

#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year

#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Volume of sales correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for the year 2011.
#Generally correlation is positive, so months inventory growth like volume, but not in a
#strong dependency
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.31 to 0.03 (decreasing)
#Bryan-College Station --> the correlation goes from 0.37 to 0.03 (decreasing)
#Tyler --> shows a positive trend from 2010 to 2011 with a decrease from 2013 to 2014
#Wichita Falls --> shows strong variability in trend

#Plot of joint distribution of volume and months inventory
ggplot(data, aes(x = volume, y = months_inventory)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Volume of sales joint distribution with confidence ellipses",
       x = "Volume of sales (mln $)",
       y = "Months inventory") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = volume, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Volume of sales linear regression with confidence intervals",
       x = "Volume of sales (mln $)",
       y = "Months inventory") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ volume, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (months inventory vs volume of sales)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#There is a similar situation between months inventory and sales, with a non linear relationship
#explained by the distribution of residuals

#Plot of loess smooth of sales and months inventory by city
ggplot(data, aes(x = volume, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Volume of sales non linear regression with confidence intervals",
       x = "Volume of sales (mln $)",
       y = "Months inventory") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

Variables “months_inventory” vs “median_price”

Correlation and General Trend
The correlation between median sales price and months needed to sell listings is generally low and varies over time and across cities. Bryan-College Station in 2014 shows a relatively strong negative correlation, while other cities exhibit fluctuating or decreasing values. The negative correlation suggests that, in some cases, an increase in median price tends to be associated with a decrease in the time required to sell homes, a counterintuitive result compared to the expectation that higher prices require more time to sell.

Visualizations and Linear Model
Joint distributions and linear regressions show dispersion and city differences, with residuals indicating a nonlinear relationship not well captured by the simple linear model. Residuals are non-randomly distributed, suggesting complex patterns or local factors not accounted for by the model.

Nonlinear Models and Local Trends
LOESS regression by city clearly highlights nonlinear relationships. In some cities, a significant inverse relationship emerges between median price and months inventory, where increases in median price correspond to decreases in selling time.

Final Considerations
The relationship between median sales price and months inventory is complex, predominantly nonlinear, with varying territorial effects. Counterintuitive results (price increases linked to shorter selling times) suggest sophisticated market dynamics and the importance of flexible model-based interpretations.

#Correlation with median price
ggplot(data = data)+
  geom_point(aes(x = median_price, y = months_inventory, colour = city))+
  geom_smooth(aes(x = median_price, y = months_inventory), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory vs Median price by year",
       x = "Median price",
       y = "Months inventory")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between median price and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, median_price, months_inventory)

#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                 -0.475 
##  2  2010 Bryan-College Station    -0.194 
##  3  2010 Tyler                     0.155 
##  4  2010 Wichita Falls            -0.0413
##  5  2011 Beaumont                  0.406 
##  6  2011 Bryan-College Station    -0.408 
##  7  2011 Tyler                     0.218 
##  8  2011 Wichita Falls             0.283 
##  9  2012 Beaumont                 -0.308 
## 10  2012 Bryan-College Station    -0.0548
## 11  2012 Tyler                     0.120 
## 12  2012 Wichita Falls             0.443 
## 13  2013 Beaumont                 -0.400 
## 14  2013 Bryan-College Station    -0.288 
## 15  2013 Tyler                     0.438 
## 16  2013 Wichita Falls            -0.440 
## 17  2014 Beaumont                  0.0964
## 18  2014 Bryan-College Station    -0.783 
## 19  2014 Tyler                    -0.459 
## 20  2014 Wichita Falls             0.269

#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])

#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year

#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Median price correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows low correlation between two variables,
#except for city of Bryan-College Station in the year 2014.
#Generally correlation is negative, so months inventory decrease while median price increase, 
#but not in a strong dependency. This first result is not expected, because if median price is high
#we may expect that the number of months for sell is high, but the relationship shows non linearity
#and this correlation could not explore all trends in data
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from -0.47 to 0.10 (decreasing in absolute value)
#Bryan-College Station --> the correlation goes from -0.19 to -0.78 (increasing)
#Tyler --> the correlation goes from 0.15 to -0.46 (increasing in absolute value)
#Wichita Falls --> shows strong variability in trend

#Plot of joint distribution of median price and months inventory
ggplot(data, aes(x = median_price, y = months_inventory)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Median price joint distribution with confidence intervals",
       x = "Median price",
       y = "Months inventory") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = median_price, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Median price linear regression with confidence intervals",
       x = "Median price",
       y = "Months inventory") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ median_price, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (months inventory vs median price)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#There is a similar situation between months inventory and median price, with a non linear relationship
#explained by the distribution of residuals

#Plot of loess smooth of median price and months inventory by city
ggplot(data, aes(x = median_price, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Median price non lineare regression with confidence intervals",
       x = "Median price",
       y = "Months inventory") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Overall, the relationship between the two variables appears nonlinear, as indicated by the 
#residual plot showing points scattered around the horizontal line without clear patterns 
#suggesting a significant local influence of the city variable. However, in certain cities, a 
#significant inverse relationship is observed: as the median price increases, the number of months 
#required to sell all listings tends to decrease

Variables “months_inventory” vs “listings”

Correlation and General Trend
The analysis shows a strong positive linear correlation between the number of active listings and the number of months needed to sell all listings (months_inventory). This outcome aligns with the notion that a higher number of active listings indicates lower selling capacity and therefore a longer time to clear the market.Joint plots and confidence ellipses confirm this robust positive relationship between listings and months inventory, with variations across cities. Linear regression lines for each city illustrate trends consistent with the positive correlation.

Linear Regression Model and Residuals
The linear regression model confirms the statistical significance of both the intercept and slope (both with very low p-values around 2e-16). The coefficient of determination (R²) is moderate, about 54%, indicating the model explains a good portion of variance but leaves room for unexplained factors.Residual analysis reveals some non-random patterns, suggesting possible local effects or additional influencing variables.

Final Considerations
The strong positive linear dependence between listings and months inventory is expected and statistically significant. Residual variability and city-level differences suggest further investigation could improve understanding and modeling of the local real estate market.

#Correlation with listings
ggplot(data = data)+
  geom_point(aes(x = listings, y = months_inventory, colour = city))+
  geom_smooth(aes(x = listings, y = months_inventory), lwd = 1)+
  facet_wrap(~ year, scales = "free_x")+
  labs(title = "Months inventory vs Listings by year",
       x = "Number of listings",
       y = "Months inventory")+
  theme_classic()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Correlation index by city and year between listings and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, listings, months_inventory)

#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>% 
  ungroup()

## # A tibble: 20 × 3
##     year city                  corr_index
##    <int> <chr>                      <dbl>
##  1  2010 Beaumont                   0.966
##  2  2010 Bryan-College Station      0.898
##  3  2010 Tyler                      0.854
##  4  2010 Wichita Falls              0.974
##  5  2011 Beaumont                   0.813
##  6  2011 Bryan-College Station      0.993
##  7  2011 Tyler                      0.965
##  8  2011 Wichita Falls              0.796
##  9  2012 Beaumont                   0.730
## 10  2012 Bryan-College Station      0.965
## 11  2012 Tyler                      0.882
## 12  2012 Wichita Falls              0.980
## 13  2013 Beaumont                   0.526
## 14  2013 Bryan-College Station      0.986
## 15  2013 Tyler                      0.883
## 16  2013 Wichita Falls              0.941
## 17  2014 Beaumont                   0.910
## 18  2014 Bryan-College Station      0.988
## 19  2014 Tyler                      0.922
## 20  2014 Wichita Falls              0.991

#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
  pivot_wider(names_from = city, values_from = corr_index)

#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])

#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year

#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30, 
         number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Listings correlation by year and city",
         mar = c(0,0,1,0))

#This analysis shows a strong linear correlation between number of listings and months
#inventory. We could expect this trend because a bigger number of listings suggest a small number
#of sell, so bigger values of variable months inventory.
#To confirm this below are shown joint distribution with confidence ellipses

#Plot of joint distribution of listings and months inventory
ggplot(data, aes(x = listings, y = months_inventory)) +
  geom_point(alpha = 0.5, aes(color = city)) +
  stat_ellipse(level = 0.95, aes(color = city), size = 1) +
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Listings joint distribution with confidence ellipses",
       x = "Number of listings",
       y = "Months inventory") +
  theme_classic()

#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = months_inventory, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) + 
  facet_wrap(~ city, scales = "free_x")+
  labs(title = "Months inventory vs Listings linear regression with confidence intervals",
       x = "Number of listings",
       y = "Months inventory") +
  theme_classic()

## `geom_smooth()` using formula = 'y ~ x'

#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)

ggplot(data, aes(x = fitted, y = residuals, color = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression residuals (months inventory vs listings)",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

#The analysis of the relationship between the number of months needed to sell listings and the 
#number of active listings reveals a strong positive linear dependence, consistent with the notion 
#that a higher number of listings reflects a lower selling capacity and therefore a longer time to 
#clear the market. In the summary of the linear regression model, both the slope coefficient for 
#the listings variable and the intercept have very low p-values (2e-16), indicating statistical 
#significance. However, the coefficient of determination R² is moderate, around 54%

#The analysis of the relationship between the number of months needed to sell listings and the 
#number of active listings reveals a strong positive linear dependence, consistent with the notion 
#that a higher number of listings reflects a lower selling capacity and therefore a longer time to 
#clear the market. In the summary of the linear regression model, both the slope coefficient for 
#the listings variable and the intercept have very low p-values (2e-16), indicating statistical 
#significance. However, the coefficient of determination R² is moderate, around 54%

Second breakpoint: results of analysis (point 3)

Historical Trends in Sales and Values:

The number and value of sales are strongly correlated and show a significant linear trend over time and across cities, confirming the consistency and growth of the Texas real estate market.
The median sales value exhibits more complex and localized variations, suggesting price dynamics influenced by territorial and temporal factors.

Effectiveness of Marketing Strategies (Listings):

The relationship between the number of active listings and sales shows a declining correlation over time, indicating that the mere quantity of listings does not fully explain current sales.
Significant variation among cities and the presence of non-linearity (highlighted by GAM models and loess regressions) underscore the need for differentiated marketing strategies attentive to local dynamics.

Distribution and Relationships among Prices, Sales, and Listing Duration:

Correlations between median prices and total sales values are positive but moderate and unstable, with heteroscedasticity and outliers requiring complex models.
The average selling time (months_inventory) shows complex and nonlinear relationships with sales, prices, and number of listings, suggesting that the market is also influenced by liquidity factors and more sophisticated supply/demand dynamics.
Nonlinear models (GAM, loess) are effective in capturing these dynamics and could guide targeted interventions to reduce selling times.

Implications:

The evidence suggests that there is no “one-size-fits-all” model: each city presents specific dynamics, with different signals of growth, stability, or slowdown.
Marketing and listing management strategies must therefore be personalized and adapted to local contexts, monitoring not only listing volume but also nonlinear patterns of price, sales, and duration.
The use of advanced models and detailed visualizations allows the identification of growth opportunities and the optimization of timing and quality of commercial interventions on listings.

Multivariate analysis

Linear multivariate model with interactions between variables

After conducting thorough bivariate analyses, it could proceed with a multivariate analysis focusing on variables that do not exhibit multicollinearity. Hypothesizing a linear mixed-effects model that includes the joint effects and all interactions among volume, listings, and months_inventory variables, while accounting for random intercepts by year and month. This approach allows me to capture complex relationships and local variations—such as differences across cities—without bias from strongly correlated predictors.

#Creation of scaled data frame
volume_scaled <- as.vector(scale(data$volume))
listings_scaled <- as.vector(scale(data$listings))
months_inv_scaled <- as.vector(scale(months_inventory))
data_scaled <- data.frame(
                          year = data$year,
                          month = data$month,
                          city = data$city,
                          volume = volume_scaled,
                          listings = listings_scaled,
                          months_inventory = months_inv_scaled
                          )

mod_multivariate <- lmer(volume ~ listings * months_inventory * city + (1|year) + (1|month), data = data_scaled)
summary(mod_multivariate)

## Linear mixed model fit by REML ['lmerMod']
## Formula: volume ~ listings * months_inventory * city + (1 | year) + (1 |  
##     month)
##    Data: data_scaled
## 
## REML criterion at convergence: 229
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4525 -0.5906 -0.0039  0.5382  4.2657 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  month    (Intercept) 0.21712  0.4660  
##  year     (Intercept) 0.01398  0.1182  
##  Residual             0.11826  0.3439  
## Number of obs: 240, groups:  month, 12; year, 5
## 
## Fixed effects:
##                                                     Estimate Std. Error t value
## (Intercept)                                         -0.20734    0.19196  -1.080
## listings                                            -0.92099    0.79243  -1.162
## months_inventory                                    -0.37871    0.14355  -2.638
## cityBryan-College Station                            0.83173    0.14995   5.547
## cityTyler                                           -0.01248    0.52803  -0.024
## cityWichita Falls                                   -2.77226    1.23405  -2.246
## listings:months_inventory                           -0.67500    0.57726  -1.169
## listings:cityBryan-College Station                   4.46206    0.87585   5.095
## listings:cityTyler                                   2.24928    0.80297   2.801
## listings:cityWichita Falls                          -0.52055    1.39737  -0.373
## months_inventory:cityBryan-College Station          -1.06162    0.20119  -5.277
## months_inventory:cityTyler                          -0.86318    0.30189  -2.859
## months_inventory:cityWichita Falls                  -2.55279    1.27016  -2.010
## listings:months_inventory:cityBryan-College Station  0.97342    0.58925   1.652
## listings:months_inventory:cityTyler                  0.79424    0.58685   1.353
## listings:months_inventory:cityWichita Falls         -1.34502    1.23212  -1.092

## 
## Correlation matrix not shown by default, as p = 16 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

#Plot of listings and volume by city
ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_classic() +
  labs(title = "Volume of sales vs Listings by city", 
       x = "Number of listings",
       y = "Volume of sales (scaled)")

## `geom_smooth()` using formula = 'y ~ x'

#Plot of listings and volume by city and with interaction with months inventory

#Creation of bins of months inventory (25% of population for every bins)
data_scaled <- data_scaled %>%
  mutate(
    months_inventory_bin_num = ntile(months_inventory, 4),
    months_inventory_bin = factor(
      months_inventory_bin_num,
      levels = 1:4,
      labels = c("Low", "Medium-Low", "Medium-High", "High")
    )
  )

ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ months_inventory_bin) +
  theme_classic() +
  labs(title = "Interaction Volume of sales vs Listings by year and months inventory",
       x = "Number of listings",
       y = "Volume of sales (scaled)")

## `geom_smooth()` using formula = 'y ~ x'

#Plot of marginal effects between variables
effects <- ggpredict(mod_multivariate, terms = c("listings", "months_inventory", "city"))
p <- plot(effects) + 
  labs(
    title = "Marginal effects of listings and months inventory by city",
    x = "Number of listings (scaled)",
    y = "Predicted volume of sales (scaled)"
  )

print(p)

#Plot of residuals to show other caratheristic on correlation
data_scaled$residuals <- resid(mod_multivariate)
data_scaled$fitted <- fitted(mod_multivariate)

ggplot(data_scaled, aes(x = fitted, y = residuals, colour = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Linear regression of multivariate model residuals",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

After careful scaling and accounting for multicollinearity, a linear mixed-effects model was fitted incorporating the full factorial interactions among the numerical predictors (listings and months_inventory) and the categorical variable city, while including random intercepts for year and month. The results indicate that:

The effect of months_inventory on sales volume is significantly negative, with stronger negative impacts observed in certain cities (Bryan, Tyler, Wichita Falls).
The influence of listings on volume varies considerably by city, showing positive and significant effects in Bryan and Tyler, while the overall baseline effect is non-significant or negative.
Interactions between listings and months_inventory are generally non-significant, suggesting that their combined influence may be less critical than their individual and city-specific effects.
Higher-order interactions (three-way) do not contribute significantly in the current model, indicating potential for simplification.

Visualizing predicted effects via interaction plots or marginal effect plots (using tools such as ggplot2 or ggeffects) can elucidate how the relationships between listings, months_inventory, and sales volume differ across cities, helping to capture heterogeneity in effects. Given the limited significance of complex triple interactions, consider simplifying the model by removing non-significant higher-order terms to enhance interpretability and reduce overfitting. Exploring alternative model specifications or incorporating additional relevant covariates may further improve model accuracy and insight into drivers of sales volume.

For Tyler and Bryan, the parallel lines with narrow confidence intervals indicate that the effects of listings and months_inventory on volume are stable and similar across these cities, showing consistent and precise model estimates. This reflects a relatively homogeneous pattern of influence. In contrast, for Beaumont and Wichita Falls, the presence of intersecting lines with wider confidence intervals suggests more variability and complexity in how these predictors affect volume. This could reflect more uncertain or non-linear relationships, possibly due to less consistent data or more complex local dynamics.

Non linear multivariate model with interactions between variables

Plotting residuals against fitted values reveals positive linear patterns that differ by city. This indicates that the current model under- or overestimates volume in a systematic way depending on predicted value and city. Such patterns suggest that the model does not fully capture the relationship between predictors and volume across all cities, pointing to potential model misspecification and the need to consider nonlinear terms or random slopes to better represent city-specific effects.

#Non linear model GAM with the capture of non linear effect and smooth terms
data_scaled$city <- as.factor(data_scaled$city)

mod_gam <- gamm(
  volume ~ s(listings, by = city) + s(months_inventory, by = city) + city,
  random = list(year = ~1, month = ~1),
  data = data_scaled
)

summary(mod_gam$gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## volume ~ s(listings, by = city) + s(months_inventory, by = city) + 
##     city
## 
## Parametric coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -0.08993    0.11841  -0.759   0.4484    
## cityBryan-College Station  0.91006    0.14479   6.285 1.67e-09 ***
## cityTyler                 -1.19451    0.49073  -2.434   0.0157 *  
## cityWichita Falls         -0.81361    0.92225  -0.882   0.3786    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                                  edf Ref.df      F  p-value    
## s(listings):cityBeaumont                      0.9999 0.9999  0.230 0.632280    
## s(listings):cityBryan-College Station         1.0000 1.0000 64.057  < 2e-16 ***
## s(listings):cityTyler                         1.0000 1.0000 37.920  < 2e-16 ***
## s(listings):cityWichita Falls                 1.0000 1.0000  0.336 0.562894    
## s(months_inventory):cityBeaumont              1.0000 1.0000 15.530 0.000109 ***
## s(months_inventory):cityBryan-College Station 2.4917 2.4917 43.121  < 2e-16 ***
## s(months_inventory):cityTyler                 1.8676 1.8676 44.673  < 2e-16 ***
## s(months_inventory):cityWichita Falls         1.0000 1.0000  8.898 0.003168 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.713   
##   Scale est. = 0.11131   n = 240

#Plot of predicted values
data_scaled$predicted <- predict(mod_gam$gam, newdata = data_scaled)

ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_point(aes(y = predicted), size = 1, colour = "red") +
  facet_wrap(~ city)+
  labs(title = " Predicted vs Observed values of volume of sales by listings and city",
       x = "Number of listings(scaled)",
       y = "Volume of sales (scaled)") +
  theme_classic()

#So, we can allow slopes of predictors like listings and months_inventory to vary by city, providing 
#flexibility to capture city-specific effect heterogeneity
mod_gam_random_slopes <- gamm(
  volume ~ s(listings, by = city) + s(months_inventory, by = city) + city,
  random = list(
    city = ~ 1 + listings + months_inventory,
    year = ~ 1,
    month = ~ 1
  ),
  data = data_scaled
)

summary(mod_gam_random_slopes$gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## volume ~ s(listings, by = city) + s(months_inventory, by = city) + 
##     city
## 
## Parametric coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                0.17045    0.13696   1.244    0.215    
## cityBryan-College Station  0.87642    0.19748   4.438 1.42e-05 ***
## cityTyler                 -3.50305    0.57635  -6.078 5.11e-09 ***
## cityWichita Falls         -0.03652    1.21113  -0.030    0.976    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                                 edf Ref.df      F  p-value    
## s(listings):cityBeaumont                      1.000  1.000  9.691  0.00209 ** 
## s(listings):cityBryan-College Station         1.000  1.000 56.761  < 2e-16 ***
## s(listings):cityTyler                         1.000  1.000 69.365  < 2e-16 ***
## s(listings):cityWichita Falls                 1.000  1.000  0.938  0.33382    
## s(months_inventory):cityBeaumont              1.000  1.000 20.004 1.24e-05 ***
## s(months_inventory):cityBryan-College Station 2.606  2.606 34.330  < 2e-16 ***
## s(months_inventory):cityTyler                 1.000  1.000 92.742  < 2e-16 ***
## s(months_inventory):cityWichita Falls         1.000  1.000  0.664  0.41587    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.754   
##   Scale est. = 6.1686e-08  n = 240

#Plot of predicted values
data_scaled$predicted <- predict(mod_gam_random_slopes$gam, newdata = data_scaled)

ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
  geom_point(alpha = 0.5) +
  geom_point(aes(y = predicted), size = 1, colour = "red") +
  facet_wrap(~ city)+
  labs(title = "Predicted vs Observed values of volume of sales by listings and city",
       x = "Number of listings (scaled)",
       y = "Volume of sales (scaled)")+
  theme_classic()

#Evaluation of residuals for investigate model modified

#Plot of residuals to show other caratheristic on correlation
data_scaled$residuals <- resid(mod_gam_random_slopes)
data_scaled$fitted <- fitted(mod_gam_random_slopes)

ggplot(data, aes(x = fitted, y = residuals, colour = city)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Non linear muultivariate model residuals",
       x = "Predicted values",
       y = "Residuals") +
  theme_classic()

The second model (GAM mixed model with splines)better captures the complexity and heterogeneity of the relationships between volume, listings, and months_inventory across different cities. It allows for non-linear effects of predictors by fitting spline functions, which better represent complex and potentially non-monotonic relationships that the linear model cannot capture. The GAM model shows a higher adjusted R² (~0.71), indicating improved explanatory power compared to the linear mixed model. Significant spline terms for months_inventory and listings (in some cities) reveal meaningful variation in the shape of effects that linear interactions missed.

The plot comparing observed versus predicted sales volume across cities shows mixed predictive performance of the model:

For Beaumont and Wichita Falls, the predicted values (red points) align reasonably well with the observed data, indicating the model captures the general pattern for these cities effectively.
In contrast, for Bryan-College Station and Tyler, the predicted values appear more clustered and less spread along the observed data range, suggesting that the model may be underfitting or failing to fully capture the variability in these cities.
Overall, the model exhibits decent predictive capability but shows variability in fit quality across different cities.

Third breakpoints (point 4)

Model and Main Results:

The mixed GAM (gamm) model with splines allows modeling the relationship between sales value (volume), number of listings, and months on market (months_inventory) flexibly, capturing nonlinearities and city-specific variations. Moreover, including random effects for year, month, and city with varying slopes (random slopes) enhances the ability to capture temporal and territorial heterogeneity.
The improvement in adjusted R² (from about 0.54 in the linear model to around 0.71 in the GAM model) indicates a greater ability of the model to explain sales volume variability.
Significant spline terms for some cities highlight important nonlinear effects that simple linear interaction models could not capture. The capacity to model city-specific relationship curves helps represent distinctive local market dynamics.

Goodness of Fit and Predictions Analysis:

Beaumont and Wichita Falls show good alignment between observed and predicted values, suggesting the model effectively captures trends in these markets, likely more stable or with dynamics well represented by the data.
Bryan-College Station and Tyler exhibit more clustered and less variable predicted values compared to observed data, indicating the model struggles to reproduce the real complexity and variability of these markets, probably influenced by more complex factors or unmodeled noise.

Interpretation in the Project Context:

Identify and interpret historical sales trends: The model highlights general and city-specific trends, going beyond simple linear associations, providing a more realistic and dynamic market representation.
Assess marketing strategies efficacy (listings): Including nonlinear interaction between number of listings and months on market captures the combined effect of listings and speed of sales, identifying conditions that enhance or limit generated value. This helps evaluate whether increasing listings effectively leads to more sales or if long selling times mitigate this effect.

Added Value:

The model’s ability to recognize nonlinear dynamics and territorial differences helps better understand which cities offer the greatest growth opportunities and where intervention is needed to improve listing effectiveness.
Residual analysis and differentiated performance by city suggest orienting personalized strategies for more complex and less linear markets (like Bryan-College Station and Tyler), possibly integrating additional variables or higher-frequency data to improve predictions.
Finally, visual representation of effects and models provides practical tools to monitor and communicate market trends and marketing strategy effectiveness over time.

Final consideration about the analysis and raccomandations

Based on the detailed analysis conducted, several data-driven strategies could be adopted to improve sales management and the effectiveness of property listings. Specifically:

Territorial segmentation and personalized marketing strategies:

Tailor advertising campaigns by city: Since sales dynamics, prices, and listing durations vary significantly across Beaumont, Bryan-College Station, Tyler, and Wichita Falls, it is useful to develop market-specific strategies by customizing messages, channels, and timing.
Targeted resource allocation: Allocate greater investments and focus on more dynamic markets with higher growth potential (e.g., Tyler and Bryan-College Station), while applying more conservative strategies or demand-stimulating interventions in more stable or less dynamic markets (e.g., Wichita Falls).

Optimization of listing timing and promotions:

Leverage seasonality: Intensify marketing and promotion efforts during peak sales months (spring/summer) to maximize listing conversion.
Manage listing efficiency: Monitor the listing efficiency index over time and take actions (e.g., price revision, improved descriptions, new channels) in periods or areas with lower performance.

Predictive modeling and continuous monitoring:

Implement flexible predictive models (e.g., GAM) to forecast sales volumes based on the number of listings, months on market, and territorial characteristics, improving commercial planning.
Analyze residuals and outliers to identify anomalous cases and adapt listing management strategies for particularly complex markets.

Development of customized KPIs:

Use the listing efficiency index as a standard metric, integrated with territorial and temporal variability indicators, to continuously and promptly assess campaign quality and intervene quickly.
Monitor market liquidity through variables such as average months to sell listings (months_inventory), to calibrate prices, incentives, and positioning strategies.

Improvement of offer and demand segmentation:

The correlation analysis between median price and sales volume suggests the need to evaluate targeted offers for specific market segments with different price and demand characteristics, avoiding risky one-size-fits-all strategies.

Real Estate Market Analysis in Texas

2025-12-28

Script for data analysis

Load dataset

1) Univariate analysis

Variable “city”

Variable “year”

Variable “month”

Variable “date”

Variable “sales”

Variable “volume”

Variable “median_price”

Variable “listings”

Variable “months_inventory”

Other variable analysis

Classification of variable in decreasing order of variability

Classification of variable in decreasing order of asimmetry

Creation of class and its analysis on variables

Probability of every occurrency of each variable

Creation and analysis of new variables

Variable “mean_price”

Variable “efficiency_list_index”

Histogram of variable “efficiency_list_index” absolute frequencies with mean and median

Variable “mean_eff_list_index”

Bivariate analysis by year, month, city and time series

Variable “sales”

Variable “volume”

Variable “median_price”

Variable “listings”

Variable “months_inventory”

New created variables

First breakpoint: results of analysis (points 1 and 2)

Bivariate analysis to explore relationship among variables

Variables “volume” vs “sales”

Variables “median_price” vs “sales”

Variables “median_price” vs “volume”

Variables “listings” vs “sales”

Variables “listings” vs “volume”

Variables “listings” vs “median_price”

Variables “months_inventory” vs “sales”

Variables “months_inventory” vs “volume”

Variables “months_inventory” vs “median_price”

Variables “months_inventory” vs “listings”

Second breakpoint: results of analysis (point 3)

Multivariate analysis

Linear multivariate model with interactions between variables

Non linear multivariate model with interactions between variables

Third breakpoints (point 4)

Final consideration about the analysis and raccomandations