This report presents an analysis of the real estate market in Texas based on sales carried out from 2010 to 2014 in the cities of Beaumont, Bryan-College Station, Tyler, and Wichita Falls. Below is a statistical report constructed according to the following approach:
Univariate analysis of the dataset variables with graphical representations, calculation of the main statistical indicators, identification of variables with the greatest variability and skewness, creation of class for some variables, examples of occurrence probabilities within the dataset and creation of new variables such as average price and an effectiveness index of sales listings.
Bivariate analysis to individually explore the relationships between each numerical variable and the categorical variables in the dataset, including graphical representations of relationships, distributions, and main findings.
Bivariate analysis to individually explore the relationships among the numerical variables in the dataset, with investigation of possible linear correlations, linear fitting and residual distribution analysis, nonlinear smooth fitting, and related graphical representations.
Multivariate analysis aimed at finding a plausible model to explain the variability of the dataset and the relationships among its variables, and to generally describe the model’s behavior, including any nonlinearities and group effects, with related graphical representations.
#Set project directory
setwd("C:/Users/matmi/OneDrive/Desktop/ProgettiR")
#List of library to use in project
library(ggplot2)
library(dplyr)
##
## Caricamento pacchetto: 'dplyr'
## I seguenti oggetti sono mascherati da 'package:stats':
##
## filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
##
## intersect, setdiff, setequal, union
library(DescTools)
library(moments)
library(tidyr)
library(corrplot)
## corrplot 0.95 loaded
library(broom)
library(mgcv)
## Caricamento del pacchetto richiesto: nlme
##
## Caricamento pacchetto: 'nlme'
## Il seguente oggetto è mascherato da 'package:dplyr':
##
## collapse
## This is mgcv 1.9-4. For overview type '?mgcv'.
library(lme4)
## Caricamento del pacchetto richiesto: Matrix
##
## Caricamento pacchetto: 'Matrix'
## I seguenti oggetti sono mascherati da 'package:tidyr':
##
## expand, pack, unpack
##
## Caricamento pacchetto: 'lme4'
## Il seguente oggetto è mascherato da 'package:nlme':
##
## lmList
library(ggeffects)
library(knitr)
#Import csv file
data <- read.csv("realestate_texas.csv")
#Attach dataset
attach(data)
#Summary of dataset
summary(data)
## city year month sales
## Length:240 Min. :2010 Min. : 1.00 Min. : 79.0
## Class :character 1st Qu.:2011 1st Qu.: 3.75 1st Qu.:127.0
## Mode :character Median :2012 Median : 6.50 Median :175.5
## Mean :2012 Mean : 6.50 Mean :192.3
## 3rd Qu.:2013 3rd Qu.: 9.25 3rd Qu.:247.0
## Max. :2014 Max. :12.00 Max. :423.0
## volume median_price listings months_inventory
## Min. : 8.166 Min. : 73800 Min. : 743 Min. : 3.400
## 1st Qu.:17.660 1st Qu.:117300 1st Qu.:1026 1st Qu.: 7.800
## Median :27.062 Median :134500 Median :1618 Median : 8.950
## Mean :31.005 Mean :132665 Mean :1738 Mean : 9.193
## 3rd Qu.:40.893 3rd Qu.:150050 3rd Qu.:2056 3rd Qu.:10.950
## Max. :83.547 Max. :180000 Max. :3296 Max. :14.900
This is a categorical variable. First of all we can explore the frequency of each value in variable and calculate Gini Index
#Function for distribution of all variables frequencies
dist_freq <- function(vector) {
n <- length(vector)
tab <- table(vector)
df <- data.frame(
value = names(tab),
ni = as.numeric(tab),
fi = as.numeric(tab) / n,
Ni = cumsum(as.numeric(tab)),
Fi = cumsum(as.numeric(tab)) / n
)
return(df)
}
dist_city_freq <- dist_freq(city)
#Function for Gini index calculate
gini_index <- function(freq) {
freq_vec <- as.numeric(freq)
Gini(freq_vec)
}
gini_city <- gini_index(dist_city_freq$ni)
For the categorical variable “city”, we observe a uniform distribution with equal frequencies across its 4 categories. Consequently, the Gini index is 0, indicating perfect equality in the distribution. Given the categorical nature and the balanced frequencies, this exhaustive analysis sufficiently describes this variable.
#Plot of cities distribution
ggplot(data = data)+
geom_bar(aes(x = city),
stat = "count",
col = "black",
fill = "blue")+
labs(title = "City distribution",
x = "City",
y = "Abosulte frequencies")+
theme_classic()
This is a categorical variable. First of all we can explore the frequency of each value in variable and calculate Gini Index
#Distribution of all variables frequencies
dist_year_freq <- dist_freq(year)
#Gini index calculate
gini_years <- gini_index(dist_year_freq$ni)
For the categorical variable “year”, we observe a uniform distribution with equal frequencies across its 5 categories. Consequently, the Gini index is 0, indicating perfect equality in the distribution. Given the categorical nature and the balanced frequencies, this exhaustive analysis sufficiently describes this variable.
#Plot of years distribution
ggplot(data = data)+
geom_bar(aes(x = year),
stat = "count",
col = "black",
fill = "blue")+
labs(title = "Years distribution",
x = "Years",
y = "Absolute frequencies")+
theme_classic()
This is a categorical variable. First of all we can explore the frequency of each value in variable and calculate Gini Index
#Built correct labels for months
data$month <- sprintf("%02d", as.numeric(data$month))
month_names <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
data$month <- factor(data$month, levels = sprintf("%02d", 1:12), labels = month_names)
#Distribution of all variables frequencies
dist_months_freq <- dist_freq(month)
#Gini index calculate
gini_months <- gini_index(dist_months_freq$ni)
For the categorical variable “month”, we observe a uniform distribution with equal frequencies across its 12 categories. Consequently, the Gini index is 0, indicating perfect equality in the distribution. Given the categorical nature and the balanced frequencies, this exhaustive analysis sufficiently describes this variable.
#Plot of months distribution
ggplot(data = data)+
geom_bar(aes(x = month),
stat = "count",
col = "black",
fill = "blue")+
labs(title = "Months distribution",
x = "Months",
y = "Aboslute frequencies")+
theme_classic()
For time analysis of data we can built a new variable that concatenate year and month
#Variable "date" creation
date <- paste(year, month, "1", sep = "-")
date <- as.Date(date)
data$date <- date
#Distribution of all variables frequencies
dist_dates_freq <- dist_freq(date)
#Gini index calculate
gini_dates <- gini_index(dist_dates_freq$ni)
For the categorical variables “month” and “year”, we observe two uniform distributionswith equal frequencies respectively across their 12 and 5 categories. Consequently, the Gini index of each distribution is 0. Analysis beetwen year and month shows other uniform distribution of month for each years.Starting to paste year and month we built another variable (type date) to show trend and patterns by time
#Plot of month distribution by year (every month is a date "YYYY-MM-01")
ggplot(data = data)+
geom_bar(aes(x = month),
stat = "count",
col = "black",
fill = "blue")+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Months distribution for each year",
x = "Months",
y = "Absolute frequencies")+
scale_y_continuous(breaks = seq(0, 12, 1)) +
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index.
#Distribution of all variables frequencies
dist_sales_freq <- dist_freq(sales)
#Gini index calculate
gini_sales <- gini_index(dist_sales_freq$ni)
#Aritmetic mean
mean_sales <- mean(sales)
#Median
median_sales <- median(sales)
#Standard deviation
sd_sales <- sd(sales)
#Variation coefficient
var_cf_sales <- (sd_sales/mean_sales)*100
#Range
range_sales <- range(sales)
#Quantile and IQR
q_sales <- quantile(sales, seq(0,1, 0.25))
iqr_sales <- IQR(sales)
#Skewness
skew_sales <- skewness(sales)
#Kurtosis
kurt_sales <- kurtosis(sales)-3
#Function to create a vector with principal statistical index
stat_index_vect <- function (pos, mean_v, median_v, sd_v, cf_v, iqr_v, skew_v, kurt_v) {
name_v <- colnames(data)[pos]
v_st_index <- cbind(name_v, mean_v, median_v, sd_v, cf_v, iqr_v, skew_v, kurt_v)
return(v_st_index)
}
pos_sales <- which(names(data) == "sales")
st_index_sales <- stat_index_vect(pos_sales, mean_sales, median_sales, sd_sales, var_cf_sales, iqr_sales, skew_sales, kurt_sales)
#Plot of sales distribution with mean and median
ggplot(data = data)+
geom_histogram(aes(x = sales),
binwidth = 10,
col = "black",
fill = "blue")+
geom_vline(xintercept = mean_sales, col = "red", lwd = 1)+
geom_vline(xintercept = median_sales, col = "pink", lwd = 1)+
annotate("label", x = mean_sales + 20, y = 16, label = round(mean_sales, 2), color = "red") +
annotate("label", x = median_sales - 20, y = 16, label = median_sales, color = "pink") +
labs(title = "Number of sales distribution",
x = "Number of sales",
y = "Absolute frequencies")+
theme_classic()
#Boxplot for variable "sales"
ggplot(data = data)+
geom_boxplot(aes(y = sales), fill = "blue")+
labs(title = "Number of sales boxplot",
x = "Relative frequencies",
y = "Number of sales")+
theme_classic()
This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index
#Distribution of all variables frequencies
dist_volume_freq <- dist_freq(volume)
#Gini index calculate
gini_volume <- gini_index(dist_volume_freq$ni)
#Aritmetic mean
mean_volume <- mean(volume)
#Median
median_volume <- median(volume)
#Standard deviation
sd_volume <- sd(volume)
#Variation coefficient
var_cf_volume <- (sd_volume/mean_volume)*100
#Range
range_volume <- range(volume)
#Quantile and IQR
q_volume <- quantile(volume, seq(0,1, 0.25))
iqr_volume <- IQR(volume)
#Skewness
skew_volume <- skewness(volume)
#Kurtosis
kurt_volume <- kurtosis(volume)-3
pos_volume <- which(names(data) == "volume")
st_index_volume <- stat_index_vect(pos_volume, mean_volume, median_volume, sd_volume, var_cf_volume, iqr_volume, skew_volume, kurt_volume)
#Plot of volume distribution with mean and median
ggplot(data = data)+
geom_histogram(aes(x = volume),
binwidth = 5,
col = "black",
fill = "blue")+
geom_vline(xintercept = mean_volume, col = "red", lwd = 1)+
geom_vline(xintercept = median_volume, col = "pink", lwd = 1)+
annotate("label", x = mean_volume + 5, y = 30, label = round(mean_volume, 2), color = "red") +
annotate("label", x = median_volume - 5, y = 30, label = median_volume, color = "pink") +
labs(title = "Volume of sales distribution",
x = "Volume of sales (mln $)",
y = "Absolute frequencies")+
theme_classic()
#Boxplot for variable "volume"
ggplot(data = data)+
geom_boxplot(aes(y = volume), fill = "blue")+
labs(title = "Volume of sales boxplot",
x = "Relative frequencies",
y = "Volume of sales (mln $)")+
theme_classic()
This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index
#Distribution of all variables frequencies
dist_median_price_freq <- dist_freq(median_price)
#Gini index calculate
gini_median_price <- gini_index(dist_median_price_freq$ni)
#Aritmetic mean
mean_median_price <- mean(median_price)
#Median
median_median_price <- median(median_price)
#Standard deviation
sd_median_price <- sd(median_price)
#Variation coefficient
var_cf_median_price <- (sd_median_price/mean_median_price)*100
#Range
range_median_price <- range(median_price)
#Quantile and IQR
q_median_price <- quantile(median_price, seq(0,1, 0.25))
iqr_median_price <- IQR(median_price)
#Skewness
skew_median_price <- skewness(median_price)
#Kurtosis
kurt_median_price <- kurtosis(median_price)-3
pos_mprice <- which(names(data) == "median_price")
st_index_mprice <- stat_index_vect(pos_mprice, mean_median_price, median_median_price, sd_median_price, var_cf_median_price, iqr_median_price, skew_median_price, kurt_median_price)
#Plot of median price distribution with mean and median
ggplot(data = data)+
geom_histogram(aes(x = median_price),
binwidth = 2500,
col = "black",
fill = "blue")+
geom_vline(xintercept = mean_median_price, col = "red", lwd = 1)+
geom_vline(xintercept = median_median_price, col = "pink", lwd = 1)+
annotate("label", x = mean_median_price + 12000, y = 18, label = round(mean_median_price, 2), color = "red") +
annotate("label", x = median_median_price - 12000, y = 18, label = median_median_price, color = "pink") +
labs(title = "Median price distribution",
x = "Median price",
y = "Absolute frequencies")+
theme_classic()
#Boxplot for variable "median price"
ggplot(data = data)+
geom_boxplot(aes(y = median_price), fill = "blue")+
labs(title = "Median price boxplot",
x = "Relative frequencies",
y = "Median price")+
theme_classic()
This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index.
#Distribution of all variables frequencies
dist_listings_freq <- dist_freq(listings)
#Gini index calculate
gini_listings <- gini_index(dist_listings_freq$ni)
#Aritmetic mean
mean_listings <- mean(listings)
#Median
median_listings <- median(listings)
#Standard deviation
sd_listings <- sd(listings)
#Variation coefficient
var_cf_listings <- (sd_listings/mean_listings)*100
#Range
range_listings <- range(listings)
#Quantile and IQR
q_listings <- quantile(listings, seq(0,1, 0.25))
iqr_listings <- IQR(listings)
#Skewness
skew_listings <- skewness(listings)
#Kurtosis
kurt_listings <- kurtosis(listings)-3
pos_listings <- which(names(data) == "listings")
st_index_listings <- stat_index_vect(pos_listings, mean_listings, median_listings, sd_listings, var_cf_listings, iqr_listings, skew_listings, kurt_listings)
#Plot of listings distribution with mean and median
ggplot(data = data)+
geom_histogram(aes(x = listings),
binwidth = 100,
col = "black",
fill = "blue")+
geom_vline(xintercept = mean_listings, col = "red", lwd = 1)+
geom_vline(xintercept = median_listings, col = "pink", lwd = 1)+
annotate("label", x = mean_listings + 250, y = 20, label = round(mean_listings, 2), color = "red") +
annotate("label", x = median_listings - 250, y = 20, label = median_listings, color = "pink") +
labs(title = "Number of listings distribution",
x = "Number of listings",
y = "Absolute frequencies")+
theme_classic()
#Boxplot for variable "listings"
ggplot(data = data)+
geom_boxplot(aes(y = listings), fill = "blue")+
labs(title = "Number of listings boxplot",
x = "Relative frequencies",
y = "Number of listings")+
theme_classic()
This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index
#Distribution of all variables frequencies
dist_minv_freq <- dist_freq(months_inventory)
#Gini index calculate
gini_minv <- gini_index(dist_minv_freq$ni)
#Aritmetic mean
mean_minv <- mean(months_inventory)
#Median
median_minv <- median(months_inventory)
#Standard deviation
sd_minv <- sd(months_inventory)
#Variation coefficient
var_cf_minv <- (sd_minv/mean_minv)*100
#Range
range_minv <- range(months_inventory)
#Quantile and IQR
q_minv <- quantile(months_inventory, seq(0,1, 0.25))
iqr_minv <- IQR(months_inventory)
#Skewness
skew_minv <- skewness(months_inventory)
#Kurtosis
kurt_minv <- kurtosis(months_inventory)-3
pos_minv <- which(names(data) == "months_inventory")
st_index_minv <- stat_index_vect(pos_minv, mean_minv, median_minv, sd_minv, var_cf_minv, iqr_minv, skew_minv, kurt_minv)
#Plot of months inventory distribution with mean and median
ggplot(data = data)+
geom_histogram(aes(x = months_inventory),
binwidth = 1,
col = "black",
fill = "blue")+
geom_vline(xintercept = mean_minv, col = "red", lwd = 1)+
geom_vline(xintercept = median_minv, col = "pink", lwd = 1)+
annotate("label", x = mean_minv + 0.5, y = 45, label = round(mean_minv, 2), color = "red") +
annotate("label", x = median_minv - 0.5, y = 45, label = median_minv, color = "pink") +
labs(title = "Months inventory distribution",
x = "Months inventory",
y = "Absolute frequencies")+
theme_classic()
#Boxplot for variable "months inventory"
ggplot(data = data)+
geom_boxplot(aes(y = months_inventory), fill = "blue")+
labs(title = "Months frequencies boxplot",
x = "Relative frequencies",
y = "Months inventory")+
theme_classic()
Below are shown some analysis on varibles:
1) Classification of variable in decreasing order of variability
2) Classification of variable in decreasing order of asimmetry
3) Creation of class and its analysis on variables
4) Probability of every occurrency of each variable
5) Creation and analysis of new variable, like mean price and efficiency
index of listings
The variable with the bigger variability is “volume”, with a variation coefficient equal to 53%, while the variable “median_price” has a smaller variability. The results of statistic analysis are shown below
#Creation of matrix of statistical index for each variable in dataset
m_stat_index <- as.data.frame(rbind(
st_index_sales,
st_index_volume,
st_index_mprice,
st_index_listings,
st_index_minv
))
m_stat_index_cv <- m_stat_index[order(m_stat_index$cf_v, decreasing = TRUE), ]
#Plot of statistical index table
kable(m_stat_index_cv, caption = "Statistical index for each variable ordinated by variability", row.names = FALSE)
| name_v | mean_v | median_v | sd_v | cf_v | iqr_v | skew_v | kurt_v |
|---|---|---|---|---|---|---|---|
| volume | 31.0051875 | 27.0625 | 16.6514471564494 | 53.7053586805415 | 23.2335 | 0.884742026325995 | 0.176986997089741 |
| listings | 1738.02083333333 | 1618.5 | 752.707756098841 | 43.3083275909432 | 1029.5 | 0.649498226273971 | -0.791790033332591 |
| sales | 192.291666666667 | 175.5 | 79.6511111777793 | 41.4220296482492 | 120 | 0.718104024884959 | -0.313176409071494 |
| months_inventory | 9.1925 | 8.95 | 2.30366862229334 | 25.0603059264982 | 3.15 | 0.040975265871081 | -0.174447541638487 |
| median_price | 132665.416666667 | 134500 | 22662.148686505 | 17.0821825732064 | 32750 | -0.364552878177372 | -0.622961820755544 |
The variable with the bigger asimmetry is “volume” with a skewness of 88,4%, while the variable “months_inventory” has a smaller asimmetry. The results of statistic analysis are shown below
#Matrix ordered by decreasing asimmetry
m_stat_index$skew_v <- as.numeric(as.character(m_stat_index$skew_v))
m_stat_index_as <- m_stat_index[order(abs(m_stat_index$skew_v), decreasing = TRUE), ]
#Plot of statistical index table
kable(m_stat_index_as, caption = "Statistical index for each variable ordinated by skewness", row.names = FALSE)
| name_v | mean_v | median_v | sd_v | cf_v | iqr_v | skew_v | kurt_v |
|---|---|---|---|---|---|---|---|
| volume | 31.0051875 | 27.0625 | 16.6514471564494 | 53.7053586805415 | 23.2335 | 0.8847420 | 0.176986997089741 |
| sales | 192.291666666667 | 175.5 | 79.6511111777793 | 41.4220296482492 | 120 | 0.7181040 | -0.313176409071494 |
| listings | 1738.02083333333 | 1618.5 | 752.707756098841 | 43.3083275909432 | 1029.5 | 0.6494982 | -0.791790033332591 |
| median_price | 132665.416666667 | 134500 | 22662.148686505 | 17.0821825732064 | 32750 | -0.3645529 | -0.622961820755544 |
| months_inventory | 9.1925 | 8.95 | 2.30366862229334 | 25.0603059264982 | 3.15 | 0.0409753 | -0.174447541638487 |
This analysis divides each variable into classes defined by their respective quartiles to verify whether observations are evenly distributed among the four theoretical groups, each representing approximately 25% of the data. Quartiles were chosen to segment data into balanced subsets, facilitating the evaluation of distribution uniformity and identification of any anomalies such as duplicates at interval boundaries.Statistically, this approach highlights:
In contrast, months_inventory shows low uneven distribution, suggesting data aggregation or clusters that cause asymmetry and call for further examination or alternative segmentation methods.
#Function to create classes for each variable based on thir respective quartiles
create_class <- function(data, field, quantiles) {
vec <- data[[field]]
class <- cut(vec,
breaks = quantiles,
include.lowest = TRUE,
labels = c("Q1", "Q2", "Q3", "Q4"))
return(class)
}
#Class for variable "sales", frequency distribution and Gini index
data$sales_class <- create_class(data, "sales", q_sales)
dist_sales_class <- dist_freq(data$sales_class)
gini_sales_class <- gini_index(dist_sales_class$ni)
#Class for variable "volume", frequency distribution and Gini index
data$volume_class <- create_class(data, "volume", q_volume)
dist_volume_class <- dist_freq(data$volume_class)
gini_volume_class <- gini_index(dist_volume_class$ni)
#Class for variable "median_price", frequency distribution and Gini index
data$mdpr_class <- create_class(data, "median_price", q_median_price)
dist_mdpr_class <- dist_freq(data$mdpr_class)
gini_mdpr_class <- gini_index(dist_mdpr_class$ni)
#Class for variable "listings", frequency distribution and Gini index
data$list_class <- create_class(data, "listings", q_listings)
dist_list_class <- dist_freq(data$list_class)
gini_list_class <- gini_index(dist_list_class$ni)
#Class for variable "months_inventory", frequency distribution and Gini index
data$minv_class <- create_class(data, "months_inventory", q_minv)
dist_minv_class <- dist_freq(data$minv_class)
gini_minv_class <- gini_index(dist_minv_class$ni)
#Function to extract only value and absolute frequencies from dataset
exct_data <- function(data) {
exct_data <- data[,1:2]
return(exct_data)
}
#Extract data from frequencies total distribution for each variable and built of new dataset
d_s_class <- exct_data(dist_sales_class)
d_s_class$name_var <- "sales"
d_v_class <- exct_data(dist_volume_class)
d_v_class$name_var <- "volume"
d_mp_class <- exct_data(dist_mdpr_class)
d_mp_class$name_var <- "median_price"
d_l_class <- exct_data(dist_list_class)
d_l_class$name_var <- "listings"
d_mv_class <- exct_data(dist_minv_class)
d_mv_class$name_var <- "months_inventory"
dist_freq_class <- rbind(d_s_class, d_v_class, d_mp_class, d_l_class, d_mv_class)
#Plot of absolute frequency distribution of class (quartile) of variable "sales"
ggplot(data = dist_freq_class)+
geom_bar(aes(x = value, y = ni), stat = "identity", color = "black", fill = "blue") +
facet_wrap(~ name_var, scales = "free_x")+
labs(title = "Absolute frequencies for class (quartiles) for each variable",
x = "Classes (quartiles)",
y = "Absolute frequencies")+
theme_classic()
#Plot of table of Gini index for each variable
v_gini <- as.data.frame(cbind(gini_sales_class, gini_volume_class, gini_mdpr_class, gini_list_class, gini_minv_class))
kable(v_gini,
caption = "Gini index of classes built from quartiles for each variable", row.naes = FALSE)
| gini_sales_class | gini_volume_class | gini_mdpr_class | gini_list_class | gini_minv_class |
|---|---|---|---|---|
| 0.0083333 | 0 | 0.0083333 | 0 | 0.0416667 |
#Function to create dataset with value and relative frequencies for each variable
exct_freq <- function(data) {
exct_r1 <- data[, 1]
exct_r2 <- data[,3]
exct_rows <- as.data.frame(cbind(exct_r1, exct_r2))
return(exct_rows)
}
#Creation of dataset for each categorical variable and new dataset
f_city <- exct_freq(dist_city_freq)
f_year <- exct_freq(dist_year_freq)
f_month <- exct_freq(dist_months_freq)
f_date <- exct_freq(dist_dates_freq)
f_dataset <- as.data.frame(rbind(f_city, f_year, f_month, f_date))
#Example of some probabilities
p_beaumont <- f_dataset$exct_r2[f_dataset$exct_r1 == "Beaumont"]
print(paste("The probability that the city associated with a dataset record could be Beaumont is: ", p_beaumont))
## [1] "The probability that the city associated with a dataset record could be Beaumont is: 0.25"
p_tyler <- f_dataset$exct_r2[f_dataset$exct_r1 == "Tyler"]
print(paste("The probability that the city associated with a dataset record could be Tyler is: ", p_tyler))
## [1] "The probability that the city associated with a dataset record could be Tyler is: 0.25"
p_2011 <- f_dataset$exct_r2[f_dataset$exct_r1 == "2011"]
print(paste("The probability that the year associated with a dataset record could be 2011 is: ", p_2011))
## [1] "The probability that the year associated with a dataset record could be 2011 is: 0.2"
p_2011_02_01 <- f_dataset$exct_r2[f_dataset$exct_r1 == "2011-02-01"]
print(paste("The probability that the date associated with a dataset record could be 2011 is: ", p_2011_02_01))
## [1] "The probability that the date associated with a dataset record could be 2011 is: 0.0166666666666667"
Based on variables in the dataset, it could create other
variables:
1) Mean price = [volume/sales]
2) Efficiency listings index = [(sales/listings)]x100
3) Variation of efficiency listings index = ([sales -
(listings/months_inventory)]/(listings/months_inventory))x100
#Variable "mean_price"
data$mean_price <- ifelse(sales == 0, 0, volume/sales)
#Variable "efficiency_list_index"
data$efficiency_list_index <- (ifelse(listings == 0, 0, sales/listings))*100
#Variable "variation_eff_list_index"
mean_eff_listings <- ifelse(months_inventory == 0, 0, listings/months_inventory)
data$mean_eff_list_index <- ifelse(mean_eff_listings == 0, 0,
((sales - mean_eff_listings)/(mean_eff_listings))*100)
This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index
#Distribution of all variables frequencies
dist_mpr_freq <- dist_freq(data$mean_price)
#Gini index calculate
gini_mpr <- gini_index(dist_mpr_freq$ni)
#Aritmetic mean
mean_mpr <- mean(data$mean_price)
#Median
median_mpr <- median(data$mean_price)
#Standard deviation
sd_mpr <- sd(data$mean_price)
#Variation coefficient
var_cf_mpr <- (sd_mpr/mean_mpr)*100
#Range
range_mpr <- range(data$mean_price)
#Quantile and IQR
q_mpr <- quantile(data$mean_price, seq(0,1, 0.25))
iqr_mpr <- IQR(data$mean_price)
#Skewness
skew_mpr <- skewness(data$mean_price)
#Kurtosis
kurt_mpr <- kurtosis(data$mean_price)-3
#Creation of statistical index to plot table
pos_mpr <- which(names(data) == "mean_price")
st_index_mpr <- stat_index_vect(pos_mpr, mean_mpr, median_mpr, sd_mpr, var_cf_mpr, iqr_mpr, skew_mpr, kurt_mpr)
kable(st_index_mpr, caption = "Statistical index for the variable mean_price", row.names = FALSE)
| name_v | mean_v | median_v | sd_v | cf_v | iqr_v | skew_v | kurt_v |
|---|---|---|---|---|---|---|---|
| mean_price | 0.154320365777658 | 0.156588479842459 | 0.0271474563315486 | 17.5916225928743 | 0.0409762176937704 | -0.0687052806534169 | -0.77843287102 |
#Plot of mean price distribution with mean and median
ggplot(data = data)+
geom_histogram(aes(x = mean_price),
binwidth = 0.01,
col = "black",
fill = "blue")+
geom_vline(xintercept = mean_mpr, col = "red", lwd = 1)+
geom_vline(xintercept = median_mpr, col = "pink", lwd = 1)+
annotate("label", x = mean_mpr - 0.01, y = 40, label = round(mean_mpr, 2), color = "red") +
annotate("label", x = median_mpr + 0.01, y = 40, label = round(median_mpr, 2), color = "pink") +
labs(title = "Mean price distribution ",
x = "Mean price (mln $)",
y = "Absolute frequencies")+
theme_classic()
#Boxplot for variable "mean price"
ggplot(data = data)+
geom_boxplot(aes(y = mean_price), fill = "blue")+
labs(title = "Mean price boxplot",
x = "Relative frequencies",
y = "Mean price (mln $)")+
theme_classic()
This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index
#Distribution of all variables frequencies
dist_eff_list_freq <- dist_freq(data$efficiency_list_index)
#Gini index calculate
gini_eff_list <- gini_index(dist_eff_list_freq$ni)
#Aritmetic mean
mean_eff_list <- mean(data$efficiency_list_index)
#Median
median_eff_list <- median(data$efficiency_list_index)
#Standard deviation
sd_eff_list <- sd(data$efficiency_list_index)
#Variation coefficient
var_cf_eff_list <- (sd_eff_list/mean_eff_list)*100
#Range
range_eff_list <- range(data$efficiency_list_index)
#Quantile and IQR
q_eff_list <- quantile(data$efficiency_list_index, seq(0,1, 0.25))
iqr_eff_list <- IQR(data$efficiency_list_index)
#Skewness
skew_eff_list <- skewness(data$efficiency_list_index)
#Kurtosis
kurt_eff_list <- kurtosis(data$efficiency_list_index)-3
#Creation of statistical index to plot table
pos_mlis <- which(names(data) == "efficiency_list_index")
st_index_mlis <- stat_index_vect(pos_mlis, mean_eff_list, median_eff_list, sd_eff_list, var_cf_eff_list, iqr_eff_list, skew_eff_list, kurt_eff_list)
kable(st_index_mlis, caption = "Statistical index for variable efficency_list_index", row.names = FALSE)
| name_v | mean_v | median_v | sd_v | cf_v | iqr_v | skew_v | kurt_v |
|---|---|---|---|---|---|---|---|
| efficiency_list_index | 11.8744921731651 | 10.9626835334067 | 4.68990009256993 | 39.4955845199724 | 4.51207215683548 | 2.08931813879356 | 6.88174747715347 |
#Plot of efficiency listings index distribution with mean and median
ggplot(data = data)+
geom_histogram(aes(x = efficiency_list_index),
binwidth = 1,
col = "black",
fill = "blue")+
geom_vline(xintercept = mean_eff_list, col = "red", lwd = 1)+
geom_vline(xintercept = median_eff_list, col = "pink", lwd = 1)+
annotate("label", x = mean_eff_list + 2, y = 40, label = round(mean_eff_list, 2), color = "red") +
annotate("label", x = median_eff_list - 2, y = 40, label = round(median_eff_list, 2), color = "pink") +
labs(title = "Efficiency listings index distribution",
x = "Efficiency listings index",
y = "Absolute frequencies")+
theme_classic()
#Boxplot for variable "efficiency listings index"
ggplot(data = data)+
geom_boxplot(aes(y = efficiency_list_index), fill = "blue")+
labs(title = "Efficiency listings index boxplot",
x = "Relative frequencies",
y = "Efficiency listings index")+
theme_classic()
This is a numeric variable. First of all we can explore the frequency of each value in variable and calculate all statistical index
#Distribution of all variables frequencies
dist_meff_list_freq <- dist_freq(data$mean_eff_list_index)
#Gini index calculate
gini_meff_list <- gini_index(dist_meff_list_freq$ni)
#Aritmetic mean
mean_meff_list <- mean(data$mean_eff_list_index)
#Median
median_meff_list <- median(data$mean_eff_list_index)
#Standard deviation
sd_meff_list <- sd(data$mean_eff_list_index)
#Variation coefficient
var_cf_meff_list <- (sd_meff_list/mean_meff_list)*100
#Range
range_meff_list <- range(data$mean_eff_list_index)
#Quantile and IQR
q_meff_list <- quantile(data$mean_eff_list_index, seq(0,1, 0.25))
iqr_meff_list <- IQR(data$mean_eff_list_index)
#Skewness
skew_meff_list <- skewness(data$mean_eff_list_index)
#Kurtosis
kurt_meff_list <- kurtosis(data$mean_eff_list_index)-3
#Creation of statistical index to plot table
pos_mel <- which(names(data) == "mean_eff_list_index")
st_index_mel <- stat_index_vect(pos_mel, mean_meff_list, median_meff_list, sd_meff_list, var_cf_meff_list, iqr_meff_list, skew_meff_list, kurt_meff_list)
kable(st_index_mel, caption = "Statistical index for variation of efficiency listings index", row.names = FALSE)
| name_v | mean_v | median_v | sd_v | cf_v | iqr_v | skew_v | kurt_v |
|---|---|---|---|---|---|---|---|
| mean_eff_list_index | 2.07809270544987 | 2.06321958662783 | 24.8726892383184 | 1196.89988676102 | 34.0724048034506 | 0.465258599586698 | 0.171896889361363 |
#Plot of variation efficiency listings index distribution with mean and median
ggplot(data = data)+
geom_histogram(aes(x = mean_eff_list_index),
binwidth = 5,
col = "black",
fill = "blue")+
geom_vline(xintercept = mean_meff_list, col = "red", lwd = 1)+
geom_vline(xintercept = median_meff_list, col = "pink", lwd = 1)+
annotate("label", x = mean_meff_list + 10, y = 25, label = round(mean_meff_list, 2), color = "red") +
annotate("label", x = median_meff_list - 10, y = 25, label = round(median_meff_list, 2), color = "pink") +
labs(title = "Variation of efficiency listings index distribution",
x = "Variation of efficiency listings index (%)",
y = "Absolute frequencies")+
theme_classic()
#Boxplot for variable "variation efficiency listings index"
ggplot(data = data)+
geom_boxplot(aes(y = mean_eff_list_index), fill = "blue")+
labs(title = "Variation of efficiency listings index boxplot",
x = "Relative frequencies",
y = "Variation of efficiency listings index (%)")+
theme_classic()
Bivariate analysis by city: Tyler is the city with the highest total and average number of sales, while Wichita Falls has the lowest indicators. The variability in the number of sales is around 20% for all cities except Bryan-College Station. Bryan shows a much higher variability in sales number (41%). This situation suggests a healthier market in Tyler and Bryan-College Station, although the latter has high variability and is therefore more subject to temporal fluctuations, and a somewhat less healthy market in Beaumont and Wichita Falls, albeit with lower temporal fluctuations.
Bivariate analysis by year and month: The analysis shows that the total and average number of sales peak around mid-year, regardless of the year. This suggests a seasonal trend in the data, which should be confirmed through the analysis of the historical data trends. Additionally, three peaks (at the beginning, middle, and end of the year) in sales variability are observed, independently of the year. Overall, the results indicate real estate markets subject to seasonality, with differences in sales opportunities based on different times of the year and varying peaks in sales variability. These factors should be considered, for example, when managing advertising campaigns.
Bivariate analysis by date: The analysis confirms the seasonal pattern of the data for each city, showing a periodic and repetitive trend with a peak in the number of sales around mid-year. The temporal distribution of the boxplots by city further highlights the mid-year sales peak and the three variability peaks identified earlier. Overall, these findings confirm at the local level—with some differences—what was observed at the global level. Moreover, the temporal trends support the analysis of sales numbers by city, identifying a positive sales trend for all cities (especially Tyler and Bryan-College Station, which have higher total and average sales) and a downward trend for Wichita Falls.
#Function to aggregate data by city
agg_city <- function (data, city, kpi){
data_by_city <- data %>%
group_by({{city}}) %>%
summarise(num_kpi = sum({{kpi}}), mean_kpi = mean({{kpi}}), sd_kpi = sd({{kpi}}), .groups = "drop") %>%
mutate(cv_kpi = (sd_kpi/mean_kpi)*100)
return(data_by_city)
}
#Total number, mean and standard deviation of sales by city
sales_by_city <- agg_city(data, city, sales)
names(sales_by_city)[2:5] <- c("num_sales", "mean_sales", "sd_sales", "cv_sales")
kable(sales_by_city, caption = "Statistical index for number of sales aggregated by city", row.names = FALSE)
| city | num_sales | mean_sales | sd_sales | cv_sales |
|---|---|---|---|---|
| Beaumont | 10643 | 177.3833 | 41.48395 | 23.38661 |
| Bryan-College Station | 12358 | 205.9667 | 84.98374 | 41.26092 |
| Tyler | 16185 | 269.7500 | 61.96380 | 22.97083 |
| Wichita Falls | 6964 | 116.0667 | 22.15192 | 19.08551 |
#Plot of total number of sales by city
ggplot(data = sales_by_city)+
geom_bar(aes(x = city, y = num_sales),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Number of sales by city",
x = "City",
y = "Number of sales")+
theme_classic()
#Plot of mean number of sales by city
ggplot(data = sales_by_city)+
geom_bar(aes(x = city, y = mean_sales),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Mean number of sales by city",
x = "City",
y = "Mean number of sales")+
theme_classic()
#Plot of variation coefficient of number of sales by city
ggplot(data = sales_by_city)+
geom_bar(aes(x = city, y = cv_sales),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Number of sales variability by city",
x = "City",
y = "Number of sales variability")+
theme_classic()
#Function to aggregate data by month and year
agg_date <- function (data, year, month, kpi){
data_by_date <- data %>%
group_by({{year}}, {{month}}) %>%
summarise(num_kpi = sum({{kpi}}), mean_kpi = mean({{kpi}}), sd_kpi = sd({{kpi}}), .groups = "drop") %>%
mutate(cv_kpi = (sd_kpi/mean_kpi)*100)
return(data_by_date)
}
#Total number, mean and standard deviation of sales by month and year
sales_by_date <- agg_date(data, year, month, sales)
names(sales_by_date)[3:6] <- c("num_sales", "mean_sales", "sd_sales", "cv_sales")
kable(sales_by_date,caption = "Statistical index for number of salesw aggregated by year and months", row.names = FALSE)
| year | month | num_sales | mean_sales | sd_sales | cv_sales |
|---|---|---|---|---|---|
| 2010 | Jan | 421 | 105.25 | 36.60943 | 34.78330 |
| 2010 | Feb | 487 | 121.75 | 40.26061 | 33.06826 |
| 2010 | Mar | 755 | 188.75 | 43.59950 | 23.09907 |
| 2010 | Apr | 916 | 229.00 | 63.95311 | 27.92712 |
| 2010 | May | 931 | 232.75 | 58.84089 | 25.28072 |
| 2010 | Jun | 866 | 216.50 | 71.44462 | 32.99982 |
| 2010 | Jul | 712 | 178.00 | 62.50867 | 35.11723 |
| 2010 | Aug | 738 | 184.50 | 45.00000 | 24.39024 |
| 2010 | Sep | 598 | 149.50 | 47.19816 | 31.57068 |
| 2010 | Oct | 565 | 141.25 | 45.70467 | 32.35729 |
| 2010 | Nov | 503 | 125.75 | 30.99866 | 24.65102 |
| 2010 | Dec | 604 | 151.00 | 40.70217 | 26.95508 |
| 2011 | Jan | 425 | 106.25 | 27.03547 | 25.44515 |
| 2011 | Feb | 469 | 117.25 | 44.25965 | 37.74810 |
| 2011 | Mar | 668 | 167.00 | 52.42773 | 31.39385 |
| 2011 | Apr | 716 | 179.00 | 58.64583 | 32.76303 |
| 2011 | May | 780 | 195.00 | 70.28039 | 36.04123 |
| 2011 | Jun | 885 | 221.25 | 93.93038 | 42.45441 |
| 2011 | Jul | 812 | 203.00 | 69.95713 | 34.46164 |
| 2011 | Aug | 786 | 196.50 | 70.27802 | 35.76490 |
| 2011 | Sep | 629 | 157.25 | 67.60855 | 42.99431 |
| 2011 | Oct | 594 | 148.50 | 57.57604 | 38.77174 |
| 2011 | Nov | 549 | 137.25 | 49.37864 | 35.97715 |
| 2011 | Dec | 565 | 141.25 | 48.25194 | 34.16067 |
| 2012 | Jan | 499 | 124.75 | 29.78115 | 23.87266 |
| 2012 | Feb | 574 | 143.50 | 57.61076 | 40.14687 |
| 2012 | Mar | 711 | 177.75 | 66.69020 | 37.51910 |
| 2012 | Apr | 747 | 186.75 | 52.77863 | 28.26165 |
| 2012 | May | 882 | 220.50 | 90.71751 | 41.14173 |
| 2012 | Jun | 898 | 224.50 | 86.18004 | 38.38755 |
| 2012 | Jul | 928 | 232.00 | 89.81462 | 38.71320 |
| 2012 | Aug | 954 | 238.50 | 87.99432 | 36.89489 |
| 2012 | Sep | 707 | 176.75 | 78.20646 | 44.24694 |
| 2012 | Oct | 742 | 185.50 | 79.80601 | 43.02211 |
| 2012 | Nov | 650 | 162.50 | 37.24245 | 22.91843 |
| 2012 | Dec | 643 | 160.75 | 52.20073 | 32.47324 |
| 2013 | Jan | 576 | 144.00 | 49.22059 | 34.18097 |
| 2013 | Feb | 593 | 148.25 | 54.90219 | 37.03351 |
| 2013 | Mar | 814 | 203.50 | 64.04426 | 31.47138 |
| 2013 | Apr | 878 | 219.50 | 74.54082 | 33.95937 |
| 2013 | May | 1057 | 264.25 | 90.36362 | 34.19626 |
| 2013 | Jun | 1045 | 261.25 | 108.21699 | 41.42277 |
| 2013 | Jul | 1127 | 281.75 | 122.70391 | 43.55063 |
| 2013 | Aug | 1107 | 276.75 | 92.01585 | 33.24873 |
| 2013 | Sep | 814 | 203.50 | 66.00253 | 32.43367 |
| 2013 | Oct | 738 | 184.50 | 65.97727 | 35.76004 |
| 2013 | Nov | 690 | 172.50 | 65.07688 | 37.72573 |
| 2013 | Dec | 733 | 183.25 | 64.04881 | 34.95160 |
| 2014 | Jan | 627 | 156.75 | 61.34805 | 39.13751 |
| 2014 | Feb | 694 | 173.50 | 62.21736 | 35.86015 |
| 2014 | Mar | 841 | 210.25 | 85.35563 | 40.59721 |
| 2014 | Apr | 977 | 244.25 | 84.10063 | 34.43219 |
| 2014 | May | 1127 | 281.75 | 112.15577 | 39.80684 |
| 2014 | Jun | 1177 | 294.25 | 134.62386 | 45.75152 |
| 2014 | Jul | 1136 | 284.00 | 122.29745 | 43.06248 |
| 2014 | Aug | 1044 | 261.00 | 89.70693 | 34.37047 |
| 2014 | Sep | 899 | 224.75 | 103.54186 | 46.06979 |
| 2014 | Oct | 959 | 239.75 | 106.31518 | 44.34418 |
| 2014 | Nov | 745 | 186.25 | 84.50000 | 45.36913 |
| 2014 | Dec | 843 | 210.75 | 91.73649 | 43.52858 |
#Plot of total number of sales by month and year
ggplot(data = sales_by_date)+
geom_bar(aes(x = month, y = num_sales),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = num_sales, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Number of sales by year and month",
x = "Months",
y = "Number of sales")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of mean number of sales by month and year
ggplot(data = sales_by_date)+
geom_bar(aes(x = month, y = mean_sales),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = mean_sales, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Mean number of sales by year and month",
x = "Months",
y = "Mean number of sales")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of variation coefficient of sales by month and year
ggplot(data = sales_by_date)+
geom_bar(aes(x = month, y = cv_sales),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = cv_sales, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Number of sales variability by year and month",
x = "Months",
y = "Number of sales variability")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Function to aggregate data by date and city
agg_datecity <- function (data, city, date, kpi){
data_by_datecity <- data %>%
group_by({{city}}, {{date}}) %>%
summarise(num_kpi = sum({{kpi}}), .groups = "drop")
return(data_by_datecity)
}
#Total number, mean and standard deviation of sales by date and city for trend analysis
sales_by_datecity <- agg_datecity(data, city, date, sales)
names(sales_by_datecity)[3] <- c("num_sales")
#Plot by year and city to explore trend
ggplot(data = sales_by_datecity)+
geom_smooth(aes(x = date, y = num_sales, colour = city), lwd = 1, method = "loess")+
geom_point(aes(x = date, y = num_sales, colour = city))+
labs(title = "Number of sales trend by city",
x = "Period",
y = "Number of sales")+
facet_wrap(~ city, scales = "free_x")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'
#Boxplot of number of sales by date and city
ggplot(data = data)+
geom_boxplot(aes(x = month, y = sales, fill = city))+
facet_wrap(~ city, scales = "free_x")+
labs(title = "Number of sales distribution by city and months",
x = "Months",
y = "Number of sales")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of number of sales by month, year and city
ggplot(data = data)+
geom_bar(aes(x = month, y = sales, fill = city),
stat = "identity",
position = "stack",
col = "black")+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Number of sales by year, month and city",
x = "Months",
y = "Number of sales")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Bivariate analysis by city: Tyler is the city with the highest total and average volume of sales, while Wichita Falls has the lowest indicators. The variability in the volume of sales is around 25-30% for all cities except Bryan-College Station. Bryan shows a much higher variability in sales number (45%). This situation suggests a healthier market in Tyler and Bryan-College Station, although the latter has high variability and is therefore more subject to temporal fluctuations, and a somewhat less healthy market in Beaumont and Wichita Falls, albeit with lower temporal fluctuations.
Bivariate analysis by year and month: The analysis shows that the total and average volume of sales peak around mid-year, regardless of the year. This suggests a seasonal trend in the data, which should be confirmed through the analysis of the historical data trends. Additionally, three peaks (at the beginning, middle, and end of the year) in volume of sales variability are observed, independently of the year. Overall, the results indicate real estate markets subject to seasonality, with differences in sales opportunities based on different times of the year and varying peaks in sales variability. These factors should be considered, for example, when managing advertising campaigns.
Bivariate analysis by date: The analysis confirms the seasonal pattern of the data for each city, showing a periodic and repetitive trend with a peak in the volume of sales around mid-year. The temporal distribution of the boxplots by city further highlights the mid-year sales peak and the three variability peaks identified earlier. Overall, these findings confirm at the local level—with some differences—what was observed at the global level. Moreover, the temporal trends support the analysis of sales volume by city, identifying a positive sales trend for all cities (especially Tyler and Bryan-College Station, which have higher total and average sales) and a downward trend for Wichita Falls.
Overall, the analysis suggests a likely linear relationship between the number of sales and the value of sales, which should be further confirmed by investigating the relationship between the two variables.
#Total, mean and standard deviation of volume by city
volume_by_city <- agg_city(data, city, volume)
names(volume_by_city)[2:5] <- c("num_volume", "mean_volume", "sd_volume", "cv_volume")
kable(volume_by_city, caption = "Statistical index for volume of sales in mln $ aggregated by city", row.names = FALSE)
| city | num_volume | mean_volume | sd_volume | cv_volume |
|---|---|---|---|---|
| Beaumont | 1567.896 | 26.13160 | 6.970384 | 26.67416 |
| Bryan-College Station | 2291.496 | 38.19160 | 17.248577 | 45.16327 |
| Tyler | 2746.043 | 45.76738 | 13.107146 | 28.63862 |
| Wichita Falls | 835.810 | 13.93017 | 3.239766 | 23.25719 |
#Plot of volume of sales by city
ggplot(data = volume_by_city)+
geom_bar(aes(x = city, y = num_volume),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Volume of sales by city",
x = "City",
y = "Volume of sales (mln $)")+
theme_classic()
#Plot of mean volume of sales by city
ggplot(data = volume_by_city)+
geom_bar(aes(x = city, y = mean_volume),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Mean volume of sales by city",
x = "City",
y = "Mean volume of sales (mln $)")+
theme_classic()
#Plot of variation coefficient of volume of sales by city
ggplot(data = volume_by_city)+
geom_bar(aes(x = city, y = cv_volume),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Volume of sales variability by city",
x = "City",
y = "Volume of sales variability")+
theme_classic()
#Total number, mean and standard deviation of volume by month and year
volume_by_date <- agg_date(data, year, month, volume)
names(volume_by_date)[3:6] <- c("num_volume", "mean_volume", "sd_volume", "cv_volume")
kable(volume_by_date, caption = "Statistical index for volume of sales in mln $ aggregated by year and month", row.names = FALSE)
| year | month | num_volume | mean_volume | sd_volume | cv_volume |
|---|---|---|---|---|---|
| 2010 | Jan | 63.751 | 15.93775 | 6.922791 | 43.43644 |
| 2010 | Feb | 76.897 | 19.22425 | 8.535588 | 44.40011 |
| 2010 | Mar | 111.876 | 27.96900 | 7.298540 | 26.09511 |
| 2010 | Apr | 135.196 | 33.79900 | 13.280337 | 39.29210 |
| 2010 | May | 143.689 | 35.92225 | 13.236626 | 36.84799 |
| 2010 | Jun | 138.191 | 34.54775 | 13.560836 | 39.25244 |
| 2010 | Jul | 106.764 | 26.69100 | 12.121784 | 45.41525 |
| 2010 | Aug | 114.356 | 28.58900 | 10.666873 | 37.31111 |
| 2010 | Sep | 88.826 | 22.20650 | 7.228736 | 32.55234 |
| 2010 | Oct | 87.164 | 21.79100 | 8.068802 | 37.02814 |
| 2010 | Nov | 73.663 | 18.41575 | 5.684010 | 30.86494 |
| 2010 | Dec | 92.070 | 23.01750 | 7.441038 | 32.32774 |
| 2011 | Jan | 60.658 | 15.16450 | 5.313391 | 35.03835 |
| 2011 | Feb | 69.379 | 17.34475 | 8.107725 | 46.74455 |
| 2011 | Mar | 101.566 | 25.39150 | 10.674532 | 42.03979 |
| 2011 | Apr | 110.267 | 27.56675 | 11.256460 | 40.83347 |
| 2011 | May | 122.036 | 30.50900 | 15.375147 | 50.39545 |
| 2011 | Jun | 138.986 | 34.74650 | 18.521100 | 53.30350 |
| 2011 | Jul | 130.855 | 32.71375 | 14.836718 | 45.35315 |
| 2011 | Aug | 120.816 | 30.20400 | 14.510272 | 48.04089 |
| 2011 | Sep | 95.103 | 23.77575 | 12.214512 | 51.37382 |
| 2011 | Oct | 88.274 | 22.06850 | 10.280971 | 46.58663 |
| 2011 | Nov | 86.288 | 21.57200 | 10.695940 | 49.58251 |
| 2011 | Dec | 83.347 | 20.83675 | 8.284211 | 39.75769 |
| 2012 | Jan | 69.791 | 17.44775 | 6.837480 | 39.18832 |
| 2012 | Feb | 82.504 | 20.62600 | 10.198258 | 49.44370 |
| 2012 | Mar | 109.129 | 27.28225 | 12.719385 | 46.62147 |
| 2012 | Apr | 112.128 | 28.03200 | 10.981393 | 39.17449 |
| 2012 | May | 147.526 | 36.88150 | 19.394133 | 52.58499 |
| 2012 | Jun | 149.335 | 37.33375 | 18.633920 | 49.91173 |
| 2012 | Jul | 152.463 | 38.11575 | 20.164871 | 52.90430 |
| 2012 | Aug | 154.271 | 38.56775 | 19.762346 | 51.24060 |
| 2012 | Sep | 112.759 | 28.18975 | 15.251248 | 54.10210 |
| 2012 | Oct | 115.004 | 28.75100 | 15.203367 | 52.87944 |
| 2012 | Nov | 102.547 | 25.63675 | 7.173133 | 27.97989 |
| 2012 | Dec | 97.386 | 24.34650 | 10.330214 | 42.42997 |
| 2013 | Jan | 91.739 | 22.93475 | 9.591138 | 41.81924 |
| 2013 | Feb | 89.946 | 22.48650 | 11.426154 | 50.81340 |
| 2013 | Mar | 125.502 | 31.37550 | 13.893366 | 44.28094 |
| 2013 | Apr | 145.621 | 36.40525 | 18.420387 | 50.59816 |
| 2013 | May | 184.474 | 46.11850 | 22.272422 | 48.29390 |
| 2013 | Jun | 184.315 | 46.07875 | 24.932241 | 54.10789 |
| 2013 | Jul | 188.551 | 47.13775 | 26.927092 | 57.12426 |
| 2013 | Aug | 185.637 | 46.40925 | 20.868560 | 44.96638 |
| 2013 | Sep | 136.781 | 34.19525 | 14.307844 | 41.84161 |
| 2013 | Oct | 124.589 | 31.14725 | 15.597292 | 50.07598 |
| 2013 | Nov | 110.195 | 27.54875 | 12.784377 | 46.40638 |
| 2013 | Dec | 119.965 | 29.99125 | 12.059098 | 40.20872 |
| 2014 | Jan | 94.076 | 23.51900 | 12.074357 | 51.33873 |
| 2014 | Feb | 114.304 | 28.57600 | 13.162092 | 46.05995 |
| 2014 | Mar | 139.621 | 34.90525 | 17.803333 | 51.00474 |
| 2014 | Apr | 162.877 | 40.71925 | 20.136292 | 49.45153 |
| 2014 | May | 196.317 | 49.07925 | 26.326048 | 53.63987 |
| 2014 | Jun | 215.236 | 53.80900 | 30.670720 | 56.99924 |
| 2014 | Jul | 203.805 | 50.95125 | 29.507853 | 57.91389 |
| 2014 | Aug | 185.203 | 46.30075 | 22.953933 | 49.57573 |
| 2014 | Sep | 158.514 | 39.62850 | 23.355544 | 58.93623 |
| 2014 | Oct | 166.541 | 41.63525 | 21.281137 | 51.11327 |
| 2014 | Nov | 123.450 | 30.86250 | 17.263882 | 55.93805 |
| 2014 | Dec | 149.125 | 37.28125 | 19.756553 | 52.99327 |
#Plot of volume of sales by month and year
ggplot(data = volume_by_date)+
geom_bar(aes(x = month, y = num_volume),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = num_volume, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Volume of sales by year and month",
x = "Months",
y = "Volume of sales (mln $)")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of mean volume of sales by month and year
ggplot(data = volume_by_date)+
geom_bar(aes(x = month, y = mean_volume),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = mean_volume, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Mean volume of sales by year and month",
x = "Months",
y = "Mean volume of sales (mln $)")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of variation coefficient of volume of sales by month and year
ggplot(data = volume_by_date)+
geom_bar(aes(x = month, y = cv_volume),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = cv_volume, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Volume of sales varibility by year and month",
x = "Months",
y = "Volume of sales variability")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Total volume of sales by date and city for trend analysis
volume_by_datecity <- agg_date(data, city, date, volume)
names(volume_by_datecity)[3] <- c("num_volume")
#Plot by year and city to explore trend
ggplot(data = volume_by_datecity)+
geom_smooth(aes(x = date, y = num_volume, colour = city), lwd = 1)+
geom_point(aes(x = date, y = num_volume, colour = city))+
labs(title = "Volume of sales trend by city",
x = "Period",
y = "Volume of sales (mln $)")+
facet_wrap(~ city, scales = "free_x")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Boxplot of volume of sales by date and city
ggplot(data = data)+
geom_boxplot(aes(x = month, y = volume, fill = city))+
facet_wrap(~ city, scales = "free_x")+
labs(title = "Volume of sales distribution by city and month",
x = "Months",
y = "Volume of sales (mln $)")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of volume of sales by month, year and city
ggplot(data = data)+
geom_bar(aes(x = month, y = volume, fill = city),
stat = "identity",
position = "stack",
col = "black")+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Volume of sales by year, month and city",
x = "Months",
y = "Volume of sales (mln $)")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Bivariate analysis by city: Bryan-College Station is the city with the highest total and average median price, even exceeding the same indicators for Tyler. Tyler shows generally higher median sale prices, despite having higher total sales value and number of sales compared to Bryan. Wichita Falls, on the other hand, confirms to be the city with the least healthy real estate market, having the lowest median prices among the cities and the highest variability in median prices.
Bivariate analysis by year and month: The analysis shows a more stable pattern across months and years for the median sale price compared to the number and value of sales. No peaks are observed in the total or average median sale prices. This suggests that seasonality affects the number and value of sales but not house valuations and thus the median price. However, the variability over time reveals a more chaotic situation across years, without a clear pattern, indicating influences from more complex dynamics rather than periodic or recurring variability.
Bivariate analysis by date: The temporal analysis confirms a relatively stable trend in the median sale price, showing a slight increase for the cities of Bryan-College Station and Tyler, which are also the cities with the highest number and value of sales and the strongest upward trends. Unlike the previous analysis, this one highlights a seasonal trend more clearly, although it is less pronounced than that of the number and value of sales, confirming the impression that additional dynamics influence the median sale price, obscuring a purely seasonal pattern.
Overall, it appears that the median sale price does not follow a strictly linear relationship with the number and value of sales, suggesting a more complex dynamic influenced by local behaviors. This situation should be further confirmed through bivariate analysis of the relationship with the mentioned variables.
#Total, mean and standard deviation of median price by city
median_price_by_city <- agg_city(data, city, median_price)
names(median_price_by_city)[2:5] <- c("num_median_price", "mean_median_price", "sd_median_price", "cv_median_price")
kable(median_price_by_city, caption = "Statistical index for median price aggregated by city", row.names = FALSE)
| city | num_median_price | mean_median_price | sd_median_price | cv_median_price |
|---|---|---|---|---|
| Beaumont | 7799300 | 129988.3 | 10104.993 | 7.773769 |
| Bryan-College Station | 9449300 | 157488.3 | 8852.235 | 5.620883 |
| Tyler | 8486500 | 141441.7 | 9336.538 | 6.600981 |
| Wichita Falls | 6104600 | 101743.3 | 11320.034 | 11.126070 |
#Plot of volume of median price by city
ggplot(data = median_price_by_city)+
geom_bar(aes(x = city, y = num_median_price),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Total median price by city",
x = "City",
y = "Total median price")+
theme_classic()
#Plot of mean median price of sales by city
ggplot(data = median_price_by_city)+
geom_bar(aes(x = city, y = mean_median_price),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Mean of median price by city",
x = "City",
y = "Mean of median price")+
theme_classic()
#Plot of variation coefficient of median price of sales by city
ggplot(data = median_price_by_city)+
geom_bar(aes(x = city, y = cv_median_price),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Median price variability by city",
x = "City",
y = "Median price variability")+
theme_classic()
#Total number, mean and standard deviation of median price by month and year
median_price_by_date <- agg_date(data, year, month, median_price)
names(median_price_by_date)[3:6] <- c("num_median_price", "mean_median_price", "sd_median_price", "cv_median_price")
kable(median_price_by_date, caption = "Statistical index for median price by year and month", row.names = FALSE)
| year | month | num_median_price | mean_median_price | sd_median_price | cv_median_price |
|---|---|---|---|---|---|
| 2010 | Jan | 541800 | 135450 | 33735.69 | 24.906378 |
| 2010 | Feb | 524400 | 131100 | 31406.47 | 23.956121 |
| 2010 | Mar | 489300 | 122325 | 24973.24 | 20.415480 |
| 2010 | Apr | 513200 | 128300 | 18899.38 | 14.730618 |
| 2010 | May | 520500 | 130125 | 20121.53 | 15.463229 |
| 2010 | Jun | 533100 | 133275 | 15108.80 | 11.336560 |
| 2010 | Jul | 508900 | 127225 | 23351.71 | 18.354657 |
| 2010 | Aug | 524400 | 131100 | 23883.47 | 18.217748 |
| 2010 | Sep | 503300 | 125825 | 20739.56 | 16.482858 |
| 2010 | Oct | 536600 | 134150 | 25058.66 | 18.679586 |
| 2010 | Nov | 526400 | 131600 | 32545.87 | 24.730901 |
| 2010 | Dec | 527300 | 131825 | 16897.61 | 12.818212 |
| 2011 | Jan | 489000 | 122250 | 24234.20 | 19.823480 |
| 2011 | Feb | 484800 | 121200 | 24134.76 | 19.913169 |
| 2011 | Mar | 507000 | 126750 | 21773.76 | 17.178510 |
| 2011 | Apr | 529900 | 132475 | 15490.94 | 11.693480 |
| 2011 | May | 515100 | 128775 | 17313.46 | 13.444736 |
| 2011 | Jun | 521400 | 130350 | 24845.86 | 19.060881 |
| 2011 | Jul | 523000 | 130750 | 20059.99 | 15.342251 |
| 2011 | Aug | 541400 | 135350 | 20121.38 | 14.866185 |
| 2011 | Sep | 516400 | 129100 | 22192.04 | 17.189807 |
| 2011 | Oct | 496400 | 124100 | 35492.44 | 28.599871 |
| 2011 | Nov | 511600 | 127900 | 31642.06 | 24.739688 |
| 2011 | Dec | 501000 | 125250 | 22397.10 | 17.881915 |
| 2012 | Jan | 457000 | 114250 | 24834.72 | 21.737173 |
| 2012 | Feb | 508900 | 127225 | 20443.48 | 16.068759 |
| 2012 | Mar | 517100 | 129275 | 23645.21 | 18.290626 |
| 2012 | Apr | 505000 | 126250 | 26047.33 | 20.631547 |
| 2012 | May | 522300 | 130575 | 23545.05 | 18.031817 |
| 2012 | Jun | 544700 | 136175 | 15604.14 | 11.458887 |
| 2012 | Jul | 549500 | 137375 | 18896.27 | 13.755249 |
| 2012 | Aug | 540200 | 135050 | 24313.85 | 18.003592 |
| 2012 | Sep | 541100 | 135275 | 27574.31 | 20.383891 |
| 2012 | Oct | 500300 | 125075 | 28832.55 | 23.052207 |
| 2012 | Nov | 547000 | 136750 | 13484.93 | 9.861009 |
| 2012 | Dec | 510600 | 127650 | 27354.52 | 21.429318 |
| 2013 | Jan | 513300 | 128325 | 23111.96 | 18.010486 |
| 2013 | Feb | 522400 | 130600 | 20824.18 | 15.945011 |
| 2013 | Mar | 512900 | 128225 | 32607.20 | 25.429671 |
| 2013 | Apr | 524200 | 131050 | 28138.88 | 21.471867 |
| 2013 | May | 564100 | 141025 | 17159.33 | 12.167578 |
| 2013 | Jun | 558100 | 139525 | 26363.15 | 18.894931 |
| 2013 | Jul | 544800 | 136200 | 26459.65 | 19.427059 |
| 2013 | Aug | 558100 | 139525 | 23961.97 | 17.173959 |
| 2013 | Sep | 565400 | 141350 | 18549.12 | 13.122833 |
| 2013 | Oct | 562200 | 140550 | 24629.86 | 17.523913 |
| 2013 | Nov | 536600 | 134150 | 24053.90 | 17.930598 |
| 2013 | Dec | 552600 | 138150 | 19851.53 | 14.369549 |
| 2014 | Jan | 483900 | 120975 | 28066.04 | 23.199866 |
| 2014 | Feb | 561000 | 140250 | 24954.96 | 17.793198 |
| 2014 | Mar | 522000 | 130500 | 26588.59 | 20.374402 |
| 2014 | Apr | 557500 | 139375 | 26724.94 | 19.174846 |
| 2014 | May | 567700 | 141925 | 21542.57 | 15.178844 |
| 2014 | Jun | 595100 | 148775 | 16999.88 | 11.426569 |
| 2014 | Jul | 568800 | 142200 | 29364.04 | 20.649815 |
| 2014 | Aug | 569400 | 142350 | 31003.82 | 21.779991 |
| 2014 | Sep | 554600 | 138650 | 38758.10 | 27.953910 |
| 2014 | Oct | 574100 | 143525 | 25713.60 | 17.915763 |
| 2014 | Nov | 564500 | 141125 | 30097.77 | 21.327030 |
| 2014 | Dec | 576500 | 144125 | 32345.36 | 22.442576 |
#Plot of median price of sales by month and year
ggplot(data = median_price_by_date)+
geom_bar(aes(x = month, y = num_median_price),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = num_median_price, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Total median price by year and month",
x = "Months",
y = "Total median price")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of mean median price of sales by month and year
ggplot(data = median_price_by_date)+
geom_bar(aes(x = month, y = mean_median_price),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = mean_median_price, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Mean of median pruce by year and month",
x = "Months",
y = "Mean of median price")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of variation coefficient of median price of sales by month and year
ggplot(data = median_price_by_date)+
geom_bar(aes(x = month, y = cv_median_price),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = cv_median_price, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Median price variability by year and month",
x = "Months",
y = "Variability of median price")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Total volume of median price by date and city for trend analysis
median_price_by_datecity <- agg_datecity(data, city, date, median_price)
names(median_price_by_datecity)[3] <- c("num_median_price")
#Plot by year and city to explore trend
ggplot(data = median_price_by_datecity)+
geom_smooth(aes(x = date, y = num_median_price, colour = city), lwd = 1)+
geom_point(aes(x = date, y = num_median_price, colour = city))+
labs(title = "Total median price trend by city",
x = "Period",
y = "Total median price")+
facet_wrap(~ city, scales = "free_x")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Boxplot of median price of sales by date and city
ggplot(data = data)+
geom_boxplot(aes(x = month, y = median_price, fill = city))+
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price distribution by city and month",
x = "Months",
y = "Median price")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of median price by month, year and city
ggplot(data = data)+
geom_bar(aes(x = month, y = median_price, fill = city),
stat = "identity",
position = "stack",
col = "black")+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Median price by year, month and city",
x = "Months",
y = "Median price")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Bivariate analysis by city: The analysis shows that Tyler is the city with the highest total and average number of listings, and it is also the city with the highest number and value of sales. Wichita Falls, on the other hand, has the lowest total and average number of listings and ranks last in both sales number and value. This may suggest some correlation between the number of listings and the number and value of sales; however, it could also reflect higher demand, which is mirrored in the larger number of available listings as well as in the higher number and value of sales, revealing a correlation that may not be truly causal. The highest variability in the number of listings occurs in Bryan-College Station, which is also the city with the highest total and average median sale price. This could indicate that in this city, the relationship between supply and demand—which influences the sale price of individual listings—is unstable or subject to variation due to complex dynamics, which also affect the number of available listings.
Bivariate analysis by year and month: The analysis shows that the total and average number of listings has a slight peak around mid-year, regardless of the year, similar to what was observed for the number and value of sales. Variability, however, does not appear to follow a specific pattern and seems rather stable over time, except for some cases (such as in 2013 or 2014).
Bivariate analysis by date: The analysis shows a decrease over time in the number of listings for the cities of Bryan-College Station and Tyler, while Beaumont and Wichita Falls exhibit a stable situation. Overall, it appears that cities experiencing an increase over time in the number and value of sales—and thus seemingly having a better real estate market—show the opposite trend in the number of listings. This could be due to the fact that, as demand and the respective sales capacity and market penetration increase, it is no longer necessary to invest as much in advertising and listings. This behavior is confirmed by the boxplot, which supports the temporal trend of average values and variability of observations by city.
Overall, the number of listings shows an inverse proportional relationship with the number and value of sales, which, however, needs to be further investigated to determine whether it is linear or influenced by more complex nonlinear behaviors.
#Total, mean and standard deviation of listings by city
listings_by_city <- agg_city(data, city, listings)
names(listings_by_city)[2:5] <- c("num_listings", "mean_listings", "sd_listings", "cv_listings")
kable(listings_by_city, caption = "Statistical index for number of listings aggregated by city", row.names = FALSE)
| city | num_listings | mean_listings | sd_listings | cv_listings |
|---|---|---|---|---|
| Beaumont | 100759 | 1679.3167 | 91.13382 | 5.426839 |
| Bryan-College Station | 87488 | 1458.1333 | 252.52753 | 17.318548 |
| Tyler | 174303 | 2905.0500 | 226.75458 | 7.805531 |
| Wichita Falls | 54575 | 909.5833 | 73.75504 | 8.108663 |
#Plot of volume of listings by city
ggplot(data = listings_by_city)+
geom_bar(aes(x = city, y = num_listings),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Number of listings by city",
x = "City",
y = "Number of listings")+
theme_classic()
#Plot of mean listings of sales by city
ggplot(data = listings_by_city)+
geom_bar(aes(x = city, y = mean_listings),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Mean of listings by city",
x = "City",
y = "Mean of listings")+
theme_classic()
#Plot of variation coefficient of listings of sales by city
ggplot(data = listings_by_city)+
geom_bar(aes(x = city, y = cv_listings),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Listings variability by city",
x = "City",
y = "Listings variability")+
theme_classic()
#Total number, mean and standard deviation of listings by month and year
listings_by_date <- agg_date(data, year, month, listings)
names(listings_by_date)[3:6] <- c("num_listings", "mean_listings", "sd_listings", "cv_listings")
kable(listings_by_date, caption = "Statistical index for number of listings aggregated by year and month", row.names = FALSE)
| year | month | num_listings | mean_listings | sd_listings | cv_listings |
|---|---|---|---|---|---|
| 2010 | Jan | 6466 | 1616.50 | 783.9211 | 48.49497 |
| 2010 | Feb | 6703 | 1675.75 | 779.9484 | 46.54325 |
| 2010 | Mar | 6941 | 1735.25 | 738.8362 | 42.57808 |
| 2010 | Apr | 7610 | 1902.50 | 871.1883 | 45.79176 |
| 2010 | May | 7473 | 1868.25 | 947.4173 | 50.71149 |
| 2010 | Jun | 7657 | 1914.25 | 984.9468 | 51.45341 |
| 2010 | Jul | 7768 | 1942.00 | 959.9205 | 49.42948 |
| 2010 | Aug | 7718 | 1929.50 | 954.2175 | 49.45413 |
| 2010 | Sep | 7702 | 1925.50 | 972.0838 | 50.48475 |
| 2010 | Oct | 7470 | 1867.50 | 917.5832 | 49.13431 |
| 2010 | Nov | 7262 | 1815.50 | 879.6384 | 48.45158 |
| 2010 | Dec | 6878 | 1719.50 | 826.7162 | 48.07887 |
| 2011 | Jan | 6964 | 1741.00 | 800.9024 | 46.00244 |
| 2011 | Feb | 7141 | 1785.25 | 833.7063 | 46.69970 |
| 2011 | Mar | 7554 | 1888.50 | 887.1056 | 46.97408 |
| 2011 | Apr | 7792 | 1948.00 | 914.6573 | 46.95366 |
| 2011 | May | 7990 | 1997.50 | 922.9047 | 46.20299 |
| 2011 | Jun | 7889 | 1972.25 | 930.5813 | 47.18374 |
| 2011 | Jul | 7776 | 1944.00 | 943.6444 | 48.54138 |
| 2011 | Aug | 7592 | 1898.00 | 940.5683 | 49.55576 |
| 2011 | Sep | 7420 | 1855.00 | 887.7466 | 47.85696 |
| 2011 | Oct | 7233 | 1808.25 | 890.4644 | 49.24454 |
| 2011 | Nov | 6910 | 1727.50 | 833.0564 | 48.22324 |
| 2011 | Dec | 6522 | 1630.50 | 791.3859 | 48.53639 |
| 2012 | Jan | 6803 | 1700.75 | 814.4947 | 47.89032 |
| 2012 | Feb | 7018 | 1754.50 | 823.8529 | 46.95657 |
| 2012 | Mar | 7291 | 1822.75 | 811.8002 | 44.53711 |
| 2012 | Apr | 7416 | 1854.00 | 834.8185 | 45.02797 |
| 2012 | May | 7453 | 1863.25 | 849.5153 | 45.59320 |
| 2012 | Jun | 7433 | 1858.25 | 873.0400 | 46.98184 |
| 2012 | Jul | 7431 | 1857.75 | 887.3719 | 47.76595 |
| 2012 | Aug | 7176 | 1794.00 | 892.0617 | 49.72473 |
| 2012 | Sep | 7086 | 1771.50 | 852.7878 | 48.13930 |
| 2012 | Oct | 6933 | 1733.25 | 839.9684 | 48.46205 |
| 2012 | Nov | 6801 | 1700.25 | 821.1977 | 48.29864 |
| 2012 | Dec | 6446 | 1611.50 | 759.5668 | 47.13415 |
| 2013 | Jan | 6579 | 1644.75 | 748.6046 | 45.51480 |
| 2013 | Feb | 6735 | 1683.75 | 746.3692 | 44.32779 |
| 2013 | Mar | 6973 | 1743.25 | 793.4471 | 45.51539 |
| 2013 | Apr | 7191 | 1797.75 | 836.3852 | 46.52400 |
| 2013 | May | 7086 | 1771.50 | 853.9924 | 48.20730 |
| 2013 | Jun | 7046 | 1761.50 | 875.5313 | 49.70373 |
| 2013 | Jul | 6935 | 1733.75 | 915.1089 | 52.78205 |
| 2013 | Aug | 6843 | 1710.75 | 899.3140 | 52.56840 |
| 2013 | Sep | 6611 | 1652.75 | 914.7066 | 55.34453 |
| 2013 | Oct | 6458 | 1614.50 | 897.0561 | 55.56247 |
| 2013 | Nov | 6234 | 1558.50 | 837.1063 | 53.71231 |
| 2013 | Dec | 5834 | 1458.50 | 766.7170 | 52.56887 |
| 2014 | Jan | 6129 | 1532.25 | 793.8226 | 51.80764 |
| 2014 | Feb | 6253 | 1563.25 | 790.5101 | 50.56837 |
| 2014 | Mar | 6375 | 1593.75 | 814.8412 | 51.12729 |
| 2014 | Apr | 6505 | 1626.25 | 827.1817 | 50.86436 |
| 2014 | May | 6475 | 1618.75 | 806.1585 | 49.80130 |
| 2014 | Jun | 6640 | 1660.00 | 851.4368 | 51.29138 |
| 2014 | Jul | 6514 | 1628.50 | 889.2669 | 54.60650 |
| 2014 | Aug | 6397 | 1599.25 | 847.1428 | 52.97125 |
| 2014 | Sep | 6159 | 1539.75 | 809.5638 | 52.57761 |
| 2014 | Oct | 6113 | 1528.25 | 772.6195 | 50.55583 |
| 2014 | Nov | 5847 | 1461.75 | 728.5403 | 49.84028 |
| 2014 | Dec | 5475 | 1368.75 | 675.7817 | 49.37218 |
#Plot of listings of sales by month and year
ggplot(data = listings_by_date)+
geom_bar(aes(x = month, y = num_listings),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = num_listings, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Number of listings by year and month",
x = "Months",
y = "Number of listings")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of mean listings of sales by month and year
ggplot(data = listings_by_date)+
geom_bar(aes(x = month, y = mean_listings),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = mean_listings, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Mean of listings by year and month",
x = "Months",
y = "Mean of listings")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of variation coefficient of listings by month and year
ggplot(data = listings_by_date)+
geom_bar(aes(x = month, y = cv_listings),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = cv_listings, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Listings variability by year and month",
x = "Months",
y = "Listings variability")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Total listings by date and city for trend analysis
listings_by_datecity <- agg_datecity(data, city, date, listings)
names(listings_by_datecity)[3] <- c("num_listings")
#Plot by year and city to explore trend
ggplot(data = listings_by_datecity)+
geom_smooth(aes(x = date, y = num_listings, colour = city), lwd = 1)+
geom_point(aes(x = date, y = num_listings, colour = city))+
labs(title = "Number of listings trend by city",
x = "Period",
y = "Number of listings")+
facet_wrap(~ city, scales = "free_x")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Boxplot of listings by date and city
ggplot(data = data)+
geom_boxplot(aes(x = month, y = listings, fill = city))+
facet_wrap(~ city, scales = "free_x")+
labs(title = "Number of listings distribution by city and month",
x = "Months",
y = "Number of listings")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of listings by month, year and city
ggplot(data = data)+
geom_bar(aes(x = month, y = listings, fill = city),
stat = "identity",
position = "stack",
col = "black")+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Number of listings by year, month and city",
x = "Months",
y = "Number of listings")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Bivariate analysis by city: The analysis shows that Tyler has the highest total and average number of months to sell all listings, while Wichita Falls has the lowest indicators. Variability, on the other hand, is greater for Bryan-College Station, which aligns with the higher variability in the number of listings observed earlier. The first result is linked to the fact that Tyler is also the city with the highest number and value of sales, so in absolute terms it may take longer to process all listings. To assess efficiency, it would be necessary to evaluate other indicators created that relate the different measures present in the dataset.
Bivariate analysis by year and month: The analysis shows a pattern of the number of months to sell listings that is partially similar, in terms of average and total values, to that of the number and value of sales, with a peak mid-year for some years, but much less pronounced. The variability pattern appears less marked and more dependent on individual years (for example, in 2013 there is a noticeable increasing variability over the months).
Bivariate analysis by date: The temporal analysis reveals seasonality, although less pronounced compared to the number and value of sales, and a decreasing trend over time, especially for those cities (Tyler and Bryan-College Station) with a positive trend in the number and value of sales and a negative trend in the number of listings. The other cities also show a generally decreasing trend, but less markedly. This suggests a direct, probably nearly linear, proportional relationship between the number of listings and the number of months to sell listings, and an inverse proportionality of this latter indicator with the number and value of sales, following a relationship that appears to involve nonlinear and more complex dynamics, confirming what was already suggested by the city-level analysis.
#Total, mean and standard deviation of months inventory by city
minv_by_city <- agg_city(data, city, months_inventory)
names(minv_by_city)[2:5] <- c("num_minv", "mean_minv", "sd_minv", "cv_minv")
kable(minv_by_city, caption = "Statistical index for months inventory aggregated by city", row.names = FALSE)
| city | num_minv | mean_minv | sd_minv | cv_minv |
|---|---|---|---|---|
| Beaumont | 598.2 | 9.970000 | 1.6495814 | 16.545450 |
| Bryan-College Station | 459.5 | 7.658333 | 2.2472048 | 29.343262 |
| Tyler | 679.5 | 11.325000 | 1.8864032 | 16.656982 |
| Wichita Falls | 469.0 | 7.816667 | 0.7809526 | 9.990865 |
#Plot of total months inventory by city
ggplot(data = minv_by_city)+
geom_bar(aes(x = city, y = num_minv),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Months inventory by city",
x = "City",
y = "Months inventory")+
theme_classic()
#Plot of mean months inventory by city
ggplot(data = minv_by_city)+
geom_bar(aes(x = city, y = mean_minv),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Mean of months inventory by city",
x = "City",
y = "Mean of months inventory")+
theme_classic()
#Plot of variation coefficient of months inventory by city
ggplot(data = minv_by_city)+
geom_bar(aes(x = city, y = cv_minv),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Months inventory variability by city",
x = "City",
y = "Months inventory variability")+
theme_classic()
#Total number, mean and standard deviation of months inventory by month and year
minv_by_date <- agg_date(data, year, month, months_inventory)
names(minv_by_date)[3:6] <- c("num_minv", "mean_minv", "sd_minv", "cv_minv")
kable(minv_by_date, caption = "Statistical index for months inventory aggregated by year and city", row.names = FALSE)
| year | month | num_minv | mean_minv | sd_minv | cv_minv |
|---|---|---|---|---|---|
| 2010 | Jan | 35.0 | 8.750 | 2.042058 | 23.33780 |
| 2010 | Feb | 36.6 | 9.150 | 1.957039 | 21.38840 |
| 2010 | Mar | 37.9 | 9.475 | 1.774589 | 18.72917 |
| 2010 | Apr | 40.1 | 10.025 | 2.129750 | 21.24439 |
| 2010 | May | 38.8 | 9.700 | 2.393045 | 24.67057 |
| 2010 | Jun | 40.2 | 10.050 | 2.452889 | 24.40686 |
| 2010 | Jul | 42.2 | 10.550 | 2.379776 | 22.55712 |
| 2010 | Aug | 42.5 | 10.625 | 2.304163 | 21.68624 |
| 2010 | Sep | 42.6 | 10.650 | 2.580052 | 24.22584 |
| 2010 | Oct | 42.0 | 10.500 | 2.471167 | 23.53492 |
| 2010 | Nov | 41.6 | 10.400 | 2.405549 | 23.13028 |
| 2010 | Dec | 39.2 | 9.800 | 2.287648 | 23.34335 |
| 2011 | Jan | 39.8 | 9.950 | 2.112660 | 21.23277 |
| 2011 | Feb | 40.8 | 10.200 | 2.210581 | 21.67236 |
| 2011 | Mar | 43.6 | 10.900 | 2.356551 | 21.61974 |
| 2011 | Apr | 46.1 | 11.525 | 2.372587 | 20.58643 |
| 2011 | May | 48.3 | 12.075 | 2.270646 | 18.80452 |
| 2011 | Jun | 47.7 | 11.925 | 2.211146 | 18.54210 |
| 2011 | Jul | 46.3 | 11.575 | 2.364142 | 20.42455 |
| 2011 | Aug | 44.9 | 11.225 | 2.351418 | 20.94804 |
| 2011 | Sep | 44.1 | 11.025 | 2.093442 | 18.98814 |
| 2011 | Oct | 42.6 | 10.650 | 2.080865 | 19.53864 |
| 2011 | Nov | 40.7 | 10.175 | 1.944865 | 19.11415 |
| 2011 | Dec | 38.5 | 9.625 | 1.882153 | 19.55484 |
| 2012 | Jan | 39.8 | 9.950 | 1.887679 | 18.97165 |
| 2012 | Feb | 40.6 | 10.150 | 1.763519 | 17.37457 |
| 2012 | Mar | 41.9 | 10.475 | 1.652019 | 15.77106 |
| 2012 | Apr | 42.5 | 10.625 | 1.682013 | 15.83071 |
| 2012 | May | 42.1 | 10.525 | 1.543535 | 14.66541 |
| 2012 | Jun | 41.9 | 10.475 | 1.641899 | 15.67446 |
| 2012 | Jul | 41.4 | 10.350 | 1.682260 | 16.25372 |
| 2012 | Aug | 39.2 | 9.800 | 1.794436 | 18.31057 |
| 2012 | Sep | 38.5 | 9.625 | 1.652019 | 17.16383 |
| 2012 | Oct | 36.9 | 9.225 | 1.594522 | 17.28479 |
| 2012 | Nov | 35.9 | 8.975 | 1.631717 | 18.18069 |
| 2012 | Dec | 33.5 | 8.375 | 1.408013 | 16.81209 |
| 2013 | Jan | 34.2 | 8.550 | 1.173314 | 13.72298 |
| 2013 | Feb | 34.8 | 8.700 | 1.101514 | 12.66108 |
| 2013 | Mar | 35.6 | 8.900 | 1.151810 | 12.94169 |
| 2013 | Apr | 36.2 | 9.050 | 1.382027 | 15.27102 |
| 2013 | May | 34.8 | 8.700 | 1.534058 | 17.63285 |
| 2013 | Jun | 34.4 | 8.600 | 1.741647 | 20.25171 |
| 2013 | Jul | 32.9 | 8.225 | 2.041854 | 24.82497 |
| 2013 | Aug | 32.0 | 8.000 | 1.954482 | 24.43103 |
| 2013 | Sep | 30.6 | 7.650 | 2.199242 | 28.74827 |
| 2013 | Oct | 30.0 | 7.500 | 2.252406 | 30.03208 |
| 2013 | Nov | 28.9 | 7.225 | 2.012254 | 27.85127 |
| 2013 | Dec | 26.8 | 6.700 | 1.867262 | 27.86958 |
| 2014 | Jan | 28.0 | 7.000 | 1.790717 | 25.58167 |
| 2014 | Feb | 28.4 | 7.100 | 1.762574 | 24.82498 |
| 2014 | Mar | 29.1 | 7.275 | 1.802544 | 24.77724 |
| 2014 | Apr | 29.5 | 7.375 | 1.858987 | 25.20661 |
| 2014 | May | 29.5 | 7.375 | 1.851801 | 25.10917 |
| 2014 | Jun | 29.9 | 7.475 | 2.069420 | 27.68455 |
| 2014 | Jul | 29.5 | 7.375 | 2.280899 | 30.92744 |
| 2014 | Aug | 29.2 | 7.300 | 2.264214 | 31.01664 |
| 2014 | Sep | 27.9 | 6.975 | 2.066196 | 29.62288 |
| 2014 | Oct | 27.3 | 6.825 | 1.905037 | 27.91263 |
| 2014 | Nov | 26.1 | 6.525 | 1.824600 | 27.96322 |
| 2014 | Dec | 24.3 | 6.075 | 1.783956 | 29.36554 |
#Plot of months inventory by month and year
ggplot(data = minv_by_date)+
geom_bar(aes(x = month, y = num_minv),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = num_minv, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Months inventory by year and month",
x = "Months",
y = "Months inventory")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of mean months inventory by month and year
ggplot(data = minv_by_date)+
geom_bar(aes(x = month, y = mean_minv),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = mean_minv, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Mean of months inventory by year and month",
x = "Months",
y = "Mean of months inventory")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of variation coefficient of months inventory by month and year
ggplot(data = minv_by_date)+
geom_bar(aes(x = month, y = cv_minv),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = cv_minv, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Months inventory variabiliy by year and month",
x = "Months",
y = "Months inventory variability")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Total months inventory by date and city for trend analysis
minv_by_datecity <- agg_datecity(data, city, date, months_inventory)
names(minv_by_datecity)[3] <- c("num_minv")
#Plot by year and city to explore trend
ggplot(data = minv_by_datecity)+
geom_smooth(aes(x = date, y = num_minv, colour = city), lwd = 1)+
geom_point(aes(x = date, y = num_minv, colour = city))+
labs(title = "Months inventory trend by city",
x = "Period",
y = "Months inventory")+
facet_wrap(~ city, scales = "free_x")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Boxplot of months inventory by date and city
ggplot(data = data)+
geom_boxplot(aes(x = month, y = months_inventory, fill = city))+
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory distribution by city and month",
x = "Months",
y = "Months inventory")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of months inventory by month, year and city
ggplot(data = data)+
geom_bar(aes(x = month, y = months_inventory, fill = city),
stat = "identity",
position = "stack",
col = "black")+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Months inventory by year, month and city",
x = "Months",
y = "Months inventory")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Bivariate analysis by city:
Bivariate analysis by year and month:
Bivariate analysis by date:
There is a substantial correlation between the number and value of sales and efficiency indicators, highlighting that more active and higher-value markets tend to be more efficient or improve over time. Seasonality and annual dynamics suggest opportunities for strategic interventions targeted at specific periods.
#Function to aggregate new created variables by city
agg_new_city <- function(data, city, kpi1, kpi2, kpi3, kpi4) {
datanew_by_city <- data %>%
group_by({{city}}) %>%
summarize(mean_price = (sum({{kpi2}})/sum({{kpi1}})),
eff_list_ind = (sum({{kpi1}})/sum({{kpi3}}))*100,
mean_eff_listings = (sum({{kpi3}})/sum({{kpi4}})),
.groups = "drop")%>%
mutate(var_eff_list = ((mean({{kpi1}}) - mean_eff_listings)/(mean_eff_listings))*100)
return(datanew_by_city)
}
agg_new_var_city <- agg_new_city(data, city, sales, volume, listings, months_inventory)
#Plot of statistical index table
agg_new_var_city <- agg_new_var_city[order(agg_new_var_city$eff_list_ind, decreasing = TRUE), ]
kable(agg_new_var_city, caption = "Statistical index for new variables aggregated by city and ordinated by efficiency listings index", row.names = FALSE)
| city | mean_price | eff_list_ind | mean_eff_listings | var_eff_list |
|---|---|---|---|---|
| Bryan-College Station | 0.1854261 | 14.125366 | 190.3983 | 0.9944459 |
| Wichita Falls | 0.1200187 | 12.760421 | 116.3646 | 65.2492747 |
| Beaumont | 0.1473171 | 10.562828 | 168.4370 | 14.1623825 |
| Tyler | 0.1696659 | 9.285554 | 256.5166 | -25.0373272 |
#Plot of mean price by city
ggplot(data = agg_new_var_city)+
geom_bar(aes(x = city, y = mean_price),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Mean price (mln $) by city",
x = "City",
y = "Mean price (mln $)")+
theme_classic()
#Plot of efficiency listings index by city
ggplot(data = agg_new_var_city)+
geom_bar(aes(x = city, y = eff_list_ind),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Efficiency listings index by city",
x = "City",
y = "Efficiency listings index")+
theme_classic()
#Plot of variation efficiency listings index by city
ggplot(data = agg_new_var_city)+
geom_bar(aes(x = city, y = var_eff_list),
stat = "identity",
col = "black",
fill = "blue")+
labs(title = "Efficency listings index variation by city",
x = "City",
y = "Efficiency listings index variation (%)")+
theme_classic()
#Function to aggregate new created variables by year and month
agg_new_ym <- function(data, year, month, kpi1, kpi2, kpi3, kpi4) {
datanew_by_ym <- data %>%
group_by({{year}}, {{month}}) %>%
summarize(mean_price = (sum({{kpi2}})/sum({{kpi1}})),
eff_list_ind = (sum({{kpi1}})/sum({{kpi3}}))*100,
mean_eff_listings = (sum({{kpi3}})/sum({{kpi4}})),
.groups = "drop")%>%
mutate(var_eff_list = ((mean({{kpi1}}) - mean_eff_listings)/(mean_eff_listings))*100)
return(datanew_by_ym)
}
agg_new_var_ym <- agg_new_ym(data, year, month, sales, volume, listings, months_inventory)
#Plot of statistical index table
agg_new_var_ym <- agg_new_var_ym[order(agg_new_var_ym$eff_list_ind, decreasing = TRUE), ]
kable(agg_new_var_ym, caption = "Statistical index for new variables aggregated by year and month and ordinated by efficiency listings index", row.names = FALSE)
| year | month | mean_price | eff_list_ind | mean_eff_listings | var_eff_list |
|---|---|---|---|---|---|
| 2014 | Jun | 0.1828683 | 17.725904 | 222.0736 | -13.4108308 |
| 2014 | Jul | 0.1794058 | 17.439361 | 220.8136 | -12.9167306 |
| 2014 | May | 0.1741943 | 17.405405 | 219.4915 | -12.3922136 |
| 2014 | Aug | 0.1773975 | 16.320150 | 219.0753 | -12.2257829 |
| 2013 | Jul | 0.1673035 | 16.250901 | 210.7903 | -8.7758351 |
| 2013 | Aug | 0.1676938 | 16.177115 | 213.8438 | -10.0784256 |
| 2014 | Oct | 0.1736611 | 15.687878 | 223.9194 | -14.1246115 |
| 2014 | Dec | 0.1768980 | 15.397260 | 225.3086 | -14.6541096 |
| 2014 | Apr | 0.1667114 | 15.019216 | 220.5085 | -12.7962465 |
| 2013 | May | 0.1745260 | 14.916737 | 203.6207 | -5.5637878 |
| 2013 | Jun | 0.1763780 | 14.831110 | 204.8256 | -6.1193112 |
| 2014 | Sep | 0.1763226 | 14.596525 | 220.7527 | -12.8927180 |
| 2012 | Aug | 0.1617096 | 13.294314 | 183.0612 | 5.0422705 |
| 2014 | Mar | 0.1660178 | 13.192157 | 219.0722 | -12.2245098 |
| 2014 | Nov | 0.1657047 | 12.741577 | 224.0230 | -14.1643150 |
| 2013 | Dec | 0.1636630 | 12.564278 | 217.6866 | -11.6658096 |
| 2012 | Jul | 0.1642920 | 12.488225 | 179.4928 | 7.1306015 |
| 2010 | May | 0.1543383 | 12.458183 | 192.6031 | -0.1616932 |
| 2013 | Sep | 0.1680356 | 12.312812 | 216.0458 | -10.9949327 |
| 2013 | Apr | 0.1658554 | 12.209707 | 198.6464 | -3.1990219 |
| 2012 | Jun | 0.1662973 | 12.081259 | 177.3986 | 8.3952756 |
| 2010 | Apr | 0.1475939 | 12.036794 | 189.7756 | 1.3258322 |
| 2012 | May | 0.1672630 | 11.834161 | 177.0309 | 8.6204101 |
| 2013 | Mar | 0.1541794 | 11.673598 | 195.8708 | -1.8272862 |
| 2013 | Oct | 0.1688198 | 11.427687 | 215.2667 | -10.6728089 |
| 2010 | Jun | 0.1595739 | 11.309912 | 190.4726 | 0.9550085 |
| 2011 | Jun | 0.1570463 | 11.218152 | 165.3878 | 16.2671124 |
| 2014 | Feb | 0.1647032 | 11.098673 | 220.1761 | -12.6645877 |
| 2013 | Nov | 0.1597029 | 11.068335 | 215.7093 | -10.8561250 |
| 2010 | Mar | 0.1481801 | 10.877395 | 183.1398 | 4.9971786 |
| 2012 | Oct | 0.1549919 | 10.702438 | 187.8862 | 2.3447642 |
| 2011 | Jul | 0.1611515 | 10.442387 | 167.9482 | 14.4946523 |
| 2011 | Aug | 0.1537099 | 10.353003 | 169.0869 | 13.7236016 |
| 2014 | Jan | 0.1500415 | 10.230054 | 218.8929 | -12.1526078 |
| 2012 | Apr | 0.1501044 | 10.072816 | 174.4941 | 10.1995123 |
| 2012 | Sep | 0.1594894 | 9.977420 | 184.0519 | 4.4768440 |
| 2012 | Dec | 0.1514557 | 9.975178 | 192.4179 | -0.0656092 |
| 2011 | May | 0.1564564 | 9.762203 | 165.4244 | 16.2413955 |
| 2012 | Mar | 0.1534866 | 9.751749 | 174.0095 | 10.5063892 |
| 2010 | Aug | 0.1549539 | 9.562063 | 181.6000 | 5.8874816 |
| 2012 | Nov | 0.1577646 | 9.557418 | 189.4429 | 1.5037617 |
| 2011 | Apr | 0.1540042 | 9.188912 | 169.0239 | 13.7659886 |
| 2010 | Jul | 0.1499494 | 9.165808 | 184.0758 | 4.4632896 |
| 2011 | Mar | 0.1520449 | 8.842997 | 173.2569 | 10.9864531 |
| 2013 | Feb | 0.1516796 | 8.804751 | 193.5345 | -0.6421678 |
| 2010 | Dec | 0.1524338 | 8.781623 | 175.4592 | 9.5933896 |
| 2013 | Jan | 0.1592691 | 8.755130 | 192.3684 | -0.0398997 |
| 2011 | Dec | 0.1475168 | 8.662987 | 169.4026 | 13.5116401 |
| 2011 | Sep | 0.1511971 | 8.477089 | 168.2540 | 14.2865566 |
| 2011 | Oct | 0.1486094 | 8.212360 | 169.7887 | 13.2534909 |
| 2012 | Feb | 0.1437352 | 8.178968 | 172.8571 | 11.2431129 |
| 2011 | Nov | 0.1571730 | 7.945007 | 169.7789 | 13.2600699 |
| 2010 | Sep | 0.1485385 | 7.764217 | 180.7981 | 6.3571150 |
| 2010 | Oct | 0.1542726 | 7.563588 | 177.8571 | 8.1157965 |
| 2012 | Jan | 0.1398617 | 7.334999 | 170.9296 | 12.4975501 |
| 2010 | Feb | 0.1578994 | 7.265404 | 183.1421 | 4.9958974 |
| 2010 | Nov | 0.1464473 | 6.926467 | 174.5673 | 10.1533095 |
| 2011 | Feb | 0.1479296 | 6.567708 | 175.0245 | 9.8655650 |
| 2010 | Jan | 0.1514276 | 6.510980 | 184.7429 | 4.0861171 |
| 2011 | Jan | 0.1427247 | 6.102815 | 174.9749 | 9.8967308 |
#Plot of mean price by month and year
ggplot(data = agg_new_var_ym)+
geom_bar(aes(x = month, y = mean_price),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = mean_price, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Mean price by year and month",
x = "Months",
y = "Mean price (mln $)")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of efficiency listings index by month and year
ggplot(data = agg_new_var_ym)+
geom_bar(aes(x = month, y = eff_list_ind),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = eff_list_ind, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Efficiency listings index by year and month",
x = "Months",
y = "Efficiency listings index")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
##Plot of variation efficiency listings index by month and year
ggplot(data = agg_new_var_ym)+
geom_bar(aes(x = month, y = var_eff_list),
stat = "identity",
col = "black",
fill = "blue")+
geom_line(aes(x = month, y = var_eff_list, group = 1), color = "red", lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Efficiency listings index variation by year and month",
x = "Months",
y = "Efficiency listings index variation")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Efficiency listings index by date and city for trend analysis
agg_new_efl <- function(data, city, date, kpi1, kpi3) {
datanew_by_efl <- data %>%
group_by({{city}}, {{date}}) %>%
summarize(eff_list_ind = (sum({{kpi1}})/sum({{kpi3}}))*100, .groups = "drop")%>%
return(datanew_by_efl)
}
efl_by_datecity <- agg_new_efl(data, city, date, sales, listings)
#Plot by year and city to explore trend
ggplot(data = efl_by_datecity)+
geom_smooth(aes(x = date, y = eff_list_ind, colour = city), lwd = 1)+
geom_point(aes(x = date, y = eff_list_ind, colour = city))+
labs(title = "Efficiency listings index trend by city",
x = "Period",
y = "Efficiency listings index")+
facet_wrap(~ city, scales = "free_x")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Boxplot of efficiency list index by date and city
ggplot(data = data)+
geom_boxplot(aes(x = month, y = efficiency_list_index, fill = city))+
facet_wrap(~ city, scales = "free_x")+
labs(title = "Efficiency listings index distribution by city and month",
x = "Months",
y = "Efficiency listings index")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
#Plot of efficiency listings index by month, year and city
ggplot(data = data)+
geom_bar(aes(x = month, y = efficiency_list_index, fill = city),
stat = "identity",
position = "stack",
col = "black")+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Efficiency listings index by year, month and city",
x = "Months",
y = "Efficiency listings index")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
Historical Trends in Real Estate Sales:
Effectiveness of Marketing Strategies for Listings:
Initial Overall Assessment:
The analysis of the relationship between the number of sales and the value of sales variables shows a strong linear correlation with a marked direct proportionality. This was suggested by the similar trends of both variables by city and by year and month, indicating that this relationship holds at a global level and not only locally. The analysis of the shape and slope of the confidence ellipses, as well as the visualization of the linear regression lines of the data, further confirm the linear relationship between these two variables.
#The analysis of trends suggest some kind of correlation between number of sales and volume.
#For this reason, we can plot this two variables and explore this correlation
ggplot(data = data)+
geom_point(aes(x = sales, y = volume, colour = city))+
geom_smooth(aes(x = sales, y = volume), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Volume of sales vs Number of sales",
x = "Number of sales",
y = "Volume of sales (mln $)")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Function for correlation index by city and year between number and volume of sales
corr_fun <- function(data, year, city, kpi1, kpi2){
corr_by_yc <- data %>%
group_by({{year}}, {{city}}) %>%
summarize(corr_index = cor({{kpi1}}, {{kpi2}}, use = "complete.obs"), .groups = "drop")
return(corr_by_yc)
}
corr_by_yearcity <- corr_fun(data, year, city, sales, volume)
#Plot of corrplot for this calculated index
corr_by_yearcity %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont 0.937
## 2 2010 Bryan-College Station 0.995
## 3 2010 Tyler 0.969
## 4 2010 Wichita Falls 0.953
## 5 2011 Beaumont 0.912
## 6 2011 Bryan-College Station 0.990
## 7 2011 Tyler 0.988
## 8 2011 Wichita Falls 0.943
## 9 2012 Beaumont 0.960
## 10 2012 Bryan-College Station 0.993
## 11 2012 Tyler 0.973
## 12 2012 Wichita Falls 0.820
## 13 2013 Beaumont 0.987
## 14 2013 Bryan-College Station 0.996
## 15 2013 Tyler 0.973
## 16 2013 Wichita Falls 0.930
## 17 2014 Beaumont 0.979
## 18 2014 Bryan-College Station 0.993
## 19 2014 Tyler 0.983
## 20 2014 Wichita Falls 0.923
#Transform in wide format
corr_matrix <- corr_by_yearcity %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat <- as.matrix(corr_matrix[,-1])
#Set rownames like years, to be more clear
rownames(mat) <- corr_matrix$year
#Corrplot
corrplot(mat, method = "color", addCoef.col = "white", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Volume of sales vs Number of sales correlation by year and city",
mar = c(0,0,1,0))
#This analysis confirm that number of sales and volume show a very important correlation,
#globally and by city and year (values from 0.91 to 1.00). The only value of correlation under 0.91
#is for year = 2012 and city = Wichita Falls (0.82)
#The high correlation suggests a strong linear relationship between sales and volume, with data
#clustered near a straight line in bivariate space. The joint distributions are therefore expected
#to exhibit a narrow elliptical shape. However, since the correlation reflects only the linear
#association, further controls for outliers, non-Gaussian distributions, or possible nonlinear
#relationships are necessary.
#Plot of joint distribution of sales and volume
ggplot(data, aes(x = sales, y = volume)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Volume of sales vs Number of sales joint distribution with confidence ellipses",
x = "Number of sales",
y = "Volume of sales (mln $)") +
theme_classic()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Plot of points around linear regression rect
ggplot(data, aes(x = sales, y = volume, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Volume of sales vs Number of sales linear regression with confidence intervals",
x = "Number of sales",
y = "Volume of sales (mln $)") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#All this analysis confirm strong linearity beetwen sales and volume
Summary of the Analysis of the Relationship between Number of Sales
and Median Sales Value:
The exploratory analysis of the relationship between the number of sales
and the median sales value of real estate reveals a complex relationship
that is not perfectly linear, but shows several significant trends at
territorial and temporal levels.
Correlations and Trends by City and Year:
At a first level, the correlation between the two variables varies
considerably across cities and over the years. In general, the
correlation is positive, indicating that an increase in the number of
sales tends to be associated with an increase in the median sales value,
but this relationship is not strong or stable in all cases:
Linear Model and Residual Distribution:
Applying a linear regression model between number of sales and median
value confirms the presence of a linear relationship, albeit
imperfect:
Nonlinear Analysis with Loess:
The nonlinear regression analysis (loess) highlights a better fit of the
relationship between sales and median prices, capturing curvatures and
local dynamics more accurately:
Overall Statistical Considerations:
The analysis suggests that although the number of sales and median
market value are correlated, the relationship cannot be simply modeled
with a basic linear regression without accounting for variability,
nonlinearity, and outliers. Seasonality, local dynamics, and specific
market conditions influence the link between quantity and price.
#Correlation with sales
ggplot(data = data)+
geom_point(aes(x = sales, y = median_price, colour = city))+
geom_smooth(aes(x = sales, y = median_price), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Median price vs Number of sales by year",
x = "Number of sales",
y = "Median price")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Correlation index by city and year between number and median price of sales
corr_by_yearcity_mp <- corr_fun(data, year, city, sales, median_price)
#Plot of corrplot for this calculated index
corr_by_yearcity_mp %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont -0.692
## 2 2010 Bryan-College Station -0.535
## 3 2010 Tyler 0.280
## 4 2010 Wichita Falls 0.463
## 5 2011 Beaumont 0.486
## 6 2011 Bryan-College Station -0.426
## 7 2011 Tyler 0.627
## 8 2011 Wichita Falls 0.542
## 9 2012 Beaumont 0.618
## 10 2012 Bryan-College Station 0.173
## 11 2012 Tyler 0.666
## 12 2012 Wichita Falls 0.375
## 13 2013 Beaumont 0.394
## 14 2013 Bryan-College Station 0.531
## 15 2013 Tyler 0.708
## 16 2013 Wichita Falls -0.0353
## 17 2014 Beaumont 0.752
## 18 2014 Bryan-College Station -0.00894
## 19 2014 Tyler 0.493
## 20 2014 Wichita Falls 0.232
#Transform in wide format
corr_matrix_mp <- corr_by_yearcity_mp %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat_mp <- as.matrix(corr_matrix_mp[,-1])
#Set rownames like years, to be more clear
rownames(mat_mp) <- corr_matrix_mp$year
#Corrplot
corrplot(mat_mp, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Median price vs Number of sales correlation by year and city",
mar = c(0,0,1,0))
#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so median price growth like number of sales, but not in a
#strong dependency and in other cases the correlation is negative or null.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from -0.69 to 0.75, so the market meybe is growing
#Bryan-College Station --> the market are growth from 2010 to 2013, and into 2014 is freeze
#Tyler --> shows a positive market trend, but into 2014 a small decrease
#Wichita Falls --> shows a stable trend, but generally the market seems in a difficult situation
#Plot of joint distribution of sales and median price
ggplot(data, aes(x = sales, y = median_price)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price vs Number of sales joint distribution with confidence ellipses",
x = "Number of sales",
y = "Median price") +
theme_classic()
#Plot of points around linear regression rect
ggplot(data, aes(x = sales, y = median_price, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price vs Number of sales linear regression with confidence intervals",
x = "Number of sales",
y = "Median price") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Plot of residuals to show other caratheristic on correlation
model <- lm(median_price ~ sales, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)
ggplot(data, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression residuals (median price vs number of sales))",
x = "Predicted values",
y = "Residuals") +
theme_classic()
#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.
#Plot of loess smooth of sales and median price by city
ggplot(data, aes(x = sales, y = median_price, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price vs Number of sales non linear regression with confidence intervals",
x = "Number of sales",
y = "Median price") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Trends confirm grwoth in market for Beaoumont and Tyler and also for Bryan-College Station.
#Wichita Falls confirms that its market show a stable situation
The correlation between the total sales value and the median sales value is generally positive, indicating that an increase in the overall transaction value tends to be accompanied by an increase in the median value. However, this relationship is not always strong and shows considerable variability across time and between cities.
In particular, the city of Tyler exhibits a strong positive correlation from 2011 to 2013, with a slight decline in 2014, confirming similar trends observed in the previous analysis on the number of sales. Other cities, such as Beaumont, show wide fluctuations in correlation over time, ranging from negative to highly positive values, suggesting more unstable or evolving markets. Bryan-College Station shows a growth phase until 2013 followed by stabilization in 2014, whereas Wichita Falls displays a more stable but generally weaker trend.
Linear Model and Residual Analysis
The linear regression confirms not a very high positive linear
relationship between volume and median value and as in the previous
case, residuals exhibit heteroscedasticity and asymmetry, indicating
that the adequacy of the linear model is limited. The presence of
outliers and significant dispersion in residuals suggests that nonlinear
factors or specific market conditions influence this relationship.
Nonlinear Analysis with Loess
The loess regression improves the fit to data curvature and local
dynamics, more accurately confirming market growth in the major cities
and highlighting a more stagnant situation in Wichita Falls.
Spearman Correlation
The use of the Spearman correlation, less sensitive to outliers,
reinforces the previous conclusions, showing a generally positive and
monotonic relationship between volume and median value, though not
perfectly linear.
Final Considerations
These results indicate that the median sales value is influenced, albeit
in a complex and nonlinear way, by the total volume of real estate
transactions. The presence of heteroscedasticity and outliers suggests
considering more flexible models or variable transformations (for
example, logarithmic) for a better market representation.
The analysis also highlights significant territorial differences that may reflect economic conditions, micro-dynamics of supply and demand, or varied sales strategies.
#Correlation with volume
ggplot(data = data)+
geom_point(aes(x = volume, y = median_price, colour = city))+
geom_smooth(aes(x = volume, y = median_price), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Median price vs Volume of sales by year",
x = "Volume of sales (mln $)",
y = "Median price")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Correlation index by city and year between volume and median price of sales
corr_by_yearcity_mpv <- corr_fun(data, year, city, volume, median_price)
#Plot of corrplot for this calculated index
corr_by_yearcity_mpv %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont -0.686
## 2 2010 Bryan-College Station -0.525
## 3 2010 Tyler 0.436
## 4 2010 Wichita Falls 0.644
## 5 2011 Beaumont 0.637
## 6 2011 Bryan-College Station -0.315
## 7 2011 Tyler 0.682
## 8 2011 Wichita Falls 0.721
## 9 2012 Beaumont 0.768
## 10 2012 Bryan-College Station 0.234
## 11 2012 Tyler 0.723
## 12 2012 Wichita Falls 0.793
## 13 2013 Beaumont 0.492
## 14 2013 Bryan-College Station 0.579
## 15 2013 Tyler 0.798
## 16 2013 Wichita Falls 0.297
## 17 2014 Beaumont 0.818
## 18 2014 Bryan-College Station 0.0872
## 19 2014 Tyler 0.585
## 20 2014 Wichita Falls 0.487
#Transform in wide format
corr_matrix_mpv <- corr_by_yearcity_mpv %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat_mpv <- as.matrix(corr_matrix_mpv[,-1])
#Set rownames like years, to be more clear
rownames(mat_mpv) <- corr_matrix_mpv$year
#Corrplot
corrplot(mat_mpv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Median price vs Volume of sales correlation by year and city",
mar = c(0,0,1,0))
#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so median price growth like number of sales, but not in a
#strong dependency and in other cases the correlation is negative or null.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from -0.69 to 0.82, so the market meybe is growing
#Bryan-College Station --> the market are growth from 2010 to 2013, and into 2014 is freeze
#Tyler --> shows a positive market trend, but into 2014 a small decrease
#Wichita Falls --> shows a stable trend
#Plot of joint distribution of volume and median price
ggplot(data, aes(x = volume, y = median_price)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price vs Volume of sales joint distribution with confidence ellipses",
x = "Volume of sales (mln $)",
y = "Median price") +
theme_classic()
#Plot of points around linear regression rect
ggplot(data, aes(x = volume, y = median_price, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price vs Volume of sales linear regression with confidence intervals",
x = "Volume of sales (mln $)",
y = "Median price") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Plot of residuals to show other caratheristic on correlation
model <- lm(median_price ~ volume, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)
ggplot(data, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression residuals (median price vs volume of sales)",
x = "Predicted values",
y = "Residuals") +
theme_classic()
#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.
#Plot of loess smooth of volume and median price by city
ggplot(data, aes(x = volume, y = median_price, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price vs Volume of sales non linear regression with confidence intervals",
x = "Volume of sales (mln $)",
y = "Median price") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Trends confirm grwoth in market for Beaoumont and Tyler and also for Bryan-College Station.
#Also Wichita Falls show a low growth, despite the flat trend of median price by number of sales,
#while correlation shows a bad situation for the market
#Generally, between volume and median price there isn't a linear correlation, but more linear
#than the correlation between sales e median price
#Beacause residuals analysis shows an important dispersion of values, we could try to calculate
#correlation using Spearman index
corr_sp_fun <- function(data, year, city, kpi1, kpi2) {
corr_sp_data <- data %>%
group_by({{year}}, {{city}}) %>%
summarize(corr_spearman = cor({{kpi1}}, {{kpi2}}, method = "spearman", use = "complete.obs"), .groups = "drop")
return(corr_sp_data)
}
corr_by_yearcity_mpv_sp <- corr_sp_fun(data, year, city, volume, median_price)
corr_by_yearcity_mpv_sp %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_spearman
## <int> <chr> <dbl>
## 1 2010 Beaumont -0.503
## 2 2010 Bryan-College Station -0.545
## 3 2010 Tyler 0.406
## 4 2010 Wichita Falls 0.643
## 5 2011 Beaumont 0.748
## 6 2011 Bryan-College Station -0.154
## 7 2011 Tyler 0.776
## 8 2011 Wichita Falls 0.827
## 9 2012 Beaumont 0.510
## 10 2012 Bryan-College Station 0.497
## 11 2012 Tyler 0.650
## 12 2012 Wichita Falls 0.718
## 13 2013 Beaumont 0.608
## 14 2013 Bryan-College Station 0.586
## 15 2013 Tyler 0.802
## 16 2013 Wichita Falls 0.427
## 17 2014 Beaumont 0.760
## 18 2014 Bryan-College Station -0.0140
## 19 2014 Tyler 0.503
## 20 2014 Wichita Falls 0.315
corr_matrix_spearman_mpv <- corr_by_yearcity_mpv_sp %>%
pivot_wider(names_from = city, values_from = corr_spearman)
mat_spearman <- as.matrix(corr_matrix_spearman_mpv[,-1])
rownames(mat_spearman) <- corr_matrix_spearman_mpv$year
corrplot(mat_spearman, method = "color", addCoef.col = "black", tl.col = "black",
tl.cex = 0.7, tl.srt = 30, number.cex = 0.9, cl.pos = "b",
title = "Median price vs Volume of sales Spearman corr by year and city",
mar = c(0,0,1,0))
#The bivariate analysis between median sales price and total sales value highlights a positive
#relationship with a less pronounced nonlinearity than that observed between median price and number
#of sales. This result is consistent with the fact that, although the median price is less
#influenced by outliers than the mean, it still tends to reflect the central market trend, which
#increases along with the total sales value. Analysis using Spearman's correlation, which is less
#sensitive to outliers, confirms this dynamic, suggesting a monotonic but not perfectly linear
#relationship between the two variables. In conclusion, these results invite the exploration of
#models or transformations that can better capture the complexity of the relationship between
#sales volume and median price.
Correlation and General Trend
The analysis highlights a generally positive correlation between the
number of active property listings and the number of sales, suggesting
that greater supply tends to be accompanied by a higher number of sales.
However, the strength of this correlation shows a decreasing trend over
time across all cities considered, indicating a likely change in market
dynamics.
City-Level Considerations
Beaumont and Bryan-College Station show a decrease in correlation over
the analyzed period, with values ranging from moderate to low,
suggesting that sales no longer proportionally follow the trend in
listings over the years. Tyler shows a slight decrease in correlation
over the years but maintains a more stable relationship compared to
other cities. Wichita Falls presents a rather stable trend, with a more
consistent correlation over time.
Linear Model and Residual Analysis
The application of a linear regression model confirms a positive
relationship between the number of listings and sales, but the residuals
show heteroscedasticity and skewness, meaning non-constant variance of
errors and asymmetry. Furthermore, the residuals display strong
unexplained variability and cyclical temporal patterns, underscoring the
influence of additional factors or nonlinear effects. The presence of
outliers indicates markets or periods with particular behavior.
Nonlinear Regression (Loess)
Using loess regression provides a more flexible and realistic model,
capturing local variations over time and possible nonlinearities in the
relationship between listings and sales.
Spearman Correlation
The analysis with Spearman’s correlation, which is robust to outliers,
confirms the observed trends: a positive but complex and evolving
relationship, with a decreasing correlation over time.
#Correlation with sales (verify if listings influence sales)
ggplot(data = data)+
geom_point(aes(x = listings, y = sales, colour = city))+
geom_smooth(aes(x = listings, y = sales), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Number of sales vs Listings by year",
x = "Number of listings",
y = "Number of sales")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Correlation index by city and year between number and listings
corr_by_yearcity_ls <- corr_fun(data, year, city, listings, sales)
#Plot of corrplot for this calculated index
corr_by_yearcity_ls %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont 0.568
## 2 2010 Bryan-College Station 0.618
## 3 2010 Tyler 0.390
## 4 2010 Wichita Falls -0.118
## 5 2011 Beaumont 0.509
## 6 2011 Bryan-College Station 0.722
## 7 2011 Tyler 0.866
## 8 2011 Wichita Falls 0.760
## 9 2012 Beaumont 0.446
## 10 2012 Bryan-College Station 0.414
## 11 2012 Tyler 0.749
## 12 2012 Wichita Falls 0.274
## 13 2013 Beaumont 0.443
## 14 2013 Bryan-College Station 0.191
## 15 2013 Tyler 0.852
## 16 2013 Wichita Falls 0.415
## 17 2014 Beaumont 0.334
## 18 2014 Bryan-College Station 0.208
## 19 2014 Tyler 0.382
## 20 2014 Wichita Falls 0.657
#Transform in wide format
corr_matrix_ls <- corr_by_yearcity_ls %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat_ls <- as.matrix(corr_matrix_ls[,-1])
#Set rownames like years, to be more clear
rownames(mat_ls) <- corr_matrix_ls$year
#Corrplot
corrplot(mat_ls, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Number of sales vs Listings correlation by year and city",
mar = c(0,0,1,0))
#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so number of sales growth like listings, but not at all in a
#strong dependency.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.57 to 0.33, so the number of sales are decreasing according to number of listings
#Bryan-College Station --> the correlation goes from 0.62 to 0.21, so the number of sales are decreasing according to number of listings
#Tyler --> shows a small decrease into the years
#Wichita Falls --> shows a stable trend
#Plot of joint distribution of listings and sales
ggplot(data, aes(x = listings, y = sales)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Number of sales vs Listings joint distribution with confidence intervals",
x = "Number of listings",
y = "Number of sales") +
theme_classic()
#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = sales, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Number of sales vs Listings linear regression with confidence intervals",
x = "Listings",
y = "Number of sales") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Plot of residuals to show other caratheristic on correlation
model <- lm(sales ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)
ggplot(data, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression residuals (number of sales vs listings)",
x = "Predicted values",
y = "Residuals") +
theme_classic()
#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.
#Plot of loess smooth of sales and median price by city
ggplot(data, aes(x = listings, y = sales, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Number of sales vs Listings non linear regression with confidence intervals",
x = "Number of listings",
y = "Number of sales") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Analysis of the relationship between the number of active listings and the number of sales
#highlights a generalized positive correlation, but one that decreases over time for all cities
#considered. This decrease tends to coincide with an increase in the median price and the total
#value of sales, suggesting a market in which the decrease in active supply does not translate
#into a proportional reduction in sales, likely due to demand and price dynamics. Examination of
#the residuals from the linear regression model reveals strong unexplained variability and wave-like
#temporal patterns, indicating that the relationship between listings and sales is influenced by
#additional factors or complex nonlinear processes. The large confidence intervals, which indicate
#considerable data dispersion, require caution in interpretation, except for one city where the
#relationship appears more defined and stable. Overall, the results underscore the need for more
#flexible models or the inclusion of additional explanatory variables to fully describe market
#dynamics.
#Beacause residuals analysis shows an important dispersion of values, we could try to calculate
#correlation using Spearman index
corr_by_yearcity_ls_sp <- corr_sp_fun(data, year, city, listings, sales)
corr_by_yearcity_ls_sp %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_spearman
## <int> <chr> <dbl>
## 1 2010 Beaumont 0.378
## 2 2010 Bryan-College Station 0.797
## 3 2010 Tyler 0.399
## 4 2010 Wichita Falls -0.0210
## 5 2011 Beaumont 0.581
## 6 2011 Bryan-College Station 0.706
## 7 2011 Tyler 0.881
## 8 2011 Wichita Falls 0.778
## 9 2012 Beaumont 0.623
## 10 2012 Bryan-College Station 0.404
## 11 2012 Tyler 0.914
## 12 2012 Wichita Falls 0.480
## 13 2013 Beaumont 0.497
## 14 2013 Bryan-College Station 0.0385
## 15 2013 Tyler 0.930
## 16 2013 Wichita Falls 0.343
## 17 2014 Beaumont 0.371
## 18 2014 Bryan-College Station 0.259
## 19 2014 Tyler 0.503
## 20 2014 Wichita Falls 0.685
corr_matrix_spearman_ls <- corr_by_yearcity_ls_sp %>%
pivot_wider(names_from = city, values_from = corr_spearman)
mat_spearman <- as.matrix(corr_matrix_spearman_ls[,-1])
rownames(mat_spearman) <- corr_matrix_spearman_ls$year
corrplot(mat_spearman, method = "color", addCoef.col = "black", tl.col = "black",
tl.cex = 0.7, tl.srt = 30, number.cex = 0.9, cl.pos = "b",
title = "Number of sales vs Listings Spearman corr by year and city",
mar = c(0,0,1,0))
#Also the analysis with correlation Spearman index confirms the same considerations already done
General Correlation and Residual Patterns The relationship between the number of listings and the total sales value (volume) appears generally positive, although with a low to moderate correlation that varies over time and among cities. The residual analysis from the linear regression model reveals four distinct clusters corresponding to the analyzed cities, suggesting that the relationship between the variables differs by geographic area. This indicates the need for separate models or models with interaction terms to better understand territory-specific dynamics.
Linear Interaction Model The linear model with interaction between number of listings and city explains about 54% of the overall variability (R² ≈ 0.54). However, the statistical significance of the interaction terms is generally low, indicating that, statistically, the slope of the relationship between volume and listings does not differ significantly among cities. This suggests that, under the simple linear model, city does not significantly modify the effect of listings on sales value.
Models with Nonlinearity
The polynomial (quadratic) model with interactions did not show
significant improvement or relevant terms. Generalized Additive Models
(GAMs), instead, offer a more flexible and realistic description:
Visualizations and Local Trends
Loess regressions for each city clearly show different territorial
market dynamics, with varying curves relating number of listings and
sales volume. In particular, some cities exhibit strong deviations from
linearity and significant local variation.
Final Considerations
This analysis shows that the relationship between the number of active
listings and total sales value varies considerably among cities and
cannot be adequately captured by a simple linear model. Flexible models
such as GAMs are more effective in describing local complexities and
diversity. For future in-depth analyses, it is advisable to employ
approaches that allow modeling nonlinearities and spatiotemporal
interactions to improve interpretation and forecasting of the real
estate market.
#Correlation with volume
ggplot(data = data)+
geom_point(aes(x = listings, y = volume, colour = city))+
geom_smooth(aes(x = listings, y = volume), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Volume of sales vs Listings by year",
x = "Number of listings",
y = "Volume of sales (mln $)")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Correlation index by city and year between volume and listings
corr_by_yearcity_ls <- corr_fun(data, year, city, listings, volume)
#Plot of corrplot for this calculated index
corr_by_yearcity_ls %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont 0.448
## 2 2010 Bryan-College Station 0.624
## 3 2010 Tyler 0.450
## 4 2010 Wichita Falls -0.0104
## 5 2011 Beaumont 0.559
## 6 2011 Bryan-College Station 0.670
## 7 2011 Tyler 0.882
## 8 2011 Wichita Falls 0.704
## 9 2012 Beaumont 0.355
## 10 2012 Bryan-College Station 0.368
## 11 2012 Tyler 0.776
## 12 2012 Wichita Falls 0.412
## 13 2013 Beaumont 0.421
## 14 2013 Bryan-College Station 0.172
## 15 2013 Tyler 0.899
## 16 2013 Wichita Falls 0.326
## 17 2014 Beaumont 0.285
## 18 2014 Bryan-College Station 0.120
## 19 2014 Tyler 0.363
## 20 2014 Wichita Falls 0.638
#Transform in wide format
corr_matrix_ls <- corr_by_yearcity_ls %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat_ls <- as.matrix(corr_matrix_ls[,-1])
#Set rownames like years, to be more clear
rownames(mat_ls) <- corr_matrix_ls$year
#Corrplot
corrplot(mat_ls, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Volume of sales vs Listings correlation by year and city",
mar = c(0,0,1,0))
#This analysis shows low correlation between two variables,
#except for city of Tyler in the year from 2011 to 2013.
#Generally correlation is positive, so volume growth like number of listings, but not in a
#strong dependency
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.45 to 0.29 (decreasing)
#Bryan-College Station --> the correlation goes from 0.62 to 0.12 (decreasing)
#Tyler --> shows a positive trend from 2011 to 2013 with a decrease into 2014
#Wichita Falls --> shows strong variability in trend
#Plot of joint distribution of volume and listings
ggplot(data, aes(x = listings, y = volume)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Volume of sales vs Listings joint distribution with confidence ellipses",
x = "Number of listings",
y = "Volume of sales (mln $)") +
theme_classic()
#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = volume, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Volume of sales vs Listings linear regression with confidence intervals",
x = "Number of listings",
y = "Volume of sales (mln $)") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Plot of residuals to show other caratheristic on correlation
model <- lm(volume ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)
ggplot(data, aes(x = fitted, y = residuals, color = city)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression residuals (volume of sales vs listings))",
x = "Predicted values",
y = "Residuals") +
theme_classic()
#The residuals plot from the linear regression model linking sales value to the number of active
#listings reveals an interesting pattern: the residuals are not randomly distributed but highlight
#the presence of four distinct clusters, corresponding to the different cities analyzed. This
#suggests that the relationship between sales value and the number of listings varies significantly
#across cities, indicating potentially different market dynamics. To further explore these
#differences, it is necessary to estimate separate regression models for each city or introduce
#interaction terms into a single model. This step will allow us to better understand how individual
#cities contribute to the observed variability and whether the slope of the relationship varies
#significantly depending on the geographical context.
#These four different trends of residuals suggest four different trends of regression between volume
#and sales by city. So, below there is a plot of these four regression lines
ggplot(data, aes(x = listings, y = volume, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Volume of sales vs Listings linear regressions",
x = "Number of listings",
y = "Volume of sales (mln $)") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Interaction model to investigate the influence of variable city on volume-listings relationship
model_interaction <- lm(volume ~ listings * city, data = data)
summary(model_interaction)
##
## Call:
## lm(formula = volume ~ listings * city, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.405 -6.782 -0.731 4.847 39.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.940e+01 2.741e+01 1.802 0.0728 .
## listings -1.386e-02 1.630e-02 -0.850 0.3960
## cityBryan-College Station 1.032e+01 2.876e+01 0.359 0.7201
## cityTyler 8.051e+00 3.340e+01 0.241 0.8097
## cityWichita Falls -4.136e+01 3.300e+01 -1.253 0.2113
## listings:cityBryan-College Station -9.057e-04 1.733e-02 -0.052 0.9584
## listings:cityTyler 9.835e-03 1.757e-02 0.560 0.5761
## listings:cityWichita Falls 2.033e-02 2.591e-02 0.785 0.4334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.41 on 232 degrees of freedom
## Multiple R-squared: 0.5443, Adjusted R-squared: 0.5305
## F-statistic: 39.59 on 7 and 232 DF, p-value: < 2.2e-16
#From the summary of interaction model we can explore some results
#R^2 --> this interaction explain only the 54% of variability of the entire dataset
#Global p-value --> 2.2e-16, so at least one predictor (excluding intercept) has a significant contribution to the model
#Intercept p-value --> intercept is statistically significant for 10%
#Other p-values --> have values very high, so there are not statistically significant
#Generally, from interaction model, there is no significant dependence of the volume vs listings
#relationship on cities at a statistical level, that is, the slope of the relationship does not
#differ significantly between cities. For this reason, we can explore other type of interaction,
#introducing non linearity
#Given a dataset with approximately 240 observations evenly distributed across four cities
#(about 60 cases per city), model complexity must be balanced with data availability to avoid
#overfitting and unstable estimates. To explore the relationship between sales volume and number
#of active listings, accounting for potential non-linearities and city-specific differences, two
#flexible yet parsimonious modeling approaches are considered.
#First, polynomial regression with interaction terms allows capturing simple curvilinear effects
#(e.g., quadratic terms) and varying relationships by city without inflating the number of
#parameters excessively.
model_poly <- lm(volume ~ city * (listings + I(listings^2)), data = data)
summary(model_poly)
##
## Call:
## lm(formula = volume ~ city * (listings + I(listings^2)), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.707 -6.608 -0.815 4.691 39.701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.451e+01 4.479e+02 -0.189 0.851
## cityBryan-College Station 2.311e+02 4.497e+02 0.514 0.608
## cityTyler 1.793e+02 4.830e+02 0.371 0.711
## cityWichita Falls 4.964e+01 4.838e+02 0.103 0.918
## listings 1.460e-01 5.340e-01 0.273 0.785
## I(listings^2) -4.757e-05 1.588e-04 -0.300 0.765
## cityBryan-College Station:listings -2.877e-01 5.371e-01 -0.536 0.593
## cityTyler:listings -1.760e-01 5.485e-01 -0.321 0.749
## cityWichita Falls:listings -4.359e-02 6.717e-01 -0.065 0.948
## cityBryan-College Station:I(listings^2) 9.242e-05 1.601e-04 0.577 0.564
## cityTyler:I(listings^2) 5.207e-05 1.603e-04 0.325 0.746
## cityWichita Falls:I(listings^2) -5.703e-06 2.763e-04 -0.021 0.984
##
## Residual standard error: 11.38 on 228 degrees of freedom
## Multiple R-squared: 0.5542, Adjusted R-squared: 0.5327
## F-statistic: 25.77 on 11 and 228 DF, p-value: < 2.2e-16
#All p-values show a low stastic significance
#Second, Generalized Additive Models (GAMs) provide a more flexible framework to model smooth
#non-linear relationships that can differ between cities, while controlling complexity through
#limited degrees of freedom in the smoothing parameters to adapt to the modest sample size.
data$Bryan <- ifelse(data$city == "Bryan-College Station", 1, 0)
data$Tyler <- ifelse(data$city == "Tyler", 1, 0)
data$Wichita <- ifelse(data$city == "Wichita Falls", 1, 0)
data$Beaumont <- ifelse(data$city == "Beaumont", 1, 0)
gam_model <- gam(volume ~ city +
s(listings, by=Bryan, k=5) +
s(listings, by=Tyler, k=5) +
s(listings, by=Wichita, k=5)+
s(listings, by=Beaumont, k=5),
data=data)
summary(gam_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## volume ~ city + s(listings, by = Bryan, k = 5) + s(listings,
## by = Tyler, k = 5) + s(listings, by = Wichita, k = 5) + s(listings,
## by = Beaumont, k = 5)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.370 5.032 4.247 3.15e-05 ***
## cityBryan-College Station 3.914 12.882 0.304 0.761511
## cityTyler 14.546 4.118 3.533 0.000497 ***
## cityWichita Falls -1.039 7.450 -0.139 0.889252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(listings):Bryan 2.352 2.682 4.674 0.0461 *
## s(listings):Tyler 1.429 1.429 12.858 0.0104 *
## s(listings):Wichita 1.429 1.429 1.255 0.4410
## s(listings):Beaumont 1.714 1.714 1.059 0.4613
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 20/24
## R-sq.(adj) = 0.539 Deviance explained = 55.4%
## GCV = 132.72 Scale est. = 127.78 n = 240
#1. The intercept estimate for Tyler is statistically significant, indicating that at zero listings,
#the sales volume in Tyler is higher than the baseline city, while the intercepts for the other
#cities are not significant.
#2.The higher effective degrees of freedom (edf) for Bryan and Tyler indicate a nonlinear
#relationship between listings and sales volume, whereas for Wichita and Beaumont the relationship
#is linear or nearly linear.
#3.The smooth terms for Bryan and Tyler are statistically significant, meaning that the
#relationship between listings and sales volume in these cities is significant and non-constant
#(nonlinear), while for Wichita and Beaumont listings have little influence on the sales volume.
#4.Overall, the nonlinear GAM model, unlike the linear interaction model, confirms that the
#relationship between listings and sales volume depends on the city, but this dependence is
#statistically significant only for Bryan and Tyler. Furthermore, the city modifies the linearity of
#the relationship; where the relationship is linear, the city factor does not significantly
#influence it.
#Plot of loess smooth of volume and listings by city
ggplot(data, aes(x = listings, y = volume, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Volume of sales vs Listings non linear regression with confidence intervals",
x = "Number of listings",
y = "Volume of sales (mln $)") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
Correlation and General Trend
The correlation between the number of listings and the median sales
price is generally low and subject to significant variations across
cities and years:
Linear Model and Residual Analysis
The linear regression model reveals residual patterns similar to those
observed in the relationship between listings and sales volume, with
distinct clusters by city suggesting different market dynamics. The
presence of these residual differences indicates the need to consider
geographic variability in the model.
Linear Interaction Model
The interaction model between number of listings and city explains about
83% of the overall variability, but the interaction terms are generally
not significant. This implies that the slope of the relationship between
median price and number of listings does not differ significantly among
cities in a simple linear model.
Nonlinear Models and Generalized Additive Models (GAM)
Polynomial regression with interaction terms does not show significant
improvements. GAMs reveal that the relationship between listings and
median price is significant and nonlinear for Bryan, Tyler, and
Beaumont, while the effect is less relevant for Wichita. The GAM model
confirms the dependence of the relationship on the geographic area and
shows how the city modifies the linearity of the relationship.
Visualization with LOESS Regression
LOESS curves by city clearly highlight local dynamics and nonlinear
relationships, reinforcing the idea of territorial heterogeneity in the
relationship between listings and median price.
Final considerations
The relationship between the number of property listings and median
sales price appears complex, weakly linear, and influenced by
territorial and temporal factors. A simple linear model is insufficient
to adequately describe it, whereas flexible approaches like GAMs enable
capturing nonlinearities and geographic differences. These results
suggest further analysis using models that allow precise modeling of
spatial and nonlinear variability in the real estate market.
#Correlation with median price
ggplot(data = data)+
geom_point(aes(x = listings, y = median_price, colour = city))+
geom_smooth(aes(x = listings, y = median_price), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Median price vs listings by year",
x = "Number of listings",
y = "Median price")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Correlation index by city and year between median price and listings
corr_by_yearcity_ls <- corr_fun(data, year, city, listings, median_price)
#Plot of corrplot for this calculated index
corr_by_yearcity_ls %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont -0.581
## 2 2010 Bryan-College Station -0.316
## 3 2010 Tyler 0.493
## 4 2010 Wichita Falls 0.0507
## 5 2011 Beaumont 0.508
## 6 2011 Bryan-College Station -0.424
## 7 2011 Tyler 0.376
## 8 2011 Wichita Falls 0.312
## 9 2012 Beaumont 0.0904
## 10 2012 Bryan-College Station 0.0654
## 11 2012 Tyler 0.465
## 12 2012 Wichita Falls 0.500
## 13 2013 Beaumont 0.122
## 14 2013 Bryan-College Station -0.184
## 15 2013 Tyler 0.655
## 16 2013 Wichita Falls -0.260
## 17 2014 Beaumont 0.371
## 18 2014 Bryan-College Station -0.742
## 19 2014 Tyler -0.226
## 20 2014 Wichita Falls 0.317
#Transform in wide format
corr_matrix_ls <- corr_by_yearcity_ls %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat_ls <- as.matrix(corr_matrix_ls[,-1])
#Set rownames like years, to be more clear
rownames(mat_ls) <- corr_matrix_ls$year
#Corrplot
corrplot(mat_ls, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Median price vs Listings correlation by year and city",
mar = c(0,0,1,0))
#This analysis reveals varying relationships between listings and median price across cities and
#years. For Bryan-College Station, the relationship is predominantly negative with generally low
#linear correlation, except in 2014. Tyler and Wichita Falls exhibit positive but weak and
#inconsistent correlations over time. Beaumont shows a shift from negative to positive correlations
#across years, though correlations remain weak. Overall, regardless of the year, all cities display
#low linear correlation between these two variables, suggesting the absence of a strong linear
#relationship.
#Plot of joint distribution of median price and listings
ggplot(data, aes(x = listings, y = median_price)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price vs Listings joint distribution with confidence ellipses",
x = "Listings",
y = "Median price") +
theme_classic()
#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = median_price, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price vs Listings linear regression with confidence intervals",
x = "Number of listings",
y = "Median price") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Plot of residuals to show other caratheristic on correlation
model <- lm(median_price ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)
ggplot(data, aes(x = fitted, y = residuals, color = city)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression residuals (median price vs listings)",
x = "Predicted values",
y = "Residuals") +
theme_classic()
#This analysis shows a particular pattern of residuals, like the analysis between listings and volume.
#So, some add on this analysis, like done previously
#These four different trends of residuals suggest four different trends of regression between median price
#and sales by city. So, below there is a plot of these four regression lines
ggplot(data, aes(x = listings, y = median_price, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Median price vs Listings linear regressions",
x = "Number of listings",
y = "Median price") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Interaction model to investigate the influence of variable city on volume-listings relationship
model_interaction <- lm(median_price ~ listings * city, data = data)
summary(model_interaction)
##
## Call:
## lm(formula = median_price ~ listings * city, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27642 -6338 652 6038 33846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 154590.766 22727.226 6.802 8.68e-11 ***
## listings -14.650 13.514 -1.084 0.2795
## cityBryan-College Station 37982.241 23845.127 1.593 0.1126
## cityTyler 15533.789 27694.337 0.561 0.5754
## cityWichita Falls -47723.146 27362.571 -1.744 0.0825 .
## listings:cityBryan-College Station -9.411 14.367 -0.655 0.5131
## listings:cityTyler 4.777 14.565 0.328 0.7432
## listings:cityWichita Falls 9.017 21.482 0.420 0.6751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9460 on 232 degrees of freedom
## Multiple R-squared: 0.8309, Adjusted R-squared: 0.8257
## F-statistic: 162.8 on 7 and 232 DF, p-value: < 2.2e-16
#From the summary of interaction model we can explore some results
#R^2 --> this interaction explain only the 83% of variability of the entire dataset
#Global p-value --> 2.2e-16, so at least one predictor (excluding intercept) has a significant contribution to the model
#Intercept p-value --> intercept is statistically significant
#Other p-values --> have values high, so there are not statistically significant
#Generally, from interaction model, there is no significant dependence of the median price vs listings
#relationship on cities at a statistical level, that is, the slope of the relationship does not
#differ significantly between cities. For this reason, we can explore other type of interaction,
#introducing non linearity
#First, polynomial regression with interaction terms allows capturing simple curvilinear effects
#(e.g., quadratic terms) and varying relationships by city without inflating the number of
#parameters excessively.
model_poly <- lm(median_price ~ city * (listings + I(listings^2)), data = data)
summary(model_poly)
##
## Call:
## lm(formula = median_price ~ city * (listings + I(listings^2)),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27494 -6103 58 5577 34019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.906e+05 3.702e+05 1.055 0.293
## cityBryan-College Station -1.330e+05 3.717e+05 -0.358 0.721
## cityTyler -1.697e+04 3.992e+05 -0.043 0.966
## cityWichita Falls -2.052e+05 3.999e+05 -0.513 0.608
## listings -2.963e+02 4.414e+02 -0.671 0.503
## I(listings^2) 8.383e-02 1.313e-01 0.639 0.524
## cityBryan-College Station:listings 1.774e+02 4.439e+02 0.400 0.690
## cityTyler:listings 1.446e+02 4.534e+02 0.319 0.750
## cityWichita Falls:listings 1.151e+02 5.552e+02 0.207 0.836
## cityBryan-College Station:I(listings^2) -5.031e-02 1.323e-01 -0.380 0.704
## cityTyler:I(listings^2) -5.926e-02 1.325e-01 -0.447 0.655
## cityWichita Falls:I(listings^2) 1.366e-02 2.283e-01 0.060 0.952
##
## Residual standard error: 9408 on 228 degrees of freedom
## Multiple R-squared: 0.8356, Adjusted R-squared: 0.8276
## F-statistic: 105.3 on 11 and 228 DF, p-value: < 2.2e-16
#All p-values show a low stastic significance
#Second, Generalized Additive Models (GAMs) provide a more flexible framework to model smooth
#non-linear relationships that can differ between cities, while controlling complexity through
#limited degrees of freedom in the smoothing parameters to adapt to the modest sample size.
gam_model <- gam(median_price ~ city +
s(listings, by=Bryan, k=5) +
s(listings, by=Tyler, k=5) +
s(listings, by=Wichita, k=5)+
s(listings, by=Beaumont, k=5),
data=data)
summary(gam_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## median_price ~ city + s(listings, by = Bryan, k = 5) + s(listings,
## by = Tyler, k = 5) + s(listings, by = Wichita, k = 5) + s(listings,
## by = Beaumont, k = 5)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93801 3905 24.019 < 2e-16 ***
## cityBryan-College Station 25577 8472 3.019 0.00282 **
## cityTyler 31260 5829 5.363 1.98e-07 ***
## cityWichita Falls 1638 6136 0.267 0.78981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(listings):Bryan 2.188 2.516 15.370 1.39e-06 ***
## s(listings):Tyler 2.008 2.263 4.475 0.00512 **
## s(listings):Wichita 1.429 1.429 2.287 0.25782
## s(listings):Beaumont 1.714 1.714 50.053 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 20/24
## R-sq.(adj) = 0.829 Deviance explained = 83.5%
## GCV = 9.1502e+07 Scale est. = 8.7942e+07 n = 240
#1. The intercept estimate for Tyler and Bryan are statistically significant, indicating that at zero listings,
#the median price in Tyler and Bryan is higher than the baseline city, while the intercepts for the other
#cities are not significant.
#2.The higher effective degrees of freedom (edf) for all cities indicate a nonlinear
#relationship between listings and median price.
#3.The smooth terms for Bryan, Tyler and Beaumont are statistically significant, meaning that the
#relationship between listings and median price in these cities is significant and non-constant
#(nonlinear), while for Wichita listings have little influence on median price.
#4.Overall, the nonlinear GAM model, unlike the linear interaction model, confirms that the
#relationship between listings and median price depends on the city, but this dependence is
#statistically significant only for Bryan, Tyler and Beaumont. Furthermore, the city modifies the linearity of
#the relationship; where the relationship is linear, the city factor does not significantly
#influence it.
#Plot of loess smooth of median price and listings by city
ggplot(data, aes(x = listings, y = median_price, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Median price vs Listings non linear regression with confidence intervals",
x = "Number of listings",
y = "Median price") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
Correlation and General Trend
The linear correlation between number of sales and months needed to sell
listings (months_inventory) is generally low, except for 2011 which
shows stronger correlations. Overall, a positive correlation is
observed: as sales increase, months of inventory tend to increase as
well, but this relationship is weak and sometimes negative. Among
cities, Beaumont and Bryan-College Station show a declining correlation
over time, while Tyler and Wichita Falls display trends of increase
followed by decrease.
Visualizations and Linear Model
Joint distributions and linear regression plots reveal considerable data
dispersion and scatter, confirmed by residual analysis. Residuals
exhibit heteroscedasticity and skewness, indicating that the simple
linear model does not adequately describe the actual relationship.
Nature of the Relationship and Nonlinear Models
Loess regression analysis suggests a nonlinear and partly inverse
relationship: generally, higher sales correspond to fewer months needed
to sell listings. This aligns with market intuition but contrasts with
the averaged positive correlations previously observed.
Spearman Correlation
Spearman correlation analysis, less sensitive to outliers, confirms the
weak and complex nature of the relationship, emphasizing the need for
flexible models and variable transformations for accurate
description.
Final Considerations
The relationship between sales numbers and months inventory shows
complex features not well captured by simple linear modeling. Evidence
indicates a nonlinear, likely inverse relationship, warranting further
investigation via nonlinear models or transformations. The city-specific
differences and time dynamics highlight the structural market
complexity.
#Correlation with sales
ggplot(data = data)+
geom_point(aes(x = sales, y = months_inventory, colour = city))+
geom_smooth(aes(x = sales, y = months_inventory), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Months inventory vs Number of sales by year",
x = "Number of sales",
y = "Months inventory")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Correlation index by city and year between number and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, sales, months_inventory)
#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont 0.431
## 2 2010 Bryan-College Station 0.345
## 3 2010 Tyler -0.00956
## 4 2010 Wichita Falls -0.249
## 5 2011 Beaumont 0.646
## 6 2011 Bryan-College Station 0.778
## 7 2011 Tyler 0.757
## 8 2011 Wichita Falls 0.769
## 9 2012 Beaumont -0.149
## 10 2012 Bryan-College Station 0.289
## 11 2012 Tyler 0.453
## 12 2012 Wichita Falls 0.0972
## 13 2013 Beaumont -0.118
## 14 2013 Bryan-College Station 0.0449
## 15 2013 Tyler 0.653
## 16 2013 Wichita Falls 0.235
## 17 2014 Beaumont 0.0904
## 18 2014 Bryan-College Station 0.113
## 19 2014 Tyler 0.0578
## 20 2014 Wichita Falls 0.616
#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])
#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year
#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Number of sales correlation by year and city",
mar = c(0,0,1,0))
#This analysis shows low correlation between two variables,
#except for the year 2011.
#Generally correlation is positive, so months inventory growth like number of sales, but not at all in a
#strong dependency. There are, also, negative terms of correlation.
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.43 to 0.09
#Bryan-College Station --> the correlation goes from 0.34 to 0.11
#Tyler --> shows a growth in positive correlation from 2010 to 2011 and a decrease from 2013 to 2014
#Wichita Falls --> shows a similar trend than Tyler
#Plot of joint distribution of months inventory and sales
ggplot(data, aes(x = sales, y = months_inventory)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Number of sales joint distribution with confidence ellipses",
x = "Number of sales",
y = "Months inventory") +
theme_classic()
#Plot of points around linear regression rect
ggplot(data, aes(x = sales, y = months_inventory, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Number of sales linear regression with confidence intervals",
x = "Number of sales",
y = "Months inventory") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ sales, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)
ggplot(data, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression residuals (months inventory vs number of sales)",
x = "Predicted values",
y = "Residuals") +
theme_classic()
#The linear relationship exists but is not perfectly described by the simple linear model due to heteroskedasticity and skewness in the residuals.
#The model may underestimate or overestimate the dependence in certain areas (especially for low values of the variable).
#The presence of outliers suggests that a more in-depth analysis of the data would be useful to identify possible subgroups or outliers.
#It may be worth trying variable transformations (e.g., logarithmic) or more flexible nonlinear models to better accommodate the shape of
#the relationship and improve the homogeneity of errors.
#Plot of loess smooth of sales and months inventory by city
ggplot(data, aes(x = sales, y = months_inventory, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Number of sales non linear regression with confidence intervals",
x = "Number of sales",
y = "Months inventory") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#The relationship between the number of sales and the months needed to sell listings
#(months_inventory) appears to be nonlinear and inversely proportional. Specifically,
#as the number of sales increases, the time required to sell all listings tends to decrease.
#This intuition aligns with market expectations and is supported by the scatter of residuals in
#the linear regression model, which indicates that a simple linear trend may not adequately capture
#the complexity of this relationship.This appears to be at odds with the positive correlation
#coefficients found for year and month, but the “average” positive correlation coefficient
#alone may not correctly reflect the true nature of the relationship between variables, best
#highlighted with this trend.
#Beacause residuals analysis shows an important dispersion of values, we could try to calculate
#correlation using Spearman index
corr_by_yearcity_minv_sp <- corr_sp_fun(data, year, city, sales, months_inventory)
corr_by_yearcity_minv_sp %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_spearman
## <int> <chr> <dbl>
## 1 2010 Beaumont 0.214
## 2 2010 Bryan-College Station 0.258
## 3 2010 Tyler -0.112
## 4 2010 Wichita Falls -0.123
## 5 2011 Beaumont 0.587
## 6 2011 Bryan-College Station 0.727
## 7 2011 Tyler 0.828
## 8 2011 Wichita Falls 0.743
## 9 2012 Beaumont -0.0495
## 10 2012 Bryan-College Station 0.253
## 11 2012 Tyler 0.621
## 12 2012 Wichita Falls 0.239
## 13 2013 Beaumont -0.120
## 14 2013 Bryan-College Station -0.00699
## 15 2013 Tyler 0.796
## 16 2013 Wichita Falls 0.162
## 17 2014 Beaumont 0.106
## 18 2014 Bryan-College Station 0.115
## 19 2014 Tyler 0.168
## 20 2014 Wichita Falls 0.614
corr_matrix_spearman_minv <- corr_by_yearcity_minv_sp %>%
pivot_wider(names_from = city, values_from = corr_spearman)
mat_spearman <- as.matrix(corr_matrix_spearman_minv[,-1])
rownames(mat_spearman) <- corr_matrix_spearman_minv$year
corrplot(mat_spearman, method = "color", addCoef.col = "black", tl.col = "black",
tl.cex = 0.7, tl.srt = 30, number.cex = 0.9, cl.pos = "b",
title = "Months inventory vs Number of sales Spearman corr by year and city",
mar = c(0,0,1,0))
#Also the analysis with correlation Spearman index confirms the same considerations already done
Correlation and General Trend
The linear correlation between total sales value and months needed to
sell listings is generally low, with some exceptions in 2011 showing
stronger correlations. Overall, a positive correlation is observed: as
sales value increases, months inventory tends to increase as well, but
this dependency is neither strong nor stable over time. Looking at
individual cities, Beaumont and Bryan-College Station show a decreasing
correlation over time, Tyler displays an initial positive trend followed
by a decline, and Wichita Falls exhibits high variability in trend.
Visualizations and Linear Model
Joint distributions and linear regression plots reveal considerable data
dispersion and non-random residual patterns. Residual analysis of the
linear model highlights heteroscedasticity and non-random patterns,
suggesting that the relationship is not adequately captured by a simple
linear model.
Nature of the Relationship and Nonlinear Models
LOESS regression confirms a nonlinear relationship between sales volume
and months inventory, consistent with a pattern where an increase in
sales value does not correspond linearly to an increase in time to sell.
This indicates complex and varied phenomena across cities and
periods.
Final Considerations
The relationship between sales value and months inventory is weak,
unstable, and nonlinear, making a simple linear model inadequate to
describe it. Temporal and spatial differences suggest the need for more
flexible models and possibly additional variables or transformations for
better understanding.
#Correlation with volume
ggplot(data = data)+
geom_point(aes(x = volume, y = months_inventory, colour = city))+
geom_smooth(aes(x = volume, y = months_inventory), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Months inventory vs Volume of sales by year",
x = "Volume of sales (mln $)",
y = "Months inventory")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Correlation index by city and year between volume and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, volume, months_inventory)
#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont 0.307
## 2 2010 Bryan-College Station 0.372
## 3 2010 Tyler 0.0157
## 4 2010 Wichita Falls -0.152
## 5 2011 Beaumont 0.559
## 6 2011 Bryan-College Station 0.733
## 7 2011 Tyler 0.781
## 8 2011 Wichita Falls 0.747
## 9 2012 Beaumont -0.219
## 10 2012 Bryan-College Station 0.238
## 11 2012 Tyler 0.503
## 12 2012 Wichita Falls 0.275
## 13 2013 Beaumont -0.127
## 14 2013 Bryan-College Station 0.0261
## 15 2013 Tyler 0.696
## 16 2013 Wichita Falls 0.0916
## 17 2014 Beaumont 0.0292
## 18 2014 Bryan-College Station 0.0252
## 19 2014 Tyler 0.0315
## 20 2014 Wichita Falls 0.595
#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])
#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year
#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Volume of sales correlation by year and city",
mar = c(0,0,1,0))
#This analysis shows low correlation between two variables,
#except for the year 2011.
#Generally correlation is positive, so months inventory growth like volume, but not in a
#strong dependency
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from 0.31 to 0.03 (decreasing)
#Bryan-College Station --> the correlation goes from 0.37 to 0.03 (decreasing)
#Tyler --> shows a positive trend from 2010 to 2011 with a decrease from 2013 to 2014
#Wichita Falls --> shows strong variability in trend
#Plot of joint distribution of volume and months inventory
ggplot(data, aes(x = volume, y = months_inventory)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Volume of sales joint distribution with confidence ellipses",
x = "Volume of sales (mln $)",
y = "Months inventory") +
theme_classic()
#Plot of points around linear regression rect
ggplot(data, aes(x = volume, y = months_inventory, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Volume of sales linear regression with confidence intervals",
x = "Volume of sales (mln $)",
y = "Months inventory") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ volume, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)
ggplot(data, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression residuals (months inventory vs volume of sales)",
x = "Predicted values",
y = "Residuals") +
theme_classic()
#There is a similar situation between months inventory and sales, with a non linear relationship
#explained by the distribution of residuals
#Plot of loess smooth of sales and months inventory by city
ggplot(data, aes(x = volume, y = months_inventory, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Volume of sales non linear regression with confidence intervals",
x = "Volume of sales (mln $)",
y = "Months inventory") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
Correlation and General Trend
The correlation between median sales price and months needed to sell
listings is generally low and varies over time and across cities.
Bryan-College Station in 2014 shows a relatively strong negative
correlation, while other cities exhibit fluctuating or decreasing
values. The negative correlation suggests that, in some cases, an
increase in median price tends to be associated with a decrease in the
time required to sell homes, a counterintuitive result compared to the
expectation that higher prices require more time to sell.
Visualizations and Linear Model
Joint distributions and linear regressions show dispersion and city
differences, with residuals indicating a nonlinear relationship not well
captured by the simple linear model. Residuals are non-randomly
distributed, suggesting complex patterns or local factors not accounted
for by the model.
Nonlinear Models and Local Trends
LOESS regression by city clearly highlights nonlinear relationships. In
some cities, a significant inverse relationship emerges between median
price and months inventory, where increases in median price correspond
to decreases in selling time.
Final Considerations
The relationship between median sales price and months inventory is
complex, predominantly nonlinear, with varying territorial effects.
Counterintuitive results (price increases linked to shorter selling
times) suggest sophisticated market dynamics and the importance of
flexible model-based interpretations.
#Correlation with median price
ggplot(data = data)+
geom_point(aes(x = median_price, y = months_inventory, colour = city))+
geom_smooth(aes(x = median_price, y = months_inventory), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Months inventory vs Median price by year",
x = "Median price",
y = "Months inventory")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Correlation index by city and year between median price and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, median_price, months_inventory)
#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont -0.475
## 2 2010 Bryan-College Station -0.194
## 3 2010 Tyler 0.155
## 4 2010 Wichita Falls -0.0413
## 5 2011 Beaumont 0.406
## 6 2011 Bryan-College Station -0.408
## 7 2011 Tyler 0.218
## 8 2011 Wichita Falls 0.283
## 9 2012 Beaumont -0.308
## 10 2012 Bryan-College Station -0.0548
## 11 2012 Tyler 0.120
## 12 2012 Wichita Falls 0.443
## 13 2013 Beaumont -0.400
## 14 2013 Bryan-College Station -0.288
## 15 2013 Tyler 0.438
## 16 2013 Wichita Falls -0.440
## 17 2014 Beaumont 0.0964
## 18 2014 Bryan-College Station -0.783
## 19 2014 Tyler -0.459
## 20 2014 Wichita Falls 0.269
#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])
#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year
#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Median price correlation by year and city",
mar = c(0,0,1,0))
#This analysis shows low correlation between two variables,
#except for city of Bryan-College Station in the year 2014.
#Generally correlation is negative, so months inventory decrease while median price increase,
#but not in a strong dependency. This first result is not expected, because if median price is high
#we may expect that the number of months for sell is high, but the relationship shows non linearity
#and this correlation could not explore all trends in data
#Analyzing the situation by city into years, the correlation plot shows that
#Beaumont --> the correlation goes from -0.47 to 0.10 (decreasing in absolute value)
#Bryan-College Station --> the correlation goes from -0.19 to -0.78 (increasing)
#Tyler --> the correlation goes from 0.15 to -0.46 (increasing in absolute value)
#Wichita Falls --> shows strong variability in trend
#Plot of joint distribution of median price and months inventory
ggplot(data, aes(x = median_price, y = months_inventory)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Median price joint distribution with confidence intervals",
x = "Median price",
y = "Months inventory") +
theme_classic()
#Plot of points around linear regression rect
ggplot(data, aes(x = median_price, y = months_inventory, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Median price linear regression with confidence intervals",
x = "Median price",
y = "Months inventory") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ median_price, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)
ggplot(data, aes(x = fitted, y = residuals)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression residuals (months inventory vs median price)",
x = "Predicted values",
y = "Residuals") +
theme_classic()
#There is a similar situation between months inventory and median price, with a non linear relationship
#explained by the distribution of residuals
#Plot of loess smooth of median price and months inventory by city
ggplot(data, aes(x = median_price, y = months_inventory, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Median price non lineare regression with confidence intervals",
x = "Median price",
y = "Months inventory") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Overall, the relationship between the two variables appears nonlinear, as indicated by the
#residual plot showing points scattered around the horizontal line without clear patterns
#suggesting a significant local influence of the city variable. However, in certain cities, a
#significant inverse relationship is observed: as the median price increases, the number of months
#required to sell all listings tends to decrease
Correlation and General Trend
The analysis shows a strong positive linear correlation between the
number of active listings and the number of months needed to sell all
listings (months_inventory). This outcome aligns with the notion that a
higher number of active listings indicates lower selling capacity and
therefore a longer time to clear the market.Joint plots and confidence
ellipses confirm this robust positive relationship between listings and
months inventory, with variations across cities. Linear regression lines
for each city illustrate trends consistent with the positive
correlation.
Linear Regression Model and Residuals
The linear regression model confirms the statistical significance of
both the intercept and slope (both with very low p-values around 2e-16).
The coefficient of determination (R²) is moderate, about 54%, indicating
the model explains a good portion of variance but leaves room for
unexplained factors.Residual analysis reveals some non-random patterns,
suggesting possible local effects or additional influencing
variables.
Final Considerations
The strong positive linear dependence between listings and months
inventory is expected and statistically significant. Residual
variability and city-level differences suggest further investigation
could improve understanding and modeling of the local real estate
market.
#Correlation with listings
ggplot(data = data)+
geom_point(aes(x = listings, y = months_inventory, colour = city))+
geom_smooth(aes(x = listings, y = months_inventory), lwd = 1)+
facet_wrap(~ year, scales = "free_x")+
labs(title = "Months inventory vs Listings by year",
x = "Number of listings",
y = "Months inventory")+
theme_classic()+
theme(axis.text.x = element_text(angle = 60, hjust = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#Correlation index by city and year between listings and months inventory
corr_by_yearcity_minv <- corr_fun(data, year, city, listings, months_inventory)
#Plot of corrplot for this calculated index
corr_by_yearcity_minv %>%
ungroup()
## # A tibble: 20 × 3
## year city corr_index
## <int> <chr> <dbl>
## 1 2010 Beaumont 0.966
## 2 2010 Bryan-College Station 0.898
## 3 2010 Tyler 0.854
## 4 2010 Wichita Falls 0.974
## 5 2011 Beaumont 0.813
## 6 2011 Bryan-College Station 0.993
## 7 2011 Tyler 0.965
## 8 2011 Wichita Falls 0.796
## 9 2012 Beaumont 0.730
## 10 2012 Bryan-College Station 0.965
## 11 2012 Tyler 0.882
## 12 2012 Wichita Falls 0.980
## 13 2013 Beaumont 0.526
## 14 2013 Bryan-College Station 0.986
## 15 2013 Tyler 0.883
## 16 2013 Wichita Falls 0.941
## 17 2014 Beaumont 0.910
## 18 2014 Bryan-College Station 0.988
## 19 2014 Tyler 0.922
## 20 2014 Wichita Falls 0.991
#Transform in wide format
corr_matrix_minv <- corr_by_yearcity_minv %>%
pivot_wider(names_from = city, values_from = corr_index)
#Remove column "year" to have only numeric variable
mat_minv <- as.matrix(corr_matrix_minv[,-1])
#Set rownames like years, to be more clear
rownames(mat_minv) <- corr_matrix_minv$year
#Corrplot
corrplot(mat_minv, method = "color", addCoef.col = "black", tl.col = "black", tl.cex = 0.7, tl.srt = 30,
number.cex = 0.9, cl.pos = "b", title = "Months inventory vs Listings correlation by year and city",
mar = c(0,0,1,0))
#This analysis shows a strong linear correlation between number of listings and months
#inventory. We could expect this trend because a bigger number of listings suggest a small number
#of sell, so bigger values of variable months inventory.
#To confirm this below are shown joint distribution with confidence ellipses
#Plot of joint distribution of listings and months inventory
ggplot(data, aes(x = listings, y = months_inventory)) +
geom_point(alpha = 0.5, aes(color = city)) +
stat_ellipse(level = 0.95, aes(color = city), size = 1) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Listings joint distribution with confidence ellipses",
x = "Number of listings",
y = "Months inventory") +
theme_classic()
#Plot of points around linear regression rect
ggplot(data, aes(x = listings, y = months_inventory, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
facet_wrap(~ city, scales = "free_x")+
labs(title = "Months inventory vs Listings linear regression with confidence intervals",
x = "Number of listings",
y = "Months inventory") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
#Plot of residuals to show other caratheristic on correlation
model <- lm(months_inventory ~ listings, data = data)
data$residuals <- resid(model)
data$fitted <- fitted(model)
ggplot(data, aes(x = fitted, y = residuals, color = city)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression residuals (months inventory vs listings)",
x = "Predicted values",
y = "Residuals") +
theme_classic()
#The analysis of the relationship between the number of months needed to sell listings and the
#number of active listings reveals a strong positive linear dependence, consistent with the notion
#that a higher number of listings reflects a lower selling capacity and therefore a longer time to
#clear the market. In the summary of the linear regression model, both the slope coefficient for
#the listings variable and the intercept have very low p-values (2e-16), indicating statistical
#significance. However, the coefficient of determination R² is moderate, around 54%
#The analysis of the relationship between the number of months needed to sell listings and the
#number of active listings reveals a strong positive linear dependence, consistent with the notion
#that a higher number of listings reflects a lower selling capacity and therefore a longer time to
#clear the market. In the summary of the linear regression model, both the slope coefficient for
#the listings variable and the intercept have very low p-values (2e-16), indicating statistical
#significance. However, the coefficient of determination R² is moderate, around 54%
Historical Trends in Sales and Values:
Effectiveness of Marketing Strategies (Listings):
Distribution and Relationships among Prices, Sales, and Listing Duration:
Implications:
After conducting thorough bivariate analyses, it could proceed with a multivariate analysis focusing on variables that do not exhibit multicollinearity. Hypothesizing a linear mixed-effects model that includes the joint effects and all interactions among volume, listings, and months_inventory variables, while accounting for random intercepts by year and month. This approach allows me to capture complex relationships and local variations—such as differences across cities—without bias from strongly correlated predictors.
#Creation of scaled data frame
volume_scaled <- as.vector(scale(data$volume))
listings_scaled <- as.vector(scale(data$listings))
months_inv_scaled <- as.vector(scale(months_inventory))
data_scaled <- data.frame(
year = data$year,
month = data$month,
city = data$city,
volume = volume_scaled,
listings = listings_scaled,
months_inventory = months_inv_scaled
)
mod_multivariate <- lmer(volume ~ listings * months_inventory * city + (1|year) + (1|month), data = data_scaled)
summary(mod_multivariate)
## Linear mixed model fit by REML ['lmerMod']
## Formula: volume ~ listings * months_inventory * city + (1 | year) + (1 |
## month)
## Data: data_scaled
##
## REML criterion at convergence: 229
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4525 -0.5906 -0.0039 0.5382 4.2657
##
## Random effects:
## Groups Name Variance Std.Dev.
## month (Intercept) 0.21712 0.4660
## year (Intercept) 0.01398 0.1182
## Residual 0.11826 0.3439
## Number of obs: 240, groups: month, 12; year, 5
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -0.20734 0.19196 -1.080
## listings -0.92099 0.79243 -1.162
## months_inventory -0.37871 0.14355 -2.638
## cityBryan-College Station 0.83173 0.14995 5.547
## cityTyler -0.01248 0.52803 -0.024
## cityWichita Falls -2.77226 1.23405 -2.246
## listings:months_inventory -0.67500 0.57726 -1.169
## listings:cityBryan-College Station 4.46206 0.87585 5.095
## listings:cityTyler 2.24928 0.80297 2.801
## listings:cityWichita Falls -0.52055 1.39737 -0.373
## months_inventory:cityBryan-College Station -1.06162 0.20119 -5.277
## months_inventory:cityTyler -0.86318 0.30189 -2.859
## months_inventory:cityWichita Falls -2.55279 1.27016 -2.010
## listings:months_inventory:cityBryan-College Station 0.97342 0.58925 1.652
## listings:months_inventory:cityTyler 0.79424 0.58685 1.353
## listings:months_inventory:cityWichita Falls -1.34502 1.23212 -1.092
##
## Correlation matrix not shown by default, as p = 16 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
#Plot of listings and volume by city
ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
theme_classic() +
labs(title = "Volume of sales vs Listings by city",
x = "Number of listings",
y = "Volume of sales (scaled)")
## `geom_smooth()` using formula = 'y ~ x'
#Plot of listings and volume by city and with interaction with months inventory
#Creation of bins of months inventory (25% of population for every bins)
data_scaled <- data_scaled %>%
mutate(
months_inventory_bin_num = ntile(months_inventory, 4),
months_inventory_bin = factor(
months_inventory_bin_num,
levels = 1:4,
labels = c("Low", "Medium-Low", "Medium-High", "High")
)
)
ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~ months_inventory_bin) +
theme_classic() +
labs(title = "Interaction Volume of sales vs Listings by year and months inventory",
x = "Number of listings",
y = "Volume of sales (scaled)")
## `geom_smooth()` using formula = 'y ~ x'
#Plot of marginal effects between variables
effects <- ggpredict(mod_multivariate, terms = c("listings", "months_inventory", "city"))
p <- plot(effects) +
labs(
title = "Marginal effects of listings and months inventory by city",
x = "Number of listings (scaled)",
y = "Predicted volume of sales (scaled)"
)
print(p)
#Plot of residuals to show other caratheristic on correlation
data_scaled$residuals <- resid(mod_multivariate)
data_scaled$fitted <- fitted(mod_multivariate)
ggplot(data_scaled, aes(x = fitted, y = residuals, colour = city)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Linear regression of multivariate model residuals",
x = "Predicted values",
y = "Residuals") +
theme_classic()
After careful scaling and accounting for multicollinearity, a linear mixed-effects model was fitted incorporating the full factorial interactions among the numerical predictors (listings and months_inventory) and the categorical variable city, while including random intercepts for year and month. The results indicate that:
Visualizing predicted effects via interaction plots or marginal effect plots (using tools such as ggplot2 or ggeffects) can elucidate how the relationships between listings, months_inventory, and sales volume differ across cities, helping to capture heterogeneity in effects. Given the limited significance of complex triple interactions, consider simplifying the model by removing non-significant higher-order terms to enhance interpretability and reduce overfitting. Exploring alternative model specifications or incorporating additional relevant covariates may further improve model accuracy and insight into drivers of sales volume.
For Tyler and Bryan, the parallel lines with narrow confidence intervals indicate that the effects of listings and months_inventory on volume are stable and similar across these cities, showing consistent and precise model estimates. This reflects a relatively homogeneous pattern of influence. In contrast, for Beaumont and Wichita Falls, the presence of intersecting lines with wider confidence intervals suggests more variability and complexity in how these predictors affect volume. This could reflect more uncertain or non-linear relationships, possibly due to less consistent data or more complex local dynamics.
Plotting residuals against fitted values reveals positive linear patterns that differ by city. This indicates that the current model under- or overestimates volume in a systematic way depending on predicted value and city. Such patterns suggest that the model does not fully capture the relationship between predictors and volume across all cities, pointing to potential model misspecification and the need to consider nonlinear terms or random slopes to better represent city-specific effects.
#Non linear model GAM with the capture of non linear effect and smooth terms
data_scaled$city <- as.factor(data_scaled$city)
mod_gam <- gamm(
volume ~ s(listings, by = city) + s(months_inventory, by = city) + city,
random = list(year = ~1, month = ~1),
data = data_scaled
)
summary(mod_gam$gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## volume ~ s(listings, by = city) + s(months_inventory, by = city) +
## city
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.08993 0.11841 -0.759 0.4484
## cityBryan-College Station 0.91006 0.14479 6.285 1.67e-09 ***
## cityTyler -1.19451 0.49073 -2.434 0.0157 *
## cityWichita Falls -0.81361 0.92225 -0.882 0.3786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(listings):cityBeaumont 0.9999 0.9999 0.230 0.632280
## s(listings):cityBryan-College Station 1.0000 1.0000 64.057 < 2e-16 ***
## s(listings):cityTyler 1.0000 1.0000 37.920 < 2e-16 ***
## s(listings):cityWichita Falls 1.0000 1.0000 0.336 0.562894
## s(months_inventory):cityBeaumont 1.0000 1.0000 15.530 0.000109 ***
## s(months_inventory):cityBryan-College Station 2.4917 2.4917 43.121 < 2e-16 ***
## s(months_inventory):cityTyler 1.8676 1.8676 44.673 < 2e-16 ***
## s(months_inventory):cityWichita Falls 1.0000 1.0000 8.898 0.003168 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.713
## Scale est. = 0.11131 n = 240
#Plot of predicted values
data_scaled$predicted <- predict(mod_gam$gam, newdata = data_scaled)
ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
geom_point(alpha = 0.5) +
geom_point(aes(y = predicted), size = 1, colour = "red") +
facet_wrap(~ city)+
labs(title = " Predicted vs Observed values of volume of sales by listings and city",
x = "Number of listings(scaled)",
y = "Volume of sales (scaled)") +
theme_classic()
#So, we can allow slopes of predictors like listings and months_inventory to vary by city, providing
#flexibility to capture city-specific effect heterogeneity
mod_gam_random_slopes <- gamm(
volume ~ s(listings, by = city) + s(months_inventory, by = city) + city,
random = list(
city = ~ 1 + listings + months_inventory,
year = ~ 1,
month = ~ 1
),
data = data_scaled
)
summary(mod_gam_random_slopes$gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## volume ~ s(listings, by = city) + s(months_inventory, by = city) +
## city
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17045 0.13696 1.244 0.215
## cityBryan-College Station 0.87642 0.19748 4.438 1.42e-05 ***
## cityTyler -3.50305 0.57635 -6.078 5.11e-09 ***
## cityWichita Falls -0.03652 1.21113 -0.030 0.976
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(listings):cityBeaumont 1.000 1.000 9.691 0.00209 **
## s(listings):cityBryan-College Station 1.000 1.000 56.761 < 2e-16 ***
## s(listings):cityTyler 1.000 1.000 69.365 < 2e-16 ***
## s(listings):cityWichita Falls 1.000 1.000 0.938 0.33382
## s(months_inventory):cityBeaumont 1.000 1.000 20.004 1.24e-05 ***
## s(months_inventory):cityBryan-College Station 2.606 2.606 34.330 < 2e-16 ***
## s(months_inventory):cityTyler 1.000 1.000 92.742 < 2e-16 ***
## s(months_inventory):cityWichita Falls 1.000 1.000 0.664 0.41587
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.754
## Scale est. = 6.1686e-08 n = 240
#Plot of predicted values
data_scaled$predicted <- predict(mod_gam_random_slopes$gam, newdata = data_scaled)
ggplot(data_scaled, aes(x = listings, y = volume, color = city)) +
geom_point(alpha = 0.5) +
geom_point(aes(y = predicted), size = 1, colour = "red") +
facet_wrap(~ city)+
labs(title = "Predicted vs Observed values of volume of sales by listings and city",
x = "Number of listings (scaled)",
y = "Volume of sales (scaled)")+
theme_classic()
#Evaluation of residuals for investigate model modified
#Plot of residuals to show other caratheristic on correlation
data_scaled$residuals <- resid(mod_gam_random_slopes)
data_scaled$fitted <- fitted(mod_gam_random_slopes)
ggplot(data, aes(x = fitted, y = residuals, colour = city)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Non linear muultivariate model residuals",
x = "Predicted values",
y = "Residuals") +
theme_classic()
The second model (GAM mixed model with splines)better captures the complexity and heterogeneity of the relationships between volume, listings, and months_inventory across different cities. It allows for non-linear effects of predictors by fitting spline functions, which better represent complex and potentially non-monotonic relationships that the linear model cannot capture. The GAM model shows a higher adjusted R² (~0.71), indicating improved explanatory power compared to the linear mixed model. Significant spline terms for months_inventory and listings (in some cities) reveal meaningful variation in the shape of effects that linear interactions missed.
The plot comparing observed versus predicted sales volume across cities shows mixed predictive performance of the model:
Model and Main Results:
Goodness of Fit and Predictions Analysis:
Interpretation in the Project Context:
Added Value:
Based on the detailed analysis conducted, several data-driven strategies could be adopted to improve sales management and the effectiveness of property listings. Specifically:
Territorial segmentation and personalized marketing strategies:
Optimization of listing timing and promotions:
Predictive modeling and continuous monitoring:
Development of customized KPIs:
Improvement of offer and demand segmentation: