Texas Realty Insights aims to analyze real estate market trends within the state of Texas by leveraging historical property sales data. The goal is to provide statistical and visual insights that support strategic decisions related to sales and the optimization of property listings.
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = "C:/Users/Francesco/Desktop/ProfessionAI")
library(moments)
library(dplyr)
##
## Caricamento pacchetto: 'dplyr'
## I seguenti oggetti sono mascherati da 'package:stats':
##
## filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Importing DataFrame
getwd()
## [1] "C:/Users/Francesco/Desktop/ProfessionAI"
data = read.csv("realestate_texas.csv", sep = ",") # Importing
attach(data)
Let’s now identify the variables of the problem under consideration. This is the head of the dataset:
head(data)
## city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010 1 83 14.162 163800 1533 9.5
## 2 Beaumont 2010 2 108 17.690 138200 1586 10.0
## 3 Beaumont 2010 3 182 28.701 122400 1689 10.6
## 4 Beaumont 2010 4 200 26.819 123200 1708 10.6
## 5 Beaumont 2010 5 202 28.833 123100 1771 10.9
## 6 Beaumont 2010 6 189 27.219 122800 1803 11.1
As we can see, the variables are:
“city”: Qualitative variable - Nominal variable. We cannot say
anything about this variable on its own, but it can be used as parameter
to study the other ones.
“year”: Quantitative variable - Continuous
variable, but treated as Qualitative variable - Ordinal variable. We
cannot say anything about this variable on its own, but it can be used
as parameter to study the other ones.
“month”: Qualitative variable
- Nominal variable, but coded as a number. We cannot say anything about
this variable on its own, but it can be used as parameter to study the
other ones.
“sales”: Quantitative variable - Discrete variable. We
can study this variable as a function of years, months and cities, by
calculating its position and variation indexes.
“volume”:
Quantitative variable - Continuous variable. We can study this variable
as a function of years, months and cities, by calculating its position
and variation indexes.
“median_price”: Quantitative variable -
Continuous variable. We can study this variable as a function of years,
months and cities, by calculating its position and variation indexes.
“listings”: Quantitative variable - Discrete variable. We can study
this variable as a function of years, months and cities, by calculating
its position and variation indexes.
“months_inventory”:
Quantitative variable - Continuous variable. We can study this variable
as a function of years, months and cities, by calculating its position
and variation indexes.
All continuous quantitative variables are measured on a ratio scale.
Let’s now calculate both the position and variability indices for those variable for which it makes sense:
labels_stats = c("Median", "Quantile 25%", "Mean", "Quantile 75%", "Interval", "IQR", "Variance", "Standard deviation")
sales_statistics = c(
median(sales),
quantile(sales, .25),
mean(sales),
quantile(sales, .75),
max(sales) - min(sales),
IQR(sales),
var(sales),
sd(sales))
volume_statistics = c(
median(volume),
quantile(volume, .25),
mean(volume),
quantile(volume, .75),
max(volume) - min(volume),
IQR(volume),
var(volume),
sd(volume))
median_price_statistics <- c(
median(median_price),
quantile(median_price, 0.25),
mean(median_price),
quantile(median_price, 0.75),
max(median_price) - min(median_price),
IQR(median_price),
var(median_price),
sd(median_price)
)
listings_statistics = c(
median(listings),
quantile(listings, .25),
mean(listings),
quantile(listings, .75),
max(listings) - min(listings),
IQR(listings),
var(listings),
sd(listings)
)
months_inventory_statistics = c(
median(months_inventory),
quantile(months_inventory, .25),
mean(months_inventory),
quantile(months_inventory, .75),
max(months_inventory) - min(months_inventory),
IQR(months_inventory),
var(months_inventory),
sd(months_inventory)
)
df_all_stats = t(data.frame(row.names = labels_stats, Sales = sales_statistics, Volume = volume_statistics, Median_Price = median_price_statistics, Listings = listings_statistics, Months_Inventory = months_inventory_statistics))
knitr::kable(df_all_stats, caption = "Statistics", digits = 2)
Median | Quantile 25% | Mean | Quantile 75% | Interval | IQR | Variance | Standard deviation | |
---|---|---|---|---|---|---|---|---|
Sales | 175.50 | 127.00 | 192.29 | 247.00 | 344.00 | 120.00 | 6.34430e+03 | 79.65 |
Volume | 27.06 | 17.66 | 31.01 | 40.89 | 75.38 | 23.23 | 2.77270e+02 | 16.65 |
Median_Price | 134500.00 | 117300.00 | 132665.42 | 150050.00 | 106200.00 | 32750.00 | 5.13573e+08 | 22662.15 |
Listings | 1618.50 | 1026.50 | 1738.02 | 2056.00 | 2553.00 | 1029.50 | 5.66569e+05 | 752.71 |
Months_Inventory | 8.95 | 7.80 | 9.19 | 10.95 | 11.50 | 3.15 | 5.31000e+00 | 2.30 |
Let’s calculate the coefficients of variation to understand which variable has more variability:
CV_sales = sd(sales)/mean(sales)
CV_volume = sd(volume)/mean(volume)
CV_median_price = sd(median_price)/mean(median_price)
CV_listings = sd(listings)/mean(listings)
CV_months_inventory = sd(months_inventory)/mean(months_inventory)
df_CVs = t(data.frame(row.names = c("CV_sales", "CV_volume", "CV_median_price", "CV_listings", "CV_months_inventory"), CV = c(CV_sales, CV_volume, CV_median_price, CV_listings, CV_months_inventory)))
knitr::kable(df_CVs, caption = "CV", digits = 2)
CV_sales | CV_volume | CV_median_price | CV_listings | CV_months_inventory | |
---|---|---|---|---|---|
CV | 0.41 | 0.54 | 0.17 | 0.43 | 0.25 |
As we can see, the largest CV is the volume’s one. In other words, the more variability belongs to that variable representing the amount of money earned by sales.
Let’s calculate the Fisher index:
Fisher_sales = skewness(sales)
Fisher_volume = skewness(volume)
Fisher_median_price = skewness(median_price)
Fisher_listings = skewness(listings)
Fisher_months_inventory = skewness(months_inventory)
df_Fisher = t(data.frame(row.names = c("Fisher_sales", "Fisher_volume", "Fisher_median_price", "Fisher_listings", "Fisher_months_inventory"), Fisher = c(Fisher_sales, Fisher_volume, Fisher_median_price, Fisher_listings, Fisher_months_inventory)))
knitr::kable(df_Fisher, caption = "Fisher index", digits = 2)
Fisher_sales | Fisher_volume | Fisher_median_price | Fisher_listings | Fisher_months_inventory | |
---|---|---|---|---|---|
Fisher | 0.72 | 0.88 | -0.36 | 0.65 | 0.04 |
The largest Fisher index is again the volume’s one. It’s a positive asymmetry, which means that the mean is larger than the median value and there is a long right tail.
Let’s create the classes after finding the min value and max value:
min_value = min(sales) # Min
min_value
## [1] 79
max_value = max(sales) # Max
max_value
## [1] 423
classes = cut(sales, breaks = c(50,100,150,200,250,300,350,400,450))
N = length(sales)
ni = table(classes)
fi = table(classes)/N
Ni = cumsum(ni)
Fi = Ni/N
cbind(ni,fi,Ni,Fi)
## ni fi Ni Fi
## (50,100] 21 0.08750000 21 0.0875000
## (100,150] 72 0.30000000 93 0.3875000
## (150,200] 56 0.23333333 149 0.6208333
## (200,250] 32 0.13333333 181 0.7541667
## (250,300] 34 0.14166667 215 0.8958333
## (300,350] 13 0.05416667 228 0.9500000
## (350,400] 9 0.03750000 237 0.9875000
## (400,450] 3 0.01250000 240 1.0000000
barplot(ni,
col = "skyblue",
main = "Sales divided by classes",
ylab = "Sales",
names.arg = levels(classes),
las = 2)
fi2 = fi^2
gini = 1 - sum(fi2)
j = length(ni)
gini_normalized = gini/((j-1)/j)
print(gini_normalized) # Gini normalized
## [1] 0.9206349
As we can see, the sales distribution is strongly positively skewed and, as indicated by the Gini index, also highly heterogeneous. Most monthly sales fall within the 100–150 range, followed by the 150–200 class. This suggests that lower sales values are much more common, while a few high sales significantly raise the average, contributing to the observed inequality.
Let’s calculate the probabilities:
favorable_cases = sum(city == "Beaumont")
total_cases = length(city)
prob_beaumont = favorable_cases/total_cases
print(prob_beaumont)
## [1] 0.25
In light of this result, 1/4 of the data in dataset is related to “Beaumont” (4 cities).
favorable_cases = sum(month == 7)
total_cases = length(month)
prob_july = favorable_cases/total_cases
print(prob_july)
## [1] 0.08333333
In light of this result, 1/12 of the data in dataset is related to July (12 months).
favorable_cases = sum(month == 12 & year == 2012)
total_cases = length(month)
prob_2012 = round(favorable_cases/total_cases,2)
print(prob_2012)
## [1] 0.02
In light of this result, 1/60 of the data in dataset is related to december 12 (12 months x 5 years = 60).
Let’s create a new column with data realated to the mean price:
data$mean_price = volume*10^6/sales
mean_price = data$median_price
head(data)
## city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010 1 83 14.162 163800 1533 9.5
## 2 Beaumont 2010 2 108 17.690 138200 1586 10.0
## 3 Beaumont 2010 3 182 28.701 122400 1689 10.6
## 4 Beaumont 2010 4 200 26.819 123200 1708 10.6
## 5 Beaumont 2010 5 202 28.833 123100 1771 10.9
## 6 Beaumont 2010 6 189 27.219 122800 1803 11.1
## mean_price
## 1 170626.5
## 2 163796.3
## 3 157697.8
## 4 134095.0
## 5 142737.6
## 6 144015.9
mean_price_statistics = c(
median(mean_price),
quantile(mean_price, .25),
mean(mean_price),
quantile(mean_price, .75),
max(mean_price) - min(mean_price),
IQR(mean_price),
var(mean_price),
sd(mean_price)
)
mean_price_labels = c(
"Median", "Quantile 25%", "Mean", "Quantile 75%",
"Interval", "IQR", "Variance", "Standard deviation"
)
df_mean_price = t(data.frame(row.names = mean_price_labels, Mean_price = mean_price_statistics))
knitr::kable(df_mean_price, caption = "Mean price", digits = 2)
Median | Quantile 25% | Mean | Quantile 75% | Interval | IQR | Variance | Standard deviation | |
---|---|---|---|---|---|---|---|---|
Mean_price | 134500 | 117300 | 132665.4 | 150050 | 106200 | 32750 | 513572983 | 22662.15 |
Looking at the summary statistics, the average price across all cities and years is around 132,665, with a median of 134,500. Most prices fall within an interquartile range (IQR) of 32,750, indicating a moderate spread around the median. The standard deviation of about 22,662 confirms there’s some variability in the data, though not extreme.
Now related to the effectiveness of property listings:
data$effectiveness = sales/(listings + sales)
effectiveness = data$effectiveness
head(data)
## city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010 1 83 14.162 163800 1533 9.5
## 2 Beaumont 2010 2 108 17.690 138200 1586 10.0
## 3 Beaumont 2010 3 182 28.701 122400 1689 10.6
## 4 Beaumont 2010 4 200 26.819 123200 1708 10.6
## 5 Beaumont 2010 5 202 28.833 123100 1771 10.9
## 6 Beaumont 2010 6 189 27.219 122800 1803 11.1
## mean_price effectiveness
## 1 170626.5 0.05136139
## 2 163796.3 0.06375443
## 3 157697.8 0.09727418
## 4 134095.0 0.10482180
## 5 142737.6 0.10238216
## 6 144015.9 0.09487952
effectiveness_statistics = c(
median(effectiveness),
quantile(effectiveness, .25),
mean(effectiveness),
quantile(effectiveness, .75),
max(effectiveness) - min(effectiveness),
IQR(effectiveness),
var(effectiveness),
sd(effectiveness)
)
effectiveness_labels = c(
"Median", "Quantile 25%", "Mean", "Quantile 75%",
"Interval", "IQR", "Variance", "Standard deviation"
)
df_effectiveness = t(data.frame(row.names = effectiveness_labels, Effectiveness = effectiveness_statistics))
knitr::kable(df_effectiveness, caption = "Effectiveness", digits = 3)
Median | Quantile 25% | Mean | Quantile 75% | Interval | IQR | Variance | Standard deviation | |
---|---|---|---|---|---|---|---|---|
Effectiveness | 0.099 | 0.082 | 0.105 | 0.119 | 0.231 | 0.036 | 0.001 | 0.035 |
The average effectiveness is approximately 0.105, with a median of around 0.099. In other words, on average, about 1 out of 10 listings results in an effective sale. Most values fall between the first and third quartiles (0.082 to 0.119), resulting in an interquartile range (IQR) of about 0.036. The standard deviation is relatively low (0.035), suggesting limited variability in effectiveness across observations.
Let’s see the mean and standard deviation for sales by year:
cond_prob_year = data %>%
group_by(year) %>%
summarise(mean_sales_by_year = mean(sales),
dev_standard_sales_by_year = sd(sales))
print(cond_prob_year)
## # A tibble: 5 × 3
## year mean_sales_by_year dev_standard_sales_by_year
## <int> <dbl> <dbl>
## 1 2010 169. 60.5
## 2 2011 164. 63.9
## 3 2012 186. 70.9
## 4 2013 212. 84.0
## 5 2014 231. 95.5
barplot(cond_prob_year$mean_sales_by_year,
names.arg = cond_prob_year$year,
las = 2,
col = "skyblue",
main = "Sales per year",
ylab = "Sales")
As we can see, sales increased over the years by about one third.
Now by month:
cond_prob_month = data %>%
group_by(month) %>%
summarise(mean_sales_by_month = mean(sales),
dev_standard_sales_by_month = sd(sales))
print(cond_prob_month)
## # A tibble: 12 × 3
## month mean_sales_by_month dev_standard_sales_by_month
## <int> <dbl> <dbl>
## 1 1 127. 43.4
## 2 2 141. 51.1
## 3 3 189. 59.2
## 4 4 212. 65.4
## 5 5 239. 83.1
## 6 6 244. 95.0
## 7 7 236. 96.3
## 8 8 231. 79.2
## 9 9 182. 72.5
## 10 10 180. 75.0
## 11 11 157. 55.5
## 12 12 169. 60.7
barplot(cond_prob_month$mean_sales_by_month,
names.arg = cond_prob_month$month,
las = 2,
col = "skyblue",
main = "Sales per month",
ylab = "Sales")
As we can see, around the middle of the year sales take place more often.
Then by city:
cond_prob_city = data %>%
group_by(city) %>%
summarise(mean_sales_by_city = mean(sales),
dev_standard_sales_by_city = sd(sales))
print(cond_prob_city)
## # A tibble: 4 × 3
## city mean_sales_by_city dev_standard_sales_by_city
## <chr> <dbl> <dbl>
## 1 Beaumont 177. 41.5
## 2 Bryan-College Station 206. 85.0
## 3 Tyler 270. 62.0
## 4 Wichita Falls 116. 22.2
barplot(cond_prob_city$mean_sales_by_city,
names.arg = cond_prob_city$city,
las = 2,
col = "skyblue",
main = "Sales per city",
ylab = "Sales")
As we can see in Tyler there has been the largest amount of sales over these years.
Let’s do the same for the mean price of sales, firstly by year:
cond_prob_year = data %>%
group_by(year) %>%
summarise(mean_price_by_year = mean(mean_price),
dev_standard_mean_price_by_year = sd(mean_price))
print(cond_prob_year)
## # A tibble: 5 × 3
## year mean_price_by_year dev_standard_mean_price_by_year
## <int> <dbl> <dbl>
## 1 2010 150189. 23280.
## 2 2011 148251. 24938.
## 3 2012 150899. 26438.
## 4 2013 158705. 26524.
## 5 2014 163559. 31741.
barplot(cond_prob_year$mean_price_by_year,
names.arg = cond_prob_year$year,
las = 2,
col = "skyblue",
main = "Mean price per year",
ylab = "Mean price")
As we can see the mean price increases over years.
Then by month:
cond_prob_month = data %>%
group_by(month) %>%
summarise(mean_price_by_month = mean(mean_price),
dev_standard_mean_price_by_month = sd(mean_price))
print(cond_prob_month)
## # A tibble: 12 × 3
## month mean_price_by_month dev_standard_mean_price_by_month
## <int> <dbl> <dbl>
## 1 1 145640. 29819.
## 2 2 148840. 25120.
## 3 3 151137. 23238.
## 4 4 151461. 26174.
## 5 5 158235. 25787.
## 6 6 161546. 23470.
## 7 7 156881. 27220.
## 8 8 156456. 28253.
## 9 9 156522. 29669.
## 10 10 155897. 32527.
## 11 11 154233. 29685.
## 12 12 154996. 27009.
barplot(cond_prob_month$mean_price_by_month,
names.arg = cond_prob_month$month,
las = 2,
col = "skyblue",
main = "Mean price per month",
ylab = "Mean price")
As we can see the are not astonisghing price variations all over the year, with slightly lower prices in the first months.
Then by city:
cond_prob_city = data %>%
group_by(city) %>%
summarise(mean_price_by_city = mean(mean_price),
dev_standard_mean_price_by_city = sd(mean_price))
print(cond_prob_city)
## # A tibble: 4 × 3
## city mean_price_by_city dev_standard_mean_price_by_city
## <chr> <dbl> <dbl>
## 1 Beaumont 146640. 11232.
## 2 Bryan-College Station 183534. 15149.
## 3 Tyler 167677. 12351.
## 4 Wichita Falls 119430. 11398.
barplot(cond_prob_city$mean_price_by_city,
names.arg = cond_prob_city$city,
las = 2,
col = "skyblue",
main = "Mean price per city",
ylab = "Mean price")
As we can see the mean price progressively increases over Wichita, Beaumont, Tyler and finally Bryan-College
Let’s start from the boxplot related to the median price distribution:
ggplot(data,
aes(x = city,
y = median_price))+
geom_boxplot(
col = "black",
fill = "skyblue",
)+
labs(title = "Median price distribution per city",
x = "City",
y = "Median price"
)
The boxplot shows the distribution of median home prices across four cities: Beaumont, Bryan-College Station, Tyler, and Wichita Falls.
Among these, Bryan-College Station has the highest median price, followed by Tyler and Beaumont, while Wichita Falls has the lowest.
Price variability is greatest in Bryan-College Station, indicating a wider range of home prices.
Bryan-College Station, Beaumont and Wichita Falls exhibit outliers, suggesting the presence of unusually high prices.
Since the median price is close to the middle of the box, the distributions appear approximatevely symmetrical in most cities, though Wichita Falls may show a slight negative skew, due to the longer lower whisker.
Overall, this plot highlights notable differences in housing markets, with Bryan-College Station being the most expensive and variable, and Wichita Falls offering the most affordable options.
ggplot(data,
aes(x = factor(year),
y = median_price,
fill = city))+
geom_boxplot(
position = position_dodge(width = 0.75),
color = "black"
)+
labs(title = "Median price distribution per year for all the cities",
x = "City",
y = "Median price"
)
In 2010, it can be observed that the distributions are essentially skewed, as indicated by the differing whisker lengths and the off-center median prices. The median price progressively increases across Wichita Falls, Beaumont, Tyler, and Bryan-College Station. Among these, Bryan-College Station exhibits the smallest variability, which then increases for Tyler and is even greater for Beaumont and Wichita Falls.
A similar pattern is observed in the other years, with the notable exception that in 2014 the variability of Tyler becomes comparable to that of Wichita Falls and Beaumont.
Overall, the median prices in Bryan-College Station and Tyler show a clear increasing trend over time. Meanwhile, smaller increases can be observed in Wichita Falls and Beaumont, where the median price remains relatively stable.
ggplot(data,
aes(x = factor(month, levels = 1:12, labels = month.abb),
y = median_price,
fill = city)) +
geom_bar(stat = "identity",
position = position_dodge(),
color = "black") +
labs(title = "Median price distribution by city and month for each year",
x = "Month",
y = "Median price",
fill = "City") +
facet_wrap(~ year, nrow = 5) +
theme_minimal() +
theme(legend.position = "bottom")
ggplot(subset(data, year == 2011),
aes(x = month,
y = median_price,
color = city))+
geom_line(linewidth = 1.5)+
geom_point(shape = 21, size = 2.5, fill = "black", stroke = 1)+
scale_x_continuous(breaks = 1:12)+
labs(title = "Median price over the year 2011",
x = "Month",
y = "Median price")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(subset(data, year == 2012),
aes(x = month,
y = median_price,
color = city))+
geom_line(linewidth = 1.5)+
geom_point(shape = 21, size = 2.5, fill = "black", stroke = 1)+
scale_x_continuous(breaks = 1:12)+
labs(title = "Median price over the year 2012",
x = "Month",
y = "Median price")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(subset(data, year == 2013),
aes(x = month,
y = median_price,
color = city))+
geom_line(linewidth = 1.5)+
geom_point(shape = 21, size = 2.5, fill = "black", stroke = 1)+
scale_x_continuous(breaks = 1:12)+
labs(title = "Median price over the year 2013",
x = "Month",
y = "Median price")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(subset(data, year == 2014),
aes(x = month,
y = median_price,
color = city))+
geom_line(linewidth = 1.5)+
geom_point(shape = 21, size = 2.5, fill = "black", stroke = 1)+
scale_x_continuous(breaks = 1:12)+
labs(title = "Median price over the year 2014",
x = "Month",
y = "Median price")+
theme(plot.title = element_text(hjust = 0.5))
data2 = data
min_year = min(data2$year)
for (i in 1:nrow(data2)) {
if (data2$year[i] == min_year + 1) {
data2$month[i] = data2$month[i] + 12
} else if (data2$year[i] == min_year + 2) {
data2$month[i] = data2$month[i] + 24
} else if (data2$year[i] == min_year + 3) {
data2$month[i] = data2$month[i] + 36
} else if (data2$year[i] == min_year + 4) {
data2$month[i] = data2$month[i] + 48
}
}
ggplot(subset(data2),
aes(x = month,
y = median_price,
color = city))+
geom_line(linewidth = 1.5)+
geom_point(shape = 21, size = 2.5, fill = "black", stroke = 1
)+
scale_x_continuous(breaks = 1:60)+
labs(title = "Median price over the years 2010-2014",
x = "Month",
y = "Median price"
)+
theme(plot.title = element_text(hjust = 0.5), # centering the title
panel.grid.major.y = element_line(color = "black"),
panel.grid.major.x = element_line(color = "black")
)
This chart shows the median home prices for four Texas cities—Beaumont, Bryan-College Station, Tyler, and Wichita Falls—over the five-year period from 2010 to 2014.
Bryan-College Station consistently has the highest home prices throughout the entire period. Aside from some peaks around months 10 and 33, prices generally remain steady until around month 37, after which there’s a clear upward trend. Despite some fluctuations, Bryan-College Station appears to have the most stable and steadily increasing prices overall.
Interestingly, in the very first month, Beaumont had the highest median price. However, prices dropped soon after, and for most of the period Beaumont ranks third—except for a few months (roughly months 10 to 13), where prices temporarily exceeded those in Tyler. Overall, Beaumont shows noticeable volatility and lacks a clear long-term trend.
Tyler generally holds the second-highest price position, aside from the brief exception mentioned above. There are plenty of short-term fluctuations, but starting from around month 49, there’s a more obvious upward trend, suggesting some late-stage growth.
Wichita Falls consistently shows the lowest prices among the four cities. The data is quite volatile, with frequent spikes and dips. Notably, there’s a sharp dip around month 22 and a significant spike at month 54, where prices briefly approach those of Beaumont. Despite that, the overall trend in Wichita Falls is relatively flat compared to the other cities.