Analysis of dataframe’s structure and
dimensions
my_table <- df %>%
summarise(across(where(is.numeric),
list(
Min = ~min(.x, na.rm = TRUE),
Q1 = ~quantile(.x, 0.25, na.rm = TRUE),
Median = ~median(.x, na.rm = TRUE),
Mean = ~mean(.x, na.rm = TRUE),
Q3 = ~quantile(.x, 0.75, na.rm = TRUE),
Max = ~max(.x, na.rm = TRUE)
))) %>%
pivot_longer(
everything(),
names_to = c("Variables", ".value"),
names_pattern = "(.+)_(Min|Q1|Median|Mean|Q3|Max)"
)
my_table %>%
mutate(across(-Variables, ~round(as.numeric(.), 2))) %>%
kable(caption = "Variable analysis") %>%
kable_styling(full_width = FALSE)
Variable analysis
|
Variables
|
Min
|
Q1
|
Median
|
Mean
|
Q3
|
Max
|
|
year
|
2010.00
|
2011.00
|
2012.00
|
2012.00
|
2013.00
|
2014.00
|
|
month
|
1.00
|
3.75
|
6.50
|
6.50
|
9.25
|
12.00
|
|
sales
|
79.00
|
127.00
|
175.50
|
192.29
|
247.00
|
423.00
|
|
volume
|
8.17
|
17.66
|
27.06
|
31.01
|
40.89
|
83.55
|
|
median_price
|
73800.00
|
117300.00
|
134500.00
|
132665.42
|
150050.00
|
180000.00
|
|
listings
|
743.00
|
1026.50
|
1618.50
|
1738.02
|
2056.00
|
3296.00
|
|
months_inventory
|
3.40
|
7.80
|
8.95
|
9.19
|
10.95
|
14.90
|
kable(data.frame(Dimension = c("Rows", "Columns"), Value = dim(df)),
caption = "Dimension of dataset") %>%
kable_styling(full_width = FALSE)
Dimension of dataset
|
Dimension
|
Value
|
|
Rows
|
240
|
|
Columns
|
8
|
Quick analysis of the dataset’s variables
kable(head(df), caption = "Firs rows of dataset") %>%
kable_styling(full_width = FALSE)
Firs rows of dataset
|
city
|
year
|
month
|
sales
|
volume
|
median_price
|
listings
|
months_inventory
|
month_year
|
|
Beaumont
|
2010
|
1
|
83
|
14.162
|
163800
|
1533
|
9.5
|
1 / 2010
|
|
Beaumont
|
2010
|
2
|
108
|
17.690
|
138200
|
1586
|
10.0
|
2 / 2010
|
|
Beaumont
|
2010
|
3
|
182
|
28.701
|
122400
|
1689
|
10.6
|
3 / 2010
|
|
Beaumont
|
2010
|
4
|
200
|
26.819
|
123200
|
1708
|
10.6
|
4 / 2010
|
|
Beaumont
|
2010
|
5
|
202
|
28.833
|
123100
|
1771
|
10.9
|
5 / 2010
|
|
Beaumont
|
2010
|
6
|
189
|
27.219
|
122800
|
1803
|
11.1
|
6 / 2010
|
The dataset contains 240 rows and 8
variables:
-‘city’ nominal qualitative variable
-‘year’ continuous quantitative to be treated as ordinal qualitative
variable
-‘month’ nominal qualitative variable coded with numbers
-‘sales’ discrete quantitative variable
-‘volume’ continuous quantitative variable on a ratio scale
-‘median_price’ continuous quantitative variable on a ratio
scale
-‘listings’ discrete quantitative variable
-‘months_inventory’ continuous quantitative variable on a ratio
scale
I decided to add the ‘month_year’ variable so I can do analysis on
time dimension
We can do analysis on the distribution of quantitative variables by
qualitative variables
Position, variability and shape indexes for quantitative
variables
We would like to analyse the distribution of
quantitative variables, especially ‘sales’ and ‘volume’; with that we
can see the sales, both in quantity
and in amount
stats_table <- data.frame(
Variable = c("sales", "volume", "median_price", "listings", "months_inventory"),
Mean = round(c(mean(sales), mean(volume), mean(median_price), mean(listings), mean(months_inventory)), 2),
SD = round(c(sd(sales), sd(volume), sd(median_price), sd(listings), sd(months_inventory)), 2),
IQR = round(c(IQR(sales), IQR(volume), IQR(median_price), IQR(listings), IQR(months_inventory)), 2),
Skewness = round(c(skewness(sales), skewness(volume), skewness(median_price), skewness(listings), skewness(months_inventory)), 2),
Kurtosis = round(c(kurtosis(sales)-3, kurtosis(volume)-3, kurtosis(median_price)-3, kurtosis(listings)-3, kurtosis(months_inventory)-3), 2)
)
kable(stats_table, caption = "Statistical indexes") %>%
kable_styling(full_width = FALSE)
Statistical indexes
|
Variable
|
Mean
|
SD
|
IQR
|
Skewness
|
Kurtosis
|
|
sales
|
192.29
|
79.65
|
120.00
|
0.72
|
-0.31
|
|
volume
|
31.01
|
16.65
|
23.23
|
0.88
|
0.18
|
|
median_price
|
132665.42
|
22662.15
|
32750.00
|
-0.36
|
-0.62
|
|
listings
|
1738.02
|
752.71
|
1029.50
|
0.65
|
-0.79
|
|
months_inventory
|
9.19
|
2.30
|
3.15
|
0.04
|
-0.17
|
Most of the variables have positive skewness and
are platicurtic with the exception of ‘median_price’
that has negative skewness and ‘volume’ that is leptocurtic
The positive skeweness indicates that the
lowest values are more frequent
If we take as an example the ‘sales’ variabile we
can see a good amount of variance between datas with an extense range of
datas.
Graphic analysis for ‘volume’
boxplot(volume, col = "lightblue",
main = "Volume Boxplot (mln)",
ylab = "Volume") # few outliers

hist(volume, probability = T, col = "lightblue",
main = "Volume trend analysis (mln)",
xlab = "Volume",
ylab = "Density")
lines(density(volume)) # we can clearly see the positive skewness

I analysed the ‘volume’ variable with 2 graphs: a
boxplot to see the quantiles and the outliers and a
histogram (with the density curve) to see the frequency
for each class (to see the positive skewness)
‘City’, ‘year’, ‘month’ and ‘month_year’
frequency
[[1]]
City frequencies
|
city
|
Freq
|
|
Beaumont
|
60
|
|
Bryan-College Station
|
60
|
|
Tyler
|
60
|
|
Wichita Falls
|
60
|
[[2]]
Year frequencies
|
year
|
Freq
|
|
2010
|
48
|
|
2011
|
48
|
|
2012
|
48
|
|
2013
|
48
|
|
2014
|
48
|
[[3]]
Month frequencies
|
month
|
Freq
|
|
1
|
20
|
|
2
|
20
|
|
3
|
20
|
|
4
|
20
|
|
5
|
20
|
|
6
|
20
|
|
7
|
20
|
|
8
|
20
|
|
9
|
20
|
|
10
|
20
|
|
11
|
20
|
|
12
|
20
|
Month-Year frequencies (pivoted)
|
Month
|
2010
|
2011
|
2012
|
2013
|
2014
|
|
1
|
4
|
4
|
4
|
4
|
4
|
|
2
|
4
|
4
|
4
|
4
|
4
|
|
3
|
4
|
4
|
4
|
4
|
4
|
|
4
|
4
|
4
|
4
|
4
|
4
|
|
5
|
4
|
4
|
4
|
4
|
4
|
|
6
|
4
|
4
|
4
|
4
|
4
|
|
7
|
4
|
4
|
4
|
4
|
4
|
|
8
|
4
|
4
|
4
|
4
|
4
|
|
9
|
4
|
4
|
4
|
4
|
4
|
|
10
|
4
|
4
|
4
|
4
|
4
|
|
11
|
4
|
4
|
4
|
4
|
4
|
|
12
|
4
|
4
|
4
|
4
|
4
|
Comparisons between variables for variability and
skewness
It is essential to analyze the variability of the
variables in order to understand how much the data deviate from
the central value (e.g. to identify anomalous behaviour).
Low variability indicates relatively
homogeneous observations, while high
dispersion signals the presence of heterogeneity
Coefficient of variation function; it is necessary in order to
compare variables with different units of measurement
CV <- function(datas, column){
sd_v = sd(datas[[column]])
mean_v = mean(datas[[column]])
round(sd_v/mean_v*100, 2)
}
Comparison between some variables to see the one with the highest
variability
cols <- c("sales", "volume", "median_price", "listings", "months_inventory")
cv_table <- data.frame(
Variable = cols,
CV_percent = unname(sapply(cols, CV, datas=df))
)
kable(cv_table, caption = "Coefficient of Variation (%)", align = "c") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover"))
Coefficient of Variation (%)
|
Variable
|
CV_percent
|
|
sales
|
41.42
|
|
volume
|
53.71
|
|
median_price
|
17.08
|
|
listings
|
43.31
|
|
months_inventory
|
25.06
|
Histograms of the 5 variables above (to analyse the skewness)
par(mfrow = c(3, 2))
hist(sales, col = "lightblue")
hist(volume, col = "lightgreen")
hist(median_price, col = "lightgrey")
hist(listings, col = "brown")
hist(months_inventory, col = "yellow")
par(mfrow = c(1, 1))

In order to compare the variability for columns
with different unit of measurement we must use the coefficient
of variation
From a statistical point of view the variable
‘volume’ is the one whose values vary greatly relative
to the average
If we see above the skewness functions (where I
executed the position, variability and shape indexes) and the
histograms, ‘volume’ is the variable
with the highest skewness
Probabilities
Probability for ‘city’ = ‘Beaumont’
P_Beaumont = sum(city == 'Beaumont')/length(city)
cat("Probability Beaumont:", round(P_Beaumont, 2), "%") # P=1/4
Probability Beaumont: 0.25 %
Probability for ‘month’ = ‘July’ (7)
P_July = sum(month == 7)/length(month)
cat("Probability July:", round(P_July, 2), "%") # P=1/12
Probability July: 0.08 %
Probability for ‘month’ = ‘December’ (12) and ‘year’ = 2012
P_Dec_2012 = sum((month == 7) & (year == 2012))/nrow(df)
cat("Probability Dec_2012:", round(P_Dec_2012, 2), "%") # P=1/60
Probability Dec_2012: 0.02 %
Adding new variables
Variable ‘mean_price’ calculated by
‘volume’/‘sales’
Variable ‘listings_efficiency’ calculated by
‘sales’/‘listings’. We can obtain how well listings have performed
Visualize the results of the new variables
‘Listings_efficiency’ is a perfect indicator of
how well the real estate market is performing in Texas.
A high value indicates a fluid and responsive
market, where demand is ableto quickly absorb supply.
Conversely, low efficiency suggests that many
listings remain unsold, indicating possible imbalances, such as
uncompetitive prices or weak demand
hist(df$mean_price, col = "lightgreen",
main = "Histogram of mean_prices",
xlab = "Mean_prices")

max_val <- max(df$listings_efficiency, na.rm = TRUE)
idx_max <- which.max(df$listings_efficiency)
city_max <- df$city[idx_max]
month_year_max <- df$month_year[idx_max]
boxplot(df$listings_efficiency, col = "lightblue",
main = "Listings_efficiency Boxplot",
xlab = "Listings_efficiency",
horizontal = TRUE)
points(max_val, 1, col="red", pch=19, cex=1.5)
text(max_val - 0.08, 1.1, labels = paste0(round(max_val,2), " (", city_max, ", ", month_year_max, ")"), col="red")

From the histogram we can see that ‘mean_prices’ between
140k and 170k are the ones with the highest frequency.
From the boxplot we can see the distribution of
‘listings_efficiency’, calculated by ‘sales’/‘listings’.
The maximum value for ‘listings_efficency’ was
registrated in 07/2014 for Bryan-College Station where
there were 39 sales for every 100 listings.
Statistical analysis
With the analysis below we can see different
aspects of the market. We can see which city
has the highest amount of sales, which
year has the highest quantity of sales
and which month has the highest
listings. This way we can decide in which city we can decide to
buy a new house for example, or in which period of the year
‘Volume’ by cities
We can clearly see the cities with higher variability (Tyler and
Bryan-College Station)
volume_per_city <- df %>%
group_by(city) %>%
summarise(mean_val_city = mean(volume),
dev_std_city = sd(volume)
) %>%
mutate(label = paste0(city, " (SD=", round(dev_std_city, 2), ")"))
kable(
volume_per_city %>% select(label, mean_val_city),
caption = "Mean Volume per City with SD",
col.names = c("City (SD)", "Mean Volume"),
digits = 2,
align = c("l", "c")
) %>%
kable_styling(
full_width = FALSE,
bootstrap_options = c("striped", "hover")
)
Mean Volume per City with SD
|
City (SD)
|
Mean Volume
|
|
Beaumont (SD=6.97)
|
26.13
|
|
Bryan-College Station (SD=17.25)
|
38.19
|
|
Tyler (SD=13.11)
|
45.77
|
|
Wichita Falls (SD=3.24)
|
13.93
|
ggplot(df, aes(x = volume, color = city, fill = city)) +
geom_density(alpha = 0.3) +
geom_vline(data = volume_per_city, aes(xintercept = mean_val_city, color = city),
linetype = "dashed", size = 1) +
scale_color_manual(
values = scales::hue_pal()(nrow(volume_per_city)),
labels = volume_per_city$label
) +
labs(x = "Volume", y = "Density",
title = "Density curve for cities with highlighting means") +
theme_minimal()

‘Sales’ by years
We can see the variability between the years and the trend over
time; higher variability and sales (in average) for 2013 and 2014
sales_per_year <- df %>%
group_by(year) %>%
summarise(mean_val_year = mean(sales),
dev_std_year = sd(sales),
.groups = "drop"
) %>%
mutate(label = paste0(year, " (SD=", round(dev_std_year, 2), ")"))
kable(
sales_per_year %>% select(label, mean_val_year),
caption = "Mean Sales per Year with SD",
col.names = c("Year (SD)", "Mean Sales"),
digits = 2,
align = c("l", "c")
) %>%
kable_styling(
full_width = FALSE,
bootstrap_options = c("striped", "hover")
)
Mean Sales per Year with SD
|
Year (SD)
|
Mean Sales
|
|
2010 (SD=60.54)
|
168.67
|
|
2011 (SD=63.87)
|
164.12
|
|
2012 (SD=70.91)
|
186.15
|
|
2013 (SD=84)
|
211.92
|
|
2014 (SD=95.51)
|
230.60
|
ggplot(df, aes(x=year, y=sales)) +
geom_jitter(width=0.2, alpha=0.5) +
stat_summary(fun=mean, geom="point", color="red", size=3) +
labs(title="Volume per città con punti e media")

‘Listings’ by months
We can see the 4° month as the one with the highest mean and the 7°
with the highest SD
listings_per_month <- df %>%
group_by(month) %>%
summarise(mean_val_month = mean(listings),
dev_std_month = sd(listings),
.groups = "drop"
) %>%
mutate(
label = paste0(month, " (SD=", round(dev_std_month, 2), ")")
)
kable(
listings_per_month %>% select(label, mean_val_month),
caption = "Mean Listings per Month with SD",
col.names = c("Month (SD)", "Mean Listings"),
digits = 2,
align = c("l", "c")
) %>%
kable_styling(
full_width = FALSE,
bootstrap_options = c("striped", "hover")
)
Mean Listings per Month with SD
|
Month (SD)
|
Mean Listings
|
|
1 (SD=704.61)
|
1647.05
|
|
2 (SD=711.2)
|
1692.50
|
|
3 (SD=727.35)
|
1756.70
|
|
4 (SD=770.43)
|
1825.70
|
|
5 (SD=790.22)
|
1823.85
|
|
6 (SD=811.63)
|
1833.25
|
|
7 (SD=826.72)
|
1821.20
|
|
8 (SD=815.87)
|
1786.30
|
|
9 (SD=802.66)
|
1748.90
|
|
10 (SD=779.16)
|
1710.35
|
|
11 (SD=741.25)
|
1652.70
|
|
12 (SD=692.57)
|
1557.75
|
boxplot(listings~month, col = "lightblue",
main = "Boxplot of listings over months",
xlab = "Month", ylab = "Listings")

Barplots: ‘sales’ by ‘month’ and ‘sales’ by ‘city’ for each
‘year’
Same analysis can be made with the visualization below, that can
lead to same conclusions; cities with lower prices
registered lower sales. A special case could be
Bryan-College Station that, even if prices are higher tha Beaumont,
registered a higher quantity of sales for each year, maybe due to some
attractions in the city or due to more professional occasions
The highest amount of sales are between May and August and for
Tyler, while the lowest sales are in January and for Wichita Falls
I also created a stacked barplot to see the ‘sales’ bu ‘city’ and
‘month’ (Tyler and Bryan-College Station have the highest values)
ggplot(df, aes(x=month, y=sales)) +
geom_col(aes(fill = factor(year))) +
labs(x="Month", y="Sales",
title="Barplot of sales by month for each year") +
facet_wrap(~year, nrow = 1, ncol = 5) +
scale_x_continuous(
breaks = 1:12,
labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
) +
theme_minimal() +
theme(axis.text.x = element_blank())

ggplot(df, aes(x=month, y=sales, fill=city)) +
geom_bar(stat="identity") +
labs(x="Month", y="Sales",
title="Stacked barplot of sales by month and city for each year") +
facet_wrap(~year, nrow = 1, ncol = 5) +
scale_x_continuous(
breaks = 1:12,
labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
) +
theme_minimal() +
theme(axis.text.x = element_blank())

ggplot(df, aes(x=city, y=sales)) +
geom_col(fill="violet") +
labs(x="City", y="Sales",
title="Barplot of sales by city for each year") +
facet_wrap(~year, nrow = 1, ncol = 5) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))

Linecharts for ‘sales’ trend over the years by
‘city’
For every city, the 2014 and 2013 where the years with highest
sales
df$month_year <- gsub(" ", "", df$month_year)
df$date <- as.Date(paste0("01/", df$month_year), format = "%d/%m/%Y")
ggplot(df, aes(x = df$date, y = sales, color = city, group = city)) +
geom_line() +
geom_point() +
labs(
title = "Monthly sales trend by city",
x = "Month",
y = "Sales",
color = "City"
) +
scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
theme_minimal()

Linecharts of ‘volume’ by ‘year’ (in average) for every
‘city’
By examining volume over time for each city, we can illustrate
how different markets develop and change. Not all
cities follow the same trajectory: most of them show a gradual
growth, while Wichita Falls go through phases of
expansion and contraction. These differences could
reflect local dynamics, such as variations in demand, prices, or market
attractiveness
They all seem to have an upward trend. Tyler has the highest values,
while Whicita Falls has a great drop in 2011
par(mfrow = c(2,2))
df_Beaumont <- subset(df, city == "Beaumont")
df_agg_Beaumont <- aggregate(volume ~ year, data = df_Beaumont, FUN = mean)
plot(df_agg_Beaumont$year, df_agg_Beaumont$volume, type="l", col="blue", lwd=2,
main="Beaumont city",
xlab="Year", ylab="Volume (mln)")
df_Bryan_College_Station <- subset(df, city == "Bryan-College Station")
df_agg_Bryan_College_Station <- aggregate(volume ~ year, data = df_Bryan_College_Station, FUN = mean)
plot(df_agg_Bryan_College_Station$year, df_agg_Bryan_College_Station$volume, type="l", col="red", lwd=2,
main="Bryan-College Station city",
xlab="Year", ylab="Volume (mln)")
df_Tyler <- subset(df, city == "Tyler")
df_agg_Tyler <- aggregate(volume ~ year, data = df_Tyler, FUN = mean)
plot(df_agg_Tyler$year, df_agg_Tyler$volume, type="l", col="green", lwd=2,
main="Tyler city",
xlab="Year", ylab="Volume (mln)")
df_Wichita_Falls <- subset(df, city == "Wichita Falls")
df_agg_Wichita_Falls <- aggregate(volume ~ year, data = df_Wichita_Falls, FUN = mean)
plot(df_agg_Wichita_Falls$year, df_agg_Wichita_Falls$volume, type="l", col="orange", lwd=2,
main="Wichita Falls city",
xlab="Year", ylab="Volume (mln)")

par(mfrow = c(1,1))
————————–
CONCLUSIONS
————————–
From the historical trends,
volumes of sales (in mln) increased from 2010
to 2014 for every city, with Tyler having the highest values
and Wichita Falls the lowest values.
We can also highlight a fall in volumes for
Wichita Falls in 2011.
If we compare monthly every year for every city,
quantity of sales have the highest values in
2013 and 2014, with a peak in the central
months (May-Aug) for Bryan-College Station.
The efficacy of listings (that become sales) is
between 5% and 20% for almost every observation, with
few outliers.
We can highlight the efficacy of
39% for Bryan-College Station in July
2014. That is the maximum value
registered.
The highest classes of median price registered were
127k-138k and 148k-159k. These amounts
are nearer to the highest value of median price than to the lowest.