Analysis of dataframe’s structure and dimensions

my_table <- df %>%
  summarise(across(where(is.numeric),
                   list(
                     Min = ~min(.x, na.rm = TRUE),
                     Q1 = ~quantile(.x, 0.25, na.rm = TRUE),
                     Median = ~median(.x, na.rm = TRUE),
                     Mean = ~mean(.x, na.rm = TRUE),
                     Q3 = ~quantile(.x, 0.75, na.rm = TRUE),
                     Max = ~max(.x, na.rm = TRUE)
                   ))) %>%
  pivot_longer(
    everything(),
    names_to = c("Variables", ".value"),
    names_pattern = "(.+)_(Min|Q1|Median|Mean|Q3|Max)"
  )
my_table %>%
  mutate(across(-Variables, ~round(as.numeric(.), 2))) %>%
  kable(caption = "Variable analysis") %>%
  kable_styling(full_width = FALSE)
Variable analysis
Variables Min Q1 Median Mean Q3 Max
year 2010.00 2011.00 2012.00 2012.00 2013.00 2014.00
month 1.00 3.75 6.50 6.50 9.25 12.00
sales 79.00 127.00 175.50 192.29 247.00 423.00
volume 8.17 17.66 27.06 31.01 40.89 83.55
median_price 73800.00 117300.00 134500.00 132665.42 150050.00 180000.00
listings 743.00 1026.50 1618.50 1738.02 2056.00 3296.00
months_inventory 3.40 7.80 8.95 9.19 10.95 14.90
kable(data.frame(Dimension = c("Rows", "Columns"), Value = dim(df)),
      caption = "Dimension of dataset") %>%
  kable_styling(full_width = FALSE)
Dimension of dataset
Dimension Value
Rows 240
Columns 8

Quick analysis of the dataset’s variables

kable(head(df), caption = "Firs rows of dataset") %>%
  kable_styling(full_width = FALSE)
Firs rows of dataset
city year month sales volume median_price listings months_inventory month_year
Beaumont 2010 1 83 14.162 163800 1533 9.5 1 / 2010
Beaumont 2010 2 108 17.690 138200 1586 10.0 2 / 2010
Beaumont 2010 3 182 28.701 122400 1689 10.6 3 / 2010
Beaumont 2010 4 200 26.819 123200 1708 10.6 4 / 2010
Beaumont 2010 5 202 28.833 123100 1771 10.9 5 / 2010
Beaumont 2010 6 189 27.219 122800 1803 11.1 6 / 2010
The dataset contains 240 rows and 8 variables:
-‘city’ nominal qualitative variable
-‘year’ continuous quantitative to be treated as ordinal qualitative variable
-‘month’ nominal qualitative variable coded with numbers
-‘sales’ discrete quantitative variable
-‘volume’ continuous quantitative variable on a ratio scale
-‘median_price’ continuous quantitative variable on a ratio scale
-‘listings’ discrete quantitative variable
-‘months_inventory’ continuous quantitative variable on a ratio scale
I decided to add the ‘month_year’ variable so I can do analysis on time dimension
We can do analysis on the distribution of quantitative variables by qualitative variables

Position, variability and shape indexes for quantitative variables

We would like to analyse the distribution of quantitative variables, especially ‘sales’ and ‘volume’; with that we can see the sales, both in quantity and in amount

stats_table <- data.frame(
  Variable = c("sales", "volume", "median_price", "listings", "months_inventory"),
  Mean = round(c(mean(sales), mean(volume), mean(median_price), mean(listings), mean(months_inventory)), 2),
  SD = round(c(sd(sales), sd(volume), sd(median_price), sd(listings), sd(months_inventory)), 2),
  IQR = round(c(IQR(sales), IQR(volume), IQR(median_price), IQR(listings), IQR(months_inventory)), 2),
  Skewness = round(c(skewness(sales), skewness(volume), skewness(median_price), skewness(listings), skewness(months_inventory)), 2),
  Kurtosis = round(c(kurtosis(sales)-3, kurtosis(volume)-3, kurtosis(median_price)-3, kurtosis(listings)-3, kurtosis(months_inventory)-3), 2)
)

kable(stats_table, caption = "Statistical indexes") %>%
  kable_styling(full_width = FALSE)
Statistical indexes
Variable Mean SD IQR Skewness Kurtosis
sales 192.29 79.65 120.00 0.72 -0.31
volume 31.01 16.65 23.23 0.88 0.18
median_price 132665.42 22662.15 32750.00 -0.36 -0.62
listings 1738.02 752.71 1029.50 0.65 -0.79
months_inventory 9.19 2.30 3.15 0.04 -0.17
Most of the variables have positive skewness and are platicurtic with the exception of ‘median_price’ that has negative skewness and ‘volume’ that is leptocurtic
The positive skeweness indicates that the lowest values are more frequent
If we take as an example the ‘sales’ variabile we can see a good amount of variance between datas with an extense range of datas.

Graphic analysis for ‘volume’

boxplot(volume, col = "lightblue",
        main = "Volume Boxplot (mln)",
        ylab = "Volume") # few outliers

hist(volume, probability = T, col = "lightblue",
     main = "Volume trend analysis (mln)",
     xlab = "Volume",
     ylab = "Density")
lines(density(volume)) # we can clearly see the positive skewness

I analysed the ‘volume’ variable with 2 graphs: a boxplot to see the quantiles and the outliers and a histogram (with the density curve) to see the frequency for each class (to see the positive skewness)

‘City’, ‘year’, ‘month’ and ‘month_year’ frequency

[[1]]
City frequencies
city Freq
Beaumont 60
Bryan-College Station 60
Tyler 60
Wichita Falls 60
[[2]]
Year frequencies
year Freq
2010 48
2011 48
2012 48
2013 48
2014 48
[[3]]
Month frequencies
month Freq
1 20
2 20
3 20
4 20
5 20
6 20
7 20
8 20
9 20
10 20
11 20
12 20
Month-Year frequencies (pivoted)
Month 2010 2011 2012 2013 2014
1 4 4 4 4 4
2 4 4 4 4 4
3 4 4 4 4 4
4 4 4 4 4 4
5 4 4 4 4 4
6 4 4 4 4 4
7 4 4 4 4 4
8 4 4 4 4 4
9 4 4 4 4 4
10 4 4 4 4 4
11 4 4 4 4 4
12 4 4 4 4 4

Comparisons between variables for variability and skewness

It is essential to analyze the variability of the variables in order to understand how much the data deviate from the central value (e.g. to identify anomalous behaviour). Low variability indicates relatively homogeneous observations, while high dispersion signals the presence of heterogeneity

Coefficient of variation function; it is necessary in order to compare variables with different units of measurement
CV <- function(datas, column){
  sd_v = sd(datas[[column]])
  mean_v = mean(datas[[column]])
  
  round(sd_v/mean_v*100, 2)
}
Comparison between some variables to see the one with the highest variability
cols <- c("sales", "volume", "median_price", "listings", "months_inventory")

cv_table <- data.frame(
  Variable = cols,
  CV_percent = unname(sapply(cols, CV, datas=df))
)

kable(cv_table, caption = "Coefficient of Variation (%)", align = "c") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped","hover"))
Coefficient of Variation (%)
Variable CV_percent
sales 41.42
volume 53.71
median_price 17.08
listings 43.31
months_inventory 25.06
Histograms of the 5 variables above (to analyse the skewness)
par(mfrow = c(3, 2))
hist(sales, col = "lightblue")
hist(volume, col = "lightgreen")
hist(median_price, col = "lightgrey")
hist(listings, col = "brown")
hist(months_inventory, col = "yellow")
par(mfrow = c(1, 1))

In order to compare the variability for columns with different unit of measurement we must use the coefficient of variation
From a statistical point of view the variable ‘volume’ is the one whose values vary greatly relative to the average
If we see above the skewness functions (where I executed the position, variability and shape indexes) and the histograms, ‘volume’ is the variable with the highest skewness

Classes for ‘median_price’

By dividing ‘median_price’ into classes, we can group them to identify those with the highest frequency. In Texas, average prices are concentrated in the center of the distribution

median_price_class = cut(median_price, breaks = 10, labels = FALSE)
breaks_vec <- seq(min(median_price), max(median_price), length.out=11)
Barplot for ‘median_price’ divided in 10 classes
label <- sapply(1:(length(breaks_vec)-1), function(i){
  paste0(round(breaks_vec[i]/1000), "-", round(breaks_vec[i+1]/1000), "k")
})

barplot(table(median_price_class), names.arg = label, las=2, col = "lightblue",
        main = "Barplot for median_price in classes",
        ylab = "Frequency")

Calculation of Gini index
sum_median_price_class = sum(table(median_price_class))
freq_rel_median_price = round(table(median_price_class)/sum_median_price_class, digits = 3)
Gini <- 1-sum(freq_rel_median_price^2)

cat("Gini Index:", round(Gini, 2), "%") # 0.86 -> high eterogenity

Gini Index: 0.86 %

Variable ‘median_price’, divided in classes and plotted with a barplot, has high heterogeneity.
It can be seen also through the Gini index (0.86).

Probabilities

Probability for ‘city’ = ‘Beaumont’
P_Beaumont = sum(city == 'Beaumont')/length(city)
cat("Probability Beaumont:", round(P_Beaumont, 2), "%") # P=1/4

Probability Beaumont: 0.25 %

Probability for ‘month’ = ‘July’ (7)
P_July = sum(month == 7)/length(month)
cat("Probability July:", round(P_July, 2), "%") # P=1/12

Probability July: 0.08 %

Probability for ‘month’ = ‘December’ (12) and ‘year’ = 2012
P_Dec_2012 = sum((month == 7) & (year == 2012))/nrow(df)
cat("Probability Dec_2012:", round(P_Dec_2012, 2), "%") # P=1/60

Probability Dec_2012: 0.02 %

Adding new variables

Variable ‘mean_price’ calculated by ‘volume’/‘sales’
Variable ‘listings_efficiency’ calculated by ‘sales’/‘listings’. We can obtain how well listings have performed

Visualize the results of the new variables

‘Listings_efficiency’ is a perfect indicator of how well the real estate market is performing in Texas. A high value indicates a fluid and responsive market, where demand is ableto quickly absorb supply. Conversely, low efficiency suggests that many listings remain unsold, indicating possible imbalances, such as uncompetitive prices or weak demand

hist(df$mean_price, col = "lightgreen",
     main = "Histogram of mean_prices",
     xlab = "Mean_prices")

max_val <- max(df$listings_efficiency, na.rm = TRUE)
idx_max <- which.max(df$listings_efficiency)
city_max <- df$city[idx_max]
month_year_max <- df$month_year[idx_max]

boxplot(df$listings_efficiency, col = "lightblue",
        main = "Listings_efficiency Boxplot",
        xlab = "Listings_efficiency",
        horizontal = TRUE)
points(max_val, 1, col="red", pch=19, cex=1.5)
text(max_val - 0.08, 1.1, labels = paste0(round(max_val,2), " (", city_max, ", ", month_year_max, ")"), col="red")

From the histogram we can see that ‘mean_prices’ between 140k and 170k are the ones with the highest frequency.
From the boxplot we can see the distribution of ‘listings_efficiency’, calculated by ‘sales’/‘listings’.
The maximum value for ‘listings_efficency’ was registrated in 07/2014 for Bryan-College Station where there were 39 sales for every 100 listings.

Statistical analysis

With the analysis below we can see different aspects of the market. We can see which city has the highest amount of sales, which year has the highest quantity of sales and which month has the highest listings. This way we can decide in which city we can decide to buy a new house for example, or in which period of the year

‘Volume’ by cities

We can clearly see the cities with higher variability (Tyler and Bryan-College Station)
volume_per_city <- df %>%
  group_by(city) %>%
  summarise(mean_val_city = mean(volume),
            dev_std_city = sd(volume)
  ) %>%
  mutate(label = paste0(city, " (SD=", round(dev_std_city, 2), ")"))

kable(
  volume_per_city %>% select(label, mean_val_city),
  caption = "Mean Volume per City with SD",
  col.names = c("City (SD)", "Mean Volume"),
  digits = 2,
  align = c("l", "c")
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover")
  )
Mean Volume per City with SD
City (SD) Mean Volume
Beaumont (SD=6.97) 26.13
Bryan-College Station (SD=17.25) 38.19
Tyler (SD=13.11) 45.77
Wichita Falls (SD=3.24) 13.93
ggplot(df, aes(x = volume, color = city, fill = city)) +
  geom_density(alpha = 0.3) +
  geom_vline(data = volume_per_city, aes(xintercept = mean_val_city, color = city),
             linetype = "dashed", size = 1) +
  scale_color_manual(
    values = scales::hue_pal()(nrow(volume_per_city)),
    labels = volume_per_city$label
  ) +
  labs(x = "Volume", y = "Density",
       title = "Density curve for cities with highlighting means") +
  theme_minimal()

‘Sales’ by years

We can see the variability between the years and the trend over time; higher variability and sales (in average) for 2013 and 2014
sales_per_year <- df %>%
  group_by(year) %>%
  summarise(mean_val_year = mean(sales),
            dev_std_year = sd(sales),
            .groups = "drop"
            ) %>%
  mutate(label = paste0(year, " (SD=", round(dev_std_year, 2), ")"))

kable(
  sales_per_year %>% select(label, mean_val_year),
  caption = "Mean Sales per Year with SD",
  col.names = c("Year (SD)", "Mean Sales"),
  digits = 2,
  align = c("l", "c")
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover")
  )
Mean Sales per Year with SD
Year (SD) Mean Sales
2010 (SD=60.54) 168.67
2011 (SD=63.87) 164.12
2012 (SD=70.91) 186.15
2013 (SD=84) 211.92
2014 (SD=95.51) 230.60
ggplot(df, aes(x=year, y=sales)) +
  geom_jitter(width=0.2, alpha=0.5) +
  stat_summary(fun=mean, geom="point", color="red", size=3) +
  labs(title="Volume per città con punti e media")

‘Listings’ by months

We can see the 4° month as the one with the highest mean and the 7° with the highest SD
listings_per_month <- df %>%
  group_by(month) %>%
  summarise(mean_val_month = mean(listings),
            dev_std_month = sd(listings),
            .groups = "drop"
            ) %>%
  mutate(
    label = paste0(month, " (SD=", round(dev_std_month, 2), ")")
  )

kable(
  listings_per_month %>% select(label, mean_val_month),
  caption = "Mean Listings per Month with SD",
  col.names = c("Month (SD)", "Mean Listings"),
  digits = 2,
  align = c("l", "c")
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover")
  )
Mean Listings per Month with SD
Month (SD) Mean Listings
1 (SD=704.61) 1647.05
2 (SD=711.2) 1692.50
3 (SD=727.35) 1756.70
4 (SD=770.43) 1825.70
5 (SD=790.22) 1823.85
6 (SD=811.63) 1833.25
7 (SD=826.72) 1821.20
8 (SD=815.87) 1786.30
9 (SD=802.66) 1748.90
10 (SD=779.16) 1710.35
11 (SD=741.25) 1652.70
12 (SD=692.57) 1557.75
boxplot(listings~month, col = "lightblue",
        main = "Boxplot of listings over months",
        xlab = "Month", ylab = "Listings")

Boxplot to compare ‘median_price’ by city

‘Median_price’ has increased during the years. Given that, we can decide in which city of Texas is better to invest if we would like to buy a new house

Bryan-College Station has the highest median prices, while Wichita Falls has the lowest median prices.
ggplot(df, aes(x=city, y=median_price, fill=factor(year))) +
  geom_boxplot() +
  labs(x="City", y="Median_price",
       title="Boxplot of median_price by city for each year")

Barplots: ‘sales’ by ‘month’ and ‘sales’ by ‘city’ for each ‘year’

Same analysis can be made with the visualization below, that can lead to same conclusions; cities with lower prices registered lower sales. A special case could be Bryan-College Station that, even if prices are higher tha Beaumont, registered a higher quantity of sales for each year, maybe due to some attractions in the city or due to more professional occasions

The highest amount of sales are between May and August and for Tyler, while the lowest sales are in January and for Wichita Falls
I also created a stacked barplot to see the ‘sales’ bu ‘city’ and ‘month’ (Tyler and Bryan-College Station have the highest values)
ggplot(df, aes(x=month, y=sales)) +
  geom_col(aes(fill = factor(year))) +
  labs(x="Month", y="Sales",
       title="Barplot of sales by month for each year") +
  facet_wrap(~year, nrow = 1, ncol = 5) +
  scale_x_continuous(
    breaks = 1:12,
    labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
  ) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

ggplot(df, aes(x=month, y=sales, fill=city)) +
  geom_bar(stat="identity") +
  labs(x="Month", y="Sales",
       title="Stacked barplot of sales by month and city for each year") +
  facet_wrap(~year, nrow = 1, ncol = 5) +
  scale_x_continuous(
    breaks = 1:12,
    labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
  ) +
  theme_minimal() +
  theme(axis.text.x = element_blank())

ggplot(df, aes(x=city, y=sales)) +
  geom_col(fill="violet") +
  labs(x="City", y="Sales",
       title="Barplot of sales by city for each year") +
  facet_wrap(~year, nrow = 1, ncol = 5) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90))

Linecharts for ‘sales’ trend over the years by ‘city’

For every city, the 2014 and 2013 where the years with highest sales
df$month_year <- gsub(" ", "", df$month_year)
df$date <- as.Date(paste0("01/", df$month_year), format = "%d/%m/%Y")

ggplot(df, aes(x = df$date, y = sales, color = city, group = city)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Monthly sales trend by city",
    x = "Month",
    y = "Sales",
    color = "City"
  ) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
  theme_minimal()

Linecharts of ‘volume’ by ‘year’ (in average) for every ‘city’

By examining volume over time for each city, we can illustrate how different markets develop and change. Not all cities follow the same trajectory: most of them show a gradual growth, while Wichita Falls go through phases of expansion and contraction. These differences could reflect local dynamics, such as variations in demand, prices, or market attractiveness

They all seem to have an upward trend. Tyler has the highest values, while Whicita Falls has a great drop in 2011
par(mfrow = c(2,2))
df_Beaumont <- subset(df, city == "Beaumont")
df_agg_Beaumont <- aggregate(volume ~ year, data = df_Beaumont, FUN = mean)
plot(df_agg_Beaumont$year, df_agg_Beaumont$volume, type="l", col="blue", lwd=2,
     main="Beaumont city",
     xlab="Year", ylab="Volume (mln)")

df_Bryan_College_Station <- subset(df, city == "Bryan-College Station")
df_agg_Bryan_College_Station <- aggregate(volume ~ year, data = df_Bryan_College_Station, FUN = mean)
plot(df_agg_Bryan_College_Station$year, df_agg_Bryan_College_Station$volume, type="l", col="red", lwd=2,
     main="Bryan-College Station city",
     xlab="Year", ylab="Volume (mln)")

df_Tyler <- subset(df, city == "Tyler")
df_agg_Tyler <- aggregate(volume ~ year, data = df_Tyler, FUN = mean)
plot(df_agg_Tyler$year, df_agg_Tyler$volume, type="l", col="green", lwd=2,
     main="Tyler city",
     xlab="Year", ylab="Volume (mln)")

df_Wichita_Falls <- subset(df, city == "Wichita Falls")
df_agg_Wichita_Falls <- aggregate(volume ~ year, data = df_Wichita_Falls, FUN = mean)
plot(df_agg_Wichita_Falls$year, df_agg_Wichita_Falls$volume, type="l", col="orange", lwd=2,
     main="Wichita Falls city",
     xlab="Year", ylab="Volume (mln)")

par(mfrow = c(1,1))

————————–

CONCLUSIONS

————————–

We can also highlight a fall in volumes for Wichita Falls in 2011.

If we compare monthly every year for every city, quantity of sales have the highest values in 2013 and 2014, with a peak in the central months (May-Aug) for Bryan-College Station.

The efficacy of listings (that become sales) is between 5% and 20% for almost every observation, with few outliers.

We can highlight the efficacy of 39% for Bryan-College Station in July 2014. That is the maximum value registered.

The highest classes of median price registered were 127k-138k and 148k-159k. These amounts are nearer to the highest value of median price than to the lowest.