VARIABLES ANALYSIS

library(readr)
realestate_texas <- read_csv("C:/Users/Utente/Downloads/realestate_texas.csv")
View(realestate_texas)

The imported dataset contains 240 observations across 8 columns. Below is an identification and description of the statistical variables in the dataset:

-city (Type: Qualitative Nominal): Represents the city associated with the observation. A categorical variable with no intrinsic order. Possible analysis: distribution of observations by city, comparisons between cities for quantitative variables (e.g., sales).

-year (Type: Quantitative Continuous to be treated as Ordinal Qualitative in this case): Represents the year of the observation. Implies a temporal dimension. Possible analysis: assessing yearly trends in other variables.

-month (Type: Qualitative Nominal (cyclic) but coded in numbers): Represents the month of the observation, with 12 distinct values (1 to 12). Implies a seasonal component. Possible analysis: identifying seasonal patterns in sales, volume, or median price.

-sales (Type: Quantitative Discrete): Indicates the number of real estate sales. Possible analysis: distribution analysis, identifying trends over time, or city-based comparisons.

-volume (Type: Quantitative Continuous): Represents sales volume in millions. Possible analysis: evaluation of total market activity, mean or temporal variation analysis.

-median_price (Type: Quantitative Continuous): Indicates the median price of real estate sales. Possible analysis: distribution analysis, spatial and temporal variations, identifying correlations with other variables.

-listings (Type: Quantitative Discrete): Represents the number of active sales listings. Possible analysis: temporal trends, relationship with sales or available inventory.

-months_inventory (Type: Quantitative Continuous): Represents the number of months required to sell the current inventory at the current sales rate. Possible analysis: assessment of market balance (supply vs. demand).

The variables year and month imply a time dimension: Combining year and month: These two variables can be combined into a single temporal variable (e.g., timestamp or year-month) for more sophisticated time-based analyses, such as time series modeling. Seasonal Patterns and Trends: Separate analyses of seasonal patterns (monthly) and long-term trends (yearly) can be conducted using tools like time series decomposition.

Categorical Variables (e.g., city): Distribution of observations across categories. Comparisons with summary statistics (mean, median).

Temporal Variables: Trends, seasonality, and time series analysis.

Quantitative Variables (sales, volume, median_price, listings, months_inventory): Correlations. Trend analysis (e.g., linear regressions over time). Spatial comparisons (between cities).

attach(realestate_texas) #in order to use dataset variables, without $

POSITION INDEXES, VARIABILITY INDEXES AND SHAPE INDEXES

#POSITION INDEXES
#Modal values
library(knitr)

quant_variables <- c("sales", "volume", "median_price", "listings", "months_inventory")

get_mode <- function(x) {
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x, uniqx)))]
}

modal_values <- lapply(quant_variables, function(var) get_mode(get(var)))

format_number <- function(x) {
  ifelse(x %% 1 == 0, as.character(x), sprintf("%.2f", round(x, 2)))
}

formatted_modal_values <- sapply(modal_values, format_number)

mode_table <- data.frame(
  Variable = c("Sales", "Volume", "Median Price", "Listings", "Months Inventory"),
  Modal_value = formatted_modal_values  
)

kable(mode_table, caption = "Quantitative variables modal values")

Quantitative variables modal values
Variable	Modal_value
Sales	124
Volume	35.34
Median Price	130000
Listings	1581
Months Inventory	8.10

Position indexes are synthetic indices, functions of the resulting data in single values that describe in summary the distribution of our data. Modal value can be defined for every type of variable: it’s the modality with the highest absolute frequency.

#Maximum and minimum
get_min_max <- function(x) {
  return(c(min(x), max(x)))
}

min_max_values <- sapply(quant_variables, function(var) get_min_max(get(var)))

formatted_min_max_values <- apply(min_max_values, 2, function(x) sapply(x, format_number))

min_max_table <- data.frame(
  Min = formatted_min_max_values[1, ],
  Max = formatted_min_max_values[2, ]
)

kable(min_max_table, caption = "Maximum and Minimum Values for Each Variable")

Maximum and Minimum Values for Each Variable
	Min	Max
sales	79	423
volume	8.17	83.55
median_price	73800	180000
listings	743	3296
months_inventory	3.40	14.90

They represent the maximum value and the minimum value of a specific variable.

#Median values
get_median <- function(x) {
  return(median(x))
}

median_values <- sapply(quant_variables, function(var) get_median(get(var)))

formatted_median_values <- sapply(median_values, format_number)

median_table <- data.frame(
  Median = formatted_median_values
)

kable(median_table, caption = "Median Values for Each Variable")

Median Values for Each Variable
	Median
sales	175.50
volume	27.06
median_price	134500
listings	1618.50
months_inventory	8.95

It is a robust index, which does not account for any outliers within the distribution, and it is calculated for both quantitative and ordinal qualitative variables and represents the 50th percentile. It is defined by sorting the data in ascending or descending order and taking the value that is exactly in the middle of the series.

#Quantiles and quartiles
get_quantiles_and_summary <- function(var) {
  quantiles_10 <- quantile(var, seq(0, 1, 0.1))
  quantiles_100 <- quantile(var, seq(0, 1, 0.01))
  summary_stats <- summary(var)
  return(list(quantiles_10 = quantiles_10, quantiles_100 = quantiles_100, summary = summary_stats))
}

results <- lapply(quant_variables, function(var) get_quantiles_and_summary(get(var)))

names(results) <- quant_variables  
results

## $sales
## $sales$quantiles_10
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##  79.0 101.9 120.6 135.0 155.0 175.5 197.0 228.5 271.0 302.1 423.0 
## 
## $sales$quantiles_100
##     0%     1%     2%     3%     4%     5%     6%     7%     8%     9%    10% 
##  79.00  80.39  87.68  89.17  91.56  93.00  94.34  96.73  98.24 101.00 101.90 
##    11%    12%    13%    14%    15%    16%    17%    18%    19%    20%    21% 
## 104.29 107.68 108.07 110.00 111.00 113.00 114.00 115.02 117.82 120.60 123.00 
##    22%    23%    24%    25%    26%    27%    28%    29%    30%    31%    32% 
## 124.00 124.00 125.00 127.00 128.14 130.00 130.92 132.62 135.00 137.27 140.48 
##    33%    34%    35%    36%    37%    38%    39%    40%    41%    42%    43% 
## 143.00 144.52 147.65 149.00 149.43 150.00 151.21 155.00 159.00 160.00 161.54 
##    44%    45%    46%    47%    48%    49%    50%    51%    52%    53%    54% 
## 163.00 164.00 165.94 167.66 169.72 173.11 175.50 176.89 180.28 181.67 182.00 
##    55%    56%    57%    58%    59%    60%    61%    62%    63%    64%    65% 
## 186.00 186.84 189.00 191.86 196.00 197.00 198.00 200.36 202.00 205.92 208.35 
##    66%    67%    68%    69%    70%    71%    72%    73%    74%    75%    76% 
## 211.48 213.65 219.04 224.91 228.50 233.69 238.00 239.94 244.00 247.00 253.00 
##    77%    78%    79%    80%    81%    82%    83%    84%    85%    86%    87% 
## 254.03 258.84 262.00 271.00 272.59 277.94 282.00 283.52 287.30 289.00 292.93 
##    88%    89%    90%    91%    92%    93%    94%    95%    96%    97%    98% 
## 295.32 298.00 302.10 314.47 321.40 326.54 333.98 347.30 357.00 367.64 372.32 
##    99%   100% 
## 396.54 423.00 
## 
## $sales$summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    79.0   127.0   175.5   192.3   247.0   423.0 
## 
## 
## $volume
## $volume$quantiles_10
##      0%     10%     20%     30%     40%     50%     60%     70%     80%     90% 
##  8.1660 13.0967 16.1206 19.0344 23.9976 27.0625 31.8436 36.9307 45.5920 53.7391 
##    100% 
## 83.5470 
## 
## $volume$quantiles_100
##       0%       1%       2%       3%       4%       5%       6%       7% 
##  8.16600  9.11909  9.48346  9.67093 10.51088 11.15940 11.36852 11.81244 
##       8%       9%      10%      11%      12%      13%      14%      15% 
## 12.05580 12.46069 13.09670 13.45837 13.57160 13.82113 13.90286 14.00300 
##      16%      17%      18%      19%      20%      21%      22%      23% 
## 14.52280 15.04925 15.28384 15.51278 16.12060 16.16999 16.35950 16.90221 
##      24%      25%      26%      27%      28%      29%      30%      31% 
## 17.21956 17.65950 17.79860 18.09290 18.24536 18.67634 19.03440 19.68418 
##      32%      33%      34%      35%      36%      37%      38%      39% 
## 19.78016 20.33667 20.69896 21.17025 22.19824 22.78383 22.96914 23.55640 
##      40%      41%      42%      43%      44%      45%      46%      47% 
## 23.99760 24.22659 24.88318 24.94887 25.29348 25.44155 25.67790 25.89993 
##      48%      49%      50%      51%      52%      53%      54%      55% 
## 26.36344 26.82692 27.06250 27.33559 28.44688 28.63698 28.89610 29.44380 
##      56%      57%      58%      59%      60%      61%      62%      63% 
## 30.07740 30.31716 31.02136 31.65611 31.84360 32.12492 32.70494 33.50099 
##      64%      65%      66%      67%      68%      69%      70%      71% 
## 34.04908 34.71695 34.89208 35.23495 35.36620 36.23819 36.93070 37.74989 
##      72%      73%      74%      75%      76%      77%      78%      79% 
## 38.69228 39.37236 40.23348 40.89300 41.09704 42.42399 42.56602 43.81661 
##      80%      81%      82%      83%      84%      85%      86%      87% 
## 45.59200 46.60949 47.43920 48.45832 49.25504 49.97910 50.75608 51.57237 
##      88%      89%      90%      91%      92%      93%      94%      95% 
## 51.90736 52.75139 53.73910 55.67585 59.66368 60.74511 62.27676 63.83685 
##      96%      97%      98%      99%     100% 
## 66.95336 68.64489 70.54574 77.25487 83.54700 
## 
## $volume$summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.166  17.660  27.062  31.005  40.893  83.547 
## 
## 
## $median_price
## $median_price$quantiles_10
##     0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
##  73800  99960 110000 121650 130700 134500 141220 147960 152360 158850 180000 
## 
## $median_price$quantiles_100
##     0%     1%     2%     3%     4%     5%     6%     7%     8%     9%    10% 
##  73800  86095  87156  88736  90000  91180  92472  94730  96796  98418  99960 
##    11%    12%    13%    14%    15%    16%    17%    18%    19%    20%    21% 
## 100700 102012 102314 102684 104565 105048 105578 108006 109223 110000 111290 
##    22%    23%    24%    25%    26%    27%    28%    29%    30%    31%    32% 
## 113474 114291 116180 117300 118800 119624 120692 121131 121650 122827 123488 
##    33%    34%    35%    36%    37%    38%    39%    40%    41%    42%    43% 
## 124287 126126 127715 128812 129286 129928 130000 130700 130800 131438 132331 
##    44%    45%    46%    47%    48%    49%    50%    51%    52%    53%    54% 
## 132548 133065 133294 133899 134172 134311 134500 134967 135412 135901 136330 
##    55%    56%    57%    58%    59%    60%    61%    62%    63%    64%    65% 
## 137870 138468 139246 139848 140501 141220 142358 142772 143828 144100 144670 
##    66%    67%    68%    69%    70%    71%    72%    73%    74%    75%    76% 
## 144874 145917 146952 147582 147960 148369 148516 148994 149300 150050 150892 
##    77%    78%    79%    80%    81%    82%    83%    84%    85%    86%    87% 
## 151506 151942 152100 152360 153100 153888 154696 155276 155515 155654 156386 
##    88%    89%    90%    91%    92%    93%    94%    95%    96%    97%    98% 
## 156500 157097 158850 159547 161000 161454 163766 165340 167828 169583 172288 
##    99%   100% 
## 174813 180000 
## 
## $median_price$summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   73800  117300  134500  132665  150050  180000 
## 
## 
## $listings
## $listings$quantiles_10
##     0%    10%    20%    30%    40%    50%    60%    70%    80%    90%   100% 
##  743.0  899.9  968.0 1208.7 1525.2 1618.5 1687.8 1796.0 2721.4 2946.7 3296.0 
## 
## $listings$quantiles_100
##      0%      1%      2%      3%      4%      5%      6%      7%      8%      9% 
##  743.00  775.17  799.90  822.53  841.36  849.70  855.70  866.11  877.60  891.08 
##     10%     11%     12%     13%     14%     15%     16%     17%     18%     19% 
##  899.90  904.29  907.00  914.00  918.68  932.70  938.48  941.00  950.10  961.82 
##     20%     21%     22%     23%     24%     25%     26%     27%     28%     29% 
##  968.00  973.00  994.74 1004.97 1018.16 1026.50 1030.14 1046.83 1126.00 1168.64 
##     30%     31%     32%     33%     34%     35%     36%     37%     38%     39% 
## 1208.70 1261.90 1328.72 1385.00 1439.78 1443.95 1462.72 1486.00 1496.92 1502.89 
##     40%     41%     42%     43%     44%     45%     46%     47%     48%     49% 
## 1525.20 1533.99 1540.90 1559.01 1570.80 1576.10 1581.00 1586.66 1598.16 1604.99 
##     50%     51%     52%     53%     54%     55%     56%     57%     58%     59% 
## 1618.50 1623.56 1638.80 1646.67 1655.12 1660.35 1668.52 1672.69 1676.24 1681.02 
##     60%     61%     62%     63%     64%     65%     66%     67%     68%     69% 
## 1687.80 1694.95 1708.00 1722.57 1729.76 1737.45 1749.74 1762.39 1769.08 1784.46 
##     70%     71%     72%     73%     74%     75%     76%     77%     78%     79% 
## 1796.00 1817.04 1830.16 1833.47 1844.30 2056.00 2485.60 2609.48 2643.50 2690.30 
##     80%     81%     82%     83%     84%     85%     86%     87%     88%     89% 
## 2721.40 2733.72 2762.62 2789.11 2825.44 2852.45 2862.94 2875.93 2903.40 2932.78 
##     90%     91%     92%     93%     94%     95%     96%     97%     98%     99% 
## 2946.70 2966.72 2996.56 3041.27 3051.24 3094.35 3164.36 3214.26 3263.66 3270.05 
##    100% 
## 3296.00 
## 
## $listings$summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     743    1026    1618    1738    2056    3296 
## 
## 
## $months_inventory
## $months_inventory$quantiles_10
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##  3.40  6.69  7.50  7.97  8.40  8.95  9.40 10.53 11.40 12.21 14.90 
## 
## $months_inventory$quantiles_100
##     0%     1%     2%     3%     4%     5%     6%     7%     8%     9%    10% 
##  3.400  4.000  4.078  4.551  4.956  5.000  5.068  6.073  6.124  6.400  6.690 
##    11%    12%    13%    14%    15%    16%    17%    18%    19%    20%    21% 
##  6.929  7.000  7.100  7.100  7.200  7.224  7.300  7.402  7.500  7.500  7.600 
##    22%    23%    24%    25%    26%    27%    28%    29%    30%    31%    32% 
##  7.600  7.600  7.736  7.800  7.800  7.853  7.900  7.900  7.970  8.000  8.000 
##    33%    34%    35%    36%    37%    38%    39%    40%    41%    42%    43% 
##  8.087  8.100  8.100  8.100  8.143  8.300  8.321  8.400  8.400  8.500  8.577 
##    44%    45%    46%    47%    48%    49%    50%    51%    52%    53%    54% 
##  8.600  8.700  8.700  8.733  8.800  8.900  8.950  9.000  9.000  9.067  9.100 
##    55%    56%    57%    58%    59%    60%    61%    62%    63%    64%    65% 
##  9.100  9.200  9.300  9.362  9.400  9.400  9.500  9.618  9.814  9.996 10.000 
##    66%    67%    68%    69%    70%    71%    72%    73%    74%    75%    76% 
## 10.174 10.213 10.400 10.491 10.530 10.600 10.700 10.747 10.800 10.950 11.100 
##    77%    78%    79%    80%    81%    82%    83%    84%    85%    86%    87% 
## 11.200 11.300 11.300 11.400 11.400 11.500 11.600 11.600 11.615 11.700 11.700 
##    88%    89%    90%    91%    92%    93%    94%    95%    96%    97%    98% 
## 11.932 12.000 12.210 12.300 12.400 12.600 12.898 13.015 13.344 13.483 13.844 
##    99%   100% 
## 14.500 14.900 
## 
## $months_inventory$summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.400   7.800   8.950   9.193  10.950  14.900

Quantiles are those values that split a set of data arranged in sequential order into several parts. Quartiles are those three values of the ordered data series that divide the series itself into quarters, that is, into four equal parts. They represent the 25th, 50th and 75th percentiles, respectively. A percentile, for example, Xp is that value that divides the distribution of values into two parts, such that p% of the values is less than Xp and (1-p)% is greater than Xp.

#Mean values
get_mean <- function(x) {
  return(mean(x))
}

mean_values <- sapply(quant_variables, function(var) get_mean(get(var)))

formatted_mean_values <- sapply(mean_values, format_number)

mean_table <- data.frame(
  Mean = formatted_mean_values
)

kable(mean_table, caption = "Mean Values for Each Variable")

Mean Values for Each Variable
	Mean
sales	192.29
volume	31.01
median_price	132665.42
listings	1738.02
months_inventory	9.19

It is that value which is obtained by summing all the values of a quantitative variable, and then dividing by the number of values.

#VARIABILITY INDEXES
#Values range
get_range <- function(x) {
  return(range(x))
}

range_values <- sapply(quant_variables, function(var) get_range(get(var)))

formatted_range_values <- apply(range_values, 2, function(x) sapply(x, format_number))

range_table <- data.frame(
  Min = formatted_range_values[1, ],
  Max = formatted_range_values[2, ]
)

kable(range_table, caption = "Range (Min and Max) for Each Variable")

Range (Min and Max) for Each Variable
	Min	Max
sales	79	423
volume	8.17	83.55
median_price	73800	180000
listings	743	3296
months_inventory	3.40	14.90

Position indices alone are not enough, since they do not give us information about variability, how and how much the data are distributed over their domain or range. The variability of a distribution measures the tendency of units to assume different modes or values of the variable. Variability indices summarize the diversity among units in terms between two values of the distribution or observed modes.They make it possible, thus, to compare different distributions with each other. Variable range represents the difference between maximum value and minimum value of that specific variable (it refers for only observed values).

#InterQuartile range (IQR)
get_iqr <- function(x) {
  return(IQR(x))
}

iqr_values <- sapply(quant_variables, function(var) get_iqr(get(var)))

formatted_iqr_values <- sapply(iqr_values, format_number)

iqr_table <- data.frame(
  IQR = formatted_iqr_values
)

kable(iqr_table, caption = "Interquartile Range (IQR) for Each Variable")

Interquartile Range (IQR) for Each Variable
	IQR
sales	120
volume	23.23
median_price	32750
listings	1029.50
months_inventory	3.15

It is the difference between the 3rd quartile and the 1st quartile and it shows the range of variation in the central body of data, i.e., 50% of the data between the two quartiles. Robust index if there are outliers in the data distribution.

#Variance and standard deviation
options(scipen = 999)

get_variance_sd <- function(x) {
  return(c(variance = var(x), sd = sd(x)))
}

variance_sd_values <- sapply(quant_variables, function(var) get_variance_sd(get(var)))

formatted_variance_values <- sapply(variance_sd_values["variance", ], format_number)
formatted_sd_values <- sapply(variance_sd_values["sd", ], format_number)

variance_sd_table <- data.frame(
  Variance = formatted_variance_values,
  Standard_Deviation = formatted_sd_values
)

kable(variance_sd_table, caption = "Variance and Standard Deviation for Each Variable")

Variance and Standard Deviation for Each Variable
	Variance	Standard_Deviation
sales	6344.30	79.65
volume	277.27	16.65
median_price	513572983.09	22662.15
listings	566568.97	752.71
months_inventory	5.31	2.30

Variance shows the degree of dispersion around the mean value. Standard deviation represents the square root of the variance and gives us an index of variability in the same unit as the observed data (a measure more consistent with the source data). The latter is an absolute index and, therefore, is affected by both the unit of measurement of the variable and the order of magnitude of the data. If the mean values are very different, the standard deviation may not be a suitable measure for comparing different data.

#Coefficient of variation
CV <- function(x) {
  return(sd(x) / mean(x) * 100)
}

cv_values <- sapply(quant_variables, function(var) CV(get(var)))

formatted_cv_values <- sapply(cv_values, format_number)

cv_table <- data.frame(
  CV = formatted_cv_values
)

kable(cv_table, caption = "Coefficient of Variation (CV) for Each Variable")

Coefficient of Variation (CV) for Each Variable
	CV
sales	41.42
volume	53.71
median_price	17.08
listings	43.31
months_inventory	25.06

The coefficient of variation makes it possible to compare the variability of a sample relative to two different variables, or the variability of two samples relative to the same variable. This coefficient, however, is problematic if the variable has both positive and negative values, or in the case of a conventional zero in the measurement scale.

#SHAPE INDEXES
#Skewness index
library(moments)

skew_values <- sapply(quant_variables, function(var) skewness(get(var)))

formatted_skew_values <- sapply(skew_values, format_number)

skewness_table <- data.frame(
  Skewness = formatted_skew_values
)

kable(skewness_table, caption = "Skewness for Each Variable")

Skewness for Each Variable
	Skewness
sales	0.72
volume	0.88
median_price	-0.36
listings	0.65
months_inventory	0.04

These indices refer to skewness and kurtosis, features of the distribution that refer to the central moment of order three and four of a random variable, respectively. A distribution is said to be symmetrical if it is possible to identify a vertical axis that cuts the distribution into two specularly equal parts (the variable must be sortable).

#Kurtosis coefficient
kurt_values <- sapply(quant_variables, function(var) kurtosis(get(var)) - 3)

formatted_kurt_values <- sapply(kurt_values, format_number)

kurtosis_table <- data.frame(
  Excess_Kurtosis = formatted_kurt_values
)

kable(kurtosis_table, caption = "Kurtosis for Each Variable")

Kurtosis for Each Variable
	Excess_Kurtosis
sales	-0.31
volume	0.18
median_price	-0.62
listings	-0.79
months_inventory	-0.17

Kurtosis can be defined as a measure of crushing/extension of the shape of a distribution relative to the normal distribution.

for (var in quant_variables) {
  plot(density(get(var)), main = paste(var, "density plot"), xlab = var)
}

This in order to see, in a graphical way, the distribution of these quantitative variables.

#frequency distribution for qualitative variables
freq_distr <- function(x){
  N <- length(x)
  freq <- table(x)
  rel_freq <- freq / N
  cum_freq <- cumsum(freq)
  cum_rel_freq <- cumsum(freq) / N
  return(cbind(freq, rel_freq, cum_freq, cum_rel_freq))
}

city_freq <- freq_distr(city)
year_freq <- freq_distr(year)
month_freq <- freq_distr(month)

formatted_city_freq <- apply(city_freq, 2, function(col) sapply(col, format_number))
formatted_year_freq <- apply(year_freq, 2, function(col) sapply(col, format_number))
formatted_month_freq <- apply(month_freq, 2, function(col) sapply(col, format_number))

city_freq_table <- data.frame(formatted_city_freq)
year_freq_table <- data.frame(formatted_year_freq)
month_freq_table <- data.frame(formatted_month_freq)

kable(city_freq_table, caption = "Frequency Distribution for City")

Frequency Distribution for City
	freq	rel_freq	cum_freq	cum_rel_freq
Beaumont	60	0.25	60	0.25
Bryan-College Station	60	0.25	120	0.50
Tyler	60	0.25	180	0.75
Wichita Falls	60	0.25	240	1

kable(year_freq_table, caption = "Frequency Distribution for Year")

Frequency Distribution for Year
	freq	rel_freq	cum_freq	cum_rel_freq
2010	48	0.20	48	0.20
2011	48	0.20	96	0.40
2012	48	0.20	144	0.60
2013	48	0.20	192	0.80
2014	48	0.20	240	1

kable(month_freq_table, caption = "Frequency Distribution for Month")

Frequency Distribution for Month
freq	rel_freq	cum_freq	cum_rel_freq
20	0.08	20	0.08
20	0.08	40	0.17
20	0.08	60	0.25
20	0.08	80	0.33
20	0.08	100	0.42
20	0.08	120	0.50
20	0.08	140	0.58
20	0.08	160	0.67
20	0.08	180	0.75
20	0.08	200	0.83
20	0.08	220	0.92
20	0.08	240	1

Here, we have frequency tables for the others qualitative variables.

IDENTIFICATION OF VARIABLES WITH GREATER VARIABILITY AND SKEWNESS

“volume” is the variable with the highest variability (53.70536, or about 53.71%), and this result was obtained through the coefficient of variation (CV): a very high value of this index implies a relatively large value of the standard deviation from the mean value, so individual observations will be far from the mean itself. A relatively higher value of the standard deviation than the mean value could be due to several other factors: presence of outliers or strong skewness. In several cases, using a measure of absolute variability such as standard deviation is problematic. In fact, to make a comparison of variability, the standard deviation is not always the most suitable index because it always assumes the same unit of measurement and order of magnitude as the variable on which it is calculated. “volume” is also the variable with the highest skewness (positive, in this case, since we have a value of 0.884742). If a variable has a positive skewed distribution, it means that low values (or modes) are more frequent and we will have mean > median > mode value. Moreover, we have a distribution of data that deviates significantly from the normal distribution, in which it is possible to identify a vertical axis that bisects the distribution into two specularly equal parts.

CLASSES CREATION FOR QUANTITATIVE VARIABLES

realestate_texas$cl_sales <- cut(
  realestate_texas$sales,
  breaks = c(79, 100, 150, 200, 250, 300, 350, 400, 423),  
  right = TRUE,
  include.lowest = TRUE
)
View(realestate_texas)

attach(realestate_texas)

freq_distr_cl_sales <- freq_distr(cl_sales)
View(freq_distr_cl_sales)
freq_distr_cl_sales <- as.data.frame(freq_distr_cl_sales)

A frequency distribution was created starting from the quantitative variable “sales”.

create_barplot <- function(data, freq_col, ylab, ylim, main) {
  barplot_heights <- barplot(data[[freq_col]],
                             main = main,
                             xlab = "Sales Class",
                             ylab = ylab,
                             ylim = ylim,
                             names.arg = rownames(data),
                             cex.names = 0.7)
  text(x = barplot_heights, 
       y = data[[freq_col]],  
       labels = round(data[[freq_col]], 3), 
       cex = 0.8,  
       col = "black",  
       pos = 3)
}

par(mfrow = c(2, 2))

create_barplot(freq_distr_cl_sales, "freq", "Frequency", c(0, 90), "Frequency Class Distribution")
create_barplot(freq_distr_cl_sales, "rel_freq", "Relative Frequency", c(0, 0.35), "Relative Frequency Class Distribution")
create_barplot(freq_distr_cl_sales, "cum_freq", "Cumulative Frequency", c(0, 300), "Cumulative Frequency Class Distribution")
create_barplot(freq_distr_cl_sales, "cum_rel_freq", "Cumulative Relative Frequency", c(0, 1.2), "Cumulative Relative Frequency Class Distribution")

par(mfrow = c(1, 1))

The highest absolute frequency is concentrated in the class (100, 150] and the highest relative frequency is concentrated in the class (100, 150].

#Gini coefficient
gini_index <- function(x){
  ni=table(x)
  fi=ni/length(x)
  fi2=fi^2
  J=length(table(x))
  
  gini=1-sum(fi2)
  normalized_gini=gini/((J-1)/J)
  
  return(normalized_gini)
}

gini_index_cl_sales <- gini_index(cl_sales)
gini_index_cl_sales_rounded <- round(gini_index_cl_sales, 2)
gini_index_cl_sales_table <- data.frame(Gini_Index = gini_index_cl_sales_rounded)
kable(gini_index_cl_sales_table, col.names = c("cl_sales Gini index"))

cl_sales Gini index
0.92

The Gini heterogeneity index (normalized) for the qualitative variable “cl_sales” is about 0.92, practically almost equal to 1: this shows us a situation of almost maximum heterogeneity, that is, almost all statistical units are equally distributed among all modes of the variable (equidistribution).

PROBABILITY CALCULATION

library(ggplot2)

total_obs <- nrow(realestate_texas)
num_beaumont <- sum(city == "Beaumont")
prob_beaumont <- num_beaumont / total_obs

num_july <- sum(month == "7")
prob_july <- num_july / total_obs

year_month <- paste(year, month, sep = "-")
num_dec_2012 <- sum(year_month == "2012-12")
prob_dec_2012 <- num_dec_2012 / total_obs

prob_data <- data.frame(
  Category = c("Beaumont", "July", "December 2012"),
  Probability = c(prob_beaumont, prob_july, prob_dec_2012)
)

combined_plot <- ggplot(prob_data, aes(x = Category, y = Probability)) +
  geom_bar(stat = "identity", col = "black", fill = "gray", width = 0.6) +  
  geom_text(aes(y = Probability, label = round(Probability, 3)),    
            color = "black", vjust = -0.5, size = 4) +  
  labs(x = "Category", y = "Probability") +
  theme(aspect.ratio = 1, plot.margin = margin(5, 30, 5, 5))

combined_plot

The probability that, taken a random row in this dataset, it carries the city “Beaumont” is 0.25, or 25% of the cases; the probability that, taken a random row in this dataset, it carries the month and year “December 2012” is about 0.017, or about 1.7% of the cases and the probability that, taken a random row in this dataset, it carries the month “July” is about 0.083, or about 8.3% of the cases.

CREATION OF NEW VARIABLES

realestate_texas$mean_price <- volume/sales * 1000000

“mean_price” represents the average sales price per unit. It was calculated as the ratio of the total sales volume (volume) to the total number of sales (sales), multiplied by 1,000,000 to express the result in dollars (volume is expressed in millions of dollars). Basically, this value gives you the average sales price of a property for each completed transaction.

realestate_texas$sales_efficiency <- round(realestate_texas$sales / realestate_texas$listings, 3)
View(realestate_texas)

“sales_efficiency” measures the efficiency with which properties are sold. It is calculated as the ratio of the total number of sales (sales) to the total number of active listings (listings). This variable indicates how many sales are made per active listing. A higher value suggests that properties are sold faster than the number of listings, while a low value indicates that there are many listings for sale without corresponding equally high sales. If sales_efficiency values are high, it means that sales listings are more effective in bringing about a transaction. This could be the case during periods or in cities with high demand for real estate, where listings find buyers quickly. If sales_efficiency values are low, this could indicate that properties are remaining unsold for longer periods, which could reflect low demand, prices that are too high, or other issues related to properties for sale. Finally, we see the new variables in the dataset.

CONDITIONAL ANALYSIS

library(dplyr)
library(ggplot2)
library(tidyr)

summary_data_city_sales <- realestate_texas %>%
  group_by(city) %>%
  summarise(mean = mean(sales),
            st_dev = sd(sales))

summary_data_city_sales_plot <- ggplot(summary_data_city_sales) +
  geom_bar(aes(x = city, y = mean, fill = "Mean"), 
           stat = "identity", position = "dodge") +
  geom_errorbar(aes(x = city, ymin = mean - st_dev, ymax = mean + st_dev, color = "Standard Deviation"), width = 0.2, linewidth = 1.5) +
  scale_fill_manual(values = c("Mean" = "#00BFFF")) +  
  scale_color_manual(values = c("Standard Deviation" = "red1")) +
  labs(title = "Sales Mean (Blue) and Standard Deviation (Red) by City",
       x = "City",
       y = "Value") + 
  theme_minimal() +
  theme(legend.position = "none")

summary_data_city_sales_plot

“Sales” shows a higher mean in “Tyler” and a higher variability in ”Brian-College Station”.

summary_data_year_sales <- realestate_texas %>%
  group_by(year) %>%
  summarise(mean = mean(sales),
            st_dev = sd(sales))

summary_data_year_sales_plot <- ggplot(summary_data_year_sales) +
  geom_bar(aes(x = year, y = mean, fill = "Mean"), 
           stat = "identity", position = "dodge") +
  geom_errorbar(aes(x = year, ymin = mean - st_dev, ymax = mean + st_dev, color = "Standard Deviation"), width = 0.2, linewidth = 1.5) +
  scale_fill_manual(values = c("Mean" = "#00BFFF")) +  
  scale_color_manual(values = c("Standard Deviation" = "red1")) +
  labs(title = "Sales Mean (Blue) and Standard Deviation (Red) by Year",
       x = "Year",
       y = "Value") + 
  theme_minimal() +
  theme(legend.position = "none")

summary_data_year_sales_plot

“Sales” shows a higher mean and a higher variability in ”2014”.

summary_data_month_sales <- realestate_texas %>%
  group_by(month) %>%
  summarise(mean = mean(sales),
            st_dev = sd(sales)) %>%
  mutate(month = factor(month, levels = 1:12, labels = month.name))

summary_data_month_sales_plot <- ggplot(summary_data_month_sales) +
  geom_bar(aes(x = month, y = mean, fill = "Mean"), 
           stat = "identity", position = "dodge") +
  geom_errorbar(aes(x = month, ymin = mean - st_dev, ymax = mean + st_dev, color = "Standard Deviation"), width = 0.2, linewidth = 1.5) +
  scale_fill_manual(values = c("Mean" = "#00BFFF")) +  
  scale_color_manual(values = c("Standard Deviation" = "red1")) +
  labs(title = "Sales Mean (Blue) and Standard Deviation (Red) by Month",
       x = "Month",
       y = "Value") + 
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

summary_data_month_sales_plot

“Sales” shows a higher mean and a higher variability in ”June”.

summary_data_city_year_sales <- realestate_texas %>%
  group_by(city, year) %>%
  summarise(mean = mean(sales),
            st_dev = sd(sales),
            .groups = "keep")

summary_data_city_year_sales_plot <- ggplot(summary_data_city_year_sales) +
  geom_bar(aes(x = interaction(city, year), y = mean, fill = "Mean"), 
           stat = "identity", position = "dodge") +
  geom_errorbar(aes(x = interaction(city, year), 
                    ymin = mean - st_dev, 
                    ymax = mean + st_dev, 
                    color = "Standard Deviation"), 
                width = 0.2, linewidth = 1.5) +
  scale_fill_manual(values = c("Mean" = "#00BFFF")) +  
  scale_color_manual(values = c("Standard Deviation" = "red1")) +
  labs(title = "Sales Mean (Blue) and Standard Deviation (Red) by City and Year",
       x = "City and Year",
       y = "Value") + 
  theme_minimal() +
  theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1))

summary_data_city_year_sales_plot

“Sales” shows a higher mean in “Tyler 2014” and a higher variability in ”Brian-College Station 2013”.

GGPLOT2 VISUALIZATIONS

library(ggplot2)
library(lubridate)
library(dplyr)

# Median Prices by City box-plot
median_prices_boxplot <- ggplot(data = realestate_texas) +
  geom_boxplot(aes(x = city, y = median_price), 
               fill = "gray80",            
               color = "black",            
               outlier.colour = "red",     
               outlier.size = 2,           
               width = 0.7) + 
  labs(title = "Distribution of Median Prices by City",
       x = "City",
       y = "Median Price") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),  
        legend.position = "none")

median_prices_boxplot

Brian-College Station has a higher range of variation than the other cities, leading to higher Q1, Q3 and a median. Where, on the other hand, Beaumont and Tyler have similar ranges of variation, although Tyler has a higher median. Wichita Falls, on the other hand, has the smallest range of variation. Brian-College Station, Beaumont and Witchita Falls show outliers, where for Tyler they are not present.

Sys.setlocale("LC_TIME", "en_US.UTF-8")

## [1] "en_US.UTF-8"

# Average Sales Volume / Number of Sales by Month and City dodged barplots
dodged_barplot_1 <- function(variable, y_label, title) {
  monthly_data <- realestate_texas %>%
    group_by(city, month) %>%
    summarise(mean_value = mean(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
    mutate(month = factor(month, levels = 1:12, labels = month.name))  
  
  ggplot(data = monthly_data) +
    geom_bar(aes(x = month, y = mean_value), stat = "identity", position = "dodge", fill = "gray") +  
    facet_wrap(~ city, scales = "free_y") + 
    labs(title = title,
         x = "Month", y = y_label) +
    scale_x_discrete(labels = month.name) +  
    theme_classic() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8), 
      panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
      panel.grid.minor = element_blank(),
      legend.position = "none"
    )
}

dodged_barplot_volume_1 <- dodged_barplot_1("volume", "Volume", "Average Sales Volume by Month and City")
dodged_barplot_sales_1 <- dodged_barplot_1("sales", "Sales", "Average Number of Sales by Month and City")

dodged_barplot_volume_1

dodged_barplot_sales_1

These graphics consists of 4 bar charts, which refer to the 4 different cities in the dataset, in order to assess the trend of sales volumes/number of sales by individual city by month (each month represents the average of individual months’ values over a time span of all 4 years in the dataset): - in Beaumont, sales volumes/number of sales tend to be higher and more concentrated in the spring and summer periods; - in Brian-College Station we have a very similar situation, but we have an even greater concentration of sales volumes/number of sales in the spring and summer periods than in Beaumont; - in Tyler, we find a similar situation to the previous ones, and we can also see a significant increase in sales volumes/number of sales in June; - in Wichita Falls, we find the usual concentration of increases in sales volumes/number of sales close to the spring/summer periods but, unlike the other cities, the trend does not appear to be increasing but appears to be stable over time, with a conspicuous concentration in the spring period.

dodged_barplot_2 <- function(variable, y_label, title) {
  monthly_data <- realestate_texas %>%
    group_by(city, month) %>%
    summarise(mean_value = mean(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
    mutate(month = factor(month, levels = 1:12, labels = month.name))

 ggplot(data = monthly_data) +
    geom_bar(aes(x = month, y = mean_value, fill = city), stat = "identity", position = "dodge") +  
    labs(title = title,
         x = "Month", y = y_label) +
    scale_x_discrete(labels = month.name) +  
    scale_fill_brewer(palette = "Set1") +   
    theme_classic() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8), 
      panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
      panel.grid.minor = element_blank(),
      legend.position = "right"  
    )
}

dodged_barplot_volume_2 <- dodged_barplot_2("volume", "Volume", "Average Sales Volume by Month and City")
dodged_barplot_sales_2 <- dodged_barplot_2("sales", "Sales", "Average Number of Sales by Month and City")

dodged_barplot_volume_2

dodged_barplot_sales_2

Unlike before, here we can compare the different distributions of sales volumes/number of sales among different cities and how they evolve over time (every months was calculated like the previous dodged barplots): Tyler and Brian-College Station turn out to have the highest and most increasing sales volumes/number of sales over time, with the same concentration of higher sales volumes/number of sales during the spring/summer periods.

# Average Sales Volume / Number of Sales by Month and City time series
time_series_plot_1 <- function(variable, y_label, title) {
  monthly_data <- realestate_texas %>%
    group_by(city, month) %>%
    summarise(mean_value = mean(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
    mutate(month = factor(month, levels = 1:12, labels = month.name))  
  
  ggplot(data = monthly_data) +
    geom_line(aes(x = month, y = mean_value, group = city),
              linewidth = 1, alpha = 0.8, linetype = "solid", color = "black") +
    facet_wrap(~city, scales = "free_y") + 
    labs(
      title = title,
      x = "Month", y = y_label,
      color = "City"
    ) +
    scale_x_discrete(labels = month.name) + 
    theme_classic() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8), 
      panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
      panel.grid.minor = element_blank(),
      legend.position = "none"
    )
}

time_series_plot_sales_1 <- time_series_plot_1("sales", "Sales", "Average Number of Sales Trend by City and Month")
time_series_plot_volume_1 <- time_series_plot_1("volume", "Volume", "Average Sales Volume Trend by City and Month")

time_series_plot_sales_1

time_series_plot_volume_1

Again, a breakdown was made on the basis of the 4 cities to assess, however, the trend in the sales volumes/number of sales over time but, in these cases, visualization by line charts were used (same calculation for every month): - in Beaumont, we see an increasing trend over time for sales volumes/number of sales, with the usual concentration in the spring/summer periods but it is also possible to see that in the spring period there was a very conspicuous increase, something that would only be consolidated onward; - in Brian-College Station, the peaks are very pronounced between spring/summer and fall/winter periods for sales volumes/number of sales; the trend appears to be stable, even in recording these conspicuous peaks of increases and decreases in the number of sales; - at Tyler, we have a trend and concentration of values very similar to those we find at Beaumont; - in Wichita Falls, we always find a high distributional concentration between the spring/summer periods.

time_series_plot_2 <- function(variable, y_label, title) {
  monthly_data <- realestate_texas %>%
    group_by(city, month) %>%
    summarise(mean_value = mean(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
    mutate(month = factor(month, levels = 1:12, labels = month.name)) 
  
  ggplot(data = monthly_data) +
    geom_line(aes(x = month, y = mean_value, color = city, group = city),
              linewidth = 1, alpha = 0.8, linetype = "solid") +
    labs(
      title = title,
      x = "Month", y = y_label,
      color = "City"
    ) +
    scale_x_discrete(labels = month.name) + 
    theme_classic() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8), 
      panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
      panel.grid.minor = element_blank(),
      legend.position = "top"
    )
}

time_series_plot_sales_2 <- time_series_plot_2("sales", "Sales", "Average Number of Sales Trend by City and Month")
time_series_plot_volume_2 <- time_series_plot_2("volume", "Volume", "Average Sales Volume Trend by City and Month")

time_series_plot_sales_2

time_series_plot_volume_2

Here, we are going to compare the sales volumes/number of sales of different cities over time (same calculation for months): Brian-College Station and Tyler show very similar trends and concentrations of values, especially they share an ever-increasing trend over time in the number of sales. Note, however, that Tyler shows a more “jagged” distribution of data around the spring/summer periods than Brian-College Station.

# Total Sales Volume / Number of Sales by Month and City stacked barplots
stacked_barplot_1 <- function(variable, y_label, title) {
  
  monthly_data <- realestate_texas %>%
    group_by(city, month) %>%
    summarise(sum_value = sum(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
    mutate(month = factor(month, levels = 1:12, labels = month.name)) 
  
  ggplot(data = monthly_data) +
    geom_bar(aes(x = month, y = sum_value, fill = city), stat = "identity", position = "stack") +  
    labs(title = title,
         x = "Month", y = y_label) +
    scale_x_discrete(labels = month.name) + 
    theme_classic() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8), 
      panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
      panel.grid.minor = element_blank(),
      legend.position = "top"
    )
}

stacked_barplot_volume_1 <- stacked_barplot_1("volume", "Total Volume", "Total Sales Volume by Month and City")
stacked_barplot_sales_1 <- stacked_barplot_1("sales", "Total Sales", "Total Number of Sales by Month and City")

stacked_barplot_volume_1

stacked_barplot_sales_1

Here, on the other hand, we have stacked bar charts showing sales volumes/number of sales over time, with each bar representing the sum of sales volumes/number of sales by city in a given month (every month is the sum of the values for that specific month across 4 years, so each bar represents a cumulative sum for all cities of the sums discussed earlier): it can be seen that Tyler and Brian-College Station have the largest sales volumes/number of sales compared to the other cities, and both show increasing trends over time, with an always strong concentration of increases during the spring/summer periods.

stacked_barplot_2 <- function(variable, y_label, title) {
  
  monthly_data <- realestate_texas %>%
    group_by(city, month) %>%
    summarise(sum_value = sum(!!sym(variable), na.rm = TRUE), .groups = "drop") %>%
    mutate(month = factor(month, levels = 1:12, labels = month.name)) 
  
  ggplot(data = monthly_data) +
    geom_bar(aes(x = month, y = sum_value, fill = city), stat = "identity", position = "fill") +  
    labs(title = title,
         x = "Month", y = y_label) +
    scale_x_discrete(labels = month.name) + 
    theme_classic() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 8), 
      panel.grid.major = element_line(color = "gray90", linetype = "dashed"),
      panel.grid.minor = element_blank(),
      legend.position = "top"
    )
}

stacked_barplot_volume_2 <- stacked_barplot_2("volume", "Normalized Total Sales Volume", "Normalized Total Sales Volume by Month and City")
stacked_barplot_sales_2 <- stacked_barplot_2("sales", "Normalized Total Sales", "Normalized Total Number of Sales by Month and City")

stacked_barplot_volume_2

stacked_barplot_sales_2

Here, we have stacked bar charts normalized to show sales volumes/number of sales over time, where each bar represents the sum of sales volumes/number of sales for each city in a given month (same calculation for every month but unlike before, the sum here equals 1, or 100%, because a relative evaluation is made, not an absolute one like before): the considerations that can be made are the same as those for the previously discussed stacked bar charts.

# Average Sales Efficiency by Month and City time series
time_series_plot_sales_efficiency_1 <- time_series_plot_1("sales_efficiency", "Sales Efficiency", "Average Sales Efficiency Trend by City and Month")

time_series_plot_sales_efficiency_1

Here, a chart has been created showing 4 line charts, each representing the trend over time of the new variable ‘sales_efficiency’ (sales/listings) divided by the 4 cities (same calculation for months like the previous time series):
- In Wichita Falls, it can be seen that active listings turned into actual sales more consistently during the spring period. The highest concentrations of efficiency are typically observed during the spring/summer periods;
- In Beaumont, there is a continuously increasing trend, always keeping in mind the same concentration observed in Wichita Falls;
- In Bryan-College Station, it starts with relatively low sales_efficiency values, then reaches significant peaks in the summer period;
- In Tyler, the trend is very similar to the one observed in Beaumont.

time_series_plot_sales_efficiency_2 <- time_series_plot_2("sales_efficiency", "Sales Efficiency", "Average Sales Efficiency Trend by City and Month")

time_series_plot_sales_efficiency_2

Here, a chart has been created showing 4 line charts, each representing the trend over time of the new variable ‘sales_efficiency’ (sales/listings), with each line representing a city (same calculation for months like the previous time series). In this case, these 4 line charts are combined into the same chart, rather than being separate as before. The same observations made earlier can be applied here, but it can also be added that Bryan-College Station shows higher, steadily increasing, and more consistent sales_efficiency values compared to the other cities, reaching an extremely positive peak in July period.

volume_boxplots <- ggplot(data = realestate_texas) +
  geom_boxplot(aes(x = city, y = volume), 
               fill = "gray80",          
               color = "black",          
               outlier.colour = "red",   
               outlier.size = 2,         
               width = 0.7) + 
  labs(title = "Distribution of Total Sales Volume by City for Each Year",
       x = "City",
       y = "Volume") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 4),  
    legend.position = "none"  
  ) +
  facet_wrap(~ year, scales = "free_y")

volume_boxplots

Several box plots have been created, as many as the years, and for each year, a comparison is made between the medians, Q1, Q3, and ranges across different cities regarding sales volumes: it can be observed that, generally, for each year, Tyler and Brian-College Station show fairly similar distribution values. However, in 2012 and 2013, although Brian-College Station has a lower median than Tyler, it shows a Q3 that is slightly lower (2012) and slightly higher (2013) than that of Tyler. It is worth noting that Beaumont, in 2012, shows several outliers.

CONCLUSIONS

Based on the charts seen above, Brian-College Station and Tyler have the highest value distributions, especially when talking about sales volume and sales numbers. In spite of this, for all cities we have a very high concentration of values during the spring/summer periods, and this trend is repeated in every year: real estate sales tend, as a rule, to go up during the good season periods, since many more visits at properties for sale by, for example, real estate agencies are also encouraged. Thus, from the insights extrapolated from the analyses done so far, in order to increase the effectiveness of real estate listings, one could increase the volume of listings published within the “hottest” periods (spring/summer) and trying to diversify the channels, including digital channels, where the listings themselves are published, in order to attract a greater volume of customers. These conclusions are particularly relevant for cities like Beaumont and Wichita Falls, which have significant room for improvement, especially in terms of sales volumes and sales numbers.

Statistica_Descrittiva_Project

Lorenzo Canobbio

2025-02-19