Texas Real Estate

Project Description
The company Texas Realty Insights wants to analyze real estate market trends in the state of Texas, using historical data on property sales. The goal is to provide statistical and visual insights to support strategic decisions for sales and optimization of property listings.
Project Objectives
- Identify and interpret historical trends in real estate sales in Texas.
- Evaluate the effectiveness of marketing strategies for property listings.
- Provide graphical representations of the data to highlight the distribution of prices and sales across cities, months, and years.
Added Value
The proposed statistical analysis will allow Texas Realty Insights to optimize its market strategies by identifying cities with growth opportunities and assessing the effectiveness of property listings over time. With a clear and structured view of the data, the company will be able to make decisions based on concrete information, improving the management of real estate sales in Texas.
- Dataset: “Real Estate Texas.csv”
The dataset contains the following variables:
- city: reference city
- year: reference year
- month: reference month
- sales: total number of sales
- volume: total sales value (in millions of dollars)
- median_price: median sale price (in dollars)
- listings: total number of active listings
- months_inventory: the amount of time needed to sell all current listings, expressed in months.

library(ineq)
library(e1071)
library(ggplot2)
library(dplyr)
options(scipen = 999)


df <- read.csv("C:\\Users\\utente\\Desktop\\Texas Real Estate\\realestate_texas.csv")

1) ANALYSIS OF VARIABLES.

Request: Identify and describe the type of statistical variables present in the dataset. Evaluate how to handle variables that involve a time dimension and comment on the type of analysis that can be conducted for each variable.

City.

Type: Qualitative, nominal scale.
Description: String identifying a city.

Year.

Type: Qualitative, sorted with a natural, interval scale.
Description: Number identifying a year (the range is from 2010 to 2014).

Month.

Type: Qualitative, sorted with a natural order, interval scale.
Description: Number from 1 to 12, where 1 is January, 2 is February,…,12 is December.

Sales.

Type: Quantitative, discrete, ratio scale.
Description: integer representing the number of sales.

Volume.

Type: Quantitative, continuous, ratio scale.
Description: Total sales value in millions.

Median price.

Type: Quantitative, discrete, ratio scale.
Description: Median price of sales in dollars.

Listing.

Type: Quantitative, discrete, ratio scale.
Description: Integer representing the number of active listings.

Months_inventory.

Type: Quantitative, continuous, ratio scale.
Description: Number of months needed to sell the listings.

How to handle variables involving time dimension.

Year and Month are going to be used as qualitative variables, for the frequency analysis. I will consider adding a new variable date based on Year and Month later, that I can use as a quantitative variable.

Analysis for the qualitative variables (City, Year, Month).

Frequency-based, the goal is to understand the model and the trends.

Analysis for the quantitative variables (the others).

Based on statistics and measurement and variability indices, to have concise descriptions of the values of the entire data set.

2) MEASURES (POSITION, VARIABILITY, AND SHAPE) AND FREQUENCIES

Request: Compute measures of central tendency, variability, and shape for all variables where it makes sense. For the others, create a frequency distribution. Finally, provide a brief commentary.

Computation of Frequency.

I compute frequencies for the qualitative variable (“city”, “year”, and “month”).

#install.packages("e1071") 
city_freq<- as.data.frame(table(df[["city"]]))
colnames(city_freq) <- c("City","Frequence")
year_freq<- as.data.frame(table(df[["year"]]))
colnames(year_freq) <- c("Year","Frequence")
month_freq<- as.data.frame(table(df[["month"]]))
colnames(month_freq) <- c("Month","Frequence")
print(city_freq)

##                    City Frequence
## 1              Beaumont        60
## 2 Bryan-College Station        60
## 3                 Tyler        60
## 4         Wichita Falls        60

print(year_freq)

##   Year Frequence
## 1 2010        48
## 2 2011        48
## 3 2012        48
## 4 2013        48
## 5 2014        48

print(month_freq)

##    Month Frequence
## 1      1        20
## 2      2        20
## 3      3        20
## 4      4        20
## 5      5        20
## 6      6        20
## 7      7        20
## 8      8        20
## 9      9        20
## 10    10        20
## 11    11        20
## 12    12        20

Computation of Measures

These measures are computed for quantitative variables (“sales”, “volume”, “median_price”, “listings”, and “months_inventory”)

Measures of Location.

We consider the measures of locations that are the mean, the median, the maximum, the minimum, and the quartile.
The mode is for the moment ignored, as the the data is too granular (we will consider it later, when we group variables).

loc_measures <- matrix(nrow = 0, ncol = 7)
for (z in c("sales", "volume", "median_price", "listings", "months_inventory")){
  x <- df[[z]]
  stats <- c(mean(x),median(x),quantile(x))
  loc_measures <- rbind(loc_measures,stats)
}
loc_measures <- round(loc_measures, 2)
colnames(loc_measures) <- c("mean","median", "min", "Q1", "Q2","Q3", "max")
rownames(loc_measures) <- c("sales", "volume", "median_price", "listings", "months_inventory")
loc_measures_norm<- as.data.frame(loc_measures)
knitr::kable(loc_measures_norm)

	mean	median	min	Q1	Q2	Q3	max
sales	192.29	175.50	79.00	127.00	175.50	247.00	423.00
volume	31.01	27.06	8.17	17.66	27.06	40.89	83.55
median_price	132665.42	134500.00	73800.00	117300.00	134500.00	150050.00	180000.00
listings	1738.02	1618.50	743.00	1026.50	1618.50	2056.00	3296.00
months_inventory	9.19	8.95	3.40	7.80	8.95	10.95	14.90

Measures of Variability.

We consider the measures of variability that are range,variance, standard deviation, coefficient of variation, interquartile range, and “gini index”.

var_measures <- matrix(nrow = 0, ncol = 6)
for (z in c("sales", "volume", "median_price", "listings", "months_inventory")){
  x <- df[[z]]
  IQR_value <- IQR(x)
  stats <- c(max(x) - min(x),var(x),sd(x),sd(x) / mean(x),IQR_value,ineq(x, type = "Gini"))
  var_measures <- rbind(var_measures,stats)
}
var_measures <- round(var_measures, 2)
colnames(var_measures) <- c("range","variance", "standard deviation", "coefficient of variation", "interquartile range", "gini index")
rownames(var_measures) <- c("sales", "volume", "median_price", "listings", "months_inventory")
var_measures<- as.data.frame(var_measures)
knitr::kable(var_measures)

	range	variance	standard deviation	coefficient of variation	interquartile range	gini index
sales	344.00	6344.30	79.65	0.41	120.00	0.23
volume	75.38	277.27	16.65	0.54	23.23	0.30
median_price	106200.00	513572983.09	22662.15	0.17	32750.00	0.10
listings	2553.00	566568.97	752.71	0.43	1029.50	0.24
months_inventory	11.50	5.31	2.30	0.25	3.15	0.14

var_measures <- matrix(nrow = 0, ncol = 4)
for (z in c("sales", "volume", "median_price", "listings", "months_inventory")){
  x <- df[[z]]
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  lower <- Q1 - 1.5 * IQR_value
  upper <- Q3 + 1.5 * IQR_value
  x_no_outliers <- x[x >= lower & x <= upper]
  x_log<-log(x)
  if (z=="median_price"){
    stats <- c(sd(x) / mean(x),sd(x_no_outliers) / mean(x_no_outliers),NA,IQR_value/mean(x))
    var_measures <- rbind(var_measures,stats)
  }
  else{
    stats <- c(sd(x) / mean(x),sd(x_no_outliers) / mean(x_no_outliers),sd(x_log)/mean(x_log),IQR_value/mean(x))
    var_measures <- rbind(var_measures,stats)
  }
}
var_measures <- round(var_measures, 2)
colnames(var_measures) <- c("coefficient of variation (CV)", "CV no outliers", "CV logarithmic", "IQR/median")
rownames(var_measures) <- c("sales", "volume", "median_price", "listings", "months_inventory")
var_measures<- as.data.frame(var_measures)
knitr::kable(var_measures)

	coefficient of variation (CV)	CV no outliers	CV logarithmic	IQR/median
sales	0.41	0.20	0.08	0.02
volume	0.54	0.34	0.17	0.10
median_price	0.17	0.07	NA	0.00
listings	0.43	0.14	0.06	0.00
months_inventory	0.25	0.25	0.12	0.34

Measures of Shape.

We consider the measures of shape that are the skewness and the kurtosis.

#install.packages("e1071") 
shape_measures <- matrix(nrow = 0, ncol = 2)
for (z in c("sales", "volume", "median_price", "listings", "months_inventory")){
  x <- df[[z]]
  stats <- c(skewness(x),kurtosis(x))
  shape_measures <- rbind(shape_measures,stats)
}
shape_measures <- round(shape_measures, 2)
colnames(shape_measures) <- c("skewness","kurtosis")
rownames(shape_measures) <- c("sales", "volume", "median_price", "listings", "months_inventory")
shape_measures <- as.data.frame(shape_measures)
knitr::kable(shape_measures)

	skewness	kurtosis
sales	0.71	-0.34
volume	0.88	0.15
median_price	-0.36	-0.64
listings	0.65	-0.81
months_inventory	0.04	-0.20

Brief Comments

Comments about Frequencies: Noticing how year and month are regular (as well as city), I decided not to add a derived variable “date”.
Comments about measures of location: I did not compute the mode, as the quantitative variables were too granular.
Comments about measures of variability: The variables have different scales, so I mostly decided to focus on the coefficient of variation and the gini index. In particular, I decided to compute different types of coefficient of variations, as the variables seem to have different symmetries and weight on their tails (see below).
Comments about measure of shape: The measures of shape give us interesting indications about how variables are distributed.

3) Identification of the variables with the greatest variability and asymmetry

Request: Determine: which variable has the highest variability; which variable has the most skewed distribution. Explain how you reached these conclusions and provide statistical considerations.

Volume is the variable with greatest variability and skewness, having maximum coefficient of variation and skewness (equal to 0.54 and 0.88, respectively).

Statistical considerations.

Observe that the fact that the skewness of volume is high is not independent from the fact that its variability is high. Indeed, the variability of volume is affected and it could be over overestimated by the fact that it is not symmetric. This is the main reason why I also computed other kinds of coefficients of variations (CV). As I observed, the CV tells us that “volume” is the variable that has highest variability and it tells us that “months_inventory” has a low variability relative to the other variables. On the other hand, “months_inventory” is almost perfectly symmetric and its tails have a low height, and in this sense the others variables are disadvantaged with respect to it by the use of the standard CV. The other type of CV, that I now briefly introduce, tells us something more about the variables.

CV with no outliers: Here, I simply ignored the tails of the variables, in order to compute a more centered variability but with the same idea of the standard CV. Volume is still the variable with highest variance, but not so sharply as for the standard CV case. As expected, “median_price” and “listings” (having both high tails) reduced their CV value in this setting. In particular, “median_price” is clearly the variable that is more stable, as it has the lowest CV in both cases.
CV in logarithmic scale: Scaling the data with the log function, asymmetries are reduced. The result tells us something similar to “CV with no outliers”, and we use this number mostly to validate the previous one. Two notes:
1) Using the logarithmic scale, the shape of the vector changes. Such change usually is not drastic and since we use this parameter mostly to validate the previous one, we can ignore this detail for the moment.
2) The skewness of “median_price” is negative (the left tail of it is large), so the logarithmic function is not informative for this variable. On the other hand, the other two indices clearly state that the variable has low variance, so we just do not need to investigate it further.
IQR/median: This index takes 50% of the data in consideration, as “CV with no outliers”, and considers median instead of mean, so that the effect of the skewness is reduced. It is called “robust CV”, as it is less affected by small changes of the data and tails. This measure tells us what we expected, that is, most of the variance of “sales”, “volume”, “median_price”, and “listings” is caused by the tails.

In general, volume is the variable with highest variance overall, but if we consider the central data, the variable with more variance is “months_inventory”. Concerning the gini index, variables seem to be equally distributed, as the range of the index goes from a minimum of 0.1 to a maximum of 0.3.

4) Creation of classes for a quantitative variable.

Request: Select a quantitative variable (e.g., sales or median_price) and divide it into classes. Create a frequency distribution and represent the data with a bar chart. Compute the Gini heterogeneity index and discuss the results.

The classification is made by splitting the range of sales into 10 classes, each one having the same interval length.

x <- df[["sales"]]
size=0.7
classes <- table(cut(x, breaks = 10, include.lowest = TRUE))
barplot(classes, col = "lightgreen", las = 1, cex.names = size,
        main = "Barplots of 'sales', 10 classes",
        xlab = "classes", ylab = "frequency",
        names.arg = 1:10, width = 2,
        cex.axis=size,
        cex.main=size,
        cex.lab=size)

legend("topright", legend = paste0(1:10, ": ", names(classes)), bty = "n",xpd = TRUE,cex=0.7)

#
print(paste("Gini index of sales subdivided in 10 classes: ",round(ineq(classes, type = "Gini"),2)))

## [1] "Gini index of sales subdivided in 10 classes:  0.35"

classes <- table(cut(x, breaks = 100, include.lowest = TRUE))
print(paste("Gini index of sales subdivided in 100 classes: ",round(ineq(classes, type = "Gini"),2)))

## [1] "Gini index of sales subdivided in 100 classes:  0.48"

classes <- table(cut(x, breaks = 4, include.lowest = TRUE))
print(paste("Gini index of sales subdivided in 4 classes: ",round(ineq(classes, type = "Gini"),2)))

## [1] "Gini index of sales subdivided in 4 classes:  0.33"

Comments of the results

First of all, I point out that the Gini index computed in this case is different from the one computed in Section 2, because the one in Section 2 is based on the values of the variable, while this one is computed on the labels of the classification. The sale distribution is not uniform across the classes: some classes contain many prices, while others have few, but the disparity is not extreme.

5) Probability Calculation.

Request: What is the probability that, when taking a random row from this dataset, it refers to the city Beaumont? And the probability that it refers to the month of July? And the probability that it refers to December 2012?

Looking at the frequency table in Section 2, we have that every city (there are 4 cities) has a row for every pair <year,month>, where the number of years is 5. (The number of rows is 4125=240). Hence:

The probability that a row has the city “Beaumont” is 1/4.
The probability that a row has the month “July” is 1/12.
The probability that a row has month equal to December and year equal to 2012 is 4/240=1/60 (once for every city).

6) Creation of new variables.

Request: Create a new column that calculates the average price of properties using the available variables. Try to create a column that measures the effectiveness of sales listings. Comment and discuss the results.

df$mean_price <- round(df$volume *10000/ df$sales)*100
df$est_hours <- round(df$months / df$listings * 24 * 30 *4)/4
cols <- names(df)
cols <- cols[cols != "mean_price"]
pos <- which(cols == "volume")
cols <- append(cols, "mean_price", after = pos)
df <- df[, cols]

df_summary <- df %>%
  group_by(city) %>%
  summarise(
    mean_price_avg = round(mean(mean_price, na.rm = TRUE), 2),
    est_hours_avg  = round(mean(est_hours, na.rm = TRUE), 2),
    mean_price_cv  = round(sd(mean_price, na.rm = TRUE) / mean(mean_price, na.rm = TRUE), 2),
    est_hours_cv   = round(sd(est_hours, na.rm = TRUE) / mean(est_hours, na.rm = TRUE), 2)
  )
print(as.data.frame(df_summary))

##                    city mean_price_avg est_hours_avg mean_price_cv est_hours_cv
## 1              Beaumont       146638.3          4.25          0.08         0.13
## 2 Bryan-College Station       183533.3          3.68          0.08         0.17
## 3                 Tyler       167683.3          2.79          0.07         0.11
## 4         Wichita Falls       119435.0          6.19          0.10         0.06

The column with the mean price of a house in a particular month of a particular year is simply the volume of millions divided by the number of sales. I left the last two digit at 0 to be consistent with mean_price. I moved this column right after “volume”
Concerning the efficiency, I decided to do an estimation of the number of hours required to clear a listing (the column name is ‘est_hours’).
Note: In both cases I decided to lose precision for clearer information. Also, I am not even sure how much it makes sense to be more precise than this. For example, since “moths_inventory” is approximated at the first digit after the comma, being more precise in terms of hours to clear a listing could be even misleading.
I print the mean and the coefficient of variation of the new added column, aggregating by city, to comment. Concerning ‘est_hours’, ‘Wichita Falls’ is the city that seems to have the longest time to clear a listing on average. Concerning mean price, ‘Wichita Falls’ has the cheapest houses, and ‘Bryan-College Station’ has the most expensive houses. Concerning cv, these two variables seem to have low variance and no significance difference between cities.

7) Conditional analysis

Request: Use the dplyr package or base R to perform conditional statistical analyses by city, year, and month. Generate summaries (mean, standard deviation) and represent the results graphically.

make_group_summary <- function(df, group_col) {
  f <- as.formula(paste(". ~", group_col))
  cols <- c(group_col, "sales","volume","mean_price","median_price",
            "listings","months_inventory","est_hours")
  df_mean <- aggregate(f,
                    data = df[, cols],
                    FUN = mean, na.rm = TRUE)
  names(df_mean)[-1] <- paste0(names(df_mean)[-1], ".am")
  df_sd <- aggregate(f,
                     data = df[, cols],
                     FUN = sd, na.rm = TRUE)
  names(df_sd)[-1] <- paste0(names(df_sd)[-1], ".sd")
  df_mean <- as.data.frame(lapply(df_mean, function(x) {
    if (is.numeric(x)) round(x, 2) else x
  }))
  
  df_sd <- as.data.frame(lapply(df_sd, function(x) {
    if (is.numeric(x)) round(x, 2) else x
  }))
  return(list(mean = df_mean, sd = df_sd))
}


results<-make_group_summary(df,"city")
df_ccm<-results$mean
df_ccsd<-results$sd
results<-make_group_summary(df,"year")
df_ycm<-results$mean
df_ycsd<-results$sd
results<-make_group_summary(df,"month")
df_mcm<-results$mean
df_mcsd<-results$sd

I decided to visualize data using bar charts and doing a different graphic for each aggregation.

For the visualization, I decided to visualize both the original data, but also the shifted one vertically removing the minimum value, so that it is possible to understand both the relative difference between variables (latter version) but also having a more general idea of how these values really differ. I visualized everything twice, that could be seen as a waste, but I think this way of visualizing give a better idea. (Visualizing only in one of the two ways I mentioned above can give not the best information or even misleading information.)

I could decide for each data which one of the two visualization was best, but in the end I decided for the easiest option to just plot everything, as both visualization can give information in almost all cases here.

plot_conditionate <- function(df, columnname, title, color, names,zoom) {
  par(mar = c(4, 4, 4, 2))
  size=0.7
  par(bg = color[2])
  col <- df[[columnname]]
  if(zoom==1 || zoom==3){
    bp <- barplot(col,
                names.arg = names, 
                col = color[1],
                cex.axis = size,
                cex.names=size)
    mtext(title[1], side = 3, line = 3, adj = 0, cex = size,font=2)
    mtext(title[2], side = 3, line = 2, adj = 0, cex = size,font=2)
    mtext(title[3], side = 3, line = 1, adj = 0, cex = size,font=2)
  }
  if(zoom==2 || zoom==3){
    col_shift<-col-min(col)
    bp <- barplot(col_shift,
                names.arg = names, 
                col = color[1],
                cex.axis = size,
                cex.names=size,
                yaxt="n")
    axis(2, at = col_shift, labels = col,cex.axis = size)
    mtext(paste(title[1],"(ZOOM)"), side = 3, line = 3, adj = 0, cex = size,font=2)
    mtext(title[2], side = 3, line = 2, adj = 0, cex = size,font=2)
    mtext(toupper(title[3]), side = 3, line = 1, adj = 0, cex = size,font=2)
  }
  
  
  
  #
  
  #text(x = bp, y = 0, labels = names, cex = 0.6, col = "black", srt = 90)
}

for(s in names(df_ccm)[-1]){
  title=c("Value: MEAN","x-> CITY",paste0("y-> ",toupper(substring(s, 1, nchar(s) - 3))))
  plot_conditionate(df_ccm,s,title,c("orange","lightyellow"),gsub(" ","\n",gsub("[aeiouAEIOU]", "", df_ccm$city)),3)
}

for(s in names(df_ccsd)[-1]){
  title=c("Value: STANDARD DEVIATION","x-> CITY",paste0("y-> ",toupper(substring(s, 1, nchar(s) - 3))))
  plot_conditionate(df_ccsd,s,title,c("blue","lightyellow"),gsub(" ","\n",gsub("[aeiouAEIOU]", "", df_ccm$city)),3)
}

for(s in names(df_ycm)[-1]){
  title=c("Value: MEAN","x-> YEAR",paste0("y->",toupper(substring(s, 1, nchar(s) - 3))))
  plot_conditionate(df_ycm,s,title,c("orange","lightgrey"),df_ycm$year,3)
}

for(s in names(df_ccsd)[-1]){
  title=c("Value: STANDARD DEVIATION","x-> YEAR",paste0("y->",toupper(substring(s, 1, nchar(s) - 3))))
  plot_conditionate(df_ycsd,s,title,c("blue","lightgrey"),df_ycsd$year,3)
}

for(s in names(df_ycm)[-1]){
  title=c("Value: MEAN","x-> MONTH",paste0("y->",toupper(substring(s, 1, nchar(s) - 3))))
  plot_conditionate(df_mcm,s,title,c("orange","white"),df_mcm$month,3)
}

for(s in names(df_ccsd)[-1]){
  title=c("Value: STANDARD DEVIATION","x-> MONTH",paste0("y->",toupper(substring(s, 1, nchar(s) - 3))))
  plot_conditionate(df_mcsd,s,title,c("blue","white "),df_mcsd$month,3)
}

8) Visualizations with ggplot2

Request: Use ggplot2 to create customized graphs. Make sure to explore: boxplots to compare the distribution of median prices across cities. Bar charts to compare the total sales by month and city; line charts to compare sales trends over different historical periods.

Box plots

ggplot(df, aes(x = city, y = median_price)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Boxplot for median_price and city", x = "city", y = "median_price") +
  theme_minimal()

Bar charts

ggplot(df, aes(x = city, y = sales, fill = factor(month))) +
  geom_bar(stat = "identity") + 
  labs(title = "Sales for City and Month",
       x = "City",
       y = "Sales") +
  theme_minimal()

Line charts

df_ys <- df %>%
  group_by(year) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE))

ggplot(df_ys, aes(x = year, y = total_sales)) +
  geom_line(color = "steelblue", linewidth = 1) +   
  geom_point(color = "darkblue", size = 2) +  
  labs(title = "Total Sales per Year",
       x = "Year",
       y = "Total Sales") +
  theme_minimal()

df_yv <- df %>%
  group_by(year) %>%
  summarise(total_volume = sum(volume, na.rm = TRUE))

ggplot(df_yv, aes(x = year, y = total_volume)) +
  geom_line(color = "steelblue", linewidth = 1) +   
  geom_point(color = "darkblue", size = 2) +  
  labs(title = "Total Volume per Year",
       x = "Year",
       y = "Total Volume") +
  theme_minimal()

9) Conclusions

Request: Provide a synthesis of the results obtained, referring to the main trends observed and giving recommendations based on the analysis. This is not a programming project but a statistics project, so you are expected to provide comments and statistical insights at each step and for each result.

Summary of the considerations made so far for each section.

Section 1. There are 3 qualitative variables and 5 quantitative ones. The analysis are frequency based and statistics. The main decision here was to chose year and month as qualitative variables (ordered), which is justified by the later considerations.
Section 2. Here we computed frequency analysis, showing the regularity of the data with respect to ‘city’, ‘year’, and ‘month’. In particular, <‘city’,‘year’,‘month’> is a primary key of this table. Concerning statistical results, here we did the choice of not computing the mode for quantitative variables as they are too granular, and to compute different kinds of coefficients of variations as the variables had different skewness. We did not normalize data to consider other parameters different from the coefficient of variation, as the analysis seemed convincing enough.
Section 3. Here we commented on the variability of the variables, both considering the main part of the data and the tail of the data separately and all together. The main observation is that ‘volume’ is the variable with highest variation, but that. On the other hand, ‘months_inventory’, that seemed at first the variable with less variance, is actually not so regular if we only look at the main part of its occurrence ignoring the tails.
Section 4. Here we divided sales in 10 classes, and we see a tail on the right, as we already knew seeing its skewness of 0.71.
Section 5. Some probability computation on the regular qualitative variable.
Section 6. Here the main decision was to decide how to measure the efficiency of the listing and we did it by calculating how many hours on average are required to clear a listing for each tuple.
Section 7. Here we did some conditional analysis for the qualitative variable to support the understanding of the data, both wanting to show absolutely and relatively the data. Many interesting observations can be made. For example, consider ‘Wichita Falls’: it is clearly the city with less interests. The reason is that selling there requires much more time, as the last bar chart (or the last two, if you count the relative comparisons) highlight. At the same time, together with ‘Tyler’, it is a city where the time to clear a listing is more regular, which is interesting. Concerning ‘year’, 2014 has a special behavior. There, it seems that you did not even have time to store listings and the house were already sold, so it was a successful year with respect to the other four. At the same time, it was a year with a lot of variance, with more variability with respect to others, which is reasonable since there were many things going on and probably its just this scaling factor that made everything less stable. Finally, the bar charts related to months just tell us that summer and even Christmas time (with the magnitude of the former higher) are better periods to sell a house with respect to prices. Timings are somehow the same for all months, despite a slightly slower rate of sales can be observed in winter. In general, ‘month’ is regular and the analysis that only looks at the relative difference between months can be in this case misleading, while it is kind of useful for ‘city’ and ‘year’.
Section 8. Here we produced boxplots and bar charts that we will discuss in the rest of the section.

Operational notes
Some considerations:
- Use boxplots to compare the distribution of the median house price across different cities. Comment on the result.
- Use boxplots (or variants) to compare the distribution of the total sales value across different cities and across years. Any insights?
- Use a stacked bar chart to compare the total sales in different months, broken down by city. Comment on what emerges. While you’re at it, also try a normalized bar chart. Tip: Pay attention to the difference between geom_bar() and geom_col(). Pro level: Find a smart way to include the Year variable in the same block of code, without creating messy graphs.
- Try creating a line chart of a variable of your choice to make commented comparisons between cities and time periods.

Boxplot result.

We discuss the main characteristics of a boxplot and the differences between cities.

Median: ‘Bryan-College Station’ is the city with the highest median, followed by ‘Tyler’ and ‘Beaumont’. Much lower, ‘Wichita Falls’ seems to be the city where buying a house is the cheapest, looking at the median.

-> A last observation about the symmetry, with respect to the median; ‘Bryan-College Station’ and ‘Beaumont’ have an asymmetry, there are more cities above the median than below, giving the hint that here there is more variability for houses of high price than low price. The other two cities are symmetric.

IQR: The IQR is the same in all the cities, so the central dispersion is equivalent. It means that somehow the variability of the main part of the data is the same in all cities.
Range without outliers: It seems the same for all cities, with a unique exception. The lower range of the city with lowest median, ‘Wichita Falls’, is significantly more expanded. Here, many houses have very low price with respect to the main part of the data, relatively speaking with the same range for the other cities.
Outliers: Just three outliers. Interesting to see that in ‘Wichita Falls’, the cheapest place where you can buy a house, there is exactly one outlier. A particular period where houses there were more expensive. That seems something that could be interesting to investigate more deeply.

-> A last consideration. Just by giving a glimpse to the data, this particular period where houses were very expensive is June 2014, where buying houses here was as expensive as buying houses in more prestigious cities (not doing a particular distinction for the month, and that is the key, as I am explaining now). This is actually not surprising, as we observed commenting results of Section 7. Checking the conditional analysis by month, it’s clear that June is the month where in general prices are higher. This is probably motivated by the fact that the four cities are popular destinations for holidays, and people accept a higher price for a house at the beginning of the summer. At the same time, as observed before, 2014 was a successful year and houses were sold at a higher prices. So, overall, no suprise to see an house sold at high price in the summer of 2014.

ggplot(df, aes(x = city, y = volume)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Boxplot for volume and city", x = "city", y = "volume") +
  theme_minimal()

- The above boxplot shows something similar to what we already shown in Section 7 considering ‘city’ and ‘volume’. It is actually similar to the one considering the mean, if we only look at the boxes, but for the variance it looks more similar to the one referring to the standard deviation. I would summarize the results saying that ‘Tyler’ and ‘Bryan-College Station’ are the cities with more interest concerning success of the sales, but a slight preference has to be given to ‘Tyler’ for the volume, and a relevant preference has to be given to the same ‘Tyler’ in terms of consistency of the sales. These considerations are confirmed by both the boxplots and the bar charts in Section 7.

ggplot(df, aes(x = factor(year), y = volume)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Boxplot for volume and year", x = "year", y = "volume") +
  theme_minimal()

Here we just confirm what we said commenting Section 7, that is, 2014 was a year with more success than others. We add that the sales seem to increase timewise, but we could also say that commenting the barcharts of Section 7.

ggplot(df, aes(x = factor(month), y = volume)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "Boxplot for volume and month", x = "month", y = "volume") +
  theme_minimal()

Here we confirm what we already said commenting Section 7 and also before, commenting an outlier, but we add something interesting. October seems to be a year where data is more spread, while we have two outliers before Christmas, but also one in September. In general, I think Christmas is not a relevant period for the sales in these cities, or at least, there seem to be not a significant difference. Maybe a slight one.

Barchart overlapped

Here we comment on the graphic ‘Sales for City and Month’ shown in Section 8. It nicely confirms what we just said (summer is the best, Christmas just nicely if you really want to see it), plotted nicely all together in a single graphic. In the normalized version of the graphic, shown before, it is easier to recognize this trend.

ggplot(df, aes(x = city, y = sales, fill = factor(month))) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Normalized Sales for City and Month",
       x = "City",
       y = "Proportion",
       fill = "Month") +
  theme_minimal()

When there are too much data, either you have an animation, or you have different graphics well grouped, or you aggregate data. I use the latter two options. There are 12 months, but overall since we also consider the variable year now we can even just focus on seasons. Then, we have 4 cities, 4 seasons, and 5 years. I decided to have 5 different graphics.

df_season <- df %>%
  mutate(
    season = case_when(
      month %in% c(12, 1, 2)  ~ "Winter",
      month %in% c(3, 4, 5)   ~ "Spring",
      month %in% c(6, 7, 8)   ~ "Summer",
      month %in% c(9, 10, 11) ~ "Fall"
    ),
    city = gsub("[aeiouAEIOU]", "", city)  # remove vowels
  ) %>%
  group_by(city, year, season) %>%
  summarise(avg_sales = mean(sales, na.rm = TRUE), .groups = "drop")

ggplot(df_season, aes(x = city, y = avg_sales, fill = season)) +
  geom_col(position = "dodge") +
  facet_wrap(~ year) +
  labs(title = "Average Seasonal Sales by City and Year",
       x = "City",
       y = "Average Seasonal Sales",
       fill = "Season") +
  theme_minimal() +
  coord_flip()

Linecharts of a variable

I decided to consider the variable ‘listings’, measuring something we somehow analyzed less, that is, how many houses in each city are put up for sale, which is an interesting data by itself (correlated with the others of course).

df_avg <- df %>%
  group_by(city, year) %>%
  summarise(avg_listings = mean(listings, na.rm = TRUE), .groups = "drop")

ggplot(df_avg, aes(x = year, y = avg_listings, color = city, group = city)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "Average Listings by City over Years",
       x = "Year",
       y = "Average Listings",
       color = "City") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Final consideration
According to the analysis made above, the main suggestion is investing in all these four cities as the sales seem to go better over time and, in particular, I would suggest investing in ‘Wichita Falls’, where (as observed before) it is possible that the cheap prices that we have now are going to raise a lot during the years. Its true that still here it takes more time to sell a house, but the trend is decreasing and the outlier in June 2014, plus the many other considerations I made above, indicates that this direction has to be investigated further.