Project Description
The company Texas Realty Insights wants to analyze real estate
market trends in the state of Texas, using historical data on property
sales. The goal is to provide statistical and visual insights to support
strategic decisions for sales and optimization of property listings.
Project Objectives
- Identify and interpret
historical trends in real estate sales in Texas.
- Evaluate the
effectiveness of marketing strategies for property listings.
-
Provide graphical representations of the data to highlight the
distribution of prices and sales across cities, months, and years.
Added Value
The proposed statistical analysis will
allow Texas Realty Insights to optimize its market strategies by
identifying cities with growth opportunities and assessing the
effectiveness of property listings over time. With a clear and
structured view of the data, the company will be able to make decisions
based on concrete information, improving the management of real estate
sales in Texas.
- Dataset: “Real Estate Texas.csv”
The
dataset contains the following variables:
- city:
reference city
- year: reference year
- month: reference month
- sales: total number of sales
- volume: total sales value (in
millions of dollars)
- median_price: median sale price (in dollars)
- listings: total number of active listings
-
months_inventory: the amount of time needed to sell all current
listings, expressed in months.
library(ineq)
library(e1071)
library(ggplot2)
library(dplyr)
options(scipen = 999)
df <- read.csv("C:\\Users\\utente\\Desktop\\Texas Real Estate\\realestate_texas.csv")
Request: Identify and describe the type of statistical variables present in the dataset. Evaluate how to handle variables that involve a time dimension and comment on the type of analysis that can be conducted for each variable.
City.
Type: Qualitative, nominal scale.
Description: String identifying a city.
Year.
Type: Qualitative, sorted with a natural, interval scale.
Description: Number identifying a year (the range is from 2010 to 2014).
Month.
Type: Qualitative, sorted with a natural order, interval scale.
Description: Number from 1 to 12, where 1 is January, 2 is February,…,12 is December.
Sales.
Type: Quantitative, discrete, ratio scale.
Description: integer representing the number of sales.
Volume.
Type: Quantitative, continuous, ratio scale.
Description: Total sales value in millions.
Median price.
Type: Quantitative, discrete, ratio scale.
Description: Median price of sales in dollars.
Listing.
Type: Quantitative, discrete, ratio scale.
Description: Integer representing the number of active listings.
Months_inventory.
Type: Quantitative, continuous, ratio scale.
Description: Number of months needed to sell the listings.
Year and Month are going to be used as qualitative variables, for the frequency analysis. I will consider adding a new variable date based on Year and Month later, that I can use as a quantitative variable.
Frequency-based, the goal is to understand the model and the trends.
Based on statistics and measurement and variability indices, to have concise descriptions of the values of the entire data set.
Request: Compute measures of central tendency, variability, and shape for all variables where it makes sense. For the others, create a frequency distribution. Finally, provide a brief commentary.
#install.packages("e1071")
city_freq<- as.data.frame(table(df[["city"]]))
colnames(city_freq) <- c("City","Frequence")
year_freq<- as.data.frame(table(df[["year"]]))
colnames(year_freq) <- c("Year","Frequence")
month_freq<- as.data.frame(table(df[["month"]]))
colnames(month_freq) <- c("Month","Frequence")
print(city_freq)
## City Frequence
## 1 Beaumont 60
## 2 Bryan-College Station 60
## 3 Tyler 60
## 4 Wichita Falls 60
print(year_freq)
## Year Frequence
## 1 2010 48
## 2 2011 48
## 3 2012 48
## 4 2013 48
## 5 2014 48
print(month_freq)
## Month Frequence
## 1 1 20
## 2 2 20
## 3 3 20
## 4 4 20
## 5 5 20
## 6 6 20
## 7 7 20
## 8 8 20
## 9 9 20
## 10 10 20
## 11 11 20
## 12 12 20
loc_measures <- matrix(nrow = 0, ncol = 7)
for (z in c("sales", "volume", "median_price", "listings", "months_inventory")){
x <- df[[z]]
stats <- c(mean(x),median(x),quantile(x))
loc_measures <- rbind(loc_measures,stats)
}
loc_measures <- round(loc_measures, 2)
colnames(loc_measures) <- c("mean","median", "min", "Q1", "Q2","Q3", "max")
rownames(loc_measures) <- c("sales", "volume", "median_price", "listings", "months_inventory")
loc_measures_norm<- as.data.frame(loc_measures)
knitr::kable(loc_measures_norm)
| mean | median | min | Q1 | Q2 | Q3 | max | |
|---|---|---|---|---|---|---|---|
| sales | 192.29 | 175.50 | 79.00 | 127.00 | 175.50 | 247.00 | 423.00 |
| volume | 31.01 | 27.06 | 8.17 | 17.66 | 27.06 | 40.89 | 83.55 |
| median_price | 132665.42 | 134500.00 | 73800.00 | 117300.00 | 134500.00 | 150050.00 | 180000.00 |
| listings | 1738.02 | 1618.50 | 743.00 | 1026.50 | 1618.50 | 2056.00 | 3296.00 |
| months_inventory | 9.19 | 8.95 | 3.40 | 7.80 | 8.95 | 10.95 | 14.90 |
var_measures <- matrix(nrow = 0, ncol = 6)
for (z in c("sales", "volume", "median_price", "listings", "months_inventory")){
x <- df[[z]]
IQR_value <- IQR(x)
stats <- c(max(x) - min(x),var(x),sd(x),sd(x) / mean(x),IQR_value,ineq(x, type = "Gini"))
var_measures <- rbind(var_measures,stats)
}
var_measures <- round(var_measures, 2)
colnames(var_measures) <- c("range","variance", "standard deviation", "coefficient of variation", "interquartile range", "gini index")
rownames(var_measures) <- c("sales", "volume", "median_price", "listings", "months_inventory")
var_measures<- as.data.frame(var_measures)
knitr::kable(var_measures)
| range | variance | standard deviation | coefficient of variation | interquartile range | gini index | |
|---|---|---|---|---|---|---|
| sales | 344.00 | 6344.30 | 79.65 | 0.41 | 120.00 | 0.23 |
| volume | 75.38 | 277.27 | 16.65 | 0.54 | 23.23 | 0.30 |
| median_price | 106200.00 | 513572983.09 | 22662.15 | 0.17 | 32750.00 | 0.10 |
| listings | 2553.00 | 566568.97 | 752.71 | 0.43 | 1029.50 | 0.24 |
| months_inventory | 11.50 | 5.31 | 2.30 | 0.25 | 3.15 | 0.14 |
var_measures <- matrix(nrow = 0, ncol = 4)
for (z in c("sales", "volume", "median_price", "listings", "months_inventory")){
x <- df[[z]]
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
lower <- Q1 - 1.5 * IQR_value
upper <- Q3 + 1.5 * IQR_value
x_no_outliers <- x[x >= lower & x <= upper]
x_log<-log(x)
if (z=="median_price"){
stats <- c(sd(x) / mean(x),sd(x_no_outliers) / mean(x_no_outliers),NA,IQR_value/mean(x))
var_measures <- rbind(var_measures,stats)
}
else{
stats <- c(sd(x) / mean(x),sd(x_no_outliers) / mean(x_no_outliers),sd(x_log)/mean(x_log),IQR_value/mean(x))
var_measures <- rbind(var_measures,stats)
}
}
var_measures <- round(var_measures, 2)
colnames(var_measures) <- c("coefficient of variation (CV)", "CV no outliers", "CV logarithmic", "IQR/median")
rownames(var_measures) <- c("sales", "volume", "median_price", "listings", "months_inventory")
var_measures<- as.data.frame(var_measures)
knitr::kable(var_measures)
| coefficient of variation (CV) | CV no outliers | CV logarithmic | IQR/median | |
|---|---|---|---|---|
| sales | 0.41 | 0.20 | 0.08 | 0.02 |
| volume | 0.54 | 0.34 | 0.17 | 0.10 |
| median_price | 0.17 | 0.07 | NA | 0.00 |
| listings | 0.43 | 0.14 | 0.06 | 0.00 |
| months_inventory | 0.25 | 0.25 | 0.12 | 0.34 |
#install.packages("e1071")
shape_measures <- matrix(nrow = 0, ncol = 2)
for (z in c("sales", "volume", "median_price", "listings", "months_inventory")){
x <- df[[z]]
stats <- c(skewness(x),kurtosis(x))
shape_measures <- rbind(shape_measures,stats)
}
shape_measures <- round(shape_measures, 2)
colnames(shape_measures) <- c("skewness","kurtosis")
rownames(shape_measures) <- c("sales", "volume", "median_price", "listings", "months_inventory")
shape_measures <- as.data.frame(shape_measures)
knitr::kable(shape_measures)
| skewness | kurtosis | |
|---|---|---|
| sales | 0.71 | -0.34 |
| volume | 0.88 | 0.15 |
| median_price | -0.36 | -0.64 |
| listings | 0.65 | -0.81 |
| months_inventory | 0.04 | -0.20 |
Comments about Frequencies: Noticing how year and month are regular (as well as city), I decided not to add a derived variable “date”.
Comments about measures of location: I did not compute the mode, as the quantitative variables were too granular.
Comments about measures of variability: The variables have different scales, so I mostly decided to focus on the coefficient of variation and the gini index. In particular, I decided to compute different types of coefficient of variations, as the variables seem to have different symmetries and weight on their tails (see below).
Comments about measure of shape: The measures of shape give us interesting indications about how variables are distributed.
Request: Determine: which variable has the highest variability; which variable has the most skewed distribution. Explain how you reached these conclusions and provide statistical considerations.
Volume is the variable with greatest variability and skewness, having maximum coefficient of variation and skewness (equal to 0.54 and 0.88, respectively).
Observe that the fact that the skewness of volume is high is not independent from the fact that its variability is high. Indeed, the variability of volume is affected and it could be over overestimated by the fact that it is not symmetric. This is the main reason why I also computed other kinds of coefficients of variations (CV). As I observed, the CV tells us that “volume” is the variable that has highest variability and it tells us that “months_inventory” has a low variability relative to the other variables. On the other hand, “months_inventory” is almost perfectly symmetric and its tails have a low height, and in this sense the others variables are disadvantaged with respect to it by the use of the standard CV. The other type of CV, that I now briefly introduce, tells us something more about the variables.
CV with no outliers: Here, I simply ignored the tails of the variables, in order to compute a more centered variability but with the same idea of the standard CV. Volume is still the variable with highest variance, but not so sharply as for the standard CV case. As expected, “median_price” and “listings” (having both high tails) reduced their CV value in this setting. In particular, “median_price” is clearly the variable that is more stable, as it has the lowest CV in both cases.
CV in logarithmic scale: Scaling the data with the log
function, asymmetries are reduced. The result tells us something similar
to “CV with no outliers”, and we use this number mostly to validate the
previous one. Two notes:
1) Using the logarithmic scale, the shape of the vector changes. Such
change usually is not drastic and since we use this parameter mostly to
validate the previous one, we can ignore this detail for the
moment.
2) The skewness of “median_price” is negative (the left tail of it is
large), so the logarithmic function is not informative for this
variable. On the other hand, the other two indices clearly state that
the variable has low variance, so we just do not need to investigate it
further.
IQR/median: This index takes 50% of the data in consideration, as “CV with no outliers”, and considers median instead of mean, so that the effect of the skewness is reduced. It is called “robust CV”, as it is less affected by small changes of the data and tails. This measure tells us what we expected, that is, most of the variance of “sales”, “volume”, “median_price”, and “listings” is caused by the tails.
In general, volume is the variable with highest variance overall, but if we consider the central data, the variable with more variance is “months_inventory”. Concerning the gini index, variables seem to be equally distributed, as the range of the index goes from a minimum of 0.1 to a maximum of 0.3.
Request: Select a quantitative variable (e.g., sales or median_price) and divide it into classes. Create a frequency distribution and represent the data with a bar chart. Compute the Gini heterogeneity index and discuss the results.
The classification is made by splitting the range of sales into 10 classes, each one having the same interval length.
x <- df[["sales"]]
size=0.7
classes <- table(cut(x, breaks = 10, include.lowest = TRUE))
barplot(classes, col = "lightgreen", las = 1, cex.names = size,
main = "Barplots of 'sales', 10 classes",
xlab = "classes", ylab = "frequency",
names.arg = 1:10, width = 2,
cex.axis=size,
cex.main=size,
cex.lab=size)
legend("topright", legend = paste0(1:10, ": ", names(classes)), bty = "n",xpd = TRUE,cex=0.7)
#
print(paste("Gini index of sales subdivided in 10 classes: ",round(ineq(classes, type = "Gini"),2)))
## [1] "Gini index of sales subdivided in 10 classes: 0.35"
classes <- table(cut(x, breaks = 100, include.lowest = TRUE))
print(paste("Gini index of sales subdivided in 100 classes: ",round(ineq(classes, type = "Gini"),2)))
## [1] "Gini index of sales subdivided in 100 classes: 0.48"
classes <- table(cut(x, breaks = 4, include.lowest = TRUE))
print(paste("Gini index of sales subdivided in 4 classes: ",round(ineq(classes, type = "Gini"),2)))
## [1] "Gini index of sales subdivided in 4 classes: 0.33"
Request: What is the probability that, when taking a random row from this dataset, it refers to the city Beaumont? And the probability that it refers to the month of July? And the probability that it refers to December 2012?
Looking at the frequency table in Section 2, we have that every city (there are 4 cities) has a row for every pair <year,month>, where the number of years is 5. (The number of rows is 4125=240). Hence:
Request: Create a new column that calculates the average price of properties using the available variables. Try to create a column that measures the effectiveness of sales listings. Comment and discuss the results.
df$mean_price <- round(df$volume *10000/ df$sales)*100
df$est_hours <- round(df$months / df$listings * 24 * 30 *4)/4
cols <- names(df)
cols <- cols[cols != "mean_price"]
pos <- which(cols == "volume")
cols <- append(cols, "mean_price", after = pos)
df <- df[, cols]
df_summary <- df %>%
group_by(city) %>%
summarise(
mean_price_avg = round(mean(mean_price, na.rm = TRUE), 2),
est_hours_avg = round(mean(est_hours, na.rm = TRUE), 2),
mean_price_cv = round(sd(mean_price, na.rm = TRUE) / mean(mean_price, na.rm = TRUE), 2),
est_hours_cv = round(sd(est_hours, na.rm = TRUE) / mean(est_hours, na.rm = TRUE), 2)
)
print(as.data.frame(df_summary))
## city mean_price_avg est_hours_avg mean_price_cv est_hours_cv
## 1 Beaumont 146638.3 4.25 0.08 0.13
## 2 Bryan-College Station 183533.3 3.68 0.08 0.17
## 3 Tyler 167683.3 2.79 0.07 0.11
## 4 Wichita Falls 119435.0 6.19 0.10 0.06
Request: Use the dplyr package or base R to perform conditional statistical analyses by city, year, and month. Generate summaries (mean, standard deviation) and represent the results graphically.
make_group_summary <- function(df, group_col) {
f <- as.formula(paste(". ~", group_col))
cols <- c(group_col, "sales","volume","mean_price","median_price",
"listings","months_inventory","est_hours")
df_mean <- aggregate(f,
data = df[, cols],
FUN = mean, na.rm = TRUE)
names(df_mean)[-1] <- paste0(names(df_mean)[-1], ".am")
df_sd <- aggregate(f,
data = df[, cols],
FUN = sd, na.rm = TRUE)
names(df_sd)[-1] <- paste0(names(df_sd)[-1], ".sd")
df_mean <- as.data.frame(lapply(df_mean, function(x) {
if (is.numeric(x)) round(x, 2) else x
}))
df_sd <- as.data.frame(lapply(df_sd, function(x) {
if (is.numeric(x)) round(x, 2) else x
}))
return(list(mean = df_mean, sd = df_sd))
}
results<-make_group_summary(df,"city")
df_ccm<-results$mean
df_ccsd<-results$sd
results<-make_group_summary(df,"year")
df_ycm<-results$mean
df_ycsd<-results$sd
results<-make_group_summary(df,"month")
df_mcm<-results$mean
df_mcsd<-results$sd
I decided to visualize data using bar charts and doing a different graphic for each aggregation.
For the visualization, I decided to visualize both the original data, but also the shifted one vertically removing the minimum value, so that it is possible to understand both the relative difference between variables (latter version) but also having a more general idea of how these values really differ. I visualized everything twice, that could be seen as a waste, but I think this way of visualizing give a better idea. (Visualizing only in one of the two ways I mentioned above can give not the best information or even misleading information.)
I could decide for each data which one of the two visualization was best, but in the end I decided for the easiest option to just plot everything, as both visualization can give information in almost all cases here.
plot_conditionate <- function(df, columnname, title, color, names,zoom) {
par(mar = c(4, 4, 4, 2))
size=0.7
par(bg = color[2])
col <- df[[columnname]]
if(zoom==1 || zoom==3){
bp <- barplot(col,
names.arg = names,
col = color[1],
cex.axis = size,
cex.names=size)
mtext(title[1], side = 3, line = 3, adj = 0, cex = size,font=2)
mtext(title[2], side = 3, line = 2, adj = 0, cex = size,font=2)
mtext(title[3], side = 3, line = 1, adj = 0, cex = size,font=2)
}
if(zoom==2 || zoom==3){
col_shift<-col-min(col)
bp <- barplot(col_shift,
names.arg = names,
col = color[1],
cex.axis = size,
cex.names=size,
yaxt="n")
axis(2, at = col_shift, labels = col,cex.axis = size)
mtext(paste(title[1],"(ZOOM)"), side = 3, line = 3, adj = 0, cex = size,font=2)
mtext(title[2], side = 3, line = 2, adj = 0, cex = size,font=2)
mtext(toupper(title[3]), side = 3, line = 1, adj = 0, cex = size,font=2)
}
#
#text(x = bp, y = 0, labels = names, cex = 0.6, col = "black", srt = 90)
}
for(s in names(df_ccm)[-1]){
title=c("Value: MEAN","x-> CITY",paste0("y-> ",toupper(substring(s, 1, nchar(s) - 3))))
plot_conditionate(df_ccm,s,title,c("orange","lightyellow"),gsub(" ","\n",gsub("[aeiouAEIOU]", "", df_ccm$city)),3)
}
for(s in names(df_ccsd)[-1]){
title=c("Value: STANDARD DEVIATION","x-> CITY",paste0("y-> ",toupper(substring(s, 1, nchar(s) - 3))))
plot_conditionate(df_ccsd,s,title,c("blue","lightyellow"),gsub(" ","\n",gsub("[aeiouAEIOU]", "", df_ccm$city)),3)
}
for(s in names(df_ycm)[-1]){
title=c("Value: MEAN","x-> YEAR",paste0("y->",toupper(substring(s, 1, nchar(s) - 3))))
plot_conditionate(df_ycm,s,title,c("orange","lightgrey"),df_ycm$year,3)
}
for(s in names(df_ccsd)[-1]){
title=c("Value: STANDARD DEVIATION","x-> YEAR",paste0("y->",toupper(substring(s, 1, nchar(s) - 3))))
plot_conditionate(df_ycsd,s,title,c("blue","lightgrey"),df_ycsd$year,3)
}
for(s in names(df_ycm)[-1]){
title=c("Value: MEAN","x-> MONTH",paste0("y->",toupper(substring(s, 1, nchar(s) - 3))))
plot_conditionate(df_mcm,s,title,c("orange","white"),df_mcm$month,3)
}
for(s in names(df_ccsd)[-1]){
title=c("Value: STANDARD DEVIATION","x-> MONTH",paste0("y->",toupper(substring(s, 1, nchar(s) - 3))))
plot_conditionate(df_mcsd,s,title,c("blue","white "),df_mcsd$month,3)
}
Request: Use ggplot2 to create customized graphs. Make sure to explore: boxplots to compare the distribution of median prices across cities. Bar charts to compare the total sales by month and city; line charts to compare sales trends over different historical periods.
ggplot(df, aes(x = city, y = median_price)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Boxplot for median_price and city", x = "city", y = "median_price") +
theme_minimal()
ggplot(df, aes(x = city, y = sales, fill = factor(month))) +
geom_bar(stat = "identity") +
labs(title = "Sales for City and Month",
x = "City",
y = "Sales") +
theme_minimal()
df_ys <- df %>%
group_by(year) %>%
summarise(total_sales = sum(sales, na.rm = TRUE))
ggplot(df_ys, aes(x = year, y = total_sales)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "darkblue", size = 2) +
labs(title = "Total Sales per Year",
x = "Year",
y = "Total Sales") +
theme_minimal()
df_yv <- df %>%
group_by(year) %>%
summarise(total_volume = sum(volume, na.rm = TRUE))
ggplot(df_yv, aes(x = year, y = total_volume)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "darkblue", size = 2) +
labs(title = "Total Volume per Year",
x = "Year",
y = "Total Volume") +
theme_minimal()
Request: Provide a synthesis of the results obtained, referring to the main trends observed and giving recommendations based on the analysis. This is not a programming project but a statistics project, so you are expected to provide comments and statistical insights at each step and for each result.
Operational notes
Some
considerations:
- Use boxplots to compare the distribution of the
median house price across different cities. Comment on the result.
- Use boxplots (or variants) to compare the distribution of the total
sales value across different cities and across years. Any insights?
- Use a stacked bar chart to compare the total sales in different
months, broken down by city. Comment on what emerges. While you’re at
it, also try a normalized bar chart. Tip: Pay attention to the
difference between geom_bar() and geom_col(). Pro level: Find a smart
way to include the Year variable in the same block of code, without
creating messy graphs.
- Try creating a line chart of a variable of
your choice to make commented comparisons between cities and time
periods.
We discuss the main characteristics of a boxplot and the differences between cities.
-> A last observation about the symmetry, with respect to the median; ‘Bryan-College Station’ and ‘Beaumont’ have an asymmetry, there are more cities above the median than below, giving the hint that here there is more variability for houses of high price than low price. The other two cities are symmetric.
IQR: The IQR is the same in all the cities, so the central dispersion is equivalent. It means that somehow the variability of the main part of the data is the same in all cities.
Range without outliers: It seems the same for all cities, with a unique exception. The lower range of the city with lowest median, ‘Wichita Falls’, is significantly more expanded. Here, many houses have very low price with respect to the main part of the data, relatively speaking with the same range for the other cities.
Outliers: Just three outliers. Interesting to see that in ‘Wichita Falls’, the cheapest place where you can buy a house, there is exactly one outlier. A particular period where houses there were more expensive. That seems something that could be interesting to investigate more deeply.
-> A last consideration. Just by giving a glimpse to the data, this particular period where houses were very expensive is June 2014, where buying houses here was as expensive as buying houses in more prestigious cities (not doing a particular distinction for the month, and that is the key, as I am explaining now). This is actually not surprising, as we observed commenting results of Section 7. Checking the conditional analysis by month, it’s clear that June is the month where in general prices are higher. This is probably motivated by the fact that the four cities are popular destinations for holidays, and people accept a higher price for a house at the beginning of the summer. At the same time, as observed before, 2014 was a successful year and houses were sold at a higher prices. So, overall, no suprise to see an house sold at high price in the summer of 2014.
ggplot(df, aes(x = city, y = volume)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Boxplot for volume and city", x = "city", y = "volume") +
theme_minimal()
- The above boxplot shows something similar to what we already shown in
Section 7 considering ‘city’ and ‘volume’. It is actually similar to the
one considering the mean, if we only look at the boxes, but for the
variance it looks more similar to the one referring to the standard
deviation. I would summarize the results saying that ‘Tyler’ and
‘Bryan-College Station’ are the cities with more interest concerning
success of the sales, but a slight preference has to be given to ‘Tyler’
for the volume, and a relevant preference has to be given to the same
‘Tyler’ in terms of consistency of the sales. These considerations are
confirmed by both the boxplots and the bar charts in Section 7.
ggplot(df, aes(x = factor(year), y = volume)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Boxplot for volume and year", x = "year", y = "volume") +
theme_minimal()
ggplot(df, aes(x = factor(month), y = volume)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Boxplot for volume and month", x = "month", y = "volume") +
theme_minimal()
ggplot(df, aes(x = city, y = sales, fill = factor(month))) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Normalized Sales for City and Month",
x = "City",
y = "Proportion",
fill = "Month") +
theme_minimal()
df_season <- df %>%
mutate(
season = case_when(
month %in% c(12, 1, 2) ~ "Winter",
month %in% c(3, 4, 5) ~ "Spring",
month %in% c(6, 7, 8) ~ "Summer",
month %in% c(9, 10, 11) ~ "Fall"
),
city = gsub("[aeiouAEIOU]", "", city) # remove vowels
) %>%
group_by(city, year, season) %>%
summarise(avg_sales = mean(sales, na.rm = TRUE), .groups = "drop")
ggplot(df_season, aes(x = city, y = avg_sales, fill = season)) +
geom_col(position = "dodge") +
facet_wrap(~ year) +
labs(title = "Average Seasonal Sales by City and Year",
x = "City",
y = "Average Seasonal Sales",
fill = "Season") +
theme_minimal() +
coord_flip()
df_avg <- df %>%
group_by(city, year) %>%
summarise(avg_listings = mean(listings, na.rm = TRUE), .groups = "drop")
ggplot(df_avg, aes(x = year, y = avg_listings, color = city, group = city)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(title = "Average Listings by City over Years",
x = "Year",
y = "Average Listings",
color = "City") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Final consideration
According to the analysis made above, the main suggestion
is investing in all these four cities as the sales seem to go better
over time and, in particular, I would suggest investing in ‘Wichita
Falls’, where (as observed before) it is possible that the cheap prices
that we have now are going to raise a lot during the years. Its true
that still here it takes more time to sell a house, but the trend is
decreasing and the outlier in June 2014, plus the many other
considerations I made above, indicates that this direction has to be
investigated further.
Comments of the results
First of all, I point out that the Gini index computed in this case is different from the one computed in Section 2, because the one in Section 2 is based on the values of the variable, while this one is computed on the labels of the classification. The sale distribution is not uniform across the classes: some classes contain many prices, while others have few, but the disparity is not extreme.