#import dataset
setwd("D:/Users/Anton/Desktop/ProfessionAI/02-Statistica descrittiva/reporitory/progetto finale/Progetto")
data=read.csv("Real Estate Texas.csv")
attach(data)
#show first dataset row
head(data)
## city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010 1 83 14.162 163800 1533 9.5
## 2 Beaumont 2010 2 108 17.690 138200 1586 10.0
## 3 Beaumont 2010 3 182 28.701 122400 1689 10.6
## 4 Beaumont 2010 4 200 26.819 123200 1708 10.6
## 5 Beaumont 2010 5 202 28.833 123100 1771 10.9
## 6 Beaumont 2010 6 189 27.219 122800 1803 11.1
The dataset contains the following variables:
city: reference city (Qualitative variable on a nominal scale)
year: reference year (Quantitative discrete variable)
month: reference month (Qualitative variable on an ordinal scale, in this case, the month is coded with numbers from 1 January to 12 December)
sales: total number of sales (Quantitative discrete variable)
volume: total sales value in millions of dollars (Quantitative continuous variable)
median price: median sales price expressed in dollars (Quantitative continuous variable)
listings: total number of listings (Quantitative discrete variable)
months inventory: amount of time needed to sell all current listings at the current sales pace, expressed in months (Quantitative continuous variable)
tot_observation=length(data$city)
sprintf("tot_obs=%s.",tot_observation)
## [1] "tot_obs=240."
summary(data)
## city year month sales
## Length:240 Min. :2010 Min. : 1.00 Min. : 79.0
## Class :character 1st Qu.:2011 1st Qu.: 3.75 1st Qu.:127.0
## Mode :character Median :2012 Median : 6.50 Median :175.5
## Mean :2012 Mean : 6.50 Mean :192.3
## 3rd Qu.:2013 3rd Qu.: 9.25 3rd Qu.:247.0
## Max. :2014 Max. :12.00 Max. :423.0
## volume median_price listings months_inventory
## Min. : 8.166 Min. : 73800 Min. : 743 Min. : 3.400
## 1st Qu.:17.660 1st Qu.:117300 1st Qu.:1026 1st Qu.: 7.800
## Median :27.062 Median :134500 Median :1618 Median : 8.950
## Mean :31.005 Mean :132665 Mean :1738 Mean : 9.193
## 3rd Qu.:40.893 3rd Qu.:150050 3rd Qu.:2056 3rd Qu.:10.950
## Max. :83.547 Max. :180000 Max. :3296 Max. :14.900
The dataset consists of 240 observations. The data refers to sales in the period 2010-2014. Each row of the dataset represents the sales for a given city, in the reference year and month.
For a better data comprehension, it is possible to identify the unique value of variable city.
matrix(unique(data$city,length((data$city)),1))
## [,1]
## [1,] "Beaumont"
## [2,] "Bryan-College Station"
## [3,] "Tyler"
## [4,] "Wichita Falls"
The data refers to 4 city:
# Define the function to calculate variability index
variability_index = function(df) {
# If df is a vector, convert it to a dataframe
if (is.vector(df)) {
df = data.frame(col = df)
}
# Initialize an empty dataframe to store the results
variability = data.frame(
Index = c("Range", "IQR", "SD", "CV"),
stringsAsFactors = FALSE
)
# Loop through each column in the dataframe
for (col_name in names(df)) {
col_data = df[[col_name]]
# Check if the column is numeric
if (is.numeric(col_data)) {
# Calculate the variability index
range_val = round(diff(range(col_data)),2)
iqr_val = round(IQR(col_data),2)
sd_val = round(sd(col_data),2)
cv_val = round(sd_val / mean(col_data),2)
# Create a new column with the results
new_column = data.frame(
col_name = c(range_val, iqr_val, sd_val, cv_val),
stringsAsFactors = FALSE
)
# Set the column name
colnames(new_column) = col_name
# Combine the new column with the variability dataframe
variability = cbind(variability, new_column)
}
}
return(variability)
}
# Load necessary library
library(moments)
# Define the function to calculate shape index
shape_index = function(df) {
# If df is a vector, convert it to a dataframe
if (is.vector(df)) {
df = data.frame(col = df)
}
# Initialize an empty dataframe to store the results
shape_ind = data.frame(
Index = c("Skewness", "Kurtosis"),
stringsAsFactors = FALSE
)
# Loop through each column in the dataframe
for (col_name in names(df)) {
col_data = df[[col_name]]
# Check if the column is numeric
if (is.numeric(col_data)) {
# Calculate the shape index
skewness_val = round(skewness(col_data), 2)
kurtosis_val = round(kurtosis(col_data)-3, 2)
# Create a new column with the results
new_column = data.frame(
col_name = c(skewness_val, kurtosis_val),
stringsAsFactors = FALSE
)
# Set the column name
colnames(new_column) = col_name
# Combine the new column with the shape_index dataframe
shape_ind = cbind(shape_ind, new_column)
}
}
return(shape_ind)
}
summary(data[,c(4,5,6,8)])
## sales volume median_price months_inventory
## Min. : 79.0 Min. : 8.166 Min. : 73800 Min. : 3.400
## 1st Qu.:127.0 1st Qu.:17.660 1st Qu.:117300 1st Qu.: 7.800
## Median :175.5 Median :27.062 Median :134500 Median : 8.950
## Mean :192.3 Mean :31.005 Mean :132665 Mean : 9.193
## 3rd Qu.:247.0 3rd Qu.:40.893 3rd Qu.:150050 3rd Qu.:10.950
## Max. :423.0 Max. :83.547 Max. :180000 Max. :14.900
variability_index(data[,c(4,5,6,8)])
## Index sales volume median_price months_inventory
## 1 Range 344.00 75.38 106200.00 11.50
## 2 IQR 120.00 23.23 32750.00 3.15
## 3 SD 79.65 16.65 22662.15 2.30
## 4 CV 0.41 0.54 0.17 0.25
shape_index(data[,c(4,5,6,8)])
## Index sales volume median_price months_inventory
## 1 Skewness 0.72 0.88 -0.36 0.04
## 2 Kurtosis -0.31 0.18 -0.62 -0.17
The variables sales and volume have a high range and a positive asymmetric distribution (\(\gamma_{sales}=0.72 , \gamma_{volume}=0.88\)). By their nature these variables are linked to each other through the median price which, however is negatively asymmetric (\(\gamma_{median\_price}=-0.36\)).
Comparing CV index appears that sales (\(CV_{sales}=0.41\)) and volume (\(CV_{volume}=0.54\)) has the higher variability, median price has the lowest (\(CV_{median\_price}=0.17\)).
The variable months inventory is the only nearly symmetric (\(\gamma_{months\_inventory}=0.04\)), symptom of a balanced distribution of the data.
From the tables relating to the variability and asymmetry index it is possible to identify the variable with the greatest variability and the most asymmetric one.
For what concern variability, since the analyzed variables present standard deviations with different magnitude index, it is necessary to focus on the Coefficient of Variation in order to make comparisons.
Focusing on CV index, it is clear that volume is the variable with highest variability.
\(CV_{volume}\) =
0.54
Focusing on Skeweness index, appears that volume is also the most asymmetric variable.
\(\gamma_{volume}\)=
0.88
The variable volume has a positive asymmetry ( \(\gamma>0\) ), which means that there is a higher frequency of lower values.
The variable sales has a range from \(Min_{sales}\)=79 to \(Max_{sales}\)=423. It is
possible to divide the variable in classes, ranging from 50 to 450 by
steps of 50.
# Sales class from 50 to 450 by 50
breaks_sales = seq(50, 450, by = 50)
labels_sales = paste(head(breaks_sales, -1), "-", tail(breaks_sales, -1), sep="")
#Sales partition by sales class
data$class_sales = cut(sales,
breaks = breaks_sales,
include.lowest = TRUE,
labels = labels_sales)
# Create frequency distribution
freq_distribution_sales = table(data$class_sales)
# Convert frequency distribution in dataframe
df_freq_sales = as.data.frame(freq_distribution_sales)
# Bar chart sales classes
library(ggplot2)
ggplot(df_freq_sales, aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(x = "Sales Class", y = "Absolute Frequency", title = "Sales Frequency Distribution") +
scale_y_continuous(breaks = seq(0, 80, by = 10), limits = c(0, 80)) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14), # Center Title
axis.title.x = element_text(size = 12), # axis name size
axis.title.y = element_text(size = 12))
Fig. 1. Sales frequency distribution in classes
Fig. 1 shows that the lowest sales classes (100-150, 150-200) have a high frequency, which could indicate a prevalence of months and cities with low number of sales. The higher classes (300-350, 350-400) have very low frequencies, suggesting that it is rare to have months and cities with such a high number of sales. By further investigations, it is possible to find out if there are trends related to seasonality or if there are areas that on average record higher sales than others.
#Gini Index
gini.index = function(x){
ni = table(x)
fi = ni/length(x)
fi2 = fi^2
J = length(table(x))
gini = 1-sum(fi2)
gini.norm = gini/((J-1)/J)
return(gini.norm)
}
Calculating the Gini index for the variable sales divided into
classes, it comes out that \(G'_{sales}\)=0.92. The
value obtained is very close to the maximum value of 1, which indicates
that the variable sales divided into classes has a very high
heterogeneity, confirming the results shown in the graph.
What is the probability that, if you take a random row from this dataset, it will show the city “Beaumont”? And the probability that it will bring back the month of July? And the probability that it will carry over the month of December 2012?
From the dataset nature, it is intuitive to find the answer to these questions. The dataset consists of 240 observations. An observation for each city in the month and year of reference. The probability that taken a row will report the city of Beumont will be equal to the total number of observations(240), divided by the total number of cities(4) reported in dataset. The result will be 25%.
It is possible to verify this statement by applying the definition of probability:
\(p= \frac{Number~of~favorable~outcomes}{Total~number~of~outcomes}\)
p_Beumont=(sum(data$city == "Beaumont")/tot_observation)*100
\(p_{Beumont}=\)25%
similarly we have that
p_july=round((sum(data$month == "7")/tot_observation)*100,2)
p_dec2012=round((sum(data$month == "12" & data$year==2012)/tot_observation)*100,2)
\(p_{july}=\)8.33%
\(p_{dec2012}=\)1.67%
Starting from available data, it is possible to define new variables that helps in analyzing the data.
Define average price as:
\(mean_{price}=\frac{volume}{sales}\)
Define listings efficiency as total number of sales versus listings. Multiplying by 100, the output will represent the percentage of listings converted in effective sales.
\(\eta_{listings}=\frac{sales}{listings}
*100\)
data$mean_price=(data$volume*1000000)/data$sales
data$eta_listings=round((data$sales/data$listings)*100,2)
summary(data[,c(4,6,10,11)])
## sales median_price mean_price eta_listings
## Min. : 79.0 Min. : 73800 Min. : 97010 Min. : 5.01
## 1st Qu.:127.0 1st Qu.:117300 1st Qu.:132939 1st Qu.: 8.98
## Median :175.5 Median :134500 Median :156588 Median :10.96
## Mean :192.3 Mean :132665 Mean :154320 Mean :11.87
## 3rd Qu.:247.0 3rd Qu.:150050 3rd Qu.:173915 3rd Qu.:13.49
## Max. :423.0 Max. :180000 Max. :213234 Max. :38.71
All positional index calculated on the mean price are about 20-30 k$ greater than the respective positional index calculated on the variable median price. This implies that for a given city, in a given month and year, among all the sales made, there are some (in limited number) that have recorded higher revenues than the others.
For what concern listings efficiency, although the range of this variable is high, ranging from a minimum of 5% to a maximum of about 40%, it is important to focus on the \(3^{rd} Qu.\) index.
This parameter has a value of 13.5%. This means that in 75% of the cases observed, the effectiveness of the ads was less than 15% and only in rare cases has been recorded higher sales.
Deeper analyses will allow to understand if there is a relationship between the increase in sales and variables such as cities or a certain historical period.
# Median Price by city
ggplot(data, aes(x = city, y = median_price)) +
geom_boxplot() +
stat_summary(fun = median, geom = "text", aes(label = round(after_stat(y), 1)), vjust = -0.5, size = 3, color = "red") + #add median label
labs(title = "Median Price by City", subtitle = "2010-2014", x = "City", y = "Median Price [$]") +
scale_y_continuous(breaks = seq(70000, 180000, by = 10000),
limits = c(70000, 180000))+ # Set y-axis limits
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5, size=14), # Center the title
plot.subtitle = element_text(hjust = 0.5),
axis.title.x = element_text(size = 12), # axis name size
axis.title.y = element_text(size = 12))
Fig. 2. Median Price by City
Bryan-College Station has the highest median prices among the cities represented, with a median price of around 155400$. Follows Tyler with a median price of 142200$, and Beaumont with 130750$. Wichita Falls has the lowest median prices 102300$.
Fig. 2 also shows that Brian-College is the city with the lowest IQR. Probably, properties in this city are in greater demand than in other cities, managing to keep prices relatively stable.
Wichita Falls, in addition to having the lowest median price, also has the highest IQR. It is likely that this area is more subject to market trends. Another point to underline regarding Wichita Falls is the presence of a relative minimum 25% lower than the median price, suggesting that there have been periods in this area when property sale prices have collapsed drastically.
#Volume by year
ggplot(data, aes(x = factor(year), y = volume, fill = city)) +
geom_boxplot(position = position_dodge(width = 0.8)) +
scale_fill_manual(values = c("Beaumont" = "lightblue", "Bryan-College Station" = "lightgreen", "Tyler" = "lightcoral", "Wichita Falls" = "lightgoldenrod")) +
labs(title = "Volume by Year",
x = "Year",
y = "Volume [Billion $]") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size=14),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12))+
scale_y_continuous(breaks = seq(0, 100, by = 10),
limits = c(0, 100)) # Set y-axis limits
Fig. 3. Volume by year
Fig. 3 shows that excep Whicita Falls, all cities show an uptrend in the analyzed time period.
Wichita Falls is the city with the lowest sales volume, with a relatively stable trend. Considering that all the other cities show an increasing trend, the fact that its sales volume has been stable over the years could be a symptom of a lack of interest in this area by buyers.
Bryan-College Station is the city with the highest sales volume, with a peak value exceeding $80 million in 2014. It is also the city with the highest average variability. This suggests the presence of a possible seasonal trend, with months characterized by higher number of sales.
Tyler has sales volumes with an uptred. It is the city with the highest median volume in the entire period analyzed. In addition, the variability is more contained than in Bryan-College city, suggesting the presence of a more stable market.
Beumont despite having lower sales volumes, among the cities registering an uptrend, is the one with the lowest variability. The market is stable and growing.
#sales by month (2010-2014)
ggplot(data, aes(x = factor(month), y = sales, fill = city)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = sales), position = position_stack(vjust = 0.5), size = 3) +
facet_wrap(~ year, ncol = 1) +
labs(title = "Sales by Month (2010-2014)",
x = "Month",
y = "Sales") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5,size=16),
axis.title.x = element_text(size=14),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(size=14)) +
scale_x_discrete(labels = month.abb)
Fig. 4. Sales by Month (2010-2014)
Analyzing the sales trend in the various months, it can be seen that starting from 2011, the most favorable months for sales are those ranging from May to August. The months with lowest sales are January and February.
library(dplyr)
library(gridExtra)
# Filter the data for the year 2014
data_2014 = data %>% filter(year == 2014)
data_2014_normalized = data_2014 %>%
group_by(month) %>%
mutate(sales_normalized = sales / sum(sales))
#sales by months
p1=ggplot(data_2014, aes(x = factor(month), y = sales, fill = city)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = sales), position = position_stack(vjust = 0.5), size = 4) +
labs(title = "Sales by Month (2014)",
x = "Month",
y = "Sales") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5,size=14),
legend.position = c(0.9,0.9),
axis.title.x = element_text(size=12),
axis.title.y=element_text(size=12))+
scale_y_continuous(breaks = seq(0, 1200, by = 100),
limits = c(0, 1200))
#sales by months normalized
p2=ggplot(data_2014_normalized, aes(x = factor(month), y = sales_normalized, fill = city)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = round((sales_normalized*100), 1)), position = position_stack(vjust = 0.5), size = 4) +
labs(title = "Normalized Sales by Month (2014)",
x = "Month",
y = "Normalized Sales") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5,size=14),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12),
legend.position = "none")+
scale_y_continuous(labels = scales::percent)
# Arrange the two plots side by side
grid.arrange(p1, p2, ncol = 2)
Fig. 5. Sales by Month (2014)
Graph on the left clearly shows the presence of a seasonal trend, with the presence of a sales peak, for all cities, recorded in the summer months.
Analyzing the graph on the right, it is possible to observe that the distribution of sales between the various cities remains more or less stable as the months vary.
Wichita Falls is the city with the lowest but also most stable sales percentage in 2014. The sales percentage varies from a minimum of 10.5% to a maximum of 14.6% ( \(\Delta _{\%sales}=4.1\%\) )
Tyler is the city with the highest percentage of sales and also the second in terms of stability. The sales percentage fluctuates from a minimum of 32.7% to a maximum of 40.2% ( \(\Delta _{\%sales}=7.5\%\) ). This suggests that Tyler is a good area to invest in, as demand is high and stable throughout the year.
# Median price line chart
ggplot(data, aes(x = year + month/12, y = median_price, color = city, group = city)) +
geom_line(size = 1) +
labs(title = "Median Price Trends",
x = "year",
y = "Median Price [$]",
color = "city") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5,size=16),
axis.title.x = element_text(size=14),
axis.title.y = element_text(size=14))+
theme(legend.position = "bottom",
legend.title = element_text(size=14),
legend.text = element_text(size=12))
Fig. 6 Median Price Trends
All cities show an uptrend in the median price over the period analyzed. This suggests a growth in the Texas real estate market. It is also possible to notice the presence of seasonal variations with peaks and declines that repeat cyclically from year to year.
Bryan - College Station, generally, has the highest median prices, symptom of higher demand or greater attractiveness of this city. Furthermore, it is the city with the least volatility with a more or less stable trend and limited annual price fluctuations.
Wichita Falls has the lowest median prices, which may suggest less buyer interest due to local economic conditions. It also the city with the highest volatility among the cities analyzed, with numerous annual peaks and drops.
Tyler is the second city by median price, has an upward trend and intermediate variability. The presence of seasonal trends could serve as the basis for a possible buying and selling strategy.
It could be useful to investigate possible relationship between Tyler median price and sales trends.
# Tyler data filter
tyler_data = data %>% filter(city == "Tyler")
# Find min median price value
min_median_price = tyler_data %>%
group_by(year) %>%
summarize(min_price = min(median_price, na.rm = TRUE))
# Join tyler data with min median price
tyler_data = tyler_data %>%
left_join(min_median_price, by = "year")
# Tyler sales line chart
plot_sales = ggplot(tyler_data, aes(x = year + month/12, y = sales)) +
geom_line(color = "blue", size = 1) +
labs(title = "Tyler Sales over years",
x = "Year",
y = "Sales") +
scale_y_continuous(breaks = seq(100, 500, by = 100), limits = c(100, 500)) +
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5,size=14),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12))
# Tyler median price line chart
plot_median_price = ggplot(tyler_data, aes(x = year + month/12, y = median_price)) +
geom_line(color = "red", size = 1) +
geom_text(aes(label = ifelse(median_price == min_price & year>2010, round(median_price, 1), "")),
hjust = 0, vjust = +1, size = 3, color = "black") +
labs(title = "Tyler Median Price over years",
x = "Year",
y = "Median Price [$]") +
scale_y_continuous(breaks = seq(120000, 170000, by=10000), limits = c(120000, 170000))+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5,size=14),
axis.title.x = element_text(size=12),
axis.title.y = element_text(size=12))
# Join plots
grid.arrange(plot_sales, plot_median_price, ncol = 1)
Fig. 7. Tyler median price vs sales
As already seen previously, the trend of the median price for the city of Tyler presents an annual cyclical nature. Although the overall behavior is uptrend, in the years from 2011 to 2014, the price recorded its annual minimum close to the beginning of the year. Comparing the two graphs, an inverse proportionality relationship emerges between the price and the number of sales. In the summer months, where the price is highest, the greatest number of sales are also recorded, so it is better to buy properties at the beginning of the year and then resell them at a higher price in the summer months.
The analysis of the Texas real estate market from 2010 to 2014 revealed that:
Volume and sales are the parameters with the highest variability, while the median price has the lowest variability.
Listing efficiency (defined as sales on listings) in 75% of observed cases is lower than 15%.
Lower sales classes (100-150, 150-200) have the highest frequency, indicating it is rare to have months and cities with a high number of sales.
Volume by year analysis shows that except for Wichita Falls, all cities registered an increase in volume from 2010 to 2014, indicating that the market is growing.
Sales by month, both for the entire time period and focused on 2014, highlight the presence of a seasonal trend with the highest sales registered in the summer months (May-August). It also shows that Tyler city’s sales percentage is the highest and relatively stable.
The median price generally grew in the last year. It is also characterized by seasonal trends, with drops at the beginning of the year and peaks in the middle of the year.
Comparing Tyler city’s sales and median price over the years highlights an inverse proportionality between these two parameters. In the summer months when the median price reaches its peak, the highest number of sales is also recorded.
Considering that Tyler has a generally high median price, a market affected by an uptrend, a relatively stable percentage of sales, and the highest sales compared to other cities, in addition to the presence of seasonal trends, it is suggested to invest in this city.