setwd("C:/Users/tomma/Desktop/Profession AI")
texas <- read_csv("realestate_texas.csv")
## Rows: 240 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): city
## dbl (7): year, month, sales, volume, median_price, listings, months_inventory
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attach(texas)
str(texas)
## spc_tbl_ [240 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ city : chr [1:240] "Beaumont" "Beaumont" "Beaumont" "Beaumont" ...
## $ year : num [1:240] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ month : num [1:240] 1 2 3 4 5 6 7 8 9 10 ...
## $ sales : num [1:240] 83 108 182 200 202 189 164 174 124 150 ...
## $ volume : num [1:240] 14.2 17.7 28.7 26.8 28.8 ...
## $ median_price : num [1:240] 163800 138200 122400 123200 123100 ...
## $ listings : num [1:240] 1533 1586 1689 1708 1771 ...
## $ months_inventory: num [1:240] 9.5 10 10.6 10.6 10.9 11.1 11.7 11.6 11.7 11.5 ...
## - attr(*, "spec")=
## .. cols(
## .. city = col_character(),
## .. year = col_double(),
## .. month = col_double(),
## .. sales = col_double(),
## .. volume = col_double(),
## .. median_price = col_double(),
## .. listings = col_double(),
## .. months_inventory = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
City: nominal qualitative variable
Year: even if expressed with integers it is to be considered as an ordinal qualitative variable
Month: for the reason expressed previously it is an ordinal qualitative variable
Sales: quantitative variable
Volume: quantitative variable
Median Price: quantitative variable
Listings: quantitative variable
Months Inventory: quantitative variable
#Variable "city"
ni.city<-table(city)
fi.city<-table(city)/dim(texas)[1]
Tabcity<-cbind(ni.city,fi.city)
#Variable "year"
ni.year<-table(year)
fi.year<-table(year)/dim(texas)[1]
Ni.year<-cumsum(table(year))
Fi.year<-cumsum(table(year))/dim(texas)[1]
Tabyear<-cbind(ni.year,fi.year,Ni.year,Fi.year)
#Variable "month"
ni.month<-table(month)
fi.month<-table(month)/dim(texas)[1]
Ni.month<-cumsum(table(month))
Fi.month<-cumsum(table(month))/dim(texas)[1]
Tabmonth<-cbind(ni.month,fi.month,Ni.month,Fi.month)
#Other variables
quant_var <- colnames(texas[4:8])
lista_tab <- c()
for (i in quant_var) {
#Position Indices
Mean<-mean(texas[[i]]) #average
Mode<-as.numeric(names(which.max(table(texas[[i]])))) #mode
Median<-median(texas[[i]]) #median
Min<-min(texas[[i]])
Max<-max(texas[[i]])
#Variability Indices
Range<-Max-Min
IQR<-IQR(texas[[i]])
SD<-sd(texas[[i]]) #variance
#Variance.sales<-(var(sales)*(length(sales)-1))/length(sales) or
Variance<-SD^2
CV<-SD/Mean
#Shape Indices
Kurtosis<-kurtosis(texas[[i]])-3
Skewness<-skewness(texas[[i]])
Tab<-cbind(Mean,Mode,Median,Min,Max,Range,IQR,Variance,SD,CV,Skewness,Kurtosis)
lista_tab[[i]] <- Tab
}
Tabcompl_quant<-rbind(lista_tab[[(1)]], lista_tab[[(2)]], lista_tab[[(3)]], lista_tab[[(4)]], lista_tab[[(5)]])
Tabcompl_quant<-as.data.frame(Tabcompl_quant)
rownames(Tabcompl_quant)<-colnames(texas[4:8])
For the quantitative variables I calculated the main position, variability and shape indices. I then grouped them in a summary table. We can compare the variables with the skewness and coefficient of variation indices to better understand their distribution and variability. Most of the variables have a positive skewness, where the values are concentrated mostly in the lower part of the distribution, leaving a longer tail to the right. The variable “median_price” is an exception. From the point of view of relative variability, measured by the coefficient of variation (CV), the variable with the greatest dispersion with respect to the mean is volume. This variable is also the one with the highest degree of skewness, as indicated by the skewness index in the table, reflecting a distribution particularly unbalanced towards higher values.
kable(round(Tabcompl_quant, 2))
| Mean | Mode | Median | Min | Max | Range | IQR | Variance | SD | CV | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sales | 192.29 | 124.0 | 175.50 | 79.00 | 423.00 | 344.00 | 120.00 | 6.34430e+03 | 79.65 | 0.41 | 0.72 | -0.31 |
| volume | 31.01 | 14.0 | 27.06 | 8.17 | 83.55 | 75.38 | 23.23 | 2.77270e+02 | 16.65 | 0.54 | 0.88 | 0.18 |
| median_price | 132665.42 | 130000.0 | 134500.00 | 73800.00 | 180000.00 | 106200.00 | 32750.00 | 5.13573e+08 | 22662.15 | 0.17 | -0.36 | -0.62 |
| listings | 1738.02 | 1581.0 | 1618.50 | 743.00 | 3296.00 | 2553.00 | 1029.50 | 5.66569e+05 | 752.71 | 0.43 | 0.65 | -0.79 |
| months_inventory | 9.19 | 8.1 | 8.95 | 3.40 | 14.90 | 11.50 | 3.15 | 5.31000e+00 | 2.30 | 0.25 | 0.04 | -0.17 |
For the variables “city”, “years” and “month”, being qualitative, I created only the frequency distributions. As you can see, within the dataset, the number of observations for each city is identical and data were collected for each month from 2010 to 2014, for the cities of Beaumont, Bryan-College Station, Tyler, Wichita Falls.
kable(Tabcity, caption = "Frequency Table for the variable City", align = "c")
| ni.city | fi.city | |
|---|---|---|
| Beaumont | 60 | 0.25 |
| Bryan-College Station | 60 | 0.25 |
| Tyler | 60 | 0.25 |
| Wichita Falls | 60 | 0.25 |
kable(round(Tabmonth,2), caption = "Frequency Table for the variable Month", align = "c")
| ni.month | fi.month | Ni.month | Fi.month |
|---|---|---|---|
| 20 | 0.08 | 20 | 0.08 |
| 20 | 0.08 | 40 | 0.17 |
| 20 | 0.08 | 60 | 0.25 |
| 20 | 0.08 | 80 | 0.33 |
| 20 | 0.08 | 100 | 0.42 |
| 20 | 0.08 | 120 | 0.50 |
| 20 | 0.08 | 140 | 0.58 |
| 20 | 0.08 | 160 | 0.67 |
| 20 | 0.08 | 180 | 0.75 |
| 20 | 0.08 | 200 | 0.83 |
| 20 | 0.08 | 220 | 0.92 |
| 20 | 0.08 | 240 | 1.00 |
kable(Tabyear, caption = "Frequency Table for the variable Year", align = "c")
| ni.year | fi.year | Ni.year | Fi.year | |
|---|---|---|---|---|
| 2010 | 48 | 0.2 | 48 | 0.2 |
| 2011 | 48 | 0.2 | 96 | 0.4 |
| 2012 | 48 | 0.2 | 144 | 0.6 |
| 2013 | 48 | 0.2 | 192 | 0.8 |
| 2014 | 48 | 0.2 | 240 | 1.0 |
The variable with index 2, or “volume”, has a higher CV, therefore higher variability.
kable(rownames(Tabcompl_quant[which.max(Tabcompl_quant$CV),]))
| x |
|---|
| volume |
The “volume” variable has a higher skewness index in absolute value, therefore it is more asymmetric.
kable(rownames(Tabcompl_quant[which.max(abs(Tabcompl_quant$Skewness)),]))
| x |
|---|
| volume |
For the number of classes, Sturges’ formula is used.
S<-round(1+10/3*log(length(sales)))
W<-ceiling((max(sales)-min(sales))/S)
class.sales<-cut(sales,breaks=c(seq(min(sales)-1,max(sales),W),max(sales)))
ni.class.sales<-table(class.sales)
Ni.class.sales<-cumsum(table(class.sales))
fi.class.sales<-table(class.sales)/length(sales)
Fi.class.sales<-cumsum(table(class.sales)/length(sales))
Tabclass.sales<-cbind(ni.class.sales,Ni.class.sales,fi.class.sales,Fi.class.sales)
kable(round(Tabclass.sales, 2), caption = "Frequency Table of the Variable Sales divided into Classes", align = "c")
| ni.class.sales | Ni.class.sales | fi.class.sales | Fi.class.sales | |
|---|---|---|---|---|
| (78,97] | 19 | 19 | 0.08 | 0.08 |
| (97,116] | 26 | 45 | 0.11 | 0.19 |
| (116,135] | 29 | 74 | 0.12 | 0.31 |
| (135,154] | 21 | 95 | 0.09 | 0.40 |
| (154,173] | 23 | 118 | 0.10 | 0.49 |
| (173,192] | 21 | 139 | 0.09 | 0.58 |
| (192,211] | 19 | 158 | 0.08 | 0.66 |
| (211,230] | 10 | 168 | 0.04 | 0.70 |
| (230,249] | 12 | 180 | 0.05 | 0.75 |
| (249,268] | 11 | 191 | 0.05 | 0.80 |
| (268,287] | 13 | 204 | 0.05 | 0.85 |
| (287,306] | 13 | 217 | 0.05 | 0.90 |
| (306,325] | 5 | 222 | 0.02 | 0.92 |
| (325,344] | 5 | 227 | 0.02 | 0.95 |
| (344,363] | 5 | 232 | 0.02 | 0.97 |
| (363,382] | 4 | 236 | 0.02 | 0.98 |
| (382,401] | 1 | 237 | 0.00 | 0.99 |
| (401,420] | 2 | 239 | 0.01 | 1.00 |
| (420,423] | 1 | 240 | 0.00 | 1.00 |
barplot(table(class.sales),ylim = c(0,30),xlab = "Sales Classes",ylab = "Abs.Freq.")
There are 19 classes, 18 of which have a width of 19 and the last one has a width of 3. In most months there is a total number of sales belonging to the interval (116, 135], with a probability of 12%.
The Gini index is equal to 0.97, very close to the upper limit value. This means that there is almost an equal distribution of the classes of total sales, as expected, given the low frequency of the most frequent class.
gini.index <- function(x){
ni = table(x)
fi = ni/length(x)
fi2 = fi^2
J = length(table(x))
gini = 1 - sum(fi2)
gini.normalizzato = gini/((J-1)/J)
return(gini.normalizzato)}
kable(round(gini.index(class.sales),2))
| x |
|---|
| 0.97 |
The Gini index will be equal to 1, since there is an equal distribution of the absolute frequencies of the modes of the city variable.
p1 <- sum(city=="Beaumont")/nrow(texas)
p2 <- sum(month==7)/nrow(texas)
p3 <- sum(month==12 & year==2012)/nrow(texas)
p1 <- sum(city == "Beaumont") / nrow(texas)
p2 <- sum(month == 7) / nrow(texas)
p3 <- sum(month == 12 & year == 2012) / nrow(texas)
probabilities <- data.frame(
p1,
p2,
p3)
kable(
round(probabilities, 2),
align = "c",
col.names = c("P(city='Beaumont')", "P(month=7)", "P(month=12 & year=2012)"))
| P(city=‘Beaumont’) | P(month=7) | P(month=12 & year=2012) |
|---|---|---|
| 0.25 | 0.08 | 0.02 |
texas$mean_price<-(volume/sales)*1000000
texas$efflistings<-(sales/listings)*100
texas$date <- as.Date(paste(texas$year,texas$month,"01",sep = "-"))
To create a column that reflects the effectiveness of sales ads, I calculated, for each observation, the percentage of ads that actually led to a sale.
texas$efflistings<-(sales/listings)*100
texas$date <- as.Date(paste(texas$year,texas$month,"01",sep = "-"))
ggplot(data = texas) +
geom_line(aes(x = date, y = efflistings, colour = city)) +
geom_point(aes(x = date, y = efflistings, colour = city)) +
labs(
title = "Time Series of ADS effectiveness by City",
x = "Period",
y = "Percentage of Sales on active ADS"
) +
scale_x_date(
breaks = seq(as.Date("2010-01-01"), as.Date("2014-12-01"), by = "6 months"),
date_labels = "%m/%Y",
expand = expansion(mult = c(0.001, 0.001))
) +
theme_classic()
The most effective sales ADS are those in the Bryan-Colege Station city. In general, it is observed that positive and negative trends occur simultaneously in the four cities but with different magnitudes.
summary_texas <- texas %>%
group_by(year, city) %>%
summarise(
media.di.sales = round(mean(sales),2),
sd.di.volume = round(sd(volume),2))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
kable(summary_texas)
| year | city | media.di.sales | sd.di.volume |
|---|---|---|---|
| 2010 | Beaumont | 156.17 | 4.95 |
| 2010 | Bryan-College Station | 167.58 | 10.82 |
| 2010 | Tyler | 227.50 | 8.39 |
| 2010 | Wichita Falls | 123.42 | 4.07 |
| 2011 | Beaumont | 144.00 | 4.30 |
| 2011 | Bryan-College Station | 167.42 | 10.31 |
| 2011 | Tyler | 238.83 | 9.41 |
| 2011 | Wichita Falls | 106.25 | 2.52 |
| 2012 | Beaumont | 171.92 | 4.92 |
| 2012 | Bryan-College Station | 196.75 | 13.49 |
| 2012 | Tyler | 263.50 | 10.23 |
| 2012 | Wichita Falls | 112.42 | 2.66 |
| 2013 | Beaumont | 201.17 | 6.44 |
| 2013 | Bryan-College Station | 237.83 | 19.54 |
| 2013 | Tyler | 287.42 | 10.33 |
| 2013 | Wichita Falls | 121.25 | 3.11 |
| 2014 | Beaumont | 213.67 | 7.05 |
| 2014 | Bryan-College Station | 260.25 | 17.97 |
| 2014 | Tyler | 331.50 | 12.76 |
| 2014 | Wichita Falls | 117.00 | 3.13 |
ggplot(summary_texas, aes(x = year, y = media.di.sales, color = city, group = city)) +
geom_line(linewidth = 1) +
geom_point(size = 2) +
labs(
title = "Average monthly Sales by City and Year",
x = "Year",
y = "Average Monthly Sales",
color = "City"
) +
theme_classic()
There is a decline in average monthly sales in 2011 for all cities in the sample, then an increase every year, except for the city of Wichita Falls, with a decrease in 2014.
ggplot(data=texas)+
geom_boxplot(aes(y=median_price,x=city))+
labs(title = "Median Price Blox",
x="City",
y="Median Price")+
theme_classic()
The graph shows that the variable “median_price” presents a greater variability for the city of Wichita Falls, where there are also houses with the lowest prices, while the same variable is distributed with a lower variability for the observations relating to the city of Bryan-College Station, where the most expensive houses are present.
ggplot(data=texas)+
geom_boxplot(aes(y=volume,x=city,fill=as.factor(year)))+
labs(title = "BoxPlot Total Sales value by City",
x="City",
y="Total Sales Value",
fill="Year")+
theme_classic()
From the graph, it can be observed that the variable “volume” is characterized by greater variability when it comes to the city of Bryan-College Station. In particular, the total value of sales has a greater variability in the year 2013. Instead, “volume” is less variable for the city of Wichita Falls. Finally, in the year 2012, limited to the city of Beumont, the volume of sales is characterized, in absolute terms, by a lower variability.
for (i in 2010:2014) {
grafico <- ggplot(data=subset(texas, year == i)) +
geom_col(aes(x = month, y = sales, fill = city), position = "stack") +
labs(
title = paste("Total Sales by Month and City in the year", i),
x = "Month",
y = "Total Sales",
fill = "City"
) +
scale_x_continuous(breaks = seq(1, 12, 1)) +
scale_y_continuous(breaks = seq(100, 1200, 100)) +
theme_classic() +
theme(plot.title = element_text(size = 12))
print(grafico)
}
for (i in 2010:2014) {
grafico1 <- ggplot(data=subset(texas, year == i)) +
geom_col(aes(x = month, y = sales, fill = city), position = "fill") +
labs(
title = paste("Total Sales Percentage by Month and City in the year", i),
x = "Month",
y = "Percentage of Sales",
fill = "City"
) +
scale_x_continuous(breaks = seq(1, 12, 1)) +
scale_y_continuous(breaks = seq(0, 1, 0.05)) +
theme_classic() +
theme(plot.title = element_text(size = 10))
print(grafico1)
}
From the graphs it is noted that for most of the months examined the highest percentage of sales is recorded in the city of Tyler while the lowest in the city of Wichita Falls. Furthermore, the percentages of sales for each city appear to be more or less constant over time. Another consideration to add concerns the fact that sales are concentrated, for each year, during the summer period.
ggplot(data = texas) +
geom_line(aes(x = date, y = mean_price, colour = city)) +
geom_point(aes(x = date, y = mean_price, colour = city)) +
labs(
title = "Historical Series of Average Price between Cities",
x = "Period",
y = "Average Price",
colour = "City"
) +
scale_x_date(
breaks = seq(as.Date("2010-01-01"), as.Date("2014-12-01"), by = "6 months"),
date_labels = "%m/%y",
expand = expansion(mult = c(0.01, 0.01))
) +
theme_classic()
I chose to represent the average price of sales in each month from 2010 to 2015 for the cities in the dataset. You can see that the highest average prices are in the city of Bryan-College Station while the lowest are in the city of Wichita Falls. At the beginning of 2012 there was a sharp drop in prices that involved all cities. The average price undergoes strong variations, for all cities, undergoes strong variations, represented as broken lines. Often, the variations of the average prices of the cities of Beaumont and Tyler go in the same direction.
Interesting insights have been extracted from the descriptive analyses conducted on the dataset. First of all, we note the continuous price variations that involve all the cities included in the sample, therefore they are shocks that concern the real estate market itself. On the sales side, it should be noted that the trends are generally increasing, with transactions that are concluded more frequently during the summer months of the year. From the point of view of the individual cities, it is observed that the highest average price is recorded in the city of Bryan-College Station while the lowest is in the city of Wichita Falls. In the city of Tyler, more real estate sales are made than in the others and at the same time, a higher overall turnover in the sector is recorded. Going down a level further, considering the quality of real estate agencies, it is necessary to mention the agents of the city of Bryan-College Station who outperform their colleagues in other cities in terms of percentage of sales on active ADS.