INTRODUCTION

IMPORT PACKAGES

Import packages useful for the analysis:

library(ggplot2) #
library(moments) # for skewness - kurtosis functions
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gghalves)

IMPORT DATASET

The name of the csv file with the data to analyze is:

realestate_texas.csv

Copy the file in the current working directory and import with read.csv. Dataframe data will appear in environment window of Rstudio.

data = read.csv("realestate_texas.csv", sep = ',')

There are 240 observations; we can see the first rows with the function head:

head(data, 5)

With function str we can have an overview:

str(data)

## 'data.frame':    240 obs. of  8 variables:
##  $ city            : chr  "Beaumont" "Beaumont" "Beaumont" "Beaumont" ...
##  $ year            : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ month           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sales           : int  83 108 182 200 202 189 164 174 124 150 ...
##  $ volume          : num  14.2 17.7 28.7 26.8 28.8 ...
##  $ median_price    : num  163800 138200 122400 123200 123100 ...
##  $ listings        : int  1533 1586 1689 1708 1771 1803 1857 1830 1829 1779 ...
##  $ months_inventory: num  9.5 10 10.6 10.6 10.9 11.1 11.7 11.6 11.7 11.5 ...

Use function attach to directly access to data columns without symbol ‘$’.

1) Variables analysis

Identify and describe the type of statistical variables in the dataset. Evaluate how to manage the variables that imply a time dimension and comment which type of analysis can be done on each variable.

Each observation (row) of the dataset contains the values of quantitative variables sales, volume, median_price, listings and months_inventory, observed for a certain city, in a certain month, in a certain year.

Variable city

city: reference city.
It is a nominal categorical variable. For this variable it will not be done a statistical analysis, but it will be used to see how other quantitative variables (sales, volume, median_price, listings e month_inventory) changes for different values of city.

Variable year

year: reference year.
It is a numerical variable who refers to the reference year considered as an ordinal categorical variable. Similarly to city, a statistical analysis will not be done for this variable, but it will be used to see how other quantitative variables changes over the years (time series).

Variable month

month: reference month.
It is a numerical variable who refers to the reference month considered as an ordinal categorical variable. Similarly to year, a statistical analysis will not be done for this variable, but it will be used to see how other quantitative variables evolves over the months to highlight (if present) seasonal trends.

Variables year and month can be “combined” together to build detailed time series, starting from first month of 2010 (January 2010) to the last one (60th) (December 2014).

Variable sales

sales: total number of sales.
It is a ratio scale based discrete quantitative variable.
ANALYSIS TYPE: study of sales distribution for different city and evolution of sales over time (time series over year to have a large scale view of the data, and time series on month scale to have a more detailed behavior of sales during different period of the year).

Variable volume

volume: total volume fo sales (million dollar).
It is a ratio scale based continue quantitative variable.
ANALYSIS TYPE: study of volume distribution for different city and behavior over time (year and month based time series).

Variable median_price

median_price: median sale price (dollars).
It is a ratio scale based continue quantitative variable.
ANALYSIS TYPE: the same type of analysis will be performed for this variable, median_price distribution for different city and its behavior over time (year and month based time series).

Variable listings

listings: total number of active listings.
It is a ratio scale based discrete quantitative variable.
ANALYSIS TYPE: the same type of analysis will be performed for this variable, listings distribution for different city and its behavior over time (year and month based time series).

Variable months_inventory

months_inventory: quantity of time (in months) necessary to sell all the listings.
It is a ratio scale continue quantitative variable.
ANALYSIS TYPE: combined with listings to evaluate the request of properties fir city over time.

2) Position, variability and shape indices

Calculate the indices of position, variability and shape for all the variables for which it makes sense to do so, for the others create a frequency distribution. At the end, comment everything briefly.

Variable sales

Position indices

Calculation of position indices (summary could be an alternative):

# min
min(sales) # or quantile(sales)[1]

## [1] 79

# max
max(sales) # or quantile(sales)[5]

## [1] 423

# first quartile Q1
quantile(sales)[2]

## 25% 
## 127

# median
median(sales) # or quantile(sales)[3]

## [1] 175.5

# third quartile Q3
quantile(sales)[4]

## 75% 
## 247

# arithmetic mean
mean(sales)

## [1] 192.2917

NOTE: mode has not been calculated because it has a lot of discrete values, and should be more correct to divide it in class and calculate a modal class.

Variability indices

# range (min and max)
range(sales)

## [1]  79 423

# IQR interquartile range
IQR(sales)

## [1] 120

# variance
var(sales)

## [1] 6344.3

# standard deviation
sd(sales)

## [1] 79.65111

# CV coefficient of variation
100*sd(sales)/mean(sales)

## [1] 41.42203

Shape indices

# Fisher index
skewness(sales)

## [1] 0.718104

# Kurtosis index
kurtosis(sales)-3

## [1] -0.3131764

Variable sales has a positive skewness coefficient (0.72), means that the values with higher probability are located in the left part of the probability density function. This can also be noted looking at mean and median, in fact:

median (175.50) < mean (192.29) –> positive skewness

Kurtosis index is negative (-0.31), means that its shape is platykurtic, with lower probability for central and extreme (tail) values and higher probability for intermediate values.

Variable volume

Position indices

Calculation of position indices (summary could be an alternative):

# min
min(volume) # or quantile(volume)[1]

## [1] 8.166

# max
max(volume) # or quantile(volume)[5]

## [1] 83.547

# first quartile Q1
quantile(volume)[2]

##     25% 
## 17.6595

# median
median(volume) # or quantile(volume)[3]

## [1] 27.0625

# third quartile Q3
quantile(volume)[4]

##    75% 
## 40.893

# arithmetic mean
mean(volume)

## [1] 31.00519

NOTE: mode has not been calculated because it has a lot of discrete values, and should be more correct to divide it in class and calculate a modal class.

Variability indices

# range (min and max)
range(volume)

## [1]  8.166 83.547

# IQR interquartile range
IQR(volume)

## [1] 23.2335

# variance
var(volume)

## [1] 277.2707

# standard deviation
sd(volume)

## [1] 16.65145

# CV coefficient of variation
100*sd(volume)/mean(volume)

## [1] 53.70536

Shape indices

# Fisher index
skewness(volume)

## [1] 0.884742

# Kurtosis index
kurtosis(volume)-3

## [1] 0.176987

Variable volume has a positive skewness coefficient (0.88), means that the values with higher probability are located in the left part of the probability density function. This can also be noted looking at mean and median, in fact:

median (27.06) < mean (31.00) –> positive skewness

Kurtosis index is positive (0.177), means that its shape is leptokurtic, with higher probability for central and extreme (tail) values and lower probability for intermediate values.

Variable median_price

Position indices

Calculation of position indices (summary could be an alternative):

# min
min(median_price) # or quantile(median_price)[1]

## [1] 73800

# max
max(median_price) # or quantile(median_price)[5]

## [1] 180000

# first quartile Q1
quantile(median_price)[2]

##    25% 
## 117300

# median
median(median_price) # or quantile(median_price)[3]

## [1] 134500

# third quartile Q3
quantile(median_price)[4]

##    75% 
## 150050

# arithmetic mean
mean(median_price)

## [1] 132665.4

NOTE: mode has not been calculated because it has a lot of discrete values, and should be more correct to divide it in class and calculate a modal class.

Variability indices

# range (min and max)
range(median_price)

## [1]  73800 180000

# IQR interquartile range
IQR(median_price)

## [1] 32750

# variance
var(median_price)

## [1] 513572983

# standard deviation
sd(median_price)

## [1] 22662.15

# CV coefficient of variation
100*sd(median_price)/mean(median_price)

## [1] 17.08218

Shape indices

# Fisher index
skewness(median_price)

## [1] -0.3645529

# Kurtosis index
kurtosis(median_price)-3

## [1] -0.6229618

Variable median_price has a negative skewness coefficient (-0.36), means that the value with higher probability are located in the right part of the probability density function. This can also be noted looking at mean and median, in fact:

median (134500.00) > mean (132665.40) –> negative skewness

Kurtosis index is negative (-0.62), means that its shape is platykurtic, with lower probability for central and extreme (tail) values and higher probability for intermediate values.

Variable listings

Position indices

Calculation of position indices (summary could be an alternative):

# min
min(listings) # or quantile(listings)[1]

## [1] 743

# max
max(listings) # or quantile(listings)[5]

## [1] 3296

# first quartile Q1
quantile(listings)[2]

##    25% 
## 1026.5

# median
median(listings) # or quantile(listings)[3]

## [1] 1618.5

# third quartile Q3
quantile(listings)[4]

##  75% 
## 2056

# arithmetic mean
mean(listings)

## [1] 1738.021

NOTE: mode has not been calculated because it has a lot of discrete values, and should be more correct to divide it in class and calculate a modal class.

Variability indices

# range (min and max)
range(listings)

## [1]  743 3296

# IQR interquartile range
IQR(listings)

## [1] 1029.5

# variance
var(listings)

## [1] 566569

# standard deviation
sd(listings)

## [1] 752.7078

# CV coefficient of variation
100*sd(listings)/mean(listings)

## [1] 43.30833

Shape indices

# Fisher index
skewness(listings)

## [1] 0.6494982

# Kurtosis index
kurtosis(listings)-3

## [1] -0.79179

Variable listings has a positive skewness coefficient (0.65), means that the values with higher probability are located in the left part of the probability density function. This can also be noted looking at mean and median, in fact:

median (1618.50) < mean (1738.02) –> positive skewness

Kurtosis index is negative (-0.79), means that its shape is platykurtic, with lower probability for central and extreme (tail) values and higher probability for intermediate values.

Variable months_inventory

Position indices

Calculation of position indices (summary could be an alternative):

# min
min(months_inventory) # or quantile(months_inventory)[1]

## [1] 3.4

# max
max(months_inventory) # or quantile(months_inventory)[5]

## [1] 14.9

# first quartile Q1
quantile(months_inventory)[2]

## 25% 
## 7.8

# median
median(months_inventory) # or quantile(months_inventory)[3]

## [1] 8.95

# third quartile Q3
quantile(months_inventory)[4]

##   75% 
## 10.95

# arithmetic mean
mean(months_inventory)

## [1] 9.1925

NOTE: mode has not been calculated because it has a lot of discrete values, and should be more correct to divide it in class and calculate a modal class.

Variability indices

# range (min and max)
range(months_inventory)

## [1]  3.4 14.9

# IQR interquartile range
IQR(months_inventory)

## [1] 3.15

# variance
var(months_inventory)

## [1] 5.306889

# standard deviation
sd(months_inventory)

## [1] 2.303669

# CV coefficient of variation
100*sd(months_inventory)/mean(months_inventory)

## [1] 25.06031

Shape indices

# Fisher index
skewness(months_inventory)

## [1] 0.04097527

# Kurtosis index
kurtosis(months_inventory)-3

## [1] -0.1744475

Variable months_inventory has a positive skewness coefficient (0.040) (really close to zero value), means that the values with higher probability are located in the left part of the probability density function. This can also be noted looking at mean and median, in fact:

median (8.95) < mean (9.19) –> positive skewness

Kurtosis index is negative (-0.17), means that its shape is platykurtic, with lower probability for central and extreme (tail) values and higher probability for intermediate values.

Variable city

city is a nominal categorical value; here the calculation of the distribution frequencies (absolute frequency $n_i$ and the relative frequency $n_i$):

city_n_i = table(city)
city_f_i = table(city)/length(city) 
cbind(city_n_i, city_f_i)

##                       city_n_i city_f_i
## Beaumont                    60     0.25
## Bryan-College Station       60     0.25
## Tyler                       60     0.25
## Wichita Falls               60     0.25

distr_freq_city = cbind(city_n_i, city_f_i)

city has four values: Beaumont, Bryan-College Station, Tyler, and Wichita Falls.
The absolute frequency is 60 and the relative frequency is 0.25 for each city.

Variable year

year is an ordinal categorical value; here the calculation of the distribution frequencies (absolute frequency $n_i$ and the relative frequency $n_i$):

year_n_i = table(year)
year_f_i = table(year)/length(year) 
cbind(year_n_i, year_f_i)

##      year_n_i year_f_i
## 2010       48      0.2
## 2011       48      0.2
## 2012       48      0.2
## 2013       48      0.2
## 2014       48      0.2

distr_freq_year = cbind(year_n_i, year_f_i)

year has five values: 2010, 2011, 2012, 2013 and 2014.
The absolute frequency is 48 and the relative frequency is 0.2 for each value of year.

Variable month

month is an ordinal categorical value; here the calculation of the distribution frequencies (absolute frequency $n_i$ and the relative frequency $n_i$):

month_n_i = table(month)
month_f_i = table(month)/length(month) 
cbind(month_n_i, month_f_i)

##    month_n_i  month_f_i
## 1         20 0.08333333
## 2         20 0.08333333
## 3         20 0.08333333
## 4         20 0.08333333
## 5         20 0.08333333
## 6         20 0.08333333
## 7         20 0.08333333
## 8         20 0.08333333
## 9         20 0.08333333
## 10        20 0.08333333
## 11        20 0.08333333
## 12        20 0.08333333

distr_freq_month = cbind(month_n_i, month_f_i)

month has 12 values: from 1 to 12.
The absolute frequency is 20 and the relative frequency is 0.083 for each value of month.

3) Identification of variables with higher variability and asymmetry

Determine - Which is the variable with higher variability? - Which is the variable with more asymmetric distribution? Explain how you reach these conclusions and provide statistical considerations.

VARIABLE WITH HIGHER VARIABILITY

To identify the variable with higher variability it is necessary to compare the variation coefficient CV of each one:

# CV sales
100*sd(sales)/mean(sales)

## [1] 41.42203

# CV volume
100*sd(volume)/mean(volume)

## [1] 53.70536

# CV median_price
100*sd(median_price)/mean(median_price)

## [1] 17.08218

# CV listings
100*sd(listings)/mean(listings)

## [1] 43.30833

# CV months_inventory
100*sd(months_inventory)/mean(months_inventory)

## [1] 25.06031

Comparing the values obtained:

CV_volume (53.70536) > CV_listing (43.30833) > CV_sales (41.42203) > CV_months_inventory (25.06031) > CV_median_price (17.08218)

We can conclude that the variable with higher variability is volume, with a percentage value of 53.7 %.

VARIABLE WITH HIGHER ASIMMETRY

To identify the variable with higher asymmetry it is necessary to compare the skewness of each one:

# asimmetry index sales
skewness(sales)

## [1] 0.718104

# asimmetry index volume
skewness(volume)

## [1] 0.884742

# asimmetry index median_price
skewness(median_price)

## [1] -0.3645529

# asimmetry index listings
skewness(listings)

## [1] 0.6494982

# asimmetry index months_inventory
skewness(months_inventory)

## [1] 0.04097527

Looking at the results above, we can say that the variable with higher asymmetry is volume with a skewness value equal to 0.884742 (it is positive, so its probability frequency distribution will be higher in the left region of the distribution). The only variable with a negative skewness is median_price (-0.3645529).
The variable with lower asymmetry is months_inventory (0.04097527).

4) Creation of classes for a quantitative variable

Select one quantitative variable (ex. sales or median_price) and split it up in classes. Create a frequency distribution presenting the results with a bar chart. Calculate the heterogeneity index of Gini and discuss the results.

CLASS CREATION

Analyze the variable sales:

range(sales)

## [1]  79 423

Considering minimum (79) and maximum value (423), classes with extension equal to 50 will be created, starting from 75 up to 425:

data$sales_cl = cut(sales, breaks = seq(75,425,50))
# attach(data)

Now that the now column has been created and added to the dataset data, we can calculate the distribution frequency:

sales_cl_n_i = table(sales_cl)
sales_cl_f_i = sales_cl_n_i/length(sales)
sales_cl_N_i = cumsum(sales_cl_n_i)
sales_cl_F_i = cumsum(sales_cl_f_i)

distr_freq_sales_cl = cbind (sales_cl_n_i, sales_cl_f_i, sales_cl_N_i, sales_cl_F_i)
distr_freq_sales_cl

##           sales_cl_n_i sales_cl_f_i sales_cl_N_i sales_cl_F_i
## (75,125]            59   0.24583333           59    0.2458333
## (125,175]           61   0.25416667          120    0.5000000
## (175,225]           46   0.19166667          166    0.6916667
## (225,275]           30   0.12500000          196    0.8166667
## (275,325]           26   0.10833333          222    0.9250000
## (325,375]           13   0.05416667          235    0.9791667
## (375,425]            5   0.02083333          240    1.0000000

The distribution frequency is complete:

the class with higher frequency is the class (125,175].
the first quartile class is (125,175].
the median class is (125,175].
the third quartile class is (225,275].

It is interesting to note that 69 % of the sales number is inferior to 225. This is confirmed by the positive skewness previously calculated for varibale sales.

FREQUENCY DISTRIBUTION WITH BAR CHART

To create a bar chart of the variable sales_cl we use the ggplot functions geom_bar:

ggplot(data = data)+
  geom_bar(aes(x = sales_cl),
           stat = "count",
           col = "darkorange4",
           fill = "darkorange")+
  labs(title = "SALES BAR CHART",
       x = "SALES CLASSES",
       y = "ABSOLUTE FREQUENCY")+
  scale_y_continuous(breaks = seq(0,70,10), limits = c(0, 70))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

HETEROGENEITY INDEX OF GINI

Function to calculate the heterogeneity index of Gini:

gini.index = function(x) {
  n_i = table(x)
  f_i = n_i/length(x)
  f_i2 = f_i^2
  J = length(table(x))
  
  gini = 1-sum(f_i2)
  gini.normalized = gini/((J-1)/J)
  return(gini.normalized)
}

gini.index(sales_cl)

## [1] 0.9420833

The obtained Gini index for the variable sales_cl is 0.92, a value close to 1. It means that there is an high heterogeneity for what concern the distribution of the variable sales inside the classes, and there is no class that significantly prevails on the others.

5) Probability calculation

Which is the probability that, taken a row casually from the dataset, it contains the city “Beaumont”? And the probability that contains July month? And the probability that contains the month of December 2012?

PROBABILITY THAT CITY IS “Beaumont”

We can calculate the probability of the single event considering that it can be defined as the ratio between the number of favorable events and the total events.

For what concern the probability that the city is “Beaumont”, we know the it occurs 60 times and the total number of observation is 240, so it simply is 60/240; but to automatize this function, we decide to “filter” the column city, counting the number of occurrences where city is “Beaumont”:

length(which(city == "Beaumont"))/length(city) # 60/240

## [1] 0.25

The probability is 0.25.

PROBABILITY THAT MONTH IS JULY

Applying the same method explained above:

length(which(month == 7))/length(month) # 20/240

## [1] 0.08333333

The probability is 0.083.

PROBABILITY THAT MONTH IS DECEMBER 2012

Use the same method above but in this case I add the logic operator & to make a “double filter”:

length(which(month == 12 & year == 2012))/length(month) # 4/240

## [1] 0.01666667

The probability is 0.0167.

6) Creation of new variables

Create a new column that calculates the mean price of the estates using the available variables.
Try to create a column that measures the efficiency of the sale announcement. Comment and discuss the results.

CALCULATION OF MEAN PRICE OF ESTATES

We can add a new column mean_price obtained dividing, for each observation, the volume (in million dollar) with the number of sales properties sales. The unity of measure adopted for mean_price is kilo dollar to be more readable.

data$mean_price = 1000*volume/(sales)

summary(data$mean_price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   97.01  132.94  156.59  154.32  173.92  213.23

The minimum mean price is 90010 $, the maximum mean price 213234 $.

We can analyze this new variable, seeing how it changes for different cities. First of all divide it in classes:

range(mean_price)

## [1]  97.0102 213.2339

data$mean_price_cl = cut(mean_price, breaks = seq(95, 215, 15))

ggplot(data = data)+
  geom_bar(aes(x = mean_price_cl),
           stat = "count",
           col = "darkorange4",
           fill = "darkorange")+
  labs(x = "MEAN PRICE CLASSES",
       y = "ABSOLUTE FREQUENCY",
       title = "MEAN PRICE BAR CHART")+
  scale_y_continuous(breaks = seq(0,60,10), limits = c(0, 60))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The bar chart above shows that the mean price has a shape close to a normal distribution, with medium mean_price with higher probability located in the center, and lower and higher mean_price with lower probability located near the tails. We can see how it is distributed for different city:

ggplot(data = data)+
  geom_bar(aes(x = mean_price_cl, fill = city), 
           stat = "count", 
           col = "black",
           position = "dodge")+
  labs(x = "MEAN PRICE CLASSES",
       y = "ABSOLUTE FREQUENCY",
       title = "MEAN PRICE CLASSES FOR DIFFERENT CITIES")+
  scale_y_continuous(breaks = seq(0,40,5), limits = c(0, 35))+
  theme_classic()+
  theme(plot.title = element_text(hjust=1))+
  theme(plot.title = element_text(face="bold"))

From the graph above we can say that:

the properties with lower price are located in Wichita Falls; the most expansive property in Wichita Falls belongs to (140;155] class.
the properties in Beaumont have a low medium price; in fact most of them are located in the left part of the bar chart.
the properties in Tyler and Bryan-College Station are similar, with higher cost for the properties in Bryan-College Station.

If we do not divide the variable mean_price in classes, we can directly plot its probability density function for different cities, that confirm what was seen in previous bar chart.

ggplot(data = data)+
  geom_density(aes(x = mean_price, col = city, fill = city), alpha = 0, lwd = 1)+
  labs(title = "PROBABILITY DENSITY FUNCTION OF MEAN PRICE",
       x = " ",
       y = "DENSITY")+
  guides(colour = guide_legend(override.aes = list(alpha = 1)))+ # override aes in geom_density with alpha = 1
  scale_y_continuous(breaks = seq(0,0.05,0.005), limits = c(0, 0.045))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

### EFFICIENCY OF SELLING ADS

There is not a unique way to calculate an efficiency index of selling ads, depending on what we consider as “efficient”, a different index can be calculated.

EFFICIENCY INDEX VOLUME BASED

If we correlate the efficiency with the earned money, volume can be used to calculate the efficiency index, that represents the medium volume (expressed in M$) coming from a single active sales announcement:

data$index_volume = volume/listings

summary(data$index_volume)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.007381 0.012976 0.016047 0.018638 0.020508 0.080256

data$index_volume_cl = cut(index_volume, breaks = seq(0, 0.090, 0.01))
# attach(data)

ggplot(data = data)+
  geom_bar(aes(x = index_volume_cl),
           stat = "count",
           col = "darkorange4",
           fill = "darkorange")+
  labs(x = "INDEX CLASSES",
       y = "EFFICIENCY INDEX",
       title = "EFFICIENCY INDEX (VOLUME BASED)")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The graph above shows that the modal class is (0.01, 0.02], meaning that for each advert published, the medium economic income is 0.015 M$ (central value of the class) (15000 $). We can also categorize by city:

ggplot(data = data)+
  geom_bar(aes(x = index_volume_cl, fill = city),
           stat = "count",
           col = "darkorange4",
           position = "dodge")+
  labs(x = "INDEX CLASSES",
       y = "EFFICIENCY INDEX",
       title = "EFFICIENCY INDEX (VOLUME BASED)")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.8))+
  theme(plot.title = element_text(face = "bold"))

If we categorize the data by city, we can clearly see that this index is much higher (better) for Bryan-Collage Station. If we do a resume of this index categorized by city and years:

data$year_str <- as.character(year)
ggplot(data = data) +
  geom_boxplot(aes(x = city, 
                   y = index_volume, 
                   fill = year_str))+
  labs(title = "BOXPLOT OF INDEX (VOLUME BASED)",
       x = "CITY",
       y ="INDEX")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The index trend is positive in all the cities except Wichita Falls. We can also see the trend line of these index over time:

month_vector = seq(1,60,1)

index_volume_Beaumont = c(data[which(city == "Beaumont"), 12])
index_volume_Bryan_College_Station = (data[which(city == "Bryan-College Station"), 12])
index_volume_Tyler = c(data[which(city == "Tyler"), 12])
index_volume_Wichita_Falls = c(data[which(city == "Wichita Falls"), 12])

ggplot()+
  geom_line(aes(x=month_vector, y=index_volume_Beaumont, col = "Beaumont"))+
  geom_hline(yintercept = mean(index_volume_Beaumont), col = "1")+
  geom_line(aes(x=month_vector, y=index_volume_Bryan_College_Station, col = "Bryan-College Station"))+
  geom_hline(yintercept = mean(index_volume_Bryan_College_Station), col = "2")+
  geom_line(aes(x=month_vector, y=index_volume_Tyler, col = "Tyler"))+
  geom_hline(yintercept = mean(index_volume_Tyler), col = "3")+
  geom_line(aes(x=month_vector, y=index_volume_Wichita_Falls, col = "Wichita Falls"))+
  geom_hline(yintercept = mean(index_volume_Wichita_Falls), col = "4")+
  labs(title = "EFFICIENCY INDEX BASED ON VOLUME OVER YEARS",
       x = "MONTH",
       y = "EFFICIENCY INDEX")+
  scale_color_manual(name = "",
                     breaks = c("Beaumont", "Bryan-College Station", "Tyler", "Wichita Falls"),
                     values = c("1", "2", "3", "4"),
                     labels = c("Beaumont", "Bryan-College Station", "Tyler", "Wichita Falls"))+
  scale_x_continuous(breaks = seq(0,60,12), limits = c(0, 60))+
  theme_classic()+
  theme(plot.title = element_text(hjust=1))+
  theme(plot.title = element_text(face = "bold"))

The index reaches higher values in Brian-College Station, meaning that the ratio volume - listings is much better then in the other cities. High seasonal trend is also present. The other cities index moves similarly over years.

EFFICIENCY INDEX SALES BASED

If we correlate the efficiency with the number of sold properties (regardless in their value) sales can be used to calculate the efficiency index, that represents the number of properties sold for each active sale announcement:

data$index_sales = sales/listings

summary(data$index_sales)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05014 0.08980 0.10963 0.11874 0.13492 0.38713

#attach(data)

data$index_sales_cl = cut(index_sales, breaks = seq(0, 0.40, 0.05))

# attach(data)
ggplot(data = data)+
  geom_bar(aes(x = index_sales_cl),
           stat = "count",
           col = "darkorange4",
           fill = "darkorange")+
  labs(x = "INDEX CLASSES",
       y = "EFFICIENCY INDEX",
       title = "EFFICIENCY INDEX (SALES BASED)")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The graph above shows that the modal class is (0.1, 0.15], meaning that for each advert published, 8 properties are sold (centre value of the class chosen). We can also categorize by city:

ggplot(data = data)+
  geom_bar(aes(x = index_sales_cl, fill = city),
           stat = "count",
           col = "darkorange4",
           position = "dodge")+
  labs(x = "INDEX CLASSES",
       y = "EFFICIENCY INDEX",
       title = "EFFICIENCY INDEX (SALES BASED)")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.8))+
  theme(plot.title = element_text(face = "bold"))

We can see that this index is higher in Bryan-Collage Station. If we do a resume of this index categorizing by city and years:

ggplot(data = data) +
  geom_boxplot(aes(x = city, 
                   y = index_sales, 
                   fill = year_str))+
  labs(title = "BOXPLOT OF INDEX (SALES BASED)",
       x = "CITY",
       y ="INDEX")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The index trend is positive in all the cities except Wichita Falls.
We can also see the trend line of these index over time:

index_sales_Beaumont = c(data[which(city == "Beaumont"), 15])
index_sales_Bryan_College_Station = (data[which(city == "Bryan-College Station"), 15])
index_sales_Tyler = c(data[which(city == "Tyler"), 15])
index_sales_Wichita_Falls = c(data[which(city == "Wichita Falls"), 15])

ggplot()+
  geom_line(aes(x=month_vector, y=index_sales_Beaumont, col = "Beaumont"))+
  geom_hline(yintercept = mean(index_sales_Beaumont), col = "1")+
  geom_line(aes(x=month_vector, y=index_sales_Bryan_College_Station, col = "Bryan-College Station"))+
  geom_hline(yintercept = mean(index_sales_Bryan_College_Station), col = "2")+
  geom_line(aes(x=month_vector, y=index_sales_Tyler, col = "Tyler"))+
  geom_hline(yintercept = mean(index_sales_Tyler), col = "3")+
  geom_line(aes(x=month_vector, y=index_sales_Wichita_Falls, col = "Wichita Falls"))+
  geom_hline(yintercept = mean(index_sales_Wichita_Falls), col = "4")+
  labs(title = "EFFICIENCY INDEX BASED ON SALES OVER YEARS",
       x = "MONTH",
       y = "EFFICIENCY INDEX")+
  scale_color_manual(name = "",
                     breaks = c("Beaumont", "Bryan-College Station", "Tyler", "Wichita Falls"),
                     values = c("1", "2", "3", "4"),
                     labels = c("Beaumont", "Bryan-College Station", "Tyler", "Wichita Falls"))+
  #scale_x_continuous(breaks = seq(0,60,12), limits = c(0, 60))+
  theme_classic()+
  theme(plot.title = element_text(hjust=1))+
  theme(plot.title = element_text(face = "bold"))

The index reaches higher values in Brian-College Station, meaning that the ratio sales - listings is better then in the other cities. Looking at the other locations, a good ratio (looking at the mean value over years) is Whicita Falls e Beaumont. In this classification, Tyler is the worst. High seasonal trend is also present. The other cities index moves similar over years.

EFFICIENCY INDEX SPEED BASED

If we correlate the efficiency to the speed at which one listings passes from active to deactivated, we can calculate ad additional efficiency index as the ratio between months_inventory and listings; it represents the medium time (expressed in month) to sell one estate present in the active sale announcement listing:

data$index_speed = months_inventory/listings

summary(data$index_speed)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.003037 0.004238 0.005680 0.005881 0.007152 0.009360

# attach(data)

data$index_speed_cl = cut(index_speed, breaks = seq(0.003, 0.010, 0.001))

Categorizing by city:

# attach(data)
ggplot(data = data)+
  geom_bar(aes(x = index_speed_cl, fill = city),
           stat = "count",
           col = "darkorange4",
           position = "dodge")+
  labs(x = "INDEX CLASSES",
       y = "EFFICIENCY INDEX",
       title = "EFFICIENCY INDEX (SPEED BASED)")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.8))+
  theme(plot.title = element_text(face = "bold"))

We can see that this index is higher in Wichita Falls and minimum in Tyler. It means that once one advertisement is published it requires more time in Wichita Falls then in Tyler. Contrary to the other indices, this is better if small. If we do a resume of this index categorizing by city and years:

ggplot(data = data) +
  geom_boxplot(aes(x = city, 
                   y = index_speed, 
                   fill = year_str))+
  labs(title = "BOXPLOT OF INDEX (SPEED BASED)",
       x = "CITY",
       y ="INDEX")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The index trend is negative (good) in all the cities except Wichita Falls, where there isn’t a clear trend line. We can also see the evolution of this speed index over years:

index_speed_Beaumont = c(data[which(city == "Beaumont"), 17])
index_speed_Bryan_College_Station = (data[which(city == "Bryan-College Station"), 17])
index_speed_Tyler = c(data[which(city == "Tyler"), 17])
index_speed_Wichita_Falls = c(data[which(city == "Wichita Falls"), 17])

ggplot()+
  geom_line(aes(x=month_vector, y=index_speed_Beaumont, col = "Beaumont"))+
  geom_hline(yintercept = mean(index_speed_Beaumont), col = "1")+
  geom_line(aes(x=month_vector, y=index_speed_Bryan_College_Station, col = "Bryan-College Station"))+
  geom_hline(yintercept = mean(index_speed_Bryan_College_Station), col = "2")+
  geom_line(aes(x=month_vector, y=index_speed_Tyler, col = "Tyler"))+
  geom_hline(yintercept = mean(index_speed_Tyler), col = "3")+
  geom_line(aes(x=month_vector, y=index_speed_Wichita_Falls, col = "Wichita Falls"))+
  geom_hline(yintercept = mean(index_speed_Wichita_Falls), col = "4")+
  labs(title = "EFFICIENCY INDEX BASED ON SPEED OVER YEARS",
       x = "MONTH",
       y = "EFFICIENCY INDEX")+
  scale_color_manual(name = "",
                     breaks = c("Beaumont", "Bryan-College Station", "Tyler", "Wichita Falls"),
                     values = c("1", "2", "3", "4"),
                     labels = c("Beaumont", "Bryan-College Station", "Tyler", "Wichita Falls"))+
  scale_x_continuous(breaks = seq(0,60,12), limits = c(0, 60))+
  theme_classic()+
  theme(plot.title = element_text(hjust=1))+
  theme(plot.title = element_text(face = "bold"))

The index reaches lower values in Tyler, meaning that the speed is higher then in the other cities. No seasonal trend is present.

EFFICIENCY INDEX CONCLUSION

We can do a resume for what concern the efficiency index:

volume and sales based indices move similarly, and it can be reasonable considering that there is a high correlation between the number of sold properties and the volume; but the volume that we consider, is affected by the mean price of the property (depending on location), so it is obvious that Wichita Falls (that is the less expansive city) will perform worse then the other cities. So, the index based on sales could be considered a better “quality factor”.
If we put together the information coming from sales index and speed index, we can conclude that the city where sale announcements last less is Tyler (followed by Bryan-College Station and Beaumont, with very similar historical trend). Bryan-College Station is the city with higher seasonal trend, showing an high number of properties sold during Summer.

7) Conditional analysis

Use the package dplyr or the base language R to make statistical analysis conditioned by city, year and month. Generate some summary (mean, standard deviation) and represents the results by graphs.

VARIABLE sales

Adopt dplyr to do the conditional analysis:

sales_city

data %>% 
  group_by(city) %>%
  summarize(mean_sales_city = mean(sales), sd_sales_city = sd(sales))

ggplot(data = data) +
  geom_boxplot(aes(x = city, y = sales), fill = "darkorange1", col = "black")+
  #geom_hline(yintercept = mean(mean_sales_city), col = "blue", lwd = 0.8)+
  labs(title = "BOXPLOT - SALES FOR DIFFERENT CITIES", x = "CITY", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

From the graph above is clear that the city with the minor number of selling properties is Wichita Falls (below the mean), followed by Beaumont (also below the mean), Bryan-College Station and Tyler (both above the mean).

The city with lower standard deviation of sold properties is Wichita Falls, followed by Beaumont, Tyler and Bryan-College Station. It means that the market (even if the sales number is less then in the other cities) is more stable, with less variance, in Wichita Falls.

sales_year

data %>% 
  group_by(year) %>%
  summarize(mean_sales_year = mean(sales), sd_sales_year = sd(sales))

ggplot(data = data) +
  geom_boxplot(aes(x = year_str, y = sales), fill = "darkorange1", col = "black")+
  #geom_hline(yintercept = mean(mean_sales_year), col = "blue", lwd = 0.8)+
  labs(title = "BOXPLOT - SALES OVER YEARS", x = "YEAR", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

From the graph above we can say that there is a positive trend (from 2010 to 2014) for what concern the number of sold properties (exception for a small decrease in 2010/2011). Also the standard deviation shows a positive trend, reaching its maximum value in 2014; it means that, for that year, the number of sold properties during different months, has high variability (seasonal trend). We can also group the dataset for city and year:

data %>%
  group_by(city, year) %>% 
  summarise(mean_sales_city_year = mean(sales),
            sd_sales_city_year = sd(sales),
            .groups = 'drop')

ggplot(data = data) +
  geom_boxplot(aes(x = city, y = sales, fill = year_str), col = "black")+
  #geom_hline(yintercept = mean(mean_sales_city), col = "blue", lwd = 0.8)+
  labs(title = "BOXPLOT - SALES (CITY AND YEAR)", x = "CITY", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

This graph shows that the number of sales has a positive trend over years 2010-2014 for all cities except from Wichita Falls, where it can be considered quite constant.

sales_month

data_sales_month <- data %>% 
  group_by(month) %>%
  summarize(mean_sales_month = mean(sales), sd_sales_month = sd(sales))

data_sales_month

ggplot(data_sales_month, aes(x=month, y=mean_sales_month))+
  geom_bar(position=position_dodge(), stat="identity", 
           fill = "darkorange1", col = "darkorange4") + 
  geom_errorbar(aes(ymin = mean_sales_month - sd_sales_month,
                    ymax = mean_sales_month + sd_sales_month),
                col = "black", width=.2)+
  labs(title = "MEAN AND STANDARD DEVIATION OF SALES OVER MONTHS",
       x = "MONTHS",
       y = "MEAN/STANDARD DEVIATION")+
  geom_hline(yintercept = mean(mean_sales_month), col = "blue", lwd = 0.8)+
  scale_x_continuous(breaks = seq(1,12,1), limits = c(0, 13))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.4))+
  theme(plot.title = element_text(face = "bold"))

The bar chart above shows that, in years from 2010 to 2014, there is a seasonal trend, showing that the number of sold properties is higher during Summer (above the mean for months from April to August).

VARIABLE volume

volume_city

data %>% 
  group_by(city) %>%
  summarize(mean_volume_city = mean(volume), sd_volume_city = sd(volume))

ggplot(data = data) +
  geom_boxplot(aes(x = city, y = volume), fill = "darkorange1", col = "black")+
  #geom_hline(yintercept = mean(mean_volume_city), col = "blue", lwd = 0.8)+
  labs(title = "BOXPLOT - VOLUME FOR DIFFERENT CITIES", x = "CITY", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The volume is lower (under the mean value) for Whicita Falls and Beaumont, higher (above the mean) for Brian_College Station and Tyler. Variance of volume is also minimum in Whicita Falls (it follows also the trend of sales) and higher in Brian_College Station.

volume_year

data %>% 
  group_by(year) %>%
  summarize(mean_volume_year = mean(volume), sd_volume_year = sd(volume))

ggplot(data = data) +
  geom_boxplot(aes(x = year_str, y = volume), fill = "darkorange1", col = "black")+
  #geom_hline(yintercept = mean(mean_volume_year), col = "blue", lwd = 0.8)+
  labs(title = "BOXPLOT - VOLUME OVER YEARS", x = "YEARS", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The volume trend is positive, rising from 2010 to 2014, with higher variance in 2014.
If we group the variable volume by city and year:

data %>%
  group_by(city, year) %>% 
  summarise(mean_volume_city_year = mean(volume),
            sd_volume_city_year = sd(volume),
            .groups = 'drop')

ggplot(data = data) +
  geom_boxplot(aes(x = city, y = volume, fill = year_str), col = "black")+
  #geom_hline(yintercept = mean(mean_volume_city), col = "blue", lwd = 0.8)+
  labs(title = "BOXPLOT - VOLUME (CITY AND YEAR)", x = "CITY", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The graph above shows that there is a positive trend for volume variable over years, for all cities, except for Wichita Falls. This result confirms the correlation between volume and sales, in fact they follow (qualitatively) the same trend.

volume_month

data_volume_month <- data %>% 
  group_by(month) %>%
  summarize(mean_volume_month = mean(volume), sd_volume_month = sd(volume))

data_volume_month

ggplot(data_volume_month, aes(x=month, y=mean_volume_month))+
  geom_bar(position=position_dodge(), stat="identity", 
           fill = "darkorange1", col = "darkorange4") + 
  geom_errorbar(aes(ymin = mean_volume_month - sd_volume_month,
                    ymax = mean_volume_month + sd_volume_month),
                col = "black", width=.2)+
  labs(title = "MEAN AND STANDARD DEVIATION OF VOLUME OVER MONTHS",
       x = "MONTHS",
       y = "MEAN/STANDARD DEVIATION")+
  geom_hline(yintercept = mean(mean_volume_month), col = "blue", lwd = 0.8)+
  scale_x_continuous(breaks = seq(1,12,1), limits = c(0, 13))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.4))+
  theme(plot.title = element_text(face = "bold"))

The graph above shows a seasonal trend of variale volume for different years, with higher value (over the mean) from April to August (just like sales trend).

VARIABLE median_price

median_price_city

data %>% 
  group_by(city) %>%
  summarize(mean_median_price_city = mean(median_price), sd_median_price_city = sd(median_price))

ggplot(data = data) +
  geom_boxplot(aes(x = city, y = median_price), fill = "darkorange1", col = "black")+
  #geom_hline(yintercept = mean(mean_median_price_city), col = "blue", lwd = 0.8)+
  labs(title = "BOXPLOT - MEDIAN PRICE FOR DIFFERENT CITIES", x = "CITY", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.25))+
  theme(plot.title = element_text(face = "bold"))

Median price is lower for Whicita Falls and Beaumont, higher (above the mean) for Brian_College Station and Tyler.

median_price_year

data %>% 
  group_by(year) %>%
  summarize(mean_median_price_year = mean(median_price), sd_median_price_year = sd(median_price))

ggplot(data = data) +
  geom_boxplot(aes(x = year_str, y = median_price), fill = "darkorange1", col = "black")+
  #geom_hline(yintercept = mean(mean_median_price_year), col = "blue", lwd = 0.8)+
  labs(title = "BOXPLOT - MEDIAN PRICE OVER YEARS", x = "YEAR", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The volume trend of median price is slightly positive, rising from 2010 to 2014, with higher variance in 2014.

median_price_month

data_median_price_month <- data %>% 
  group_by(month) %>%
  summarize(mean_median_price_month = mean(median_price), sd_median_price_month = sd(median_price))

data_median_price_month

ggplot(data_median_price_month, aes(x=month, y=mean_median_price_month))+
  geom_bar(position=position_dodge(), stat="identity", 
           fill = "darkorange1", col = "darkorange4") + 
  geom_errorbar(aes(ymin = mean_median_price_month - sd_median_price_month,
                    ymax = mean_median_price_month + sd_median_price_month),
                col = "black", width=.2)+
  labs(title = "MEAN AND STANDARD DEVIATION OF MEDIAN PRICE OVER MONTHS",
       x = "MONTH",
       y = "MEAN/STANDARD DEVIATION")+
  #geom_hline(yintercept = mean(mean_median_price_month), col = "black", lwd = 0.8)+
  scale_x_continuous(breaks = seq(1,12,1), limits = c(0, 13))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.4))+
  theme(plot.title = element_text(face = "bold"))

The graph above shows that there is not a strong seasonal trend for what concern the median price of sold properties, only a peak of median price in June (means that, on avarage, on June houses with higher median price are sold). this is reasonable because it is normal that the median price is constant. The peak suring Summer colud be affected by the seasonal trend and properties request by the market.

VARIABLE listings

listings_city

data %>% 
  group_by(city) %>%
  summarize(mean_listings_city = mean(listings), sd_listings_city = sd(listings))

ggplot(data = data) +
  geom_boxplot(aes(x = city, y = listings), fill = "darkorange1", col = "black")+
  labs(title = "BOXPLOT - LISTINGS FOR DIFFERENT CITIES", x = "CITY", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The city with the higher numer of active sale announcements is Tyler, followed by Beaumont, Bryan-College Station and Wichita Falls.

sales_year

data %>% 
  group_by(year) %>%
  summarize(mean_listings_year = mean(listings), sd_listings_year = sd(listings))

ggplot(data = data) +
  geom_boxplot(aes(x = year_str, y = listings), fill = "darkorange1", col = "black")+
  labs(title = "BOXPLOT - LISTINGS OVER YEARS", x = "YEAR", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

From the graph above we can see that the number of listing is decreasing, passing from 1826 (in 2010) to 1560 (in 2014).

data %>%
  group_by(city, year) %>% 
  summarise(mean_listings_city_year = mean(listings),
            sd_listings_city_year = sd(listings),
            .groups = 'drop')

ggplot(data = data) +
  geom_boxplot(aes(x = city, y = listings, fill = year_str), col = "black")+
  labs(title = "BOXPLOT - LISTINGS (CITY AND YEAR)", x = "CITY", y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

This graph shows that the number of sale announcements is decreased everywhere, except for Wichita Falls.

sales_month

data_listings_month <- data %>% 
  group_by(month) %>%
  summarize(mean_listings_month = mean(listings), sd_listings_month = sd(listings))

data_listings_month

ggplot(data_listings_month, aes(x=month, y=mean_listings_month))+
  geom_bar(position=position_dodge(), stat="identity", 
           fill = "darkorange1", col = "darkorange4") + 
  geom_errorbar(aes(ymin = mean_listings_month - sd_listings_month,
                    ymax = mean_listings_month + sd_listings_month),
                col = "black", width=.2)+
  labs(title = "MEAN AND STANDARD DEVIATION OF LISTINGS OVER MONTHS",
       x = "MONTHS",
       y = "MEAN/STANDARD DEVIATION")+
  scale_x_continuous(breaks = seq(1,12,1), limits = c(0, 13))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.4))+
  theme(plot.title = element_text(face = "bold"))

The bar chart above shows that the number of listings suffers a light increase during Summer.

8) Creation of graphs with ggplot2

Use ggplot2 to create personalized graphs. Be sure to explore:
- Boxplot to compare the distribution of median price for different cities.
- Bar chart to compare the total of sales over months and cities.
- Line chart to compare the trend of sales in different historical period.

BOXPLOT OF MEDIAN PRICE

Already done in section 7, reported also here:

ggplot(data = data)+
  geom_boxplot(aes(x = city, y = median_price),
               fill = "darkorange1",
               col = "black")+
  labs(title = "BOXPLOT OF MEDIAN PRICE",
       x = "CITY",
       y =" ")+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

Higher median price is present in Bryan-College Station, lower in Wichita Falls.

Using also gghalves package, we can show also the probability function distribution in the same graph:

ggplot(data = data)+
  geom_half_boxplot(aes(x = city, y = median_price),
                    side = 'l', fill = "orange")+
  geom_half_violin(aes(x = city, y = median_price),
                    side = 'r', fill = "lightblue")+
  labs(title = "gghalves BOXPLOT OF MEDIAN PRICE",
       x = "CITY",
       y ="")+
    theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

From the graph above it is possible to highlight the positive skewness of Beaumont and the negative skewness of Bryan-College Station.

We can also study the distribution of median price for different cities, including the variable sales (previously divided in classes).

ggplot(data = data) +
  geom_boxplot(aes(x = city, 
                     y = median_price, 
                     fill = sales_cl)) +
  labs(title = "BOXPLOT OF MEDIAN PRICE (SALES AND CITY)",
       x = "CITY",
       y ="MEDIAN PRICE")+
  scale_y_continuous(breaks = seq(0,180000,10000), limits = c(70000, 180000))+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

This graph shows that:

in Beaumont the higher sales number (class (225, 275]) is registered for properties with higher median price.
the same trend in Bryan-College Station, where the higher number of sales (class (375, 425]) is registered for most expansive house (mean median class around 170000 $).
trend quite similar in Tyler.
in Wichita Falls the registered sales number is lower and for properties with lower median price.

We can repeat this analysis, showing boxplot of median price for different cities over the years:

ggplot(data = data) +
  geom_boxplot(aes(x = city, 
                     y = median_price, 
                     fill = year_str)) +
  labs(title = "BOXPLOT OF MEDIAN PRICE (CITY AND YEAR)",
       x = "CITY",
       y ="MEDIAN PRICE")+
  scale_y_continuous(breaks = seq(0,180000,10000), limits = c(70000, 180000))+
  theme_minimal()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

Looking at the median price over the years for different cities, we can observe that: * in Wichita Falls and Tyler the median price is increased over years, with a reduction of its variance, meaning an higher number of expansive properties, with a price more stable during the years. * Beaumont is the city that registered an higher median price rise.

SALES

data_sales_city_month_sum <- data %>%
  group_by(city, month) %>% 
  summarise(sum_data_sales_city_month = sum(sales),
            .groups = 'drop') %>%
  as.data.frame()

data_sales_city_month_sum

ggplot(data_sales_city_month_sum, aes(x=month, y=sum_data_sales_city_month, fill = city))+
  geom_bar(position="stack", stat="identity", col = "black")+
  labs(title = "SUM OF SALES FOR CITIES/MONTHS",
       x = "MONTHS",
       y = "SALES NUMBER")+
  scale_y_continuous(breaks = seq(0,5000,1000), limits = c(0, 5000))+
  scale_x_continuous(breaks = seq(1,12,1), limits = c(0, 13))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

If we do the same graph with the attribute position equal to dodge.

ggplot(data_sales_city_month_sum, aes(x=month, y=sum_data_sales_city_month, fill = city))+
  geom_bar(position="dodge", stat="identity", col = "black")+
  labs(title = "SUM OF SALES FOR CITIES/MONTHS",
       x = "MONTHS",
       y = "SALES NUMBER")+
  scale_y_continuous(breaks = seq(0,1750,250), limits = c(0, 1750))+
  scale_x_continuous(breaks = seq(1,12,1), limits = c(0, 13))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

From the stack graph, it is clear a total seasonal trend for what concern all the sales; the same graph, with the attribute position equal to dodge, has been made to see if this seasonal trend is present in all the cities. The lower number of properties has been sold in Wichita Falls in January (442), the higher in Tyler in June (1635). Overall Beaumont, Bryan-College Station and Tyler shows a clear seasonal trend, where the sales increases in Summer (except for the month of July in Beaumont). For what concern Wichita Falls, this trend is also present, but in a less markedly way.

The same graph in a normalized scale:

ggplot(data_sales_city_month_sum, aes(x=month, y=sum_data_sales_city_month, fill = city))+
  geom_bar(position="fill", stat="identity", col = "black")+
  labs(title = "SUM OF SALES FOR CITIES/MONTHS NORMALIZED",
       x = "MONTHS",
       y = "SALES NUMBER")+
  scale_x_continuous(breaks = seq(1,12,1), limits = c(0, 13))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

The normalized version shows more clearly the Bryan-College Station rise trend during Summer.

To add the variable year to the same graph, we can show the variable sales and not its total.

ggplot(data, aes(x=month, y=sales, fill = city))+
  geom_bar(position="fill", stat="identity", col = "black")+
  labs(title = "SUM OF SALES FOR CITIES/MONTHS",
       x = "MONTHS",
       y = "SALES NUMBER")+
  scale_x_continuous(breaks = seq(1,12,1), limits = c(0, 13))+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

LINE CHART OF SALES OVER YEARS

This graph shows the total sale registered with a net positive trend.

sales_2010 = sum(data[which(year==2010), 4])
sales_2011 = sum(data[which(year==2011), 4])
sales_2012 = sum(data[which(year==2012), 4])
sales_2013 = sum(data[which(year==2013), 4])
sales_2014 = sum(data[which(year==2014), 4])

sales_year = c(sales_2010, sales_2011, sales_2012, sales_2013, sales_2014)

ggplot()+
  geom_line(aes(x=c(2010, 2011, 2012, 2013, 2014), y=sales_year), col = "orange")+
  geom_point(aes(x=c(2010, 2011, 2012, 2013, 2014), y=sales_year), shape=7, col = "black")+
  labs(title = "TREND LINE OF TOTAL SALES OVER YEARS",
       x = "YEARS",
       y = "SALES NUMBER")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0.5))+
  theme(plot.title = element_text(face = "bold"))

We can separate for different cities to observe their single behaviors:

data_sales_city_year_sum <- data %>%
  group_by(city, year) %>% 
  summarise(sum_data_sales_city_year = sum(sales),
            .groups = 'drop') %>%
  as.data.frame()

ggplot(data = data_sales_city_year_sum)+
  geom_line(aes(x=year, y=sum_data_sales_city_year, col = city))+
  geom_point(aes(x=year, y=sum_data_sales_city_year), shape=7, col = "black")+
  labs(title = "TREND LINE OF TOTAL SALES OVER YEARS FOR DIFFERENT CITIES",
       x = "YEARS",
       y = "SALES NUMBER")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0))+
  theme(plot.title = element_text(face = "bold"))

It shows strong positive trend for all the cities except for Wichita Falls. If we plot the volume over years for different cities:

data_volume_city_year_sum <- data %>%
  group_by(city, year) %>% 
  summarise(sum_data_volume_city_year = sum(volume),
            .groups = 'drop') %>%
  as.data.frame()

ggplot(data = data_volume_city_year_sum)+
  geom_line(aes(x=year, y=sum_data_volume_city_year, col = city))+
  geom_point(aes(x=year, y=sum_data_volume_city_year), shape=7, col = "black")+
  labs(title = "TREND LINE OF TOTAL VOLUME OVER YEARS FOR DIFFERENT CITIES",
       x = "YEARS",
       y = "VOLUME [M$]")+
  theme_classic()+
  theme(plot.title = element_text(hjust=0))+
  theme(plot.title = element_text(face = "bold"))

Just like sales, volume shows a positive trend for all cities except for Wichita Falls.

9) Conclusions

Make a resume of obtained results, referring to main trends e providing reccomendations based on analysis.

The mean median price of Wichita Falls in 2010 - 2014 is 101 k$, the mean number of sales properties per month is 116 and the mean volume per month is 13.93 M$.

The mean median price of Beaumont in 2010 - 2014 is 130 k$, the mean number of sales properties per month is 177 and the mean volume per month is 26.13 M$.

The mean median price of Bryan-College Station in 2010 - 2014 is 157 k$, the mean number of sales properties per month is 205 and the mean volume per month is 38.19 M$.

The mean median price of Tyler in 2010 - 2014 is 141 k$, the mean number of sales properties per month is 269 and the mean volume per month is 45.76 M$.

The city with higher sales, higher volume is Tyler. It is also the city with the higher number of sale announcements active, showing high performance for what concern the speed index.

The trend market is positive everywhere except for Wichita Falls; high seasonable trend is present in all the other city, mainly in Bryan-College Station.

Overall, the median price of a property in Texas has been passed from 130 k$ in 2010 to 139 k$ in 2014, with an increase equal to +14 %. The number of sales increase is +36 %. In the mean time, the acitve announcements is reduced over years, passing from a mean value of 1826 in 2010 to 1560 in 2014; despite this, the volume is increased from 25.67 M$ in 2010 to 39.77 M$ in 2014, with a percentage increase of 54 %. Considering that the median price increase is 14 %, the estate market in Texas is in a rapid grow phase.

Analysis of the Texas real estate market

Giuseppe Frascolla

10-11-2024

INTRODUCTION

IMPORT PACKAGES

IMPORT DATASET

1) Variables analysis

Variable city

Variable year

Variable month

Variable sales

Variable volume

Variable median_price

Variable listings

Variable months_inventory

2) Position, variability and shape indices

Variable sales

Position indices

Variability indices

Shape indices

Variable volume

Position indices

Variability indices

Shape indices

Variable median_price

Position indices

Variability indices

Shape indices

Variable listings

Position indices

Variability indices

Shape indices

Variable months_inventory

Position indices

Variability indices

Shape indices

Variable city

Variable year

Variable month

3) Identification of variables with higher variability and asymmetry

VARIABLE WITH HIGHER VARIABILITY

VARIABLE WITH HIGHER ASIMMETRY

4) Creation of classes for a quantitative variable

CLASS CREATION

FREQUENCY DISTRIBUTION WITH BAR CHART

HETEROGENEITY INDEX OF GINI

5) Probability calculation

PROBABILITY THAT CITY IS “Beaumont”

PROBABILITY THAT MONTH IS JULY

PROBABILITY THAT MONTH IS DECEMBER 2012

6) Creation of new variables

CALCULATION OF MEAN PRICE OF ESTATES

EFFICIENCY INDEX VOLUME BASED

EFFICIENCY INDEX SALES BASED

EFFICIENCY INDEX SPEED BASED

EFFICIENCY INDEX CONCLUSION

7) Conditional analysis

VARIABLE sales

VARIABLE volume

VARIABLE median_price

VARIABLE listings

8) Creation of graphs with ggplot2

BOXPLOT OF MEDIAN PRICE

SALES

LINE CHART OF SALES OVER YEARS

9) Conclusions