The goal of the project, developed for Texas Realty Insights, is to analyze historical real estate data in the state of Texas to extract statistical insights and create visual representations that can support the company in making strategic decisions.
The analysis focuses on understanding trends in home sales, price distribution, and market dynamics across cities and over time (months and years). By exploring the relationship between key variables such as number of sales, total volume, median price, active listings, and inventory months, the project aims to:
Evaluate the effectiveness of real estate listings and marketing efforts.
Identify cities or periods with growth potential or underperformance.
Help the company optimize property pricing and inventory management.
Deliver clear and data-driven recommendations for sales strategy.
The outcome will be a comprehensive statistical report with summaries, graphs, and interpretations to guide data-informed decision-making in the Texas housing market.
dataframe creation
df <- read.csv("realestate_texas.csv")
packages:
install.packages("moments")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("ineq")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(moments)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ineq)
“city” (of reference): Qualitative (Nominal), possibility to create a distribution of frequencies and a barplot.
“year” (of reference): Temporal (discrete), time series
“month” (of reference): Temporal (discrete), time series
“sales” (total of sales): Quantitative discrete, position indexes, variability indexes, Shape Indexes
“volume” (total value (in M dollars) of the sales): Quantitative continuous, position indexes, variability indexes, Shape Indexes
“median_price” (median sale price in dollars): Quantitative continuous (eve if for simplicity rounded), position indexes, variability indexes, Shape Indexes
“listing” (total number of active advertisements): Quantitative discrete, position indexes, variability indexes, Shape Indexes
“months_inventory” (months needed to sell all the active advertisements): Quantitative continuous, position indexes, variability indexes, Shape Indexes
2° assignment, calculate indexes for all the variables where is possible,
for the others create a distribution of freq, comment briefly:
Mode function:
mode_value <- function(x) names(which.max(table(x)))
—- sales —-
summary_sales = summary(df$sales) # Summary statistics: min, 1st Qu., median, mean, 3rd Qu., max
mean_sales = mean(df$sales) # Mean: average sales
median_sales = median(df$sales) # Median: middle value of sales
mode_sales = mode_value(df$sales) # Most frequent number of sales
variance_sales = var(df$sales) # Variance: how much sales values vary
sd_sales = sd(df$sales) # Standard deviation: average distance from the mean
CoV_sales = sd(df$sales) / mean(df$sales) # CoV: relative measure of variability
skewness_sales = skewness(df$sales) # Skewness: asymmetry of the sales distribution
kurtosis_sales = kurtosis(df$sales) - 3 # Excess kurtosis: peakedness vs. normal distribution
Plot density of sales
plot(density(df$sales), # Density plot
main = "Density of Sales", xlab = "Sales")
abline(v = mean_sales, col = "red", lwd = 2) # Red = mean
abline(v = median_sales, col = "green", lwd = 2) # Green = median
abline(v = mode_sales, col = "blue", lwd = 2)
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)
comments:
looks like it is positively skewed from the density plot
—- volume —-
summary_volume <- summary(df$volume) # Summary statistics
mean_volume <- mean(df$volume) # Mean
median_volume <- median(df$volume) # Median
mode_volume <- mode_value(df$volume) # Mode
variance_volume <- var(df$volume) # Variance
sd_volume <- sd(df$volume) # Standard deviation
CoV_volume <- sd_volume / mean_volume # Coefficient of Variation
skewness_volume <- skewness(df$volume) # Skewness
kurtosis_volume <- kurtosis(df$volume) - 3 # Excess kurtosis
plot(density(df$volume), # Density plot
main = "Density of Volume", xlab = "Volume")
abline(v = mean_volume, col = "red", lwd = 2) # Red = mean
abline(v = median_volume, col = "green", lwd = 2) # Green = median
abline(v = mode_volume, col = "blue", lwd = 2) # Blue = mode
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)
comments:
looks like it is positively skewed from the density plot
—- median_price —-
summary_median_price <- summary(df$median_price) # Summary statistics
mean_median_price <- mean(df$median_price) # Mean
median_median_price <- median(df$median_price) # Median
mode_median_price <- mode_value(df$median_price) # Mode
variance_median_price <- var(df$median_price) # Variance
sd_median_price <- sd(df$median_price) # Standard deviation
CoV_median_price <- sd_median_price / mean_median_price # Coefficient of Variation
skewness_median_price <- skewness(df$median_price) # Skewness
kurtosis_median_price <- kurtosis(df$median_price) - 3 # Excess kurtosis
plot(density(df$median_price), # Density plot
main = "Density of Median Price", xlab = "Median Price")
abline(v = mean_median_price, col = "red", lwd = 2) # Red = mean
abline(v = median_median_price, col = "green", lwd = 2) # Green = median
abline(v = mode_median_price, col = "blue", lwd = 2) # Blue = mode
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)
comments: # looks like it is negatively skewed, also kurtosis is negative, so it is a platykurtic distribution.
—- listings —-
summary_listings <- summary(df$listings) # Summary statistics
mean_listings <- mean(df$listings) # Mean
median_listings <- median(df$listings) # Median
mode_listings <- mode_value(df$listings) # Mode
variance_listings <- var(df$listings) # Variance
sd_listings <- sd(df$listings) # Standard deviation
CoV_listings <- sd_listings / mean_listings # Coefficient of Variation
skewness_listings <- skewness(df$listings) # Skewness
kurtosis_listings <- kurtosis(df$listings) - 3 # Excess kurtosis
plot(density(df$listings), # Density plot
main = "Density of Listings", xlab = "Listings")
abline(v = mean_listings, col = "red", lwd = 2) # Red = mean
abline(v = median_listings, col = "green", lwd = 2) # Green = median
abline(v = mode_listings, col = "blue", lwd = 2) # Blue = mode
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)
comments:
it looks like there are three modes, anyway the bigger one is 1581. About the skewness, it is moderately positive.
anyway it’s clearly not a normal distribution, because of the three modes. data says that’s platykurtic and positively skewed.
—- months_inventory —-
summary_months_inventory <- summary(df$months_inventory) # Summary statistics
mean_months_inventory <- mean(df$months_inventory) # Mean
median_months_inventory <- median(df$months_inventory) # Median
mode_months_inventory <- mode_value(df$months_inventory) # Mode
variance_months_inventory <- var(df$months_inventory) # Variance
sd_months_inventory <- sd(df$months_inventory) # Standard deviation
CoV_months_inventory <- sd_months_inventory / mean_months_inventory # Coefficient of Variation
skewness_months_inventory <- skewness(df$months_inventory) # Skewness
kurtosis_months_inventory <- kurtosis(df$months_inventory) - 3 # Excess kurtosis
plot(density(df$months_inventory), # Density plot
main = "Density of Months Inventory", xlab = "Months of Inventory")
abline(v = mean_months_inventory, col = "red", lwd = 2) # Red = mean
abline(v = median_months_inventory, col = "green", lwd = 2) # Green = median
abline(v = mode_months_inventory, col = "blue", lwd = 2) # Blue = mode
legend("topright", legend = c("Mean", "Median", "Mode"), col = c("red", "green", "blue"), lwd = 2, box.lty = 0)
comments: # it is clearly not a perfect normal distribution, it looks
like there are three modes again # but it is not so bad like the
previous variable. # The skewness is close to 0, and the kurtosis is
very slightly negative (-0.17).
—- city (absolute and relative frequency) —-
table(df$city) # Absolute frequency of each city
##
## Beaumont Bryan-College Station Tyler
## 60 60 60
## Wichita Falls
## 60
prop.table(table(df$city)) # Relative frequency of each city
##
## Beaumont Bryan-College Station Tyler
## 0.25 0.25 0.25
## Wichita Falls
## 0.25
—- year (absolute and relative frequency) —-
table(df$year) # Absolute frequency of each year
##
## 2010 2011 2012 2013 2014
## 48 48 48 48 48
prop.table(table(df$year)) # Relative frequency of each year
##
## 2010 2011 2012 2013 2014
## 0.2 0.2 0.2 0.2 0.2
—- month (absolute and relative frequency) —-
table(df$month) # Absolute frequency of each month
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 20 20 20 20 20 20 20 20 20 20 20 20
prop.table(table(df$month)) # Relative frequency of each month
##
## 1 2 3 4 5 6 7
## 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
## 8 9 10 11 12
## 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
comments: each modality in city, year and month is equally represented.
so let’s find the variable with highest CoV:
Vector of CoV values (previously calculated)
CoV_values <- c(
sales = CoV_sales,
volume = CoV_volume,
median_price = CoV_median_price,
listings = CoV_listings,
months_inventory = CoV_months_inventory
)
print(sort(CoV_values, decreasing = TRUE))
## volume listings sales months_inventory
## 0.5370536 0.4330833 0.4142203 0.2506031
## median_price
## 0.1708218
The variable with the highest variability is ‘volume’ with a CoV of 0.53.
Vector of skewness values (previously calculated)
skewness_values <- c(
sales = skewness_sales,
volume = skewness_volume,
median_price = skewness_median_price,
listings = skewness_listings,
months_inventory = skewness_months_inventory
)
print(sort(abs(skewness_values), decreasing = TRUE))
## volume sales listings median_price
## 0.88474203 0.71810402 0.64949823 0.36455288
## months_inventory
## 0.04097527
The variable with the highest variability is ‘volume’ with a skewness of 0.88.
statistical considerations: Data indicates that, in most cases, sales occur in relatively low quantities
and are associated with moderate total volumes, while median housing prices remain relatively high.
This suggests that the market activity is driven by not so big amount of transactions,
each involving properties of relatively high individual value.
# Define 6 intervals using 7 breakpoints for median_price
breaks <- seq(min(df$median_price), max(df$median_price), length.out = 7)
labels <- c("Very Low", "Low", "Mid-Low", "Mid-High", "High", "Very High")
# Assign labeled classes based on median_price
df$price_class <- cut(df$median_price, breaks = breaks, labels = labels, include.lowest = TRUE)
# Frequency table
freq_table <- df %>%
group_by(price_class) %>%
summarise(frequency = n()) %>%
mutate(relative_freq = frequency / sum(frequency))
print(freq_table)
## # A tibble: 6 × 3
## price_class frequency relative_freq
## <fct> <int> <dbl>
## 1 Very Low 13 0.0542
## 2 Low 33 0.138
## 3 Mid-Low 37 0.154
## 4 Mid-High 73 0.304
## 5 High 68 0.283
## 6 Very High 16 0.0667
# Bar plot
ggplot(freq_table, aes(x = price_class, y = frequency)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Median Price Frequency Distribution",
x = "Price Class",
y = "Frequency") +
theme_minimal()
# Create frequency table for price_class (absolute counts)
freq_table <- df %>%
count(price_class) %>%
arrange(price_class) # Ensure proper class order
# Extract frequency counts
frequencies <- freq_table$n
# Original Gini function
gini_function <- function(x) {
x <- sort(x)
n <- length(x)
G <- (2 * sum(x * seq_along(x))) / (n * sum(x)) - (n + 1) / n
return(G)
}
# Calculate Gini index on class distribution
gini_val <- gini_function(frequencies)
cat("Gini Index for 'price_class' distribution:", round(gini_val, 3))
## Gini Index for 'price_class' distribution: 0.319
The median_price variable was divided into six labeled classes ranging from Very Low to Very High. The frequency table shows a clear concentration in the central-upper classes: Mid-High (30.4%) and High (28.3%) together account for 58,7% of the observations. In contrast, the extreme categories (Very Low at 5.4% and Very High at 6.7%) are sparsely represented.
This pattern is visually confirmed by the bar chart, which shows a peak in the mid-to-high range and a decline toward the extremes. This suggests that most houses fall within an intermediate or moderately high median price range, with fewer cases of very low or very high prices.
The Gini index calculated on the class frequency distribution is 0.319, indicating a moderate level of inequality. This value is consistent with the data: while a few classes dominate in terms of frequency, the imbalance is not extreme, as all six categories are represented to some degree.
In summary, the distribution of median_price is skewed toward mid-high price classes, reflecting a moderately uneven but not polarized housing market.
# Summary of ranges per price_class
class_ranges <- df %>%
group_by(price_class) %>%
summarise(
min_value = min(median_price),
max_value = max(median_price)
)
print(class_ranges)
## # A tibble: 6 × 3
## price_class min_value max_value
## <fct> <dbl> <dbl>
## 1 Very Low 73800 91200
## 2 Low 91700 109100
## 3 Mid-Low 109400 126200
## 4 Mid-High 127000 144600
## 5 High 144800 161600
## 6 Very High 163700 180000
calculation of probabilities:
probabilities to sample a particular city is 25%:
—- city (absolute and relative frequency) —-
table(df$city) # Absolute frequency of each city
##
## Beaumont Bryan-College Station Tyler
## 60 60 60
## Wichita Falls
## 60
prop.table(table(df$city)) # Relative frequency of each city
##
## Beaumont Bryan-College Station Tyler
## 0.25 0.25 0.25
## Wichita Falls
## 0.25
probability to sample a row with month 7 (July) is 8.33%:
—- month (absolute and relative frequency) —-
table(df$month) # Absolute frequency of each month
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 20 20 20 20 20 20 20 20 20 20 20 20
prop.table(table(df$month)) # Relative frequency of each month
##
## 1 2 3 4 5 6 7
## 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
## 8 9 10 11 12
## 0.08333333 0.08333333 0.08333333 0.08333333 0.08333333
probability to sample a row with month 12 pf 2012 os 1.66%:
—- year (absolute and relative frequency) —-
table(df$year) # Absolute frequency of each year
##
## 2010 2011 2012 2013 2014
## 48 48 48 48 48
prop.table(table(df$year)) # Relative frequency of each year
##
## 2010 2011 2012 2013 2014
## 0.2 0.2 0.2 0.2 0.2
then…
0.2*0.0833
## [1] 0.01666
or
4/240
## [1] 0.01666667
Create a new column that calculates the average property price using the available variables
df$avg_price <- (df$volume * 1e6) / df$sales
Calculate average house price by city
avg_price_by_city <- aggregate(avg_price ~ city, data = df, mean)
print(avg_price_by_city)
## city avg_price
## 1 Beaumont 146640.4
## 2 Bryan-College Station 183534.3
## 3 Tyler 167676.8
## 4 Wichita Falls 119430.0
Barplot of average prices
barplot(avg_price_by_city$avg_price,
names.arg = avg_price_by_city$city,
las = 2, # Rotate city names vertically
col = "blue",
main = "Average House Price by City",
ylab = "Average Price ($)")
— efficiency —:
Create a column that measures the effectiveness of property listings (sell-efficiency)
df$sell_efficiency <- df$listings / df$months_inventory
Normalize sell-efficiency (min-max scaling)
df$sell_efficiency_norm <- (
df$sell_efficiency - min(df$sell_efficiency)
) / (
max(df$sell_efficiency) - min(df$sell_efficiency)
)
Calculate average normalized sell-efficiency by city
efficiency_by_city_norm <- aggregate(sell_efficiency_norm ~ city, data = df, mean)
print(efficiency_by_city_norm)
## city sell_efficiency_norm
## 1 Beaumont 0.29320961
## 2 Bryan-College Station 0.41693448
## 3 Tyler 0.69298448
## 4 Wichita Falls 0.04451062
Barplot of average normalized sell-efficiency by city
barplot(efficiency_by_city_norm$sell_efficiency_norm,
names.arg = efficiency_by_city_norm$city,
las = 2, # Rotate city names vertically
col = "seagreen3",
main = "Average Normalized Sell Efficiency by City",
ylab = "Average Normalized Sell Efficiency")
comments:
Bryan-College Station
Highest average price (~$183.5k) with moderate efficiency (average normalized ~0.42).
A solid market with the most expensive properties, but sales occur at a relatively moderate pace.
Tyler
Lower average price (~$167.7k) than Bryan-College Station but highest efficiency (~0.69).
A dynamic market where properties are still quite expensive and sold very quickly. The company performs well here.
Beaumont
relatively low average price (~$146.6k) and low efficiency (~0.29).
A stagnant market with relatively inexpensive properties that are harder to sell. May indicate excess supply or weak demand.
Wichita Falls
Lowest average price (~$119.4k) and minimal efficiency (~0.04).
A depressed market with low-value properties and extremely low turnover. The company likely loses resources here.
Generate summaries (means, sd) and make charts.
Sales by city
sales_by_city <- df %>% # %>% means take the output of the previous function and use it as input for the next one
group_by(city) %>%
summarise(
count = n(), # Number of observations
total_sales = sum(sales), # Total sales
mean_sales = mean(sales), # Mean sales
sd_sales = sd(sales), # Standard deviation
min_sales = min(sales), # Minimum sales
max_sales = max(sales) # Maximum sales
)
print(sales_by_city)
## # A tibble: 4 × 7
## city count total_sales mean_sales sd_sales min_sales max_sales
## <chr> <int> <int> <dbl> <dbl> <int> <int>
## 1 Beaumont 60 10643 177. 41.5 83 273
## 2 Bryan-College Stati… 60 12358 206. 85.0 89 403
## 3 Tyler 60 16185 270. 62.0 143 423
## 4 Wichita Falls 60 6964 116. 22.2 79 167
Barplot of mean sales by city
barplot(
sales_by_city$mean_sales,
names.arg = sales_by_city$city,
col = "steelblue",
main = "Average Sales by City",
ylab = "Mean Sales",
xlab = "City",
las = 2, # Rotate city names vertically
cex.names = 0.8 # Adjust label size
)
Barplot of total sales by city
barplot(
sales_by_city$total_sales,
names.arg = sales_by_city$city,
col = "steelblue",
main = "Total Sales by City",
ylab = "Total Sales",
xlab = "City",
las = 2, # Rotate city names vertically
cex.names = 0.8 # Adjust label size
)
comments: Tyler is the city where there are the most of the sales (totally and in average), according to the considerations made in
the code above. BCStation is the second by mean sales and has the highest standard deviation. WFalls
is the city with the minimum amount of sales and also with the minimum mean of sales.
again these results accord with the previous ones, where Tyler is the most dynamic market, followed by BCStation
and WFalls is the most stagnant one, probably because it is closer to the desert and distant from main cities
like Dallas and Houston (maybe??).
Sales by year
sales_by_year <- df %>%
group_by(year) %>%
summarise(
count = n(), # Number of observations
total_sales = sum(sales), # Total sales
mean_sales = mean(sales), # Mean sales
sd_sales = sd(sales), # Standard deviation
min_sales = min(sales), # Minimum sales
max_sales = max(sales) # Maximum sales
)
print(sales_by_year)
## # A tibble: 5 × 7
## year count total_sales mean_sales sd_sales min_sales max_sales
## <int> <int> <int> <dbl> <dbl> <int> <int>
## 1 2010 48 8096 169. 60.5 83 316
## 2 2011 48 7878 164. 63.9 79 313
## 3 2012 48 8935 186. 70.9 90 322
## 4 2013 48 10172 212. 84.0 79 402
## 5 2014 48 11069 231. 95.5 89 423
Barplot of total sales by year
barplot(
sales_by_year$total_sales,
names.arg = sales_by_year$year,
col = "darkorange",
main = "N. of Sales by Year",
ylab = "Total Sales",
xlab = "Year"
)
comments: looking at the charts it looks like the total sales are
increasing over the years, indicating a general
growth of the company (probably they hired more people making more deals) or the market became more fluid. That’s a good
Sales by month
sales_by_month <- df %>%
group_by(month) %>%
summarise(
count = n(), # Number of observations
total_sales = sum(sales), # Total sales
mean_sales = mean(sales), # Mean sales
sd_sales = sd(sales), # Standard deviation
min_sales = min(sales), # Minimum sales
max_sales = max(sales) # Maximum sales
)
print(sales_by_month)
## # A tibble: 12 × 7
## month count total_sales mean_sales sd_sales min_sales max_sales
## <int> <int> <int> <dbl> <dbl> <int> <int>
## 1 1 20 2548 127. 43.4 79 238
## 2 2 20 2817 141. 51.1 79 244
## 3 3 20 3789 189. 59.2 102 298
## 4 4 20 4234 212. 65.4 111 323
## 5 5 20 4777 239. 83.1 102 388
## 6 6 20 4871 244. 95.0 111 423
## 7 7 20 4715 236. 96.3 104 403
## 8 8 20 4629 231. 79.2 123 357
## 9 9 20 3647 182. 72.5 95 361
## 10 10 20 3598 180. 75.0 97 369
## 11 11 20 3137 157. 55.5 93 300
## 12 12 20 3388 169. 60.7 81 332
Barplot of total sales by month
barplot(
sales_by_month$total_sales,
names.arg = sales_by_month$month,
col = "darkgreen",
main = "N. of Sales by Month",
ylab = "Total Sales",
xlab = "Month"
)
Housing market peaks in late spring and summer probably due to:
Families moving when schools are closed, so no interruption of school for children
Stable weather, ideal for visits and people to relocate.
part 1: use boxplot to compare median_price across cities.
boxplot of median_price by city
ggplot(df, aes(x = city, y = median_price)) +
geom_boxplot(fill = "skyblue") +
labs(
title = "Comparison of Median Price by City",
x = "City",
y = "Median Price"
)
comments: Tyler has the highest median price, followed by BCStation and Beaumont. Wichita Falls is the cheapest city.
we have already seen this pattern in the code above (with average price by city above)
but here is more clean and also median price is a more stable measure than average price to outliers.
even if outliers, we can see from the chart, here are not so a problem and boxplots are quite simmetric.
part 2: use ggplot2 to create a barplot to compare total sales across month and cities:
Total sales by city
Summarise total sales by city
sales_by_city <- df %>%
group_by(city) %>%
summarise(total_sales = sum(sales))
Barplot
ggplot(sales_by_city, aes(x = city, y = total_sales)) +
geom_col(fill = "steelblue") +
labs(title = "Total Sales by City", x = "City", y = "Total Sales")
Summarise total sales by month
sales_by_month <- df %>%
group_by(month) %>%
summarise(total_sales = sum(sales))
Barplot
ggplot(sales_by_month, aes(x = factor(month), y = total_sales)) +
geom_col(fill = "darkgreen") +
labs(title = "Total Sales by Month", x = "Month", y = "Total Sales")
Summarise total sales by month and city
sales_by_month_city <- df %>%
group_by(month, city) %>%
summarise(total_sales = sum(sales))
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
print(sales_by_month_city)
## # A tibble: 48 × 3
## # Groups: month [12]
## month city total_sales
## <int> <chr> <int>
## 1 1 Beaumont 608
## 2 1 Bryan-College Station 591
## 3 1 Tyler 907
## 4 1 Wichita Falls 442
## 5 2 Beaumont 677
## 6 2 Bryan-College Station 628
## 7 2 Tyler 1058
## 8 2 Wichita Falls 454
## 9 3 Beaumont 855
## 10 3 Bryan-College Station 949
## # ℹ 38 more rows
let’s create a variable that combines year and month to make a complete time series:
df$date <- as.Date(paste(df$year, df$month, "1", sep = "-")) # "1" means 1st of the month
Grouped barplot
df$year <- format(df$date, "%Y")
sales_by_month_city <- df %>%
group_by(year, month, city) %>%
summarise(total_sales = sum(sales), .groups = "drop")
df$month <- format(df$date, "%m")
ggplot(sales_by_month_city, aes(x = factor(month), y = total_sales, fill = city)) +
geom_col(position = "dodge") +
labs(
title = "Total Sales by Month and City",
x = "Month",
y = "Total Sales",
fill = "City"
) +
facet_wrap(~year)
comments: It looks like BCStation and Tyler are more sensible to seasons rather than the other two cities
above all WFalls but can’t understand why. Probably due to weather reasons but I won’t deep more into it.
part 3: line charts for time series:
Total sales by year
sales_by_year <- df %>%
group_by(year) %>%
summarise(total_sales = sum(sales))
print(sales_by_year)
## # A tibble: 5 × 2
## year total_sales
## <chr> <int>
## 1 2010 8096
## 2 2011 7878
## 3 2012 8935
## 4 2013 10172
## 5 2014 11069
Line-Chart:
# Ensure 'year' is numeric to allow line connection
sales_by_year$year <- as.numeric(as.character(sales_by_year$year))
# Create line plot of total sales per year
ggplot(sales_by_year, aes(x = year, y = total_sales)) +
geom_line(color = "blue", size = 1.2) + # Line connecting the points
geom_point(color = "blue") + # Points on each year
labs(
title = "Total Sales per Year", # Title of the plot
x = "Year", # X-axis label
y = "Total Sales" # Y-axis label
) +
scale_x_continuous(breaks = unique(sales_by_year$year)) + # Show all years on x-axis
theme_minimal() # Clean theme
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
comments: as we have seen before above in the code sales did increase
over years.
Total sales by month
sales_by_month <- df %>%
group_by(month) %>%
summarise(total_sales = sum(sales))
print(sales_by_month)
## # A tibble: 12 × 2
## month total_sales
## <chr> <int>
## 1 01 2548
## 2 02 2817
## 3 03 3789
## 4 04 4234
## 5 05 4777
## 6 06 4871
## 7 07 4715
## 8 08 4629
## 9 09 3647
## 10 10 3598
## 11 11 3137
## 12 12 3388
Create a line chart to show total sales by month
sales_by_month$month <- as.numeric(sales_by_month$month)
ggplot(sales_by_month, aes(x = month, y = total_sales)) +
geom_line(color = "darkgreen", size = 1.2) +
geom_point(color = "darkgreen") +
scale_x_continuous(breaks = 1:12) + # 12 months on x-axis
labs(
title = "Total Sales per Month",
x = "Month",
y = "Total Sales"
)
comments: as we have seen before above in the code sales increase during spring-summer and have a minimum in Genuary.
probably due to weather conditions.
Group data by time and calculate total sales globally (across all of cities)
sales_over_time <- df %>%
group_by(date) %>%
summarise(total_sales = sum(sales))
print(sales_over_time)
## # A tibble: 60 × 2
## date total_sales
## <date> <int>
## 1 2010-01-01 421
## 2 2010-02-01 487
## 3 2010-03-01 755
## 4 2010-04-01 916
## 5 2010-05-01 931
## 6 2010-06-01 866
## 7 2010-07-01 712
## 8 2010-08-01 738
## 9 2010-09-01 598
## 10 2010-10-01 565
## # ℹ 50 more rows
Create a time series line chart of total sales over time
ggplot(sales_over_time, aes(x = date, y = total_sales)) +
geom_line(color = "purple", size = 1.2) +
geom_point(color = "purple") +
labs(
title = "Total Sales Over Time",
x = "Date (Year-Month)",
y = "Total Sales"
)
comments: now it is clear, sales increase over seasons and over
years.
Total sales by city and time
sales_time_city <- df %>%
group_by(city, date) %>%
summarise(total_sales = sum(sales))
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
print(sales_time_city)
## # A tibble: 240 × 3
## # Groups: city [4]
## city date total_sales
## <chr> <date> <int>
## 1 Beaumont 2010-01-01 83
## 2 Beaumont 2010-02-01 108
## 3 Beaumont 2010-03-01 182
## 4 Beaumont 2010-04-01 200
## 5 Beaumont 2010-05-01 202
## 6 Beaumont 2010-06-01 189
## 7 Beaumont 2010-07-01 164
## 8 Beaumont 2010-08-01 174
## 9 Beaumont 2010-09-01 124
## 10 Beaumont 2010-10-01 150
## # ℹ 230 more rows
Line chart: one line per city
ggplot(sales_time_city, aes(x = date, y = total_sales, color = city)) +
geom_line(size = 0.3) +
geom_point() +
labs(
title = "Monthly Sales Over Time by City",
x = "Date",
y = "Total Sales"
)
comments: the numbers of sales are increasing over time but not thanks
to WFalls, only the other three cities
are developing a bigger market (or company did put more effort into them over time).
The very interesting thing is seeing that BCStation has the biggest variation over seasons in term of sales.
Doing some research online taking curiosity by the name I have seen that’s a very important university city
compared to the other ones, the logical problem is that students rent houses, they don’t buy/sell them.
this doesn’t explain then the oddity of the data (message to reader: if you know the reason please let me know).
let’s see the variable median price over time by city to see if there are some trends:
Group by city and date, then calculate the median_price for each point in time.
trend_price_city <- df %>%
group_by(city, date) %>%
summarise(median_price)
## `summarise()` has grouped output by 'city'. You can override using the
## `.groups` argument.
print(trend_price_city)
## # A tibble: 240 × 3
## # Groups: city [4]
## city date median_price
## <chr> <date> <dbl>
## 1 Beaumont 2010-01-01 163800
## 2 Beaumont 2010-02-01 138200
## 3 Beaumont 2010-03-01 122400
## 4 Beaumont 2010-04-01 123200
## 5 Beaumont 2010-05-01 123100
## 6 Beaumont 2010-06-01 122800
## 7 Beaumont 2010-07-01 124300
## 8 Beaumont 2010-08-01 136800
## 9 Beaumont 2010-09-01 121100
## 10 Beaumont 2010-10-01 138500
## # ℹ 230 more rows
Line chart: median_price over time, one line per city
ggplot(trend_price_city, aes(x = date, y = median_price, color = city)) +
geom_line(size = 0.3) +
geom_point() +
labs(
title = "Median Price Over Time by City",
x = "Date",
y = "Median Price"
)
comments:
Just by eyes it looks like after 2012’s low peaks, median prices of the cities but Beaumont are growing…
BCStation and Tyler median prices at the end reach values evidently higher than initial ones (before 2012), instead WFalls is more difficult
to be interpreted cause of the big variation it has over time.
Anyways, median prices of the two cities with the most number of sales (also growing over time, as we have seen before) are growing.
That means that the market for those cities is becoming more fluid, also houses more expensive.
The company is doing well in those cities so they should keep bet on them.