Load the “Real Estate Texas.csv” dataset into an R dataframe named df and display its head to verify the loading.
df <- read.csv('Real Estate Texas.csv')
head(df)
## city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010 1 83 14.162 163800 1533 9.5
## 2 Beaumont 2010 2 108 17.690 138200 1586 10.0
## 3 Beaumont 2010 3 182 28.701 122400 1689 10.6
## 4 Beaumont 2010 4 200 26.819 123200 1708 10.6
## 5 Beaumont 2010 5 202 28.833 123100 1771 10.9
## 6 Beaumont 2010 6 189 27.219 122800 1803 11.1
Inspect structure, data types and initial descriptive statistics of the loaded dataset, using str(df), summary(df), and dplyr::glimpse(df).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
str(df)
## 'data.frame': 240 obs. of 8 variables:
## $ city : chr "Beaumont" "Beaumont" "Beaumont" "Beaumont" ...
## $ year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ month : int 1 2 3 4 5 6 7 8 9 10 ...
## $ sales : int 83 108 182 200 202 189 164 174 124 150 ...
## $ volume : num 14.2 17.7 28.7 26.8 28.8 ...
## $ median_price : num 163800 138200 122400 123200 123100 ...
## $ listings : int 1533 1586 1689 1708 1771 1803 1857 1830 1829 1779 ...
## $ months_inventory: num 9.5 10 10.6 10.6 10.9 11.1 11.7 11.6 11.7 11.5 ...
summary(df)
## city year month sales
## Length:240 Min. :2010 Min. : 1.00 Min. : 79.0
## Class :character 1st Qu.:2011 1st Qu.: 3.75 1st Qu.:127.0
## Mode :character Median :2012 Median : 6.50 Median :175.5
## Mean :2012 Mean : 6.50 Mean :192.3
## 3rd Qu.:2013 3rd Qu.: 9.25 3rd Qu.:247.0
## Max. :2014 Max. :12.00 Max. :423.0
## volume median_price listings months_inventory
## Min. : 8.166 Min. : 73800 Min. : 743 Min. : 3.400
## 1st Qu.:17.660 1st Qu.:117300 1st Qu.:1026 1st Qu.: 7.800
## Median :27.062 Median :134500 Median :1618 Median : 8.950
## Mean :31.005 Mean :132665 Mean :1738 Mean : 9.193
## 3rd Qu.:40.893 3rd Qu.:150050 3rd Qu.:2056 3rd Qu.:10.950
## Max. :83.547 Max. :180000 Max. :3296 Max. :14.900
dplyr::glimpse(df)
## Rows: 240
## Columns: 8
## $ city <chr> "Beaumont", "Beaumont", "Beaumont", "Beaumont", "Beau…
## $ year <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…
## $ month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5,…
## $ sales <int> 83, 108, 182, 200, 202, 189, 164, 174, 124, 150, 150,…
## $ volume <dbl> 14.162, 17.690, 28.701, 26.819, 28.833, 27.219, 22.70…
## $ median_price <dbl> 163800, 138200, 122400, 123200, 123100, 122800, 12430…
## $ listings <int> 1533, 1586, 1689, 1708, 1771, 1803, 1857, 1830, 1829,…
## $ months_inventory <dbl> 9.5, 10.0, 10.6, 10.6, 10.9, 11.1, 11.7, 11.6, 11.7, …
Categorize its variables by type (e.g., qualitative, quantitative).
Based on the data inspection, here s a classification of the variables:
city: This is a nominal variable representing different
cities in Texas. It will be useful for grouping data and comparing
trends across different locations. year: An integer variable representing the year.
While numerically an integer, its inherently a time-series component,
indicating specific periods of observation. It can be treated as
discrete quantitative for some analyses (e.g., aggregations by year) or
as a time component when combined with month.
month: An integer variable representing the month of
the year. Similar to year, it’s a discrete quantitative
variable but carries significant temporal meaning. It’s crucial for
identifying seasonal trends.
sales: An integer variable representing the total
number of sales. This is a discrete quantitative variable suitable for
calculating sums, averages, and analyzing distribution.
volume: A numeric variable representing the total
value of sales (in millions of dollars). This is a continuous
quantitative variable, suitable for various statistical calculations
like mean, standard deviation, and for trend analysis.
median_price: A numeric variable representing the
median price of sales (in dollars). This is a continuous quantitative
variable, key for understanding price trends and distributions.
listings: An integer variable representing the total
number of active listings. This is a discrete quantitative variable,
useful for assessing market supply.
months_inventory: A numeric variable representing
the quantity of time needed to sell all current listings. This is a
continuous quantitative variable, critical for understanding market
balance and demand.
Discuss how to handle time-related variables like ‘year’ and ‘month’.
Handling Time-Related Variables (year and month):
The year and
month variables, while currently integers, represent a temporal
dimension. For robust time-series analysis or more intuitive plotting
(e.g., chronological order on an axis), it’s often beneficial to combine
them into a single date/time object. This can be done by creating a new
column, for instance, date, which could represent the first day of each
month.
This will allow for:
Time Series Analysis: Easily plotting trends over time.
Seasonality Analysis: Grouping by month to observe seasonal
patterns, or by year to observe annual changes.
Filtering and Aggregation: Efficiently filtering data by specific
periods or aggregating data over longer timeframes (e.g., quarterly or
annually).
Calculate descriptive statistics (mean, median, standard deviation,
IQR, variance, skewness, and kurtosis) for the quantitative variables:
sales, volume, median_price,
listings, and months_inventory. Ensure the
e1071 package is installed and loaded for skewness and kurtosis.
library(e1071)
–> n R (for OSX) select Tools
–> Install Packages from the
menu. Then you can search for the package and install it via Install
selected.
You can also go to packages in the right windows click on
install write e1071 click install.
Proceed with calculating the descriptive statistics.
Ensure the
variable is numeric.
Add to results dataframe.
variables_to_analyze <- c("sales", "volume", "median_price", "listings", "months_inventory")
results <- data.frame()
for (var_name in variables_to_analyze) {
if (is.numeric(df[[var_name]])) {
mean_val <- mean(df[[var_name]], na.rm = TRUE)
median_val <- median(df[[var_name]], na.rm = TRUE)
sd_val <- sd(df[[var_name]], na.rm = TRUE)
var_val <- var(df[[var_name]], na.rm = TRUE)
iqr_val <- IQR(df[[var_name]], na.rm = TRUE)
skew_val <- skewness(df[[var_name]], na.rm = TRUE)
kurt_val <- kurtosis(df[[var_name]], na.rm = TRUE)
results <- rbind(results, data.frame(
Variable = var_name,
Mean = mean_val,
Median = median_val,
SD = sd_val,
Variance = var_val,
IQR = iqr_val,
Skewness = skew_val,
Kurtosis = kurt_val
))
} else {
cat(paste0("Variable '", var_name, "' is not numeric and will be skipped.\n"))
}
}
print("Descriptive Statistics for Quantitative Variables:")
## [1] "Descriptive Statistics for Quantitative Variables:"
print(results)
## Variable Mean Median SD Variance
## 1 sales 192.29167 175.5000 79.651111 6.344300e+03
## 2 volume 31.00519 27.0625 16.651447 2.772707e+02
## 3 median_price 132665.41667 134500.0000 22662.148687 5.135730e+08
## 4 listings 1738.02083 1618.5000 752.707756 5.665690e+05
## 5 months_inventory 9.19250 8.9500 2.303669 5.306889e+00
## IQR Skewness Kurtosis
## 1 120.0000 0.71362055 -0.3355200
## 2 23.2335 0.87921815 0.1505673
## 3 32750.0000 -0.36227680 -0.6427292
## 4 1029.5000 0.64544309 -0.8101534
## 5 3.1500 0.04071944 -0.1979448
Generate frequency distributions for the qualitative and discrete
quantitative variablescity, year, and
month.
The e1071 package has been
previously installed for skewness and kurtosis calculations.
cat("\nFrequency Distribution for 'city':\n")
##
## Frequency Distribution for 'city':
print(table(df$city))
##
## Beaumont Bryan-College Station Tyler
## 60 60 60
## Wichita Falls
## 60
cat("\nFrequency Distribution for 'year':\n")
##
## Frequency Distribution for 'year':
print(table(df$year))
##
## 2010 2011 2012 2013 2014
## 48 48 48 48 48
cat("\nFrequency Distribution for 'month':\n")
##
## Frequency Distribution for 'month':
print(table(df$month))
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 20 20 20 20 20 20 20 20 20 20 20 20
Finally, comment on the statistical results, highlighting significant
findings and patterns for each variable.
Data Analysis Key Findings:
Sales and Volume show positive skewness:
sales
(mean: 192.29, median: 175.5, skewness: 0.714) and volume
(mean: 31.01 million, median: 27.06 million, skewness: 0.879) both have
means higher than their medians, indicating a concentration of lower
values and occasional higher values pulling the mean up.
Median price is slightly negatively skewed:
The
median_price (mean: $132,665, median: $134,500, skewness:
-0.362) has its median slightly higher than its mean, suggesting a
concentration of prices at the higher end. It also exhibits high
variability with a standard deviation of $22,662.
Listings are positively skewed with high variability:
listings (mean: 1738.02, median: 1618.5, skewness: 0.645)
also shows a positive skew and substantial variability (standard
deviation: 752.71).
Months inventory is nearly symmetrical and least variable:
months_inventory (mean: 9.19, median: 8.95, skewness:
0.041) has an almost symmetrical distribution and the lowest relative
variability (standard deviation: 2.30) among the quantitative variables.
Platykurtic distributions are common:
sales,
median_price, and listings all exhibit
platykurtic distributions (kurtosis values of -0.336, -0.643, and -0.810
respectively), meaning they have lighter tails and are less peaked than
a normal distribution. volume is mesokurtic (kurtosis:
0.151) and months_inventory is nearly mesokurtic (kurtosis:
-0.198).
Dataset is perfectly balanced for categorical variables:
Each of the four city entries, five year
entries, and twelve month entries have an equal number of
observations (60 for city, 48 for year, and 20 for month), ensuring a
balanced dataset for comparative analyses across these dimensions.
Insights or Next Steps:
The prevalent positive skew in sales,
volume, and listings suggests that while most
periods have moderate activity, there are occasional spikes that
significantly boost averages. Further investigation into the factors
driving these high-activity periods could be valuable.
The balanced distribution of observations across
city, year, and month variables
makes the dataset highly suitable for time-series analysis, seasonality
studies, and city-specific comparisons without the need for weighting or
normalization due to unequal representation.
Select the median_price variable, divide it into
meaningful classes (bins) based on its quartiles, create a new
categorical column named median_price_class in the
dataframe, and display the defined bin edges and a sample of the new
column.
Firstly alculate the quartiles for the median_price
variable to define the bin edges for classification. Then, I will create
a new categorical column named median_price_class using the
cut() function, based on these quartiles, and display the
bin edges and a sample of the new column to verify its creation.
quartile_breaks <- quantile(df$median_price, probs = c(0, 0.25, 0.50, 0.75, 1), na.rm = TRUE)
labels <- c("Low", "Medium-Low", "Medium-High", "High")
df$median_price_class <- cut(df$median_price, breaks = quartile_breaks, labels = labels, include.lowest = TRUE)
cat("Defined Quartile Breaks (Bin Edges):\n")
## Defined Quartile Breaks (Bin Edges):
print(quartile_breaks)
## 0% 25% 50% 75% 100%
## 73800 117300 134500 150050 180000
cat("\nSample of new 'median_price_class' column:\n")
##
## Sample of new 'median_price_class' column:
print(head(df$median_price_class))
## [1] High Medium-High Medium-Low Medium-Low Medium-Low Medium-Low
## Levels: Low Medium-Low Medium-High High
Generate and display the frequency distribution (table) for the newly
created median_price_class variable, showing the count of observations
falling into each price class.
Use the table() function on the df$median_price_class column and then print the resulting table.
print(table(df$median_price_class))
##
## Low Medium-Low Medium-High High
## 60 61 59 60
Create and display a bar chart using ggplot2 to visually represent
the frequency distribution of median_price_class. The chart will include
appropriate labels and a title.
Use ggplot2 to create a bar chart.
This involves loading the
library, mapping median_price_class to the x-axis, and adding
appropriate labels and a title.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:e1071':
##
## element
# Create a bar chart for median_price_class
ggplot(df, aes(x = median_price_class)) +
geom_bar(fill = "steelblue") +
labs(
title = "Frequency Distribution of Median Price Classes",
x = "Median Price Class",
y = "Frequency"
) +
theme_minimal()
Calculate Gini Heterogeneity Index: for the
median_price_class variable:
determinate the
proportions of each category within the median_price_class
variable and apply the Gini index formula: Gini = 1 - sum(p_i^2), where
p_i is the proportion of observations in category i.
freq_table <- table(df$median_price_class)
proportions <- freq_table / sum(freq_table)
gini_index <- 1 - sum(proportions^2)
cat("Frequency Table for median_price_class:\n")
## Frequency Table for median_price_class:
print(freq_table)
##
## Low Medium-Low Medium-High High
## 60 61 59 60
cat("\nProportions for median_price_class:\n")
##
## Proportions for median_price_class:
print(proportions)
##
## Low Medium-Low Medium-High High
## 0.2500000 0.2541667 0.2458333 0.2500000
cat("\nGini Heterogeneity Index for median_price_class: ", gini_index, "\n")
##
## Gini Heterogeneity Index for median_price_class: 0.7499653
Discuss the interpretation of the calculated index in the context of
the price class distribution. The calculated Gini heterogeneity index
for median_price_class is approximately 0.750.
Discussion:
Gini Index Range:
The Gini index ranges from 0 to 1.
A value of 0 indicates perfect homogeneity, meaning all
observations fall into a single category.
A value of 1 (or close to
1) indicates perfect heterogeneity, meaning observations are spread as
evenly as possible across all categories, or each observation is in its
own category.
Context of median_price_class:
In this case, the
median_price_class variable was created by dividing median_price into
four classes (Low, Medium-Low, Medium-High, High) based on quartiles.
Ideally, if the division into quartiles is perfectly balanced, each
class would contain 25% of the observations.
This would lead to a
high Gini index, indicating a high degree of heterogeneity (diversity)
among the classes.
Result Interpretation:
A Gini index of 0.750 is a
relatively high value, indicating a significant level of heterogeneity
in the median_price_class variable.
This is expected given that the
classes were created using quartiles, aiming for an approximately equal
distribution of observations across the four categories.
The
frequency table shows that the classes are indeed very evenly
distributed (60, 61, 59, 60 observations respectively out of 240 total),
which contributes to this high heterogeneity index. This suggests that
the median prices are well-distributed across the defined price ranges,
and no single price class overwhelmingly dominates the dataset.
Implication:
This high heterogeneity means that the
median_price variable, when categorized into these four classes,
provides a good differentiation across the spectrum of prices. It
confirms that the classification effectively captures the diversity in
median housing prices across the dataset.
Data Analysis Key Findings:
The frequency distribution of the median_price_class variable was
visually represented using a bar chart, clearly showing the distribution
across its four categories.
The frequency table for median_price_class revealed a very even
distribution across the categories:
Low (60 observations),
Medium-Low (61 observations), Medium-High (59 observations), and High
(60 observations).
The calculated Gini heterogeneity index for median_price_class is
approximately 0.750.
This Gini index value of 0.750 indicates a significant level of
heterogeneity (diversity) within the median_price_class variable.
This high value is consistent with the design of the classes, which were
based on quartiles to ensure an approximately equal distribution of
observations across the categories.
Insights or Next Steps:
The high heterogeneity of the median_price_class variable
confirms that the classification effectively captures the diversity in
median housing prices, providing well-differentiated segments for
further analysis.
This robust classification allows for comparative studies of
other housing characteristics or market behaviors across these distinct
price segments.
Calculate the probability that a randomly selected row from the
dataset corresponds to the city ‘Beaumont’.
This involves counting
rows where ‘city’ is ‘Beaumont’ and dividing by the total number of
rows.
Filter the DataFrame to count rows for ‘Beaumont’, get the total
number of rows, and then divide the ‘Beaumont’ count by the total count.
count_beaumont <- sum(df$city == "Beaumont")
total_rows <- nrow(df)
probability_beaumont <- count_beaumont / total_rows
cat("Number of rows for 'Beaumont':", count_beaumont, "\n")
## Number of rows for 'Beaumont': 60
cat("Total number of rows:", total_rows, "\n")
## Total number of rows: 240
cat("Probability of selecting a row for 'Beaumont':", probability_beaumont, "\n")
## Probability of selecting a row for 'Beaumont': 0.25
Calculate the probability that a randomly selected row corresponds to the month of ‘July’ using a similar approach.
count_july <- sum(df$month == 7)
total_rows <- nrow(df)
probability_july <- count_july / total_rows
cat("Number of rows for 'July':", count_july, "\n")
## Number of rows for 'July': 20
cat("Total number of rows:", total_rows, "\n")
## Total number of rows: 240
cat("Probability of selecting a row for 'July':", probability_july, "\n")
## Probability of selecting a row for 'July': 0.08333333
determine the probability of selecting a row that corresponds to ‘December 2012’ by filtering the data for both month 12 and year 2012, and then dividing this count by the total number of rows.
count_december_2012 <- sum(df$month == 12 & df$year == 2012)
total_rows <- nrow(df)
probability_december_2012 <- count_december_2012 / total_rows
cat("Number of rows for 'December 2012':", count_december_2012, "\n")
## Number of rows for 'December 2012': 4
cat("Total number of rows:", total_rows, "\n")
## Total number of rows: 240
cat("Probability of selecting a row for 'December 2012':", probability_december_2012, "\n")
## Probability of selecting a row for 'December 2012': 0.01666667
Probabilities:
Probability for ‘Beaumont’:
The probability of
randomly selecting a row corresponding to the city ‘Beaumont’ is 0.25
(or 25%).
Probability for ‘July’:
The probability of
randomly selecting a row corresponding to the month of ‘July’ is
approximately 0.0833 (or 8.33%).
Probability for ‘December 2012’:
The probability
of randomly selecting a row corresponding to ‘December 2012’ is
approximately 0.0167 (or 1.67%).
Data Analysis Key Findings and Interpretation:
This probability of 0.25 indicates that 25% of the total
observations in the dataset belong to the city ‘Beaumont’.
Given that there are 4 unique cities in the dataset and each city
has an equal number of observations (60 out of 240 total rows), this
result is expected (1/4 = 0.25). This confirms the balanced
representation of cities in the dataset.
This probability indicates that approximately 8.33% of the total
observations correspond to the month of July.
Since there are 12 unique months in the dataset and each month
has an equal number of observations (20 out of 240 total rows), this
result is expected (20/240 = 1/12 ≈ 0.0833).
This confirms the
balanced representation of months across the dataset.
This probability indicates that approximately 1.67% of the total observations specifically correspond to December 2012.
This is an intersection of two conditions (month is December AND
year is 2012). Given that there are 4 cities * 5 years * 12 months = 240
total combinations, and each specific month-year combination for a given
city appears only once, but there are 4 cities for ‘December 2012’,
meaning 4 rows out of 240. (4/240 = 1/60 ≈ 0.0167). This shows that the
probability of picking a very specific time slice is quite low, as
expected due to the granularity of the data.
Insights:
The calculated probabilities are consistent with the balanced
nature of the dataset across cities, years, and months, as identified in
previous steps. This reinforces confidence in using this dataset for
comparative analyses without concerns about skewed representation.
These probabilities serve as a fundamental understanding of the
dataset’s composition and can be used as baseline references for more
complex conditional probability calculations or hypothesis testing in
future analyses.
Create two new variables in the df dataframe:
average_price (calculated as (volume *
1,000,000) divided by sales) and
listing_effectiveness (calculated as sales
divided by listings). For each new variable, display the
head of the updated dataframe and summary statistics.
Then, provide
a commentary on their meaning, distributions, and initial insights into
the real estate market.
Finally, summarize the process of creating
these new variables and the insights gained from their analysis, linking
them to the project’s objectives.
Calculate the average_price as specified, multiplying volume by
1,000,000 to convert it to dollars, then divide by sales.
display
the head of the updated dataframe and summary statistics for the new
column to verify the operation.
df$average_price <- (df$volume * 1000000) / df$sales
cat("Head of DataFrame with new 'average_price' column:\n")
## Head of DataFrame with new 'average_price' column:
print(head(df))
## city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010 1 83 14.162 163800 1533 9.5
## 2 Beaumont 2010 2 108 17.690 138200 1586 10.0
## 3 Beaumont 2010 3 182 28.701 122400 1689 10.6
## 4 Beaumont 2010 4 200 26.819 123200 1708 10.6
## 5 Beaumont 2010 5 202 28.833 123100 1771 10.9
## 6 Beaumont 2010 6 189 27.219 122800 1803 11.1
## median_price_class average_price
## 1 High 170626.5
## 2 Medium-High 163796.3
## 3 Medium-Low 157697.8
## 4 Medium-Low 134095.0
## 5 Medium-Low 142737.6
## 6 Medium-Low 144015.9
cat("\nSummary statistics for 'average_price' column:\n")
##
## Summary statistics for 'average_price' column:
print(summary(df$average_price))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 97010 132939 156588 154320 173915 213234
Now that the average_price column has been successfully
created and inspected, It is possible to proceed and create the second
new column, listing_effectiveness, calculated as
sales divided by listings and display the head
of the updated dataframe and summary statistics for this new column to
verify its creation and initial properties.
df$listing_effectiveness <- df$sales / df$listings
cat("Head of DataFrame with new 'listing_effectiveness' column:
")
## Head of DataFrame with new 'listing_effectiveness' column:
print(head(df))
## city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010 1 83 14.162 163800 1533 9.5
## 2 Beaumont 2010 2 108 17.690 138200 1586 10.0
## 3 Beaumont 2010 3 182 28.701 122400 1689 10.6
## 4 Beaumont 2010 4 200 26.819 123200 1708 10.6
## 5 Beaumont 2010 5 202 28.833 123100 1771 10.9
## 6 Beaumont 2010 6 189 27.219 122800 1803 11.1
## median_price_class average_price listing_effectiveness
## 1 High 170626.5 0.05414220
## 2 Medium-High 163796.3 0.06809584
## 3 Medium-Low 157697.8 0.10775607
## 4 Medium-Low 134095.0 0.11709602
## 5 Medium-Low 142737.6 0.11405985
## 6 Medium-Low 144015.9 0.10482529
cat("\nSummary statistics for 'listing_effectiveness' column:\n")
##
## Summary statistics for 'listing_effectiveness' column:
print(summary(df$listing_effectiveness))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05014 0.08980 0.10963 0.11874 0.13492 0.38713
Comments:
Meaning:
The average_price variable represents the average
price per real estate transaction, calculated as the total volume of
sales (in dollars) divided by the total sales count for a given period
and city. Unlike median_price, which indicates the midpoint of
individual property prices, average_price reflects the average value of
each transaction. This can be significantly influenced by a few
high-value sales.
Distribution (from Summary Statistics):
> Min.: 97,010
> 1𝑠𝑡𝑄𝑢.: 132,939
> Median: 156,588
> 𝑀𝑒𝑎𝑛:
154,320
> 3rd Qu.: 173,915
> 𝑀𝑎𝑥.: 213,234
Analysis:
The average_price ranges from
approximately 97,010𝑡𝑜 213,234.
The mean
(154,320)𝑖𝑠𝑠𝑙𝑖𝑔ℎ𝑡𝑙𝑦𝑙𝑜𝑤𝑒𝑟𝑡ℎ𝑎𝑛𝑡ℎ𝑒𝑚𝑒𝑑𝑖𝑎𝑛(156,588), suggesting a very slight
negative skew, meaning there are potentially more values concentrated
towards the higher end, or fewer extremely low values pulling the mean
down.
This contrasts with the median_price which showed a slight
negative skew too, with its mean
(132,665)𝑏𝑒𝑖𝑛𝑔𝑠𝑙𝑖𝑔ℎ𝑡𝑙𝑦𝑙𝑜𝑤𝑒𝑟𝑡ℎ𝑎𝑛𝑖𝑡𝑠𝑚𝑒𝑑𝑖𝑎𝑛(134,500).
However, the
range and quartiles of average_price are generally higher
than median_price, indicating that the average transaction value is
often higher than the typical (median) property price, possibly due to
higher-value properties contributing disproportionately to the total
volume. The difference between the mean and median is small, indicating
a relatively symmetrical distribution for average_price compared to some
other variables like sales or volume.
Initial Insights:
Revenue Perspective: average_price is a
direct indicator of the average revenue generated per sale. Monitoring
this alongside median_price provides a more complete picture of the
market’s value proposition. A growing average_price suggests increasing
revenue per transaction, which is positive for profitability. Market
Segmentation: Discrepancies between average_price and median_price can
highlight different market dynamics. For instance, if average_price is
significantly higher than median_price, it could imply a robust luxury
market or a few very expensive properties driving the overall
volume.
Meaning:
listing_effectiveness is calculated
as sales divided by listings. This variable measures how many sales are
generated per active listing. It can be interpreted as a proxy for
market efficiency or the success rate of listings in leading to a sale.
A higher value indicates that a greater proportion of listings are
resulting in sales, suggesting a more dynamic market or more effective
marketing.
Distribution (from Summary Statistics):
> Min.: 0.05014
> 1st Qu.: 0.08980
> Median: 0.10963
> Mean:
0.11874
> 3rd Qu.: 0.13492
> Max.: 0.38713
Analysis:
The listing_effectiveness ranges from
approximately 0.05 (5%) to 0.387 (38.7%).
This means that, at its
lowest, about 5% of active listings result in a sale in a given period,
while at its peak, almost 39% do. This represents a wide range of
effectiveness.
The mean (0.11874) is higher than the median
(0.10963), indicating a positive skew.
This suggests that while
most periods have lower to moderate listing effectiveness, there are
occasional instances where listings are remarkably effective, pulling
the average up.
Initial Insights:
Market Demand
vs. Supply: Higher listing_effectiveness generally points to stronger
buyer demand relative to the supply of listings, or a faster-moving
market.
Conversely, lower values might suggest an oversupply of
properties or weaker demand.
Marketing Success: This metric can
also reflect the overall success of marketing and listing strategies.
Periods or cities with consistently
higher listing_effectiveness might have more competitive pricing, better
property presentation, or more targeted advertising.
Variability: The significant range and positive skew highlight
that listing_effectiveness is not constant. Identifying the factors
contributing to high effectiveness (e.g., specific cities, seasons, or
market conditions) would be valuable for strategic decision-making.
Data Analysis Key Findings:
Average_price vs. median_price:
The average_price (mean: $154,320; median: $156,588) was found to be
generally higher than the median_price (mean: $132,665; median:
$134,500).
This suggests that higher-priced homes contribute more
significantly to the total sales volume.
The average_price showed a
slight negative skew, indicating a concentration of values towards the
higher end or fewer extremely low values pulling the mean down.
Listing_effectiveness Variability:
This metric
exhibited considerable variability, ranging from approximately 5% to
38.7% (mean: 0.11874; median: 0.10963).
The positive skew (mean
greater than median) suggests that while there is a baseline level of
effectiveness, some periods experience notably higher sales conversion
rates relative to active listings, indicating dynamic market conditions.
Insights:
Decision Support:
The newly created variables
provide quantitative metrics vital for understanding market dynamics and
supporting strategic sales decisions. average_price helps in assessing
revenue potential per transaction, while listing_effectiveness measures
the success rate of listings.
Optimization and Further Analysis:
Texas Realty
Insights can utilize listing_effectiveness to identify optimal periods
or cities for listing properties, thereby improving conversion rates.
Further analysis should explore factors influencing
high listing_effectiveness and the relationship
between average_price and median_price to gain deeper insights into
market value and segmentation.
Conditional Analysis by City:
Perform conditional statistical analysis (mean and standard
deviation) for key quantitative variables grouped by ‘city’ using
dplyr. Display the summary statistics and create
appropriate graphical representations (e.g., bar charts for means, box
plots for distributions) to visualize city-specific trends.
Calculate the mean and standard deviation for the specified
quantitative variables, grouped by ‘city’, as instructed.
library(dplyr)
city_summary <- df %>%
group_by(city) %>%
summarise(
mean_sales = mean(sales, na.rm = TRUE),
sd_sales = sd(sales, na.rm = TRUE),
mean_volume = mean(volume, na.rm = TRUE),
sd_volume = sd(volume, na.rm = TRUE),
mean_median_price = mean(median_price, na.rm = TRUE),
sd_median_price = sd(median_price, na.rm = TRUE),
mean_listings = mean(listings, na.rm = TRUE),
sd_listings = sd(listings, na.rm = TRUE),
mean_months_inventory = mean(months_inventory, na.rm = TRUE),
sd_months_inventory = sd(months_inventory, na.rm = TRUE),
mean_average_price = mean(average_price, na.rm = TRUE),
sd_average_price = sd(average_price, na.rm = TRUE),
mean_listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
sd_listing_effectiveness = sd(listing_effectiveness, na.rm = TRUE)
)
print(city_summary)
## # A tibble: 4 × 15
## city mean_sales sd_sales mean_volume sd_volume mean_median_price
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Beaumont 177. 41.5 26.1 6.97 129988.
## 2 Bryan-College Sta… 206. 85.0 38.2 17.2 157488.
## 3 Tyler 270. 62.0 45.8 13.1 141442.
## 4 Wichita Falls 116. 22.2 13.9 3.24 101743.
## # ℹ 9 more variables: sd_median_price <dbl>, mean_listings <dbl>,
## # sd_listings <dbl>, mean_months_inventory <dbl>, sd_months_inventory <dbl>,
## # mean_average_price <dbl>, sd_average_price <dbl>,
## # mean_listing_effectiveness <dbl>, sd_listing_effectiveness <dbl>
Conditional Analysis by Year:
Perform the conditional statistical analysis grouped by ‘year’.
Display the summary statistics and create graphical representations
(e.g., line charts, bar charts) to visualize annual trends.
Calculate the mean and standard deviation for the specified
quantitative variables.
library(dplyr)
year_summary <- df %>%
group_by(year) %>%
summarise(
mean_sales = mean(sales, na.rm = TRUE),
sd_sales = sd(sales, na.rm = TRUE),
mean_volume = mean(volume, na.rm = TRUE),
sd_volume = sd(volume, na.rm = TRUE),
mean_median_price = mean(median_price, na.rm = TRUE),
sd_median_price = sd(median_price, na.rm = TRUE),
mean_listings = mean(listings, na.rm = TRUE),
sd_listings = sd(listings, na.rm = TRUE),
mean_months_inventory = mean(months_inventory, na.rm = TRUE),
sd_months_inventory = sd(months_inventory, na.rm = TRUE),
mean_average_price = mean(average_price, na.rm = TRUE),
sd_average_price = sd(average_price, na.rm = TRUE),
mean_listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
sd_listing_effectiveness = sd(listing_effectiveness, na.rm = TRUE)
)
print(year_summary)
## # A tibble: 5 × 15
## year mean_sales sd_sales mean_volume sd_volume mean_median_price
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2010 169. 60.5 25.7 10.8 130192.
## 2 2011 164. 63.9 25.2 12.2 127854.
## 3 2012 186. 70.9 29.3 14.5 130077.
## 4 2013 212. 84.0 35.2 17.9 135723.
## 5 2014 231. 95.5 39.8 21.2 139481.
## # ℹ 9 more variables: sd_median_price <dbl>, mean_listings <dbl>,
## # sd_listings <dbl>, mean_months_inventory <dbl>, sd_months_inventory <dbl>,
## # mean_average_price <dbl>, sd_average_price <dbl>,
## # mean_listing_effectiveness <dbl>, sd_listing_effectiveness <dbl>
Conditional Analysis by Month:
Perform conditional statistical analysis (mean and standard
deviation) for key quantitative variables grouped by ‘month’. Display
the summary statistics and create graphical representations (e.g., line
charts, bar charts) to visualize annual trends.
Calculate the mean and standard deviation for the specified
quantitative variables.
library(dplyr)
month_summary <- df %>%
group_by(month) %>%
summarise(
mean_sales = mean(sales, na.rm = TRUE),
sd_sales = sd(sales, na.rm = TRUE),
mean_volume = mean(volume, na.rm = TRUE),
sd_volume = sd(volume, na.rm = TRUE),
mean_median_price = mean(median_price, na.rm = TRUE),
sd_median_price = sd(median_price, na.rm = TRUE),
mean_listings = mean(listings, na.rm = TRUE),
sd_listings = sd(listings, na.rm = TRUE),
mean_months_inventory = mean(months_inventory, na.rm = TRUE),
sd_months_inventory = sd(months_inventory, na.rm = TRUE),
mean_average_price = mean(average_price, na.rm = TRUE),
sd_average_price = sd(average_price, na.rm = TRUE),
mean_listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
sd_listing_effectiveness = sd(listing_effectiveness, na.rm = TRUE)
)
print(month_summary)
## # A tibble: 12 × 15
## month mean_sales sd_sales mean_volume sd_volume mean_median_price
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 127. 43.4 19.0 8.37 124250
## 2 2 141. 51.1 21.7 10.1 130075
## 3 3 189. 59.2 29.4 12.0 127415
## 4 4 212. 65.4 33.3 14.5 131490
## 5 5 239. 83.1 39.7 19.0 134485
## 6 6 244. 95.0 41.3 21.1 137620
## 7 7 236. 96.3 39.1 21.4 134750
## 8 8 231. 79.2 38.0 18.0 136675
## 9 9 182. 72.5 29.6 15.2 134040
## 10 10 180. 75.0 29.1 15.1 133480
## 11 11 157. 55.5 24.8 11.2 134305
## 12 12 169. 60.7 27.1 12.6 133400
## # ℹ 9 more variables: sd_median_price <dbl>, mean_listings <dbl>,
## # sd_listings <dbl>, mean_months_inventory <dbl>, sd_months_inventory <dbl>,
## # mean_average_price <dbl>, sd_average_price <dbl>,
## # mean_listing_effectiveness <dbl>, sd_listing_effectiveness <dbl>
Conditional Analysis by Month:
Data Analysis Key Findings:
Peak Sales Season:
Mean sales and volume show a
clear increase from March, peaking in June-July, and then declining
towards the end of the year. For example, mean sales are lowest in
January (127) and highest in June (244). This suggests a strong seasonal
demand in the warmer months.
Price Trends:
Mean median_price and average_price generally follow the sales trend,
with prices tending to be higher during the peak selling season. Mean
median price is lowest in January (124,250)𝑎𝑛𝑑𝑝𝑒𝑎𝑘𝑠𝑖𝑛𝐽𝑢𝑛𝑒(137,620),
indicating that properties sold during these months tend to fetch higher
prices.
Listing Activity:
Mean listings generally increase
from the beginning of the year, peaking around spring/summer (e.g., June
with 2110 listings) before slightly decreasing. This indicates that
sellers are more likely to put their homes on the market during periods
of high demand.
Market Efficiency:
Mean listing_effectiveness is
notably higher in the spring and summer months (e.g., peaks in June with
0.134), indicating that listings convert into sales more efficiently
during these periods. This is a critical insight for optimizing
marketing strategies.
Inventory Levels:
Mean months_inventory shows an
inverse relationship to sales and effectiveness, generally decreasing
from winter through summer (lowest in July at 7.7 months) and rising
again towards winter (highest in January at 10.1 months). This confirms
a tighter, faster-moving market in the warmer months.
Distribution Insights (from Box Plots):
The box
plots visually confirm these seasonal trends, showing shifts in the
median, interquartile range, and presence of outliers across months. For
instance, sales and volume box plots typically show higher medians and
potentially wider spreads during peak seasons,
while months_inventory shows lower medians and narrower spreads.
Insights or Next Steps:
Optimal Listing Periods:
Texas Realty Insights
should focus their marketing and listing efforts heavily on the spring
and summer months (March to August) to capitalize on peak sales
activity, higher prices, and improved listing_effectiveness. This
directly addresses the project objective of optimizing listing
strategies.
Pricing Strategy:
Dynamic pricing strategies could
be implemented to align with seasonal price fluctuations, potentially
maximizing revenue during peak periods.
Inventory Management:
Understanding
seasonal months_inventory trends can help agents advise sellers on the
best time to list their properties to achieve quicker sales or higher
prices.
Targeted Marketing:
Marketing campaigns can be
tailored to seasonal variations, perhaps focusing on buyer attraction in
spring/summer and seller education or strategic listings in fall/winter.
Conditional Analysis (by City, Year, and Month):
Data Analysis Key Findings:
City Performance Disparities:
Cities exhibit
distinct market characteristics. For instance, Tyler is a high-activity
market with robust sales, while Bryan-College Station commands higher
prices. Wichita Falls consistently shows lower activity and prices
across most metrics. Beaumont generally falls in the
middle.
Market Growth Over Time:
The Texas real estate
market demonstrated consistent growth and increasing efficiency between
2010 and 2014. Key indicators like sales, volume, and prices generally
rose, while the time properties spent on the market (months_inventory)
decreased, signaling a strengthening seller’s market.
Seasonal Influence:
The real estate market is
highly seasonal, with a strong preference for spring and summer sales.
This period is characterized by higher transaction volumes, greater
market efficiency (higher listing_effectiveness), and elevated prices,
followed by a slowdown in colder months.
Insights or Next Steps:
Targeted Strategies:
Texas Realty Insights should
develop city-specific marketing and sales strategies. High-volume
markets like Tyler might benefit from aggressive listing campaigns,
while higher-priced markets like Bryan-College Station could focus on
premium services. Wichita Falls may require strategies to stimulate
demand or differentiate listings.
Capitalize on Growth and Seasonality:
Leveraging
the observed annual growth and seasonal peaks is crucial. Marketing
efforts should be intensified from March to August to maximize sales and
revenue. Understanding these patterns allows for proactive inventory
management, staffing adjustments, and promotional timing.
Further Granular Analysis:
While this conditional
analysis provides a strong overview, combining these factors (e.g.,
city-year, city-month trends) could yield even more precise insights.
For example, investigating why a particular city might deviate from the
general annual or seasonal trend could uncover unique local market
dynamics.
Integrated Summary of Conditional Analyses and Strategic
Implications:
The conditional analyses conducted by city, year, and month provide a
comprehensive understanding of the dynamics shaping the Texas real
estate market. These insights highlight structural differences across
cities, a clear multi‑year growth trajectory, and strong seasonal
patterns that influence sales activity, pricing, inventory behavior, and
listing performance.
The analyses reveal substantial variation across local markets:
Tyler: stands out as the most active market, with the
highest mean sales and volume, as well as strong
listing_effectiveness, indicating efficient conversion of
listings into sales.
Bryan-College Station: leads in median_price
and average_price, positioning it as a premium-value market.
Beaumont: shows moderate performance across most
metrics.
Wichita Falls: consistently records the lowest values
in sales, prices, and listing effectiveness, along with higher and more
variable months_inventory, reflecting a slower and less
competitive market.
These differences underscore the need for city-specific strategies
tailored to each market’s structure and performance.
Across the five-year period, the Texas real estate market
demonstrates strong and sustained growth:
Mean sales, volume, median_price,
average_price, and listing_effectiveness all increase
steadily from 2010 to 2014.
Mean months_inventory declines from approximately 9.9 to 8.4 months, signaling a tightening seller’s market where properties sell more quickly.
Price appreciation is evident, with mean median price
rising from roughly $130,192 in 2010 to $139,481 in 2014.
These trends reflect improving market efficiency, stronger demand,
and favorable economic conditions.
The analyses also reveal pronounced seasonality:
Sales and volume peak between May and
July, with June typically showing the highest activity.
Median_price and average_price follow similar
seasonal peaks, reaching their highest levels in late spring and early
summer.
Listing_effectiveness is strongest during the same
period, indicating that marketing efforts convert more effectively when
demand is highest.
Months_inventory displays inverse seasonality, reaching
its lowest levels in summer and highest in winter, confirming faster
turnover during peak demand months.
Listings increase in spring and peak around May–July,
suggesting that sellers strategically enter the market when conditions
are most favorable.
Strategic Implications:
The combined insights from the conditional analyses support several
strategic recommendations for Texas Realty Insights:
Develop City-Specific Strategies:
Focus on
high-volume campaigns in Tyler, premium positioning in Bryan-College
Station, and targeted, conservative approaches in slower markets like
Wichita Falls.
Capitalize on Seasonality:
Concentrate marketing
investments, listing promotions, and staffing resources between March
and August to leverage peak demand, higher prices, and stronger listing
effectiveness.
Implement Dynamic Pricing:
Adjust pricing
strategies to reflect seasonal fluctuations, maximizing revenue during
high-activity months.
Monitor Inventory Conditions:
Use
months_inventory as a key indicator of market competitiveness
and to guide recommendations on optimal listing timing.
Leverage Multi-Year Growth Trends:
Invest in
expanding operations in high-performing markets and refine sales
forecasts using the consistent upward trends observed from 2010 to 2014.
This integrated analysis provides a clear, data-driven foundation for
strategic decision-making, enabling Texas Realty Insights to optimize
marketing efforts, improve forecasting, and tailor approaches to the
unique characteristics of each market.
Create a bar chart using ggplot2 to visualize the mean
of sales across different cities. This chart will help in comparing the
average sales performance of each city.
library(ggplot2)
Bar chart for mean sales by city
ggplot(city_summary, aes(x = city, y = mean_sales, fill = city)) +
geom_bar(stat = "identity") +
labs(
title = "Mean Sales by City",
x = "City",
y = "Mean Sales"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will allow for easy comparison of average sales volume per city.
Bar chart for mean volume by city
ggplot(city_summary, aes(x = city, y = mean_volume, fill = city)) +
geom_bar(stat = "identity") +
labs(
title = "Mean Sales Volume by City",
x = "City",
y = "Mean Sales Volume (in millions)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will allow for easy comparison of the average median price per
city.
Bar chart for mean median_price by city
ggplot(city_summary, aes(x = city, y = mean_median_price, fill = city)) +
geom_bar(stat = "identity") +
labs(
title = "Mean Median Price by City",
x = "City",
y = "Mean Median Price (in dollars)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will allow for easy comparison of the average number of listings
per city.
Bar chart for mean listings by city
ggplot(city_summary, aes(x = city, y = mean_listings, fill = city)) +
geom_bar(stat = "identity") +
labs(
title = "Mean Listings by City",
x = "City",
y = "Mean Listings"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will allow for easy comparison of the average months inventory per
city.
Bar chart for mean months_inventory by city
ggplot(city_summary, aes(x = city, y = mean_months_inventory, fill = city)) +
geom_bar(stat = "identity") +
labs(
title = "Mean Months Inventory by City",
x = "City",
y = "Mean Months Inventory"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will allow for easy comparison of the average price per city.
Bar chart for mean average_price by city
ggplot(city_summary, aes(x = city, y = mean_average_price, fill = city)) +
geom_bar(stat = "identity") +
labs(
title = "Mean Average Price by City",
x = "City",
y = "Mean Average Price (in dollars)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will allow for easy comparison of the average listing effectiveness
per city.
Bar chart for mean listing_effectiveness by city
ggplot(city_summary, aes(x = city, y = mean_listing_effectiveness, fill = city)) +
geom_bar(stat = "identity") +
labs(
title = "Mean Listing Effectiveness by City",
x = "City",
y = "Mean Listing Effectiveness"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Create a box plot using
ggplot2 to visualize the
distribution of sales across different cities. This chart will provide
insights into the spread, central tendency, and outliers of sales for
each city.
Box plot for sales by city
ggplot(df, aes(x = city, y = sales, fill = city)) +
geom_boxplot() +
labs(
title = "Distribution of Sales by City",
x = "City",
y = "Sales"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will provide insights into the spread, central tendency, and
outliers of sales volume for each city.
Box plot for volume by city
ggplot(df, aes(x = city, y = volume, fill = city)) +
geom_boxplot() +
labs(
title = "Distribution of Sales Volume by City",
x = "City",
y = "Sales Volume (in millions)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will provide insights into the spread, central tendency, and
outliers of median price for each city.
Box plot for median_price by city
ggplot(df, aes(x = city, y = median_price, fill = city)) +
geom_boxplot() +
labs(
title = "Distribution of Median Price by City",
x = "City",
y = "Median Price (in dollars)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will provide insights into the spread, central tendency, and
outliers of listings for each city.
Box plot for listings by city
ggplot(df, aes(x = city, y = listings, fill = city)) +
geom_boxplot() +
labs(
title = "Distribution of Listings by City",
x = "City",
y = "Listings"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will provide insights into the spread, central tendency, and
outliers of months inventory for each city.
Box plot for months_inventory by city
ggplot(df, aes(x = city, y = months_inventory, fill = city)) +
geom_boxplot() +
labs(
title = "Distribution of Months Inventory by City",
x = "City",
y = "Months Inventory"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will provide insights into the spread, central tendency, and
outliers of average price for each city.
Box plot for average_price by city
ggplot(df, aes(x = city, y = average_price, fill = city)) +
geom_boxplot() +
labs(
title = "Distribution of Average Price by City",
x = "City",
y = "Average Price (in dollars)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will provide insights into the spread, central tendency, and
outliers of listing effectiveness for each city.
Box plot for listing_effectiveness by city
ggplot(df, aes(x = city, y = listing_effectiveness, fill = city)) +
geom_boxplot() +
labs(
title = "Distribution of Listing Effectiveness by City",
x = "City",
y = "Listing Effectiveness"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This chart will help in comparing the average sales performance for each
year.
Bar chart for mean sales by year
ggplot(year_summary, aes(x = as.factor(year), y = mean_sales, fill = as.factor(year))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Sales by Year",
x = "Year",
y = "Mean Sales"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of average sales volume per year.
Bar chart for mean volume by year
ggplot(year_summary, aes(x = as.factor(year), y = mean_volume, fill = as.factor(year))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Sales Volume by Year",
x = "Year",
y = "Mean Sales Volume (in millions)"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average median price per
year.
Bar chart for mean median_price by year
ggplot(year_summary, aes(x = as.factor(year), y = mean_median_price, fill = as.factor(year))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Median Price by Year",
x = "Year",
y = "Mean Median Price (in dollars)"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average number of listings
per year.
Bar chart for mean listings by year
ggplot(year_summary, aes(x = as.factor(year), y = mean_listings, fill = as.factor(year))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Listings by Year",
x = "Year",
y = "Mean Listings"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average months inventory per
year.
Bar chart for mean months_inventory by year
ggplot(year_summary, aes(x = as.factor(year), y = mean_months_inventory, fill = as.factor(year))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Months Inventory by Year",
x = "Year",
y = "Mean Months Inventory"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average price per year.
Bar chart for mean average_price by year
ggplot(year_summary, aes(x = as.factor(year), y = mean_average_price, fill = as.factor(year))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Average Price by Year",
x = "Year",
y = "Mean Average Price (in dollars)"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average listing effectiveness
per year.
Bar chart for mean listing_effectiveness by year
ggplot(year_summary, aes(x = as.factor(year), y = mean_listing_effectiveness, fill = as.factor(year))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Listing Effectiveness by Year",
x = "Year",
y = "Mean Listing Effectiveness"
) +
theme_minimal() +
theme(legend.position = "none")
Create a box plot using ggplot2, to visualize the distribution of sales
across different years. This will provide insights into the spread,
central tendency, and outliers of sales for each year.
Box plot for sales by year
ggplot(df, aes(x = as.factor(year), y = sales, fill = as.factor(year))) +
geom_boxplot() +
labs(
title = "Distribution of Sales by Year",
x = "Year",
y = "Sales"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of sales volume for each year.
Box plot for volume by year
ggplot(df, aes(x = as.factor(year), y = volume, fill = as.factor(year))) +
geom_boxplot() +
labs(
title = "Distribution of Sales Volume by Year",
x = "Year",
y = "Sales Volume (in millions)"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of median price for each year.
Box plot for median_price by year
ggplot(df, aes(x = as.factor(year), y = median_price, fill = as.factor(year))) +
geom_boxplot() +
labs(
title = "Distribution of Median Price by Year",
x = "Year",
y = "Median Price (in dollars)"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of listings for each year.
Box plot for listings by year
ggplot(df, aes(x = as.factor(year), y = listings, fill = as.factor(year))) +
geom_boxplot() +
labs(
title = "Distribution of Listings by Year",
x = "Year",
y = "Listings"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of months inventory for each year.
Box plot for months_inventory by year
ggplot(df, aes(x = as.factor(year), y = months_inventory, fill = as.factor(year))) +
geom_boxplot() +
labs(
title = "Distribution of Months Inventory by Year",
x = "Year",
y = "Months Inventory"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of average price for each year.
Box plot for average_price by year
ggplot(df, aes(x = as.factor(year), y = average_price, fill = as.factor(year))) +
geom_boxplot() +
labs(
title = "Distribution of Average Price by Year",
x = "Year",
y = "Average Price (in dollars)"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of listing effectiveness for each year.
Box plot for listing_effectiveness by year
ggplot(df, aes(x = as.factor(year), y = listing_effectiveness, fill = as.factor(year))) +
geom_boxplot() +
labs(
title = "Distribution of Listing Effectiveness by Year",
x = "Year",
y = "Listing Effectiveness"
) +
theme_minimal() +
theme(legend.position = "none")
This chart will help in comparing the average sales performance for each
month.
Bar chart for mean sales by month
ggplot(month_summary, aes(x = as.factor(month), y = mean_sales, fill = as.factor(month))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Sales by Month",
x = "Month",
y = "Mean Sales"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of average sales volume per month.
Bar chart for mean volume by month
ggplot(month_summary, aes(x = as.factor(month), y = mean_volume, fill = as.factor(month))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Sales Volume by Month",
x = "Month",
y = "Mean Sales Volume (in millions)"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average median price per
month.
Bar chart for mean median_price by month
ggplot(month_summary, aes(x = as.factor(month), y = mean_median_price, fill = as.factor(month))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Median Price by Month",
x = "Month",
y = "Mean Median Price (in dollars)"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average number of listings
per month.
Bar chart for mean listings by month
ggplot(month_summary, aes(x = as.factor(month), y = mean_listings, fill = as.factor(month))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Listings by Month",
x = "Month",
y = "Mean Listings"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average months inventory per
month.
Bar chart for mean months_inventory by month
ggplot(month_summary, aes(x = as.factor(month), y = mean_months_inventory, fill = as.factor(month))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Months Inventory by Month",
x = "Month",
y = "Mean Months Inventory"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average price per month.
Bar chart for mean average_price by month
ggplot(month_summary, aes(x = as.factor(month), y = mean_average_price, fill = as.factor(month))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Average Price by Month",
x = "Month",
y = "Mean Average Price (in dollars)"
) +
theme_minimal() +
theme(legend.position = "none")
This will allow for easy comparison of the average listing effectiveness
per month.
Bar chart for mean listing_effectiveness by month
ggplot(month_summary, aes(x = as.factor(month), y = mean_listing_effectiveness, fill = as.factor(month))) +
geom_bar(stat = "identity") +
labs(
title = "Mean Listing Effectiveness by Month",
x = "Month",
y = "Mean Listing Effectiveness"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of sales for each month.
Box plot for sales by month
ggplot(df, aes(x = as.factor(month), y = sales, fill = as.factor(month))) +
geom_boxplot() +
labs(
title = "Distribution of Sales by Month",
x = "Month",
y = "Sales"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of sales volume for each month.
Box plot for volume by month
ggplot(df, aes(x = as.factor(month), y = volume, fill = as.factor(month))) +
geom_boxplot() +
labs(
title = "Distribution of Sales Volume by Month",
x = "Month",
y = "Sales Volume (in millions)"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of median price for each month.
Box plot for median_price by month
ggplot(df, aes(x = as.factor(month), y = median_price, fill = as.factor(month))) +
geom_boxplot() +
labs(
title = "Distribution of Median Price by Month",
x = "Month",
y = "Median Price (in dollars)"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of listings for each month.
Box plot for listings by month
ggplot(df, aes(x = as.factor(month), y = listings, fill = as.factor(month))) +
geom_boxplot() +
labs(
title = "Distribution of Listings by Month",
x = "Month",
y = "Listings"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of months inventory for each month.
Box plot for months_inventory by month
ggplot(df, aes(x = as.factor(month), y = months_inventory, fill = as.factor(month))) +
geom_boxplot() +
labs(
title = "Distribution of Months Inventory by Month",
x = "Month",
y = "Months Inventory"
) +
theme_minimal() +
theme(legend.position = "none")
This will provide insights into the spread, central tendency, and
outliers of average price for each month.
Box plot for average_price by month
ggplot(df, aes(x = as.factor(month), y = average_price, fill = as.factor(month))) +
geom_boxplot() +
labs(
title = "Distribution of Average Price by Month",
x = "Month",
y = "Average Price (in dollars)"
) +
theme_minimal() +
theme(legend.position = "none")
Box Plot for listing_effectiveness by Month:
Create a box plot to visualize the distribution of listing effectiveness across different months, to provide insights into the spread, central tendency, and outliers of listing effectiveness for each month.
library(ggplot2)
Box plot for listing_effectiveness by month
ggplot(df, aes(x = as.factor(month), y = listing_effectiveness, fill = as.factor(month))) +
geom_boxplot() +
labs(
title = "Distribution of Listing Effectiveness by Month",
x = "Month",
y = "Listing Effectiveness"
) +
theme_minimal() +
theme(legend.position = "none")
Data Analysis Key Findings:
Sales and Volume Growth:
Both mean sales and
mean volume show a consistent increase over the years. This suggests a
healthy and expanding real estate market in Texas during this period.
For example, mean sales rose from 169 in 2010 to 231 in 2014, and mean
sales volume increased from 25.7 million to 39.8 million in the same
period.
Price Appreciation:
Mean median_price and
mean average_price also demonstrated an upward trend, indicating
increasing property values. Mean median price increased
from 130,192𝑖𝑛2010𝑡𝑜 139,481 in 2014, and mean average price followed a
similar pattern.
Increased Listings:
The mean listings also
increased steadily from 1572 in 2010 to 2106 in 2014, suggesting more
properties were coming onto the market, potentially driven by the
improving market conditions.
Improved Market Efficiency:
Mean listing_effectiveness generally improved from 0.108 in 2010 to
0.123 in 2014, indicating that a higher proportion of listings were
resulting in sales each year. This suggests a more active and
demand-driven market.
Decreasing Inventory:
Mean months_inventory showed
a decreasing trend from 9.9 in 2010 to 8.4 in 2014. A declining months
of inventory generally signifies a seller’s market, where properties are
selling faster, reflecting strong demand.
Distribution Insights (from Box Plots):
The box
plots illustrate that while the median values for sales, volume, and
prices generally increased, the spread (IQR) and the presence of
outliers also varied by year. Some years show tighter distributions,
while others reveal a wider range of values, indicating periods of
higher market volatility or varied performance across different
segments.
Insights or Next Steps:
Confirmation of Growth:
The analysis clearly
confirms a positive growth trajectory in the Texas real estate market
between 2010 and 2014 across multiple key indicators. This information
is crucial for Texas Realty Insights to understand macro-level market
performance.
Strategic Planning:
The observed trends can inform
strategic decisions, such as anticipating future market conditions,
adjusting pricing strategies, and optimizing listing efforts based on
annual performance patterns.
Deeper Dive:
Further analysis could explore the
drivers behind the increasing listing_effectiveness and
decreasing months_inventory, such as economic indicators, population
growth, or interest rates, to provide more granular actionable insights.
Create a stacked bar chart to compare the total sales across months,
with cities stacked within each month.
Provide the necessary total_sales values for each combination to aggregate the total sales by month and city and prepare the data for the stacked bar chart.
library(dplyr)
library(ggplot2)
Aggregate data to calculate total sales by month and city
sales_by_month_city <- df %>%
group_by(month, city) %>%
summarise(total_sales = sum(sales, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
print(sales_by_month_city)
## # A tibble: 48 × 3
## month city total_sales
## <int> <chr> <int>
## 1 1 Beaumont 608
## 2 1 Bryan-College Station 591
## 3 1 Tyler 907
## 4 1 Wichita Falls 442
## 5 2 Beaumont 677
## 6 2 Bryan-College Station 628
## 7 2 Tyler 1058
## 8 2 Wichita Falls 454
## 9 3 Beaumont 855
## 10 3 Bryan-College Station 949
## # ℹ 38 more rows
Use the aggregated sales_by_month_city data to create a stacked bar chart to visualize total sales by month and city.
library(dplyr)
library(ggplot2)
Aggregate data to calculate total sales by month and city
sales_by_month_city <- df %>%
group_by(month, city) %>%
summarise(total_sales = sum(sales, na.rm = TRUE), .groups = 'drop')
Added.groups = ‘drop’ to address the warning
Create a stacked bar
chart for total sales by month and city
ggplot(sales_by_month_city, aes(x = as.factor(month), y = total_sales, fill = city)) +
geom_bar(stat = "identity", position = "stack") +
labs(
title = "Total Sales by Month and City",
x = "Month",
y = "Total Sales"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Box Plot for median_price by City:
Create a box plot to compare the distribution of median prices across
cities, to highlight differences in central tendency, variability, and
potential outliers.
Calculate the mean and standard deviation for the specified quantitative variables.
library(ggplot2)
Create a boxplot to compare the distribution of median_price between cities
ggplot(df, aes(x = city, y = median_price, fill = city)) +
geom_boxplot() +
labs(
title = "Distribuzione del Prezzo Mediano per Città",
x = "Città",
y = "Prezzo Mediano (in dollari)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Boxplot Analysis: Distribution of Median Price by City
The boxplot illustrates the distribution of median_price
across the four cities: Beaumont, Bryan-College Station, Tyler, and
Wichita Falls.
Central Tendency (Median):
Bryan-College Station shows the highest median price,
indicating a significantly higher-priced market.
Tyler has the second-highest median, above Beaumont and
Wichita Falls.
Beaumont displays a lower median price, suggesting more
moderate property values.
Wichita Falls has the lowest median, making it the most
affordable market.
Variability (IQR and Whiskers):
Bryan-College Station exhibits the widest IQR and
longest whiskers, reflecting high price dispersion and a broad range of
mid‑market properties.
Tyler shows moderate variability, lower than
Bryan-College Station but higher than Beaumont and Wichita Falls.
Beaumont has a more compact box, indicating lower
variability in property prices.
Wichita Falls shows the narrowest IQR and shortest
whiskers, suggesting highly uniform and tightly clustered prices.
Outliers:
Bryan-College Station contains several upper and lower
outliers, indicating unusually high‑value properties (luxury homes) and
unusually low‑priced sales.
Tyler, Beaumont, and Wichita Falls show few or no
significant outliers, suggesting more predictable and stable price
ranges.
Preliminary Conclusions:
Bryan-College Station is the most expensive and
heterogeneous market, with strong price dispersion and distinct
segments.
Tyler represents a mid‑range market with moderate
variability.
Beaumont offers more moderate prices with limited
dispersion.
Wichita Falls is the most affordable and homogeneous
market, with consistently uniform pricing.
These differences are essential for Texas Realty Insights when
tailoring marketing strategies, positioning properties, and advising
clients based on each city’s specific market dynamics.
Boxplots for volume by City:
Create a boxplot to compare the distribution of total sales volume across cities, highlighting differences in spread, central tendency, and outliers.
library(ggplot2)
Box plot for volume by city
ggplot(df, aes(x = city, y = volume, fill = city)) +
geom_boxplot() +
labs(
title = "Distribution of Sales Volume by City",
x = "City",
y = "Sales Volume (in millions)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Boxplots for volume by Year:
Create a boxplot to compare the distribution of total sales volume across years, to identify temporal trends, variability, and unusual values.
library(ggplot2)
Box plot for volume by year
ggplot(df, aes(x = as.factor(year), y = volume, fill = as.factor(year))) +
geom_boxplot() +
labs(
title = "Distribution of Sales Volume by Year",
x = "Year",
y = "Sales Volume (in millions)"
) +
theme_minimal() +
theme(legend.position = "none")
Calculate the mean and standard deviation for the specified quantitative
variables.
To create a normalized stacked bar chart, it is necessary to
calculate the proportion of sales for each city within each month. This
involves grouping the previously aggregated data by month and then
calculating each city’ s sales as a proportion of the total monthly
sales.
library(dplyr)
library(ggplot2)
Aggregate data to calculate total sales by month and city (re-doing just to be safe, though already done)
sales_by_month_city <- df %>%
group_by(month, city) %>%
summarise(total_sales = sum(sales, na.rm = TRUE), .groups = 'drop')
Calculate the proportion of sales for each city within each month
sales_by_month_city_normalized <- sales_by_month_city %>%
group_by(month) %>%
mutate(proportion = total_sales / sum(total_sales)) %>%
ungroup()
print(sales_by_month_city_normalized)
## # A tibble: 48 × 4
## month city total_sales proportion
## <int> <chr> <int> <dbl>
## 1 1 Beaumont 608 0.239
## 2 1 Bryan-College Station 591 0.232
## 3 1 Tyler 907 0.356
## 4 1 Wichita Falls 442 0.173
## 5 2 Beaumont 677 0.240
## 6 2 Bryan-College Station 628 0.223
## 7 2 Tyler 1058 0.376
## 8 2 Wichita Falls 454 0.161
## 9 3 Beaumont 855 0.226
## 10 3 Bryan-College Station 949 0.250
## # ℹ 38 more rows
Generate the normalized stacked bar chart to visually represent the proportion of sales contributed by each city for every month.
ggplot(sales_by_month_city_normalized, aes(x = as.factor(month), y = proportion, fill = city)) +
geom_bar(stat = "identity", position = "fill") +
labs(
title = "Normalized Total Sales by Month and City",
x = "Month",
y = "Proportion of Sales"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Line Chart for sales Over Time by City:
Generate a line chart using ggplot2 to visualize the trend
of sales over time, differentiated by city.
This plot will
help compare sales trajectories across locations during different
historical periods and highlight how each market evolves over time.
Purpose:
Visualize temporal sales patterns for
each city to identify growth trends, seasonal fluctuations, and
deviations in performance.
Task:
date (or
year/month) on the x-axis and sales on the
y-axis. library(dplyr)
df$date <- as.Date(paste0(df$year, "-", df$month, "-01"), format = "%Y-%m-%d")
cat("Head of DataFrame with new 'date' column:\n")
## Head of DataFrame with new 'date' column:
print(head(df))
## city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010 1 83 14.162 163800 1533 9.5
## 2 Beaumont 2010 2 108 17.690 138200 1586 10.0
## 3 Beaumont 2010 3 182 28.701 122400 1689 10.6
## 4 Beaumont 2010 4 200 26.819 123200 1708 10.6
## 5 Beaumont 2010 5 202 28.833 123100 1771 10.9
## 6 Beaumont 2010 6 189 27.219 122800 1803 11.1
## median_price_class average_price listing_effectiveness date
## 1 High 170626.5 0.05414220 2010-01-01
## 2 Medium-High 163796.3 0.06809584 2010-02-01
## 3 Medium-Low 157697.8 0.10775607 2010-03-01
## 4 Medium-Low 134095.0 0.11709602 2010-04-01
## 5 Medium-Low 142737.6 0.11405985 2010-05-01
## 6 Medium-Low 144015.9 0.10482529 2010-06-01
Create a line chart for sales trend over time by city
library(ggplot2)
ggplot(df, aes(x = date, y = sales, color = city, group = city)) +
geom_line() +
labs(
title = "Andamento delle Vendite per Città nel Tempo",
x = "Data",
y = "Vendite"
) +
theme_minimal()
Comprehensive Summary of ggplot2 Visualizations (Project Step
8):
This summary consolidates insights from all ggplot2 visualizations,
showing how they enhance understanding of the Texas real estate market
and support strategic decisions for Texas Realty Insights.
Boxplots of volume by City and by Year
2.1. Volume by City
Tyler leads in total sales volume, confirming its role as the most active market.
Bryan-College Station shows substantial but more variable volume; Beaumont and Wichita Falls remain lower and more stable.
Findings align with conditional analyses and suggest high‑volume
strategies for Tyler and variability‑management approaches for
Bryan-College Station.
2.2. Volume by Year
Clear upward trend in sales volume from 2010 to 2014, with higher
medians and wider IQRs in later years.
Increased upper outliers in 2013–2014 indicate growing market
dynamism.
These visuals confirm annual growth patterns and support
expansion-oriented planning.
Overall Strategic Value:
Final Conclusions and Recommendations for Texas Realty
Insights
This project delivered a comprehensive statistical and visual
analysis of the Texas real estate market from 2010 to 2014 using the
“Real Estate Texas.csv” dataset.
The objectives—identifying trends,
evaluating marketing effectiveness, and providing visual insights for
strategic decision‑making—were fully achieved through descriptive
statistics, probability calculations, new variable creation, conditional
analyses, and extensive ggplot2 visualizations.
Key Findings and Emerging Trends:
The dataset is perfectly balanced across city, year, and month,
ensuring reliable conditional comparisons.
Sales and volume show positive skewness,
reflecting occasional high‑activity periods.
Median_price shows slight negative skewness.
The Gini heterogeneity index for median_price_class
(~0.750) confirms strong price diversity and effective
segmentation.
Steady increases in sales, volume,
median_price, and average_price each year.
Months_inventory declines from 9.9 to 8.4 months,
indicating a tightening seller’s market.
Listing_effectiveness improves from 0.108 to 0.123,
showing more efficient listing‑to‑sale conversion.
Boxplots of volume by year reinforce this trend, with higher
medians and more upper outliers in later years.
Clear peaks in sales, volume,
median_price, average_price, and
listing_effectiveness from May to July.
Months_inventory reaches its lowest levels during these
high‑activity months and peaks in winter.
Line charts confirm recurring seasonal cycles across all cities.
Tyler: Highest activity and volume, excellent listing
effectiveness.
Bryan-College Station: Highest prices and greatest
variability, indicating a premium and segmented market.
Beaumont: Moderate and stable market profile.
Wichita Falls: Lowest prices and activity, most
homogeneous market.
Average_price (~$154,320) exceeds median_price
(~$132,665), showing that high‑value transactions disproportionately
influence total volume.
Listing_effectiveness reveals meaningful differences in
market efficiency across cities and seasons.
The analysis provides a statistically solid foundation for
strategic planning.
Findings are consistent across multiple analytical methods and
visualizations, reinforcing their reliability.
Actionable Recommendations for Texas Realty Insights:
7.1. City‑Specific Marketing and Sales Strategies:
Evidence: Strong differences in means, medians,
variability, and distributions across cities.
Implications:
7.2. Proactive Seasonal Optimization:
Evidence: Consistent peaks in sales, volume, and listing
effectiveness from May to July.
Implications:
7.3. Strategic Use of Listing Effectiveness and Months Inventory:
Evidence: Listing effectiveness improves annually and
seasonally; months inventory is inversely correlated with market
activity.
Implications:
7.4. Capitalizing on Multi‑Year Growth:
Evidence: Strong upward trends across all key
metrics.
Implications:
7.5. Dynamic Pricing Strategy Considerations:
Evidence: Statistically significant seasonal and annual
price fluctuations.
Implications:
Final Summary:
This analysis transforms raw data into a clear, actionable understanding of Texas real estate dynamics. The recommendations—grounded in strong statistical and visual evidence—equip Texas Realty Insights to make informed decisions, optimize strategies, and strengthen their position in a competitive and evolving market.