Task 1:

Load the “Real Estate Texas.csv” dataset into an R dataframe named df and display its head to verify the loading.

df <- read.csv('Real Estate Texas.csv')
head(df)
##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
## 6 Beaumont 2010     6   189 27.219       122800     1803             11.1

Inspect structure, data types and initial descriptive statistics of the loaded dataset, using str(df), summary(df), and dplyr::glimpse(df).

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
str(df)
## 'data.frame':    240 obs. of  8 variables:
##  $ city            : chr  "Beaumont" "Beaumont" "Beaumont" "Beaumont" ...
##  $ year            : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ month           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sales           : int  83 108 182 200 202 189 164 174 124 150 ...
##  $ volume          : num  14.2 17.7 28.7 26.8 28.8 ...
##  $ median_price    : num  163800 138200 122400 123200 123100 ...
##  $ listings        : int  1533 1586 1689 1708 1771 1803 1857 1830 1829 1779 ...
##  $ months_inventory: num  9.5 10 10.6 10.6 10.9 11.1 11.7 11.6 11.7 11.5 ...
summary(df)
##      city                year          month           sales      
##  Length:240         Min.   :2010   Min.   : 1.00   Min.   : 79.0  
##  Class :character   1st Qu.:2011   1st Qu.: 3.75   1st Qu.:127.0  
##  Mode  :character   Median :2012   Median : 6.50   Median :175.5  
##                     Mean   :2012   Mean   : 6.50   Mean   :192.3  
##                     3rd Qu.:2013   3rd Qu.: 9.25   3rd Qu.:247.0  
##                     Max.   :2014   Max.   :12.00   Max.   :423.0  
##      volume        median_price       listings    months_inventory
##  Min.   : 8.166   Min.   : 73800   Min.   : 743   Min.   : 3.400  
##  1st Qu.:17.660   1st Qu.:117300   1st Qu.:1026   1st Qu.: 7.800  
##  Median :27.062   Median :134500   Median :1618   Median : 8.950  
##  Mean   :31.005   Mean   :132665   Mean   :1738   Mean   : 9.193  
##  3rd Qu.:40.893   3rd Qu.:150050   3rd Qu.:2056   3rd Qu.:10.950  
##  Max.   :83.547   Max.   :180000   Max.   :3296   Max.   :14.900
dplyr::glimpse(df)
## Rows: 240
## Columns: 8
## $ city             <chr> "Beaumont", "Beaumont", "Beaumont", "Beaumont", "Beau…
## $ year             <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,…
## $ month            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5,…
## $ sales            <int> 83, 108, 182, 200, 202, 189, 164, 174, 124, 150, 150,…
## $ volume           <dbl> 14.162, 17.690, 28.701, 26.819, 28.833, 27.219, 22.70…
## $ median_price     <dbl> 163800, 138200, 122400, 123200, 123100, 122800, 12430…
## $ listings         <int> 1533, 1586, 1689, 1708, 1771, 1803, 1857, 1830, 1829,…
## $ months_inventory <dbl> 9.5, 10.0, 10.6, 10.6, 10.9, 11.1, 11.7, 11.6, 11.7, …

Categorize its variables by type (e.g., qualitative, quantitative). Based on the data inspection, here s a classification of the variables:

  1. Qualitative Variables:
  1. Quantitative Variables:

Discuss how to handle time-related variables like ‘year’ and ‘month’. Handling Time-Related Variables (year and month):
The year and month variables, while currently integers, represent a temporal dimension. For robust time-series analysis or more intuitive plotting (e.g., chronological order on an axis), it’s often beneficial to combine them into a single date/time object. This can be done by creating a new column, for instance, date, which could represent the first day of each month.

This will allow for:

  1. Time Series Analysis: Easily plotting trends over time.

  2. Seasonality Analysis: Grouping by month to observe seasonal patterns, or by year to observe annual changes.

  3. Filtering and Aggregation: Efficiently filtering data by specific periods or aggregating data over longer timeframes (e.g., quarterly or annually).


Task 2:

Calculate descriptive statistics (mean, median, standard deviation, IQR, variance, skewness, and kurtosis) for the quantitative variables: sales, volume, median_price, listings, and months_inventory. Ensure the e1071 package is installed and loaded for skewness and kurtosis.

library(e1071)

–> n R (for OSX) select Tools
–> Install Packages from the menu. Then you can search for the package and install it via Install selected.
You can also go to packages in the right windows click on install write e1071 click install.

Proceed with calculating the descriptive statistics.
Ensure the variable is numeric.
Add to results dataframe.

variables_to_analyze <- c("sales", "volume", "median_price", "listings", "months_inventory")

results <- data.frame()

for (var_name in variables_to_analyze) {
  
  if (is.numeric(df[[var_name]])) {
    mean_val <- mean(df[[var_name]], na.rm = TRUE)
    median_val <- median(df[[var_name]], na.rm = TRUE)
    sd_val <- sd(df[[var_name]], na.rm = TRUE)
    var_val <- var(df[[var_name]], na.rm = TRUE)
    iqr_val <- IQR(df[[var_name]], na.rm = TRUE)
    skew_val <- skewness(df[[var_name]], na.rm = TRUE)
    kurt_val <- kurtosis(df[[var_name]], na.rm = TRUE)
    
    results <- rbind(results, data.frame(
      Variable = var_name,
      Mean = mean_val,
      Median = median_val,
      SD = sd_val,
      Variance = var_val,
      IQR = iqr_val,
      Skewness = skew_val,
      Kurtosis = kurt_val
    ))
  } else {
    cat(paste0("Variable '", var_name, "' is not numeric and will be skipped.\n"))
  }
}

print("Descriptive Statistics for Quantitative Variables:")
## [1] "Descriptive Statistics for Quantitative Variables:"
print(results)
##           Variable         Mean      Median           SD     Variance
## 1            sales    192.29167    175.5000    79.651111 6.344300e+03
## 2           volume     31.00519     27.0625    16.651447 2.772707e+02
## 3     median_price 132665.41667 134500.0000 22662.148687 5.135730e+08
## 4         listings   1738.02083   1618.5000   752.707756 5.665690e+05
## 5 months_inventory      9.19250      8.9500     2.303669 5.306889e+00
##          IQR    Skewness   Kurtosis
## 1   120.0000  0.71362055 -0.3355200
## 2    23.2335  0.87921815  0.1505673
## 3 32750.0000 -0.36227680 -0.6427292
## 4  1029.5000  0.64544309 -0.8101534
## 5     3.1500  0.04071944 -0.1979448

Generate frequency distributions for the qualitative and discrete quantitative variablescity, year, and month.
The e1071 package has been previously installed for skewness and kurtosis calculations.

cat("\nFrequency Distribution for 'city':\n")
## 
## Frequency Distribution for 'city':
print(table(df$city))
## 
##              Beaumont Bryan-College Station                 Tyler 
##                    60                    60                    60 
##         Wichita Falls 
##                    60
cat("\nFrequency Distribution for 'year':\n")
## 
## Frequency Distribution for 'year':
print(table(df$year))
## 
## 2010 2011 2012 2013 2014 
##   48   48   48   48   48
cat("\nFrequency Distribution for 'month':\n")
## 
## Frequency Distribution for 'month':
print(table(df$month))
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 20 20 20 20 20 20 20 20 20 20 20 20

Finally, comment on the statistical results, highlighting significant findings and patterns for each variable.

Data Analysis Key Findings:

  1. Sales and Volume show positive skewness:
    sales (mean: 192.29, median: 175.5, skewness: 0.714) and volume (mean: 31.01 million, median: 27.06 million, skewness: 0.879) both have means higher than their medians, indicating a concentration of lower values and occasional higher values pulling the mean up.

  2. Median price is slightly negatively skewed:
    The median_price (mean: $132,665, median: $134,500, skewness: -0.362) has its median slightly higher than its mean, suggesting a concentration of prices at the higher end. It also exhibits high variability with a standard deviation of $22,662.

  3. Listings are positively skewed with high variability:
    listings (mean: 1738.02, median: 1618.5, skewness: 0.645) also shows a positive skew and substantial variability (standard deviation: 752.71).

  4. Months inventory is nearly symmetrical and least variable:
    months_inventory (mean: 9.19, median: 8.95, skewness: 0.041) has an almost symmetrical distribution and the lowest relative variability (standard deviation: 2.30) among the quantitative variables.

  5. Platykurtic distributions are common:
    sales, median_price, and listings all exhibit platykurtic distributions (kurtosis values of -0.336, -0.643, and -0.810 respectively), meaning they have lighter tails and are less peaked than a normal distribution. volume is mesokurtic (kurtosis: 0.151) and months_inventory is nearly mesokurtic (kurtosis: -0.198).

  6. Dataset is perfectly balanced for categorical variables:
    Each of the four city entries, five year entries, and twelve month entries have an equal number of observations (60 for city, 48 for year, and 20 for month), ensuring a balanced dataset for comparative analyses across these dimensions.

Insights or Next Steps:


Task 4:

Select the median_price variable, divide it into meaningful classes (bins) based on its quartiles, create a new categorical column named median_price_class in the dataframe, and display the defined bin edges and a sample of the new column.

Firstly alculate the quartiles for the median_price variable to define the bin edges for classification. Then, I will create a new categorical column named median_price_class using the cut() function, based on these quartiles, and display the bin edges and a sample of the new column to verify its creation.

quartile_breaks <- quantile(df$median_price, probs = c(0, 0.25, 0.50, 0.75, 1), na.rm = TRUE)

labels <- c("Low", "Medium-Low", "Medium-High", "High")

df$median_price_class <- cut(df$median_price, breaks = quartile_breaks, labels = labels, include.lowest = TRUE)

cat("Defined Quartile Breaks (Bin Edges):\n")
## Defined Quartile Breaks (Bin Edges):
print(quartile_breaks)
##     0%    25%    50%    75%   100% 
##  73800 117300 134500 150050 180000
cat("\nSample of new 'median_price_class' column:\n")
## 
## Sample of new 'median_price_class' column:
print(head(df$median_price_class))
## [1] High        Medium-High Medium-Low  Medium-Low  Medium-Low  Medium-Low 
## Levels: Low Medium-Low Medium-High High

Generate and display the frequency distribution (table) for the newly created median_price_class variable, showing the count of observations falling into each price class.

Use the table() function on the df$median_price_class column and then print the resulting table.

print(table(df$median_price_class))
## 
##         Low  Medium-Low Medium-High        High 
##          60          61          59          60

Create and display a bar chart using ggplot2 to visually represent the frequency distribution of median_price_class. The chart will include appropriate labels and a title.

Use ggplot2 to create a bar chart.
This involves loading the library, mapping median_price_class to the x-axis, and adding appropriate labels and a title.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:e1071':
## 
##     element
# Create a bar chart for median_price_class
ggplot(df, aes(x = median_price_class)) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Frequency Distribution of Median Price Classes",
    x = "Median Price Class",
    y = "Frequency"
  ) +
  theme_minimal()

Calculate Gini Heterogeneity Index: for the median_price_class variable:
determinate the proportions of each category within the median_price_class variable and apply the Gini index formula: Gini = 1 - sum(p_i^2), where p_i is the proportion of observations in category i.

freq_table <- table(df$median_price_class)
proportions <- freq_table / sum(freq_table)
gini_index <- 1 - sum(proportions^2)

cat("Frequency Table for median_price_class:\n")
## Frequency Table for median_price_class:
print(freq_table)
## 
##         Low  Medium-Low Medium-High        High 
##          60          61          59          60
cat("\nProportions for median_price_class:\n")
## 
## Proportions for median_price_class:
print(proportions)
## 
##         Low  Medium-Low Medium-High        High 
##   0.2500000   0.2541667   0.2458333   0.2500000
cat("\nGini Heterogeneity Index for median_price_class: ", gini_index, "\n")
## 
## Gini Heterogeneity Index for median_price_class:  0.7499653

Discuss the interpretation of the calculated index in the context of the price class distribution. The calculated Gini heterogeneity index for median_price_class is approximately 0.750.

Discussion:

Gini Index Range:
The Gini index ranges from 0 to 1.
A value of 0 indicates perfect homogeneity, meaning all observations fall into a single category.
A value of 1 (or close to 1) indicates perfect heterogeneity, meaning observations are spread as evenly as possible across all categories, or each observation is in its own category.

Context of median_price_class:
In this case, the median_price_class variable was created by dividing median_price into four classes (Low, Medium-Low, Medium-High, High) based on quartiles.
Ideally, if the division into quartiles is perfectly balanced, each class would contain 25% of the observations.
This would lead to a high Gini index, indicating a high degree of heterogeneity (diversity) among the classes.

Result Interpretation:
A Gini index of 0.750 is a relatively high value, indicating a significant level of heterogeneity in the median_price_class variable.
This is expected given that the classes were created using quartiles, aiming for an approximately equal distribution of observations across the four categories.
The frequency table shows that the classes are indeed very evenly distributed (60, 61, 59, 60 observations respectively out of 240 total), which contributes to this high heterogeneity index. This suggests that the median prices are well-distributed across the defined price ranges, and no single price class overwhelmingly dominates the dataset.

Implication:
This high heterogeneity means that the median_price variable, when categorized into these four classes, provides a good differentiation across the spectrum of prices. It confirms that the classification effectively captures the diversity in median housing prices across the dataset.

Data Analysis Key Findings:

Insights or Next Steps:


Task 5:

Calculate the probability that a randomly selected row from the dataset corresponds to the city ‘Beaumont’.
This involves counting rows where ‘city’ is ‘Beaumont’ and dividing by the total number of rows.

Filter the DataFrame to count rows for ‘Beaumont’, get the total number of rows, and then divide the ‘Beaumont’ count by the total count.

count_beaumont <- sum(df$city == "Beaumont")
total_rows <- nrow(df)
probability_beaumont <- count_beaumont / total_rows

cat("Number of rows for 'Beaumont':", count_beaumont, "\n")
## Number of rows for 'Beaumont': 60
cat("Total number of rows:", total_rows, "\n")
## Total number of rows: 240
cat("Probability of selecting a row for 'Beaumont':", probability_beaumont, "\n")
## Probability of selecting a row for 'Beaumont': 0.25

Calculate the probability that a randomly selected row corresponds to the month of ‘July’ using a similar approach.

count_july <- sum(df$month == 7)
total_rows <- nrow(df)
probability_july <- count_july / total_rows

cat("Number of rows for 'July':", count_july, "\n")
## Number of rows for 'July': 20
cat("Total number of rows:", total_rows, "\n")
## Total number of rows: 240
cat("Probability of selecting a row for 'July':", probability_july, "\n")
## Probability of selecting a row for 'July': 0.08333333

determine the probability of selecting a row that corresponds to ‘December 2012’ by filtering the data for both month 12 and year 2012, and then dividing this count by the total number of rows.

count_december_2012 <- sum(df$month == 12 & df$year == 2012)
total_rows <- nrow(df)
probability_december_2012 <- count_december_2012 / total_rows

cat("Number of rows for 'December 2012':", count_december_2012, "\n")
## Number of rows for 'December 2012': 4
cat("Total number of rows:", total_rows, "\n")
## Total number of rows: 240
cat("Probability of selecting a row for 'December 2012':", probability_december_2012, "\n")
## Probability of selecting a row for 'December 2012': 0.01666667

Probabilities:

Data Analysis Key Findings and Interpretation:

  1. City ‘Beaumont’ (P = 0.25):
  1. Month ‘July’ (P ≈ 0.0833):
  1. ‘December 2012’ (P ≈ 0.0167):

Insights:


Task 6:

Create two new variables in the df dataframe: average_price (calculated as (volume * 1,000,000) divided by sales) and listing_effectiveness (calculated as sales divided by listings). For each new variable, display the head of the updated dataframe and summary statistics.
Then, provide a commentary on their meaning, distributions, and initial insights into the real estate market.
Finally, summarize the process of creating these new variables and the insights gained from their analysis, linking them to the project’s objectives.

Calculate the average_price as specified, multiplying volume by 1,000,000 to convert it to dollars, then divide by sales.
display the head of the updated dataframe and summary statistics for the new column to verify the operation.

df$average_price <- (df$volume * 1000000) / df$sales

cat("Head of DataFrame with new 'average_price' column:\n")
## Head of DataFrame with new 'average_price' column:
print(head(df))
##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
## 6 Beaumont 2010     6   189 27.219       122800     1803             11.1
##   median_price_class average_price
## 1               High      170626.5
## 2        Medium-High      163796.3
## 3         Medium-Low      157697.8
## 4         Medium-Low      134095.0
## 5         Medium-Low      142737.6
## 6         Medium-Low      144015.9
cat("\nSummary statistics for 'average_price' column:\n")
## 
## Summary statistics for 'average_price' column:
print(summary(df$average_price))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   97010  132939  156588  154320  173915  213234

Now that the average_price column has been successfully created and inspected, It is possible to proceed and create the second new column, listing_effectiveness, calculated as sales divided by listings and display the head of the updated dataframe and summary statistics for this new column to verify its creation and initial properties.

df$listing_effectiveness <- df$sales / df$listings

cat("Head of DataFrame with new 'listing_effectiveness' column:
")
## Head of DataFrame with new 'listing_effectiveness' column:
print(head(df))
##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
## 6 Beaumont 2010     6   189 27.219       122800     1803             11.1
##   median_price_class average_price listing_effectiveness
## 1               High      170626.5            0.05414220
## 2        Medium-High      163796.3            0.06809584
## 3         Medium-Low      157697.8            0.10775607
## 4         Medium-Low      134095.0            0.11709602
## 5         Medium-Low      142737.6            0.11405985
## 6         Medium-Low      144015.9            0.10482529
cat("\nSummary statistics for 'listing_effectiveness' column:\n")
## 
## Summary statistics for 'listing_effectiveness' column:
print(summary(df$listing_effectiveness))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05014 0.08980 0.10963 0.11874 0.13492 0.38713

Comments:

  1. average_price:
  1. listing_effectiveness:

Data Analysis Key Findings:

Insights:


Task 7:

Conditional Analysis by City:

Perform conditional statistical analysis (mean and standard deviation) for key quantitative variables grouped by ‘city’ using dplyr. Display the summary statistics and create appropriate graphical representations (e.g., bar charts for means, box plots for distributions) to visualize city-specific trends.

Calculate the mean and standard deviation for the specified quantitative variables, grouped by ‘city’, as instructed.

library(dplyr)

city_summary <- df %>% 
  group_by(city) %>% 
  summarise(
    mean_sales = mean(sales, na.rm = TRUE),
    sd_sales = sd(sales, na.rm = TRUE),
    mean_volume = mean(volume, na.rm = TRUE),
    sd_volume = sd(volume, na.rm = TRUE),
    mean_median_price = mean(median_price, na.rm = TRUE),
    sd_median_price = sd(median_price, na.rm = TRUE),
    mean_listings = mean(listings, na.rm = TRUE),
    sd_listings = sd(listings, na.rm = TRUE),
    mean_months_inventory = mean(months_inventory, na.rm = TRUE),
    sd_months_inventory = sd(months_inventory, na.rm = TRUE),
    mean_average_price = mean(average_price, na.rm = TRUE),
    sd_average_price = sd(average_price, na.rm = TRUE),
    mean_listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
    sd_listing_effectiveness = sd(listing_effectiveness, na.rm = TRUE)
  )

print(city_summary)
## # A tibble: 4 × 15
##   city               mean_sales sd_sales mean_volume sd_volume mean_median_price
##   <chr>                   <dbl>    <dbl>       <dbl>     <dbl>             <dbl>
## 1 Beaumont                 177.     41.5        26.1      6.97           129988.
## 2 Bryan-College Sta…       206.     85.0        38.2     17.2            157488.
## 3 Tyler                    270.     62.0        45.8     13.1            141442.
## 4 Wichita Falls            116.     22.2        13.9      3.24           101743.
## # ℹ 9 more variables: sd_median_price <dbl>, mean_listings <dbl>,
## #   sd_listings <dbl>, mean_months_inventory <dbl>, sd_months_inventory <dbl>,
## #   mean_average_price <dbl>, sd_average_price <dbl>,
## #   mean_listing_effectiveness <dbl>, sd_listing_effectiveness <dbl>

Conditional Analysis by Year:

Perform the conditional statistical analysis grouped by ‘year’. Display the summary statistics and create graphical representations (e.g., line charts, bar charts) to visualize annual trends.

Calculate the mean and standard deviation for the specified quantitative variables.

library(dplyr)

year_summary <- df %>%
  group_by(year) %>%
  summarise(
    mean_sales = mean(sales, na.rm = TRUE),
    sd_sales = sd(sales, na.rm = TRUE),
    mean_volume = mean(volume, na.rm = TRUE),
    sd_volume = sd(volume, na.rm = TRUE),
    mean_median_price = mean(median_price, na.rm = TRUE),
    sd_median_price = sd(median_price, na.rm = TRUE),
    mean_listings = mean(listings, na.rm = TRUE),
    sd_listings = sd(listings, na.rm = TRUE),
    mean_months_inventory = mean(months_inventory, na.rm = TRUE),
    sd_months_inventory = sd(months_inventory, na.rm = TRUE),
    mean_average_price = mean(average_price, na.rm = TRUE),
    sd_average_price = sd(average_price, na.rm = TRUE),
    mean_listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
    sd_listing_effectiveness = sd(listing_effectiveness, na.rm = TRUE)
  )

print(year_summary)
## # A tibble: 5 × 15
##    year mean_sales sd_sales mean_volume sd_volume mean_median_price
##   <int>      <dbl>    <dbl>       <dbl>     <dbl>             <dbl>
## 1  2010       169.     60.5        25.7      10.8           130192.
## 2  2011       164.     63.9        25.2      12.2           127854.
## 3  2012       186.     70.9        29.3      14.5           130077.
## 4  2013       212.     84.0        35.2      17.9           135723.
## 5  2014       231.     95.5        39.8      21.2           139481.
## # ℹ 9 more variables: sd_median_price <dbl>, mean_listings <dbl>,
## #   sd_listings <dbl>, mean_months_inventory <dbl>, sd_months_inventory <dbl>,
## #   mean_average_price <dbl>, sd_average_price <dbl>,
## #   mean_listing_effectiveness <dbl>, sd_listing_effectiveness <dbl>

Conditional Analysis by Month:

Perform conditional statistical analysis (mean and standard deviation) for key quantitative variables grouped by ‘month’. Display the summary statistics and create graphical representations (e.g., line charts, bar charts) to visualize annual trends.

Calculate the mean and standard deviation for the specified quantitative variables.

library(dplyr)

month_summary <- df %>%
  group_by(month) %>%
  summarise(
    mean_sales = mean(sales, na.rm = TRUE),
    sd_sales = sd(sales, na.rm = TRUE),
    mean_volume = mean(volume, na.rm = TRUE),
    sd_volume = sd(volume, na.rm = TRUE),
    mean_median_price = mean(median_price, na.rm = TRUE),
    sd_median_price = sd(median_price, na.rm = TRUE),
    mean_listings = mean(listings, na.rm = TRUE),
    sd_listings = sd(listings, na.rm = TRUE),
    mean_months_inventory = mean(months_inventory, na.rm = TRUE),
    sd_months_inventory = sd(months_inventory, na.rm = TRUE),
    mean_average_price = mean(average_price, na.rm = TRUE),
    sd_average_price = sd(average_price, na.rm = TRUE),
    mean_listing_effectiveness = mean(listing_effectiveness, na.rm = TRUE),
    sd_listing_effectiveness = sd(listing_effectiveness, na.rm = TRUE)
  )

print(month_summary)
## # A tibble: 12 × 15
##    month mean_sales sd_sales mean_volume sd_volume mean_median_price
##    <int>      <dbl>    <dbl>       <dbl>     <dbl>             <dbl>
##  1     1       127.     43.4        19.0      8.37            124250
##  2     2       141.     51.1        21.7     10.1             130075
##  3     3       189.     59.2        29.4     12.0             127415
##  4     4       212.     65.4        33.3     14.5             131490
##  5     5       239.     83.1        39.7     19.0             134485
##  6     6       244.     95.0        41.3     21.1             137620
##  7     7       236.     96.3        39.1     21.4             134750
##  8     8       231.     79.2        38.0     18.0             136675
##  9     9       182.     72.5        29.6     15.2             134040
## 10    10       180.     75.0        29.1     15.1             133480
## 11    11       157.     55.5        24.8     11.2             134305
## 12    12       169.     60.7        27.1     12.6             133400
## # ℹ 9 more variables: sd_median_price <dbl>, mean_listings <dbl>,
## #   sd_listings <dbl>, mean_months_inventory <dbl>, sd_months_inventory <dbl>,
## #   mean_average_price <dbl>, sd_average_price <dbl>,
## #   mean_listing_effectiveness <dbl>, sd_listing_effectiveness <dbl>

Conditional Analysis by Month:

Data Analysis Key Findings:

Insights or Next Steps:

Conditional Analysis (by City, Year, and Month):

Data Analysis Key Findings:

Insights or Next Steps:

Integrated Summary of Conditional Analyses and Strategic Implications:

The conditional analyses conducted by city, year, and month provide a comprehensive understanding of the dynamics shaping the Texas real estate market. These insights highlight structural differences across cities, a clear multi‑year growth trajectory, and strong seasonal patterns that influence sales activity, pricing, inventory behavior, and listing performance.

The analyses reveal substantial variation across local markets:

  1. Tyler: stands out as the most active market, with the highest mean sales and volume, as well as strong listing_effectiveness, indicating efficient conversion of listings into sales.

  2. Bryan-College Station: leads in median_price and average_price, positioning it as a premium-value market.

  3. Beaumont: shows moderate performance across most metrics.

  4. Wichita Falls: consistently records the lowest values in sales, prices, and listing effectiveness, along with higher and more variable months_inventory, reflecting a slower and less competitive market.

These differences underscore the need for city-specific strategies tailored to each market’s structure and performance.

Across the five-year period, the Texas real estate market demonstrates strong and sustained growth:

  1. Mean sales, volume, median_price, average_price, and listing_effectiveness all increase steadily from 2010 to 2014.

  2. Mean months_inventory declines from approximately 9.9 to 8.4 months, signaling a tightening seller’s market where properties sell more quickly.

  3. Price appreciation is evident, with mean median price rising from roughly $130,192 in 2010 to $139,481 in 2014.

These trends reflect improving market efficiency, stronger demand, and favorable economic conditions.

The analyses also reveal pronounced seasonality:

  1. Sales and volume peak between May and July, with June typically showing the highest activity.

  2. Median_price and average_price follow similar seasonal peaks, reaching their highest levels in late spring and early summer.

  3. Listing_effectiveness is strongest during the same period, indicating that marketing efforts convert more effectively when demand is highest.

  4. Months_inventory displays inverse seasonality, reaching its lowest levels in summer and highest in winter, confirming faster turnover during peak demand months.

  5. Listings increase in spring and peak around May–July, suggesting that sellers strategically enter the market when conditions are most favorable.

Strategic Implications:

The combined insights from the conditional analyses support several strategic recommendations for Texas Realty Insights:

  1. Develop City-Specific Strategies:
    Focus on high-volume campaigns in Tyler, premium positioning in Bryan-College Station, and targeted, conservative approaches in slower markets like Wichita Falls.

  2. Capitalize on Seasonality:
    Concentrate marketing investments, listing promotions, and staffing resources between March and August to leverage peak demand, higher prices, and stronger listing effectiveness.

  3. Implement Dynamic Pricing:
    Adjust pricing strategies to reflect seasonal fluctuations, maximizing revenue during high-activity months.

  4. Monitor Inventory Conditions:
    Use months_inventory as a key indicator of market competitiveness and to guide recommendations on optimal listing timing.

  5. Leverage Multi-Year Growth Trends:
    Invest in expanding operations in high-performing markets and refine sales forecasts using the consistent upward trends observed from 2010 to 2014.

This integrated analysis provides a clear, data-driven foundation for strategic decision-making, enabling Texas Realty Insights to optimize marketing efforts, improve forecasting, and tailor approaches to the unique characteristics of each market.


Task 8 –> Ggplot2 Visualizations:

Create a bar chart using ggplot2 to visualize the mean of sales across different cities. This chart will help in comparing the average sales performance of each city.

library(ggplot2)

Bar chart for mean sales by city

ggplot(city_summary, aes(x = city, y = mean_sales, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales by City",
    x = "City",
    y = "Mean Sales"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of average sales volume per city.

Bar chart for mean volume by city

ggplot(city_summary, aes(x = city, y = mean_volume, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales Volume by City",
    x = "City",
    y = "Mean Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average median price per city.

Bar chart for mean median_price by city

ggplot(city_summary, aes(x = city, y = mean_median_price, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Median Price by City",
    x = "City",
    y = "Mean Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average number of listings per city.

Bar chart for mean listings by city

ggplot(city_summary, aes(x = city, y = mean_listings, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listings by City",
    x = "City",
    y = "Mean Listings"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average months inventory per city.

Bar chart for mean months_inventory by city

ggplot(city_summary, aes(x = city, y = mean_months_inventory, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Months Inventory by City",
    x = "City",
    y = "Mean Months Inventory"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average price per city.

Bar chart for mean average_price by city

ggplot(city_summary, aes(x = city, y = mean_average_price, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Average Price by City",
    x = "City",
    y = "Mean Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will allow for easy comparison of the average listing effectiveness per city.

Bar chart for mean listing_effectiveness by city

ggplot(city_summary, aes(x = city, y = mean_listing_effectiveness, fill = city)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listing Effectiveness by City",
    x = "City",
    y = "Mean Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Create a box plot using ggplot2 to visualize the distribution of sales across different cities. This chart will provide insights into the spread, central tendency, and outliers of sales for each city.

Box plot for sales by city

ggplot(df, aes(x = city, y = sales, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales by City",
    x = "City",
    y = "Sales"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of sales volume for each city.

Box plot for volume by city

ggplot(df, aes(x = city, y = volume, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by City",
    x = "City",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of median price for each city.

Box plot for median_price by city

ggplot(df, aes(x = city, y = median_price, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Median Price by City",
    x = "City",
    y = "Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of listings for each city.

Box plot for listings by city

ggplot(df, aes(x = city, y = listings, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listings by City",
    x = "City",
    y = "Listings"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of months inventory for each city.

Box plot for months_inventory by city

ggplot(df, aes(x = city, y = months_inventory, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Months Inventory by City",
    x = "City",
    y = "Months Inventory"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of average price for each city.

Box plot for average_price by city

ggplot(df, aes(x = city, y = average_price, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Average Price by City",
    x = "City",
    y = "Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will provide insights into the spread, central tendency, and outliers of listing effectiveness for each city.

Box plot for listing_effectiveness by city

ggplot(df, aes(x = city, y = listing_effectiveness, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listing Effectiveness by City",
    x = "City",
    y = "Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This chart will help in comparing the average sales performance for each year.

Bar chart for mean sales by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_sales, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales by Year",
    x = "Year",
    y = "Mean Sales"
  ) +
  theme_minimal() + 
  theme(legend.position = "none")

This will allow for easy comparison of average sales volume per year.

Bar chart for mean volume by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_volume, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales Volume by Year",
    x = "Year",
    y = "Mean Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average median price per year.

Bar chart for mean median_price by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_median_price, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Median Price by Year",
    x = "Year",
    y = "Mean Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average number of listings per year.

Bar chart for mean listings by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_listings, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listings by Year",
    x = "Year",
    y = "Mean Listings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average months inventory per year.

Bar chart for mean months_inventory by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_months_inventory, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Months Inventory by Year",
    x = "Year",
    y = "Mean Months Inventory"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average price per year.

Bar chart for mean average_price by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_average_price, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Average Price by Year",
    x = "Year",
    y = "Mean Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average listing effectiveness per year.

Bar chart for mean listing_effectiveness by year

ggplot(year_summary, aes(x = as.factor(year), y = mean_listing_effectiveness, fill = as.factor(year))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listing Effectiveness by Year",
    x = "Year",
    y = "Mean Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Create a box plot using ggplot2, to visualize the distribution of sales across different years. This will provide insights into the spread, central tendency, and outliers of sales for each year.

Box plot for sales by year

ggplot(df, aes(x = as.factor(year), y = sales, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales by Year",
    x = "Year",
    y = "Sales"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of sales volume for each year.

Box plot for volume by year

ggplot(df, aes(x = as.factor(year), y = volume, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by Year",
    x = "Year",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of median price for each year.

Box plot for median_price by year

ggplot(df, aes(x = as.factor(year), y = median_price, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Median Price by Year",
    x = "Year",
    y = "Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of listings for each year.

Box plot for listings by year

ggplot(df, aes(x = as.factor(year), y = listings, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listings by Year",
    x = "Year",
    y = "Listings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of months inventory for each year.

Box plot for months_inventory by year

ggplot(df, aes(x = as.factor(year), y = months_inventory, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Months Inventory by Year",
    x = "Year",
    y = "Months Inventory"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of average price for each year.

Box plot for average_price by year

ggplot(df, aes(x = as.factor(year), y = average_price, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Average Price by Year",
    x = "Year",
    y = "Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of listing effectiveness for each year.

Box plot for listing_effectiveness by year

ggplot(df, aes(x = as.factor(year), y = listing_effectiveness, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listing Effectiveness by Year",
    x = "Year",
    y = "Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This chart will help in comparing the average sales performance for each month.

Bar chart for mean sales by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_sales, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales by Month",
    x = "Month",
    y = "Mean Sales"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of average sales volume per month.

Bar chart for mean volume by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_volume, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Sales Volume by Month",
    x = "Month",
    y = "Mean Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average median price per month.

Bar chart for mean median_price by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_median_price, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Median Price by Month",
    x = "Month",
    y = "Mean Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average number of listings per month.

Bar chart for mean listings by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_listings, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listings by Month",
    x = "Month",
    y = "Mean Listings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average months inventory per month.

Bar chart for mean months_inventory by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_months_inventory, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Months Inventory by Month",
    x = "Month",
    y = "Mean Months Inventory"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average price per month.

Bar chart for mean average_price by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_average_price, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Average Price by Month",
    x = "Month",
    y = "Mean Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will allow for easy comparison of the average listing effectiveness per month.

Bar chart for mean listing_effectiveness by month

ggplot(month_summary, aes(x = as.factor(month), y = mean_listing_effectiveness, fill = as.factor(month))) +
  geom_bar(stat = "identity") +
  labs(
    title = "Mean Listing Effectiveness by Month",
    x = "Month",
    y = "Mean Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of sales for each month.

Box plot for sales by month

ggplot(df, aes(x = as.factor(month), y = sales, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales by Month",
    x = "Month",
    y = "Sales"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of sales volume for each month.

Box plot for volume by month

ggplot(df, aes(x = as.factor(month), y = volume, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by Month",
    x = "Month",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of median price for each month.

Box plot for median_price by month

ggplot(df, aes(x = as.factor(month), y = median_price, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Median Price by Month",
    x = "Month",
    y = "Median Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of listings for each month.

Box plot for listings by month

ggplot(df, aes(x = as.factor(month), y = listings, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listings by Month",
    x = "Month",
    y = "Listings"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of months inventory for each month.

Box plot for months_inventory by month

ggplot(df, aes(x = as.factor(month), y = months_inventory, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Months Inventory by Month",
    x = "Month",
    y = "Months Inventory"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This will provide insights into the spread, central tendency, and outliers of average price for each month.

Box plot for average_price by month

ggplot(df, aes(x = as.factor(month), y = average_price, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Average Price by Month",
    x = "Month",
    y = "Average Price (in dollars)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Box Plot for listing_effectiveness by Month:

Create a box plot to visualize the distribution of listing effectiveness across different months, to provide insights into the spread, central tendency, and outliers of listing effectiveness for each month.

library(ggplot2)

Box plot for listing_effectiveness by month

ggplot(df, aes(x = as.factor(month), y = listing_effectiveness, fill = as.factor(month))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Listing Effectiveness by Month",
    x = "Month",
    y = "Listing Effectiveness"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Data Analysis Key Findings:

Insights or Next Steps:

Create a stacked bar chart to compare the total sales across months, with cities stacked within each month.

Provide the necessary total_sales values for each combination to aggregate the total sales by month and city and prepare the data for the stacked bar chart.

library(dplyr)
library(ggplot2)

Aggregate data to calculate total sales by month and city

sales_by_month_city <- df %>%
  group_by(month, city) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
print(sales_by_month_city)
## # A tibble: 48 × 3
##    month city                  total_sales
##    <int> <chr>                       <int>
##  1     1 Beaumont                      608
##  2     1 Bryan-College Station         591
##  3     1 Tyler                         907
##  4     1 Wichita Falls                 442
##  5     2 Beaumont                      677
##  6     2 Bryan-College Station         628
##  7     2 Tyler                        1058
##  8     2 Wichita Falls                 454
##  9     3 Beaumont                      855
## 10     3 Bryan-College Station         949
## # ℹ 38 more rows

Use the aggregated sales_by_month_city data to create a stacked bar chart to visualize total sales by month and city.

library(dplyr)
library(ggplot2)

Aggregate data to calculate total sales by month and city

sales_by_month_city <- df %>%
  group_by(month, city) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE), .groups = 'drop')

Added.groups = ‘drop’ to address the warning
Create a stacked bar chart for total sales by month and city

ggplot(sales_by_month_city, aes(x = as.factor(month), y = total_sales, fill = city)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Total Sales by Month and City",
    x = "Month",
    y = "Total Sales"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Box Plot for median_price by City:

Create a box plot to compare the distribution of median prices across cities, to highlight differences in central tendency, variability, and potential outliers.

Calculate the mean and standard deviation for the specified quantitative variables.

library(ggplot2)

Create a boxplot to compare the distribution of median_price between cities

ggplot(df, aes(x = city, y = median_price, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribuzione del Prezzo Mediano per Città",
    x = "Città",
    y = "Prezzo Mediano (in dollari)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Boxplot Analysis: Distribution of Median Price by City

The boxplot illustrates the distribution of median_price across the four cities: Beaumont, Bryan-College Station, Tyler, and Wichita Falls.

Central Tendency (Median):

Variability (IQR and Whiskers):

Outliers:

Preliminary Conclusions:

These differences are essential for Texas Realty Insights when tailoring marketing strategies, positioning properties, and advising clients based on each city’s specific market dynamics.

Boxplots for volume by City:

Create a boxplot to compare the distribution of total sales volume across cities, highlighting differences in spread, central tendency, and outliers.

library(ggplot2)

Box plot for volume by city

ggplot(df, aes(x = city, y = volume, fill = city)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by City",
    x = "City",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Boxplots for volume by Year:

Create a boxplot to compare the distribution of total sales volume across years, to identify temporal trends, variability, and unusual values.

library(ggplot2)

Box plot for volume by year

ggplot(df, aes(x = as.factor(year), y = volume, fill = as.factor(year))) +
  geom_boxplot() +
  labs(
    title = "Distribution of Sales Volume by Year",
    x = "Year",
    y = "Sales Volume (in millions)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Calculate the mean and standard deviation for the specified quantitative variables.

To create a normalized stacked bar chart, it is necessary to calculate the proportion of sales for each city within each month. This involves grouping the previously aggregated data by month and then calculating each city’ s sales as a proportion of the total monthly sales.

library(dplyr)
library(ggplot2)

Aggregate data to calculate total sales by month and city (re-doing just to be safe, though already done)

sales_by_month_city <- df %>%
  group_by(month, city) %>%
  summarise(total_sales = sum(sales, na.rm = TRUE), .groups = 'drop')

Calculate the proportion of sales for each city within each month

sales_by_month_city_normalized <- sales_by_month_city %>%
  group_by(month) %>%
  mutate(proportion = total_sales / sum(total_sales)) %>%
  ungroup()

print(sales_by_month_city_normalized)
## # A tibble: 48 × 4
##    month city                  total_sales proportion
##    <int> <chr>                       <int>      <dbl>
##  1     1 Beaumont                      608      0.239
##  2     1 Bryan-College Station         591      0.232
##  3     1 Tyler                         907      0.356
##  4     1 Wichita Falls                 442      0.173
##  5     2 Beaumont                      677      0.240
##  6     2 Bryan-College Station         628      0.223
##  7     2 Tyler                        1058      0.376
##  8     2 Wichita Falls                 454      0.161
##  9     3 Beaumont                      855      0.226
## 10     3 Bryan-College Station         949      0.250
## # ℹ 38 more rows

Generate the normalized stacked bar chart to visually represent the proportion of sales contributed by each city for every month.

ggplot(sales_by_month_city_normalized, aes(x = as.factor(month), y = proportion, fill = city)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(
    title = "Normalized Total Sales by Month and City",
    x = "Month",
    y = "Proportion of Sales"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Line Chart for sales Over Time by City:

Generate a line chart using ggplot2 to visualize the trend of sales over time, differentiated by city.
This plot will help compare sales trajectories across locations during different historical periods and highlight how each market evolves over time.

  1. Purpose:
    Visualize temporal sales patterns for each city to identify growth trends, seasonal fluctuations, and deviations in performance.

  2. Task:

  1. Interpretation:
    Comment on the observed trends, noting:
library(dplyr)

df$date <- as.Date(paste0(df$year, "-", df$month, "-01"), format = "%Y-%m-%d")

cat("Head of DataFrame with new 'date' column:\n")
## Head of DataFrame with new 'date' column:
print(head(df))
##       city year month sales volume median_price listings months_inventory
## 1 Beaumont 2010     1    83 14.162       163800     1533              9.5
## 2 Beaumont 2010     2   108 17.690       138200     1586             10.0
## 3 Beaumont 2010     3   182 28.701       122400     1689             10.6
## 4 Beaumont 2010     4   200 26.819       123200     1708             10.6
## 5 Beaumont 2010     5   202 28.833       123100     1771             10.9
## 6 Beaumont 2010     6   189 27.219       122800     1803             11.1
##   median_price_class average_price listing_effectiveness       date
## 1               High      170626.5            0.05414220 2010-01-01
## 2        Medium-High      163796.3            0.06809584 2010-02-01
## 3         Medium-Low      157697.8            0.10775607 2010-03-01
## 4         Medium-Low      134095.0            0.11709602 2010-04-01
## 5         Medium-Low      142737.6            0.11405985 2010-05-01
## 6         Medium-Low      144015.9            0.10482529 2010-06-01

Create a line chart for sales trend over time by city

library(ggplot2)

ggplot(df, aes(x = date, y = sales, color = city, group = city)) +
  geom_line() +
  labs(
    title = "Andamento delle Vendite per Città nel Tempo",
    x = "Data",
    y = "Vendite"
  ) +
  theme_minimal()

Comprehensive Summary of ggplot2 Visualizations (Project Step 8):

This summary consolidates insights from all ggplot2 visualizations, showing how they enhance understanding of the Texas real estate market and support strategic decisions for Texas Realty Insights.

  1. Boxplot of median_price by City
  1. Boxplots of volume by City and by Year

    2.1. Volume by City

  1. Stacked and Normalized Bar Charts of sales by Month and City
  1. Line Chart of sales Over Time by City

Overall Strategic Value:


Task 9:

Final Conclusions and Recommendations for Texas Realty Insights

This project delivered a comprehensive statistical and visual analysis of the Texas real estate market from 2010 to 2014 using the “Real Estate Texas.csv” dataset.
The objectives—identifying trends, evaluating marketing effectiveness, and providing visual insights for strategic decision‑making—were fully achieved through descriptive statistics, probability calculations, new variable creation, conditional analyses, and extensive ggplot2 visualizations.

Key Findings and Emerging Trends:

  1. Data Robustness and Statistical Validity:
  1. Multi‑Year Growth (2010–2014):
  1. Pronounced Seasonal Patterns:
  1. City‑Specific Dynamics:
  1. Insights from New Variables:
  1. Strategic Value and Practical Implications:
  1. Actionable Recommendations for Texas Realty Insights:

    7.1. City‑Specific Marketing and Sales Strategies:

Final Summary:

This analysis transforms raw data into a clear, actionable understanding of Texas real estate dynamics. The recommendations—grounded in strong statistical and visual evidence—equip Texas Realty Insights to make informed decisions, optimize strategies, and strengthen their position in a competitive and evolving market.