Phase 1 : Business Problem

Introduction

Source Image : Pexels.com

Connecticut is one of the states in the United States that has consistently ranked the best quality of life among states. Connecticut has highly-rated schools, a low crime rate, a healthy population, and more. pristine lakes and ponds of various sizes, perfect for fishing, hiking, and exploring. This is what makes many people from other countries want to live there.

To live comfortably in the city, owning property is everyone’s wish. Of course, buying a property shouldn’t be excruciating or coercive. Knowing the estimated price of the property will make it easier for us to choose the right property.Estimated property prices can be known from the transaction history.

In the case study, we will explore a real estate dataset from Kaggle to get an insight into property prices, towns suitable for investment, and fair prices in Connecticut based on descriptive statistics.

Business Problem

Why are property prices in Connecticut increasing in 2020 ?
Which town is suitable for investment ?
Case Study 1 : Determine the most suitable type of residential based on price ?
Case Study 2 : Determine the estimated costs incurred in buying a particular type of property ?

Phase 2 : Prepare Data

Dataset

About Dataset
All data compiled into this dataset is available under public domain. This set is designed to provide some insight into sales trends across the state of Connecticut as well as the individual towns within. It is also specifically structured to highlight changes in trends due to the COVID-19 pandemic.

Variables
list_year : grand list year of the property (grand list years run from Oct. 1 through Sept. 30).
town : name of the town that the property was sold in.
population : population of the town that the property was sold in.
residential_type : single family, two family, three family, four family, or condo.
month : the month the sale was recorded.
year : the year the sale was recorded.
in_pandemic : boolean value indicating whether the selling date was after March 11, 2020.
assessed_value : tax assessed value of the property at the time of the sale.
sale_amount : final closing sale amount of the property.
price_index : the Consumer Price Index (CPI) for that month/year. Used to normalize dollar values.
norm_assessed_value : CPI-normalized assessed value (assessed_value / price_index * 100).
norm_sale_amount : CPI-normalized sale amount (sale_amount / price_index * 100).
norm_sales_ratio : CPI-normalized assessment to sale ratio (norm_assessed_value / norm_sale_amount).
latitude : latitude for the property’s town.
longitude : longitude for the property’s town.

Data Sources

Kaggle : https://www.kaggle.com/datasets/asasherwyn/ctrre-2011-2021
Real estate records: https://data.ct.gov/Housing-and-Development/Real-Estate-Sales-2001-2020-GL/5mzw-sjtu
Township shapes: https://data.ct.gov/Government/Town-Boundary-Index-Map/evyv-fqzg
Consumer price index: https://www.bls.gov/regions/new-england/data/consumerpriceindex_us_table.htm
Town populations: https://www.connecticut-demographics.com/cities_by_population

Phase 3 : Process

Setting Up the Enviroment >

Packages

1. Loading the Packages

There are 3 packages used in this case study:
dplyr : to dataframe manipulation
ggplot2 : to data visualization
lubridate : to date time manupulation

# install.packages('dplyr') 
# install.packages('ggplot')
# install.packages('lubridate')

2. Activate the Packages

After installed, then activate the packages,

library (dplyr)
library(ggplot2)
library(GGally)
library(lubridate)

options(scipen=999) # to switch scientific notation

Data Collection >

1. Read CSV File

Read csv file from the folder connecticut. Make sure that the file directory is the same as .Rproj file. After the read, you can input into variable real estate. you can use stringAsFactors, if you want to default switch character data type to factor data type.

real_estate <- read.csv('connecticut/CTRRE 2011-2021.csv', stringsAsFactors = T)

2. Check the top three data

Check top three data to view and understand the data in each row and column.

head(real_estate, n=3)

#>   list_year    town population residential_type month year in_pandemic
#> 1      2020 Ashford       4193    Single Family    10 2020           1
#> 2      2020    Avon      18821            Condo     3 2021           1
#> 3      2020    Avon      18821    Single Family     4 2021           1
#>   assessed_value sale_amount price_index norm_assessed_value norm_sale_amount
#> 1         253000      430000     254.076            99576.50         169240.7
#> 2         130400      179900     258.935            50360.13          69476.9
#> 3         619290      890000     261.237           237060.60         340686.8
#>   norm_sales_ratio latitude longitude
#> 1        0.5883721  41.8731  -72.1216
#> 2        0.7248471  41.8096  -72.8305
#> 3        0.6958315  41.8096  -72.8305

3. Check Structure the Data

Check the structure to know total variable, size the data, and data types. Make sure the data type is appropriate, because it relates to behavior in data processing. In this case, we must know the unique value of all variable to determine whether the variable is numeric or categoric using function n_distinct from library dplyr

str(real_estate)

#> 'data.frame':    430492 obs. of  15 variables:
#>  $ list_year          : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
#>  $ town               : Factor w/ 169 levels "Andover","Ansonia",..: 3 4 4 4 4 7 9 10 11 14 ...
#>  $ population         : int  4193 18821 18821 18821 18821 20107 20287 3408 21399 28230 ...
#>  $ residential_type   : Factor w/ 5 levels "Condo","Four Family",..: 3 1 3 3 3 3 3 3 1 3 ...
#>  $ month              : int  10 3 4 7 12 7 12 8 9 1 ...
#>  $ year               : int  2020 2021 2021 2021 2020 2021 2020 2021 2021 2021 ...
#>  $ in_pandemic        : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ assessed_value     : int  253000 130400 619290 862330 847520 412000 171360 168900 163730 530500 ...
#>  $ sale_amount        : num  430000 179900 890000 1447500 1250000 ...
#>  $ price_index        : num  254 259 261 268 254 ...
#>  $ norm_assessed_value: num  99577 50360 237061 322018 333563 ...
#>  $ norm_sale_amount   : num  169241 69477 340687 540538 491969 ...
#>  $ norm_sales_ratio   : num  0.588 0.725 0.696 0.596 0.678 ...
#>  $ latitude           : num  41.9 41.8 41.8 41.8 41.8 ...
#>  $ longitude          : num  -72.1 -72.8 -72.8 -72.8 -72.8 ...

sapply(real_estate, n_distinct)

#>           list_year                town          population    residential_type 
#>                  11                 169                 168                   5 
#>               month                year         in_pandemic      assessed_value 
#>                  12                  12                   2               67203 
#>         sale_amount         price_index norm_assessed_value    norm_sale_amount 
#>               26144                 135              375514              150018 
#>    norm_sales_ratio            latitude           longitude 
#>              423960                 165                 167

Insight :
1. real_estate dataframe consists of 15 variables
2. change variables list_year,resindential_type, in_pandemic as factor
3. Other variables make as numeric

Data Cleansing

1. Change Data Types and Data Description

real_estate <- real_estate |> 
  mutate(
    
    #create new variable name for easier reading
    month_name = case_when(
      month == 1 ~ 'January',
      month == 2 ~ 'February',
      month == 3 ~ 'March',
      month == 4 ~ 'April',
      month == 5 ~ 'May',
      month == 6 ~ 'June',
      month == 7 ~ 'July',
      month == 8 ~ 'August',
      month == 9 ~ 'September',
      month == 10 ~ 'October',
      month == 11 ~ 'November',
      month == 12 ~ 'December'
    ),
    
    #create new variable name for easier reading
    pandemic_status = case_when(
      in_pandemic == 1 ~ 'Pandemic Periode',
      in_pandemic == 0 ~ 'Not Pandemic Periode'
    ),
    
    #change data type to factor with levels
    month_name = factor ( x= month_name,
                        levels= c('January','February','March','April','May','June','July','August','September','October','November','December')
    ) )|> 
  mutate_at(
    #change many data type to factor 
    vars(list_year,residential_type,pandemic_status, month_name, year), as.factor
  )

2. Remove Column

From the 15 variables that are not all variables are used. only related variables are used. the latitude and longitude variables are coordinate variables, and this study does not perform map visualization.

real_estate <- real_estate |> 
  select(-latitude,-longitude)

3. Missing Value check

Using anyNA() function, if the result is FALSE you can proceed to the next step

real_estate |> 
  anyNA()

#> [1] FALSE

4. Null Value Check

Using filter function to manipulate the dataframe. The method aims to eliminate null values especially in variable sale_amount. The null value makes the data nonsensical.

real_estate <- real_estate |> 
  filter(sale_amount != 0)

5. Remove Duplicated Data

Delete data that has the same value because it will reduce data variation. After the data cleansing is complete, save the data into the new variable real_estate_clean

real_estate_clean <- real_estate |> 
  distinct()

6. Make Sure the Data is Clean

Data can be clean if all data types are considered clean, not missing values, redundant and unreasonable data.

glimpse(real_estate_clean)

#> Rows: 428,305
#> Columns: 15
#> $ list_year           <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
#> $ town                <fct> Ashford, Avon, Avon, Avon, Avon, Berlin, Bethel, B…
#> $ population          <int> 4193, 18821, 18821, 18821, 18821, 20107, 20287, 34…
#> $ residential_type    <fct> Single Family, Condo, Single Family, Single Family…
#> $ month               <int> 10, 3, 4, 7, 12, 7, 12, 8, 9, 1, 9, 10, 12, 3, 8, …
#> $ year                <fct> 2020, 2021, 2021, 2021, 2020, 2021, 2020, 2021, 20…
#> $ in_pandemic         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ assessed_value      <int> 253000, 130400, 619290, 862330, 847520, 412000, 17…
#> $ sale_amount         <dbl> 430000, 179900, 890000, 1447500, 1250000, 677500, …
#> $ price_index         <dbl> 254.076, 258.935, 261.237, 267.789, 254.081, 267.7…
#> $ norm_assessed_value <dbl> 99576.50, 50360.13, 237060.60, 322018.45, 333562.9…
#> $ norm_sale_amount    <dbl> 169240.70, 69476.90, 340686.81, 540537.51, 491969.…
#> $ norm_sales_ratio    <dbl> 0.5883721, 0.7248471, 0.6958315, 0.5957375, 0.6780…
#> $ month_name          <fct> October, March, April, July, December, July, Decem…
#> $ pandemic_status     <fct> Pandemic Periode, Pandemic Periode, Pandemic Perio…

summary(real_estate_clean)

#>    list_year                 town          population    
#>  2020   : 58919   Stamford     : 13986   Min.   :   932  
#>  2019   : 52115   Norwalk      : 10525   1st Qu.: 18354  
#>  2018   : 44076   Waterbury    : 10334   Median : 30002  
#>  2016   : 42393   Bridgeport   :  9827   Mean   : 44927  
#>  2014   : 40970   West Hartford:  9200   3rd Qu.: 60790  
#>  2017   : 39310   Greenwich    :  8811   Max.   :148529  
#>  (Other):150522   (Other)      :365622                   
#>       residential_type      month             year         in_pandemic    
#>  Condo        : 81538   Min.   : 1.000   2020   : 55766   Min.   :0.0000  
#>  Four Family  :  1531   1st Qu.: 4.000   2019   : 45930   1st Qu.:0.0000  
#>  Single Family:318462   Median : 7.000   2021   : 42576   Median :0.0000  
#>  Three Family :  8439   Mean   : 6.886   2017   : 41484   Mean   :0.2124  
#>  Two Family   : 18335   3rd Qu.: 9.000   2018   : 40604   3rd Qu.:0.0000  
#>                         Max.   :12.000   2015   : 40512   Max.   :1.0000  
#>                                          (Other):161433                   
#>  assessed_value      sale_amount        price_index    norm_assessed_value
#>  Min.   :     500   Min.   :    2000   Min.   :213.9   Min.   :     233   
#>  1st Qu.:  112700   1st Qu.:  161000   1st Qu.:231.1   1st Qu.:   46765   
#>  Median :  162900   Median :  245000   Median :238.6   Median :   68090   
#>  Mean   :  246045   Mean   :  367749   Mean   :240.7   Mean   :  102608   
#>  3rd Qu.:  252520   3rd Qu.:  381000   3rd Qu.:250.3   3rd Qu.:  105606   
#>  Max.   :26609830   Max.   :48000000   Max.   :269.1   Max.   :10599246   
#>                                                                           
#>  norm_sale_amount   norm_sales_ratio      month_name    
#>  Min.   :     799   Min.   :0.04668   July     : 46478  
#>  1st Qu.:   67516   1st Qu.:0.58860   August   : 46071  
#>  Median :  101487   Median :0.67856   June     : 45166  
#>  Mean   :  152752   Mean   :0.70978   September: 38673  
#>  3rd Qu.:  158303   3rd Qu.:0.79935   May      : 37689  
#>  Max.   :19494919   Max.   :1.40841   October  : 37515  
#>                                       (Other)  :176713  
#>              pandemic_status  
#>  Not Pandemic Periode:337335  
#>  Pandemic Periode    : 90970  
#>                               
#>                               
#>                               
#>                               
#>

Phase 4a : Descriptive Analysis

1. Frequency Distribution

The first step in an analysis work is the knowledge of the behavior of the variables involved in the study. Using statistical techniques such as frequency tables, histograms and bar graphs we can better understand the way in which the phenomena under study are distributed.

1. Distribution of The Number of Sales per Year

rs_per_year <- real_estate_clean |> 
  group_by(year) |> 
  summarise( n = n())


ggplot(data = rs_per_year, aes( x = year)) +
  geom_col( aes(y = n,), position = 'dodge') +
  labs( title = 'Distribution of The Number of Sales per Year',
        x = NULL,
        y = 'Frequency of Sales'
        ) +
  theme_minimal() +
  theme(legend.position = "none",
            plot.title = element_text(hjust = 0.5))

insight = From the year frequency distribution, it can be seen that the lowest property sales occurred in 2010 and the highest in 2020. Sales in 2010 were low because data collection was conducted starting in October 2010. Meanwhile, data for 2020 will be a business question.

2. Distribution of The Number of Sales per Residential Type

rs_type <- real_estate_clean |> 
  group_by(residential_type) |> 
  summarise(n = n()) |> 
  arrange(-n)

ggplot(rs_type, aes( x = n )) +
  geom_col(aes(y = reorder(residential_type,n)))+
   labs( title = 'Distribution of The Number of Sales per Residential Type',
        x = 'Frequency of Sales',
        y = NULL
        ) +
  theme_minimal() +
  theme(legend.position = "none",
            plot.title = element_text(hjust = 0.5))

insight : single-family residential type sold the most compared to other types

3. Distribution of Total House Price per Year

rs_price <- real_estate_clean |>  
  group_by(year) |> 
  summarise( n = sum(sale_amount))

ggplot(rs_price, aes( x = year )) +
  geom_col(aes(y = n/1000000))+
   labs( title = 'Distribution of Total House Price per Year',
        x = NULL,
        y = 'USD (in millions)'
        ) +
  theme_minimal() +
  theme(legend.position = "none",
            plot.title = element_text(hjust = 0.5))

insight : 2020 was the year with the highest total transactions compared to other years. The total price of sales is equivalent to the value of the total number of sales.

2. Measure of Central Tendency

1. Mean

mean(real_estate_clean$sale_amount)

#> [1] 367748.6

insight : the mean value of all property sales in connecticut is 367748.6 USD, but we need to know that the mean value can be affected by outliers. so we need a formula to remove outliers, using median.

2. Median

median(real_estate_clean$sale_amount)

#> [1] 245000

insight : the results obtained from the median are different from the mean, meaning that there are outliers in this data

3. Measure of Spread

1. Standar Deviation

rs_std <- real_estate_clean |> 
  group_by(town) |> 
  summarise(std = sd(sale_amount)) |> 
  arrange(-std)


ggplot(head(rs_std, n=10), aes(x = std)) + 
  geom_col(aes(y = reorder(town, std))) +
  
     labs( title = 'Standar Deviation Value of Sale Price per Town',
        x = 'USD',
        y = NULL
        ) +
  theme_minimal() +
  theme(legend.position = "none",
            plot.title = element_text(hjust = 0.5))

insight : Based on standar deviation value, Greenwich Town is a city with a very large data variation compared to other cities. but the standard deviation is related to the mean, the model is less robust. In the case study, we will use the median value to generate a robust model

2. IQR

rs_boxplot <- real_estate_clean |> 
  filter( sale_amount <= 2000000) # delete outliers


ggplot(rs_boxplot, aes(x = year, y = sale_amount)) +
  geom_boxplot() +
  
       labs( title = 'Boxplot of The Number of Total Sales per Year',
        x = NULL,
        y = 'USD'
        ) +
  theme_minimal() +
  theme(legend.position = "none",
            plot.title = element_text(hjust = 0.5))

insight : from the boxplot above it can be seen that the median value of property prices increases slowly

4. Measure of Variable Relationship

1. Corelation

ggcorr(real_estate_clean)

insight :
- there is a similar positive pattern between the in-pandemic variable and the price_index variable
- there is a similar positive pattern between the price_index variable and the sale_amount variable

Phase 4b : Exploratory Analysis

Business Question 1: Why are property prices in Connecticut increasing in 2020 ?

cor( real_estate_clean$price_index, real_estate_clean$in_pandemic,)

#> [1] 0.724508

if you look at the variation of relationship, price_index increase has the same pattern as the in_pandemic variable. Even though the correlation does not causality, when viewed from the economy, the price_index or inflation value has relationship with pandemic conditions. Where almost countries around the world experienced an increase in inflation during the pandemic.

real_estate_clean$month_year <- myd(paste(real_estate_clean$month, real_estate_clean$year,'1'))

total_med_per_year <- real_estate_clean |>
group_by( price_index, month_year) |>
summarise(price_sale = median(sale_amount)/1000, price_index = median(price_index) )

total_med_per_year <- total_med_per_year |>
filter(
price_sale <= 500 & price_sale >=170
)

# ggplot(data = total_med_per_year) +
# geom_line(aes(x = month_year,y = price_index), color = 'red' ) +
# geom_line(aes(x = month_year,y = price_sale), color = 'blue') +
#    labs( title = 'Trend between Price Index and Sale Amount',
#         x = NULL,
#         y = NULL
#         ) +
#   theme_minimal() +
#   theme( legend.position = c(2020,300),
#     plot.title = element_text(hjust = 0.5)) 
  




ggplot(data = total_med_per_year, mapping = aes(x=month_year)) +
  geom_line(aes(y = price_index, color = 'Price Index'), size = 0.8 ) +
  geom_line(aes(y = price_sale, color = 'Sale Amount (in Thousand USD)'), size = 0.8) +
  theme_minimal() +
  scale_color_manual(name= 'Value',
                     values = c('Price Index'='red',
                                'Sale Amount (in Thousand USD)'='blue')) +
   labs( title = 'Trend between Price Index and Sale Amount',
        x = NULL,
        y = NULL
        ) +
  theme( legend.position = 'top',
    plot.title = element_text(hjust = 0.5))

In Connecticut, the price index value trend has the same trend as sale_amount

Business Question 2: Which town is suitable for investment ?

This question will depend on investors’ preferences in taking on investment risk.

If investors are people who dare to take big risks. Several cities that were recommended included Greenwich, Darein, New Canaan, West Canaan, and Westport. The city was suggested because property prices there have a large price variance as indicated by the IQR values. a large variance, it will allow someone to get cheap prices in cities that have expensive property prices

iqr_per_town <- real_estate_clean |> 
  group_by(town) |> 
  summarise( med = median(sale_amount), q1 = quantile(sale_amount, 0.25), q3 = quantile(sale_amount, 0.75), IQR = IQR(sale_amount)) |> 
  mutate( 
    low_w = q1 - (1.5*IQR),
    upper_w = q3 + (1.5*IQR))|>
  arrange(-med)

head(iqr_per_town, n=5)

#> # A tibble: 5 × 7
#>   town           med     q1      q3     IQR    low_w upper_w
#>   <fct>        <dbl>  <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
#> 1 Greenwich  1450000 805000 2561250 1756250 -1829375 5195625
#> 2 Darien     1372500 899000 2000000 1101000  -752500 3651500
#> 3 New Canaan 1285000 873750 1900000 1026250  -665625 3439375
#> 4 Westport   1129800 750000 1700000  950000  -675000 3125000
#> 5 Weston      814000 619750 1114000  494250  -121625 1855375

tail(iqr_per_town, n=5)

#> # A tibble: 5 × 7
#>   town          med     q1     q3     IQR   low_w upper_w
#>   <fct>       <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
#> 1 Hartford   140000 79855. 195000 115145. -92863. 367718.
#> 2 Norwich    140000 96000  181000  85000  -31500  308500 
#> 3 Torrington 135000 97925  175000  77075  -17688. 290612.
#> 4 Windham    135000 89900  175000  85100  -37750  302650 
#> 5 Waterbury  117000 75000  160000  85000  -52500  287500

However, if someone wants a low risk investment risk, then the cities above are a good choice because they have relatively stable prices.

Business Question 3: Determine the most suitable type of residential based on price ?

If a man wants to buy property in Connecticut, of course, he needs to know the property prices in each town. By using descriptive statistics through the mean/median function because it is able to represent prices. In this case study, the median value is used because the median measurement is more robust and not affected by outliers.

rs_town_price <- real_estate_clean |> 
  group_by(town) |> 
  summarise(price = median(sale_amount)) |> 
  arrange(-price)


head(rs_town_price)

#> # A tibble: 6 × 2
#>   town         price
#>   <fct>        <dbl>
#> 1 Greenwich  1450000
#> 2 Darien     1372500
#> 3 New Canaan 1285000
#> 4 Westport   1129800
#> 5 Weston      814000
#> 6 Wilton      765000

Insight: from the table above it is known that Greenwich is the most expensive city in terms of average property prices

rs_town_5sum <- real_estate_clean |> 
  group_by(town) |> 
  summarise( med = median(sale_amount), q1 = quantile(sale_amount, 0.25), q3 = quantile(sale_amount, 0.75), IQR = q3-q1 ) |> 
  mutate( 
    low_w = q1 - (1.5*IQR),
    upper_w = q3 + (1.5*IQR)
    
  ) |> arrange(-med)

rs_town_5sum

#> # A tibble: 169 × 7
#>    town           med     q1      q3     IQR     low_w  upper_w
#>    <fct>        <dbl>  <dbl>   <dbl>   <dbl>     <dbl>    <dbl>
#>  1 Greenwich  1450000 805000 2561250 1756250 -1829375  5195625 
#>  2 Darien     1372500 899000 2000000 1101000  -752500  3651500 
#>  3 New Canaan 1285000 873750 1900000 1026250  -665625  3439375 
#>  4 Westport   1129800 750000 1700000  950000  -675000  3125000 
#>  5 Weston      814000 619750 1114000  494250  -121625  1855375 
#>  6 Wilton      765000 565015 1010000  444985  -102462. 1677478.
#>  7 Roxbury     653500 439750 1056250  616500  -485000  1981000 
#>  8 Ridgefield  619000 428410  850000  421590  -203975  1482385 
#>  9 Easton      590000 467250  740000  272750    58125  1149125 
#> 10 Washington  554500 320650 1323750 1003100 -1184000  2828400 
#> # ℹ 159 more rows

Case: if a man decides to settle in the city of New Canaan because it is close to NYC. What costs need to be prepared? the cost that needs to be prepared is around 1,285,000 USD. But what if the man only has 900,000 USD? Of course, it doesn’t matter because the range of property prices in the city of New Canaan is from 873,750 USD to 1,900,000 USD

real_estate_clean |> 
  filter(town == 'New Canaan' & sale_amount <= 900000)  |> 
  group_by(residential_type) |> 
  summarise(n = n())

#> # A tibble: 5 × 2
#>   residential_type     n
#>   <fct>            <int>
#> 1 Condo              419
#> 2 Four Family          2
#> 3 Single Family      528
#> 4 Three Family         2
#> 5 Two Family          14

a price of 900,000 USD buying a property with a Single Family or Condominium type is the right option compared to buying other types because based on the transaction data that has been collected, fewer transactions occur with 900.00 USD money to buy another type of house

Business Question 4: Determine the estimated costs incurred in buying a particular type of property ?

If a Yale University student wants to buy a condominium-type property in the city of New Haven, what is the price range that needs to be spent?

rs_new_haven <- real_estate_clean |> 
  filter(town == 'New Haven') |> 
  group_by(residential_type, sale_amount) |> 
  summarise(price = mean(sale_amount)) |>
  mutate(
    q1 = quantile(sale_amount, 0.25),
    med = median(sale_amount),
    q3 = quantile(sale_amount, 0.75)
    
  )

rs_new_haven |> 
  group_by(residential_type) |> 
  summarise( mean = mean(sale_amount), med = mean(med), q1 = mean(q1), q3 = mean(q3) )

#> # A tibble: 5 × 5
#>   residential_type    mean    med      q1      q3
#>   <fct>              <dbl>  <dbl>   <dbl>   <dbl>
#> 1 Condo            233465. 178500  99450  312000 
#> 2 Four Family      324224. 285000 172362. 434000 
#> 3 Single Family    309478. 221000 132500  368250 
#> 4 Three Family     269072. 217750 138625  355456.
#> 5 Two Family       230299. 180000 105600  305250

The estimated cost required to buy a condominium type property is 99,450 USD to 312,000 USD

Phase 5 : Conclusion

The increase in the average price of housing costs in 2020 was influenced by the pandemic that occurred at that time. The pandemic has increased the price of inflation (price_index) which has affected various price increases for various economic sectors including property.
If an investor is a high-risk taker, buying property in areas with high price variations can be a profitable option. Greenwich, Darein, New Canaan, West Canaan, and Westport are the top 5 cities based on IQR value
Property prices for residential type single family around New Canaan from 873,750 USD to 1,900,000 USD
The estimated cost required to buy a condominium-type property in New Haven is 99450 USD to 312,000 USD

Analysis Connecticut Real Estate Dataset using Descriptive Statistics

Akmal Hasan

2023-09-22