Connecticut is one of the states in the United States that has consistently ranked the best quality of life among states. Connecticut has highly-rated schools, a low crime rate, a healthy population, and more. pristine lakes and ponds of various sizes, perfect for fishing, hiking, and exploring. This is what makes many people from other countries want to live there.
To live comfortably in the city, owning property is everyone’s wish. Of course, buying a property shouldn’t be excruciating or coercive. Knowing the estimated price of the property will make it easier for us to choose the right property.Estimated property prices can be known from the transaction history.
In the case study, we will explore a real estate dataset from Kaggle to get an insight into property prices, towns suitable for investment, and fair prices in Connecticut based on descriptive statistics.
About Dataset
All data compiled into this
dataset is available under public domain. This set is designed to
provide some insight into sales trends across the state of Connecticut
as well as the individual towns within. It is also specifically
structured to highlight changes in trends due to the COVID-19
pandemic.
Variables
list_year : grand
list year of the property (grand list years run from Oct. 1 through
Sept. 30).
town : name of the town that the
property was sold in.
population : population of
the town that the property was sold in.
residential_type : single family, two family, three
family, four family, or condo.
month : the month
the sale was recorded.
year : the year the sale was
recorded.
in_pandemic : boolean value indicating
whether the selling date was after March 11, 2020.
assessed_value : tax assessed value of the property at
the time of the sale.
sale_amount : final closing
sale amount of the property.
price_index : the
Consumer Price Index (CPI) for that month/year. Used to normalize dollar
values.
norm_assessed_value : CPI-normalized
assessed value (assessed_value / price_index * 100).
norm_sale_amount : CPI-normalized sale amount
(sale_amount / price_index * 100).
norm_sales_ratio
: CPI-normalized assessment to sale ratio (norm_assessed_value /
norm_sale_amount).
latitude : latitude for the
property’s town.
longitude : longitude for the
property’s town.
Kaggle : https://www.kaggle.com/datasets/asasherwyn/ctrre-2011-2021
Real estate records: https://data.ct.gov/Housing-and-Development/Real-Estate-Sales-2001-2020-GL/5mzw-sjtu
Township shapes: https://data.ct.gov/Government/Town-Boundary-Index-Map/evyv-fqzg
Consumer price index: https://www.bls.gov/regions/new-england/data/consumerpriceindex_us_table.htm
Town populations: https://www.connecticut-demographics.com/cities_by_population
Packages
1. Loading the Packages
There are 3 packages used in this case study:
dplyr : to dataframe manipulation
ggplot2 : to data visualization
lubridate : to date time manupulation
2. Activate the Packages
After installed, then activate the packages,
1. Read CSV File
Read csv file from the folder connecticut. Make sure that the file directory is the same as .Rproj file. After the read, you can input into variable real estate. you can use stringAsFactors, if you want to default switch character data type to factor data type.
2. Check the top three data
Check top three data to view and understand the data in each row and column.
#> list_year town population residential_type month year in_pandemic
#> 1 2020 Ashford 4193 Single Family 10 2020 1
#> 2 2020 Avon 18821 Condo 3 2021 1
#> 3 2020 Avon 18821 Single Family 4 2021 1
#> assessed_value sale_amount price_index norm_assessed_value norm_sale_amount
#> 1 253000 430000 254.076 99576.50 169240.7
#> 2 130400 179900 258.935 50360.13 69476.9
#> 3 619290 890000 261.237 237060.60 340686.8
#> norm_sales_ratio latitude longitude
#> 1 0.5883721 41.8731 -72.1216
#> 2 0.7248471 41.8096 -72.8305
#> 3 0.6958315 41.8096 -72.8305
3. Check Structure the Data
Check the structure to know total variable, size the data, and data types. Make sure the data type is appropriate, because it relates to behavior in data processing. In this case, we must know the unique value of all variable to determine whether the variable is numeric or categoric using function n_distinct from library dplyr
#> 'data.frame': 430492 obs. of 15 variables:
#> $ list_year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
#> $ town : Factor w/ 169 levels "Andover","Ansonia",..: 3 4 4 4 4 7 9 10 11 14 ...
#> $ population : int 4193 18821 18821 18821 18821 20107 20287 3408 21399 28230 ...
#> $ residential_type : Factor w/ 5 levels "Condo","Four Family",..: 3 1 3 3 3 3 3 3 1 3 ...
#> $ month : int 10 3 4 7 12 7 12 8 9 1 ...
#> $ year : int 2020 2021 2021 2021 2020 2021 2020 2021 2021 2021 ...
#> $ in_pandemic : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ assessed_value : int 253000 130400 619290 862330 847520 412000 171360 168900 163730 530500 ...
#> $ sale_amount : num 430000 179900 890000 1447500 1250000 ...
#> $ price_index : num 254 259 261 268 254 ...
#> $ norm_assessed_value: num 99577 50360 237061 322018 333563 ...
#> $ norm_sale_amount : num 169241 69477 340687 540538 491969 ...
#> $ norm_sales_ratio : num 0.588 0.725 0.696 0.596 0.678 ...
#> $ latitude : num 41.9 41.8 41.8 41.8 41.8 ...
#> $ longitude : num -72.1 -72.8 -72.8 -72.8 -72.8 ...
#> list_year town population residential_type
#> 11 169 168 5
#> month year in_pandemic assessed_value
#> 12 12 2 67203
#> sale_amount price_index norm_assessed_value norm_sale_amount
#> 26144 135 375514 150018
#> norm_sales_ratio latitude longitude
#> 423960 165 167
Insight :
1. real_estate dataframe
consists of 15 variables
2. change variables
list_year,resindential_type, in_pandemic as
factor
3. Other variables make as numeric
1. Change Data Types and Data Description
real_estate <- real_estate |>
mutate(
#create new variable name for easier reading
month_name = case_when(
month == 1 ~ 'January',
month == 2 ~ 'February',
month == 3 ~ 'March',
month == 4 ~ 'April',
month == 5 ~ 'May',
month == 6 ~ 'June',
month == 7 ~ 'July',
month == 8 ~ 'August',
month == 9 ~ 'September',
month == 10 ~ 'October',
month == 11 ~ 'November',
month == 12 ~ 'December'
),
#create new variable name for easier reading
pandemic_status = case_when(
in_pandemic == 1 ~ 'Pandemic Periode',
in_pandemic == 0 ~ 'Not Pandemic Periode'
),
#change data type to factor with levels
month_name = factor ( x= month_name,
levels= c('January','February','March','April','May','June','July','August','September','October','November','December')
) )|>
mutate_at(
#change many data type to factor
vars(list_year,residential_type,pandemic_status, month_name, year), as.factor
)2. Remove Column
From the 15 variables that are not all variables are used. only related variables are used. the latitude and longitude variables are coordinate variables, and this study does not perform map visualization.
3. Missing Value check
Using anyNA() function, if the result is FALSE you can proceed to the next step
#> [1] FALSE
4. Null Value Check
Using filter function to manipulate the dataframe. The method aims to eliminate null values especially in variable sale_amount. The null value makes the data nonsensical.
5. Remove Duplicated Data
Delete data that has the same value because it will reduce data variation. After the data cleansing is complete, save the data into the new variable real_estate_clean
6. Make Sure the Data is Clean
Data can be clean if all data types are considered clean, not missing values, redundant and unreasonable data.
#> Rows: 428,305
#> Columns: 15
#> $ list_year <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 20…
#> $ town <fct> Ashford, Avon, Avon, Avon, Avon, Berlin, Bethel, B…
#> $ population <int> 4193, 18821, 18821, 18821, 18821, 20107, 20287, 34…
#> $ residential_type <fct> Single Family, Condo, Single Family, Single Family…
#> $ month <int> 10, 3, 4, 7, 12, 7, 12, 8, 9, 1, 9, 10, 12, 3, 8, …
#> $ year <fct> 2020, 2021, 2021, 2021, 2020, 2021, 2020, 2021, 20…
#> $ in_pandemic <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ assessed_value <int> 253000, 130400, 619290, 862330, 847520, 412000, 17…
#> $ sale_amount <dbl> 430000, 179900, 890000, 1447500, 1250000, 677500, …
#> $ price_index <dbl> 254.076, 258.935, 261.237, 267.789, 254.081, 267.7…
#> $ norm_assessed_value <dbl> 99576.50, 50360.13, 237060.60, 322018.45, 333562.9…
#> $ norm_sale_amount <dbl> 169240.70, 69476.90, 340686.81, 540537.51, 491969.…
#> $ norm_sales_ratio <dbl> 0.5883721, 0.7248471, 0.6958315, 0.5957375, 0.6780…
#> $ month_name <fct> October, March, April, July, December, July, Decem…
#> $ pandemic_status <fct> Pandemic Periode, Pandemic Periode, Pandemic Perio…
#> list_year town population
#> 2020 : 58919 Stamford : 13986 Min. : 932
#> 2019 : 52115 Norwalk : 10525 1st Qu.: 18354
#> 2018 : 44076 Waterbury : 10334 Median : 30002
#> 2016 : 42393 Bridgeport : 9827 Mean : 44927
#> 2014 : 40970 West Hartford: 9200 3rd Qu.: 60790
#> 2017 : 39310 Greenwich : 8811 Max. :148529
#> (Other):150522 (Other) :365622
#> residential_type month year in_pandemic
#> Condo : 81538 Min. : 1.000 2020 : 55766 Min. :0.0000
#> Four Family : 1531 1st Qu.: 4.000 2019 : 45930 1st Qu.:0.0000
#> Single Family:318462 Median : 7.000 2021 : 42576 Median :0.0000
#> Three Family : 8439 Mean : 6.886 2017 : 41484 Mean :0.2124
#> Two Family : 18335 3rd Qu.: 9.000 2018 : 40604 3rd Qu.:0.0000
#> Max. :12.000 2015 : 40512 Max. :1.0000
#> (Other):161433
#> assessed_value sale_amount price_index norm_assessed_value
#> Min. : 500 Min. : 2000 Min. :213.9 Min. : 233
#> 1st Qu.: 112700 1st Qu.: 161000 1st Qu.:231.1 1st Qu.: 46765
#> Median : 162900 Median : 245000 Median :238.6 Median : 68090
#> Mean : 246045 Mean : 367749 Mean :240.7 Mean : 102608
#> 3rd Qu.: 252520 3rd Qu.: 381000 3rd Qu.:250.3 3rd Qu.: 105606
#> Max. :26609830 Max. :48000000 Max. :269.1 Max. :10599246
#>
#> norm_sale_amount norm_sales_ratio month_name
#> Min. : 799 Min. :0.04668 July : 46478
#> 1st Qu.: 67516 1st Qu.:0.58860 August : 46071
#> Median : 101487 Median :0.67856 June : 45166
#> Mean : 152752 Mean :0.70978 September: 38673
#> 3rd Qu.: 158303 3rd Qu.:0.79935 May : 37689
#> Max. :19494919 Max. :1.40841 October : 37515
#> (Other) :176713
#> pandemic_status
#> Not Pandemic Periode:337335
#> Pandemic Periode : 90970
#>
#>
#>
#>
#>
The first step in an analysis work is the knowledge of the behavior of the variables involved in the study. Using statistical techniques such as frequency tables, histograms and bar graphs we can better understand the way in which the phenomena under study are distributed.
1. Distribution of The Number of Sales per Year
rs_per_year <- real_estate_clean |>
group_by(year) |>
summarise( n = n())
ggplot(data = rs_per_year, aes( x = year)) +
geom_col( aes(y = n,), position = 'dodge') +
labs( title = 'Distribution of The Number of Sales per Year',
x = NULL,
y = 'Frequency of Sales'
) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
insight = From the year frequency distribution, it can
be seen that the lowest property sales occurred in 2010 and the highest
in 2020. Sales in 2010 were low because data collection was conducted
starting in October 2010. Meanwhile, data for 2020 will be a business
question.
2. Distribution of The Number of Sales per Residential Type
rs_type <- real_estate_clean |>
group_by(residential_type) |>
summarise(n = n()) |>
arrange(-n)
ggplot(rs_type, aes( x = n )) +
geom_col(aes(y = reorder(residential_type,n)))+
labs( title = 'Distribution of The Number of Sales per Residential Type',
x = 'Frequency of Sales',
y = NULL
) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
insight : single-family residential type sold the most
compared to other types
3. Distribution of Total House Price per Year
rs_price <- real_estate_clean |>
group_by(year) |>
summarise( n = sum(sale_amount))
ggplot(rs_price, aes( x = year )) +
geom_col(aes(y = n/1000000))+
labs( title = 'Distribution of Total House Price per Year',
x = NULL,
y = 'USD (in millions)'
) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
insight : 2020 was the year with the highest total
transactions compared to other years. The total price of sales is
equivalent to the value of the total number of sales.
1. Mean
#> [1] 367748.6
insight : the mean value of all property sales in connecticut is 367748.6 USD, but we need to know that the mean value can be affected by outliers. so we need a formula to remove outliers, using median.
2. Median
#> [1] 245000
insight : the results obtained from the median are different from the mean, meaning that there are outliers in this data
1. Standar Deviation
rs_std <- real_estate_clean |>
group_by(town) |>
summarise(std = sd(sale_amount)) |>
arrange(-std)
ggplot(head(rs_std, n=10), aes(x = std)) +
geom_col(aes(y = reorder(town, std))) +
labs( title = 'Standar Deviation Value of Sale Price per Town',
x = 'USD',
y = NULL
) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
insight : Based on standar deviation value, Greenwich
Town is a city with a very large data variation compared to other
cities. but the standard deviation is related to the mean, the model is
less robust. In the case study, we will use the median value to generate
a robust model
2. IQR
rs_boxplot <- real_estate_clean |>
filter( sale_amount <= 2000000) # delete outliers
ggplot(rs_boxplot, aes(x = year, y = sale_amount)) +
geom_boxplot() +
labs( title = 'Boxplot of The Number of Total Sales per Year',
x = NULL,
y = 'USD'
) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
insight : from the boxplot above it can be seen that
the median value of property prices increases slowly
#> [1] 0.724508
if you look at the variation of relationship, price_index increase has the same pattern as the in_pandemic variable. Even though the correlation does not causality, when viewed from the economy, the price_index or inflation value has relationship with pandemic conditions. Where almost countries around the world experienced an increase in inflation during the pandemic.
real_estate_clean$month_year <- myd(paste(real_estate_clean$month, real_estate_clean$year,'1'))
total_med_per_year <- real_estate_clean |>
group_by( price_index, month_year) |>
summarise(price_sale = median(sale_amount)/1000, price_index = median(price_index) )
total_med_per_year <- total_med_per_year |>
filter(
price_sale <= 500 & price_sale >=170
)
# ggplot(data = total_med_per_year) +
# geom_line(aes(x = month_year,y = price_index), color = 'red' ) +
# geom_line(aes(x = month_year,y = price_sale), color = 'blue') +
# labs( title = 'Trend between Price Index and Sale Amount',
# x = NULL,
# y = NULL
# ) +
# theme_minimal() +
# theme( legend.position = c(2020,300),
# plot.title = element_text(hjust = 0.5))
ggplot(data = total_med_per_year, mapping = aes(x=month_year)) +
geom_line(aes(y = price_index, color = 'Price Index'), size = 0.8 ) +
geom_line(aes(y = price_sale, color = 'Sale Amount (in Thousand USD)'), size = 0.8) +
theme_minimal() +
scale_color_manual(name= 'Value',
values = c('Price Index'='red',
'Sale Amount (in Thousand USD)'='blue')) +
labs( title = 'Trend between Price Index and Sale Amount',
x = NULL,
y = NULL
) +
theme( legend.position = 'top',
plot.title = element_text(hjust = 0.5))
In Connecticut, the price index value trend has the same trend as
sale_amount
This question will depend on investors’ preferences in taking on investment risk.
If investors are people who dare to take big risks. Several cities that were recommended included Greenwich, Darein, New Canaan, West Canaan, and Westport. The city was suggested because property prices there have a large price variance as indicated by the IQR values. a large variance, it will allow someone to get cheap prices in cities that have expensive property prices
iqr_per_town <- real_estate_clean |>
group_by(town) |>
summarise( med = median(sale_amount), q1 = quantile(sale_amount, 0.25), q3 = quantile(sale_amount, 0.75), IQR = IQR(sale_amount)) |>
mutate(
low_w = q1 - (1.5*IQR),
upper_w = q3 + (1.5*IQR))|>
arrange(-med)
head(iqr_per_town, n=5)#> # A tibble: 5 × 7
#> town med q1 q3 IQR low_w upper_w
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Greenwich 1450000 805000 2561250 1756250 -1829375 5195625
#> 2 Darien 1372500 899000 2000000 1101000 -752500 3651500
#> 3 New Canaan 1285000 873750 1900000 1026250 -665625 3439375
#> 4 Westport 1129800 750000 1700000 950000 -675000 3125000
#> 5 Weston 814000 619750 1114000 494250 -121625 1855375
#> # A tibble: 5 × 7
#> town med q1 q3 IQR low_w upper_w
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Hartford 140000 79855. 195000 115145. -92863. 367718.
#> 2 Norwich 140000 96000 181000 85000 -31500 308500
#> 3 Torrington 135000 97925 175000 77075 -17688. 290612.
#> 4 Windham 135000 89900 175000 85100 -37750 302650
#> 5 Waterbury 117000 75000 160000 85000 -52500 287500
However, if someone wants a low risk investment risk, then the cities above are a good choice because they have relatively stable prices.
If a man wants to buy property in Connecticut, of course, he needs to know the property prices in each town. By using descriptive statistics through the mean/median function because it is able to represent prices. In this case study, the median value is used because the median measurement is more robust and not affected by outliers.
rs_town_price <- real_estate_clean |>
group_by(town) |>
summarise(price = median(sale_amount)) |>
arrange(-price)
head(rs_town_price)#> # A tibble: 6 × 2
#> town price
#> <fct> <dbl>
#> 1 Greenwich 1450000
#> 2 Darien 1372500
#> 3 New Canaan 1285000
#> 4 Westport 1129800
#> 5 Weston 814000
#> 6 Wilton 765000
Insight: from the table above it is known that Greenwich is the most expensive city in terms of average property prices
rs_town_5sum <- real_estate_clean |>
group_by(town) |>
summarise( med = median(sale_amount), q1 = quantile(sale_amount, 0.25), q3 = quantile(sale_amount, 0.75), IQR = q3-q1 ) |>
mutate(
low_w = q1 - (1.5*IQR),
upper_w = q3 + (1.5*IQR)
) |> arrange(-med)
rs_town_5sum#> # A tibble: 169 × 7
#> town med q1 q3 IQR low_w upper_w
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Greenwich 1450000 805000 2561250 1756250 -1829375 5195625
#> 2 Darien 1372500 899000 2000000 1101000 -752500 3651500
#> 3 New Canaan 1285000 873750 1900000 1026250 -665625 3439375
#> 4 Westport 1129800 750000 1700000 950000 -675000 3125000
#> 5 Weston 814000 619750 1114000 494250 -121625 1855375
#> 6 Wilton 765000 565015 1010000 444985 -102462. 1677478.
#> 7 Roxbury 653500 439750 1056250 616500 -485000 1981000
#> 8 Ridgefield 619000 428410 850000 421590 -203975 1482385
#> 9 Easton 590000 467250 740000 272750 58125 1149125
#> 10 Washington 554500 320650 1323750 1003100 -1184000 2828400
#> # ℹ 159 more rows
Case: if a man decides to settle in the city of New Canaan because it is close to NYC. What costs need to be prepared? the cost that needs to be prepared is around 1,285,000 USD. But what if the man only has 900,000 USD? Of course, it doesn’t matter because the range of property prices in the city of New Canaan is from 873,750 USD to 1,900,000 USD
real_estate_clean |>
filter(town == 'New Canaan' & sale_amount <= 900000) |>
group_by(residential_type) |>
summarise(n = n())#> # A tibble: 5 × 2
#> residential_type n
#> <fct> <int>
#> 1 Condo 419
#> 2 Four Family 2
#> 3 Single Family 528
#> 4 Three Family 2
#> 5 Two Family 14
a price of 900,000 USD buying a property with a Single Family or Condominium type is the right option compared to buying other types because based on the transaction data that has been collected, fewer transactions occur with 900.00 USD money to buy another type of house
If a Yale University student wants to buy a condominium-type property in the city of New Haven, what is the price range that needs to be spent?
rs_new_haven <- real_estate_clean |>
filter(town == 'New Haven') |>
group_by(residential_type, sale_amount) |>
summarise(price = mean(sale_amount)) |>
mutate(
q1 = quantile(sale_amount, 0.25),
med = median(sale_amount),
q3 = quantile(sale_amount, 0.75)
)
rs_new_haven |>
group_by(residential_type) |>
summarise( mean = mean(sale_amount), med = mean(med), q1 = mean(q1), q3 = mean(q3) )#> # A tibble: 5 × 5
#> residential_type mean med q1 q3
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 Condo 233465. 178500 99450 312000
#> 2 Four Family 324224. 285000 172362. 434000
#> 3 Single Family 309478. 221000 132500 368250
#> 4 Three Family 269072. 217750 138625 355456.
#> 5 Two Family 230299. 180000 105600 305250
The estimated cost required to buy a condominium type property is 99,450 USD to 312,000 USD
The increase in the average price of housing costs in 2020 was
influenced by the pandemic that occurred at that time. The pandemic has
increased the price of inflation (price_index) which has affected
various price increases for various economic sectors including property.
If an investor is a high-risk taker, buying property in areas
with high price variations can be a profitable option. Greenwich,
Darein, New Canaan, West Canaan, and Westport are the top 5 cities based
on IQR value
Property prices for residential type single family around New
Canaan from 873,750 USD to 1,900,000 USD
The estimated cost required to buy a condominium-type property in
New Haven is 99450 USD to 312,000 USD