Question 1: Cleaning, Reshaping, Summarizing, and Joining Datasets & Visualization
The Owned Housing and Rented Housing datasets were loaded into R using read_csv(). The head() function was used to inspect the first few rows of each dataset to verify that the data was imported correctly and to understand the structure of the variables.
library(readr)
owned <- read_csv("~/AdvDSAProject/OwnedHousingbyCity_GA_Final.csv")
## Rows: 555 Columns: 77
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): RegionName, State, CountyName
## dbl (74): RegionID, SizeRank, 2020-01-31, 2020-02-29, 2020-03-31, 2020-04-30...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rented <- read_csv("~/AdvDSAProject/RentedHousing_GA_Final.csv")
## Rows: 23 Columns: 77
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): RegionName, RegionType
## dbl (75): RegionID, SizeRank, StateName, 2020-01-31, 2020-02-29, 2020-03-31,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To focus the analysis on a single market, the datasets were filtered to include only data for the city of Atlanta. This allows for a direct comparison between owned housing prices and rental prices within the same geographic area.
# Example: pick Atlanta city
city_owned <- owned %>% filter(RegionName == "Atlanta")
city_rented <- rented %>% filter(RegionName == "Atlanta, GA")
The datasets were originally in a wide format, with dates stored as column names. These columns were converted into a long format using pivot_longer() so that dates could be treated as a time variable.
The date values were then converted to proper Date objects. Monthly average prices were calculated for both owned and rented housing by grouping the data by date.
Finally, the owned and rented datasets were merged by date using a left join to create a combined dataset suitable for comparison and visualization.
city_owned_long <- city_owned %>%
pivot_longer(
cols = matches("^20"),
names_to = "Date",
values_to = "OwnedPrice"
) %>%
mutate(Date = as.Date(Date))
city_rented_long <- city_rented %>%
pivot_longer(
cols = matches("^20"),
names_to = "Date",
values_to = "RentPrice"
) %>%
mutate(Date = as.Date(Date))
count(city_owned_long, Date) %>% filter(n > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: Date <date>, n <int>
count(city_rented_long, Date) %>% filter(n > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: Date <date>, n <int>
city_owned_monthly <- city_owned_long %>%
group_by(Date) %>%
summarise(OwnedPrice = mean(OwnedPrice, na.rm = TRUE), .groups = "drop")
city_rented_monthly <- city_rented_long %>%
group_by(Date) %>%
summarise(RentPrice = mean(RentPrice, na.rm = TRUE), .groups = "drop")
city_combined <- left_join(
city_owned_monthly,
city_rented_monthly,
by = "Date"
)
city_combined <- city_combined %>%
drop_na(OwnedPrice, RentPrice)
Used head to confirm the join was successful.
str(city_combined)
## tibble [72 × 3] (S3: tbl_df/tbl/data.frame)
## $ Date : Date[1:72], format: "2020-01-31" "2020-02-29" ...
## $ OwnedPrice: num [1:72] 331814 334567 337343 338886 338871 ...
## $ RentPrice : num [1:72] 1434 1443 1456 1466 1477 ...
head(city_combined)
## # A tibble: 6 × 3
## Date OwnedPrice RentPrice
## <date> <dbl> <dbl>
## 1 2020-01-31 331814 1434
## 2 2020-02-29 334567 1443
## 3 2020-03-31 337343 1456
## 4 2020-04-30 338886 1466
## 5 2020-05-31 338871 1477
## 6 2020-06-30 337707 1489
Visualize data Line plot of Owned vs Rented prices over time: This graph shows the trend of owned home prices and rental prices in Atlanta over time. Owned housing prices increased sharply after 2020, while rental prices increased more gradually.
ggplot(city_combined, aes(x=Date)) +
geom_line(aes(y=OwnedPrice, color="Owned")) +
geom_line(aes(y=RentPrice, color="Rented")) +
labs(title="Housing Prices vs Rent in Atlanta (GA)",
x="Date", y="Price (USD)", color="Legend") +
theme_minimal()
Although rent prices increased steadily over time, they appear flat when plotted alongside home prices due to the large difference in magnitude between the two variables. Home prices range in the hundreds of thousands of dollars, while rent prices are in the low thousands. To make trends comparable, the data were normalized (or plotted using a secondary axis).
city_combined_scaled <- city_combined %>%
mutate(
Owned_scaled = OwnedPrice / max(OwnedPrice),
Rent_scaled = RentPrice / max(RentPrice)
)
ggplot(city_combined_scaled, aes(x = Date)) +
geom_line(aes(y = Owned_scaled, color = "Owned")) +
geom_line(aes(y = Rent_scaled, color = "Rented")) +
labs(
title = "Trends in Home Prices and Rent in Atlanta (GA)",
x = "Date",
y = "Normalized Price",
color = "Legend"
) +
theme_minimal()
Question 2: Simple Calculation
The mean owned home price in Atlanta during the study period was substantially higher than the mean rent price, reflecting the large difference in scale between ownership and renting. This supports the hypothesis that home prices increased at a faster rate than rental prices.
mean(city_combined$OwnedPrice, na.rm = TRUE)
## [1] 418639.8
mean(city_combined$RentPrice, na.rm = TRUE)
## [1] 1897.625
Question 3: Regression
The regression analysis shows a positive relationship between owned housing prices and rent prices. The R² value indicates how much of the variation in rent prices is explained by changes in home prices, and the statistically significant p-value suggests this relationship is unlikely due to chance.
model <- lm(RentPrice ~ OwnedPrice, data = city_combined)
summary(model)
##
## Call:
## lm(formula = RentPrice ~ OwnedPrice, data = city_combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -168.50 -43.99 -17.88 32.76 209.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.575e+02 1.054e+02 -2.443 0.0171 *
## OwnedPrice 5.148e-03 2.505e-04 20.549 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 87.8 on 70 degrees of freedom
## Multiple R-squared: 0.8578, Adjusted R-squared: 0.8558
## F-statistic: 422.3 on 1 and 70 DF, p-value: < 2.2e-16
ggplot(city_combined, aes(x = OwnedPrice, y = RentPrice)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Relationship Between Home Prices and Rent Prices",
x = "Owned Home Price",
y = "Rent Price"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Question 4: Histogram & distribution
The histogram shows that rent prices are approximately right-skewed, with most values concentrated in the mid-range and fewer high-price observations.
ggplot(city_combined, aes(x = RentPrice)) +
geom_histogram(bins = 20) +
labs(
title = "Distribution of Rent Prices in Atlanta",
x = "Rent Price",
y = "Frequency"
) +
theme_minimal()
Question 5: Group Comparison
A two-sample t-test was conducted to compare average rent prices in 2020 versus after 2020. The results show a statistically significant difference in rent prices (t = -19.33, p < 0.001). Average rent increased from approximately $1,501 in 2020 to $1,977 after 2020. The 95% confidence interval indicates that rent prices increased by between $427 and $526, suggesting a substantial rise in rental costs following 2020.
city_combined <- city_combined %>%
mutate(
Period = ifelse(
format(Date, "%Y") == "2020",
"2020",
"After 2020"
)
)
t.test(RentPrice ~ Period, data = city_combined)
##
## Welch Two Sample t-test
##
## data: RentPrice by Period
## t = -19.332, df = 60.392, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 2020 and group After 2020 is not equal to 0
## 95 percent confidence interval:
## -525.7409 -427.1591
## sample estimates:
## mean in group 2020 mean in group After 2020
## 1500.583 1977.033