The aim of this project is to perform exploratory data analysis on Ames housing data. This data has information about the characteristics of houses in Ames, lowa and the price of those houses.
We perform exploratory data analysis as a first step necessary in building a reliable model.
data(ames)
glimpse(ames)
## Rows: 2,930
## Columns: 74
## $ MS_SubClass <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ MS_Zoning <fct> Residential_Low_Density, Residential_High_Density, …
## $ Lot_Frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ Lot_Area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ Alley <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ Lot_Shape <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ Land_Contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ Lot_Config <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ Land_Slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ Condition_1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ Condition_2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ Bldg_Type <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ House_Style <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ Overall_Cond <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ Year_Built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ Year_Remod_Add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ Roof_Style <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ Roof_Matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ Exterior_1st <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Exterior_2nd <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Mas_Vnr_Type <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ Mas_Vnr_Area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ Exter_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Foundation <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ Bsmt_Cond <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ Bsmt_Exposure <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ BsmtFin_Type_1 <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ BsmtFin_SF_1 <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ BsmtFin_Type_2 <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ BsmtFin_SF_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ Bsmt_Unf_SF <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ Total_Bsmt_SF <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ Heating_QC <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ Central_Air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ First_Flr_SF <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ Second_Flr_SF <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ Gr_Liv_Area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ Bsmt_Full_Bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ Bsmt_Half_Bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full_Bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ Half_Bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Bedroom_AbvGr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ Kitchen_AbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotRms_AbvGrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ Fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ Garage_Type <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ Garage_Finish <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ Garage_Cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ Garage_Area <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ Garage_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Paved_Drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ Wood_Deck_SF <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ Open_Porch_SF <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ Enclosed_Porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen_Porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ Pool_Area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Pool_QC <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ Fence <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ Misc_Feature <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ Misc_Val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ Mo_Sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ Year_Sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ Sale_Type <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ Sale_Condition <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ Sale_Price <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ Longitude <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ Latitude <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
What is the relationship of Sale price with other numeric variables
ames |>
# Select all Numeric Variables
select(where(is.numeric)) |> cor() |>
# Convert the correlation matrix to data.frame
as.data.frame() |>
# Convert row names to columns
rownames_to_column(var = "rowname") |>
# Convert data frame into a tibble
as_tibble() |>
select(rowname, Sale_Price) |>
filter(rowname != "Sale_Price") |>
arrange(desc(Sale_Price)) |>
ggplot(aes(x = Sale_Price, y = fct_reorder(rowname, Sale_Price))) +
geom_col(aes(fill = Sale_Price > 0), color = "white", show.legend = FALSE) + # Use different colors for positive and negative correlations
labs(y = NULL,
fill = NULL,
x = "Sale Price Correlation",
title = "Sale Price Correlation with other numeric variables") +
scale_fill_manual(values = c("darkred", "darkgreen"), labels = c("Negative", "Positive")) + # Customize fill colors
theme_minimal() # Use a minimal theme for better readability
What is the distribution of our target variable, Sale Price?
ames |>
ggplot(aes(x = Sale_Price)) +
geom_histogram(color = "white", fill = "darkorchid", bins = 30) +
scale_x_continuous(labels = scales::dollar) +
labs(x = "Sale Price in Dollars",
y = "Frequency",
title = "Sale in Price Histogram")
The Sale price is skewed to the right; implying there are more inexpensive houses compared to the expensive ones.
We have few homes that are above $600,000.
As the size of garage in car capacity increases, so does the price of the house. n represent the number of houses in each category. There are more houses with two cars garage capacity.
ames |>
add_count(Garage_Cars) |>
mutate(Garage_Cars = fct_reorder(factor(Garage_Cars), Sale_Price)) |>
ggplot(aes(x = Garage_Cars, y = Sale_Price, fill = n)) +
geom_boxplot(color = "grey") +
scale_fill_viridis_c() +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Sale Price Vs Garage Cars",
subtitle = "The price of the house increases with the increase in Car Garage Capacity",
y = "House selling Price",
x = "Size of garage in car capacity",
caption = "n represents the number of houses per garage capacity")
There is a strong and positive relationship between garage area and sale price. Increase in garage area, increases the sale price.
ames |>
ggplot(aes(x = Garage_Area, Sale_Price)) +
geom_point(aes(color = factor(Garage_Cars)),alpha = 0.5) +
scale_y_continuous(labels = scales::dollar) +
geom_smooth(method = "lm", se = FALSE, lty = 2, color = "gray",
linewidth = 0.8) +
labs(title = "Garage Area and Sale Price relationship",
subtitle = "Increase in Garage area increases the sale price",
x = "Garage Area in Square feet",
y = "House selling price",
color = "Garage Car Capacity")
The above plot also shows there is a strong correlation between garage area and garage car capacity as expected.
Most of the houses in the dataset were built in the year 2000s decade .
How many houses were build per decade?
ames |>
select(Year_Built, Sale_Price) |>
mutate(
decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
between(Year_Built, 1880, 1889) ~ "1880s",
between(Year_Built, 1890, 1899) ~ "1890s",
between(Year_Built, 1900, 1909) ~ "1900s",
between(Year_Built, 1910, 1919) ~ "1910s",
between(Year_Built, 1920, 1929) ~ "1920s",
between(Year_Built, 1930, 1939) ~ "1930s",
between(Year_Built, 1940, 1949) ~ "1940s",
between(Year_Built, 1950, 1959) ~ "1950s",
between(Year_Built, 1960, 1969) ~ "1960s",
between(Year_Built, 1970, 1979) ~ "1970s",
between(Year_Built, 1980, 1989) ~ "1980s",
between(Year_Built, 1990, 1999) ~ "1990s",
between(Year_Built, 2000, 2009) ~ "2000s",
between(Year_Built, 2010, 2019) ~ "2010s"
)
) |>
add_count(decade) |>
ggplot(aes( y = decade)) +
geom_bar(fill = "darkorchid") +
labs(y = NULL,
x = "Number of houses",
title = "Number of houses per decade",
subtitle = "Most houses we build in the 2000s decade")
Most houses were built in the 2000s.
What is the relationship between sale price and the year the house was built?
ames |>
select(Year_Built, Sale_Price) |>
mutate(
decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
between(Year_Built, 1880, 1889) ~ "1880s",
between(Year_Built, 1890, 1899) ~ "1890s",
between(Year_Built, 1900, 1909) ~ "1900s",
between(Year_Built, 1910, 1919) ~ "1910s",
between(Year_Built, 1920, 1929) ~ "1920s",
between(Year_Built, 1930, 1939) ~ "1930s",
between(Year_Built, 1940, 1949) ~ "1940s",
between(Year_Built, 1950, 1959) ~ "1950s",
between(Year_Built, 1960, 1969) ~ "1960s",
between(Year_Built, 1970, 1979) ~ "1970s",
between(Year_Built, 1980, 1989) ~ "1980s",
between(Year_Built, 1990, 1999) ~ "1990s",
between(Year_Built, 2000, 2009) ~ "2000s",
between(Year_Built, 2010, 2019) ~ "2010s"
)
) |>
add_count(decade) |>
ggplot(aes(x = Sale_Price, y = decade, fill = n)) +
geom_boxplot() +
scale_fill_viridis_b() +
scale_x_continuous(labels = scales::dollar) +
labs(y = NULL,
x = "House selling price",
title = "Exploring the Correlation Between House Prices and Construction Decades")
Newer houses cost more compared to the old houses, and most houses were built in the 2000s decade.
sale_price_summary_per_decade <- ames |>
select(Year_Built, Sale_Price) |>
mutate(
decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
between(Year_Built, 1880, 1889) ~ "1880s",
between(Year_Built, 1890, 1899) ~ "1890s",
between(Year_Built, 1900, 1909) ~ "1900s",
between(Year_Built, 1910, 1919) ~ "1910s",
between(Year_Built, 1920, 1929) ~ "1920s",
between(Year_Built, 1930, 1939) ~ "1930s",
between(Year_Built, 1940, 1949) ~ "1940s",
between(Year_Built, 1950, 1959) ~ "1950s",
between(Year_Built, 1960, 1969) ~ "1960s",
between(Year_Built, 1970, 1979) ~ "1970s",
between(Year_Built, 1980, 1989) ~ "1980s",
between(Year_Built, 1990, 1999) ~ "1990s",
between(Year_Built, 2000, 2009) ~ "2000s",
between(Year_Built, 2010, 2019) ~ "2010s"
)
) |>
group_by(decade) |>
summarise(
avg_price = round(mean(Sale_Price),2),
max_price = max(Sale_Price),
min_price = min(Sale_Price), .groups = "drop"
)
sale_price_summary_per_decade |> knitr::kable()
| decade | avg_price | max_price | min_price |
|---|---|---|---|
| 1870s | 133666.7 | 185000 | 94000 |
| 1880s | 165497.4 | 295000 | 100000 |
| 1890s | 162675.9 | 475000 | 50138 |
| 1900s | 122217.4 | 240000 | 44000 |
| 1910s | 126310.6 | 239000 | 37900 |
| 1920s | 121855.4 | 256000 | 12789 |
| 1930s | 142260.9 | 415000 | 52000 |
| 1940s | 125016.3 | 266500 | 35311 |
| 1950s | 139892.1 | 335000 | 13100 |
| 1960s | 151393.8 | 375000 | 62383 |
| 1970s | 151884.1 | 345000 | 71000 |
| 1980s | 187142.5 | 385000 | 112000 |
| 1990s | 223266.6 | 755000 | 93500 |
| 2000s | 248388.1 | 615000 | 84500 |
| 2010s | 283116.0 | 394432 | 187000 |
Newer houses have the highest average price compared to the old houses. The newer the house, the more pricey its likely to be.
sale_price_summary_per_decade |>
pivot_longer(cols = avg_price:min_price, values_to = "price", names_to = "summary") |>
ggplot(aes(x = price, y = decade, fill = summary)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = c("darkorange","darkorchid","cyan4")) +
facet_grid(~ summary, scales = "free_x") +
scale_x_continuous(labels = scales::dollar) +
labs(
x = NULL,
y = NULL,
title = "Houses price summary by decade"
)
ames |>
select(Year_Built, Sale_Price) |>
mutate(
decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
between(Year_Built, 1880, 1889) ~ "1880s",
between(Year_Built, 1890, 1899) ~ "1890s",
between(Year_Built, 1900, 1909) ~ "1900s",
between(Year_Built, 1910, 1919) ~ "1910s",
between(Year_Built, 1920, 1929) ~ "1920s",
between(Year_Built, 1930, 1939) ~ "1930s",
between(Year_Built, 1940, 1949) ~ "1940s",
between(Year_Built, 1950, 1959) ~ "1950s",
between(Year_Built, 1960, 1969) ~ "1960s",
between(Year_Built, 1970, 1979) ~ "1970s",
between(Year_Built, 1980, 1989) ~ "1980s",
between(Year_Built, 1990, 1999) ~ "1990s",
between(Year_Built, 2000, 2009) ~ "2000s",
between(Year_Built, 2010, 2019) ~ "2010s"
)
) |>
add_count(decade) |>
ggplot(aes(x = Year_Built, y = Sale_Price, color = decade)) +
geom_point() +
scale_y_continuous(labels = scales::dollar) +
labs(x = NULL,
y = NULL,
color = "Decade",
title = "Houses prices VS Year Built",
subtitle = "Newly built houses are more expensive than old houses") +
scale_color_viridis_d()
The relationship between price and total basement area
summary(ames$Total_Bsmt_SF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 793 990 1051 1302 6110
We have a positive and strong relationship between basement area and house selling price.
ames |>
ggplot(aes(x = Total_Bsmt_SF, y = Sale_Price)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, lty = 2, color = "gray") +
scale_y_continuous(labels = scales::dollar) +
theme_minimal() +
labs(x = "Total basement area in square feet",
y = "House selling price",
title = "Total Basement area vs House selling price",
subtitle = "There is a strong and positive correlation between basement area and house selling price")
We have few houses we more than 4000 square feet basement area but cost less than $250,000. Most houses in our data set have a basement area, with most houses having a basement area of approximately 400 square feet.
ames |>
ggplot(aes(x = Total_Bsmt_SF)) +
geom_histogram(bins = round(sqrt(nrow(ames))), color = "white", fill = "darkorchid") +
labs(
x = "Basement area in Square Feet",
y = "Frequency",
title = "Basement area in square feet histogram"
) +
theme_minimal()
Basement area is skewed to the right. There are more houses with less basement area.
Remodel date (same as construction date if no remodeling or additions).
Most houses in the 1950s were remodeled.
ames |>
# To check if the house was remodeled
mutate(
remodeled = if_else(Year_Built < Year_Remod_Add, "Remodeled", "Not Remodeled")
) |>
ggplot(aes(x = Year_Remod_Add, y = Sale_Price, color = Overall_Cond)) +
geom_point(alpha = 0.5) +
theme_minimal() +
scale_y_continuous(labels = scales::dollar) +
labs(color = "House Condition",
y = NULL,
x = "Year Modified") +
facet_wrap(~ remodeled)
The plot does not show any significant difference between the price of remodeled and not remodeled houses.
Since we don’t have the price of the houses before remodeling, we can only assume that the price of remodeled house increased after remodeling.
For remodeled houses, the number of houses with Above Average, Good, Very Good and Excellent conditions are more compared to houses that were not remodeled. We also don’t have the condition of the houses before remodeling, however, we can conclude that remodeling houses improved the house conditions which had a positive effect on price.
All houses from 1870s to 1940s were remodeled.
ames |>
# To check if the house was remodeled
mutate(
remodeled = if_else(Year_Built < Year_Remod_Add, "Remodeled", "Not Remodeled")
) |>
group_by(remodeled) |>
summarise(avg_price = mean(Sale_Price),
min_price = min(Sale_Price),
max_price = max(Sale_Price)) |> knitr::kable()
| remodeled | avg_price | min_price | max_price |
|---|---|---|---|
| Not Remodeled | 184324.7 | 13100 | 745000 |
| Remodeled | 176722.6 | 12789 | 755000 |
ames |>
mutate(
decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
between(Year_Built, 1880, 1889) ~ "1880s",
between(Year_Built, 1890, 1899) ~ "1890s",
between(Year_Built, 1900, 1909) ~ "1900s",
between(Year_Built, 1910, 1919) ~ "1910s",
between(Year_Built, 1920, 1929) ~ "1920s",
between(Year_Built, 1930, 1939) ~ "1930s",
between(Year_Built, 1940, 1949) ~ "1940s",
between(Year_Built, 1950, 1959) ~ "1950s",
between(Year_Built, 1960, 1969) ~ "1960s",
between(Year_Built, 1970, 1979) ~ "1970s",
between(Year_Built, 1980, 1989) ~ "1980s",
between(Year_Built, 1990, 1999) ~ "1990s",
between(Year_Built, 2000, 2009) ~ "2000s",
between(Year_Built, 2010, 2019) ~ "2010s"
),
remodeled = if_else(Year_Built < Year_Remod_Add, "Remodeled", "Not Remodeled")
) |>
group_by(decade) |>
count(remodeled) |>
pivot_wider(names_from = remodeled, values_from = n) |> knitr::kable()
| decade | Remodeled | Not Remodeled |
|---|---|---|
| 1870s | 3 | NA |
| 1880s | 8 | NA |
| 1890s | 15 | NA |
| 1900s | 40 | NA |
| 1910s | 110 | NA |
| 1920s | 196 | NA |
| 1930s | 109 | NA |
| 1940s | 151 | NA |
| 1950s | 112 | 228 |
| 1960s | 85 | 272 |
| 1970s | 65 | 299 |
| 1980s | 31 | 89 |
| 1990s | 160 | 174 |
| 2000s | 275 | 505 |
| 2010s | NA | 3 |
The majority of houses lack a masonry veneer, while those with stone masonry veneer stand out as the most expensive among the houses.
ames |>
add_count(Mas_Vnr_Type) |>
ggplot(aes(x = fct_reorder(Mas_Vnr_Type, Sale_Price), y = Sale_Price, fill = n)) +
geom_boxplot() +
scale_fill_viridis_b() +
scale_y_continuous(labels = scales::dollar) +
labs(x = "Masonry Veneer Type",
y = "House Selling Price (USD)",
title = "Sale Price by Masonry Veneer Type",
subtitle = "Houses with Stone Veneer Are Priced Higher",
caption = "n indicates the number of houses in each category")
ames |>
ggplot(aes(x = Mas_Vnr_Area, y = Sale_Price)) +
geom_point(aes(color = Mas_Vnr_Type)) +
geom_smooth(method = "lm", se = FALSE, lty = 2, color = "gray")
There is a strong and positive relationship between ground living area and house selling price. Houses with large above ground living area in square feet are more expensive compared to houses with less ground living ares.
ames |>
ggplot(aes(x = Gr_Liv_Area, y = Sale_Price)) +
geom_point(color = "darkorchid") +
geom_smooth(method = "lm", se = FALSE, color = "gray", lty = 2) +
scale_y_continuous(labels = scales::dollar) +
scale_x_continuous(labels = scales::number) +
labs(
x = "Above grade (ground) living area square feet",
y = "House selling price in USD",
title = "Relationship between ground living are and sale price"
)
We only have 13 houses with pool area in our dataset. We compare prices of houses with pool with those without to see whether there is a price difference.
Whereas there is an imbalance between the houses with and without a pool, Houses with a pool are more expensive that those without on average.
However, We have several outliers of houses without a pool but are very expensive. This might be due to existence of other luxurious features.
ames |>
mutate(Pool_Area = ifelse(Pool_Area > 0, "Yes", "No")) |>
add_count(Pool_Area) |>
ggplot(aes(x = Pool_Area, y = Sale_Price, fill = n)) +
geom_boxplot() +
scale_y_continuous(labels = scales::dollar) +
scale_fill_viridis_b() +
labs(x = "Pool Area",
y = "House selling price in USD",
fill = "No. of Houses",
title = "House has a pool? Vs Sale Price")