Introduction

The aim of this project is to perform exploratory data analysis on Ames housing data. This data has information about the characteristics of houses in Ames, lowa and the price of those houses.

We perform exploratory data analysis as a first step necessary in building a reliable model.

data(ames)
glimpse(ames)
## Rows: 2,930
## Columns: 74
## $ MS_SubClass        <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ MS_Zoning          <fct> Residential_Low_Density, Residential_High_Density, …
## $ Lot_Frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ Lot_Area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ Street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ Alley              <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ Lot_Shape          <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ Land_Contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ Utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ Lot_Config         <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ Land_Slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood       <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ Condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ Condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ Bldg_Type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ House_Style        <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ Overall_Cond       <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ Year_Built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ Year_Remod_Add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ Roof_Style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ Roof_Matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ Exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Mas_Vnr_Type       <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ Mas_Vnr_Area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ Exter_Cond         <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ Bsmt_Cond          <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ Bsmt_Exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ BsmtFin_Type_1     <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ BsmtFin_SF_1       <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ BsmtFin_Type_2     <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ BsmtFin_SF_2       <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ Bsmt_Unf_SF        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ Total_Bsmt_SF      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ Heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ Heating_QC         <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ Central_Air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ Electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ First_Flr_SF       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ Second_Flr_SF      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ Gr_Liv_Area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ Bsmt_Full_Bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ Bsmt_Half_Bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full_Bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ Half_Bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Bedroom_AbvGr      <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ Kitchen_AbvGr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotRms_AbvGrd      <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ Functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ Fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ Garage_Type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ Garage_Finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ Garage_Cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ Garage_Area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ Garage_Cond        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Paved_Drive        <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ Wood_Deck_SF       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ Open_Porch_SF      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ Enclosed_Porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen_Porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ Pool_Area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Pool_QC            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ Fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ Misc_Feature       <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ Misc_Val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ Mo_Sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ Year_Sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ Sale_Type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ Sale_Condition     <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ Sale_Price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ Longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ Latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…

Sale Price correlation with other numeric variables

What is the relationship of Sale price with other numeric variables

ames |> 
    # Select all Numeric Variables
    select(where(is.numeric)) |> cor() |> 
    
    # Convert the correlation matrix to data.frame
    as.data.frame() |> 
    # Convert row names to columns
    rownames_to_column(var = "rowname") |> 
    # Convert data frame into a tibble
    as_tibble() |> 
    select(rowname, Sale_Price) |> 
    filter(rowname != "Sale_Price") |> 
    arrange(desc(Sale_Price)) |> 
    ggplot(aes(x = Sale_Price, y = fct_reorder(rowname, Sale_Price))) +
    geom_col(aes(fill = Sale_Price > 0), color = "white", show.legend = FALSE) +  # Use different colors for positive and negative correlations
    labs(y = NULL,
         fill = NULL,
         x = "Sale Price Correlation",
         title = "Sale Price Correlation with other numeric variables") +
    scale_fill_manual(values = c("darkred", "darkgreen"), labels = c("Negative", "Positive")) +  # Customize fill colors
    theme_minimal()  # Use a minimal theme for better readability

How is Sale Price Distributed

What is the distribution of our target variable, Sale Price?

 ames |> 
  ggplot(aes(x = Sale_Price)) +
  geom_histogram(color = "white", fill = "darkorchid", bins = 30) +
  scale_x_continuous(labels = scales::dollar) +
  labs(x = "Sale Price in Dollars",
       y = "Frequency",
       title = "Sale in Price Histogram")

The Sale price is skewed to the right; implying there are more inexpensive houses compared to the expensive ones.

We have few homes that are above $600,000.

Car Garage Capacity and Sale Price

As the size of garage in car capacity increases, so does the price of the house. n represent the number of houses in each category. There are more houses with two cars garage capacity.

ames |>
  add_count(Garage_Cars) |> 
  mutate(Garage_Cars = fct_reorder(factor(Garage_Cars), Sale_Price)) |> 
  ggplot(aes(x = Garage_Cars, y = Sale_Price, fill = n)) +
  geom_boxplot(color = "grey") +
  scale_fill_viridis_c() +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "Sale Price Vs Garage Cars",
       subtitle = "The price of the house increases with the increase in Car Garage Capacity",
       y = "House selling Price",
       x = "Size of garage in car capacity",
       caption = "n represents the number of houses per garage capacity")

Garage Area and Sale Price relationship

There is a strong and positive relationship between garage area and sale price. Increase in garage area, increases the sale price.

ames |> 
  ggplot(aes(x = Garage_Area, Sale_Price)) +
  geom_point(aes(color = factor(Garage_Cars)),alpha = 0.5) +
  scale_y_continuous(labels = scales::dollar) +
  geom_smooth(method = "lm", se = FALSE, lty = 2, color = "gray", 
              linewidth = 0.8) +
  labs(title = "Garage Area and Sale Price relationship",
       subtitle = "Increase in Garage area increases the sale price",
       x = "Garage Area in Square feet",
       y = "House selling price",
       color = "Garage Car Capacity") 

The above plot also shows there is a strong correlation between garage area and garage car capacity as expected.

Year Built and Sale Price

Most of the houses in the dataset were built in the year 2000s decade .

How many houses were build per decade?

ames |> 
  select(Year_Built, Sale_Price) |> 
  mutate(
    decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
                       between(Year_Built, 1880, 1889) ~ "1880s",
                       between(Year_Built, 1890, 1899) ~ "1890s",
                       between(Year_Built, 1900, 1909) ~ "1900s",
                       between(Year_Built, 1910, 1919) ~ "1910s",
                       between(Year_Built, 1920, 1929) ~ "1920s",
                       between(Year_Built, 1930, 1939) ~ "1930s",
                       between(Year_Built, 1940, 1949) ~ "1940s",
                       between(Year_Built, 1950, 1959) ~ "1950s",
                       between(Year_Built, 1960, 1969) ~ "1960s",
                       between(Year_Built, 1970, 1979) ~ "1970s",
                       between(Year_Built, 1980, 1989) ~ "1980s",
                       between(Year_Built, 1990, 1999) ~ "1990s",
                       between(Year_Built, 2000, 2009) ~ "2000s",
                       between(Year_Built, 2010, 2019) ~ "2010s"
                       
                       )
  ) |> 
  add_count(decade) |> 
  ggplot(aes( y = decade)) +
  geom_bar(fill = "darkorchid") +
  labs(y = NULL,
       x = "Number of houses",
       title = "Number of houses per decade",
       subtitle = "Most houses we build in the 2000s decade") 

Most houses were built in the 2000s.

What is the relationship between sale price and the year the house was built?

ames |> 
  select(Year_Built, Sale_Price) |> 
  mutate(
    decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
                       between(Year_Built, 1880, 1889) ~ "1880s",
                       between(Year_Built, 1890, 1899) ~ "1890s",
                       between(Year_Built, 1900, 1909) ~ "1900s",
                       between(Year_Built, 1910, 1919) ~ "1910s",
                       between(Year_Built, 1920, 1929) ~ "1920s",
                       between(Year_Built, 1930, 1939) ~ "1930s",
                       between(Year_Built, 1940, 1949) ~ "1940s",
                       between(Year_Built, 1950, 1959) ~ "1950s",
                       between(Year_Built, 1960, 1969) ~ "1960s",
                       between(Year_Built, 1970, 1979) ~ "1970s",
                       between(Year_Built, 1980, 1989) ~ "1980s",
                       between(Year_Built, 1990, 1999) ~ "1990s",
                       between(Year_Built, 2000, 2009) ~ "2000s",
                       between(Year_Built, 2010, 2019) ~ "2010s"
                       
                       )
  ) |> 
  add_count(decade) |> 
  ggplot(aes(x = Sale_Price, y = decade, fill = n)) +
  geom_boxplot() +
  scale_fill_viridis_b() +
  scale_x_continuous(labels = scales::dollar) +
  labs(y = NULL,
       x = "House selling price",
       title = "Exploring the Correlation Between House Prices and Construction Decades") 

Newer houses cost more compared to the old houses, and most houses were built in the 2000s decade.

sale_price_summary_per_decade <- ames |> 
  select(Year_Built, Sale_Price) |> 
  mutate(
    decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
                       between(Year_Built, 1880, 1889) ~ "1880s",
                       between(Year_Built, 1890, 1899) ~ "1890s",
                       between(Year_Built, 1900, 1909) ~ "1900s",
                       between(Year_Built, 1910, 1919) ~ "1910s",
                       between(Year_Built, 1920, 1929) ~ "1920s",
                       between(Year_Built, 1930, 1939) ~ "1930s",
                       between(Year_Built, 1940, 1949) ~ "1940s",
                       between(Year_Built, 1950, 1959) ~ "1950s",
                       between(Year_Built, 1960, 1969) ~ "1960s",
                       between(Year_Built, 1970, 1979) ~ "1970s",
                       between(Year_Built, 1980, 1989) ~ "1980s",
                       between(Year_Built, 1990, 1999) ~ "1990s",
                       between(Year_Built, 2000, 2009) ~ "2000s",
                       between(Year_Built, 2010, 2019) ~ "2010s"
                       
                       )
  ) |> 
  group_by(decade) |> 
  summarise(
    avg_price = round(mean(Sale_Price),2),
    max_price = max(Sale_Price),
    min_price = min(Sale_Price), .groups = "drop"
  ) 

sale_price_summary_per_decade |> knitr::kable()
decade avg_price max_price min_price
1870s 133666.7 185000 94000
1880s 165497.4 295000 100000
1890s 162675.9 475000 50138
1900s 122217.4 240000 44000
1910s 126310.6 239000 37900
1920s 121855.4 256000 12789
1930s 142260.9 415000 52000
1940s 125016.3 266500 35311
1950s 139892.1 335000 13100
1960s 151393.8 375000 62383
1970s 151884.1 345000 71000
1980s 187142.5 385000 112000
1990s 223266.6 755000 93500
2000s 248388.1 615000 84500
2010s 283116.0 394432 187000

Newer houses have the highest average price compared to the old houses. The newer the house, the more pricey its likely to be.

sale_price_summary_per_decade |> 
  pivot_longer(cols = avg_price:min_price, values_to = "price", names_to = "summary") |> 
  ggplot(aes(x = price, y = decade, fill = summary)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c("darkorange","darkorchid","cyan4")) +
  facet_grid(~ summary, scales = "free_x") +
  scale_x_continuous(labels = scales::dollar) +
  labs(
    x = NULL,
    y = NULL,
    title = "Houses price summary by decade"
  )

ames |> 
  select(Year_Built, Sale_Price) |> 
  mutate(
    decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
                       between(Year_Built, 1880, 1889) ~ "1880s",
                       between(Year_Built, 1890, 1899) ~ "1890s",
                       between(Year_Built, 1900, 1909) ~ "1900s",
                       between(Year_Built, 1910, 1919) ~ "1910s",
                       between(Year_Built, 1920, 1929) ~ "1920s",
                       between(Year_Built, 1930, 1939) ~ "1930s",
                       between(Year_Built, 1940, 1949) ~ "1940s",
                       between(Year_Built, 1950, 1959) ~ "1950s",
                       between(Year_Built, 1960, 1969) ~ "1960s",
                       between(Year_Built, 1970, 1979) ~ "1970s",
                       between(Year_Built, 1980, 1989) ~ "1980s",
                       between(Year_Built, 1990, 1999) ~ "1990s",
                       between(Year_Built, 2000, 2009) ~ "2000s",
                       between(Year_Built, 2010, 2019) ~ "2010s"
                       
                       )
  ) |> 
  add_count(decade) |> 
  ggplot(aes(x = Year_Built, y = Sale_Price, color = decade)) +
  geom_point() +
  scale_y_continuous(labels = scales::dollar) +
  labs(x = NULL,
       y = NULL,
       color = "Decade",
       title = "Houses prices VS Year Built",
       subtitle = "Newly built houses are more expensive than old houses") +
  scale_color_viridis_d() 

Total square feet of basement area

The relationship between price and total basement area

summary(ames$Total_Bsmt_SF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     793     990    1051    1302    6110

We have a positive and strong relationship between basement area and house selling price.

ames |> 
  ggplot(aes(x = Total_Bsmt_SF, y = Sale_Price)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, lty = 2, color = "gray") +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  labs(x = "Total basement area in square feet",
       y = "House selling price",
       title = "Total Basement area vs House selling price",
       subtitle = "There is a strong and positive correlation between basement area and house selling price")

We have few houses we more than 4000 square feet basement area but cost less than $250,000. Most houses in our data set have a basement area, with most houses having a basement area of approximately 400 square feet.

ames |> 
  ggplot(aes(x = Total_Bsmt_SF)) +
  geom_histogram(bins = round(sqrt(nrow(ames))), color = "white", fill = "darkorchid") +
  labs(
    x = "Basement area in Square Feet",
    y = "Frequency",
    title = "Basement area in square feet histogram"
  ) +
  theme_minimal()

Basement area is skewed to the right. There are more houses with less basement area.

Sale Price and the house was remodeled.

Remodel date (same as construction date if no remodeling or additions).

Most houses in the 1950s were remodeled.

ames |>
  # To check if the house was remodeled
  mutate(
    remodeled = if_else(Year_Built < Year_Remod_Add, "Remodeled", "Not Remodeled")
  ) |> 
  ggplot(aes(x = Year_Remod_Add, y = Sale_Price, color = Overall_Cond)) +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  scale_y_continuous(labels = scales::dollar) +
  labs(color = "House Condition",
       y = NULL,
       x = "Year Modified") +
  facet_wrap(~ remodeled)

The plot does not show any significant difference between the price of remodeled and not remodeled houses.

Since we don’t have the price of the houses before remodeling, we can only assume that the price of remodeled house increased after remodeling.

For remodeled houses, the number of houses with Above Average, Good, Very Good and Excellent conditions are more compared to houses that were not remodeled. We also don’t have the condition of the houses before remodeling, however, we can conclude that remodeling houses improved the house conditions which had a positive effect on price.

All houses from 1870s to 1940s were remodeled.

ames |>
  # To check if the house was remodeled
  mutate(
    remodeled = if_else(Year_Built < Year_Remod_Add, "Remodeled", "Not Remodeled")
  ) |>
  group_by(remodeled) |> 
  summarise(avg_price = mean(Sale_Price),
            min_price = min(Sale_Price),
            max_price = max(Sale_Price)) |> knitr::kable()
remodeled avg_price min_price max_price
Not Remodeled 184324.7 13100 745000
Remodeled 176722.6 12789 755000
ames |> 
    mutate(
        decade = case_when(between(Year_Built, 1870, 1879) ~ "1870s",
                           between(Year_Built, 1880, 1889) ~ "1880s",
                           between(Year_Built, 1890, 1899) ~ "1890s",
                           between(Year_Built, 1900, 1909) ~ "1900s",
                           between(Year_Built, 1910, 1919) ~ "1910s",
                           between(Year_Built, 1920, 1929) ~ "1920s",
                           between(Year_Built, 1930, 1939) ~ "1930s",
                           between(Year_Built, 1940, 1949) ~ "1940s",
                           between(Year_Built, 1950, 1959) ~ "1950s",
                           between(Year_Built, 1960, 1969) ~ "1960s",
                           between(Year_Built, 1970, 1979) ~ "1970s",
                           between(Year_Built, 1980, 1989) ~ "1980s",
                           between(Year_Built, 1990, 1999) ~ "1990s",
                           between(Year_Built, 2000, 2009) ~ "2000s",
                           between(Year_Built, 2010, 2019) ~ "2010s"
                           
        ),
        remodeled = if_else(Year_Built < Year_Remod_Add, "Remodeled", "Not Remodeled")
    ) |> 
    group_by(decade) |> 
    count(remodeled) |> 
    pivot_wider(names_from = remodeled, values_from = n) |> knitr::kable()
decade Remodeled Not Remodeled
1870s 3 NA
1880s 8 NA
1890s 15 NA
1900s 40 NA
1910s 110 NA
1920s 196 NA
1930s 109 NA
1940s 151 NA
1950s 112 228
1960s 85 272
1970s 65 299
1980s 31 89
1990s 160 174
2000s 275 505
2010s NA 3

Sale Price and Masonry veneer

The majority of houses lack a masonry veneer, while those with stone masonry veneer stand out as the most expensive among the houses.

ames |>
  add_count(Mas_Vnr_Type) |>
  ggplot(aes(x = fct_reorder(Mas_Vnr_Type, Sale_Price), y = Sale_Price, fill = n)) +
  geom_boxplot() +
  scale_fill_viridis_b() +
  scale_y_continuous(labels = scales::dollar)  +
  labs(x = "Masonry Veneer Type",
       y = "House Selling Price (USD)",
       title = "Sale Price by Masonry Veneer Type",
       subtitle = "Houses with Stone Veneer Are Priced Higher",
       caption = "n indicates the number of houses in each category")

ames |> 
  ggplot(aes(x = Mas_Vnr_Area, y = Sale_Price)) +
  geom_point(aes(color = Mas_Vnr_Type)) +
  geom_smooth(method = "lm", se = FALSE, lty = 2, color = "gray") 

Ground living area and sale price

There is a strong and positive relationship between ground living area and house selling price. Houses with large above ground living area in square feet are more expensive compared to houses with less ground living ares.

ames |> 
  ggplot(aes(x = Gr_Liv_Area, y = Sale_Price)) +
  geom_point(color = "darkorchid") +
  geom_smooth(method = "lm", se = FALSE, color = "gray", lty = 2) +
  scale_y_continuous(labels = scales::dollar) +
  scale_x_continuous(labels = scales::number) +
  labs(
    x = "Above grade (ground) living area square feet",
    y = "House selling price in USD",
    title = "Relationship between ground living are and sale price"
  )

Pool Area and Sale Price

We only have 13 houses with pool area in our dataset. We compare prices of houses with pool with those without to see whether there is a price difference.

Whereas there is an imbalance between the houses with and without a pool, Houses with a pool are more expensive that those without on average.

However, We have several outliers of houses without a pool but are very expensive. This might be due to existence of other luxurious features.

ames |> 
  mutate(Pool_Area = ifelse(Pool_Area > 0, "Yes", "No")) |> 
  add_count(Pool_Area) |> 
  ggplot(aes(x = Pool_Area, y = Sale_Price, fill = n)) +
  geom_boxplot() +
  scale_y_continuous(labels = scales::dollar) +
  scale_fill_viridis_b() +
  labs(x = "Pool Area",
       y = "House selling price in USD",
       fill = "No. of Houses",
       title = "House has a pool? Vs Sale Price")