```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Summary

Dataset Description: The dataset used for this project is the Ames Housing dataset, which contains detailed information on 2930 homes in Ames, Iowa. The dataset has 82 columns representing various aspects such as lot size, building features, and sale prices.

The dataset can be found on github with detailed documentation available on below link

https://github.com/leontoddjohnson/datasets/blob/main/data/ames.csv

https://jse.amstat.org/v19n3/decock/DataDocumentation.txt

Main Question/Goal:

The primary goal of this project is to analyze the relationships between key numeric variables, such as Sale Price, Lot Area, and Above Ground Living Area. This includes understanding the influence of house features on sale prices and identifying patterns, such as the impact of price per square foot on overall value.

Visualizations:

  • SalePrice vs Lot Area:

    # Load necessary libraries
    library(ggplot2)
    library(dplyr)
    ## 
    ## Attaching package: 'dplyr'
    ## The following objects are masked from 'package:stats':
    ## 
    ##     filter, lag
    ## The following objects are masked from 'package:base':
    ## 
    ##     intersect, setdiff, setequal, union
    # Load the dataset
    ames <- read.csv('D:/Stats for DS/ames.csv', header = TRUE)
    
    # Selecting relevant columns for analysis
    variables <- ames %>%
      select(SalePrice, Lot.Area, Gr.Liv.Area) %>%
      na.omit()
    
    # Plot 1: SalePrice vs Lot Area
    ggplot(variables, aes(x = Lot.Area, y = SalePrice)) +
      geom_point(color = "blue", alpha = 0.6) +
      labs(title = "Sale Price vs Lot Area", 
           x = "Lot Area (sq ft)", 
           y = "Sale Price ($)") +
      theme_minimal()

  1. A scatter plot shows a weak positive relationship between sale price and lot area, with some outliers.

  2. This relationship is worth further investigation to understand if certain lot sizes lead to disproportionately high sale prices.

  • SalePrice vs Above Ground Living Area:

    ggplot(variables, aes(x = Gr.Liv.Area, y = SalePrice)) +
      geom_point(color = "green", alpha = 0.6) +
      labs(title = "Sale Price vs Above Ground Living Area", 
           x = "Above Ground Living Area (sq ft)", 
           y = "Sale Price ($)") +
      theme_minimal()

    1. A much stronger positive correlation is evident between the sale price and the above-ground living area.

    2. Investigating this further will help determine if larger homes provide better value for buyers, and if there are diminishing returns as living area increases.

# Calculate the correlation matrix
correlation_matrix <- cor(variables)

# Print the correlation matrix
print(correlation_matrix)
##             SalePrice  Lot.Area Gr.Liv.Area
## SalePrice   1.0000000 0.2665492   0.7067799
## Lot.Area    0.2665492 1.0000000   0.2855992
## Gr.Liv.Area 0.7067799 0.2855992   1.0000000

Plan Moving Forward:

Initial Findings

Hypothesis 1:

Hypothesis 2:

To-do List: