Loading packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading dataset

ames = read_csv("ames (1).csv")
## Rows: 2930 Columns: 72
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (35): MS.Zoning, Street, Lot.Shape, Land.Contour, Utilities, Lot.Config,...
## dbl (37): Order, PID, area, price, MS.SubClass, Lot.Area, Overall.Qual, Over...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(ames)
## Rows: 2,930
## Columns: 72
## $ Order           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ PID             <dbl> 526301100, 526350040, 526351010, 526353030, 527105010,…
## $ area            <dbl> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1…
## $ price           <dbl> 215000, 105000, 172000, 244000, 189900, 195500, 213500…
## $ MS.SubClass     <dbl> 20, 20, 20, 20, 60, 60, 120, 120, 120, 60, 60, 20, 60,…
## $ MS.Zoning       <chr> "RL", "RH", "RL", "RL", "RL", "RL", "RL", "RL", "RL", …
## $ Lot.Area        <dbl> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, 5…
## $ Street          <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"…
## $ Lot.Shape       <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", "Reg", "IR1"…
## $ Land.Contour    <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "HLS"…
## $ Utilities       <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "All…
## $ Lot.Config      <chr> "Corner", "Inside", "Corner", "Corner", "Inside", "Ins…
## $ Land.Slope      <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl"…
## $ Neighborhood    <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert", "Gilber…
## $ Condition.1     <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm…
## $ Condition.2     <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm"…
## $ Bldg.Type       <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "Twnhs…
## $ House.Style     <chr> "1Story", "1Story", "1Story", "1Story", "2Story", "2St…
## $ Overall.Qual    <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9, …
## $ Overall.Cond    <dbl> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2, …
## $ Year.Built      <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
## $ Year.Remod.Add  <dbl> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 1996, …
## $ Roof.Style      <chr> "Hip", "Gable", "Hip", "Hip", "Gable", "Gable", "Gable…
## $ Roof.Matl       <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg",…
## $ Exterior.1st    <chr> "BrkFace", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
## $ Exterior.2nd    <chr> "Plywood", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
## $ Mas.Vnr.Type    <chr> "Stone", "None", "BrkFace", "None", "None", "BrkFace",…
## $ Mas.Vnr.Area    <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 603,…
## $ Exter.Qual      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "Gd", "Gd", "Gd", …
## $ Exter.Cond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ Foundation      <chr> "CBlock", "CBlock", "CBlock", "CBlock", "PConc", "PCon…
## $ Bsmt.Qual       <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "Gd", "Gd", "Gd", …
## $ Bsmt.Cond       <chr> "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ Bsmt.Exposure   <chr> "Gd", "No", "No", "No", "No", "No", "Mn", "No", "No", …
## $ BsmtFin.Type.1  <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ"…
## $ BsmtFin.SF.1    <dbl> 639, 468, 923, 1065, 791, 602, 616, 263, 1180, 0, 0, 9…
## $ BsmtFin.Type.2  <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf"…
## $ BsmtFin.SF.2    <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0, 0…
## $ Bsmt.Unf.SF     <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994, 76…
## $ Total.Bsmt.SF   <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994…
## $ Heating         <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA"…
## $ Heating.QC      <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", …
## $ Central.Air     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ Electrical      <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", …
## $ X1st.Flr.SF     <dbl> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, 102…
## $ X2nd.Flr.SF     <dbl> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0, 0,…
## $ Low.Qual.Fin.SF <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Bsmt.Full.Bath  <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, …
## $ Bsmt.Half.Bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full.Bath       <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 1, …
## $ Half.Bath       <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, …
## $ Bedroom.AbvGr   <dbl> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, …
## $ Kitchen.AbvGr   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Kitchen.Qual    <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", "Gd", "Gd", "Gd", …
## $ TotRms.AbvGrd   <dbl> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8, 8,…
## $ Functional      <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ"…
## $ Fireplaces      <dbl> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, …
## $ Garage.Type     <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attchd", "Att…
## $ Garage.Cars     <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
## $ Garage.Area     <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 440,…
## $ Paved.Drive     <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ Wood.Deck.SF    <dbl> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 483, …
## $ Open.Porch.SF   <dbl> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0, 5…
## $ Enclosed.Porch  <dbl> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X3Ssn.Porch     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen.Porch    <dbl> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, 210…
## $ Pool.Area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Misc.Val        <dbl> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, …
## $ Mo.Sold         <dbl> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, 6, …
## $ Yr.Sold         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
## $ Sale.Type       <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", …
## $ Sale.Condition  <chr> "Normal", "Normal", "Normal", "Normal", "Normal", "Nor…
  1. (9 points) Load and explore the dataset assigned to you for your project to answer the following questions. You can use the glimpse function from the tidyverse package to peek into the dataset and find answers to these questions.
  1. (3 points) How many cases (instances/rows) are in your dataset?

There are 2,930 rows in my dataset

  1. (3 points) How many variables (attributes/columns) are in your dataset?

There are 72 variables in my dataset

  1. (3 points) Does your dataset contain missing values? Which variables contain missing values?
anyNA(ames)
## [1] TRUE

There are (is) missing value in my dataset

colSums(is.na(ames))
##           Order             PID            area           price     MS.SubClass 
##               0               0               0               0               0 
##       MS.Zoning        Lot.Area          Street       Lot.Shape    Land.Contour 
##               0               0               0               0               0 
##       Utilities      Lot.Config      Land.Slope    Neighborhood     Condition.1 
##               0               0               0               0               0 
##     Condition.2       Bldg.Type     House.Style    Overall.Qual    Overall.Cond 
##               0               0               0               0               0 
##      Year.Built  Year.Remod.Add      Roof.Style       Roof.Matl    Exterior.1st 
##               0               0               0               0               0 
##    Exterior.2nd    Mas.Vnr.Type    Mas.Vnr.Area      Exter.Qual      Exter.Cond 
##               0              23              23               0               0 
##      Foundation       Bsmt.Qual       Bsmt.Cond   Bsmt.Exposure  BsmtFin.Type.1 
##               0              80              80              83              80 
##    BsmtFin.SF.1  BsmtFin.Type.2    BsmtFin.SF.2     Bsmt.Unf.SF   Total.Bsmt.SF 
##               1              81               1               1               1 
##         Heating      Heating.QC     Central.Air      Electrical     X1st.Flr.SF 
##               0               0               0               1               0 
##     X2nd.Flr.SF Low.Qual.Fin.SF  Bsmt.Full.Bath  Bsmt.Half.Bath       Full.Bath 
##               0               0               2               2               0 
##       Half.Bath   Bedroom.AbvGr   Kitchen.AbvGr    Kitchen.Qual   TotRms.AbvGrd 
##               0               0               0               0               0 
##      Functional      Fireplaces     Garage.Type     Garage.Cars     Garage.Area 
##               0               0             157               1               1 
##     Paved.Drive    Wood.Deck.SF   Open.Porch.SF  Enclosed.Porch     X3Ssn.Porch 
##               0               0               0               0               0 
##    Screen.Porch       Pool.Area        Misc.Val         Mo.Sold         Yr.Sold 
##               0               0               0               0               0 
##       Sale.Type  Sale.Condition 
##               0               0

Variables with the missing values are Mas.Vnr.Type, Mas.Vnr.Area, Bsmt.Cond, BsmtFin.SF., Bsmt.Exposure, BsmtFin.Type.1, Bsmt.Qual, BsmtFin.Type.2, BsmtFin.SF.2, Bsmt.Unf.SF, Total.Bsmt.SF, Bsmt.Full.Bath, Bsmt.Half.Bath, Garage.Type, Garage.Cars, Garage.Area

  1. (9 points) Identify one research questions that you plan to answer using your dataset.
  1. (3 points) Write your research question and the corresponding hypothesis to be tested.

RQ1: Does the presence of Mas.Vnr.Type, Mas.Vnr.Area, Bsmt.Cond, BsmtFin.SF., Bsmt.Exposure, BsmtFin.Type.1, Bsmt.Qual, BsmtFin.Type.2, BsmtFin.SF.2, Bsmt.Unf.SF, Total.Bsmt.SF, Bsmt.Full.Bath, Bsmt.Half.Bath, Garage.Type, Garage.Cars, Garage.Area affect the price of the house?

HO: Mas.Vnr.Type, Mas.Vnr.Area, Bsmt.Cond, BsmtFin.SF., Bsmt.Exposure, BsmtFin.Type.1, Bsmt.Qual, BsmtFin.Type.2, BsmtFin.SF.2, Bsmt.Unf.SF, Total.Bsmt.SF, Bsmt.Full.Bath, Bsmt.Half.Bath, Garage.Type, Garage.Cars, Garage.Area is associated with the price of the house. H1:Mas.Vnr.Type, Mas.Vnr.Area, Bsmt.Cond, BsmtFin.SF., Bsmt.Exposure, BsmtFin.Type.1, Bsmt.Qual, BsmtFin.Type.2, BsmtFin.SF.2, Bsmt.Unf.SF, Total.Bsmt.SF, Bsmt.Full.Bath, Bsmt.Half.Bath, Garage.Type, Garage.Cars, Garage.Area is not associated with the price of the house.

  1. (3 points) Identify and list the relevant variables that will be used in the analysis.

The relevant variable I will be using is foundation.

  1. (3 points) Identify the main response (also known as dependent) variable in the data.

My main response variable or target variable is price of a house.

  1. (12 points) Create at least three graphs displaying the distributions of three different variables including the response variable identified in 2(c) and other relevant variables identified in 2(b). For example, if the variable is categorical, report a bar chart, while for quantitative variables, report histogram, dotplot or boxplot. Note: You need to make a graph for each variable. You shouldn’t use the same variable for different graphs.

Association plots

ames %>%
  ggplot(aes(x = price))+
  geom_histogram(color = "black", fill = "pink")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

labs(x = "Price of a House", y = "Frequency")
## $x
## [1] "Price of a House"
## 
## $y
## [1] "Frequency"
## 
## attr(,"class")
## [1] "labels"
### Categorical (independent variable) vs Numeric (response)
ames %>%
  ggplot(aes(x = price))+
  geom_density(color = "black", fill = "purple")

labs(x = "Price in $1,000",
y = "Count")
## $x
## [1] "Price in $1,000"
## 
## $y
## [1] "Count"
## 
## attr(,"class")
## [1] "labels"
ames %>%
ggplot(aes(x = Garage.Type, y = price, fill = Garage.Type))+
geom_boxplot()+
labs(y = "Garage Type",
y = "price in $1,000")

1. (6 points) Compute and report summary statistics (e.g., mean, standard deviation, and five number summary) for summarizing the distribution of the response variable identified in Part I 2(c).

ames %>%
  filter(!is.na(price))%>%
  summarise(Mean = mean(price),
            SD = sd(price),
            Min = min(price),
            Q1 = quantile(price, 0.25),
            Median = median(price), 
            Q3 = quantile(price, 0.75), 
            Max = max(price))

The average price of houses is around $180,796. With Minimum of $12,789 and Maximum of $755,000 respectively.

  1. Construct confidence interval for estimating the population mean of the main response variable (e.g.,annual income, house price, body weight, etc.).
  1. (4 points) Check the assumptions needed in order to perform the procedure of constructing a confidence interval for the population mean of response variable. For example, make a histogram for the response variable and check if it is approximately normally distributed. Also report your sample size to show that the central limit theorem approximation is valid.
ames %>%
  filter(!is.na(price))%>%
  ggplot(aes(x=price))+geom_histogram(color = "black", fill = "pink")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Because we have 2930 samples, then we say that my data agrees with the rules of CLT (Central Limit Theorem).

  1. (5 points) Compute and report the 95% confidence interval for the population mean of the main response variable. Use R function t_test() to compute the confidence interval. Show your code and output.
library(infer)
ames %>%
  filter(!is.na(price))%>%
  t_test(
    response = price, 
         mu = 180796.1,
         conf_int = T, 
         conf_level = 0.95)
  1. (2 points) Interpret the confidence interval obtained in the context of your dataset.

I am 95% confident that the true population mean Price of houses in between $177,902.3 and $183,689.9 dollars.

From the results above, we see that the actual population mean does not fall within the range of the confidence bounds or not enclosed by the bound. Hence, we reject the null hypothesis and conclude there is not enough of sufficient evidence that the mean price of houses are the same.

  1. (6 points) Compute summary statistics (mean and standard deviation) to summarize the response variable in your data by groups defined by the levels of one categorical variable from your dataset. For example, you can compare the mean body weight for men versus women. Comment on the patterns seen from those summaries.
ames %>%
  group_by(House.Style) %>%
  summarise(Mean = mean(price),
            SD = sd(price))
  1. (7 points) Construct confidence interval for estimating the difference in the population mean of the main response variable across the levels of one categorical explanatory variable (e.g., the difference in mean body weight between men versus women).
  1. (5 points) Compute and report the 90% confidence interval for the difference in population means. Use R function t_test() to compute the confidence interval. Show your code and output.
library(infer)
ames %>%
  filter(!is.na(price))%>%
  t_test(response = price, 
         explanatory = House.Style,
         order = c("1.5Fin", "1.5Unf"), 
         mu = 180796.1,
         conf_int = T, 
         conf_level = 0.90)
  1. (2 points) Interpret the confidence interval obtained in the context of your dataset.

The results above shows that the mean difference between the prices of house styles are significant. Since the confidence bounds is not enclosed by zero. Hence, we say that the prices of house styles differs. (i.e., price differences between $18,679 and $37,054.53 respectively)

  1. (25 points) Use correlation and linear regression to describe and model the associations between the outcome (response) variable that you identified in Part I of the project and the explanatory (predictor) variables you identified in the same part of project. Make sure to address the following:
  1. (6 points) Identify all numerical (quantitative) variables in the dataset. Then construct a Scatter Plot to show the relationship between the outcome (response) variable (e.g., house price, body 1 weight, income, etc.) and each explanatory variable. First use select() to select only the outcome variable and the numerical explanatory variables from the dataset. Then use the function ggpairs() from package GGally (you need to install this package first) to create the scatter plots and compute corresponding correlation coefficients. Describe the associations between the pairs of variables based on the scatter plots. Describe the associations between the pairs of variables based on the correlation values. Which pair has the strongest linear association compared to the others?
glimpse(ames)
## Rows: 2,930
## Columns: 72
## $ Order           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ PID             <dbl> 526301100, 526350040, 526351010, 526353030, 527105010,…
## $ area            <dbl> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1…
## $ price           <dbl> 215000, 105000, 172000, 244000, 189900, 195500, 213500…
## $ MS.SubClass     <dbl> 20, 20, 20, 20, 60, 60, 120, 120, 120, 60, 60, 20, 60,…
## $ MS.Zoning       <chr> "RL", "RH", "RL", "RL", "RL", "RL", "RL", "RL", "RL", …
## $ Lot.Area        <dbl> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, 5…
## $ Street          <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave"…
## $ Lot.Shape       <chr> "IR1", "Reg", "IR1", "Reg", "IR1", "IR1", "Reg", "IR1"…
## $ Land.Contour    <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "HLS"…
## $ Utilities       <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "All…
## $ Lot.Config      <chr> "Corner", "Inside", "Corner", "Corner", "Inside", "Ins…
## $ Land.Slope      <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl"…
## $ Neighborhood    <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert", "Gilber…
## $ Condition.1     <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm…
## $ Condition.2     <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm"…
## $ Bldg.Type       <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "Twnhs…
## $ House.Style     <chr> "1Story", "1Story", "1Story", "1Story", "2Story", "2St…
## $ Overall.Qual    <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9, …
## $ Overall.Cond    <dbl> 5, 6, 6, 5, 5, 6, 5, 5, 5, 5, 5, 7, 5, 5, 5, 5, 7, 2, …
## $ Year.Built      <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, …
## $ Year.Remod.Add  <dbl> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 1996, …
## $ Roof.Style      <chr> "Hip", "Gable", "Hip", "Hip", "Gable", "Gable", "Gable…
## $ Roof.Matl       <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg",…
## $ Exterior.1st    <chr> "BrkFace", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
## $ Exterior.2nd    <chr> "Plywood", "VinylSd", "Wd Sdng", "BrkFace", "VinylSd",…
## $ Mas.Vnr.Type    <chr> "Stone", "None", "BrkFace", "None", "None", "BrkFace",…
## $ Mas.Vnr.Area    <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 603,…
## $ Exter.Qual      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "Gd", "Gd", "Gd", …
## $ Exter.Cond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ Foundation      <chr> "CBlock", "CBlock", "CBlock", "CBlock", "PConc", "PCon…
## $ Bsmt.Qual       <chr> "TA", "TA", "TA", "TA", "Gd", "TA", "Gd", "Gd", "Gd", …
## $ Bsmt.Cond       <chr> "Gd", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ Bsmt.Exposure   <chr> "Gd", "No", "No", "No", "No", "No", "Mn", "No", "No", …
## $ BsmtFin.Type.1  <chr> "BLQ", "Rec", "ALQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ"…
## $ BsmtFin.SF.1    <dbl> 639, 468, 923, 1065, 791, 602, 616, 263, 1180, 0, 0, 9…
## $ BsmtFin.Type.2  <chr> "Unf", "LwQ", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf"…
## $ BsmtFin.SF.2    <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0, 0…
## $ Bsmt.Unf.SF     <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994, 76…
## $ Total.Bsmt.SF   <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994…
## $ Heating         <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA"…
## $ Heating.QC      <chr> "Fa", "TA", "TA", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", …
## $ Central.Air     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ Electrical      <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", …
## $ X1st.Flr.SF     <dbl> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, 102…
## $ X2nd.Flr.SF     <dbl> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0, 0,…
## $ Low.Qual.Fin.SF <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Bsmt.Full.Bath  <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, …
## $ Bsmt.Half.Bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full.Bath       <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 1, …
## $ Half.Bath       <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, …
## $ Bedroom.AbvGr   <dbl> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1, …
## $ Kitchen.AbvGr   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ Kitchen.Qual    <chr> "TA", "TA", "Gd", "Ex", "TA", "Gd", "Gd", "Gd", "Gd", …
## $ TotRms.AbvGrd   <dbl> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8, 8,…
## $ Functional      <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ"…
## $ Fireplaces      <dbl> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, …
## $ Garage.Type     <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attchd", "Att…
## $ Garage.Cars     <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
## $ Garage.Area     <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 440,…
## $ Paved.Drive     <chr> "P", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y",…
## $ Wood.Deck.SF    <dbl> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 483, …
## $ Open.Porch.SF   <dbl> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0, 5…
## $ Enclosed.Porch  <dbl> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X3Ssn.Porch     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen.Porch    <dbl> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, 210…
## $ Pool.Area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Misc.Val        <dbl> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, 0, …
## $ Mo.Sold         <dbl> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, 6, …
## $ Yr.Sold         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
## $ Sale.Type       <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", …
## $ Sale.Condition  <chr> "Normal", "Normal", "Normal", "Normal", "Normal", "Nor…
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ames %>%
  dplyr::select(price, Fireplaces, Garage.Cars, Garage.Area, Enclosed.Porch, Overall.Cond, Overall.Qual) %>%
  ggpairs()
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

There is significant strong positive correlation between Garage Area and Price.

  1. (8 points) Develop a multiple linear regression model to predict the outcome (response) variable using all the relevant explanatory variables. Write the model equation. For example, CarP rice  = β0 + β1 × City MPG + β2 × Horsepower + · · · + βp × Engine Size. Use the lm() function to estimate the model using your dataset and report the summary of the model using the function summary().
full.model = lm(price~ Fireplaces + Garage.Cars + Garage.Area + Enclosed.Porch + Overall.Cond + Overall.Qual, data = ames)
summary(full.model)
## 
## Call:
## lm(formula = price ~ Fireplaces + Garage.Cars + Garage.Area + 
##     Enclosed.Porch + Overall.Cond + Overall.Qual, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -276134  -26062   -2473   19403  381278 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -79252.791   5621.943 -14.097   <2e-16 ***
## Fireplaces      20594.247   1334.240  15.435   <2e-16 ***
## Garage.Cars      5485.027   2370.718   2.314   0.0208 *  
## Garage.Area        81.040      8.053  10.063   <2e-16 ***
## Enclosed.Porch    -22.599     12.491  -1.809   0.0705 .  
## Overall.Cond      186.610    721.622   0.259   0.7960    
## Overall.Qual    32678.606    728.625  44.850   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42610 on 2922 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7161, Adjusted R-squared:  0.7155 
## F-statistic:  1229 on 6 and 2922 DF,  p-value: < 2.2e-16
  1. (6 points) Write the estimated regression equation.

\(\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p\)

price = -79252.791 + 20594.247 x Fireplaces + 5485.027 x Garage.Cars + 81.040 x Garage.Area - 22.599 x Enclosed.Porch + 186.610 x Overall.Qual + 32678.606

From the summary table above, only Enclosed.Porch has a negative estimate of -22.599 respectively. Fireplaces, Garage.Cars, Garage.Area, Overall.Cond, and Overall.Qual has a positive estimate. F-statistic: 1229 on 6 and 2922 DF. P-value: < 2.2e-16. The overall model is statistically significant.

Fireplaces, Garage.Area, and Overall.Qual are all significantly significant (p , 0.001), indicating it has a strong positive influence on car price.

All other variables (Garage.Cars, Enclosed.Porch, and Overall.Cond) show no statistically significant relationship with price at the 5% level.

Despite half of the variables being insignificant individually, the overall model is significant, suggesting potential multicollinearity or combined effects.

  1. (5 points) Describe the quality of the model by reporting and commenting on the percentage of total variation in the outcome (response) variable that is explained by the explanatory variables collectively. Is the whole regression model statistically significant?

Multiple R-squared: 0.7161 , 71.61% of the total variation of price can be explained by Price, Fireplaces, Garage.Cars, Garage.Area, Enclosed.Porch, Overall.Cond, Overall.Qual.

Results

Output from R

House.Style

1.5Fin 137529.9 47225.67
1.5Unf 109663.2 20569.59
1Story 178699.9 81066.94
2.5Fin 220000.0 118211.98
2.5Unf 177158.3 76114.76
2Story 206990.2 85349.91
SFoyer 143472.7 31220.08
SLvl 165527.4 34348.13

Interpretation

From the summary table above, only Enclosed.Porch has a negative estimate of -22.599 respectively. Fireplaces, Garage.Cars, Garage.Area, Overall.Cond, and Overall.Qual has a positive estimate. F-statistic: 1229 on 6 and 2922 DF. P-value: < 2.2e-16. The overall model is statistically significant.

Fireplaces, Garage.Area, and Overall.Qual are all significantly significant (p , 0.001), indicating it has a strong positive influence on car price.

All other variables (Garage.Cars, Enclosed.Porch, and Overall.Cond) show no statistically significant relationship with price at the 5% level.

Despite half of the variables being insignificant individually, the overall model is significant, suggesting potential multicollinearity or combined effects.

##Inference of the results I am 95% confident that the true population mean Price of houses in between $177,902.3 and $183,689.9 dollars.

Discussion

#Possible improvement to Project Keep raw data, scripts, and outputs in clearly defined folders to maintain clarity. Also, Utilize functions and scripts to streamline repetitive tasks, reducing manual intervention.

Reference

TA Abiodun Joseph helped me during this DAP.