Capstone Quiz I

Complete all Exercises, and submit answers to Questions on the Coursera platform.

This initial quiz will concern exploratory data analysis (EDA) of the Ames Housing dataset. EDA is essential when working with any source of data and helps inform modeling.

First, let us load the data:

load("~/Desktop/R Programming/Statistics_Coursera/Capstone/ames_train.Rdata")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)
library(ggplot2)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ tibble  3.0.4     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::extract()   masks magrittr::extract()
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()

library(moments)

Which of the following are the three variables with the highest number of missing observations?
1. Misc.Feature, Fence, Pool.QC
2. **Misc.Feature, Alley, Pool.QC
3. Pool.QC, Alley, Fence
4. Fireplace.Qu, Pool.QC, Lot.Frontage

sum_na <- colSums(is.na(ames_train))
head(sort(unlist(sum_na), decreasing = TRUE), 3)

##      Pool.QC Misc.Feature        Alley 
##          997          971          933

How many categorical variables are coded in R as having type int? Change them to factors when conducting your analysis.
1. 0
2. 1
3. 2
4. 3

There are 38 variables coded as integer.

sum(sapply(ames_train, is.integer))

## [1] 38

ames_train$Overall.Cond <- as.factor(ames_train$Overall.Cond)

In terms of price, which neighborhood has the highest standard deviation?
1. StoneBr
2. Timber
3. Veenker
4. NridgHt

ames_train %>%
  dplyr::select(Neighborhood, price) %>%
  group_by(Neighborhood) %>%
  summarise(sd = sd(price, na.rm = TRUE)) %>%
  arrange(desc(sd))

## # A tibble: 27 x 2
##    Neighborhood      sd
##    <fct>          <dbl>
##  1 StoneBr      123459.
##  2 NridgHt      105089.
##  3 Timber        84030.
##  4 Veenker       72545.
##  5 Crawfor       71268.
##  6 GrnHill       70711.
##  7 Somerst       65199.
##  8 Edwards       54852.
##  9 CollgCr       52786.
## 10 SawyerW       48354.
## # … with 17 more rows

Using scatter plots or other graphical displays, which of the following variables appears to be the best single predictor of price?
1. Lot.Area
2. Bedroom.AbvGr
3. Overall.Qual
4. Year.Built

ggplot(ames_train, aes(x = Lot.Area, y = price)) +
  geom_point() +
  stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

ggplot(ames_train, aes(x = Bedroom.AbvGr, y = price)) +
  geom_point() +
  stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

ggplot(ames_train, aes(x = Overall.Qual, y = price)) +
  geom_point() +
  stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

ggplot(ames_train, aes(x = Year.Built, y = price)) +
  geom_point() +
  stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Suppose you are examining the relationship between price and area. Which of the following variable transformations makes the relationship appear to be the most linear?
1. Do not transform either price or area
2. Log-transform price but not area
3. Log-transform area but not price
4. Log-transform both price and area

skewness function was used to compute skewness value. Both price and area are right-skewed. Therefore, both variables need to be transformed.

skewness(ames_train$area)

## [1] 0.9881214

skewness(ames_train$price)

## [1] 1.628719

ggplot(ames_train, aes(x = log(area), y = log(price))) +
  geom_point() +
  stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Suppose that your prior for the proportion of houses that have at least one garage is Beta(9, 1). What is your posterior? Assume a beta-binomial model for this proportion.
1. Beta(954, 46)
2. Beta(963, 46)
3. Beta(954, 47)
4. Beta(963, 47)

n <- ames_train %>%
  dplyr::select(Garage.Type) %>%
  nrow()

x <- ames_train %>%
  dplyr::select(Garage.Type) %>%
  filter(!is.na(Garage.Type)) %>%
  nrow()

alpha <- 9 + x
beta <- 1 + n - x
paste(alpha, beta)

## [1] "963 47"

Which of the following statements is true about the dataset?
1. Over 30 percent of houses were built after the year 1999. FALSE
2. The median housing price is greater than the mean housing price. FALSE
3. 21 houses do not have a basement.
4. 4 houses are located on gravel streets. FALSE

sum_n <- nrow(ames_train)
ames_train %>%
  dplyr::select(Year.Built) %>%
  group_by(Year.Built > 1999) %>%
  count() %>%
  summarise(pct = n/sum_n * 100)

## # A tibble: 2 x 2
##   `Year.Built > 1999`   pct
## * <lgl>               <dbl>
## 1 FALSE                72.8
## 2 TRUE                 27.2

summary(ames_train$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129762  159467  181190  213000  615000

summary(ames_train$Bsmt.Qual)

##        Ex   Fa   Gd   Po   TA NA's 
##    1   87   28  424    1  438   21

summary(ames_train$Street)

## Grvl Pave 
##    3  997

Test, at the \(\alpha = 0.05\) level, whether homes with a garage have larger square footage than those without a garage.
1. With a p-value near 0.000, we reject the null hypothesis of no difference.
2. With a p-value of approximately 0.032, we reject the null hypothesis of no difference.
3. With a p-value of approximately 0.135, we fail to reject the null hypothesis of no difference.
4. With a p-value of approximately 0.343, we fail to reject the null hypothesis of no difference.

\[ H_0: \text{Homes with and without a garage have no difference in size.} \] \[ H_a: \text{Homes with a garage are larger than those without.} \]

ames_train$Garage.Finish <- ifelse(is.na(ames_train$Garage.Finish), "NoGarage", "Garage")

t.test(area ~ Garage.Finish, data = ames_train,
       var.equal = TRUE,
       alternative = "greater")

## 
##  Two Sample t-test
## 
## data:  area by Garage.Finish
## t = 4.6035, df = 998, p-value = 2.345e-06
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  223.2586      Inf
## sample estimates:
##   mean in group Garage mean in group NoGarage 
##               1492.603               1145.043

For homes with square footage greater than 2000, assume that the number of bedrooms above ground follows a Poisson distribution with rate \(\lambda\). Your prior on \(\lambda\) follows a Gamma distribution with mean 3 and standard deviation 1. What is your posterior mean and standard deviation for the average number of bedrooms in houses with square footage greater than 2000 square feet?
1. Mean: 3.61, SD: 0.11
2. Mean: 3.62, SD: 0.16
3. Mean: 3.63, SD: 0.09
4. Mean: 3.63, SD: 0.91

bedroom <- ames_train %>%
  dplyr::select(area, Bedroom.AbvGr) %>%
  filter(area > 2000 & !is.na(Bedroom.AbvGr)) %>%
  summarise(rcount = n(), 
            total = sum(Bedroom.AbvGr, na.rm=TRUE))

k <- (3/1)^2
theta <- 1/3
sumX <- bedroom$total
k_star <- k + sumX
n <- bedroom$rcount
theta_star <- theta / (n * theta + 1)
post_mean <- k_star * theta_star
post_sd <- theta_star * sqrt(k_star)
paste(post_mean, post_sd)

## [1] "3.61702127659574 0.160164394193421"

When regressing \(\log\)(price) on \(\log\)(area), there are some outliers. Which of the following do the three most outlying points have in common?
1. They had abnormal sale conditions.
2. They have only two bedrooms.
3. They have an overall quality of less than 3.
4. They were built before 1930.

lm_q10 <- lm(log(price) ~ log(area), ames_train)
plot(lm_q10, 1)

ames_train$Sale.Condition[c(206, 428, 741)]

## [1] Abnorml Abnorml Normal 
## Levels: Abnorml AdjLand Alloca Family Normal Partial

ames_train$Bedroom.AbvGr[c(206, 428, 741)]

## [1] 3 2 3

ames_train$Overall.Qual[c(206, 428, 741)]

## [1] 4 2 4

ames_train$Year.Built[c(206, 428, 741)]

## [1] 1910 1923 1920

Which of the following are reasons to log-transform price if used as a dependent variable in a linear regression?
1. price is right-skewed.
2. price cannot take on negative values.
3. price can only take on integer values.
4. Both a and b.

par(mfrow=c(1,2))
hist(ames_train$price)
hist(log(ames_train$price))

How many neighborhoods consist of only single-family homes? (e.g. Bldg.Type = 1Fam)
1. 0
2. 1
3. 2
4. 3

ames_train %>%
  filter(!is.na(Neighborhood), !is.na(Bldg.Type)) %>%
  dplyr::select(Neighborhood, Bldg.Type) %>%
  group_by(Neighborhood) %>%
  summarise(FamilyMem = mean(Bldg.Type == "1Fam")) %>%
  arrange(-FamilyMem)

## # A tibble: 27 x 2
##    Neighborhood FamilyMem
##    <fct>            <dbl>
##  1 ClearCr          1    
##  2 NoRidge          1    
##  3 Timber           1    
##  4 Gilbert          0.980
##  5 BrkSide          0.976
##  6 NWAmes           0.976
##  7 CollgCr          0.965
##  8 IDOTRR           0.943
##  9 NAmes            0.923
## 10 Sawyer           0.918
## # … with 17 more rows

Using color, different plotting symbols, conditioning plots, etc., does there appear to be an association between \(\log\)(area) and the number of bedrooms above ground (Bedroom.AbvGr)?
1. Yes
2. No

ggplot(ames_train, aes(Bedroom.AbvGr, log(area))) + 
  geom_point() +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Of the people who have unfinished basements, what is the average square footage of the unfinished basement?
1. 590.36
2. 595.25
3. 614.37
4. 681.94

ames_train %>%
  filter(!is.na(Bsmt.Unf.SF) & Bsmt.Unf.SF != 0) %>%
  dplyr::select(Bsmt.Unf.SF) %>%
  summarise(mean(Bsmt.Unf.SF))

## # A tibble: 1 x 1
##   `mean(Bsmt.Unf.SF)`
##                 <dbl>
## 1                595.