Complete all Exercises, and submit answers to Questions on the Coursera platform.
This initial quiz will concern exploratory data analysis (EDA) of the Ames Housing dataset. EDA is essential when working with any source of data and helps inform modeling.
First, let us load the data:
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.4 ✓ purrr 0.3.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
Misc.Feature, Fence, Pool.QC
Misc.Feature, Alley, Pool.QC
Pool.QC, Alley, Fence
Fireplace.Qu, Pool.QC, Lot.Frontage
## Pool.QC Misc.Feature Alley
## 997 971 933
int? Change them to factors when conducting your analysis.
There are 38 variables coded as integer.
## [1] 38
StoneBr
Timber
Veenker
NridgHt
ames_train %>%
dplyr::select(Neighborhood, price) %>%
group_by(Neighborhood) %>%
summarise(sd = sd(price, na.rm = TRUE)) %>%
arrange(desc(sd))## # A tibble: 27 x 2
## Neighborhood sd
## <fct> <dbl>
## 1 StoneBr 123459.
## 2 NridgHt 105089.
## 3 Timber 84030.
## 4 Veenker 72545.
## 5 Crawfor 71268.
## 6 GrnHill 70711.
## 7 Somerst 65199.
## 8 Edwards 54852.
## 9 CollgCr 52786.
## 10 SawyerW 48354.
## # … with 17 more rows
price?
Lot.Area
Bedroom.AbvGr
Overall.Qual
Year.Built
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
price and area. Which of the following variable transformations makes the relationship appear to be the most linear?
price or area
price but not area
area but not price
price and area
skewness function was used to compute skewness value. Both price and area are right-skewed. Therefore, both variables need to be transformed.
## [1] 0.9881214
## [1] 1.628719
## `geom_smooth()` using formula 'y ~ x'
n <- ames_train %>%
dplyr::select(Garage.Type) %>%
nrow()
x <- ames_train %>%
dplyr::select(Garage.Type) %>%
filter(!is.na(Garage.Type)) %>%
nrow()
alpha <- 9 + x
beta <- 1 + n - x
paste(alpha, beta)## [1] "963 47"
sum_n <- nrow(ames_train)
ames_train %>%
dplyr::select(Year.Built) %>%
group_by(Year.Built > 1999) %>%
count() %>%
summarise(pct = n/sum_n * 100)## # A tibble: 2 x 2
## `Year.Built > 1999` pct
## * <lgl> <dbl>
## 1 FALSE 72.8
## 2 TRUE 27.2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129762 159467 181190 213000 615000
## Ex Fa Gd Po TA NA's
## 1 87 28 424 1 438 21
## Grvl Pave
## 3 997
\[ H_0: \text{Homes with and without a garage have no difference in size.} \] \[ H_a: \text{Homes with a garage are larger than those without.} \]
ames_train$Garage.Finish <- ifelse(is.na(ames_train$Garage.Finish), "NoGarage", "Garage")
t.test(area ~ Garage.Finish, data = ames_train,
var.equal = TRUE,
alternative = "greater")##
## Two Sample t-test
##
## data: area by Garage.Finish
## t = 4.6035, df = 998, p-value = 2.345e-06
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 223.2586 Inf
## sample estimates:
## mean in group Garage mean in group NoGarage
## 1492.603 1145.043
bedroom <- ames_train %>%
dplyr::select(area, Bedroom.AbvGr) %>%
filter(area > 2000 & !is.na(Bedroom.AbvGr)) %>%
summarise(rcount = n(),
total = sum(Bedroom.AbvGr, na.rm=TRUE))
k <- (3/1)^2
theta <- 1/3
sumX <- bedroom$total
k_star <- k + sumX
n <- bedroom$rcount
theta_star <- theta / (n * theta + 1)
post_mean <- k_star * theta_star
post_sd <- theta_star * sqrt(k_star)
paste(post_mean, post_sd)## [1] "3.61702127659574 0.160164394193421"
price) on \(\log\)(area), there are some outliers. Which of the following do the three most outlying points have in common?
## [1] Abnorml Abnorml Normal
## Levels: Abnorml AdjLand Alloca Family Normal Partial
## [1] 3 2 3
## [1] 4 2 4
## [1] 1910 1923 1920
price if used as a dependent variable in a linear regression?
price is right-skewed.
price cannot take on negative values.
price can only take on integer values.Bldg.Type = 1Fam)
ames_train %>%
filter(!is.na(Neighborhood), !is.na(Bldg.Type)) %>%
dplyr::select(Neighborhood, Bldg.Type) %>%
group_by(Neighborhood) %>%
summarise(FamilyMem = mean(Bldg.Type == "1Fam")) %>%
arrange(-FamilyMem)## # A tibble: 27 x 2
## Neighborhood FamilyMem
## <fct> <dbl>
## 1 ClearCr 1
## 2 NoRidge 1
## 3 Timber 1
## 4 Gilbert 0.980
## 5 BrkSide 0.976
## 6 NWAmes 0.976
## 7 CollgCr 0.965
## 8 IDOTRR 0.943
## 9 NAmes 0.923
## 10 Sawyer 0.918
## # … with 17 more rows
area) and the number of bedrooms above ground (Bedroom.AbvGr)?
## `geom_smooth()` using formula 'y ~ x'
ames_train %>%
filter(!is.na(Bsmt.Unf.SF) & Bsmt.Unf.SF != 0) %>%
dplyr::select(Bsmt.Unf.SF) %>%
summarise(mean(Bsmt.Unf.SF))## # A tibble: 1 x 1
## `mean(Bsmt.Unf.SF)`
## <dbl>
## 1 595.