---
title: "HW 6"
author: "Penelope Pooler Eisenbies"
date: last-modified
toc: true
toc-depth: 3
toc-location: left
toc-title: "Table of Contents"
toc-expand: 1
format:
html:
code-line-numbers: true
code-fold: true
code-tools: true
execute:
echo: fenced
editor: visual
---
## Setup
Run the following chunk of R code to install and load the packages needed for this assignment.
Click green triangle in upper right corner of the setup chunk to run the setup code.
Note that `setup` code will not appear in the rendered HTML file.
```{r echo=F, include=F}
#|label: Setup
# this line specifies options for default options for all R Chunks
# suppress scientific notation
options(scipen=100)
# install helper package that loads and installs other packages, if needed
if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/")
# install and load required packages
pacman::p_load(pacman,tidyverse, magrittr, gridExtra, ggthemes,
knitr, kableExtra, olsrr, ggiraphExtra)
# verify packages
p_loaded()
```
## House Remodel Data - Questions 1 - 6
## Import and Examine Data
### Question 1
Examine the R output from the chunk below to answer these questions on Blackboard.
- The `house_remodel_hw6` dataset has `____` observations.
- There are `____` remodeled houses and `____` un-remodeled houses in this dataset.
```{r}
#|label: Question 1 - Import and examine house_remodel_hw6 data
# import and examine data
houses_hw6 <- read_csv("data/house_remodel_hw6.csv", show_col_types = F) |>
glimpse(width=75)
# examine counts for each category
houses_hw6 |> select(Remodeled) |> table()
```
## Examine Correlations within Data
### Question 2
Examine the R output from the chunk below to answer these questions.
- The overall correlation between Price and Square Feet is `____`.
- The correlation between Price and Square Feet in un-remodeled houses is `____`.
- The correlation between Price and Square Feet in remodeled houses is `____`.
```{r}
#|label: Question 2 - Examine Correlations
# examine correlation between Price and Square_Feet
houses_hw6 |> select(Price, Square_Feet) |> cor() |> round(2)
# examine correlation between price and square feet in un-remodeled houses
houses_hw6 |> filter(Remodeled=="No") |>
select(Price, Square_Feet) |> cor() |> round(2)
# examine correlation between price and square feet in remodeled houses
houses_hw6 |> filter(Remodeled=="Yes") |>
select(Price, Square_Feet) |> cor() |> round(2)
```
## Modeling the House Remodel Data
### Questions 3 - 6
- Below are two chunks of R code.
- The first chunk creates the interactive plot
- Copy and paste `ggPredict` command into R Console to view plot more clearly in RStudio Viewer (Lower Left Pane).
- The second chunk creates the model and prints it.
- Use the interactive plot and model output to answer Questions 3 - 6
Question 3. What is the SLR model equation for un-remodeled houses (Remodeled = No)?
- Round values to two decimal places.
- Est. Price = `___` + `___` \* Square_Feet.
- Hint for Question 3:
- The un-remodeled houses (Remodeled = No) are the baseline category (not listed in output).
- The baseline `Intercept` Beta and `Square_Feet` Beta are the coefficients for the baseline category SLR.
Question 4. What is the SLR model equation for remodeled houses (Remodeled = Yes)?
- Round values to two decimal places.
- Est. Price = `___` + `___` \* Square_Feet.
Hint for Question 4:
- The intercept for the remodeled houses (Remodeled = Yes) is calculated as: baseline `Intercept` Beta + `RemodeledYes` Beta
Hint for Questions 3 and 4
- You can check your work by examining the model equations for each line in the interactive plot.
- The slope is the same for both `Remodeled` categories, but the intercepts differ.
#### Interactive Categorical Model Plot
```{r}
#|label: Questions 3-6 - Categorical Regression Model Plot
# mlr categorical model
house_rem_cat_lm <- lm(Price ~ Square_Feet + Remodeled, data=houses_hw6)
# create interactive plot of model
# copy and paste this into console to view interactive plot
ggPredict(house_rem_cat_lm, interactive=T)
```
Question 5. Fill in the blanks. Round values to 2 decimal places.
- If a house is remodeled, the estimated price increase will be `___`.
- For both remodeled houses and un-remodeled houses, the price increase for each additional square foot is `___`.
Hints for Question 5:
- The difference due to remodeling is the `RemodeledYes` Beta in the model output.
- The price increase for each additional square foot is the slope, `Square_Feet` Beta, common to both models.
Question 6. Based on the P-value (`Sig`) for the difference due to remodeling (RemodeledYes), copy and paste the correct phrase to complete this sentence:
- After accounting for the relationship between Price and Square Feet, we see that there is `___` in price between un-remodeled and remodeled houses.
- Copy and paste the correct phrase from these options:
- not a significant difference
- suggestive evidence of a significant difference
- definitely a significant difference
#### Categorical Regression Model output
```{r}
#|label: Questions 3-6 - Categorical Regression Model Formal Output
# formatted regression output
# model is saved and printed to screen
(house_rem_cat_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=houses_hw6))
```
## Diamonds Data - Questions 7 - 16
## Import and Examine Data
### Question 7
Examine the R output below to answer these questions on Blackboard.
- The `diamonds` dataset has `____` observations.
- In the `diamonds` dataset for HW6 there are
- `____ Colorless` diamonds.
- `____ Faint yellow` diamonds.
- `____ Nearly colorless`.
#### Import and Examine Diamonds Dataset
```{r}
#|label: Question 7 - Import and Examine Diamonds Data
diamonds <- read_csv("data/diamonds_hw6.csv", show_col_types = F) |>
glimpse(width = 75)
diamonds |> select(Color) |> table()
```
### Question 8
Use the formal regression output and/or the interactive plot below to answer this question.
- Recall that the baseline category, is the first category alphabetically, `Colorless`.
- The `beta` values for Intercept and Weight are the SLR model for this baseline category:
- For `Colorless` diamonds, the `SLR` model is (round terms to 2 decimal places):
- `Est. Price = ____ + ____ Weight`
### Question 9
- What is the estimated price in dollars of a colorless diamond that weighs 0.75 carats?
- Round estimate to closest whole dollar.
- DO NOT include dollar sign.
- This calculation can be done in the R Console using values found in Question 8.
## Questions 10-15
- The first code block below created the linear model and the interactive plot (run `ggPredict` in Console) for the `diamonds` data.
- The second code block below saves the full model output but only prints the abridged output to avoid text-wrapping.
### Questions 10-12
- Use the formal regression output to answer Questions 10-11 about `Faint yellow` diamonds.
- Use the formal regression output and/or the interactive plot to answer Question 12 about about `Faint yellow` diamonds.
#### Question 10
The difference in intercept from the baseline Intercept (`Colorless`) to the Intercept for the `Faint yellow` category is `____`.
#### Question 11
The difference in slope from the baseline slope (`Colorless`) to the slope for the `Faint yellow` category is `____`.
- Hint: The numerical variable in this model is Weight so all slope terms will include Weight in their label.
#### Question 12
Use the answers from Questions 10 and 11 and/or the interactive plot to answer this question.
For `Faint yellow` diamonds, the `slr` model is (round terms to 2 decimal places):
```
- `Est. Price = ____ + ____ Weight`.
```
### Questions 13-15
- Use the formal regression output to answer Questions 13-14 about `Nearly colorless` diamonds.
- Use the formal regression output and/or the interactive plot to answer Question 15 about about `Nearly colorless` diamonds.
#### Question 13
The difference in intercept from the baseline intercept (`Colorless`) to the intercept for `Nearly colorless` category is `____`.
#### Question 14
The difference in slope from the baseline slope (`Colorless`) to the slope for the `Nearly colorless` category is `____`.
#### Question 15
Use the answers from Questions 13 and 14 and/or the interactive plot to answer this question.
For `Nearly colorless` diamonds, the `slr` model is (round terms to 2 decimal places):
```
- `Est. Price = ____ + ____ Weight`.
```
### Question 16
Select the correct text to fill in the blanks to complete these sentences.
Based on all of the P-values (`Sig` column) in the model output, we can determine that:
- The model intercepts for each the three diamond color categories are `____`.
- The model slopes for each the three diamond color categories are `____`.
- Copy and paste the correct phrase from these options:
- not significantly different from each other
- show some suggestive differences from each other
- significantly different from each other
#### Interactive Model Plot
```{r}
#|label: Questions 8-15 - Categorical Regression Interaction Model Plot
# mlr interaction model
diamonds_int_lm <- lm(Price ~ Weight + Color + Weight*Color, data=diamonds)
# create interactive plot of model
# copy and paste this into console to view interactive plot
ggPredict(diamonds_int_lm, interactive=T)
```
#### Abridged Model Output
```{r}
#|label: Questions 8-15 - Categorical Regression Interaction Model Abridged Output
# abridged formatted regression output
diamonds_int_ols <- ols_regress(Price ~ Weight + Color + Weight*Color, data=diamonds, iterm=T)
(model_out <- tibble(diamonds_int_ols$mvars, # create temp dataset
diamonds_int_ols$betas,
diamonds_int_ols$std_errors,
diamonds_int_ols$tvalues,
diamonds_int_ols$pvalues) |>
mutate(`model` = `diamonds_int_ols$mvars`, # rename and round columns
`Beta` = round(`diamonds_int_ols$betas`,2),
`Std. Error` = round(`diamonds_int_ols$std_errors`,2),
`t` = round(`diamonds_int_ols$tvalues`,2),
`Sig` = round(`diamonds_int_ols$pvalues`,4)) |>
select(6:10)) # select output columns
```
### Formatted Abridged Model Output
- Below I use R coding to format the output to make it easier to read in the HTML file
- The values are IDENTICAL to the unformatted output above.
- Note: Formatted Output will differ in appearance depending on where it is viewed, i.e. slides, html file, or `.qmd` file.
- This output will not show up well on a dark screen but will appear in the HTML file.
- Output on a Quiz is likely to look like this.
```{r}
#|label: Questions 8-15 - Categorical Regression Interaction Model Abridged Fromatted Output
model_out |> kable() |> kable_styling(full_width = F)
```
## Helpful Links for Reviewing Hypothesis Test Concepts:
- [**Khan Academy - The Idea of Significance tests**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/v/simple-hypothesis-testing)
- [**Khan Academy - Comparing a P-value to a Significance level**](https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/tests-about-population-mean/v/comparing-p-value-from-t-statistic-to-significance-level?modal=1)
## When you are done...
1. Save your changes to this file.(Ctrl + S or Cmd + S)
2. OPTIONAL: Click `Render` button to update html file with your changes.
3. Close R/RStudio on your laptop or close Posit Cloud Browser.