BUA 345 - Lecture 13

Categorical Regression - Parallel Lines Model

Author

Penelope Pooler Eisenbies

Published

February 24, 2025

Housekeeping

HW 5 is due 2/26/2025 - 2 day grace period

Demo videos are posted

Today’s plan

Review of
- SLR and MLR
- Hypothesis testing in Regression
Categorical Parallel Lines Model

In-class Polling (Session ID: bua345s25)

Review Question - Import data

Recall The Tennessee Real Estate data:

Bedrooms	Bathrooms	Price	Year_built	Address	City	State
4	4.5	1365000	2006	5442 GRANNY WHITE PIKE	BRENTWOOD	TN
3	3.0	300000	1987	5608 HEARTHSTONE LN	BRENTWOOD	TN
3	4.5	566920	1968	1706 OLD HICKORY BLVD	BRENTWOOD	TN
5	5.0	650000	1977	1428 OLD HICKORY BLVD	BRENTWOOD	TN

Review Question - Natural Log transformation

We will build an MLR model using the natural log of Price (ln_Price)
This transformation is needed because Price is RIGHT-SKEWED.

Code

```{r ln transformation of Price, echo=T}
tn_houses <- tn_houses |>
  mutate(ln_Price = log(Price)) # ln_Price = the natural log of price
```

Price	ln_Price	Bedrooms	Bathrooms	Year_built	Address	City	State
1365000	14.12666	4	4.5	2006	5442 GRANNY WHITE PIKE	BRENTWOOD	TN
300000	12.61154	3	3.0	1987	5608 HEARTHSTONE LN	BRENTWOOD	TN
566920	13.24797	3	4.5	1968	1706 OLD HICKORY BLVD	BRENTWOOD	TN

Review Question - Histograms

Histogram of Price shows distribution of raw data is right-skewed with high outliers.
Histogram of ln_Price shows distribution of transformed data is symmetric and normally distributed.

Lecture 13 In-class Exercises - Q1 - Review

Session ID: bua345s25

Back-transforming Model Estimates

Based on the model output, What is the estimated price of a house with 4 bedrooms and 3 bathrooms (rounded to closest $1000)?

Code

```{r review question incomplete R code, eval=F, echo=T}
(y_est <- ___ + 0.056*4 + 0.375*3)            # fill in intercept from R output

(est_dollars <- exp(y_est))                   # back_transform y estimate

# -3 is correct input to round to closest thousand
round(est_dollars,-3)                        # withot piping  
est_dollars |> round(-3)                     # with piping
```

NOTE: All 3 steps above could be done with one line but it is helpful to break it down when learning.

Regression Terms - $R^2$ and Adjusted $R^2$

R is the correlation coefficient, $R_{XY}$
Regression Output only shows absolute value of R.

$R^2$ is $R_{XY}^2$ the square of the correlation coefficient.
$R^2$ is also called coefficient of determination.
Meaning of $R^2$ in SLR: Proportion of variability in y explained by X

Adjusted $R^2$ adjusts $R^2$ for number of explanatory (X) variables in model.
- Much more to come about this.
- Meaning of Adjusted $R^2$ in MLR is a little less specific but it is similar to $R^2$.
Other values will be covered in upcoming lectures.

Review of Parameter Estimates Output

model column lists intercept and X variables in model
Beta column shows the estimate of the $\beta$ coefficients for each variable in model.
Std. Error shows variability of each estimated Beta coefficient estimate.
t = Beta/Std. Error, the test statistic for each Beta coefficient estimate.
Sig is P-value for Hypothesis test for each Beta coefficient estimate:

Recall Hypotheses being tested:

Review of Parameter Estimates Output

Reminder of Example Output (Sig is P-value column):

Recall Interpretation guidelines for P-value:

Types of Data - Review

Types of Data - More on Categorical Data

Categorical variables are categories that describe data observations
- Gender, Location, Hair Color, Eye Color, Location, etc.
Ordinal Categories have an OBJECTIVE order:
- Grades: A, B, C, D
- College year: Freshman, Sophomore, Junior, Senior
Nominal Categories don’t have an objective order:
- Location
- Hair color
- Gender

Data Examples - R Star Wars Dataset

Dataset of characters from Star Wars franchise
Type ?starwars in the console to review data documentation.

Code

```{r example from R starwars data, echo=T}
my_starwars <- starwars           # save R starwars dataset to Global Environment                        
my_starwars |> glimpse(width=75)  # examine data
```

Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Le…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 1…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brow…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "ligh…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "b…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 5…
$ sex        <chr> "male", "none", "none", "male", "female", "male", "fem…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "f…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", …
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of …
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>,…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced…

Examining Data

A good way to examine categorical data variables is to examine how many observations are in each category.
For example, we can examine the Star Wars character data by species and gender
- There are WAY TOO MANY species…

                gender
species          feminine masculine
  Aleena                0         1
  Besalisk              0         1
  Cerean                0         1
  Chagrian              0         1
  Clawdite              1         0
  Droid                 1         5
  Dug                   0         1
  Ewok                  0         1
  Geonosian             0         1
  Gungan                0         3
  Human                 9        26
  Hutt                  0         1
  Iktotchi              0         1
  Kaleesh               0         1
  Kaminoan              1         1
  Kel Dor               0         1
  Mirialan              2         0
  Mon Calamari          0         1
  Muun                  0         1
  Nautolan              0         1
  Neimodian             0         1
  Pau'an                0         1
  Quermian              0         1
  Rodian                0         1
  Skakoan               0         1
  Sullustan             0         1
  Tholothian            1         0
  Togruta               1         0
  Toong                 0         1
  Toydarian             0         1
  Trandoshan            0         1
  Twi'lek               1         1
  Vulptereen            0         1
  Wookiee               0         2
  Xexto                 0         1
  Yoda's species        0         1
  Zabrak                0         2

Lecture 13 In-class Exercises - Q2

Session ID: bua345s25

Is species a nominal or ordinal variable?

Star Wars Example - Examining categorical Data

Human is the most common species.
We can filter the data to look at those characters only.
For example, we can examine prevalence of each gender and eye color among the human characters.

Code

```{r summary of gender and eye color, echo=T}
my_starwars |> filter(species=="Human") |> # filter to humans only
  dplyr::select(gender, eye_color) |>      # select gender and eye_color variables
  table()                                  # create table of counts
```

           eye_color
gender      blue blue-gray brown dark hazel unknown yellow
  feminine     3         0     4    0     1       1      0
  masculine    9         1    12    1     1       0      2

Lecture 13 In-class Exercises - Q3

Session ID: bua345s25

Which R command is used to summarize the number of observations in each gender x eye_color combination?

NOTE: This useful command will also be used in HW 6.

Code

```{r summary of gender and eye color poll, echo=T}
my_starwars |> filter(species=="Human") |> # filter to humans only
  dplyr::select(gender, eye_color) |>      # select gender and eye_color variables
  table()                                  # create table of counts
```

           eye_color
gender      blue blue-gray brown dark hazel unknown yellow
  feminine     3         0     4    0     1       1      0
  masculine    9         1    12    1     1       0      2

Data Examples - GT cars dataset

Deluxe automobiles from the 2014-2017 period
Type ?gt::gtcars in the console to see data documentation.

Code

```{r import gtcars dats, echo=T}
gt_cars <- gtcars |> # import data
  glimpse(width=75) 
```

Rows: 47
Columns: 15
$ mfr         <chr> "Ford", "Ferrari", "Ferrari", "Ferrari", "Ferrari", "…
$ model       <chr> "GT", "458 Speciale", "458 Spider", "458 Italia", "48…
$ year        <dbl> 2017, 2015, 2015, 2014, 2016, 2015, 2017, 2015, 2015,…
$ trim        <chr> "Base Coupe", "Base Coupe", "Base", "Base Coupe", "Ba…
$ bdy_style   <chr> "coupe", "coupe", "convertible", "coupe", "coupe", "c…
$ hp          <dbl> 647, 597, 562, 562, 661, 553, 680, 652, 731, 949, 573…
$ hp_rpm      <dbl> 6250, 9000, 9000, 9000, 8000, 7500, 8250, 8000, 8250,…
$ trq         <dbl> 550, 398, 398, 398, 561, 557, 514, 504, 509, 664, 476…
$ trq_rpm     <dbl> 5900, 6000, 6000, 6000, 3000, 4750, 5750, 6000, 6000,…
$ mpg_c       <dbl> 11, 13, 13, 13, 15, 16, 12, 11, 11, 12, 21, 16, 11, 1…
$ mpg_h       <dbl> 18, 17, 17, 17, 22, 23, 17, 16, 16, 16, 22, 22, 18, 2…
$ drivetrain  <chr> "rwd", "rwd", "rwd", "rwd", "rwd", "rwd", "awd", "awd…
$ trsmn       <chr> "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a",…
$ ctry_origin <chr> "United States", "Italy", "Italy", "Italy", "Italy", …
$ msrp        <dbl> 447000, 291744, 263553, 233509, 245400, 198973, 29800…

Lecture 13 In-class Exercises - Q4

Session ID: bua345s25

Which variable in gt_cars, body style (bdy_style) or year could be treated as ordinal?

Code

```{r examine categories within gt data, echo=T}
gt_cars |> dplyr::select(bdy_style, year) |> table()
```

             year
bdy_style     2014 2015 2016 2017
  convertible    0    2    2    1
  coupe          2    7   16    7
  hatchback      0    0    2    0
  sedan          0    0    7    1

Categorical Regression

Categorical variables can (and should) be used in linear regression models
If categories exist in the data and we ignore them, then we assume that the linear relationship is the SAME FOR all categories.
The following two examples illustrate the importance of adding a categorical variable to a regression model when needed.

Data Example - Celebrity Salaries Data

Many (not all) celebrities see a decrease in their annual income as they age.
There is a negative relationship between wages and ages.
Is this relationship the same for males a females?

Celebrity	Earnings	Age	Gender
Taylor Swift	67	27	Female
Lady Gaga	59	31	Female
Gisele Bundchen	54	32	Female
Beyonce	54	35	Female
Kim Kardashian	51	36	Female
Sofia Vergara	28	44	Female

Celebrity Salaries

Examining categories and Correlations

Code

```{r echo=T}
celeb |> dplyr::select(Gender) |> table()  # examine counts for each category

celeb |> dplyr::select(Earnings, Age) |> cor() |> round(2) # examine correlation between earnings and age
```

Gender
Female   Male 
     8      8 
         Earnings   Age
Earnings     1.00 -0.86
Age         -0.86  1.00

Celebrity Salaries Data - Examining Model Options

Option 1: SLR

Model assumes no difference between males and females.
In this case we use Base R command for regression, lm.
Model created with lm can be used to create an interactive plot.
The interactive plot shows the model equation when the cursor is on the line.

$\hat{y} = 136.89 - 2.23\times Age$

Celebrity Salaries Data - Examining Model Options

Option 2: Categorical Regression Model

SLR model is okay, but we can do better.
It is logical to create a model that specifies a difference in earnings between males and females.
We add Gender to the model to test if this difference is significant.
The interactive plot shows each model equation when the cursor is on the line.

Females: $\hat{y} = 134.11 - 2.37\times Age$

Males: $\hat{y} = 149.5 - 2.37\times Age$

Celebrity Salaries Data - MLR Model Output

We see the model equation (poorly formatted) for each gender, in the plot.
We can also get these equations from the model output, but it requires a little work.
Examine the model output:

                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.545 
R-Squared               0.975       MSE                 6.477 
Adj. R-Squared          0.971       Coef. Var           4.959 
Pred R-Squared          0.962       AIC                83.299 
MAE                     2.197       SBC                86.389 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4007.300         2       2003.650    251.332    0.0000 
Residual       103.637        13          7.972                      
Total         4110.937        15                                     
---------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
--------------------------------------------------------------------------------------------
(Intercept)    134.112         4.141                  32.384    0.000    125.165    143.059 
        Age     -2.370         0.114       -0.918    -20.710    0.000     -2.617     -2.123 
 GenderMale     15.383         1.420        0.480     10.830    0.000     12.315     18.452 
--------------------------------------------------------------------------------------------

Getting Model Equations from Regression Output

By default R chooses baseline categories alphabetically
- Female is before Male so Female is the baseline
- Female SLR Model:
  - Est. Earnings = 134.112 - 2.37 * Age
- Male SLR Model:
  - Est. Earnings = 134.112 - 2.37 * Age + 15.383
  - Est. Earnings = 134.112 + 15.383 - 2.37*Age
  - Est. Earnings = 149.505 - 2.37 * Age
The difference between the intercepts for Females and Males is shown in the model output.
- Difference in intercepts is labeled with name of categorical variable and category
- Difference (Increase) for Males is labeled GenderMale and equals 15.383

Lecture 13 In-class Exercises - Q5

Session ID: bua345s25

Based on our categorical regression model, is the difference between male and female earnings (approx. 15 $M), statistically significant?

HINT: Look at the p-value for the GenderMale term in the model to answer this question.

Parameter Estimates Table

Data Example - House Remodeling Data

What is the effect of remodeling on house selling price?

Price	Square_Feet	Remodeled
554000	2702	No
484000	2378	No
391000	1846	No
354000	1820	No
410000	1794	No
349000	1768	No

House Remodeling Data

Examine Categories and Correlations

What is the effect of remodeling on house selling price?

Code

```{r  examine categories and correlations, echo=T}
# examine number of obs in each category                                             
house_remodel |> select(Remodeled) |> table() 

# correlation between price and square feet
house_remodel |> select(Price, Square_Feet) |> cor() |> round(2)
```

Remodeled
 No Yes 
 29  28 
            Price Square_Feet
Price        1.00        0.75
Square_Feet  0.75        1.00

House Remodeling Data - Examining Model Options

Option 1: SLR

SLR model assumes no difference due to remodeling
Again, we use Base R command for regression, lm
Model created with lm can be used to create an interactive plot.
Interactive plot shows the model equation when the cursor is on the line.

House Remodeling Data - Examining Model Options

Option 2: Categorical Regression Model

SLR model is okay, but there is probably a difference between Remodeled and un-Remodeled houses.
To test for that difference we add the categorical variable Remodeled to the model.
The interactive plot shows each model equation when the cursor is on the line.

House Remodel data - MLR Model Output

We can see the model equation (poorly formatted) for each category, in the plot.
We can also get these equations from the model output, but it requires a little work.
Examine the model output:

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.924       RMSE                    32079.481 
R-Squared                   0.854       MSE                1029093133.260 
Adj. R-Squared              0.848       Coef. Var                   8.223 
Pred R-Squared              0.836       AIC                      1352.620 
MAE                     28012.693       SBC                      1360.792 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                      ANOVA                                        
----------------------------------------------------------------------------------
                        Sum of                                                    
                       Squares        DF         Mean Square       F         Sig. 
----------------------------------------------------------------------------------
Regression    341892090457.686         2    170946045228.843     157.37    0.0000 
Residual       58658308595.823        54      1086264973.997                      
Total         400550399053.509        56                                          
----------------------------------------------------------------------------------

                                         Parameter Estimates                                          
-----------------------------------------------------------------------------------------------------
       model          Beta    Std. Error    Std. Beta      t        Sig          lower         upper 
-----------------------------------------------------------------------------------------------------
 (Intercept)    137549.093     17620.447                  7.806    0.000    102222.224    172875.963 
 Square_Feet       137.879        10.836        0.670    12.725    0.000       116.155       159.602 
RemodeledYes     90917.216      8834.268        0.542    10.291    0.000     73205.575    108628.858 
-----------------------------------------------------------------------------------------------------

Getting Model Equations from Regression Output

By default R chooses baseline categories alphabetically
- No is before Yes so un-Remodeled houses are the baseline
- un-Remodeled SLR Model:
  - Est. Price = 137549.093 + 137.879 * Square_Feet
- Remodeled SLR Model:
  - Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216
  - Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet
  - Est. Price = _____ + 137.879 * Square_Feet

Lecture 13 In-class Exercises - Q6

Session ID: bua345s25

The difference between the intercepts for Remodeled and un-Remodeled houses is shown in the model output.

Difference in intercepts is labeled with name of categorical variable and category
Difference for remodeling is labeled RemodeledYes and equals 90917.216

What is the intercept for the prices of Remodeled houses in the Categorical Regression model (Round to closest thousand ($K).

Parameter Estimates Table

Lecture 13 In-class Exercises - Q7

Session ID: bua345s25

Based on our categorical regression model, is the difference in selling price between remodeled (Remodeled = Yes) and un-remodeled (Remodeled = No) homes statistically significant?

Parameter Estimates Table

Key Points from Today

Categorical Parallel Lines Model
- Separate SLR model for each category.
- Modeling categories simultaneously with one mode is
  - more efficient
  - more accurate
Lecture 14: Similar model BUT each category has a different slope
HW 5 is due on 2/26
HW 6 will be posted on 2/27 and due on 3/5.
First set of data in HW 6 is almost identical to house_remodel data.

To submit an Engagement Question or Comment about material from Lecture 13: Submit it by midnight today (day of lecture).

--- title: "BUA 345 - Lecture 13" subtitle: "Categorical Regression - Parallel Lines Model" author: "Penelope Pooler Eisenbies" date: last-modified lightbox: true toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, warning=F, message=F, include=F} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, tools, ggiraphExtra) # verify packages # p_loaded() ``` **HW 5 is due 2/26/2025** - 2 day grace period ::: nonincremental - Demo videos are posted ::: ### Today's plan - Review of - SLR and MLR - Hypothesis testing in Regression - Categorical Parallel Lines Model ::: fragment **In-class Polling (Session ID: bua345s25)** ::: ## Review Question - Import data - Recall The Tennessee Real Estate data: :::: fragment ::: r-fit-text ```{r import data } # import and examine data tn_houses <- read_csv("data/TN_houses.csv", show_col_types = F) head(tn_houses, 4) |> kable() ``` ::: :::: ## ### Review Question - Natural Log transformation - We will build an MLR model using the natural log of Price (ln_Price) - This transformation is needed because Price is **RIGHT-SKEWED**. ::: fragment ```{r ln transformation of Price, echo=T} tn_houses <- tn_houses |> mutate(ln_Price = log(Price)) # ln_Price = the natural log of price ``` ```{r} tn_houses <- tn_houses |> select(Price, ln_Price, Bedrooms, Bathrooms, Year_built:State) head(tn_houses, 3) |> kable() ``` ::: ## Review Question - Histograms - Histogram of `Price` shows distribution of raw data is right-skewed with high outliers. - Histogram of `ln_Price` shows distribution of transformed data is symmetric and normally distributed. ::: fragment ```{r display of histograms, fig.width=15, fig.height=6, fig.align='center', echo=F, message=F} # histogram of original price data hist_Price <- tn_houses |> ggplot() + geom_histogram(aes(x=Price), color="darkblue", fill="lightblue") + labs(x="Selling Price", y="Frequency") + theme_classic() + theme(axis.title = element_text(size=18), axis.text = element_text(size=15), plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) # histogram of ln_Price hist_ln_Price <- tn_houses |> ggplot() + geom_histogram(aes(x=ln_Price), color="darkgreen", fill="lightgreen") + labs(x="Natural Log of Selling Price", y="Frequency") + theme_classic()+ theme(axis.title = element_text(size=18), axis.text = element_text(size=15), plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) # display of these two histograms side by side grid.arrange(hist_Price, hist_ln_Price, ncol=2) ``` ::: ## ### Lecture 13 In-class Exercises - Q1 - Review ***Session ID: bua345s25*** **Back-transforming Model Estimates** **Based on the model output, What is the estimated price of a house with 4 bedrooms and 3 bathrooms (rounded to closest \$1000)?** ::: fragment ```{r review question incomplete R code, eval=F, echo=T} (y_est <- ___ + 0.056*4 + 0.375*3) # fill in intercept from R output (est_dollars <- exp(y_est)) # back_transform y estimate # -3 is correct input to round to closest thousand round(est_dollars,-3) # withot piping est_dollars |> round(-3) # with piping ``` ::: - NOTE: All 3 steps above could be done with one line but it is helpful to break it down when learning. ## ### Regression Terms - $R^2$ and Adjusted $R^2$ - ***R*** is the correlation coefficient, $R_{XY}$ - Regression Output only shows absolute value of R. - $R^2$ is $R_{XY}^2$ the square of the correlation coefficient. - $R^2$ is also called coefficient of determination. - **Meaning of** $R^2$ in SLR: Proportion of variability in y explained by X - **Adjusted** $R^2$ adjusts $R^2$ for number of explanatory (X) variables in model. - Much more to come about this. - Meaning of **Adjusted** $R^2$ in MLR is a little less specific but it is similar to $R^2$. - Other values will be covered in upcoming lectures. ## Review of Parameter Estimates Output - **`model`** column lists intercept and X variables in model - **`Beta`** column shows the estimate of the $\beta$ coefficients for each variable in model. - **`Std. Error`** shows variability of each estimated Beta coefficient estimate. - **`t`** = `Beta/Std. Error`, the test statistic for each Beta coefficient estimate. - **`Sig`** is P-value for Hypothesis test for each Beta coefficient estimate: ::: fragment **Recall Hypotheses being tested:** ![](img/regression_hypotheses.png){fig-align="center" height="2.5in"} ::: ## Review of Parameter Estimates Output - Reminder of Example Output (`Sig` is P-value column): ::: fragment ![](img/reviewQ_mlr2.jpg){fig-align="center"} ::: - Recall Interpretation guidelines for P-value: ::: fragment ![](img/pv_decisions.png){fig-align="center"} ::: ## Types of Data - Review ![](img/types_of_data.png){fig-align="center"} ## Types of Data - More on Categorical Data - Categorical variables are categories that describe data observations - Gender, Location, Hair Color, Eye Color, Location, etc. - Ordinal Categories have an **OBJECTIVE** order: - Grades: A, B, C, D - College year: Freshman, Sophomore, Junior, Senior - Nominal Categories don't have an objective order: - Location - Hair color - Gender ## Data Examples - R Star Wars Dataset - Dataset of characters from Star Wars franchise - Type `?starwars` in the console to review data documentation. :::: fragment ::: r-fit-text ```{r example from R starwars data, echo=T} my_starwars <- starwars # save R starwars dataset to Global Environment my_starwars |> glimpse(width=75) # examine data ``` ::: :::: ## Examining Data - A good way to examine categorical data variables is to examine how many observations are in each category. - For example, we can examine the Star Wars character data by species and gender - There are WAY TOO MANY species... ::: fragment ```{r summary of starwars species} # summarizing starwars character data by species and gender my_starwars |> dplyr::select(species, gender) |> table() ``` ::: ## ### Lecture 13 In-class Exercises - Q2 ***Session ID: bua345s25*** **Is species a nominal or ordinal variable?** ![](img/types_of_data.png){fig-align="center"} ## Star Wars Example - Examining categorical Data - `Human` is the most common species. - We can filter the data to look at those characters only. - For example, we can examine prevalence of each gender and eye color among the human characters. ::: fragment ```{r summary of gender and eye color, echo=T} my_starwars |> filter(species=="Human") |> # filter to humans only dplyr::select(gender, eye_color) |> # select gender and eye_color variables table() # create table of counts ``` ::: ## ### Lecture 13 In-class Exercises - Q3 ***Session ID: bua345s25*** **Which R command is used to summarize the number of observations in each gender x eye_color combination?** NOTE: This useful command will also be used in HW 6. ```{r summary of gender and eye color poll, echo=T} my_starwars |> filter(species=="Human") |> # filter to humans only dplyr::select(gender, eye_color) |> # select gender and eye_color variables table() # create table of counts ``` ## Data Examples - GT cars dataset - Deluxe automobiles from the 2014-2017 period - Type `?gt::gtcars` in the console to see data documentation. ::: fragment ```{r import gtcars dats, echo=T} gt_cars <- gtcars |> # import data glimpse(width=75) ``` ::: ## ### Lecture 13 In-class Exercises - Q4 ***Session ID: bua345s25*** **Which variable in gt_cars, body style (`bdy_style`) or `year` could be treated as ordinal?** ```{r examine categories within gt data, echo=T} gt_cars |> dplyr::select(bdy_style, year) |> table() ``` ## Categorical Regression - Categorical variables can (and should) be used in linear regression models - If categories exist in the data and we ignore them, then we assume that the linear relationship is the SAME FOR all categories. - The following two examples illustrate the importance of adding a categorical variable to a regression model when needed. ## Data Example - Celebrity Salaries Data - Many (not all) celebrities see a decrease in their annual income as they age. - There is a negative relationship between wages and ages. - Is this relationship the same for males a females? ::: fragment ```{r import and examine celebrity data} # import celebrity data celeb <- read_csv("data/celeb.csv", show_col_types = F) head(celeb) |> kable() ``` ::: ## ### Celebrity Salaries #### Examining categories and Correlations ::: fragment ```{r echo=T} celeb |> dplyr::select(Gender) |> table() # examine counts for each category celeb |> dplyr::select(Earnings, Age) |> cor() |> round(2) # examine correlation between earnings and age ``` ::: ## ### Celebrity Salaries Data - Examining Model Options :::::::: columns :::: {.column width="48%"} ::: fragment #### Option 1: SLR - Model assumes no difference between males and females. - In this case we use Base R command for regression, `lm`. - Model created with `lm` can be used to create an interactive plot. - The interactive plot shows the model equation when the cursor is on the line. ::: ```{r celeb slr model, echo=F} # slr model created using lm celeb_slr <- lm(Earnings ~ Age, data=celeb) # create and save interactive plot int_clb_slr <- ggPredict(celeb_slr, interactive=T) ``` :::: ::: {.column width="2%"} ::: :::: {.column width="50%"} ::: fragment ```{r int_plot1, echo=F} int_clb_slr ``` $\hat{y} = 136.89 - 2.23\times Age$ ::: :::: :::::::: ## ### Celebrity Salaries Data - Examining Model Options ::::::::: columns :::: {.column width="48%"} ::: fragment #### Option 2: Categorical Regression Model ::: - SLR model is okay, but we can do better. - It is logical to create a model that specifies a difference in earnings between males and females. - We add `Gender` to the model to test if this difference is significant. - The interactive plot shows each model equation when the cursor is on the line. ```{r celeb mlr model, echo=F} # mlr model created using lm celeb_cat_lm <- lm(Earnings ~ Age + Gender, data=celeb) # create and save interactive plot int_clb_mlr <- ggPredict(celeb_cat_lm, interactive=T) ``` :::: ::: {.column width="2%"} ::: ::::: {.column width="50%"} ::: fragment ```{r int_plot2, echo=F} int_clb_mlr ``` ::: ::: fragment Females: $\hat{y} = 134.11 - 2.37\times Age$ Males: $\hat{y} = 149.5 - 2.37\times Age$ ::: ::::: ::::::::: ## ### Celebrity Salaries Data - MLR Model Output - We see the model equation (poorly formatted) for each gender, in the plot. - We can also get these equations from the model output, but it requires a little work. - Examine the model output: ::: fragment ```{r celeb cat mlr formal output} # formatted regression output # model is saved and printed to screen (celeb_cat_ols<- ols_regress(Earnings ~ Age + Gender, data=celeb)) ``` ::: ## ### Getting Model Equations from Regression Output - By default R chooses baseline categories alphabetically - `Female` is before `Male` so `Female` is the baseline - Female SLR Model: - `Est. Earnings = 134.112 - 2.37 * Age` - Male SLR Model: - `Est. Earnings = 134.112 - 2.37 * Age + 15.383` - `Est. Earnings = 134.112 + 15.383 - 2.37*Age` - `Est. Earnings = 149.505 - 2.37 * Age` - The difference between the intercepts for Females and Males is shown in the model output. - Difference in intercepts is labeled with name of categorical variable and category - Difference (Increase) for Males is labeled **`GenderMale`** and equals **`15.383`** ## ### Lecture 13 In-class Exercises - Q5 ***Session ID: bua345s25*** **Based on our categorical regression model, is the difference between male and female earnings (approx. 15 \$M), statistically significant?** HINT: Look at the p-value for the **`GenderMale`** term in the model to answer this question. #### Parameter Estimates Table ![](img/l13_celeb_cat_model_output.png){fig-align="center"} ## ### Data Example - House Remodeling Data - What is the effect of remodeling on house selling price? ::: fragment ```{r import and examine house remodel data} # import house remodeling data house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F) head(house_remodel) |> kable() ``` ::: ## House Remodeling Data ### Examine Categories and Correlations - What is the effect of remodeling on house selling price? :::: fragment ::: r-fit-text ```{r examine categories and correlations, echo=T} # examine number of obs in each category house_remodel |> select(Remodeled) |> table() # correlation between price and square feet house_remodel |> select(Price, Square_Feet) |> cor() |> round(2) ``` ::: :::: ## ### House Remodeling Data - Examining Model Options ::::::: columns :::: {.column width="50%"} ::: fragment #### Option 1: SLR - SLR model assumes no difference due to remodeling - Again, we use Base R command for regression, `lm` - Model created with `lm` can be used to create an interactive plot. - Interactive plot shows the model equation when the cursor is on the line. ::: ```{r house remodel slr model, echo=F} # slr model created using lm rem_slr <- lm(Price ~ Square_Feet, data=house_remodel) # create and save interactive plot int_rem_slr <- ggPredict(rem_slr, interactive=T) ``` :::: :::: {.column width="50%"} ::: fragment ```{r echo=F} int_rem_slr ``` ::: :::: ::::::: ## ### House Remodeling Data - Examining Model Options ::::::: columns :::: {.column width="50%"} ::: fragment #### Option 2: Categorical Regression Model - SLR model is okay, but there is probably a difference between `Remodeled` and un-`Remodeled` houses. - To test for that difference we add the categorical variable `Remodeled` to the model. - The interactive plot shows each model equation when the cursor is on the line. ::: ```{r house remodel mlr model, echo=F} # mlr model created using lm rem_cat_lm <- lm(Price ~ Square_Feet + Remodeled, data=house_remodel) # create and save interactive plot int_rem_mlr <- ggPredict(rem_cat_lm, interactive=T) ``` :::: :::: {.column width="50%"} ::: fragment ```{r echo=F} int_rem_mlr ``` ::: :::: ::::::: ## House Remodel data - MLR Model Output - We can see the model equation (poorly formatted) for each category, in the plot. - We can also get these equations from the model output, but it requires a little work. - Examine the model output: ::: fragment ```{r house rem cat mlr formal output} # formatted regression output # model is saved and printed to screen (house_rem_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel)) ``` ::: ## ### Getting Model Equations from Regression Output - By default R chooses baseline categories alphabetically - `No` is before `Yes` so un-`Remodeled` houses are the baseline - un-Remodeled SLR Model: - `Est. Price = 137549.093 + 137.879 * Square_Feet` - Remodeled SLR Model: - `Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216` - `Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet` - `Est. Price = _____ + 137.879 * Square_Feet` ## ### Lecture 13 In-class Exercises - Q6 ***Session ID: bua345s25*** The difference between the intercepts for `Remodeled` and un-`Remodeled` houses is shown in the model output. - Difference in intercepts is labeled with name of categorical variable and category - Difference for remodeling is labeled **`RemodeledYes`** and equals **`90917.216`** ::: fragment **What is the intercept for the prices of `Remodeled` houses in the Categorical Regression model (Round to closest thousand (\$K).** ::: ::: fragment #### Parameter Estimates Table ![](img/house_rem_cat_mlr2.jpg){fig-align="center"} ::: ## ### Lecture 13 In-class Exercises - Q7 ***Session ID: bua345s25*** **Based on our categorical regression model, is the difference in selling price between remodeled (Remodeled = `Yes`) and un-remodeled (Remodeled = `No`) homes statistically significant?** #### Parameter Estimates Table ![](img/house_rem_cat_mlr2.jpg){fig-align="center"} ## {background-image="img/tired_panda_faded.png"} ### Key Points from Today - Categorical Parallel Lines Model - Separate SLR model for each category. - Modeling categories simultaneously with one mode is - more efficient - more accurate - Lecture 14: Similar model BUT each category has a different slope - HW 5 is due on 2/26 - HW 6 will be posted on 2/27 and due on 3/5. - First set of data in HW 6 is **almost** identical to `house_remodel` data. ::: fragment **To submit an Engagement Question or Comment about material from Lecture 13:** Submit it by midnight today (day of lecture). :::

Housekeeping

Today’s plan

Review Question - Import data

Review Question - Natural Log transformation

Review Question - Histograms

Lecture 13 In-class Exercises - Q1 - Review

Regression Terms - \(R^2\) and Adjusted \(R^2\)

Review of Parameter Estimates Output

Review of Parameter Estimates Output

Types of Data - Review

Types of Data - More on Categorical Data

Data Examples - R Star Wars Dataset

Examining Data

Lecture 13 In-class Exercises - Q2

Star Wars Example - Examining categorical Data

Lecture 13 In-class Exercises - Q3

Data Examples - GT cars dataset

Lecture 13 In-class Exercises - Q4

Categorical Regression

Data Example - Celebrity Salaries Data

Celebrity Salaries

Examining categories and Correlations

Celebrity Salaries Data - Examining Model Options

Option 1: SLR

Celebrity Salaries Data - Examining Model Options

Option 2: Categorical Regression Model

Celebrity Salaries Data - MLR Model Output

Getting Model Equations from Regression Output

Lecture 13 In-class Exercises - Q5

Parameter Estimates Table

Data Example - House Remodeling Data

House Remodeling Data

Examine Categories and Correlations

House Remodeling Data - Examining Model Options

Option 1: SLR

House Remodeling Data - Examining Model Options

Option 2: Categorical Regression Model

House Remodel data - MLR Model Output

Getting Model Equations from Regression Output

Lecture 13 In-class Exercises - Q6

Parameter Estimates Table

Lecture 13 In-class Exercises - Q7

Parameter Estimates Table

Key Points from Today