Lecture 13 - Categorical Regression - Parallel Lines Model

Penelope Pooler Eisenbies
BUA 345

2024-02-27

Housekeeping

Today’s plan 📋
- Review

Review Question - Import data

Recall The Tennessee Real Estate data:

# import and examine data
tn_houses <- read_csv("data/TN_houses.csv", show_col_types = F) |>
  glimpse(width=75)

Rows: 2,430
Columns: 7
$ Bedrooms   <dbl> 4, 3, 3, 5, 4, 4, 4, 3, 4, 4, 4, 3, 3, 4, 3, 4, 5, 4, …
$ Bathrooms  <dbl> 4.5, 3.0, 4.5, 5.0, 4.0, 3.0, 2.0, 2.0, 3.5, 3.5, 3.0,…
$ Price      <dbl> 1365000, 300000, 566920, 650000, 456500, 380500, 31815…
$ Year_built <dbl> 2006, 1987, 1968, 1977, 1974, 1987, 1978, 1976, 1966, …
$ Address    <chr> "5442  GRANNY WHITE PIKE", "5608  HEARTHSTONE LN", "17…
$ City       <chr> "BRENTWOOD", "BRENTWOOD", "BRENTWOOD", "BRENTWOOD", "B…
$ State      <chr> "TN", "TN", "TN", "TN", "TN", "TN", "TN", "TN", "TN", …

Review Question - Natural Log transformation

We will build an MLR model using the natural log of Price (ln_Price)
This transformation is needed because Price is RIGHT-SKEWED.
Review of how to create this transformation

# create new variable in tn_houses dataset
# ln_price = the natural log of price
tn_houses <- tn_houses |>
  mutate(ln_price = log(Price)) |>
  glimpse( width=75)

Rows: 2,430
Columns: 8
$ Bedrooms   <dbl> 4, 3, 3, 5, 4, 4, 4, 3, 4, 4, 4, 3, 3, 4, 3, 4, 5, 4, …
$ Bathrooms  <dbl> 4.5, 3.0, 4.5, 5.0, 4.0, 3.0, 2.0, 2.0, 3.5, 3.5, 3.0,…
$ Price      <dbl> 1365000, 300000, 566920, 650000, 456500, 380500, 31815…
$ Year_built <dbl> 2006, 1987, 1968, 1977, 1974, 1987, 1978, 1976, 1966, …
$ Address    <chr> "5442  GRANNY WHITE PIKE", "5608  HEARTHSTONE LN", "17…
$ City       <chr> "BRENTWOOD", "BRENTWOOD", "BRENTWOOD", "BRENTWOOD", "B…
$ State      <chr> "TN", "TN", "TN", "TN", "TN", "TN", "TN", "TN", "TN", …
$ ln_price   <dbl> 14.12666, 12.61154, 13.24797, 13.38473, 13.03134, 12.8…

Review Question - Examination Variable Distrbutions

Histogram of Price (hist_price), shows distribution of raw data is right-skewed with high outliers.
Histogram of ln_price (hist_ln_price) shows distribution of ln transformed data is symmetric and normally distributed.

Review Question - Side by Side Comparison of Histograms

💥 Lecture 13 In-class Exercises - Q1 - Review 💥

Session ID: bua345s24

Back-transforming Model Estimates

Based on the model output, What is the estimated price of a house with 4 bedrooms and 3 bathrooms (rounded to closest $1000)?

# incomplete model equation
# fill in intercept and values from question
(y_est <- ___ + 0.056*4 + 0.375*3)

# back-transform y_est using exp function
# two versions of same code
(est_dollars <- exp(y_est)) 
(est_dollars <- y_est |> exp()) # with piping

# -3 is correct input to round to closest thousand
# two versions of same code
round(est_dollars,-3)
est_dollars |> round(-3)        # with piping

NOTE: All 3 steps above could be done with one line but it is helpful to break it down when learning.

Review of Regression Terms

$R^2$ and Adjusted $R^2$

R is the correlation coefficient, $R_{XY}$
- Regression Output only shows absolute value of R.
$R^2$ is $R_{XY}^2$ the square of the correlation coefficient.
$R^2$ is also called coefficient of determination.
Meaning of $R^2$ in SLR: Proportion of variability in y explained by X

Adjusted $R^2$ adjusts $R^2$ for number of explanatory (X) variables in model.
- Much more to come about this.
- Meaning of Adjusted $R^2$ in MLR is a little less specific but it issimilar to $R^2$.
Other values will be covered in upcoming lectures.

Review of Parameter Estimates Output

model column lists intercept and X variables in model
Beta column shows the estimate of the $\beta$ coefficients for each variable in model.
Std. Error shows variability of each estimated Beta coefficient estimate.
t = Beta/Std. Error, the test statistic for each Beta coefficient estimate.
Sig is P-value for Hypothesis test for each Beta coefficient estimate:

Recall Hypotheses being tested:

Review of Parameter Estimates Output

Recall Interpretation guidelines for P-value:

Reminder of Example Output:

Types of Data - Review (Previous Stat Course)

Types of Data - More on Categorical Data

Categorical variables are categories that describe data observations
- Gender, Location, Hair Color, Eye Color, Location, etc.
Ordinal Categories have an OBJECTIVE order:
- Grades: A, B, C, D
- College year: Freshman, Sophomore, Junior, Senior
Nominal Categories don’t have an objective order:
- Location
- Hair color
- Gender

Data Examples - R Star Wars Dataset

Dataset of characters from Star Wars franchise
Type ?starwars in the console to review data documentation.

my_starwars <- starwars           # save R starwars dataset to Global Environment
my_starwars |> glimpse(width=75)  # examine data

Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Le…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 1…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brow…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "ligh…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "b…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 5…
$ sex        <chr> "male", "none", "none", "male", "female", "male", "fem…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "f…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", …
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of …
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>,…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced…

Examining Data

A good way to examine categorical data variables is to examine how many observations are in each category.
For example, we can examine the Star Wars character data by species and gender
- There are WAY TOO MANY species…

# summarizing starwars character data by species and gender
my_starwars |> select(species, gender) |> table()

                gender
species          feminine masculine
  Aleena                0         1
  Besalisk              0         1
  Cerean                0         1
  Chagrian              0         1
  Clawdite              1         0
  Droid                 1         5
  Dug                   0         1
  Ewok                  0         1
  Geonosian             0         1
  Gungan                0         3
  Human                 9        26
  Hutt                  0         1
  Iktotchi              0         1
  Kaleesh               0         1
  Kaminoan              1         1
  Kel Dor               0         1
  Mirialan              2         0
  Mon Calamari          0         1
  Muun                  0         1
  Nautolan              0         1
  Neimodian             0         1
  Pau'an                0         1
  Quermian              0         1
  Rodian                0         1
  Skakoan               0         1
  Sullustan             0         1
  Tholothian            1         0
  Togruta               1         0
  Toong                 0         1
  Toydarian             0         1
  Trandoshan            0         1
  Twi'lek               1         1
  Vulptereen            0         1
  Wookiee               0         2
  Xexto                 0         1
  Yoda's species        0         1
  Zabrak                0         2

💥 Lecture 13 In-class Exercises - Q2 💥

Session ID: bua345s24

Is species a nominal or ordinal variable?

Star wars Example - Examining categorical Data

Human is the most common species.
We can filter the data to look at those characters only.
For example, we can examine prevalence of each gender and eye color among the human characters.

my_starwars |> filter(species=="Human") |> # filter to humans only
  select(gender, eye_color) |>    # examine genders and eye colors
  table()

           eye_color
gender      blue blue-gray brown dark hazel unknown yellow
  feminine     3         0     4    0     1       1      0
  masculine    9         1    12    1     1       0      2

💥 Lecture 13 In-class Exercises - Q3 💥

Session ID: bua345s24

Which R command is used to summarize the number of observations in each gender x eye_color combination?

NOTE: This useful command will also be used in HW 6.

my_starwars |> filter(species=="Human") |> # filter to humans only
  select(gender, eye_color) |>    # examine genders and eye colors
  table()

           eye_color
gender      blue blue-gray brown dark hazel unknown yellow
  feminine     3         0     4    0     1       1      0
  masculine    9         1    12    1     1       0      2

Data Examples - GT cars dataset

Deluxe automobiles from the 2014-2017 period
Type ?gt::gtcars in the console to see data documentation.

gt_cars <- gtcars |> # import data
  glimpse(width=75)

Rows: 47
Columns: 15
$ mfr         <chr> "Ford", "Ferrari", "Ferrari", "Ferrari", "Ferrari", "…
$ model       <chr> "GT", "458 Speciale", "458 Spider", "458 Italia", "48…
$ year        <dbl> 2017, 2015, 2015, 2014, 2016, 2015, 2017, 2015, 2015,…
$ trim        <chr> "Base Coupe", "Base Coupe", "Base", "Base Coupe", "Ba…
$ bdy_style   <chr> "coupe", "coupe", "convertible", "coupe", "coupe", "c…
$ hp          <dbl> 647, 597, 562, 562, 661, 553, 680, 652, 731, 949, 573…
$ hp_rpm      <dbl> 6250, 9000, 9000, 9000, 8000, 7500, 8250, 8000, 8250,…
$ trq         <dbl> 550, 398, 398, 398, 561, 557, 514, 504, 509, 664, 476…
$ trq_rpm     <dbl> 5900, 6000, 6000, 6000, 3000, 4750, 5750, 6000, 6000,…
$ mpg_c       <dbl> 11, 13, 13, 13, 15, 16, 12, 11, 11, 12, 21, 16, 11, 1…
$ mpg_h       <dbl> 18, 17, 17, 17, 22, 23, 17, 16, 16, 16, 22, 22, 18, 2…
$ drivetrain  <chr> "rwd", "rwd", "rwd", "rwd", "rwd", "rwd", "awd", "awd…
$ trsmn       <chr> "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a",…
$ ctry_origin <chr> "United States", "Italy", "Italy", "Italy", "Italy", …
$ msrp        <dbl> 447000, 291744, 263553, 233509, 245400, 198973, 29800…

💥 Lecture 13 In-class Exercises - Q4 💥

Session ID: bua345s24

Which variable in gt_cars, body style (bdy_style) or year could be treated as ordinal?

gt_cars |> select(bdy_style, year) |> table()

             year
bdy_style     2014 2015 2016 2017
  convertible    0    2    2    1
  coupe          2    7   16    7
  hatchback      0    0    2    0
  sedan          0    0    7    1

Categorical Regression

Categorical variables can (and should) be used in linear regression models
If categories exist in the data and we ignore them, then we assume that the linear relationship is the SAME FOR all categories.
The following two examples illustrate the importance of adding a categorical variable to a regression model when needed.

Data Example - Celebrity Salaries Data

Many (not all) celebrities see a decrease in their annual income as they age.
There is a negative relationship between wages and ages.
Is this relationship the same for males a females?

# import celebrity data
celeb <- read_csv("data/celeb.csv", show_col_types = F) |>
  glimpse(width=75)

Rows: 16
Columns: 4
$ Celebrity <chr> "Taylor Swift", "Lady Gaga", "Gisele B\xfcndchen", "Bey…
$ Earnings  <dbl> 67, 59, 54, 54, 51, 28, 24, 70, 81, 70, 77, 68, 64, 56,…
$ Age       <dbl> 27, 31, 32, 35, 36, 44, 47, 29, 29, 32, 32, 35, 36, 39,…
$ Gender    <chr> "Female", "Female", "Female", "Female", "Female", "Fema…

Celebrity Salaries Data - Examining categories and Correlations

celeb |> select(Gender) |> table()  # examine counts for each category

Gender
Female   Male 
     8      8

celeb |> select(Earnings, Age) |> cor() |> round(2) # examine correlation between earnings and age

         Earnings   Age
Earnings     1.00 -0.86
Age         -0.86  1.00

Celebrity Salaries Data - Examining Model Options

Model Option 1: SLR

Model assumes no difference between males and females.
In this case we use Base R command for regression, lm.
Model created with lm can be used to create an interactive plot.
The interactive plot shows the model equation when the cursor is on the line.

Celebrity Salaries Data - Examining Model Options

Model Option 2: Categorical Regression Model

SLR model is okay, but we can do better.
It is logical to create a model that specifies a difference in earnings between males and females.
We add Gender to the model to test if this difference is significant.
The interactive plot shows each model equation when the cursor is on the line.

Celebrity Salaries Data - MLR Model Output

We see the model equation (poorly formatted) for each gender, in the plot.
We can also get these equations from the model output, but it requires a little work.
Examine the model output:

# formatted regression output
# model is saved and printed to screen
(celeb_cat_ols<- ols_regress(Earnings ~ Age + Gender, data=celeb))

                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.545 
R-Squared               0.975       MSE                 7.972 
Adj. R-Squared          0.971       Coef. Var           4.959 
Pred R-Squared          0.962       AIC                83.299 
MAE                     2.197       SBC                86.389 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4007.300         2       2003.650    251.332    0.0000 
Residual       103.637        13          7.972                      
Total         4110.937        15                                     
---------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
--------------------------------------------------------------------------------------------
(Intercept)    134.112         4.141                  32.384    0.000    125.165    143.059 
        Age     -2.370         0.114       -0.918    -20.710    0.000     -2.617     -2.123 
 GenderMale     15.383         1.420        0.480     10.830    0.000     12.315     18.452 
--------------------------------------------------------------------------------------------

Getting Model Equations from Regression Output

By default R chooses baseline categories alphabetically
- Female is before Male so Female is the baseline
- Female SLR Model: Est. Earnings = 134.112 - 2.37 * Age
- Male SLR Model:
  - Est. Earnings = 134.112 - 2.37 * Age + 15.383
  - Est. Earnings = 134.112 + 15.383 - 2.37*Age
  - Est. Earnings = 149.505 - 2.37 * Age
The difference between the intercepts for Females and Males is shown in the model output.
- Difference in intercepts is labeled with name of categorical variable and category
- Difference (Increase) for Males is labeled GenderMale and equals 15.383

💥 Lecture 13 In-class Exercises - Q5 💥

Session ID: bua345s24

Based on our categorical regression model, is the difference between male and female earnings (approx. 15 $M), statistically significant?

HINT: Look at the p-value for the GenderMale term in the model to answer this question.

Data Example - House Remodeling Data

What is the effect of remodeling on house selling price?

# import house remodeling data
house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F) |>
  glimpse(width=75)

Rows: 57
Columns: 3
$ Price       <dbl> 554000, 484000, 391000, 354000, 410000, 349000, 40900…
$ Square_Feet <dbl> 2702, 2378, 1846, 1820, 1794, 1768, 1752, 1719, 1676,…
$ Remodeled   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…

House Remodeling Data - Examine Categories and Correlations

What is the effect of remodeling on house selling price?

# examine number of obs in each category
house_remodel |> select(Remodeled) |> table()

Remodeled
 No Yes 
 29  28

# correlation between price and square feet
house_remodel |> select(Price, Square_Feet) |> cor() |> round(2)

            Price Square_Feet
Price        1.00        0.75
Square_Feet  0.75        1.00

House Remodeling Data - Examining Model Options

Model Option 1: SLR

SLR model assumes no difference due to remodeling
Again, we use Base R command for regression, lm
Model created with lm can be used to create an interactive plot.
Interactive plot shows the model equation when the cursor is on the line.

House Remodeling Data - Examining Model Options

Model Option 2: Categorical Regression Model

SLR model is okay, but there is probably a difference between Remodeled and un-Remodeled houses.
To test for that difference we add the categorical variable Remodeled to the model.
The interactive plot shows each model equation when the cursor is on the line.

House Remodel data - MLR Model Output

We can see the model equation (poorly formatted) for each category, in the plot.
We can also get these equations from the model output, but it requires a little work.
Examine the model output:

# formatted regression output
# model is saved and printed to screen
(house_rem_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel))

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.924       RMSE                    32079.481 
R-Squared                   0.854       MSE                1086264973.997 
Adj. R-Squared              0.848       Coef. Var                   8.223 
Pred R-Squared              0.836       AIC                      1352.620 
MAE                     28012.693       SBC                      1360.792 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                      ANOVA                                        
----------------------------------------------------------------------------------
                        Sum of                                                    
                       Squares        DF         Mean Square       F         Sig. 
----------------------------------------------------------------------------------
Regression    341892090457.686         2    170946045228.843     157.37    0.0000 
Residual       58658308595.823        54      1086264973.997                      
Total         400550399053.509        56                                          
----------------------------------------------------------------------------------

                                         Parameter Estimates                                          
-----------------------------------------------------------------------------------------------------
       model          Beta    Std. Error    Std. Beta      t        Sig          lower         upper 
-----------------------------------------------------------------------------------------------------
 (Intercept)    137549.093     17620.447                  7.806    0.000    102222.224    172875.963 
 Square_Feet       137.879        10.836        0.670    12.725    0.000       116.155       159.602 
RemodeledYes     90917.216      8834.268        0.542    10.291    0.000     73205.575    108628.858 
-----------------------------------------------------------------------------------------------------

Getting Model Equations from Regression Output

By default R chooses baseline categories alphabetically
- No is before Yes so un-Remodeled houses are the baseline
- un-Remodeled SLR Model: Est. Price = 137549.093 + 137.879 * Square_Feet
- Remodeled SLR Model:
  - Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216
  - Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet
  - Est. Price = _____ + 137.879 * Square_Feet

💥 Lecture 13 In-class Exercises - Q6 💥

Session ID: bua345s24

The difference between the intercepts for Remodeled and un-Remodeled houses is shown in the model output.

Difference in intercepts is labeled with name of categorical variable and category
Difference (Increase) for remodeling is labeled RemodeledYes and equals 90917.216

What is the intercept for the prices of Remodeled houses in the Categorical Regression model (Round to closest thousand ($K).

💥 Lecture 13 In-class Exercises - Q7 💥

Session ID: bua345s24

Based on our categorical regression model, is the difference in selling price between remodeled (Remodeled = Yes) and un-remodeled (Remodeled = No) homes statistically significant?

Parameter Estimates Table

Key Points from Today

Categorical Parallel Lines Model
- Separate SLR model for each category.
- Modeling categories simultaneously with one mode is
  - more efficient
  - more accurate
Next Similar Model BUT each category has a different slope
HW 5 is due on 2/28
HW 6 will be posted on 2/29 and due on 3/6.
First set of data in HW 6 almost identical to house_remodel data.

To submit an Engagement Question or Comment about material from Lecture 12: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 12