Lecture 13 - Categorical Regression - Parallel Lines Model

Penelope Pooler Eisenbies
BUA 345

2024-02-27

Housekeeping

  • Today’s plan 📋

    • Review

Review Question - Import data

  • Recall The Tennessee Real Estate data:
# import and examine data
tn_houses <- read_csv("data/TN_houses.csv", show_col_types = F) |>
  glimpse(width=75)
Rows: 2,430
Columns: 7
$ Bedrooms   <dbl> 4, 3, 3, 5, 4, 4, 4, 3, 4, 4, 4, 3, 3, 4, 3, 4, 5, 4, …
$ Bathrooms  <dbl> 4.5, 3.0, 4.5, 5.0, 4.0, 3.0, 2.0, 2.0, 3.5, 3.5, 3.0,…
$ Price      <dbl> 1365000, 300000, 566920, 650000, 456500, 380500, 31815…
$ Year_built <dbl> 2006, 1987, 1968, 1977, 1974, 1987, 1978, 1976, 1966, …
$ Address    <chr> "5442  GRANNY WHITE PIKE", "5608  HEARTHSTONE LN", "17…
$ City       <chr> "BRENTWOOD", "BRENTWOOD", "BRENTWOOD", "BRENTWOOD", "B…
$ State      <chr> "TN", "TN", "TN", "TN", "TN", "TN", "TN", "TN", "TN", …

Review Question - Natural Log transformation

  • We will build an MLR model using the natural log of Price (ln_Price)
  • This transformation is needed because Price is RIGHT-SKEWED.
  • Review of how to create this transformation
# create new variable in tn_houses dataset
# ln_price = the natural log of price
tn_houses <- tn_houses |>
  mutate(ln_price = log(Price)) |>
  glimpse( width=75)
Rows: 2,430
Columns: 8
$ Bedrooms   <dbl> 4, 3, 3, 5, 4, 4, 4, 3, 4, 4, 4, 3, 3, 4, 3, 4, 5, 4, …
$ Bathrooms  <dbl> 4.5, 3.0, 4.5, 5.0, 4.0, 3.0, 2.0, 2.0, 3.5, 3.5, 3.0,…
$ Price      <dbl> 1365000, 300000, 566920, 650000, 456500, 380500, 31815…
$ Year_built <dbl> 2006, 1987, 1968, 1977, 1974, 1987, 1978, 1976, 1966, …
$ Address    <chr> "5442  GRANNY WHITE PIKE", "5608  HEARTHSTONE LN", "17…
$ City       <chr> "BRENTWOOD", "BRENTWOOD", "BRENTWOOD", "BRENTWOOD", "B…
$ State      <chr> "TN", "TN", "TN", "TN", "TN", "TN", "TN", "TN", "TN", …
$ ln_price   <dbl> 14.12666, 12.61154, 13.24797, 13.38473, 13.03134, 12.8…

Review Question - Examination Variable Distrbutions

  • Histogram of Price (hist_price), shows distribution of raw data is right-skewed with high outliers.

  • Histogram of ln_price (hist_ln_price) shows distribution of ln transformed data is symmetric and normally distributed.

Review Question - Side by Side Comparison of Histograms

💥 Lecture 13 In-class Exercises - Q1 - Review 💥

Session ID: bua345s24

Back-transforming Model Estimates

Based on the model output, What is the estimated price of a house with 4 bedrooms and 3 bathrooms (rounded to closest $1000)?

# incomplete model equation
# fill in intercept and values from question
(y_est <- ___ + 0.056*4 + 0.375*3)

# back-transform y_est using exp function
# two versions of same code
(est_dollars <- exp(y_est)) 
(est_dollars <- y_est |> exp()) # with piping

# -3 is correct input to round to closest thousand
# two versions of same code
round(est_dollars,-3)
est_dollars |> round(-3)        # with piping

NOTE: All 3 steps above could be done with one line but it is helpful to break it down when learning.

Review of Regression Terms

\(R^2\) and Adjusted \(R^2\)

  • R is the correlation coefficient, \(R_{XY}\)

    • Regression Output only shows absolute value of R.
  • \(R^2\) is \(R_{XY}^2\) the square of the correlation coefficient.

  • \(R^2\) is also called coefficient of determination.

  • Meaning of \(R^2\) in SLR: Proportion of variability in y explained by X


  • Adjusted \(R^2\) adjusts \(R^2\) for number of explanatory (X) variables in model.

    • Much more to come about this.

    • Meaning of Adjusted \(R^2\) in MLR is a little less specific but it issimilar to \(R^2\).

  • Other values will be covered in upcoming lectures.

Review of Parameter Estimates Output

  • model column lists intercept and X variables in model

  • Beta column shows the estimate of the \(\beta\) coefficients for each variable in model.

  • Std. Error shows variability of each estimated Beta coefficient estimate.

  • t = Beta/Std. Error, the test statistic for each Beta coefficient estimate.

  • Sig is P-value for Hypothesis test for each Beta coefficient estimate:

Recall Hypotheses being tested:

Review of Parameter Estimates Output

Recall Interpretation guidelines for P-value:


Reminder of Example Output:

Types of Data - Review (Previous Stat Course)

Types of Data - More on Categorical Data

  • Categorical variables are categories that describe data observations

    • Gender, Location, Hair Color, Eye Color, Location, etc.
  • Ordinal Categories have an OBJECTIVE order:

    • Grades: A, B, C, D

    • College year: Freshman, Sophomore, Junior, Senior

  • Nominal Categories don’t have an objective order:

    • Location

    • Hair color

    • Gender

Data Examples - R Star Wars Dataset

  • Dataset of characters from Star Wars franchise

  • Type ?starwars in the console to review data documentation.

my_starwars <- starwars           # save R starwars dataset to Global Environment
my_starwars |> glimpse(width=75)  # examine data
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Le…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 1…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brow…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "ligh…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "b…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 5…
$ sex        <chr> "male", "none", "none", "male", "female", "male", "fem…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "f…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", …
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of …
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>,…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced…

Examining Data

  • A good way to examine categorical data variables is to examine how many observations are in each category.

  • For example, we can examine the Star Wars character data by species and gender

    • There are WAY TOO MANY species…
# summarizing starwars character data by species and gender
my_starwars |> select(species, gender) |> table()
                gender
species          feminine masculine
  Aleena                0         1
  Besalisk              0         1
  Cerean                0         1
  Chagrian              0         1
  Clawdite              1         0
  Droid                 1         5
  Dug                   0         1
  Ewok                  0         1
  Geonosian             0         1
  Gungan                0         3
  Human                 9        26
  Hutt                  0         1
  Iktotchi              0         1
  Kaleesh               0         1
  Kaminoan              1         1
  Kel Dor               0         1
  Mirialan              2         0
  Mon Calamari          0         1
  Muun                  0         1
  Nautolan              0         1
  Neimodian             0         1
  Pau'an                0         1
  Quermian              0         1
  Rodian                0         1
  Skakoan               0         1
  Sullustan             0         1
  Tholothian            1         0
  Togruta               1         0
  Toong                 0         1
  Toydarian             0         1
  Trandoshan            0         1
  Twi'lek               1         1
  Vulptereen            0         1
  Wookiee               0         2
  Xexto                 0         1
  Yoda's species        0         1
  Zabrak                0         2

💥 Lecture 13 In-class Exercises - Q2 💥

Session ID: bua345s24

Is species a nominal or ordinal variable?

Star wars Example - Examining categorical Data

  • Human is the most common species.
  • We can filter the data to look at those characters only.
  • For example, we can examine prevalence of each gender and eye color among the human characters.
my_starwars |> filter(species=="Human") |> # filter to humans only
  select(gender, eye_color) |>    # examine genders and eye colors
  table()
           eye_color
gender      blue blue-gray brown dark hazel unknown yellow
  feminine     3         0     4    0     1       1      0
  masculine    9         1    12    1     1       0      2

💥 Lecture 13 In-class Exercises - Q3 💥

Session ID: bua345s24

Which R command is used to summarize the number of observations in each gender x eye_color combination?


NOTE: This useful command will also be used in HW 6.

my_starwars |> filter(species=="Human") |> # filter to humans only
  select(gender, eye_color) |>    # examine genders and eye colors
  table()
           eye_color
gender      blue blue-gray brown dark hazel unknown yellow
  feminine     3         0     4    0     1       1      0
  masculine    9         1    12    1     1       0      2

Data Examples - GT cars dataset

  • Deluxe automobiles from the 2014-2017 period

  • Type ?gt::gtcars in the console to see data documentation.

gt_cars <- gtcars |> # import data
  glimpse(width=75) 
Rows: 47
Columns: 15
$ mfr         <chr> "Ford", "Ferrari", "Ferrari", "Ferrari", "Ferrari", "…
$ model       <chr> "GT", "458 Speciale", "458 Spider", "458 Italia", "48…
$ year        <dbl> 2017, 2015, 2015, 2014, 2016, 2015, 2017, 2015, 2015,…
$ trim        <chr> "Base Coupe", "Base Coupe", "Base", "Base Coupe", "Ba…
$ bdy_style   <chr> "coupe", "coupe", "convertible", "coupe", "coupe", "c…
$ hp          <dbl> 647, 597, 562, 562, 661, 553, 680, 652, 731, 949, 573…
$ hp_rpm      <dbl> 6250, 9000, 9000, 9000, 8000, 7500, 8250, 8000, 8250,…
$ trq         <dbl> 550, 398, 398, 398, 561, 557, 514, 504, 509, 664, 476…
$ trq_rpm     <dbl> 5900, 6000, 6000, 6000, 3000, 4750, 5750, 6000, 6000,…
$ mpg_c       <dbl> 11, 13, 13, 13, 15, 16, 12, 11, 11, 12, 21, 16, 11, 1…
$ mpg_h       <dbl> 18, 17, 17, 17, 22, 23, 17, 16, 16, 16, 22, 22, 18, 2…
$ drivetrain  <chr> "rwd", "rwd", "rwd", "rwd", "rwd", "rwd", "awd", "awd…
$ trsmn       <chr> "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a",…
$ ctry_origin <chr> "United States", "Italy", "Italy", "Italy", "Italy", …
$ msrp        <dbl> 447000, 291744, 263553, 233509, 245400, 198973, 29800…

💥 Lecture 13 In-class Exercises - Q4 💥

Session ID: bua345s24

Which variable in gt_cars, body style (bdy_style) or year could be treated as ordinal?


gt_cars |> select(bdy_style, year) |> table()
             year
bdy_style     2014 2015 2016 2017
  convertible    0    2    2    1
  coupe          2    7   16    7
  hatchback      0    0    2    0
  sedan          0    0    7    1

Categorical Regression

  • Categorical variables can (and should) be used in linear regression models

  • If categories exist in the data and we ignore them, then we assume that the linear relationship is the SAME FOR all categories.

  • The following two examples illustrate the importance of adding a categorical variable to a regression model when needed.

Data Example - Celebrity Salaries Data

  • Many (not all) celebrities see a decrease in their annual income as they age.

  • There is a negative relationship between wages and ages.

  • Is this relationship the same for males a females?

# import celebrity data
celeb <- read_csv("data/celeb.csv", show_col_types = F) |>
  glimpse(width=75) 
Rows: 16
Columns: 4
$ Celebrity <chr> "Taylor Swift", "Lady Gaga", "Gisele B\xfcndchen", "Bey…
$ Earnings  <dbl> 67, 59, 54, 54, 51, 28, 24, 70, 81, 70, 77, 68, 64, 56,…
$ Age       <dbl> 27, 31, 32, 35, 36, 44, 47, 29, 29, 32, 32, 35, 36, 39,…
$ Gender    <chr> "Female", "Female", "Female", "Female", "Female", "Fema…

Celebrity Salaries Data - Examining categories and Correlations

celeb |> select(Gender) |> table()  # examine counts for each category
Gender
Female   Male 
     8      8 
celeb |> select(Earnings, Age) |> cor() |> round(2) # examine correlation between earnings and age
         Earnings   Age
Earnings     1.00 -0.86
Age         -0.86  1.00

Celebrity Salaries Data - Examining Model Options

Model Option 1: SLR

  • Model assumes no difference between males and females.

  • In this case we use Base R command for regression, lm.

  • Model created with lm can be used to create an interactive plot.

  • The interactive plot shows the model equation when the cursor is on the line.

Celebrity Salaries Data - Examining Model Options

Model Option 2: Categorical Regression Model

  • SLR model is okay, but we can do better.

  • It is logical to create a model that specifies a difference in earnings between males and females.

  • We add Gender to the model to test if this difference is significant.

  • The interactive plot shows each model equation when the cursor is on the line.

Celebrity Salaries Data - MLR Model Output

  • We see the model equation (poorly formatted) for each gender, in the plot.

  • We can also get these equations from the model output, but it requires a little work.

  • Examine the model output:

# formatted regression output
# model is saved and printed to screen
(celeb_cat_ols<- ols_regress(Earnings ~ Age + Gender, data=celeb))
                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.545 
R-Squared               0.975       MSE                 7.972 
Adj. R-Squared          0.971       Coef. Var           4.959 
Pred R-Squared          0.962       AIC                83.299 
MAE                     2.197       SBC                86.389 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4007.300         2       2003.650    251.332    0.0000 
Residual       103.637        13          7.972                      
Total         4110.937        15                                     
---------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
--------------------------------------------------------------------------------------------
(Intercept)    134.112         4.141                  32.384    0.000    125.165    143.059 
        Age     -2.370         0.114       -0.918    -20.710    0.000     -2.617     -2.123 
 GenderMale     15.383         1.420        0.480     10.830    0.000     12.315     18.452 
--------------------------------------------------------------------------------------------

Getting Model Equations from Regression Output

  • By default R chooses baseline categories alphabetically

    • Female is before Male so Female is the baseline

    • Female SLR Model: Est. Earnings = 134.112 - 2.37 * Age

    • Male SLR Model:

      • Est. Earnings = 134.112 - 2.37 * Age + 15.383

      • Est. Earnings = 134.112 + 15.383 - 2.37*Age

      • Est. Earnings = 149.505 - 2.37 * Age

  • The difference between the intercepts for Females and Males is shown in the model output.

    • Difference in intercepts is labeled with name of categorical variable and category

    • Difference (Increase) for Males is labeled GenderMale and equals 15.383

💥 Lecture 13 In-class Exercises - Q5 💥

Session ID: bua345s24

Based on our categorical regression model, is the difference between male and female earnings (approx. 15 $M), statistically significant?

HINT: Look at the p-value for the GenderMale term in the model to answer this question.

Data Example - House Remodeling Data

  • What is the effect of remodeling on house selling price?


# import house remodeling data
house_remodel <- read_csv("data/house_remodel.csv", show_col_types = F) |>
  glimpse(width=75) 
Rows: 57
Columns: 3
$ Price       <dbl> 554000, 484000, 391000, 354000, 410000, 349000, 40900…
$ Square_Feet <dbl> 2702, 2378, 1846, 1820, 1794, 1768, 1752, 1719, 1676,…
$ Remodeled   <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…

House Remodeling Data - Examine Categories and Correlations

  • What is the effect of remodeling on house selling price?


# examine number of obs in each category
house_remodel |> select(Remodeled) |> table() 
Remodeled
 No Yes 
 29  28 
# correlation between price and square feet
house_remodel |> select(Price, Square_Feet) |> cor() |> round(2)
            Price Square_Feet
Price        1.00        0.75
Square_Feet  0.75        1.00

House Remodeling Data - Examining Model Options

Model Option 1: SLR

  • SLR model assumes no difference due to remodeling

  • Again, we use Base R command for regression, lm

  • Model created with lm can be used to create an interactive plot.

  • Interactive plot shows the model equation when the cursor is on the line.

House Remodeling Data - Examining Model Options

Model Option 2: Categorical Regression Model

  • SLR model is okay, but there is probably a difference between Remodeled and un-Remodeled houses.

  • To test for that difference we add the categorical variable Remodeled to the model.

  • The interactive plot shows each model equation when the cursor is on the line.

House Remodel data - MLR Model Output

  • We can see the model equation (poorly formatted) for each category, in the plot.

  • We can also get these equations from the model output, but it requires a little work.

  • Examine the model output:

# formatted regression output
# model is saved and printed to screen
(house_rem_ols<- ols_regress(Price ~ Square_Feet + Remodeled, data=house_remodel))
                              Model Summary                                
--------------------------------------------------------------------------
R                           0.924       RMSE                    32079.481 
R-Squared                   0.854       MSE                1086264973.997 
Adj. R-Squared              0.848       Coef. Var                   8.223 
Pred R-Squared              0.836       AIC                      1352.620 
MAE                     28012.693       SBC                      1360.792 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                      ANOVA                                        
----------------------------------------------------------------------------------
                        Sum of                                                    
                       Squares        DF         Mean Square       F         Sig. 
----------------------------------------------------------------------------------
Regression    341892090457.686         2    170946045228.843     157.37    0.0000 
Residual       58658308595.823        54      1086264973.997                      
Total         400550399053.509        56                                          
----------------------------------------------------------------------------------

                                         Parameter Estimates                                          
-----------------------------------------------------------------------------------------------------
       model          Beta    Std. Error    Std. Beta      t        Sig          lower         upper 
-----------------------------------------------------------------------------------------------------
 (Intercept)    137549.093     17620.447                  7.806    0.000    102222.224    172875.963 
 Square_Feet       137.879        10.836        0.670    12.725    0.000       116.155       159.602 
RemodeledYes     90917.216      8834.268        0.542    10.291    0.000     73205.575    108628.858 
-----------------------------------------------------------------------------------------------------

Getting Model Equations from Regression Output

  • By default R chooses baseline categories alphabetically

    • No is before Yes so un-Remodeled houses are the baseline

    • un-Remodeled SLR Model: Est. Price = 137549.093 + 137.879 * Square_Feet

    • Remodeled SLR Model:

      • Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216

      • Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet

      • Est. Price = _____ + 137.879 * Square_Feet

💥 Lecture 13 In-class Exercises - Q6 💥

Session ID: bua345s24

The difference between the intercepts for Remodeled and un-Remodeled houses is shown in the model output.

  • Difference in intercepts is labeled with name of categorical variable and category

  • Difference (Increase) for remodeling is labeled RemodeledYes and equals 90917.216


What is the intercept for the prices of Remodeled houses in the Categorical Regression model (Round to closest thousand ($K).

💥 Lecture 13 In-class Exercises - Q7 💥

Session ID: bua345s24

Based on our categorical regression model, is the difference in selling price between remodeled (Remodeled = Yes) and un-remodeled (Remodeled = No) homes statistically significant?


Parameter Estimates Table

Key Points from Today

  • Categorical Parallel Lines Model

    • Separate SLR model for each category.
    • Modeling categories simultaneously with one mode is
      • more efficient
      • more accurate
  • Next Similar Model BUT each category has a different slope

  • HW 5 is due on 2/28

  • HW 6 will be posted on 2/29 and due on 3/6.

  • First set of data in HW 6 almost identical to house_remodel data.

To submit an Engagement Question or Comment about material from Lecture 12: Submit by midnight today (day of lecture). Click on Link next to the under Lecture 12