BUA 345 - Lecture 13

Categorical Regression - Parallel Lines Model

Author

Penelope Pooler Eisenbies

Published

February 24, 2025

Housekeeping

HW 5 is due 2/26/2025 - 2 day grace period

  • Demo videos are posted

Today’s plan

  • Review of

    • SLR and MLR

    • Hypothesis testing in Regression

  • Categorical Parallel Lines Model

In-class Polling (Session ID: bua345s25)

Review Question - Import data

  • Recall The Tennessee Real Estate data:
Bedrooms Bathrooms Price Year_built Address City State
4 4.5 1365000 2006 5442 GRANNY WHITE PIKE BRENTWOOD TN
3 3.0 300000 1987 5608 HEARTHSTONE LN BRENTWOOD TN
3 4.5 566920 1968 1706 OLD HICKORY BLVD BRENTWOOD TN
5 5.0 650000 1977 1428 OLD HICKORY BLVD BRENTWOOD TN

Review Question - Natural Log transformation

  • We will build an MLR model using the natural log of Price (ln_Price)
  • This transformation is needed because Price is RIGHT-SKEWED.
Code
```{r ln transformation of Price, echo=T}
tn_houses <- tn_houses |>
  mutate(ln_Price = log(Price)) # ln_Price = the natural log of price
```
Price ln_Price Bedrooms Bathrooms Year_built Address City State
1365000 14.12666 4 4.5 2006 5442 GRANNY WHITE PIKE BRENTWOOD TN
300000 12.61154 3 3.0 1987 5608 HEARTHSTONE LN BRENTWOOD TN
566920 13.24797 3 4.5 1968 1706 OLD HICKORY BLVD BRENTWOOD TN

Review Question - Histograms

  • Histogram of Price shows distribution of raw data is right-skewed with high outliers.

  • Histogram of ln_Price shows distribution of transformed data is symmetric and normally distributed.

Lecture 13 In-class Exercises - Q1 - Review

Session ID: bua345s25

Back-transforming Model Estimates

Based on the model output, What is the estimated price of a house with 4 bedrooms and 3 bathrooms (rounded to closest $1000)?

Code
```{r review question incomplete R code, eval=F, echo=T}
(y_est <- ___ + 0.056*4 + 0.375*3)            # fill in intercept from R output

(est_dollars <- exp(y_est))                   # back_transform y estimate

# -3 is correct input to round to closest thousand
round(est_dollars,-3)                        # withot piping  
est_dollars |> round(-3)                     # with piping
```
  • NOTE: All 3 steps above could be done with one line but it is helpful to break it down when learning.

Regression Terms - \(R^2\) and Adjusted \(R^2\)

  • R is the correlation coefficient, \(R_{XY}\)

  • Regression Output only shows absolute value of R.


  • \(R^2\) is \(R_{XY}^2\) the square of the correlation coefficient.

  • \(R^2\) is also called coefficient of determination.

  • Meaning of \(R^2\) in SLR: Proportion of variability in y explained by X


  • Adjusted \(R^2\) adjusts \(R^2\) for number of explanatory (X) variables in model.

    • Much more to come about this.

    • Meaning of Adjusted \(R^2\) in MLR is a little less specific but it is similar to \(R^2\).

  • Other values will be covered in upcoming lectures.

Review of Parameter Estimates Output

  • model column lists intercept and X variables in model

  • Beta column shows the estimate of the \(\beta\) coefficients for each variable in model.

  • Std. Error shows variability of each estimated Beta coefficient estimate.

  • t = Beta/Std. Error, the test statistic for each Beta coefficient estimate.

  • Sig is P-value for Hypothesis test for each Beta coefficient estimate:

Recall Hypotheses being tested:

Review of Parameter Estimates Output

  • Reminder of Example Output (Sig is P-value column):

  • Recall Interpretation guidelines for P-value:

Types of Data - Review

Types of Data - More on Categorical Data

  • Categorical variables are categories that describe data observations

    • Gender, Location, Hair Color, Eye Color, Location, etc.
  • Ordinal Categories have an OBJECTIVE order:

    • Grades: A, B, C, D

    • College year: Freshman, Sophomore, Junior, Senior

  • Nominal Categories don’t have an objective order:

    • Location

    • Hair color

    • Gender

Data Examples - R Star Wars Dataset

  • Dataset of characters from Star Wars franchise

  • Type ?starwars in the console to review data documentation.

Code
```{r example from R starwars data, echo=T}
my_starwars <- starwars           # save R starwars dataset to Global Environment                        
my_starwars |> glimpse(width=75)  # examine data
```
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Le…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 1…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brow…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "ligh…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "b…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 5…
$ sex        <chr> "male", "none", "none", "male", "female", "male", "fem…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "f…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", …
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of …
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>,…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced…

Examining Data

  • A good way to examine categorical data variables is to examine how many observations are in each category.

  • For example, we can examine the Star Wars character data by species and gender

    • There are WAY TOO MANY species…
                gender
species          feminine masculine
  Aleena                0         1
  Besalisk              0         1
  Cerean                0         1
  Chagrian              0         1
  Clawdite              1         0
  Droid                 1         5
  Dug                   0         1
  Ewok                  0         1
  Geonosian             0         1
  Gungan                0         3
  Human                 9        26
  Hutt                  0         1
  Iktotchi              0         1
  Kaleesh               0         1
  Kaminoan              1         1
  Kel Dor               0         1
  Mirialan              2         0
  Mon Calamari          0         1
  Muun                  0         1
  Nautolan              0         1
  Neimodian             0         1
  Pau'an                0         1
  Quermian              0         1
  Rodian                0         1
  Skakoan               0         1
  Sullustan             0         1
  Tholothian            1         0
  Togruta               1         0
  Toong                 0         1
  Toydarian             0         1
  Trandoshan            0         1
  Twi'lek               1         1
  Vulptereen            0         1
  Wookiee               0         2
  Xexto                 0         1
  Yoda's species        0         1
  Zabrak                0         2

Lecture 13 In-class Exercises - Q2

Session ID: bua345s25

Is species a nominal or ordinal variable?

Star Wars Example - Examining categorical Data

  • Human is the most common species.
  • We can filter the data to look at those characters only.
  • For example, we can examine prevalence of each gender and eye color among the human characters.
Code
```{r summary of gender and eye color, echo=T}
my_starwars |> filter(species=="Human") |> # filter to humans only
  dplyr::select(gender, eye_color) |>      # select gender and eye_color variables
  table()                                  # create table of counts
```
           eye_color
gender      blue blue-gray brown dark hazel unknown yellow
  feminine     3         0     4    0     1       1      0
  masculine    9         1    12    1     1       0      2

Lecture 13 In-class Exercises - Q3

Session ID: bua345s25

Which R command is used to summarize the number of observations in each gender x eye_color combination?


NOTE: This useful command will also be used in HW 6.

Code
```{r summary of gender and eye color poll, echo=T}
my_starwars |> filter(species=="Human") |> # filter to humans only
  dplyr::select(gender, eye_color) |>      # select gender and eye_color variables
  table()                                  # create table of counts
```
           eye_color
gender      blue blue-gray brown dark hazel unknown yellow
  feminine     3         0     4    0     1       1      0
  masculine    9         1    12    1     1       0      2

Data Examples - GT cars dataset

  • Deluxe automobiles from the 2014-2017 period

  • Type ?gt::gtcars in the console to see data documentation.

Code
```{r import gtcars dats, echo=T}
gt_cars <- gtcars |> # import data
  glimpse(width=75) 
```
Rows: 47
Columns: 15
$ mfr         <chr> "Ford", "Ferrari", "Ferrari", "Ferrari", "Ferrari", "…
$ model       <chr> "GT", "458 Speciale", "458 Spider", "458 Italia", "48…
$ year        <dbl> 2017, 2015, 2015, 2014, 2016, 2015, 2017, 2015, 2015,…
$ trim        <chr> "Base Coupe", "Base Coupe", "Base", "Base Coupe", "Ba…
$ bdy_style   <chr> "coupe", "coupe", "convertible", "coupe", "coupe", "c…
$ hp          <dbl> 647, 597, 562, 562, 661, 553, 680, 652, 731, 949, 573…
$ hp_rpm      <dbl> 6250, 9000, 9000, 9000, 8000, 7500, 8250, 8000, 8250,…
$ trq         <dbl> 550, 398, 398, 398, 561, 557, 514, 504, 509, 664, 476…
$ trq_rpm     <dbl> 5900, 6000, 6000, 6000, 3000, 4750, 5750, 6000, 6000,…
$ mpg_c       <dbl> 11, 13, 13, 13, 15, 16, 12, 11, 11, 12, 21, 16, 11, 1…
$ mpg_h       <dbl> 18, 17, 17, 17, 22, 23, 17, 16, 16, 16, 22, 22, 18, 2…
$ drivetrain  <chr> "rwd", "rwd", "rwd", "rwd", "rwd", "rwd", "awd", "awd…
$ trsmn       <chr> "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a", "7a",…
$ ctry_origin <chr> "United States", "Italy", "Italy", "Italy", "Italy", …
$ msrp        <dbl> 447000, 291744, 263553, 233509, 245400, 198973, 29800…

Lecture 13 In-class Exercises - Q4

Session ID: bua345s25

Which variable in gt_cars, body style (bdy_style) or year could be treated as ordinal?


Code
```{r examine categories within gt data, echo=T}
gt_cars |> dplyr::select(bdy_style, year) |> table()
```
             year
bdy_style     2014 2015 2016 2017
  convertible    0    2    2    1
  coupe          2    7   16    7
  hatchback      0    0    2    0
  sedan          0    0    7    1

Categorical Regression

  • Categorical variables can (and should) be used in linear regression models

  • If categories exist in the data and we ignore them, then we assume that the linear relationship is the SAME FOR all categories.

  • The following two examples illustrate the importance of adding a categorical variable to a regression model when needed.

Data Example - Celebrity Salaries Data

  • Many (not all) celebrities see a decrease in their annual income as they age.

  • There is a negative relationship between wages and ages.

  • Is this relationship the same for males a females?

Celebrity Earnings Age Gender
Taylor Swift 67 27 Female
Lady Gaga 59 31 Female
Gisele Bundchen 54 32 Female
Beyonce 54 35 Female
Kim Kardashian 51 36 Female
Sofia Vergara 28 44 Female

Celebrity Salaries

Examining categories and Correlations

Code
```{r echo=T}
celeb |> dplyr::select(Gender) |> table()  # examine counts for each category

celeb |> dplyr::select(Earnings, Age) |> cor() |> round(2) # examine correlation between earnings and age
```
Gender
Female   Male 
     8      8 
         Earnings   Age
Earnings     1.00 -0.86
Age         -0.86  1.00

Celebrity Salaries Data - Examining Model Options

Option 1: SLR

  • Model assumes no difference between males and females.

  • In this case we use Base R command for regression, lm.

  • Model created with lm can be used to create an interactive plot.

  • The interactive plot shows the model equation when the cursor is on the line.

\(\hat{y} = 136.89 - 2.23\times Age\)

Celebrity Salaries Data - Examining Model Options

Option 2: Categorical Regression Model

  • SLR model is okay, but we can do better.

  • It is logical to create a model that specifies a difference in earnings between males and females.

  • We add Gender to the model to test if this difference is significant.

  • The interactive plot shows each model equation when the cursor is on the line.

Females: \(\hat{y} = 134.11 - 2.37\times Age\)

Males: \(\hat{y} = 149.5 - 2.37\times Age\)

Celebrity Salaries Data - MLR Model Output

  • We see the model equation (poorly formatted) for each gender, in the plot.

  • We can also get these equations from the model output, but it requires a little work.

  • Examine the model output:

                        Model Summary                          
--------------------------------------------------------------
R                       0.987       RMSE                2.545 
R-Squared               0.975       MSE                 6.477 
Adj. R-Squared          0.971       Coef. Var           4.959 
Pred R-Squared          0.962       AIC                83.299 
MAE                     2.197       SBC                86.389 
--------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4007.300         2       2003.650    251.332    0.0000 
Residual       103.637        13          7.972                      
Total         4110.937        15                                     
---------------------------------------------------------------------

                                    Parameter Estimates                                      
--------------------------------------------------------------------------------------------
      model       Beta    Std. Error    Std. Beta       t        Sig       lower      upper 
--------------------------------------------------------------------------------------------
(Intercept)    134.112         4.141                  32.384    0.000    125.165    143.059 
        Age     -2.370         0.114       -0.918    -20.710    0.000     -2.617     -2.123 
 GenderMale     15.383         1.420        0.480     10.830    0.000     12.315     18.452 
--------------------------------------------------------------------------------------------

Getting Model Equations from Regression Output

  • By default R chooses baseline categories alphabetically

    • Female is before Male so Female is the baseline

    • Female SLR Model:

      • Est. Earnings = 134.112 - 2.37 * Age
    • Male SLR Model:

      • Est. Earnings = 134.112 - 2.37 * Age + 15.383

      • Est. Earnings = 134.112 + 15.383 - 2.37*Age

      • Est. Earnings = 149.505 - 2.37 * Age

  • The difference between the intercepts for Females and Males is shown in the model output.

    • Difference in intercepts is labeled with name of categorical variable and category

    • Difference (Increase) for Males is labeled GenderMale and equals 15.383

Lecture 13 In-class Exercises - Q5

Session ID: bua345s25

Based on our categorical regression model, is the difference between male and female earnings (approx. 15 $M), statistically significant?

HINT: Look at the p-value for the GenderMale term in the model to answer this question.


Parameter Estimates Table

Data Example - House Remodeling Data

  • What is the effect of remodeling on house selling price?
Price Square_Feet Remodeled
554000 2702 No
484000 2378 No
391000 1846 No
354000 1820 No
410000 1794 No
349000 1768 No

House Remodeling Data

Examine Categories and Correlations

  • What is the effect of remodeling on house selling price?
Code
```{r  examine categories and correlations, echo=T}
# examine number of obs in each category                                             
house_remodel |> select(Remodeled) |> table() 

# correlation between price and square feet
house_remodel |> select(Price, Square_Feet) |> cor() |> round(2)
```
Remodeled
 No Yes 
 29  28 
            Price Square_Feet
Price        1.00        0.75
Square_Feet  0.75        1.00

House Remodeling Data - Examining Model Options

Option 1: SLR

  • SLR model assumes no difference due to remodeling

  • Again, we use Base R command for regression, lm

  • Model created with lm can be used to create an interactive plot.

  • Interactive plot shows the model equation when the cursor is on the line.

House Remodeling Data - Examining Model Options

Option 2: Categorical Regression Model

  • SLR model is okay, but there is probably a difference between Remodeled and un-Remodeled houses.

  • To test for that difference we add the categorical variable Remodeled to the model.

  • The interactive plot shows each model equation when the cursor is on the line.

House Remodel data - MLR Model Output

  • We can see the model equation (poorly formatted) for each category, in the plot.

  • We can also get these equations from the model output, but it requires a little work.

  • Examine the model output:

                              Model Summary                                
--------------------------------------------------------------------------
R                           0.924       RMSE                    32079.481 
R-Squared                   0.854       MSE                1029093133.260 
Adj. R-Squared              0.848       Coef. Var                   8.223 
Pred R-Squared              0.836       AIC                      1352.620 
MAE                     28012.693       SBC                      1360.792 
--------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                      ANOVA                                        
----------------------------------------------------------------------------------
                        Sum of                                                    
                       Squares        DF         Mean Square       F         Sig. 
----------------------------------------------------------------------------------
Regression    341892090457.686         2    170946045228.843     157.37    0.0000 
Residual       58658308595.823        54      1086264973.997                      
Total         400550399053.509        56                                          
----------------------------------------------------------------------------------

                                         Parameter Estimates                                          
-----------------------------------------------------------------------------------------------------
       model          Beta    Std. Error    Std. Beta      t        Sig          lower         upper 
-----------------------------------------------------------------------------------------------------
 (Intercept)    137549.093     17620.447                  7.806    0.000    102222.224    172875.963 
 Square_Feet       137.879        10.836        0.670    12.725    0.000       116.155       159.602 
RemodeledYes     90917.216      8834.268        0.542    10.291    0.000     73205.575    108628.858 
-----------------------------------------------------------------------------------------------------

Getting Model Equations from Regression Output

  • By default R chooses baseline categories alphabetically

    • No is before Yes so un-Remodeled houses are the baseline

    • un-Remodeled SLR Model:

      • Est. Price = 137549.093 + 137.879 * Square_Feet
    • Remodeled SLR Model:

      • Est. Price = 137549.093 + 137.879 * Square_Feet + 90917.216

      • Est. Price = 137549.093 + 90917.216 + 137.879 * Square_Feet

      • Est. Price = _____ + 137.879 * Square_Feet

Lecture 13 In-class Exercises - Q6

Session ID: bua345s25

The difference between the intercepts for Remodeled and un-Remodeled houses is shown in the model output.

  • Difference in intercepts is labeled with name of categorical variable and category

  • Difference for remodeling is labeled RemodeledYes and equals 90917.216

What is the intercept for the prices of Remodeled houses in the Categorical Regression model (Round to closest thousand ($K).

Parameter Estimates Table

Lecture 13 In-class Exercises - Q7

Session ID: bua345s25

Based on our categorical regression model, is the difference in selling price between remodeled (Remodeled = Yes) and un-remodeled (Remodeled = No) homes statistically significant?


Parameter Estimates Table