Data Dive 10

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable This should be something worth modeling Build a logistic regression model for this variable, using between 1-4 explanatory variables Interpret the coefficients, and explain what they mean in your notebook Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and translate its meaning Consider a transformation for any explanatory variable, and illustrate why you need the transformation (or why you do not) e.g., use scatterplots

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows

Converting our ‘month’ column into binary format

Since our dataset does not have any binary columns, we will use our month column to convert all months before June (1-6) to 0 and all those after June(7-12) to have the value 1.

## 
##    0    1 
## 3680 4922

As seen above, we now have a binary column named month_binary with 3071 values of 0 (representing months from January to June) and 4054 values of 1 (representing months from Julyto December)

Building a Logistic Regression Model:

Predicting if house sales were before or after June:

library(ggthemes)
library(ggrepel)
library(ggplot2)

txhousing |>
  ggplot(mapping = aes(x = sales, y = month_binary)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(title = "Modeling a Binary Response with OLS") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 568 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 568 rows containing missing values or values outside the scale range
## (`geom_point()`).

model <- glm(month_binary ~ sales + inventory  , data = txhousing,
             family = binomial(link = 'logit'))

model$coefficients
##  (Intercept)        sales    inventory 
## 1.385654e-01 6.107456e-05 1.426608e-02

Interpreting our Coefficients:

Our intercept value is 0.1410 which represent the 50% decision threshold for a variable. I.e it represents the log-odds of selling before June when all the sales and inventory values are equal to zero.

The coefficient of our sales column is 0.0000606867 -This is a positive value, which implies that the for every unit increase in sales, the log-odds of the sale being between July and December goes up by 0.0000606867. It is important to note that the magnitude of this value is extremely small - implying that the relationship is fairly weak.

The coefficient of our inventory column is 0.01402 -This is a positive value, which implies that the for every unit increase in sales, the log-odds of the sale being between July and December goes up by roughly 1.5%. It is important to note that the magnitude of this value is also fairly small - however it is larger than the sales-column’s coefficient indicating that the inventory column has a stronger relationship with the month_binary column

Standard Errors and Confidence Interval

standard_errors <- summary(model)$coef[, "Std. Error"]
standard_errors
##  (Intercept)        sales    inventory 
## 0.0491755917 0.0000214657 0.0054251025

Calculating Z-scores at a 95% confidence level:

confidence_level <- 0.95

# Compute Z-value based on confidence level
z_value <- qnorm((1 + confidence_level) / 2)

sales_se = summary(model)$coeff["sales", "Std. Error"]
inventory_se = summary(model)$coeff["inventory", "Std. Error"]


sales_margin_error = z_value * sales_se
inventory_margin_error = z_value* inventory_se

# Calculate confidence interval for Sales
sales_lower_bound = sales_se - sales_margin_error
sales_upper_bound = sales_se + sales_margin_error

# Calculate confidence interval for Inventory
inventory_lower_bound = inventory_se - inventory_margin_error
inventory_upper_bound = inventory_se + inventory_margin_error

cat("Confidence Interval for Sales Column:", sales_lower_bound, "-", sales_upper_bound, "\n")
## Confidence Interval for Sales Column: -2.06063e-05 - 6.35377e-05
cat("Confidence Interval for Inventory Column:", inventory_lower_bound, "-", inventory_upper_bound, "\n")
## Confidence Interval for Inventory Column: -0.005207903 - 0.01605811

Transforming Explanatory Varibles:

Our Logisitc Regression model has 2 explanatory variables - sales and inventory:

Comparing Sales vs Invenotry without and Transformations

txhousing |>
  ggplot(mapping = aes(x = sales, y = inventory)) +
  geom_point(shape = 'O', size = 3) +
  geom_smooth(se = FALSE) +
  labs(title = "Sales vs. Inventory") +
  theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 1468 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1468 rows containing missing values or values outside the scale range
## (`geom_point()`).

As we can see above, the relationship isnot exactly linear and the Inventory values seems to be much higher for sales values near 0.

Log Transforming Sales

txhousing |>
  ggplot(mapping = aes(x = log(sales), y = inventory)) +
  geom_point(shape = 'O', size = 3) +
  geom_smooth(se = FALSE) +
  labs(title = "log(Sales) vs. Inventory") +
  theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 1468 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1468 rows containing missing values or values outside the scale range
## (`geom_point()`).

Taking the log(sales) doesnt help with reducing the issues of linarity and so I would not reccomend this as a desired transformation

Taking the log(Inveentory) Column:

txhousing |>
  ggplot(mapping = aes(x = sales, y = log(inventory))) +
  geom_point(shape = 'O', size = 3) +
  geom_smooth(se = FALSE) +
  labs(title = "Sales vs. log(Inventory)") +
  theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 1469 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1468 rows containing missing values or values outside the scale range
## (`geom_point()`).

Taking the log(inventory) seems to affect our linear relationship even more. This indicates that transformations of our models are not necesary for this dataset.