Select an interesting binary column of data, or one which can be reasonably converted into a binary variable This should be something worth modeling Build a logistic regression model for this variable, using between 1-4 explanatory variables Interpret the coefficients, and explain what they mean in your notebook Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and translate its meaning Consider a transformation for any explanatory variable, and illustrate why you need the transformation (or why you do not) e.g., use scatterplots
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
Since our dataset does not have any binary columns, we will use our month column to convert all months before June (1-6) to 0 and all those after June(7-12) to have the value 1.
##
## 0 1
## 3680 4922
As seen above, we now have a binary column named month_binary with 3071 values of 0 (representing months from January to June) and 4054 values of 1 (representing months from Julyto December)
library(ggthemes)
library(ggrepel)
library(ggplot2)
txhousing |>
ggplot(mapping = aes(x = sales, y = month_binary)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_smooth(method = 'lm', se = FALSE) +
labs(title = "Modeling a Binary Response with OLS") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 568 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 568 rows containing missing values or values outside the scale range
## (`geom_point()`).
model <- glm(month_binary ~ sales + inventory , data = txhousing,
family = binomial(link = 'logit'))
model$coefficients
## (Intercept) sales inventory
## 1.385654e-01 6.107456e-05 1.426608e-02
Our intercept value is 0.1410 which represent the 50% decision threshold for a variable. I.e it represents the log-odds of selling before June when all the sales and inventory values are equal to zero.
The coefficient of our sales column is 0.0000606867 -This is a positive value, which implies that the for every unit increase in sales, the log-odds of the sale being between July and December goes up by 0.0000606867. It is important to note that the magnitude of this value is extremely small - implying that the relationship is fairly weak.
The coefficient of our inventory column is 0.01402 -This is a positive value, which implies that the for every unit increase in sales, the log-odds of the sale being between July and December goes up by roughly 1.5%. It is important to note that the magnitude of this value is also fairly small - however it is larger than the sales-column’s coefficient indicating that the inventory column has a stronger relationship with the month_binary column
standard_errors <- summary(model)$coef[, "Std. Error"]
standard_errors
## (Intercept) sales inventory
## 0.0491755917 0.0000214657 0.0054251025
confidence_level <- 0.95
# Compute Z-value based on confidence level
z_value <- qnorm((1 + confidence_level) / 2)
sales_se = summary(model)$coeff["sales", "Std. Error"]
inventory_se = summary(model)$coeff["inventory", "Std. Error"]
sales_margin_error = z_value * sales_se
inventory_margin_error = z_value* inventory_se
# Calculate confidence interval for Sales
sales_lower_bound = sales_se - sales_margin_error
sales_upper_bound = sales_se + sales_margin_error
# Calculate confidence interval for Inventory
inventory_lower_bound = inventory_se - inventory_margin_error
inventory_upper_bound = inventory_se + inventory_margin_error
cat("Confidence Interval for Sales Column:", sales_lower_bound, "-", sales_upper_bound, "\n")
## Confidence Interval for Sales Column: -2.06063e-05 - 6.35377e-05
cat("Confidence Interval for Inventory Column:", inventory_lower_bound, "-", inventory_upper_bound, "\n")
## Confidence Interval for Inventory Column: -0.005207903 - 0.01605811
Our Logisitc Regression model has 2 explanatory variables - sales and inventory:
txhousing |>
ggplot(mapping = aes(x = sales, y = inventory)) +
geom_point(shape = 'O', size = 3) +
geom_smooth(se = FALSE) +
labs(title = "Sales vs. Inventory") +
theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 1468 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1468 rows containing missing values or values outside the scale range
## (`geom_point()`).
As we can see above, the relationship isnot exactly linear and the Inventory values seems to be much higher for sales values near 0.
txhousing |>
ggplot(mapping = aes(x = log(sales), y = inventory)) +
geom_point(shape = 'O', size = 3) +
geom_smooth(se = FALSE) +
labs(title = "log(Sales) vs. Inventory") +
theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 1468 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1468 rows containing missing values or values outside the scale range
## (`geom_point()`).
Taking the log(sales) doesnt help with reducing the issues of linarity
and so I would not reccomend this as a desired transformation
txhousing |>
ggplot(mapping = aes(x = sales, y = log(inventory))) +
geom_point(shape = 'O', size = 3) +
geom_smooth(se = FALSE) +
labs(title = "Sales vs. log(Inventory)") +
theme_minimal()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 1469 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1468 rows containing missing values or values outside the scale range
## (`geom_point()`).
Taking the log(inventory) seems to affect our linear relationship even more. This indicates that transformations of our models are not necesary for this dataset.