library(readr)
housing = read.csv("/Users/sharmistaroy/Downloads/housing.csv")
head(housing)
## longitude latitude housing_median_age total_rooms total_bedrooms population
## 1 -122.23 37.88 41 880 129 322
## 2 -122.22 37.86 21 7099 1106 2401
## 3 -122.24 37.85 52 1467 190 496
## 4 -122.25 37.85 52 1274 235 558
## 5 -122.25 37.85 52 1627 280 565
## 6 -122.25 37.85 52 919 213 413
## households median_income median_house_value ocean_proximity
## 1 126 8.3252 452600 NEAR BAY
## 2 1138 8.3014 358500 NEAR BAY
## 3 177 7.2574 352100 NEAR BAY
## 4 219 5.6431 341300 NEAR BAY
## 5 259 3.8462 342200 NEAR BAY
## 6 193 4.0368 269700 NEAR BAY
threshold <- 5
housing$highIncome <- ifelse(housing$median_income > threshold, 1, 0)
model <- glm(highIncome ~ housing_median_age + total_rooms + population + median_house_value, data = housing, family = binomial)
summary(model)
##
## Call:
## glm(formula = highIncome ~ housing_median_age + total_rooms +
## population + median_house_value, family = binomial, data = housing)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.000e+00 7.479e-02 -40.12 <2e-16 ***
## housing_median_age -5.393e-02 1.980e-03 -27.23 <2e-16 ***
## total_rooms 3.958e-04 2.532e-05 15.63 <2e-16 ***
## population -8.366e-04 5.386e-05 -15.53 <2e-16 ***
## median_house_value 1.343e-05 2.300e-07 58.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21619 on 20639 degrees of freedom
## Residual deviance: 13977 on 20635 degrees of freedom
## AIC: 13987
##
## Number of Fisher Scoring iterations: 5
The significance codes indicate that all coefficients are highly significant (p-value < 0.001), meaning that these variables are strongly associated with the likelihood of being in a ‘highIncome’ area.
In summary, the coefficients represent how each independent variable affects the log-odds of a house being in a ‘highIncome’ area. By exponentiating the coefficients, you can obtain odds ratios, which can be more interpretable. These ratios indicate how the odds of the binary outcome change for a one-unit change in the corresponding independent variable, while holding other variables constant.
ci <- confint(model, "median_house_value", level = 0.95)
## Waiting for profiling to be done...
print(ci)
## 2.5 % 97.5 %
## 1.297831e-05 1.387999e-05
library(ggplot2)
ggplot(housing, aes(x = median_house_value, y = highIncome)) +
geom_point()
After Examine the scatter plot for a linear relationship between the explanatory variable and the binary result. A linear connection implies that no transformation is required. If the plot clearly exhibits a non-linear pattern (e.g., a curve), it may suggest that a transformation is required.
###Interpret Coefficients
coefficients <- coef(model)
print(coefficients)
## (Intercept) housing_median_age total_rooms population
## -3.000333e+00 -5.393415e-02 3.957877e-04 -8.365638e-04
## median_house_value
## 1.342574e-05
In summary, the logistic regression model coefficients show the effect of several explanatory factors on the log-odds of a property being in a ‘highIncome’ neighborhood. Here’s a quick rundown of the interpretations:
Housing Median Age: For every one-unit rise in the median age of housing, the log-odds of a property being in a ‘highIncome’ neighborhood fall by about 0.053 units, assuming all else remains equal.
Total Rooms: A one-unit increase in the total number of rooms increases the log-odds of a house being in a ‘highIncome’ neighborhood by about 0.000396 units, all else being constant.
Population: For every one-unit increase in population, the log-odds of a property being in a ‘highIncome’ region reduce by around 0.000837 units, assuming all other factors remain constant.
Median House Value: For every one-unit increase in the median house value (in USD), the log-odds of a house being in a ‘highIncome’ area increase by approximately 0.0000134 units, all else being constant.
These interpretations show how changes in the explanatory factors affect the chance of a residence being in a ‘highIncome’ neighborhood, as assessed by the logistic regression model. The coefficients reflect the size and direction of these impacts.
par(mar = c(2, 2, 2, 2))
plot(model)
library(car)
## Loading required package: carData
vif(model)
## housing_median_age total_rooms population median_house_value
## 1.392481 8.039720 7.879817 1.288243
The VIF values for housing_median_age and median_house_value are close to one, suggesting that they do not display severe multicollinearity with other variables. This implies that they are generally independent of the model’s other variables.
However, both total_rooms and population have VIF values that are significantly more than 1 (greater than 5), indicating possible multicollinearity with other variables in the model. These variables’ high VIF values imply that they may be connected with one another or with other explanatory factors.
High VIF values for total_rooms and population may suggest that these variables share comparable information with other variables in the model, making interpreting their unique contributions to the response variable difficult.
###Highlight any issues with the model Based on many explanatory factors, a logistic regression model has been built to determine if a home area belongs into the “highIncome” group. The model shows extremely significant coefficients for all variables, indicating a link between them and the chance of living in a high-income location. As seen by low deviation statistics and a decent AIC value, the model fits the data effectively. However, to improve the interpretation of individual variable impacts, multicollinearity between ‘total_rooms’ and ‘population’ must be addressed. It is also critical to evaluate the model’s predicted performance and ensure that the model assumptions are satisfied. Overall, the concept offers promise, but further research and testing are required for thorough insights and practical applications.