Data Dive Week 11

library(readr)
housing = read.csv("/Users/sharmistaroy/Downloads/housing.csv")

head(housing)

##   longitude latitude housing_median_age total_rooms total_bedrooms population
## 1   -122.23    37.88                 41         880            129        322
## 2   -122.22    37.86                 21        7099           1106       2401
## 3   -122.24    37.85                 52        1467            190        496
## 4   -122.25    37.85                 52        1274            235        558
## 5   -122.25    37.85                 52        1627            280        565
## 6   -122.25    37.85                 52         919            213        413
##   households median_income median_house_value ocean_proximity
## 1        126        8.3252             452600        NEAR BAY
## 2       1138        8.3014             358500        NEAR BAY
## 3        177        7.2574             352100        NEAR BAY
## 4        219        5.6431             341300        NEAR BAY
## 5        259        3.8462             342200        NEAR BAY
## 6        193        4.0368             269700        NEAR BAY

threshold <- 5  

housing$highIncome <- ifelse(housing$median_income > threshold, 1, 0)

Build a logistic regression model:

model <- glm(highIncome ~ housing_median_age + total_rooms + population + median_house_value, data = housing, family = binomial)

summary(model)

## 
## Call:
## glm(formula = highIncome ~ housing_median_age + total_rooms + 
##     population + median_house_value, family = binomial, data = housing)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.000e+00  7.479e-02  -40.12   <2e-16 ***
## housing_median_age -5.393e-02  1.980e-03  -27.23   <2e-16 ***
## total_rooms         3.958e-04  2.532e-05   15.63   <2e-16 ***
## population         -8.366e-04  5.386e-05  -15.53   <2e-16 ***
## median_house_value  1.343e-05  2.300e-07   58.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21619  on 20639  degrees of freedom
## Residual deviance: 13977  on 20635  degrees of freedom
## AIC: 13987
## 
## Number of Fisher Scoring iterations: 5

The significance codes indicate that all coefficients are highly significant (p-value < 0.001), meaning that these variables are strongly associated with the likelihood of being in a ‘highIncome’ area.

In summary, the coefficients represent how each independent variable affects the log-odds of a house being in a ‘highIncome’ area. By exponentiating the coefficients, you can obtain odds ratios, which can be more interpretable. These ratios indicate how the odds of the binary outcome change for a one-unit change in the corresponding independent variable, while holding other variables constant.

ci <- confint(model, "median_house_value", level = 0.95)

## Waiting for profiling to be done...

print(ci)

##        2.5 %       97.5 % 
## 1.297831e-05 1.387999e-05

Scatter plot.

library(ggplot2)
ggplot(housing, aes(x = median_house_value, y = highIncome)) +
  geom_point()

After Examine the scatter plot for a linear relationship between the explanatory variable and the binary result. A linear connection implies that no transformation is required. If the plot clearly exhibits a non-linear pattern (e.g., a curve), it may suggest that a transformation is required.

###Interpret Coefficients

coefficients <- coef(model)
print(coefficients)

##        (Intercept) housing_median_age        total_rooms         population 
##      -3.000333e+00      -5.393415e-02       3.957877e-04      -8.365638e-04 
## median_house_value 
##       1.342574e-05

In summary, the logistic regression model coefficients show the effect of several explanatory factors on the log-odds of a property being in a ‘highIncome’ neighborhood. Here’s a quick rundown of the interpretations:

Housing Median Age: For every one-unit rise in the median age of housing, the log-odds of a property being in a ‘highIncome’ neighborhood fall by about 0.053 units, assuming all else remains equal.

Total Rooms: A one-unit increase in the total number of rooms increases the log-odds of a house being in a ‘highIncome’ neighborhood by about 0.000396 units, all else being constant.

Population: For every one-unit increase in population, the log-odds of a property being in a ‘highIncome’ region reduce by around 0.000837 units, assuming all other factors remain constant.

Median House Value: For every one-unit increase in the median house value (in USD), the log-odds of a house being in a ‘highIncome’ area increase by approximately 0.0000134 units, all else being constant.

These interpretations show how changes in the explanatory factors affect the chance of a residence being in a ‘highIncome’ neighborhood, as assessed by the logistic regression model. The coefficients reflect the size and direction of these impacts.

par(mar = c(2, 2, 2, 2))  
plot(model)

library(car)

## Loading required package: carData

vif(model)

## housing_median_age        total_rooms         population median_house_value 
##           1.392481           8.039720           7.879817           1.288243

The VIF values for housing_median_age and median_house_value are close to one, suggesting that they do not display severe multicollinearity with other variables. This implies that they are generally independent of the model’s other variables.

However, both total_rooms and population have VIF values that are significantly more than 1 (greater than 5), indicating possible multicollinearity with other variables in the model. These variables’ high VIF values imply that they may be connected with one another or with other explanatory factors.

High VIF values for total_rooms and population may suggest that these variables share comparable information with other variables in the model, making interpreting their unique contributions to the response variable difficult.

###Highlight any issues with the model Based on many explanatory factors, a logistic regression model has been built to determine if a home area belongs into the “highIncome” group. The model shows extremely significant coefficients for all variables, indicating a link between them and the chance of living in a high-income location. As seen by low deviation statistics and a decent AIC value, the model fits the data effectively. However, to improve the interpretation of individual variable impacts, multicollinearity between ‘total_rooms’ and ‘population’ must be addressed. It is also critical to evaluate the model’s predicted performance and ensure that the model assumptions are satisfied. Overall, the concept offers promise, but further research and testing are required for thorough insights and practical applications.

Data Dive Week 11

Sharmista Kothavadla

2023-11-06

Build a logistic regression model:

Scatter plot.