library(readr)
housing = read.csv("/Users/sharmistaroy/Downloads/housing.csv")
head(housing)
## longitude latitude housing_median_age total_rooms total_bedrooms population
## 1 -122.23 37.88 41 880 129 322
## 2 -122.22 37.86 21 7099 1106 2401
## 3 -122.24 37.85 52 1467 190 496
## 4 -122.25 37.85 52 1274 235 558
## 5 -122.25 37.85 52 1627 280 565
## 6 -122.25 37.85 52 919 213 413
## households median_income median_house_value ocean_proximity
## 1 126 8.3252 452600 NEAR BAY
## 2 1138 8.3014 358500 NEAR BAY
## 3 177 7.2574 352100 NEAR BAY
## 4 219 5.6431 341300 NEAR BAY
## 5 259 3.8462 342200 NEAR BAY
## 6 193 4.0368 269700 NEAR BAY
summary(housing)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
Create a binary target variable based on the ‘median_income’ column, categorizing whether a house is in a ‘highIncome’ area.
threshold <- 5
housing$highIncome <- ifelse(housing$median_income > threshold, 1, 0)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
count<- table(housing$highIncome)
print(count)
##
## 0 1
## 16151 4489
model <- glm(highIncome ~ housing_median_age + total_rooms + population + median_house_value, data = housing, family = binomial)
summary(model)
##
## Call:
## glm(formula = highIncome ~ housing_median_age + total_rooms +
## population + median_house_value, family = binomial, data = housing)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.000e+00 7.479e-02 -40.12 <2e-16 ***
## housing_median_age -5.393e-02 1.980e-03 -27.23 <2e-16 ***
## total_rooms 3.958e-04 2.532e-05 15.63 <2e-16 ***
## population -8.366e-04 5.386e-05 -15.53 <2e-16 ***
## median_house_value 1.343e-05 2.300e-07 58.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21619 on 20639 degrees of freedom
## Residual deviance: 13977 on 20635 degrees of freedom
## AIC: 13987
##
## Number of Fisher Scoring iterations: 5
The significance codes indicate that all coefficients are highly significant (p-value < 0.001), meaning that these variables are strongly associated with the likelihood of being in a ‘highIncome’ area.
In summary, the coefficients represent how each independent variable affects the log-odds of a house being in a ‘highIncome’ area. By exponentiating the coefficients, you can obtain odds ratios, which can be more interpretable. These ratios indicate how the odds of the binary outcome change for a one-unit change in the corresponding independent variable, while holding other variables constant.
ci <- confint(model, "median_house_value", level = 0.95)
## Waiting for profiling to be done...
print(ci)
## 2.5 % 97.5 %
## 1.297831e-05 1.387999e-05
The confidence interval for the ‘median_house_value’ coefficient is [1.29e-05, 1.38e-05]. This interval provides a range of plausible values for the effect of the ‘median_house_value’ variable on the log-odds of being in a ‘highIncome’ area.
With 95% confidence, we may say that the true population coefficient falls within this range. This indicates that for every one-unit rise in median property value (in USD), the log-odds of living in a ‘highIncome’ neighborhood are projected to increase between 1.29e-05 and 1.38e-05, while all other factors remain constant.
The fact that the whole interval is above zero indicates that the’median_house_value’ variable has a statistically significant positive influence on the chance of living in a ‘highIncome’ neighborhood. In other words, when the median property value rises, the likelihood of living in a ‘highIncome’ neighborhood rises as well, and this association is statistically validated.
library(ggplot2)
ggplot(housing, aes(x = median_house_value, y = highIncome)) +
geom_point()
After Examine the scatter plot for a linear relationship between the explanatory variable and the binary result. A linear connection implies that no transformation is required. If the plot clearly exhibits a non-linear pattern (e.g., a curve), it may suggest that a transformation is required.