Data Dive Week 10

library(readr)
housing = read.csv("/Users/sharmistaroy/Downloads/housing.csv")

head(housing)

##   longitude latitude housing_median_age total_rooms total_bedrooms population
## 1   -122.23    37.88                 41         880            129        322
## 2   -122.22    37.86                 21        7099           1106       2401
## 3   -122.24    37.85                 52        1467            190        496
## 4   -122.25    37.85                 52        1274            235        558
## 5   -122.25    37.85                 52        1627            280        565
## 6   -122.25    37.85                 52         919            213        413
##   households median_income median_house_value ocean_proximity
## 1        126        8.3252             452600        NEAR BAY
## 2       1138        8.3014             358500        NEAR BAY
## 3        177        7.2574             352100        NEAR BAY
## 4        219        5.6431             341300        NEAR BAY
## 5        259        3.8462             342200        NEAR BAY
## 6        193        4.0368             269700        NEAR BAY

summary(housing)

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
##

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable

Create a binary target variable based on the ‘median_income’ column, categorizing whether a house is in a ‘highIncome’ area.

threshold <- 5  

housing$highIncome <- ifelse(housing$median_income > threshold, 1, 0)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

count<- table(housing$highIncome)
print(count)

## 
##     0     1 
## 16151  4489

Build a logistic regression model:

model <- glm(highIncome ~ housing_median_age + total_rooms + population + median_house_value, data = housing, family = binomial)

summary(model)

## 
## Call:
## glm(formula = highIncome ~ housing_median_age + total_rooms + 
##     population + median_house_value, family = binomial, data = housing)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.000e+00  7.479e-02  -40.12   <2e-16 ***
## housing_median_age -5.393e-02  1.980e-03  -27.23   <2e-16 ***
## total_rooms         3.958e-04  2.532e-05   15.63   <2e-16 ***
## population         -8.366e-04  5.386e-05  -15.53   <2e-16 ***
## median_house_value  1.343e-05  2.300e-07   58.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21619  on 20639  degrees of freedom
## Residual deviance: 13977  on 20635  degrees of freedom
## AIC: 13987
## 
## Number of Fisher Scoring iterations: 5

The significance codes indicate that all coefficients are highly significant (p-value < 0.001), meaning that these variables are strongly associated with the likelihood of being in a ‘highIncome’ area.

In summary, the coefficients represent how each independent variable affects the log-odds of a house being in a ‘highIncome’ area. By exponentiating the coefficients, you can obtain odds ratios, which can be more interpretable. These ratios indicate how the odds of the binary outcome change for a one-unit change in the corresponding independent variable, while holding other variables constant.

ci <- confint(model, "median_house_value", level = 0.95)

## Waiting for profiling to be done...

print(ci)

##        2.5 %       97.5 % 
## 1.297831e-05 1.387999e-05

The confidence interval for the ‘median_house_value’ coefficient is [1.29e-05, 1.38e-05]. This interval provides a range of plausible values for the effect of the ‘median_house_value’ variable on the log-odds of being in a ‘highIncome’ area.

With 95% confidence, we may say that the true population coefficient falls within this range. This indicates that for every one-unit rise in median property value (in USD), the log-odds of living in a ‘highIncome’ neighborhood are projected to increase between 1.29e-05 and 1.38e-05, while all other factors remain constant.

The fact that the whole interval is above zero indicates that the’median_house_value’ variable has a statistically significant positive influence on the chance of living in a ‘highIncome’ neighborhood. In other words, when the median property value rises, the likelihood of living in a ‘highIncome’ neighborhood rises as well, and this association is statistically validated.

Scatter plot.

library(ggplot2)
ggplot(housing, aes(x = median_house_value, y = highIncome)) +
  geom_point()

After Examine the scatter plot for a linear relationship between the explanatory variable and the binary result. A linear connection implies that no transformation is required. If the plot clearly exhibits a non-linear pattern (e.g., a curve), it may suggest that a transformation is required.

Data Dive Week 10

Sharmista Kothavadla

2023-10-30

Select an interesting binary column of data, or one which can be reasonably converted into a binary variable

Build a logistic regression model:

Scatter plot.