The California Housing dataset contains information about housing attributes in various regions of California. It is often used for regression analysis and predictive modeling tasks. The dataset consists of the following columns:

median_house_value: Median house value for California districts (target variable). housing_median_age: Median age of housing units in a district. total_rooms: Total number of rooms in a district. total_bedrooms: Total number of bedrooms in a district. population: Total population in a district. households: Total number of households in a district. median_income: Median income of households in a district. latitude: Latitude coordinate of the district’s location. longitude: Longitude coordinate of the district’s location. ocean_proximity: Proximity of the district to the ocean (categorical variable). This dataset is used for various analyses, including understanding housing market trends, predicting house prices, and studying the impact of socioeconomic factors on housing.

We have few missing values in the dataset.

Goal Proposal:

The goal of this project is to analyze the California housing dataset to gain insights into housing prices and factors influencing them. We aim to provide valuable information to potential homebuyers in California, as well as to understand the relationships between housing attributes.

library(readr)
housing = read.csv("/Users/sharmistaroy/Downloads/housing.csv")

head(housing)
##   longitude latitude housing_median_age total_rooms total_bedrooms population
## 1   -122.23    37.88                 41         880            129        322
## 2   -122.22    37.86                 21        7099           1106       2401
## 3   -122.24    37.85                 52        1467            190        496
## 4   -122.25    37.85                 52        1274            235        558
## 5   -122.25    37.85                 52        1627            280        565
## 6   -122.25    37.85                 52         919            213        413
##   households median_income median_house_value ocean_proximity
## 1        126        8.3252             452600        NEAR BAY
## 2       1138        8.3014             358500        NEAR BAY
## 3        177        7.2574             352100        NEAR BAY
## 4        219        5.6431             341300        NEAR BAY
## 5        259        3.8462             342200        NEAR BAY
## 6        193        4.0368             269700        NEAR BAY
summary(housing)
##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
## 
str(housing)
## 'data.frame':    20640 obs. of  10 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num  880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num  129 1106 190 235 280 ...
##  $ population        : num  322 2401 496 558 565 ...
##  $ households        : num  126 1138 177 219 259 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : chr  "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...

###Exploratory Data Analysis

##What is the distribution of housing prices?

hist(housing$median_house_value, 
     main = "Distribution of Housing Prices",
     xlab = "Median House Value",
     col = "skyblue",  # You can choose your preferred color
     border = "black",
     breaks = 30  # Adjust the number of breaks/bins as needed
)

The histogram shows that the distribution of median house values is right-skewed, with the majority of houses having median values around $200,000. This suggests that there may be some expensive outliers.

library(ggplot2)


options(repr.plot.width=11.7, repr.plot.height=8.27)


ggplot(housing, aes(x = ocean_proximity, fill = ocean_proximity)) +
  geom_bar() +
  labs(
    title = "Frequency of Ocean Proximity",
    x = "Distance of Ocean from the House",
    y = "Frequency"
  ) +
  scale_fill_viridis_d() + 
  theme_minimal()

<1H OCEAN’ is the most common category, appearing 9136 times. ‘INLAND’ is the second most frequent category with 6551 occurrences. ‘NEAR OCEAN’ and ‘NEAR BAY’ have intermediate frequencies. ‘ISLAND’ is the least common category, occurring only 5 times.

library(ggplot2)


options(repr.plot.width=28.7, repr.plot.height=10.27)

# Plot
ggplot(housing, aes(x = housing_median_age, fill = factor(housing_median_age))) +
  geom_bar() +
  labs(
    title = "Frequency of the Age of the House",
    x = "Age of House"
  ) +
  theme_minimal()

The histogram depicts the age distribution of houses in a California district, revealing a multi-modal age distribution with notable peaks at 15-17 years, implying a construction boom during that time period. A progressive decrease in the number of houses over 30 years old indicates a reduction in older housing stock or, more likely, renewal efforts in the area. The small quantity of very new houses (1-5 years old) shows that recent construction has been restrained. Overall, the region has a vast range of housing ages, indicating its established residential nature as well as the potential for diverse architectural styles and housing circumstances.

Removing na values.

na<-colSums(is.na(housing))
print(na)
##          longitude           latitude housing_median_age        total_rooms 
##                  0                  0                  0                  0 
##     total_bedrooms         population         households      median_income 
##                207                  0                  0                  0 
## median_house_value    ocean_proximity 
##                  0                  0

207 na values exits in total_bedrooms.

housing_na<-na.omit(housing)
cor_matrix <- cor(housing_na[,-10])


options(repr.plot.width=10, repr.plot.height=8)


heatmap(cor_matrix, 
        col = colorRampPalette(c("blue", "white", "red"))(50),  # Color scale
        main = "Correlation Heatmap",
        xlab = "Variables",
        ylab = "Variables",
        margins = c(8,8),  # Adjust margins for labels
        cexRow = 0.7,  # Adjust row label size
        cexCol = 0.7   # Adjust column label size
)

The correlation heatmap for housing data in a California district reveals several key relationships: a strong negative correlation between longitude and latitude suggesting a geographical layout; a high positive correlation among total rooms, bedrooms, population, and households, indicating larger houses accommodate more people; a moderate positive correlation between median income and house value, aligning with economic expectations; a low correlation of housing age with other factors, implying it’s not a strong determinant of size or occupancy; and a slight negative correlation of housing age with room count and population, hinting older blocks may have smaller, less populated houses. Latitude and longitude show minimal correlation with socio-economic indicators, suggesting economic factors are not tightly bound to specific locations within the district.

Lets Explore the relation between Population and median_house_value

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
ggplot(housing, aes(x = population, y = median_house_value, color = ocean_proximity)) +
  geom_point(alpha = 0.4) + 
  scale_color_viridis_d(option = "rocket") +  
  labs(
    title = "How does price of house change with population and ocean proximity",
    x = "Population",
    y = "House Price"
  ) +
  theme_minimal()

We obeserve that most of the houses are Inland, with a price between 30,000 to 160,000. Most of the near bay and <1H Ocean as well as Near Ocean are around 300,000-500,000. Island seems to be in every price range.

library(ggplot2)
library(ggExtra)


p <- ggplot(housing, aes(x = median_income, y = median_house_value, color = ocean_proximity)) +
  geom_point(alpha = 0.6) +
  scale_color_viridis_d(option = "rocket") +
  theme_minimal() +
  labs(x = "Median Income", y = "Median House Value", title = "How does income and ocean proximity affect the house price")


p_final <- ggExtra::ggMarginal(p, type = "histogram", fill = "grey")


print(p_final)

A positive association exists between median income and house value; as income grows, so does house value. The marginal histogram at the top of the plot demonstrates a right-skewed distribution of median house values, showing a concentration of lower-value homes with fewer high-value outliers. The right-hand marginal histogram illustrates ocean proximity categories, with ‘INLAND’ houses being the most prevalent, followed by ‘<1H OCEAN’, ‘NEAR OCEAN’, and ‘NEAR BAY’, and ‘ISLAND’ being the least common. The scatter plot does not show a clear pattern in which ocean proximity alone has a substantial influence on property value; however, income appears to be the stronger predictor among the variables shown. This implies that, while proximity to the water may have an effect on house values, median income is a more important factor.

###Hypothesis 1: Median Household Income and House Value Correlation

correlation_result <- cor.test(housing$median_income, housing$median_house_value)


print(correlation_result)
## 
##  Pearson's product-moment correlation
## 
## data:  housing$median_income and housing$median_house_value
## t = 136.22, df = 20638, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6808236 0.6951920
## sample estimates:
##       cor 
## 0.6880752

The results of the Pearson Correlation Test provide strong evidence to reject the null hypothesis. The extremely low p-value suggests a significant correlation between median household income and median house value in California. The negative correlation coefficient indicates that as median household income increases, median house value tends to decrease. The confidence interval for the correlation coefficient also supports this finding, as it excludes zero.

###Hypothesis 2: Proximity to the Ocean Affects House Prices

contingency_table <- table(housing$ocean_proximity, housing$median_house_value)


chi_squared_result <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
print(chi_squared_result)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 20213, df = 15364, p-value < 2.2e-16

The results of the Chi-squared Test for Independence provide strong evidence to reject the null hypothesis. The extremely low p-value suggests a significant association between ocean proximity and median house prices in California.

###Predictive Analysis(forecasting)

##Logistic regression model

Lets consider median income into two classes highincome

threshold <- 5  

housing$highIncome <- ifelse(housing$median_income > threshold, 1, 0)
count<- table(housing$highIncome)
print(count)
## 
##     0     1 
## 16151  4489
model <- glm(highIncome ~ housing_median_age + total_rooms + population + median_house_value, data = housing, family = binomial)
summary(model)
## 
## Call:
## glm(formula = highIncome ~ housing_median_age + total_rooms + 
##     population + median_house_value, family = binomial, data = housing)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.000e+00  7.479e-02  -40.12   <2e-16 ***
## housing_median_age -5.393e-02  1.980e-03  -27.23   <2e-16 ***
## total_rooms         3.958e-04  2.532e-05   15.63   <2e-16 ***
## population         -8.366e-04  5.386e-05  -15.53   <2e-16 ***
## median_house_value  1.343e-05  2.300e-07   58.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21619  on 20639  degrees of freedom
## Residual deviance: 13977  on 20635  degrees of freedom
## AIC: 13987
## 
## Number of Fisher Scoring iterations: 5
ci <- confint(model, "median_house_value", level = 0.95)
## Waiting for profiling to be done...
print(ci)
##        2.5 %       97.5 % 
## 1.297831e-05 1.387999e-05

The confidence interval for the ‘median_house_value’ coefficient is [1.29e-05, 1.38e-05]. This interval provides a range of plausible values for the effect of the ‘median_house_value’ variable on the log-odds of being in a ‘highIncome’ area.

With 95% confidence, we may say that the true population coefficient falls within this range. This indicates that for every one-unit rise in median property value (in USD), the log-odds of living in a ‘highIncome’ neighborhood are projected to increase between 1.29e-05 and 1.38e-05, while all other factors remain constant.

The fact that the whole interval is above zero indicates that the’median_house_value’ variable has a statistically significant positive influence on the chance of living in a ‘highIncome’ neighborhood. In other words, when the median property value rises, the likelihood of living in a ‘highIncome’ neighborhood rises as well, and this association is statistically validated.

Linear Regression / Multiple linear regression :

multi_var_model <- lm(median_house_value ~ median_income + population + total_bedrooms, data = housing)

summary(multi_var_model)
## 
## Call:
## lm(formula = median_house_value ~ median_income + population + 
##     total_bedrooms, data = housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -542542  -54418  -13494   37880  744280 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    41074.909   1493.752   27.50   <2e-16 ***
## median_income  42105.204    299.954  140.37   <2e-16 ***
## population       -34.227      1.049  -32.62   <2e-16 ***
## total_bedrooms    95.868      2.822   33.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81410 on 20429 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.5028, Adjusted R-squared:  0.5027 
## F-statistic:  6885 on 3 and 20429 DF,  p-value: < 2.2e-16

###Conclusion :

In summary, this multiple linear regression model is statistically significant and indicates that population, median income, and total number of bedrooms are important predictors of median house values. The residuals, however, show that there is still some inexplicable fluctuation in home values. To take into consideration more elements that affect house prices, more research and model improvement could be required.

For further analysis and more accurate prediction we can consider random forest and decision regression.