The California Housing dataset contains information about housing attributes in various regions of California. It is often used for regression analysis and predictive modeling tasks. The dataset consists of the following columns:
median_house_value: Median house value for California districts (target variable). housing_median_age: Median age of housing units in a district. total_rooms: Total number of rooms in a district. total_bedrooms: Total number of bedrooms in a district. population: Total population in a district. households: Total number of households in a district. median_income: Median income of households in a district. latitude: Latitude coordinate of the district’s location. longitude: Longitude coordinate of the district’s location. ocean_proximity: Proximity of the district to the ocean (categorical variable). This dataset is used for various analyses, including understanding housing market trends, predicting house prices, and studying the impact of socioeconomic factors on housing.
We have few missing values in the dataset.
Goal Proposal:
The goal of this project is to analyze the California housing dataset to gain insights into housing prices and factors influencing them. We aim to provide valuable information to potential homebuyers in California, as well as to understand the relationships between housing attributes.
library(readr)
housing = read.csv("/Users/sharmistaroy/Downloads/housing.csv")
head(housing)
## longitude latitude housing_median_age total_rooms total_bedrooms population
## 1 -122.23 37.88 41 880 129 322
## 2 -122.22 37.86 21 7099 1106 2401
## 3 -122.24 37.85 52 1467 190 496
## 4 -122.25 37.85 52 1274 235 558
## 5 -122.25 37.85 52 1627 280 565
## 6 -122.25 37.85 52 919 213 413
## households median_income median_house_value ocean_proximity
## 1 126 8.3252 452600 NEAR BAY
## 2 1138 8.3014 358500 NEAR BAY
## 3 177 7.2574 352100 NEAR BAY
## 4 219 5.6431 341300 NEAR BAY
## 5 259 3.8462 342200 NEAR BAY
## 6 193 4.0368 269700 NEAR BAY
summary(housing)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
str(housing)
## 'data.frame': 20640 obs. of 10 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num 880 7099 1467 1274 1627 ...
## $ total_bedrooms : num 129 1106 190 235 280 ...
## $ population : num 322 2401 496 558 565 ...
## $ households : num 126 1138 177 219 259 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 452600 358500 352100 341300 342200 ...
## $ ocean_proximity : chr "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
###Exploratory Data Analysis
##What is the distribution of housing prices?
hist(housing$median_house_value,
main = "Distribution of Housing Prices",
xlab = "Median House Value",
col = "skyblue", # You can choose your preferred color
border = "black",
breaks = 30 # Adjust the number of breaks/bins as needed
)
The histogram shows that the distribution of median house values is right-skewed, with the majority of houses having median values around $200,000. This suggests that there may be some expensive outliers.
library(ggplot2)
options(repr.plot.width=11.7, repr.plot.height=8.27)
ggplot(housing, aes(x = ocean_proximity, fill = ocean_proximity)) +
geom_bar() +
labs(
title = "Frequency of Ocean Proximity",
x = "Distance of Ocean from the House",
y = "Frequency"
) +
scale_fill_viridis_d() +
theme_minimal()
<1H OCEAN’ is the most common category, appearing 9136 times.
‘INLAND’ is the second most frequent category with 6551 occurrences.
‘NEAR OCEAN’ and ‘NEAR BAY’ have intermediate frequencies. ‘ISLAND’ is
the least common category, occurring only 5 times.
library(ggplot2)
options(repr.plot.width=28.7, repr.plot.height=10.27)
# Plot
ggplot(housing, aes(x = housing_median_age, fill = factor(housing_median_age))) +
geom_bar() +
labs(
title = "Frequency of the Age of the House",
x = "Age of House"
) +
theme_minimal()
The histogram depicts the age distribution of houses in a California district, revealing a multi-modal age distribution with notable peaks at 15-17 years, implying a construction boom during that time period. A progressive decrease in the number of houses over 30 years old indicates a reduction in older housing stock or, more likely, renewal efforts in the area. The small quantity of very new houses (1-5 years old) shows that recent construction has been restrained. Overall, the region has a vast range of housing ages, indicating its established residential nature as well as the potential for diverse architectural styles and housing circumstances.
Removing na values.
na<-colSums(is.na(housing))
print(na)
## longitude latitude housing_median_age total_rooms
## 0 0 0 0
## total_bedrooms population households median_income
## 207 0 0 0
## median_house_value ocean_proximity
## 0 0
207 na values exits in total_bedrooms.
housing_na<-na.omit(housing)
cor_matrix <- cor(housing_na[,-10])
options(repr.plot.width=10, repr.plot.height=8)
heatmap(cor_matrix,
col = colorRampPalette(c("blue", "white", "red"))(50), # Color scale
main = "Correlation Heatmap",
xlab = "Variables",
ylab = "Variables",
margins = c(8,8), # Adjust margins for labels
cexRow = 0.7, # Adjust row label size
cexCol = 0.7 # Adjust column label size
)
The correlation heatmap for housing data in a California district reveals several key relationships: a strong negative correlation between longitude and latitude suggesting a geographical layout; a high positive correlation among total rooms, bedrooms, population, and households, indicating larger houses accommodate more people; a moderate positive correlation between median income and house value, aligning with economic expectations; a low correlation of housing age with other factors, implying it’s not a strong determinant of size or occupancy; and a slight negative correlation of housing age with room count and population, hinting older blocks may have smaller, less populated houses. Latitude and longitude show minimal correlation with socio-economic indicators, suggesting economic factors are not tightly bound to specific locations within the district.
Lets Explore the relation between Population and median_house_value
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot(housing, aes(x = population, y = median_house_value, color = ocean_proximity)) +
geom_point(alpha = 0.4) +
scale_color_viridis_d(option = "rocket") +
labs(
title = "How does price of house change with population and ocean proximity",
x = "Population",
y = "House Price"
) +
theme_minimal()
We obeserve that most of the houses are Inland, with a price between
30,000 to 160,000. Most of the near bay and <1H Ocean as well as Near
Ocean are around 300,000-500,000. Island seems to be in every price
range.
library(ggplot2)
library(ggExtra)
p <- ggplot(housing, aes(x = median_income, y = median_house_value, color = ocean_proximity)) +
geom_point(alpha = 0.6) +
scale_color_viridis_d(option = "rocket") +
theme_minimal() +
labs(x = "Median Income", y = "Median House Value", title = "How does income and ocean proximity affect the house price")
p_final <- ggExtra::ggMarginal(p, type = "histogram", fill = "grey")
print(p_final)
A positive association exists between median income and house value; as
income grows, so does house value. The marginal histogram at the top of
the plot demonstrates a right-skewed distribution of median house
values, showing a concentration of lower-value homes with fewer
high-value outliers. The right-hand marginal histogram illustrates ocean
proximity categories, with ‘INLAND’ houses being the most prevalent,
followed by ‘<1H OCEAN’, ‘NEAR OCEAN’, and ‘NEAR BAY’, and ‘ISLAND’
being the least common. The scatter plot does not show a clear pattern
in which ocean proximity alone has a substantial influence on property
value; however, income appears to be the stronger predictor among the
variables shown. This implies that, while proximity to the water may
have an effect on house values, median income is a more important
factor.
###Hypothesis 1: Median Household Income and House Value Correlation
correlation_result <- cor.test(housing$median_income, housing$median_house_value)
print(correlation_result)
##
## Pearson's product-moment correlation
##
## data: housing$median_income and housing$median_house_value
## t = 136.22, df = 20638, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6808236 0.6951920
## sample estimates:
## cor
## 0.6880752
The results of the Pearson Correlation Test provide strong evidence to reject the null hypothesis. The extremely low p-value suggests a significant correlation between median household income and median house value in California. The negative correlation coefficient indicates that as median household income increases, median house value tends to decrease. The confidence interval for the correlation coefficient also supports this finding, as it excludes zero.
###Hypothesis 2: Proximity to the Ocean Affects House Prices
contingency_table <- table(housing$ocean_proximity, housing$median_house_value)
chi_squared_result <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
print(chi_squared_result)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 20213, df = 15364, p-value < 2.2e-16
The results of the Chi-squared Test for Independence provide strong evidence to reject the null hypothesis. The extremely low p-value suggests a significant association between ocean proximity and median house prices in California.
###Predictive Analysis(forecasting)
##Logistic regression model
Lets consider median income into two classes highincome
threshold <- 5
housing$highIncome <- ifelse(housing$median_income > threshold, 1, 0)
count<- table(housing$highIncome)
print(count)
##
## 0 1
## 16151 4489
model <- glm(highIncome ~ housing_median_age + total_rooms + population + median_house_value, data = housing, family = binomial)
summary(model)
##
## Call:
## glm(formula = highIncome ~ housing_median_age + total_rooms +
## population + median_house_value, family = binomial, data = housing)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.000e+00 7.479e-02 -40.12 <2e-16 ***
## housing_median_age -5.393e-02 1.980e-03 -27.23 <2e-16 ***
## total_rooms 3.958e-04 2.532e-05 15.63 <2e-16 ***
## population -8.366e-04 5.386e-05 -15.53 <2e-16 ***
## median_house_value 1.343e-05 2.300e-07 58.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21619 on 20639 degrees of freedom
## Residual deviance: 13977 on 20635 degrees of freedom
## AIC: 13987
##
## Number of Fisher Scoring iterations: 5
ci <- confint(model, "median_house_value", level = 0.95)
## Waiting for profiling to be done...
print(ci)
## 2.5 % 97.5 %
## 1.297831e-05 1.387999e-05
The confidence interval for the ‘median_house_value’ coefficient is [1.29e-05, 1.38e-05]. This interval provides a range of plausible values for the effect of the ‘median_house_value’ variable on the log-odds of being in a ‘highIncome’ area.
With 95% confidence, we may say that the true population coefficient falls within this range. This indicates that for every one-unit rise in median property value (in USD), the log-odds of living in a ‘highIncome’ neighborhood are projected to increase between 1.29e-05 and 1.38e-05, while all other factors remain constant.
The fact that the whole interval is above zero indicates that the’median_house_value’ variable has a statistically significant positive influence on the chance of living in a ‘highIncome’ neighborhood. In other words, when the median property value rises, the likelihood of living in a ‘highIncome’ neighborhood rises as well, and this association is statistically validated.
Linear Regression / Multiple linear regression :
multi_var_model <- lm(median_house_value ~ median_income + population + total_bedrooms, data = housing)
summary(multi_var_model)
##
## Call:
## lm(formula = median_house_value ~ median_income + population +
## total_bedrooms, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -542542 -54418 -13494 37880 744280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41074.909 1493.752 27.50 <2e-16 ***
## median_income 42105.204 299.954 140.37 <2e-16 ***
## population -34.227 1.049 -32.62 <2e-16 ***
## total_bedrooms 95.868 2.822 33.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81410 on 20429 degrees of freedom
## (207 observations deleted due to missingness)
## Multiple R-squared: 0.5028, Adjusted R-squared: 0.5027
## F-statistic: 6885 on 3 and 20429 DF, p-value: < 2.2e-16
###Conclusion :
In summary, this multiple linear regression model is statistically significant and indicates that population, median income, and total number of bedrooms are important predictors of median house values. The residuals, however, show that there is still some inexplicable fluctuation in home values. To take into consideration more elements that affect house prices, more research and model improvement could be required.
For further analysis and more accurate prediction we can consider random forest and decision regression.