#Data Documetation:
The California Housing dataset contains information about housing attributes in various regions of California. It is often used for regression analysis and predictive modeling tasks. The dataset consists of the following columns:
This dataset is used for various analyses, including understanding housing market trends, predicting house prices, and studying the impact of socioeconomic factors on housing.
We have few missing values in the dataset.
Goal Proposal:
The goal of this project is to analyze the California housing dataset to gain insights into housing prices and factors influencing them. We aim to provide valuable information to potential homebuyers in California, as well as to understand the relationships between housing attributes.l
Task1:
library(readr)
housing <- read_csv("housing.csv",show_col_types = FALSE)
View(housing)
str(housing)
## spc_tbl_ [20,640 Ă— 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ longitude : num [1:20640] -122 -122 -122 -122 -122 ...
## $ latitude : num [1:20640] 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num [1:20640] 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num [1:20640] 880 7099 1467 1274 1627 ...
## $ total_bedrooms : num [1:20640] 129 1106 190 235 280 ...
## $ population : num [1:20640] 322 2401 496 558 565 ...
## $ households : num [1:20640] 126 1138 177 219 259 ...
## $ median_income : num [1:20640] 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num [1:20640] 452600 358500 352100 341300 342200 ...
## $ ocean_proximity : chr [1:20640] "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
## - attr(*, "spec")=
## .. cols(
## .. longitude = col_double(),
## .. latitude = col_double(),
## .. housing_median_age = col_double(),
## .. total_rooms = col_double(),
## .. total_bedrooms = col_double(),
## .. population = col_double(),
## .. households = col_double(),
## .. median_income = col_double(),
## .. median_house_value = col_double(),
## .. ocean_proximity = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
The dataset consists of 9 numeric columns and 1 categorical columnn.
Numeric Summary for at Least 10 Columns
summary(housing)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
numeric_col<- housing[0:9]
summary(numeric_col)
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value
## Min. : 14999
## 1st Qu.:119600
## Median :179700
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
categorical_col<- housing[,10]
summary(categorical_col)
## ocean_proximity
## Length:20640
## Class :character
## Mode :character
count<-table(categorical_col)
print(count)
## ocean_proximity
## <1H OCEAN INLAND ISLAND NEAR BAY NEAR OCEAN
## 9136 6551 5 2290 2658
Task 2:
What is the distribution of housing prices?
hist(housing$median_house_value, main="Distribution of Housing Prices", xlab="Median House Value")
The histogram shows that the distribution of median house values is
right-skewed, with the majority of houses having median values around
$200,000. This suggests that there may be some expensive outliers.
2.How does the average number of rooms relate to the median house value?
library(ggplot2)
ggplot(housing, aes(x=total_rooms, y=median_house_value)) +
geom_point() +
labs(title="Relationship between Total Rooms and Median House Value", x="Total Rooms", y="Median House Value")
The scatterplot indicates a positive correlation between the total
number of rooms and median house value. Generally, houses with more
rooms tend to have higher median values, although there is significant
variability.
3.Are there any correlations between median income (MEDIAN_INCOME) and other features?
correlation_matrix <- cor(housing[, c("median_income", "median_house_value", "housing_median_age", "total_rooms", "population")])
print(correlation_matrix)
## median_income median_house_value housing_median_age
## median_income 1.000000000 0.68807521 -0.1190340
## median_house_value 0.688075208 1.00000000 0.1056234
## housing_median_age -0.119033990 0.10562341 1.0000000
## total_rooms 0.198049645 0.13415311 -0.3612622
## population 0.004834346 -0.02464968 -0.2962442
## total_rooms population
## median_income 0.1980496 0.004834346
## median_house_value 0.1341531 -0.024649679
## housing_median_age -0.3612622 -0.296244240
## total_rooms 1.0000000 0.857125973
## population 0.8571260 1.000000000
pairs(correlation_matrix)
The correlation matrix shows the relationships between median income (MEDIAN_INCOME) and other features. It reveals that median income has a positive correlation with median house value, suggesting that areas with higher incomes tend to have more expensive houses.
ggplot(housing, aes(x=households, y=median_house_value)) +
geom_point() +
labs(title="Relationship between Households and Median House Value", x="Households", y="Median House Value")
The scatterplot illustrates a somewhat positive relationship between the
number of households and median house value. Areas with more households
tend to have slightly higher median house values
ggplot(housing, aes(x=ocean_proximity, y=median_house_value)) +
geom_boxplot() +
labs(title="Impact of Ocean Proximity on Median House Value", x="Ocean Proximity", y="Median House Value")
The box plot displays how ocean proximity affects median house values.
Houses located closer to the ocean generally have higher median values
compared to those further inland.
Task 3:
Calculate the median house value for each category of ocean proximity
agg_result <- aggregate(housing$median_house_value, by=list(housing$ocean_proximity), FUN=median)
print(agg_result)
## Group.1 x
## 1 <1H OCEAN 214850
## 2 INLAND 108500
## 3 ISLAND 414700
## 4 NEAR BAY 233800
## 5 NEAR OCEAN 229450
To determine the median house value for each category of ocean proximity, we utilized the aggregate() method. The outcome offers a summary of how various degrees of ocean proximity affect median property values. For instance, compared to other categories, the “NEAR BAY” and “ISLAND” categories typically have higher median house values.
Task 4:
Create visualizations to explore data
ggplot(housing, aes(x=median_house_value)) +
geom_histogram(binwidth=50000, fill="blue", color="black") +
labs(title="Distribution of Median House Value", x="Median House Value")
The histogram of median house values reveals the overall distribution.
It shows that the majority of districts have median house values
concentrated around $200,000.
ggplot(housing, aes(x=housing_median_age, y=median_house_value)) +
geom_point() +
labs(title="Relationship between Housing Median Age and Median House Value", x="Housing Median Age", y="Median House Value")
The scatterplot shows the relationship between housing median age and
median house value. While there isn’t a clear linear relationship, it
suggests that there are districts with varying house ages and median
values.
Correlation plot: