#Data Documetation:

The California Housing dataset contains information about housing attributes in various regions of California. It is often used for regression analysis and predictive modeling tasks. The dataset consists of the following columns:

  1. median_house_value: Median house value for California districts (target variable).
  2. housing_median_age: Median age of housing units in a district.
  3. total_rooms: Total number of rooms in a district.
  4. total_bedrooms: Total number of bedrooms in a district.
  5. population: Total population in a district.
  6. households: Total number of households in a district.
  7. median_income: Median income of households in a district.
  8. latitude: Latitude coordinate of the district’s location.
  9. longitude: Longitude coordinate of the district’s location.
  10. ocean_proximity: Proximity of the district to the ocean (categorical variable).

This dataset is used for various analyses, including understanding housing market trends, predicting house prices, and studying the impact of socioeconomic factors on housing.

We have few missing values in the dataset.

Goal Proposal:

The goal of this project is to analyze the California housing dataset to gain insights into housing prices and factors influencing them. We aim to provide valuable information to potential homebuyers in California, as well as to understand the relationships between housing attributes.l

Task1:

library(readr)
housing <- read_csv("housing.csv",show_col_types = FALSE)
View(housing)
str(housing)
## spc_tbl_ [20,640 Ă— 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ longitude         : num [1:20640] -122 -122 -122 -122 -122 ...
##  $ latitude          : num [1:20640] 37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num [1:20640] 41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num [1:20640] 880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num [1:20640] 129 1106 190 235 280 ...
##  $ population        : num [1:20640] 322 2401 496 558 565 ...
##  $ households        : num [1:20640] 126 1138 177 219 259 ...
##  $ median_income     : num [1:20640] 8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num [1:20640] 452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : chr [1:20640] "NEAR BAY" "NEAR BAY" "NEAR BAY" "NEAR BAY" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   longitude = col_double(),
##   ..   latitude = col_double(),
##   ..   housing_median_age = col_double(),
##   ..   total_rooms = col_double(),
##   ..   total_bedrooms = col_double(),
##   ..   population = col_double(),
##   ..   households = col_double(),
##   ..   median_income = col_double(),
##   ..   median_house_value = col_double(),
##   ..   ocean_proximity = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

The dataset consists of 9 numeric columns and 1 categorical columnn.

Numeric Summary for at Least 10 Columns

summary(housing)
##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
## 
numeric_col<- housing[0:9]

summary(numeric_col)
##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value
##  Min.   : 14999    
##  1st Qu.:119600    
##  Median :179700    
##  Mean   :206856    
##  3rd Qu.:264725    
##  Max.   :500001    
## 
categorical_col<- housing[,10]
summary(categorical_col)
##  ocean_proximity   
##  Length:20640      
##  Class :character  
##  Mode  :character
count<-table(categorical_col)
print(count)
## ocean_proximity
##  <1H OCEAN     INLAND     ISLAND   NEAR BAY NEAR OCEAN 
##       9136       6551          5       2290       2658

Task 2:

What is the distribution of housing prices?

hist(housing$median_house_value, main="Distribution of Housing Prices", xlab="Median House Value")

The histogram shows that the distribution of median house values is right-skewed, with the majority of houses having median values around $200,000. This suggests that there may be some expensive outliers.

2.How does the average number of rooms relate to the median house value?

library(ggplot2)
ggplot(housing, aes(x=total_rooms, y=median_house_value)) +
  geom_point() +
  labs(title="Relationship between Total Rooms and Median House Value", x="Total Rooms", y="Median House Value")

The scatterplot indicates a positive correlation between the total number of rooms and median house value. Generally, houses with more rooms tend to have higher median values, although there is significant variability.

3.Are there any correlations between median income (MEDIAN_INCOME) and other features?

correlation_matrix <- cor(housing[, c("median_income", "median_house_value", "housing_median_age", "total_rooms", "population")])
print(correlation_matrix)
##                    median_income median_house_value housing_median_age
## median_income        1.000000000         0.68807521         -0.1190340
## median_house_value   0.688075208         1.00000000          0.1056234
## housing_median_age  -0.119033990         0.10562341          1.0000000
## total_rooms          0.198049645         0.13415311         -0.3612622
## population           0.004834346        -0.02464968         -0.2962442
##                    total_rooms   population
## median_income        0.1980496  0.004834346
## median_house_value   0.1341531 -0.024649679
## housing_median_age  -0.3612622 -0.296244240
## total_rooms          1.0000000  0.857125973
## population           0.8571260  1.000000000
pairs(correlation_matrix)

The correlation matrix shows the relationships between median income (MEDIAN_INCOME) and other features. It reveals that median income has a positive correlation with median house value, suggesting that areas with higher incomes tend to have more expensive houses.

  1. Is there a relationship between the percentage of households with families (HOUSEHOLDS) and median house value?
ggplot(housing, aes(x=households, y=median_house_value)) +
  geom_point() +
  labs(title="Relationship between Households and Median House Value", x="Households", y="Median House Value")

The scatterplot illustrates a somewhat positive relationship between the number of households and median house value. Areas with more households tend to have slightly higher median house values

  1. Does the proximity to the ocean (OCEAN_PROXIMITY) affect the median house value?
ggplot(housing, aes(x=ocean_proximity, y=median_house_value)) +
  geom_boxplot() +
  labs(title="Impact of Ocean Proximity on Median House Value", x="Ocean Proximity", y="Median House Value")

The box plot displays how ocean proximity affects median house values. Houses located closer to the ocean generally have higher median values compared to those further inland.

Task 3:

Calculate the median house value for each category of ocean proximity

agg_result <- aggregate(housing$median_house_value, by=list(housing$ocean_proximity), FUN=median)
print(agg_result)
##      Group.1      x
## 1  <1H OCEAN 214850
## 2     INLAND 108500
## 3     ISLAND 414700
## 4   NEAR BAY 233800
## 5 NEAR OCEAN 229450

To determine the median house value for each category of ocean proximity, we utilized the aggregate() method. The outcome offers a summary of how various degrees of ocean proximity affect median property values. For instance, compared to other categories, the “NEAR BAY” and “ISLAND” categories typically have higher median house values.

Task 4:

Create visualizations to explore data

ggplot(housing, aes(x=median_house_value)) +
  geom_histogram(binwidth=50000, fill="blue", color="black") +
  labs(title="Distribution of Median House Value", x="Median House Value")

The histogram of median house values reveals the overall distribution. It shows that the majority of districts have median house values concentrated around $200,000.

ggplot(housing, aes(x=housing_median_age, y=median_house_value)) +
  geom_point() +
  labs(title="Relationship between Housing Median Age and Median House Value", x="Housing Median Age", y="Median House Value")

The scatterplot shows the relationship between housing median age and median house value. While there isn’t a clear linear relationship, it suggests that there are districts with varying house ages and median values.

Correlation plot: