library(readr)
housing_data<-read.csv("/Users/sharmistaroy/Downloads/housing.csv")
View(housing_data)
Response Variable: medianHouseValue
set1 <- housing_data[, c("median_house_value", "median_income", "housing_median_age", "total_rooms")]
housing_data$populationdensity<-housing_data$population/housing_data$households
set2<- housing_data[,c("median_house_value","populationdensity","total_bedrooms","households")]
housing_data$roomsperbedroom <- housing_data$total_rooms / housing_data$total_bedrooms
set3<-housing_data[,c("median_house_value","roomsperbedroom","median_income","housing_median_age")]
In Set 2, we calculated a new variable, populationDensity, by dividing population by households. In Set 3, we calculated a new variable, roomsPerBedroom, by dividing totalRooms by totalBedrooms.
For set1 :
library(ggplot2)
# Plot 1: Median House Value vs. Median Income
ggplot(set1, aes(x = median_income, y = median_house_value)) +
geom_point() +
ggtitle("Median House Value vs. Median Income")
# Plot 2: Median House Value vs. Housing Median Age
ggplot(set1, aes(x = housing_median_age, y = median_house_value)) +
geom_point() +
ggtitle("Median House Value vs. Housing Median Age")
# Plot 3: Median House Value vs. Total Rooms
ggplot(set1, aes(x = total_rooms, y = median_house_value)) +
geom_point() +
ggtitle("Median House Value vs. Total Rooms")
The plot of “Median House Value vs. Median Income” reveals a strong
correlation. The value of the median home tends to rise along with the
median income. The figure “Median House Value vs. Housing Median Age” is
dispersed and does not demonstrate a clear linear relationship. The
scattershot dots on the “Median House Value vs. Total Rooms” plot are
equally devoid of any discernible linear relationship.
For set2:
# Median House Value vs. Population Density
ggplot(set2, aes(x = populationdensity, y = median_house_value)) +
geom_point() +
ggtitle("Median House Value vs. Population Density")
# Median House Value vs. Total Bedrooms
ggplot(set2, aes(x = total_bedrooms, y = median_house_value)) +
geom_point() +
ggtitle("Median House Value vs. Total Bedrooms")
## Warning: Removed 207 rows containing missing values (`geom_point()`).
# Median House Value vs. Households
ggplot(set2, aes(x = households, y = median_house_value)) +
geom_point() +
ggtitle("Median House Value vs. Households")
The “Median House Value vs. Population Density” plot does not show a
strong linear relationship, and there are some outliers with very high
population densities. The “Median House Value vs. Total Bedrooms” plot
has scattered points and doesn’t show a strong linear relationship. The
“Median House Value vs. Households” plot also has scattered points with
no clear linear relationship.
For set3:
# Median House Value vs. Rooms per Bedroom
ggplot(set3, aes(x = roomsperbedroom, y = median_house_value)) +
geom_point() +
ggtitle("Median House Value vs. Rooms per Bedroom")
## Warning: Removed 207 rows containing missing values (`geom_point()`).
# Median House Value vs. Median Income
ggplot(set3, aes(x = median_income, y = median_house_value)) +
geom_point() +
ggtitle("Median House Value vs. Median Income")
# Median House Value vs. Housing Median Age
ggplot(set3, aes(x = housing_median_age, y = median_house_value)) +
geom_point() +
ggtitle("Median House Value vs. Housing Median Age")
There are outliers with incredibly high rooms per bedroom values in the
“Median House Value vs. Rooms per Bedroom” plot, which does not indicate
a clear linear relationship. The “Median House Value vs. Median Income”
plot displays a favorable correlation like Set 1. The “Median House
Value vs. Housing Median Age” graphic displays a skewed distribution of
points and fails to demonstrate a clear linear relationship.
cor1 <- cor(set1)
cor2 <- cor(set2)
cor3 <- cor(set3)
cor1
## median_house_value median_income housing_median_age
## median_house_value 1.0000000 0.6880752 0.1056234
## median_income 0.6880752 1.0000000 -0.1190340
## housing_median_age 0.1056234 -0.1190340 1.0000000
## total_rooms 0.1341531 0.1980496 -0.3612622
## total_rooms
## median_house_value 0.1341531
## median_income 0.1980496
## housing_median_age -0.3612622
## total_rooms 1.0000000
cor2
## median_house_value populationdensity total_bedrooms
## median_house_value 1.00000000 -0.02373741 NA
## populationdensity -0.02373741 1.00000000 NA
## total_bedrooms NA NA 1
## households 0.06584265 -0.02730936 NA
## households
## median_house_value 0.06584265
## populationdensity -0.02730936
## total_bedrooms NA
## households 1.00000000
cor3
## median_house_value roomsperbedroom median_income
## median_house_value 1.0000000 NA 0.6880752
## roomsperbedroom NA 1 NA
## median_income 0.6880752 NA 1.0000000
## housing_median_age 0.1056234 NA -0.1190340
## housing_median_age
## median_house_value 0.1056234
## roomsperbedroom NA
## median_income -0.1190340
## housing_median_age 1.0000000
The correlation for the variables is partially correct as the dataset has missing values, which can be replaced by imputing or mean of the columns.
These findings imply a positive correlation between median_income and median_house_value in Sets 1 and 3, indicating that areas with higher median earnings typically have higher median house values. Other variables in these sets, however, have comparatively smaller correlations with median_house_value. Due to missing values in several variables in Set 2, the correlations are typically weak or nonexistent.
set1 <- na.omit(set1)
set2 <- na.omit(set2)
set3 <- na.omit(set3)
cor1 <- cor(set1)
cor2 <- cor(set2)
cor3 <- cor(set3)
cor1
## median_house_value median_income housing_median_age
## median_house_value 1.0000000 0.6880752 0.1056234
## median_income 0.6880752 1.0000000 -0.1190340
## housing_median_age 0.1056234 -0.1190340 1.0000000
## total_rooms 0.1341531 0.1980496 -0.3612622
## total_rooms
## median_house_value 0.1341531
## median_income 0.1980496
## housing_median_age -0.3612622
## total_rooms 1.0000000
cor2
## median_house_value populationdensity total_bedrooms
## median_house_value 1.00000000 -0.02363940 0.04968618
## populationdensity -0.02363940 1.00000000 -0.02835486
## total_bedrooms 0.04968618 -0.02835486 1.00000000
## households 0.06489355 -0.02733581 0.97972827
## households
## median_house_value 0.06489355
## populationdensity -0.02733581
## total_bedrooms 0.97972827
## households 1.00000000
cor3
## median_house_value roomsperbedroom median_income
## median_house_value 1.0000000 0.3839203 0.6883555
## roomsperbedroom 0.3839203 1.0000000 0.7665784
## median_income 0.6883555 0.7665784 1.0000000
## housing_median_age 0.1064320 -0.1608531 -0.1182777
## housing_median_age
## median_house_value 0.1064320
## roomsperbedroom -0.1608531
## median_income -0.1182777
## housing_median_age 1.0000000
The strongest positive relationship is observed between median income and median house value, while rooms per bedroom also shows a positive correlation with median house value, though to a lesser extent. Housing median age exhibits weaker relationships with both median income and rooms per bedroom.
confidence_interval <- t.test(housing_data$median_house_value,conf.level = 0.95)
confidence_interval
##
## One Sample t-test
##
## data: housing_data$median_house_value
## t = 257.53, df = 20639, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 205281.4 208430.2
## sample estimates:
## mean of x
## 206855.8
Strong evidence that the null hypothesis is false is indicated by the exceptionally low p-value (p-value 2.2e-16). Therefore, we disprove the null hypothesis that median_house_value’s true mean is 0. The calculated range for the true population mean of median_house_value is between 205,281.4 and 208,430.2 with a 95% confidence level. About 206,855.8 is the sample mean for median_house_value, which is substantially within the confidence interval.