• Build at least three sets of variable combinations

library(readr)
housing_data<-read.csv("/Users/sharmistaroy/Downloads/housing.csv")

View(housing_data)

Response Variable: medianHouseValue

set1 <- housing_data[, c("median_house_value", "median_income", "housing_median_age", "total_rooms")]
housing_data$populationdensity<-housing_data$population/housing_data$households
set2<- housing_data[,c("median_house_value","populationdensity","total_bedrooms","households")]
housing_data$roomsperbedroom <- housing_data$total_rooms / housing_data$total_bedrooms
set3<-housing_data[,c("median_house_value","roomsperbedroom","median_income","housing_median_age")]

In Set 2, we calculated a new variable, populationDensity, by dividing population by households. In Set 3, we calculated a new variable, roomsPerBedroom, by dividing totalRooms by totalBedrooms.

• Plot a visualization for each response-explanatory relationship, and draw some conclusions based on the plot

For set1 :

library(ggplot2)

# Plot 1: Median House Value vs. Median Income
ggplot(set1, aes(x = median_income, y = median_house_value)) +
  geom_point() +
  ggtitle("Median House Value vs. Median Income")

# Plot 2: Median House Value vs. Housing Median Age
ggplot(set1, aes(x = housing_median_age, y = median_house_value)) +
  geom_point() +
  ggtitle("Median House Value vs. Housing Median Age")

# Plot 3: Median House Value vs. Total Rooms
ggplot(set1, aes(x = total_rooms, y = median_house_value)) +
  geom_point() +
  ggtitle("Median House Value vs. Total Rooms")

The plot of “Median House Value vs. Median Income” reveals a strong correlation. The value of the median home tends to rise along with the median income. The figure “Median House Value vs. Housing Median Age” is dispersed and does not demonstrate a clear linear relationship. The scattershot dots on the “Median House Value vs. Total Rooms” plot are equally devoid of any discernible linear relationship.

For set2:

# Median House Value vs. Population Density
ggplot(set2, aes(x = populationdensity, y = median_house_value)) +
  geom_point() +
  ggtitle("Median House Value vs. Population Density")

#  Median House Value vs. Total Bedrooms
ggplot(set2, aes(x = total_bedrooms, y = median_house_value)) +
  geom_point() +
  ggtitle("Median House Value vs. Total Bedrooms")
## Warning: Removed 207 rows containing missing values (`geom_point()`).

#  Median House Value vs. Households
ggplot(set2, aes(x = households, y = median_house_value)) +
  geom_point() +
  ggtitle("Median House Value vs. Households")

The “Median House Value vs. Population Density” plot does not show a strong linear relationship, and there are some outliers with very high population densities. The “Median House Value vs. Total Bedrooms” plot has scattered points and doesn’t show a strong linear relationship. The “Median House Value vs. Households” plot also has scattered points with no clear linear relationship.

For set3:

#  Median House Value vs. Rooms per Bedroom
ggplot(set3, aes(x = roomsperbedroom, y = median_house_value)) +
  geom_point() +
  ggtitle("Median House Value vs. Rooms per Bedroom")
## Warning: Removed 207 rows containing missing values (`geom_point()`).

#  Median House Value vs. Median Income
ggplot(set3, aes(x = median_income, y = median_house_value)) +
  geom_point() +
  ggtitle("Median House Value vs. Median Income")

#  Median House Value vs. Housing Median Age
ggplot(set3, aes(x = housing_median_age, y = median_house_value)) +
  geom_point() +
  ggtitle("Median House Value vs. Housing Median Age")

There are outliers with incredibly high rooms per bedroom values in the “Median House Value vs. Rooms per Bedroom” plot, which does not indicate a clear linear relationship. The “Median House Value vs. Median Income” plot displays a favorable correlation like Set 1. The “Median House Value vs. Housing Median Age” graphic displays a skewed distribution of points and fails to demonstrate a clear linear relationship.

• Calculate the appropriate correlation coefficient for each of these combinations

cor1 <- cor(set1)
cor2 <- cor(set2)
cor3 <- cor(set3)

cor1
##                    median_house_value median_income housing_median_age
## median_house_value          1.0000000     0.6880752          0.1056234
## median_income               0.6880752     1.0000000         -0.1190340
## housing_median_age          0.1056234    -0.1190340          1.0000000
## total_rooms                 0.1341531     0.1980496         -0.3612622
##                    total_rooms
## median_house_value   0.1341531
## median_income        0.1980496
## housing_median_age  -0.3612622
## total_rooms          1.0000000
cor2
##                    median_house_value populationdensity total_bedrooms
## median_house_value         1.00000000       -0.02373741             NA
## populationdensity         -0.02373741        1.00000000             NA
## total_bedrooms                     NA                NA              1
## households                 0.06584265       -0.02730936             NA
##                     households
## median_house_value  0.06584265
## populationdensity  -0.02730936
## total_bedrooms              NA
## households          1.00000000
cor3
##                    median_house_value roomsperbedroom median_income
## median_house_value          1.0000000              NA     0.6880752
## roomsperbedroom                    NA               1            NA
## median_income               0.6880752              NA     1.0000000
## housing_median_age          0.1056234              NA    -0.1190340
##                    housing_median_age
## median_house_value          0.1056234
## roomsperbedroom                    NA
## median_income              -0.1190340
## housing_median_age          1.0000000

The correlation for the variables is partially correct as the dataset has missing values, which can be replaced by imputing or mean of the columns.

These findings imply a positive correlation between median_income and median_house_value in Sets 1 and 3, indicating that areas with higher median earnings typically have higher median house values. Other variables in these sets, however, have comparatively smaller correlations with median_house_value. Due to missing values in several variables in Set 2, the correlations are typically weak or nonexistent.

Combinations after removing missing values.

set1 <- na.omit(set1)
set2 <- na.omit(set2)
set3 <- na.omit(set3)
cor1 <- cor(set1)
cor2 <- cor(set2)
cor3 <- cor(set3)

cor1
##                    median_house_value median_income housing_median_age
## median_house_value          1.0000000     0.6880752          0.1056234
## median_income               0.6880752     1.0000000         -0.1190340
## housing_median_age          0.1056234    -0.1190340          1.0000000
## total_rooms                 0.1341531     0.1980496         -0.3612622
##                    total_rooms
## median_house_value   0.1341531
## median_income        0.1980496
## housing_median_age  -0.3612622
## total_rooms          1.0000000
cor2
##                    median_house_value populationdensity total_bedrooms
## median_house_value         1.00000000       -0.02363940     0.04968618
## populationdensity         -0.02363940        1.00000000    -0.02835486
## total_bedrooms             0.04968618       -0.02835486     1.00000000
## households                 0.06489355       -0.02733581     0.97972827
##                     households
## median_house_value  0.06489355
## populationdensity  -0.02733581
## total_bedrooms      0.97972827
## households          1.00000000
cor3
##                    median_house_value roomsperbedroom median_income
## median_house_value          1.0000000       0.3839203     0.6883555
## roomsperbedroom             0.3839203       1.0000000     0.7665784
## median_income               0.6883555       0.7665784     1.0000000
## housing_median_age          0.1064320      -0.1608531    -0.1182777
##                    housing_median_age
## median_house_value          0.1064320
## roomsperbedroom            -0.1608531
## median_income              -0.1182777
## housing_median_age          1.0000000

The strongest positive relationship is observed between median income and median house value, while rooms per bedroom also shows a positive correlation with median house value, though to a lesser extent. Housing median age exhibits weaker relationships with both median income and rooms per bedroom.

• Build a confidence interval for each of the response variables. Provide a detailed conclusion of the response variable (i.e., the population) based on your confidence interval.

confidence_interval <- t.test(housing_data$median_house_value,conf.level = 0.95)

confidence_interval
## 
##  One Sample t-test
## 
## data:  housing_data$median_house_value
## t = 257.53, df = 20639, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  205281.4 208430.2
## sample estimates:
## mean of x 
##  206855.8

Strong evidence that the null hypothesis is false is indicated by the exceptionally low p-value (p-value 2.2e-16). Therefore, we disprove the null hypothesis that median_house_value’s true mean is 0. The calculated range for the true population mean of median_house_value is between 205,281.4 and 208,430.2 with a 95% confidence level. About 206,855.8 is the sample mean for median_house_value, which is substantially within the confidence interval.