Response Variable: We will choose the “median_house_value” column, which represents the median house value in a given area. This is a continuous variable and a crucial factor in the context of housing.
Explanatory Variable: Let’s select “ocean_proximity” as the categorical variable. It describes the proximity of the housing district to the ocean, which might influence house prices.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
housing_data<-read.csv("/Users/sharmistaroy/Downloads/housing.csv")
#response variable is median house value
#categorical column is ocean proximity
unique(housing_data$ocean_proximity)
## [1] "NEAR BAY" "<1H OCEAN" "INLAND" "NEAR OCEAN" "ISLAND"
# Perform ANOVA test
anova_result <- aov(median_house_value ~ ocean_proximity, data = housing_data)
# Summary of the ANOVA test
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## ocean_proximity 4 6.544e+13 1.636e+13 1612 <2e-16 ***
## Residuals 20635 2.094e+14 1.015e+10
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The extremely small p-value (<2e-16) indicates that the “ocean_proximity” category significantly affects median house values. Thus, it can be inferred that there is sufficient data to support the hypothesis that the dataset’s median home values are considerably influenced by the location of the ocean.
# Build a linear regression model
linear_model <- lm(median_house_value ~ median_income, data = housing_data)
# Summary of the linear regression model
summary(linear_model)
##
## Call:
## lm(formula = median_house_value ~ median_income, data = housing_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -540697 -55950 -16979 36978 434023
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45085.6 1322.9 34.08 <2e-16 ***
## median_income 41793.8 306.8 136.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83740 on 20638 degrees of freedom
## Multiple R-squared: 0.4734, Adjusted R-squared: 0.4734
## F-statistic: 1.856e+04 on 1 and 20638 DF, p-value: < 2.2e-16
In summary, the linear regression model demonstrates that median income is a statistically significant and robust predictor of median housing prices. The residuals, however, show that there is still a significant amount of unexplained variation in home values. To take into consideration additional elements that affect property prices, more research could be required.
# Build a linear regression model with additional variables
multi_var_model <- lm(median_house_value ~ median_income + population + total_bedrooms, data = housing_data)
# Summary of the multi-variable regression model
summary(multi_var_model)
##
## Call:
## lm(formula = median_house_value ~ median_income + population +
## total_bedrooms, data = housing_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -542542 -54418 -13494 37880 744280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41074.909 1493.752 27.50 <2e-16 ***
## median_income 42105.204 299.954 140.37 <2e-16 ***
## population -34.227 1.049 -32.62 <2e-16 ***
## total_bedrooms 95.868 2.822 33.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81410 on 20429 degrees of freedom
## (207 observations deleted due to missingness)
## Multiple R-squared: 0.5028, Adjusted R-squared: 0.5027
## F-statistic: 6885 on 3 and 20429 DF, p-value: < 2.2e-16
In summary, this multiple linear regression model is statistically significant and indicates that population, median income, and total number of bedrooms are important predictors of median house values. The residuals, however, show that there is still some inexplicable fluctuation in home values. To take into consideration more elements that affect house prices, more research and model improvement could be required.