Data Dive Week 8

Select Response and Explanatory Variables:

Response Variable: We will choose the “median_house_value” column, which represents the median house value in a given area. This is a continuous variable and a crucial factor in the context of housing.

Explanatory Variable: Let’s select “ocean_proximity” as the categorical variable. It describes the proximity of the housing district to the ocean, which might influence house prices.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

housing_data<-read.csv("/Users/sharmistaroy/Downloads/housing.csv")

#response variable is median house value 

#categorical column is ocean proximity 
unique(housing_data$ocean_proximity)

## [1] "NEAR BAY"   "<1H OCEAN"  "INLAND"     "NEAR OCEAN" "ISLAND"

# Perform ANOVA test
anova_result <- aov(median_house_value ~ ocean_proximity, data = housing_data)

# Summary of the ANOVA test
summary(anova_result)

##                    Df    Sum Sq   Mean Sq F value Pr(>F)    
## ocean_proximity     4 6.544e+13 1.636e+13    1612 <2e-16 ***
## Residuals       20635 2.094e+14 1.015e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The extremely small p-value (<2e-16) indicates that the “ocean_proximity” category significantly affects median house values. Thus, it can be inferred that there is sufficient data to support the hypothesis that the dataset’s median home values are considerably influenced by the location of the ocean.

Linear Regression:

# Build a linear regression model
linear_model <- lm(median_house_value ~ median_income, data = housing_data)

# Summary of the linear regression model
summary(linear_model)

## 
## Call:
## lm(formula = median_house_value ~ median_income, data = housing_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -540697  -55950  -16979   36978  434023 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    45085.6     1322.9   34.08   <2e-16 ***
## median_income  41793.8      306.8  136.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83740 on 20638 degrees of freedom
## Multiple R-squared:  0.4734, Adjusted R-squared:  0.4734 
## F-statistic: 1.856e+04 on 1 and 20638 DF,  p-value: < 2.2e-16

In summary, the linear regression model demonstrates that median income is a statistically significant and robust predictor of median housing prices. The residuals, however, show that there is still a significant amount of unexplained variation in home values. To take into consideration additional elements that affect property prices, more research could be required.

Including Additional Variables in the Regression Model:

# Build a linear regression model with additional variables
multi_var_model <- lm(median_house_value ~ median_income + population + total_bedrooms, data = housing_data)

# Summary of the multi-variable regression model
summary(multi_var_model)

## 
## Call:
## lm(formula = median_house_value ~ median_income + population + 
##     total_bedrooms, data = housing_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -542542  -54418  -13494   37880  744280 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    41074.909   1493.752   27.50   <2e-16 ***
## median_income  42105.204    299.954  140.37   <2e-16 ***
## population       -34.227      1.049  -32.62   <2e-16 ***
## total_bedrooms    95.868      2.822   33.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81410 on 20429 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.5028, Adjusted R-squared:  0.5027 
## F-statistic:  6885 on 3 and 20429 DF,  p-value: < 2.2e-16

In summary, this multiple linear regression model is statistically significant and indicates that population, median income, and total number of bedrooms are important predictors of median house values. The residuals, however, show that there is still some inexplicable fluctuation in home values. To take into consideration more elements that affect house prices, more research and model improvement could be required.

Data Dive Week 8

Sharmista Kothavadla

2023-10-23

Select Response and Explanatory Variables:

Linear Regression:

Including Additional Variables in the Regression Model: