The data set used in the project includes information gathered from various block groups in California during the 1990 Census. Each block group comprises an average of 1425.5 individuals in a geographically compact area. The data set comprises of 20640 observations of 9 dependent variables and one independent variable, median house value, that is depicted as ln(median house value). The factors in this data set encompasses information on key variables such as geographical location, housing characteristics, demographic details, and economic indicators.
The primary objective of this analysis is to discern patterns and relationships within the data set, with a focus on understanding the factors influencing median house values. Delving into the exploration and analysis of this data set, pair-wise scatter plots will be utilized to determine the most crucial factor affecting median housing prices and the subsequent application of simple linear regression with that factor to discover the underlying dynamics of the housing market in California.
This data set was found on Kaggle and consists of data, originally published in the 1997 edition of the Statistics and Probability Letters journal, built on the 1990 California census data.
The U.S. Census Bureau uses the block group as the smallest geographical unit that it publishes data in (typically consisting of 600 to 3000 people). Each observation in the data set represents one block group.
The data set is slightly modified from the original. 207 values were removed from the total_bedrooms variable because of missing or nonsensical values. The categorical variable, oceanProximity, was also added as a rough indicator of whether each block group was near the ocean, the Bay area, inland, or on an island.
The data set contains 9 independent variables and 1 dependent variable. This analysis is primarily concerned with the factor that best explains the response using a simple linear regression. As such, pair wise scatter plots will be conducted to find correlation between independent variables with median housing price. If violations of conditions for simple linear regression are present, the bootstrap method will be conducted to alleviate possible issues.
This method is used to identify the independent variable that best correlates with the response variable. Categorical and ordinal variables are excluded as this method does not apply to them. Variables that are logically correlated with each other were also excluded from the final scatter plot. Longitude, latitude, and ocean proximity were grouped together. Total rooms, total bedrooms, population, and households were also grouped together. As a result, 4 variables were in the final scatter plot: Median housing age, total bedrooms, median income, and response median house price.
Bootstrapping, a statistical technique akin to creating a multitude of data universes through resampling with replacement, proves to be a potent tool in unraveling the intricacies of simple linear regression. This resampling method sidesteps the rigid assumptions about population distribution and is particularly well-suited for scenarios with small sample sizes.
In the realm of simple linear regression, bootstrapping is invaluable, providing reliable estimates and confidence intervals without the constraints imposed by traditional parametric methods.
A simple pairwise scatter plot was made to show the association between variables with the response to choose which variable to use in the simple linear regression. As mentioned prior, independent variables that correlate with one another were removed since they likely have similar correlations to the response variable.
In this pairwise plot, median income has the largest correlation with median house value, so that is the variable the simple linear regression will be focused on.
In order to perform a parametric model, there are conditions that must be satisfied. Normality of the residuals, constant variance, linearity, and no influential points.
The Versus Fits plot mostly satisfies the condition of linearity, with a relatively concerning straight line of residual vs fits plotted. Normality condition is strongly violated with the Q-Q plot not showing a straight trend. The scale-location plot shows a concerning distribution, leading to constant variance being violated. There are no significant influential points. As a result, a parametric model is not a good model to use for these data and a bootstrap regression will be conducted.
A bootstrap regression is performed by repeatedly bootstrap sampling and performing a regression model on each sample. As a result, a distribution and a 95% confidence interval of the these regression models can be made.
1000 replications of N = 20640 sampled bootstrap distributions were created. Plotted out, the general shape appears to be normal.
The 95% bootstrap confidence interval for the slope is 95% CI[4.11^{4}, 4.25^{4}]. Since 0 is not in the confidence interval, the regression indicates that there is a relationship between median income and median house price per block group. They are statistically correlated.
In this scenario, bootstrapping created a distribution to complete a simple linear regression since the assumptions about the residuals were not met. In this way, bootstrap simple linear regression is a non-parametric way to perform and validate the correlation between the explanatory and response variables.
Bootstrapping in the context of simple linear regression offers a robust alternative, particularly when confronted with potential violations of parametric assumptions. In instances of serious model assumption breaches and a sample size that isn’t overly restricted, bootstrap confidence intervals for regression coefficients emerge as a more dependable option compared to parametric p-values, as bootstrapping is inherently non-parametric.
Notably, even when the regression function takes a misspecified form, bootstrap confidence intervals remain valid. The benefits of bootstrapping are its resilience to small sample sizes, its ability to accommodate misspecified models, and its capacity to navigate uncertainties in parametric assumptions. This is a versatile approach for robust inference in the realm of linear regression.
Wang, H.(2018). housing.csv[Data set]. https://www.kaggle.com/code/harrywang/housing-price-prediction/input?select=housing.csv
O’Reilly Media (2017). California Housing. GitHub. https://github.com/ageron/handson-ml/blob/master/datasets/housing/housing.csv