STA631_Project Portfolio
Portfolio
My name is Sanjay Akula and I’m a Master’s student studying data science and analytics right now. I worked for a service-based company after receiving my bachelor’s degree in Data Science and Analytics. It was around that period that I became aware of how restricted my modeling and machine learning abilities and expertise were. Now that I’m enrolled in my master’s degree, I may apply regression analysis to improve my analytical skills. I want to use these approaches on a practical issue in order to advance my knowledge. I’ve decided to employ a Generalized Linear Model (GLM) on a collection of data about home prices. My objective is to forecast changes in home values and to produce visual aids to help explain and comprehend these forecasts.
Objective
This project’s main goal is to gain a solid grasp of linear regression analysis by using it to forecast real estate values. To do this, a data collection must be thoroughly examined and preprocessed, pertinent characteristics must be chosen, and statistical models must be used to provide accurate predictions. By bridging theoretical knowledge with real-world application, the initiative hopes to help students understand the subtleties and complexity of data analysis in the actual world. Students should expect to improve their analytical abilities, become proficient with statistical tools, and develop the capacity to properly understand and present their findings as a result of this work. Furthermore, by asking students to evaluate the precision and dependability of their models, recognize the limitations of their research, and consider possible enhancements, this initiative aims to foster critical thinking.In the end, the project gives students a chance to apply statistical ideas and techniques in a relevant setting, equipping them for further study or careers requiring data-driven decision-making.
Project Overview
The goal of the project “Using Linear Regression to Predict the Prices of Houses” is to develop a statistical model that can accurately anticipate the prices of sales of residential real estate. A fundamental technique in statistical modeling and machine learning, linear regression assumes a linear relationship between independent variables (like square footage, number of bedrooms, and other house features) and the dependent variable (the sale price of the house). The project uses a rich data set to apply this technique.The data collection provides a complete range of parameters that are thought to impact house pricing, ranging from lot size and neighborhood to physical aspects like overall quality and year built. In order to manage missing values, outliers, and categorical variables, the analysis starts with a thorough exploratory data investigation to identify the underlying distributions and correlations. This is followed by strict data pretreatment. The project moves forward by creating a clear and useful data set, selecting features to determine the most significant predictors, fitting a model with the help of the linear regression technique, and validating the model to evaluate its predictive power and accuracy. In the end, the project improves predictive modeling methods while providing insightful information about the real estate market to help investors, buyers, and sellers make wise choices based on important property attributes. The project is an excellent example of how statistical theory and practical application may be combined, as demonstrated by the reliable management, analysis, and modeling of complicated data sets through the use of R programming. ## Loading Packages
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Loading Dataset
# Load the dataset
data <- read_csv("C:\\Users\\A Sanjay\\OneDrive\\Desktop\\STA 635 Project\\archive (6)\\data.csv")## Rows: 4600 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): street, city, statezip, country
## dbl (13): price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterf...
## dttm (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read_csv: Reads the CSV file containing the dataset into an R data frame.
# head: Displays the first few rows of the dataset to verify it was loaded correctly.
# Display the first few rows of the dataset
print(head(data))## # A tibble: 6 × 18
## date price bedrooms bathrooms sqft_living sqft_lot floors
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-02 00:00:00 313000 3 1.5 1340 7912 1.5
## 2 2014-05-02 00:00:00 2384000 5 2.5 3650 9050 2
## 3 2014-05-02 00:00:00 342000 3 2 1930 11947 1
## 4 2014-05-02 00:00:00 420000 3 2.25 2000 8030 1
## 5 2014-05-02 00:00:00 550000 4 2.5 1940 10500 1
## 6 2014-05-02 00:00:00 490000 2 1 880 6380 1
## # ℹ 11 more variables: waterfront <dbl>, view <dbl>, condition <dbl>,
## # sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## # street <chr>, city <chr>, statezip <chr>, country <chr>
Summary Statistics
## date price bedrooms
## Min. :2014-05-02 00:00:00.00 Min. : 0 Min. :0.000
## 1st Qu.:2014-05-21 00:00:00.00 1st Qu.: 322875 1st Qu.:3.000
## Median :2014-06-09 00:00:00.00 Median : 460943 Median :3.000
## Mean :2014-06-07 03:14:42.77 Mean : 551963 Mean :3.401
## 3rd Qu.:2014-06-24 00:00:00.00 3rd Qu.: 654962 3rd Qu.:4.000
## Max. :2014-07-10 00:00:00.00 Max. :26590000 Max. :9.000
## bathrooms sqft_living sqft_lot floors
## Min. :0.000 Min. : 370 Min. : 638 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000
## Median :2.250 Median : 1980 Median : 7683 Median :1.500
## Mean :2.161 Mean : 2139 Mean : 14852 Mean :1.512
## 3rd Qu.:2.500 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1074218 Max. :3.500
## waterfront view condition sqft_above
## Min. :0.000000 Min. :0.0000 Min. :1.000 Min. : 370
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190
## Median :0.000000 Median :0.0000 Median :3.000 Median :1590
## Mean :0.007174 Mean :0.2407 Mean :3.452 Mean :1827
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300
## Max. :1.000000 Max. :4.0000 Max. :5.000 Max. :9410
## sqft_basement yr_built yr_renovated street
## Min. : 0.0 Min. :1900 Min. : 0.0 Length:4600
## 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0 Class :character
## Median : 0.0 Median :1976 Median : 0.0 Mode :character
## Mean : 312.1 Mean :1971 Mean : 808.6
## 3rd Qu.: 610.0 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :4820.0 Max. :2014 Max. :2014.0
## city statezip country
## Length:4600 Length:4600 Length:4600
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
# summary: Provides summary statistics (mean, median, min, max, quartiles) for each column in the dataset.
#Summary Statistics:
# Mean, Median, Min, Max, Standard Deviation: These provide an overview of the data distribution for each feature.
# For example:
# Price: Mean = $551,963, which indicates the average house price.
# Bedrooms: Mean = 3.4, indicating that most houses have around 3 to 4 bedrooms.
# Bathrooms: Mean = 2.16, showing that most houses have between 2 and 3 bathrooms.
# Sqft Living: Mean = 2139 sqft, indicating the average living area size.## date price bedrooms bathrooms sqft_living
## 0 0 0 0 0
## sqft_lot floors waterfront view condition
## 0 0 0 0 0
## sqft_above sqft_basement yr_built yr_renovated street
## 0 0 0 0 0
## city statezip country
## 0 0 0
Exploratory Data Analysis
# Exploratory Data Analysis (EDA)
# Histograms for key features
data %>%
gather(key = "variable", value = "value", price, bedrooms, bathrooms, sqft_living, floors, yr_built) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~variable, scales = "free_x") +
theme_minimal()# gather: Converts wide-format data to long-format, which is useful for plotting multiple histograms in a single graph.
# ggplot: Initializes the plot.
# aes(x = value): Maps the value to the x-axis.
# geom_histogram: Creates histograms.
# facet_wrap: Creates separate plots for each variable.
# theme_minimal: Applies a minimal theme to the plot.Histograms & Scatter Plots
# Distributions:
# Price Distribution: Right-skewed, indicating a few very expensive properties.
# Bedrooms and Bathrooms: Most properties have 3-4 bedrooms and 2-3 bathrooms.
# Sqft Living: Peaks around 1500-2500 sqft, indicating the common range for living space.
# Floors and Year Built: Most properties have 1-2 floors and were built between 1950-1980.
# Scatter plots to examine relationships with price
ggplot(data, aes(x = sqft_living, y = price)) +
geom_point(color = "blue") +
theme_minimal() +
ggtitle("Price vs. Sqft Living")ggplot(data, aes(x = bedrooms, y = price)) +
geom_point(color = "blue") +
theme_minimal() +
ggtitle("Price vs. Bedrooms")ggplot(data, aes(x = bathrooms, y = price)) +
geom_point(color = "blue") +
theme_minimal() +
ggtitle("Price vs. Bathrooms")# Scatter Plots: Visualize the relationships between price and other features (sqft_living, bedrooms, bathrooms).
# geom_point: Adds points to the scatter plot.
# ggtitle: Adds a title to each plot.
# Scatter Plots:
#
# Price vs. Sqft Living: Positive correlation; larger houses tend to have higher prices.
# Price vs. Bedrooms/Bathrooms: Some positive correlation, but less pronounced than sqft living.Correlation Matrix
# Correlation matrix
correlation_matrix <- data %>%
select_if(is.numeric) %>%
cor()
# Correlation Matrix:
# High Correlations:
# Sqft Living and Price: Strong positive correlation, indicating that as the living area increases, the price tends to increase.
# Sqft Above and Price: Also a strong positive correlation.
# Other Features: Waterfront, view, and condition also show some correlation with price.
# Plot the heatmap
correlation_matrix %>%
as.data.frame() %>%
rownames_to_column(var = "Var1") %>%
gather(key = "Var2", value = "value", -Var1) %>%
ggplot(aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 2))) +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), space = "Lab", name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
coord_fixed()# Correlation Matrix: Calculates and visualizes the correlation between numeric features.
# select_if(is.numeric): Selects only numeric columns.
# cor: Computes the correlation matrix.
# geom_tile: Creates a heatmap.
# geom_text: Adds correlation values as text.
# scale_fill_gradient2: Sets the color gradient for the heatmap.Building Regression Model
# Building the linear regression model
model <- lm(price ~ bedrooms + bathrooms + sqft_living + floors + waterfront + view + condition + sqft_above + sqft_basement + yr_built + yr_renovated, data = data)
summary(model)##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
## waterfront + view + condition + sqft_above + sqft_basement +
## yr_built + yr_renovated, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2105681 -128343 -16711 89700 26335249
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.609e+06 6.859e+05 6.720 2.04e-11 ***
## bedrooms -5.618e+04 1.049e+04 -5.356 8.92e-08 ***
## bathrooms 5.919e+04 1.701e+04 3.479 0.000508 ***
## sqft_living 2.292e+02 2.169e+01 10.565 < 2e-16 ***
## floors 4.598e+04 1.863e+04 2.468 0.013606 *
## waterfront 3.605e+05 9.386e+04 3.841 0.000124 ***
## view 4.480e+04 1.097e+04 4.082 4.54e-05 ***
## condition 3.125e+04 1.306e+04 2.393 0.016741 *
## sqft_above 2.208e+01 2.149e+01 1.027 0.304248
## sqft_basement NA NA NA NA
## yr_built -2.395e+03 3.419e+02 -7.006 2.82e-12 ***
## yr_renovated 6.772e+00 8.643e+00 0.784 0.433354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 499800 on 4589 degrees of freedom
## Multiple R-squared: 0.216, Adjusted R-squared: 0.2143
## F-statistic: 126.4 on 10 and 4589 DF, p-value: < 2.2e-16
# lm: Fits a linear regression model.
# price ~ ...: Specifies the response variable (price) and the predictor variables (bedrooms, bathrooms, etc.).
# summary: Provides detailed information about the fitted model, including coefficients, R-squared value, and p-values.
# Linear Regression Model:
#
# Coefficients: Provide the estimated change in price for a one-unit change in each feature, holding all other features constant.
# Positive Coefficients:
# Sqft Living: For each additional square foot, the price increases by a certain amount.
# Bedrooms/Bathrooms: Positive but smaller compared to sqft living.
# Negative Coefficients:
# Year Built: Older houses tend to be cheaper.
# Intercept: The baseline price when all features are zero.
# Model evaluation
# Predicting on the dataset
predictions <- predict(model, data)Model Evaluation
# Model Evaluation:
# R-squared:
# Value: Measures the proportion of variance in the dependent variable (price) explained by the independent variables (features).
# Example: An R-squared value of 0.65 indicates that 65% of the variability in house prices can be explained by the model.
# Calculating R-squared and Mean Squared Error
r_squared <- summary(model)$r.squared
mse <- mean((data$price - predictions)^2)
# Mean Squared Error (MSE):
# Value: The average of the squares of the errors (difference between actual and predicted prices).
# Example: An MSE of a certain value indicates the average squared difference between observed and predicted house prices.
# Printing the evaluation metrics
print(paste("R-squared: ", r_squared))## [1] "R-squared: 0.215998743639482"
## [1] "Mean Squared Error: 249187320761.844"
# predict: Uses the fitted model to predict price for the dataset.
# R-squared: Measures the proportion of variance in the dependent variable that is predictable from the independent variables.
# Mean Squared Error (MSE): Measures the average squared difference between observed and predicted values.Summary of Findings
# Summary of Findings
# Most Influential Features:
# Sqft Living: Strongest predictor of house price.
# Bedrooms and Bathrooms: Also important but less than living space.
# Waterfront and View: Premium features that significantly affect price.
# Year Built: Older homes tend to have lower prices.
# Model Performance:
# R-squared: Indicates a good fit, meaning the model explains a significant portion of the variability in house prices.
# MSE: Provides a measure of the prediction error.
# Practical Implications:
# For Home Buyers: Understanding which features most influence price can guide purchasing decisions.
# For Home Sellers: Highlighting features like living space, waterfront, and view can help justify higher prices.
# For Real Estate Agents: Insights can aid in pricing strategies and advising clients on property improvements.
# Summary of Packages and Functions Used
# tidyverse: A collection of packages for data science tasks.
# read_csv: Reads CSV files.
# ggplot2: Creates graphics and visualizations.
# dplyr: Provides functions for data manipulation (e.g., select_if, summarize).
# tidyr: Helps in tidying data (e.g., gather).
# lm: Fits linear models.
# predict: Makes predictions based on the fitted model.
# summary: Provides summary statistics for data and models.
#
# cor: Computes correlation matrices.
# write_csv: Writes data to CSV files.
# Interpretation of the Results
# Summary Statistics and Missing Values: Confirm that the data is complete and provide an overview of the key features.
# EDA: Visualize data distributions and relationships between variables to identify patterns and insights.
# Correlation Matrix: Understand the relationships between numeric variables and identify potential multicollinearity.
# Linear Regression Model: Quantify the effect of each predictor on the response variable (price).
# Model Evaluation: Assess model performance using R-squared and MSE.
# Predictions: Save the predicted values for further analysis or reporting.
# Save the results
write_csv(data.frame(predictions), "predictions.csv")Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation
The mathematical basis of statistical modeling is probability theory, which allows us to measure uncertainty, forecast occurrences we don’t yet know about, and draw conclusions about a population from sample data. Probability offers the foundation for determining how likely different outcomes are in statistical modeling, given a set of assumptions and observable data.Statistical inference is the process of extrapolating population estimates from sample data.Typically, it entails testing hypotheses, generating predictions, and estimating population characteristics (such as variances and averages). By enabling us to determine the probability of detecting our sample data under various population assumptions, probability theory supports inference.By assuming, for instance, that a dataset has a normal distribution, probability theory allows us to determine the likelihood that a sample mean would fall within a certain range. Our confidence in the sample mean as a proxy for the population mean can be increased by using this computation.The MLE technique is employed to approximate the parameters inside a statistical model. The chance of witnessing the sample data given a set of parameters is referred to as the “likelihood.” Assuming that our sample represents the most likely representation of the underlying population, we use MLE to select the parameters that maximize this likelihood. MLE operates on an intuitive logical framework, identifying the parameter values that maximize the likelihood of the observed data. This method is commonly applied to many different kinds of statistical models, ranging from straightforward linear regression to more intricate models.
Determine and apply the appropriate generalized linear model for a specific data context
In this case, the selling price of homes (SalePrice) is predicted using a Generalized Linear Model (GLM) in conjunction with variables such lot area (LotArea), overall quality (OverallQual), and year constructed (YearBuilt). This method uses a linear regression framework, a subset of GLM designed for continuous outcomes, to model the connection between the target variable and chosen characteristics. In keeping with the tenets of linear regression, the glm function in R is applied using the Gaussian family and an identity link function. A versatile framework for investigating linear correlations between a continuous target variable and a collection of predictors is provided by the GLM method for house price prediction. While theoretical reasoning and exploratory data analysis should both be considered when choosing predictors, the model summary interpretation provides insights into the processes determining home prices.
Conduct model selection for a set of candidate models
In order to pick the optimum statistical model for forecasting your target variable, you must compare many models and consider factors such as model simplicity, predictive performance, and underlying assumptions. When it comes to predicting home prices, you may have many candidate models that employ different predictor types (polynomial terms, interaction terms), or contain different subsets of predictors.A crucial and iterative element in the modeling process is model selection, which lets you improve your strategy in light of empirical data. By carefully weighing various models and utilizing impartial standards for contrast,My R script applies theoretical principles that I learnt in class in a practical way. It focuses on selecting and developing models for generalized linear models (GLMs) that are used to predict housing values.
Communicate the results of statistical models to a general audience
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors +
## waterfront + view + condition + sqft_above + sqft_basement +
## yr_built + yr_renovated, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2105681 -128343 -16711 89700 26335249
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.609e+06 6.859e+05 6.720 2.04e-11 ***
## bedrooms -5.618e+04 1.049e+04 -5.356 8.92e-08 ***
## bathrooms 5.919e+04 1.701e+04 3.479 0.000508 ***
## sqft_living 2.292e+02 2.169e+01 10.565 < 2e-16 ***
## floors 4.598e+04 1.863e+04 2.468 0.013606 *
## waterfront 3.605e+05 9.386e+04 3.841 0.000124 ***
## view 4.480e+04 1.097e+04 4.082 4.54e-05 ***
## condition 3.125e+04 1.306e+04 2.393 0.016741 *
## sqft_above 2.208e+01 2.149e+01 1.027 0.304248
## sqft_basement NA NA NA NA
## yr_built -2.395e+03 3.419e+02 -7.006 2.82e-12 ***
## yr_renovated 6.772e+00 8.643e+00 0.784 0.433354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 499800 on 4589 degrees of freedom
## Multiple R-squared: 0.216, Adjusted R-squared: 0.2143
## F-statistic: 126.4 on 10 and 4589 DF, p-value: < 2.2e-16
Use programming software (i.e., R) to fit and assess statistical models
## [1] "R-squared: 0.215998743639482"
## [1] "Mean Squared Error: 249187320761.844"
Reflection
I frequently participated in online class discussions and asked questions to advance the learning of both myself and my peers. Also, it increased the interaction in our classes. My enthusiasm extended outside of the classroom. Our little competitions were a lot of fun, and I felt like I could apply what we had learned in a constructive and enjoyable way. These exercises enhanced my ability to solve problems and work with others.I also contributed and gained knowledge from others on GitHub during my stay there. We were all able to study more jointly as a result of my helpful guidance and my proficiency with the website. To put it briefly, I gave my whole attention to all aspects of our course, including homework, GitHub assistance, online conversations, and contests. By demonstrating what it meant to participate actively, I wanted to help everyone in our course group learn as well as myself.
Conclusion
The analysis conducted provides a comprehensive overview of the factors influencing house prices. The dataset, loaded and explored using various R packages, revealed key insights through summary statistics and visualizations. The data was complete with no missing values, and exploratory data analysis showed that features like square footage of living space, the number of bedrooms and bathrooms, and the presence of premium features like waterfront and view significantly affect house prices. The correlation matrix highlighted strong positive correlations between these features and price, particularly for square footage. The linear regression model quantified these relationships, with the coefficient estimates suggesting that larger living spaces and premium features contribute most significantly to higher prices. The model’s R-squared value indicated a good fit, explaining a substantial portion of the variability in house prices, while the mean squared error provided a measure of prediction accuracy. These findings can guide home buyers, sellers, and real estate agents in making informed decisions, emphasizing the most influential property features in pricing and marketing strategies. The results, including model predictions, were saved for further analysis and practical application.