STA631_Project Portfolio

Portfolio

My name is Sanjay Akula and I’m a Master’s student studying data science and analytics right now. I worked for a service-based company after receiving my bachelor’s degree in Data Science and Analytics. It was around that period that I became aware of how restricted my modeling and machine learning abilities and expertise were. Now that I’m enrolled in my master’s degree, I may apply regression analysis to improve my analytical skills. I want to use these approaches on a practical issue in order to advance my knowledge. I’ve decided to employ a Generalized Linear Model (GLM) on a collection of data about home prices. My objective is to forecast changes in home values and to produce visual aids to help explain and comprehend these forecasts.

Objective

This project’s main goal is to gain a solid grasp of linear regression analysis by using it to forecast real estate values. To do this, a data collection must be thoroughly examined and preprocessed, pertinent characteristics must be chosen, and statistical models must be used to provide accurate predictions. By bridging theoretical knowledge with real-world application, the initiative hopes to help students understand the subtleties and complexity of data analysis in the actual world. Students should expect to improve their analytical abilities, become proficient with statistical tools, and develop the capacity to properly understand and present their findings as a result of this work. Furthermore, by asking students to evaluate the precision and dependability of their models, recognize the limitations of their research, and consider possible enhancements, this initiative aims to foster critical thinking.In the end, the project gives students a chance to apply statistical ideas and techniques in a relevant setting, equipping them for further study or careers requiring data-driven decision-making.

Project Overview

The goal of the project “Using Linear Regression to Predict the Prices of Houses” is to develop a statistical model that can accurately anticipate the prices of sales of residential real estate. A fundamental technique in statistical modeling and machine learning, linear regression assumes a linear relationship between independent variables (like square footage, number of bedrooms, and other house features) and the dependent variable (the sale price of the house). The project uses a rich data set to apply this technique.The data collection provides a complete range of parameters that are thought to impact house pricing, ranging from lot size and neighborhood to physical aspects like overall quality and year built. In order to manage missing values, outliers, and categorical variables, the analysis starts with a thorough exploratory data investigation to identify the underlying distributions and correlations. This is followed by strict data pretreatment. The project moves forward by creating a clear and useful data set, selecting features to determine the most significant predictors, fitting a model with the help of the linear regression technique, and validating the model to evaluate its predictive power and accuracy. In the end, the project improves predictive modeling methods while providing insightful information about the real estate market to help investors, buyers, and sellers make wise choices based on important property attributes. The project is an excellent example of how statistical theory and practical application may be combined, as demonstrated by the reliable management, analysis, and modeling of complicated data sets through the use of R programming. ## Loading Packages

#Load necessary libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if (!requireNamespace("tidyverse", quietly = TRUE)) {
  install.packages("tidyverse")
}

Loading Dataset

# Load the dataset
data <- read_csv("C:\\Users\\A Sanjay\\OneDrive\\Desktop\\STA 635 Project\\archive (6)\\data.csv")
## Rows: 4600 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (4): street, city, statezip, country
## dbl  (13): price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterf...
## dttm  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read_csv: Reads the CSV file containing the dataset into an R data frame.
# head: Displays the first few rows of the dataset to verify it was loaded correctly.

# Display the first few rows of the dataset
print(head(data))
## # A tibble: 6 × 18
##   date                  price bedrooms bathrooms sqft_living sqft_lot floors
##   <dttm>                <dbl>    <dbl>     <dbl>       <dbl>    <dbl>  <dbl>
## 1 2014-05-02 00:00:00  313000        3      1.5         1340     7912    1.5
## 2 2014-05-02 00:00:00 2384000        5      2.5         3650     9050    2  
## 3 2014-05-02 00:00:00  342000        3      2           1930    11947    1  
## 4 2014-05-02 00:00:00  420000        3      2.25        2000     8030    1  
## 5 2014-05-02 00:00:00  550000        4      2.5         1940    10500    1  
## 6 2014-05-02 00:00:00  490000        2      1            880     6380    1  
## # ℹ 11 more variables: waterfront <dbl>, view <dbl>, condition <dbl>,
## #   sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## #   street <chr>, city <chr>, statezip <chr>, country <chr>

Summary Statistics

# Summary statistics
summary_stats <- summary(data)
print(summary_stats)
##       date                            price             bedrooms    
##  Min.   :2014-05-02 00:00:00.00   Min.   :       0   Min.   :0.000  
##  1st Qu.:2014-05-21 00:00:00.00   1st Qu.:  322875   1st Qu.:3.000  
##  Median :2014-06-09 00:00:00.00   Median :  460943   Median :3.000  
##  Mean   :2014-06-07 03:14:42.77   Mean   :  551963   Mean   :3.401  
##  3rd Qu.:2014-06-24 00:00:00.00   3rd Qu.:  654962   3rd Qu.:4.000  
##  Max.   :2014-07-10 00:00:00.00   Max.   :26590000   Max.   :9.000  
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  370   Min.   :    638   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000  
##  Median :2.250   Median : 1980   Median :   7683   Median :1.500  
##  Mean   :2.161   Mean   : 2139   Mean   :  14852   Mean   :1.512  
##  3rd Qu.:2.500   3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1074218   Max.   :3.500  
##    waterfront            view          condition       sqft_above  
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 370  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190  
##  Median :0.000000   Median :0.0000   Median :3.000   Median :1590  
##  Mean   :0.007174   Mean   :0.2407   Mean   :3.452   Mean   :1827  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :9410  
##  sqft_basement       yr_built     yr_renovated       street         
##  Min.   :   0.0   Min.   :1900   Min.   :   0.0   Length:4600       
##  1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0   Class :character  
##  Median :   0.0   Median :1976   Median :   0.0   Mode  :character  
##  Mean   : 312.1   Mean   :1971   Mean   : 808.6                     
##  3rd Qu.: 610.0   3rd Qu.:1997   3rd Qu.:1999.0                     
##  Max.   :4820.0   Max.   :2014   Max.   :2014.0                     
##      city             statezip           country         
##  Length:4600        Length:4600        Length:4600       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
# summary: Provides summary statistics (mean, median, min, max, quartiles) for each column in the dataset.
#Summary Statistics:
# Mean, Median, Min, Max, Standard Deviation: These provide an overview of the data distribution for each feature.
# For example:
# Price: Mean = $551,963, which indicates the average house price.
# Bedrooms: Mean = 3.4, indicating that most houses have around 3 to 4 bedrooms.
# Bathrooms: Mean = 2.16, showing that most houses have between 2 and 3 bathrooms.
# Sqft Living: Mean = 2139 sqft, indicating the average living area size.
# Check for missing values
missing_values <- colSums(is.na(data))
print(missing_values)
##          date         price      bedrooms     bathrooms   sqft_living 
##             0             0             0             0             0 
##      sqft_lot        floors    waterfront          view     condition 
##             0             0             0             0             0 
##    sqft_above sqft_basement      yr_built  yr_renovated        street 
##             0             0             0             0             0 
##          city      statezip       country 
##             0             0             0
# colSums(is.na(data)): Checks for missing values by summing the NA values in each column.

# Missing Values:
# No missing values: This is good because it means we don't need to handle any missing data before proceeding with the analysis.

Exploratory Data Analysis

# Exploratory Data Analysis (EDA)
# Histograms for key features
data %>%
  gather(key = "variable", value = "value", price, bedrooms, bathrooms, sqft_living, floors, yr_built) %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~variable, scales = "free_x") +
  theme_minimal()

# gather: Converts wide-format data to long-format, which is useful for plotting multiple histograms in a single graph.
# ggplot: Initializes the plot.
# aes(x = value): Maps the value to the x-axis.
# geom_histogram: Creates histograms.
# facet_wrap: Creates separate plots for each variable.
# theme_minimal: Applies a minimal theme to the plot.

Histograms & Scatter Plots

# Distributions:
# Price Distribution: Right-skewed, indicating a few very expensive properties.
# Bedrooms and Bathrooms: Most properties have 3-4 bedrooms and 2-3 bathrooms.
# Sqft Living: Peaks around 1500-2500 sqft, indicating the common range for living space.
# Floors and Year Built: Most properties have 1-2 floors and were built between 1950-1980.

# Scatter plots to examine relationships with price
ggplot(data, aes(x = sqft_living, y = price)) +
  geom_point(color = "blue") +
  theme_minimal() +
  ggtitle("Price vs. Sqft Living")

ggplot(data, aes(x = bedrooms, y = price)) +
  geom_point(color = "blue") +
  theme_minimal() +
  ggtitle("Price vs. Bedrooms")

ggplot(data, aes(x = bathrooms, y = price)) +
  geom_point(color = "blue") +
  theme_minimal() +
  ggtitle("Price vs. Bathrooms")

# Scatter Plots: Visualize the relationships between price and other features (sqft_living, bedrooms, bathrooms).
# geom_point: Adds points to the scatter plot.
# ggtitle: Adds a title to each plot.
# Scatter Plots:
# 
# Price vs. Sqft Living: Positive correlation; larger houses tend to have higher prices.
# Price vs. Bedrooms/Bathrooms: Some positive correlation, but less pronounced than sqft living.

Correlation Matrix

# Correlation matrix
correlation_matrix <- data %>%
  select_if(is.numeric) %>%
  cor()

# Correlation Matrix:
 
# High Correlations:
# Sqft Living and Price: Strong positive correlation, indicating that as the living area increases, the price tends to increase.
# Sqft Above and Price: Also a strong positive correlation.
# Other Features: Waterfront, view, and condition also show some correlation with price.

# Plot the heatmap
correlation_matrix %>%
  as.data.frame() %>%
  rownames_to_column(var = "Var1") %>%
  gather(key = "Var2", value = "value", -Var1) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2))) +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), space = "Lab", name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1)) +
  coord_fixed()

# Correlation Matrix: Calculates and visualizes the correlation between numeric features.
# select_if(is.numeric): Selects only numeric columns.
# cor: Computes the correlation matrix.
# geom_tile: Creates a heatmap.
# geom_text: Adds correlation values as text.
# scale_fill_gradient2: Sets the color gradient for the heatmap.

Building Regression Model

# Building the linear regression model
model <- lm(price ~ bedrooms + bathrooms + sqft_living + floors + waterfront + view + condition + sqft_above + sqft_basement + yr_built + yr_renovated, data = data)
summary(model)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
##     waterfront + view + condition + sqft_above + sqft_basement + 
##     yr_built + yr_renovated, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2105681  -128343   -16711    89700 26335249 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.609e+06  6.859e+05   6.720 2.04e-11 ***
## bedrooms      -5.618e+04  1.049e+04  -5.356 8.92e-08 ***
## bathrooms      5.919e+04  1.701e+04   3.479 0.000508 ***
## sqft_living    2.292e+02  2.169e+01  10.565  < 2e-16 ***
## floors         4.598e+04  1.863e+04   2.468 0.013606 *  
## waterfront     3.605e+05  9.386e+04   3.841 0.000124 ***
## view           4.480e+04  1.097e+04   4.082 4.54e-05 ***
## condition      3.125e+04  1.306e+04   2.393 0.016741 *  
## sqft_above     2.208e+01  2.149e+01   1.027 0.304248    
## sqft_basement         NA         NA      NA       NA    
## yr_built      -2.395e+03  3.419e+02  -7.006 2.82e-12 ***
## yr_renovated   6.772e+00  8.643e+00   0.784 0.433354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 499800 on 4589 degrees of freedom
## Multiple R-squared:  0.216,  Adjusted R-squared:  0.2143 
## F-statistic: 126.4 on 10 and 4589 DF,  p-value: < 2.2e-16
# lm: Fits a linear regression model.
# price ~ ...: Specifies the response variable (price) and the predictor variables (bedrooms, bathrooms, etc.).
# summary: Provides detailed information about the fitted model, including coefficients, R-squared value, and p-values.

# Linear Regression Model:
# 
# Coefficients: Provide the estimated change in price for a one-unit change in each feature, holding all other features constant.
# Positive Coefficients:
# Sqft Living: For each additional square foot, the price increases by a certain amount.
# Bedrooms/Bathrooms: Positive but smaller compared to sqft living.
# Negative Coefficients:
# Year Built: Older houses tend to be cheaper.
# Intercept: The baseline price when all features are zero.

# Model evaluation
# Predicting on the dataset
predictions <- predict(model, data)

Model Evaluation

# Model Evaluation:
 
# R-squared:
# Value: Measures the proportion of variance in the dependent variable (price) explained by the independent variables (features).
# Example: An R-squared value of 0.65 indicates that 65% of the variability in house prices can be explained by the model.

# Calculating R-squared and Mean Squared Error
r_squared <- summary(model)$r.squared
mse <- mean((data$price - predictions)^2)

# Mean Squared Error (MSE):
# Value: The average of the squares of the errors (difference between actual and predicted prices).
# Example: An MSE of a certain value indicates the average squared difference between observed and predicted house prices.

# Printing the evaluation metrics
print(paste("R-squared: ", r_squared))
## [1] "R-squared:  0.215998743639482"
print(paste("Mean Squared Error: ", mse))
## [1] "Mean Squared Error:  249187320761.844"
# predict: Uses the fitted model to predict price for the dataset.
# R-squared: Measures the proportion of variance in the dependent variable that is predictable from the independent variables.
# Mean Squared Error (MSE): Measures the average squared difference between observed and predicted values.

Summary of Findings

# Summary of Findings
# Most Influential Features:
 
# Sqft Living: Strongest predictor of house price.
# Bedrooms and Bathrooms: Also important but less than living space.
# Waterfront and View: Premium features that significantly affect price.
# Year Built: Older homes tend to have lower prices.
# Model Performance:
 
# R-squared: Indicates a good fit, meaning the model explains a significant portion of the variability in house prices.
# MSE: Provides a measure of the prediction error.
# Practical Implications:
 
# For Home Buyers: Understanding which features most influence price can guide purchasing decisions.
# For Home Sellers: Highlighting features like living space, waterfront, and view can help justify higher prices.
# For Real Estate Agents: Insights can aid in pricing strategies and advising clients on property improvements.

# Summary of Packages and Functions Used
# tidyverse: A collection of packages for data science tasks.
 
# read_csv: Reads CSV files.
# ggplot2: Creates graphics and visualizations.
# dplyr: Provides functions for data manipulation (e.g., select_if, summarize).
# tidyr: Helps in tidying data (e.g., gather).
# lm: Fits linear models.
 
# predict: Makes predictions based on the fitted model.
 
# summary: Provides summary statistics for data and models.
# 
# cor: Computes correlation matrices.
 
# write_csv: Writes data to CSV files.
 
# Interpretation of the Results
# Summary Statistics and Missing Values: Confirm that the data is complete and provide an overview of the key features.
 
# EDA: Visualize data distributions and relationships between variables to identify patterns and insights.
 
# Correlation Matrix: Understand the relationships between numeric variables and identify potential multicollinearity.
 
# Linear Regression Model: Quantify the effect of each predictor on the response variable (price).
 
# Model Evaluation: Assess model performance using R-squared and MSE.
 
# Predictions: Save the predicted values for further analysis or reporting.

# Save the results
write_csv(data.frame(predictions), "predictions.csv")

Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation

The mathematical basis of statistical modeling is probability theory, which allows us to measure uncertainty, forecast occurrences we don’t yet know about, and draw conclusions about a population from sample data. Probability offers the foundation for determining how likely different outcomes are in statistical modeling, given a set of assumptions and observable data.Statistical inference is the process of extrapolating population estimates from sample data.Typically, it entails testing hypotheses, generating predictions, and estimating population characteristics (such as variances and averages). By enabling us to determine the probability of detecting our sample data under various population assumptions, probability theory supports inference.By assuming, for instance, that a dataset has a normal distribution, probability theory allows us to determine the likelihood that a sample mean would fall within a certain range. Our confidence in the sample mean as a proxy for the population mean can be increased by using this computation.The MLE technique is employed to approximate the parameters inside a statistical model. The chance of witnessing the sample data given a set of parameters is referred to as the “likelihood.” Assuming that our sample represents the most likely representation of the underlying population, we use MLE to select the parameters that maximize this likelihood. MLE operates on an intuitive logical framework, identifying the parameter values that maximize the likelihood of the observed data. This method is commonly applied to many different kinds of statistical models, ranging from straightforward linear regression to more intricate models.

Determine and apply the appropriate generalized linear model for a specific data context

In this case, the selling price of homes (SalePrice) is predicted using a Generalized Linear Model (GLM) in conjunction with variables such lot area (LotArea), overall quality (OverallQual), and year constructed (YearBuilt). This method uses a linear regression framework, a subset of GLM designed for continuous outcomes, to model the connection between the target variable and chosen characteristics. In keeping with the tenets of linear regression, the glm function in R is applied using the Gaussian family and an identity link function. A versatile framework for investigating linear correlations between a continuous target variable and a collection of predictors is provided by the GLM method for house price prediction. While theoretical reasoning and exploratory data analysis should both be considered when choosing predictors, the model summary interpretation provides insights into the processes determining home prices.

Conduct model selection for a set of candidate models

In order to pick the optimum statistical model for forecasting your target variable, you must compare many models and consider factors such as model simplicity, predictive performance, and underlying assumptions. When it comes to predicting home prices, you may have many candidate models that employ different predictor types (polynomial terms, interaction terms), or contain different subsets of predictors.A crucial and iterative element in the modeling process is model selection, which lets you improve your strategy in light of empirical data. By carefully weighing various models and utilizing impartial standards for contrast,My R script applies theoretical principles that I learnt in class in a practical way. It focuses on selecting and developing models for generalized linear models (GLMs) that are used to predict housing values.

Communicate the results of statistical models to a general audience

summary(model)
## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + floors + 
##     waterfront + view + condition + sqft_above + sqft_basement + 
##     yr_built + yr_renovated, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2105681  -128343   -16711    89700 26335249 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.609e+06  6.859e+05   6.720 2.04e-11 ***
## bedrooms      -5.618e+04  1.049e+04  -5.356 8.92e-08 ***
## bathrooms      5.919e+04  1.701e+04   3.479 0.000508 ***
## sqft_living    2.292e+02  2.169e+01  10.565  < 2e-16 ***
## floors         4.598e+04  1.863e+04   2.468 0.013606 *  
## waterfront     3.605e+05  9.386e+04   3.841 0.000124 ***
## view           4.480e+04  1.097e+04   4.082 4.54e-05 ***
## condition      3.125e+04  1.306e+04   2.393 0.016741 *  
## sqft_above     2.208e+01  2.149e+01   1.027 0.304248    
## sqft_basement         NA         NA      NA       NA    
## yr_built      -2.395e+03  3.419e+02  -7.006 2.82e-12 ***
## yr_renovated   6.772e+00  8.643e+00   0.784 0.433354    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 499800 on 4589 degrees of freedom
## Multiple R-squared:  0.216,  Adjusted R-squared:  0.2143 
## F-statistic: 126.4 on 10 and 4589 DF,  p-value: < 2.2e-16

Use programming software (i.e., R) to fit and assess statistical models

print(paste("R-squared: ", r_squared))
## [1] "R-squared:  0.215998743639482"
print(paste("Mean Squared Error: ", mse))
## [1] "Mean Squared Error:  249187320761.844"

Reflection

I frequently participated in online class discussions and asked questions to advance the learning of both myself and my peers. Also, it increased the interaction in our classes. My enthusiasm extended outside of the classroom. Our little competitions were a lot of fun, and I felt like I could apply what we had learned in a constructive and enjoyable way. These exercises enhanced my ability to solve problems and work with others.I also contributed and gained knowledge from others on GitHub during my stay there. We were all able to study more jointly as a result of my helpful guidance and my proficiency with the website. To put it briefly, I gave my whole attention to all aspects of our course, including homework, GitHub assistance, online conversations, and contests. By demonstrating what it meant to participate actively, I wanted to help everyone in our course group learn as well as myself.

Conclusion

The analysis conducted provides a comprehensive overview of the factors influencing house prices. The dataset, loaded and explored using various R packages, revealed key insights through summary statistics and visualizations. The data was complete with no missing values, and exploratory data analysis showed that features like square footage of living space, the number of bedrooms and bathrooms, and the presence of premium features like waterfront and view significantly affect house prices. The correlation matrix highlighted strong positive correlations between these features and price, particularly for square footage. The linear regression model quantified these relationships, with the coefficient estimates suggesting that larger living spaces and premium features contribute most significantly to higher prices. The model’s R-squared value indicated a good fit, explaining a substantial portion of the variability in house prices, while the mean squared error provided a measure of prediction accuracy. These findings can guide home buyers, sellers, and real estate agents in making informed decisions, emphasizing the most influential property features in pricing and marketing strategies. The results, including model predictions, were saved for further analysis and practical application.