Purchasing_Price Square_Feet Bedrooms Bathrooms
Min. 1.0 100.000 1.000000 1.000000
1st Qu. 200000.0 1389.000 3.000000 2.000000
Median 331778.5 1864.000 3.000000 2.000000
Mean 516683.7 2205.478 3.450354 2.588604
3rd Qu. 525000.0 2588.000 4.000000 3.000000
Max. 169000000.0 1560780.000 190.000000 198.000000
BAIS 462 Final Project
The Housing Market
Introduction
For a while now, I have been interested in exploring the housing market and its trends. As someone who will be graduating soon and has friends and parents also navigating the housing market currently, I realize the challenges that come with finding a suitable place to live in a competitive market. The housing market is influenced by various factors such as economic conditions, population demographics, and local policies. As well as specific house variables like number of bedrooms/bathrooms, square footage, lot size, and so much more. In recent years, the real estate market has experienced fluctuations influenced by factors like job opportunities, interest rates, and urban development initiatives. Understanding the dynamics of this market is crucial for prospective home buyers, especially for those looking to purchase for the first time. Throughout this analysis, I want to explore what the key trends and characteristics of house for sale, and how they affect the purchasing price. Specifically looking at things like square footage, location, land size and various other factors.
The Data
The dataset I will be using was found on Kaggle here. This dataset contains a little over 1 million entries of real estate listings throughout the United States. The variables that I will be using in my analysis are as follow:
-Brokered by: The agency/broker for the listing (coded into it)
-Status: The status of the listing (For sale or Ready to build)
-Price: The price of the listing
-Bedrooms: The number of bedrooms the listing has
-Bathrooms: The number of bathrooms the listing has
-Acre_lot: The size of the property/land in acres
-Street: The street address of the listing
-City: The city the listing is in
-State: The state the listing is in (Also includes the Virgin Islands and Puerto Rico)
-Zip_Code: The zip code the listing is in
-Square_footage: The size of the house in square feet
Note: The original dataset included a variable called Prev_sold_date which listed the date the listing was sold. I have excluded that variable when doing my analysis.
If you are interested in looking at the dataset, you can access it here.
I removed all NA values from my dataset to allow for easier analysis. This brought my total rows of housing listings down to 523,746 which will be more than enough to get some interesting insights.
Summary Statistics
I first wanted to look at some basic summary statistics about my data, to help generate some additional insights.
These statistics provide a good starting point into the distribution of each variable in the dataset. The wide range in values, especially for price and square footage, suggests significant diversity in the properties. We see that for bedrooms and bathrooms that most properties have around 3 bedrooms and 2 to 3 bathrooms on average.
What Is the Average Price of a House in Each State?
Analyzing the average price of houses in each state offers insights into regional affordability, economic disparities, and housing market trends. It can highlight differences in economic prosperity, and urbanization dynamics. Additionally, it can indicate which states may have higher economic activity or stronger economic prospects, as higher average prices often correlate with areas of greater economic opportunity and growth potential.
This graph provides a comprehensive overview of the average house prices across all states. Notably, the Virgin Islands stand out with an average price exceeding $2,000,000, followed closely by the District of Columbia at just under $1,500,000. Interestingly, states like Idaho and Montana also show a higher average home prices than pretty much any other state. When thinking about what would be causing that, it could be attributed to in part,the substantial land area accompanying properties in these states, a factor that may inflate average prices compared to states with more urbanized or densely populated areas. Overall this is a very interesting graph to look at to see how different states and territories compare.
More Square Feet, or More Land?
Square Feet
I wanted to see if houses that had more square footage, were priced higher than those with less. When creating this scatter plot I choose to exclude any listing in my data where the house was priced more than $20 million, and also any house that had more than 25,000 square feet. Looking at the price vs square footage can offer more insight as to what factors affect the price.
Wow. This is a lot to look at. We notice that the majority of houses fall within the 0-10,000 square feet and $0-$10,000,000 price range. While the scatter plot doesn’t show a clear trend of increasing price with square footage, the upward trend of the line of best fit suggests a potential positive relationship between the two variables. To further explore this relationship, I calculated the correlation coefficient between square footage and price, which gave me a value of 0.54. This indicates a moderate positive linear relationship between the two variables. While this correlation suggests that as square footage increases, there tends to be a corresponding increase in price, it’s important to remember that correlation does not imply causation. While square footage is a relevant factor in determining the price of a house, it’s not the sole determinant.
[1] 0.5410897
Acres of Land
I also wanted to see if houses that had more land, were priced higher than those with less. When creating this scatter plot similar to the square feet above, I choose to exclude any listing where the house was priced more than $20 million, and also any listing with more than 1,000 acres.
This graph is very similar to the square footage graph. We notice that the majority of houses fall below 250 acres of land, which aligns with expectations. While the scatter plot doesn’t show a clear trend of increasing price with acreage, the upward trend of the line of best fit suggests a potential positive relationship between the two variables. However, the correlation coefficient between acreage of land and price is very low, with a value of 0.007. This indicates a very weak linear relationship between the two variables, suggesting that other factors may have a stronger influence on house prices.
[1] 0.00725063
Are There Noticeable Differences in Average House Prices Based on Different Combinations of Bedrooms and Bathrooms?
I wanted to look at the average prices of houses based on the number of bedrooms and bathrooms because these factors are crucial determinants of a property’s value and appeal. By examining how prices vary across different bedroom and bathroom configurations, we can gain valuable insights into the housing market’s dynamics, including trends in demand and preferences among buyers. This analysis can inform decisions for both buyers and sellers, helping them understand the relative value of properties based on their specific attributes. I choose to filter out listings that had more than 10 bedrooms and 10 bathrooms, as well as listings that were greater than $15,000,000.
What stands out to me is the consistent trend observed when there are 10 bathrooms—except for cases with 2 or 3 bedrooms, where it consistently has the highest average price. Additionally, the data highlights that houses featuring 10 bathrooms and 6 bedrooms, on average, have the highest price. Another interesting observation is the diminishing average price trend as the number of bathrooms remains fixed at 5, across a range of bedrooms from 3 to 10.
When looking into a regression analysis where price is the dependent variable, and bedrooms and bathrooms are the independent variables, we uncover some interesting findings. For each additional bedroom, there appears to be a surprising decrease in house price, at $67,840.80, when all other variables are held constant. While in the other hand, the model suggests that each additional bathroom results in a substantial increase in house price, estimated at $381,235.80, again with other variables remaining constant. While we would expect higher prices with more bathrooms, the magnitude of this effect, at $381,235.80 per bathroom seems a little too high. I am think it could be due to some of the houses in my data set having extraordinary prices.
Call:
lm(formula = price ~ bed + bath, data = filtered_data)
Residuals:
Min 1Q Median 3Q Max
-3364861 -267589 -86453 112201 15027042
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -209755.4 2532.0 -82.84 <2e-16 ***
bed -67840.8 925.5 -73.30 <2e-16 ***
bath 381235.8 903.2 422.10 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 703500 on 714978 degrees of freedom
Multiple R-squared: 0.2491, Adjusted R-squared: 0.2491
F-statistic: 1.186e+05 on 2 and 714978 DF, p-value: < 2.2e-16
This analysis serves as a valuable starting point for understanding how different variables influence house prices and the extent of their impact. Further exploration through regression all my variables would be very interesting to see.
What Region Has the Highest Average House Price?
Exploring the region with the highest average house price provides valuable insights into economic conditions, housing affordability, and investment potential. Higher average prices often indicate strong local economies, attractive amenities, and affluent demographics. By identifying the region with the highest average price, we can better understand the distribution of wealth and resources across different areas.
We see in this graph that the Northeast region has higher average house price by about $200,000 compared to any other region. This aligns with the region’s reputation for having some of the most wealthy zip codes in the United States. The Northeast also has a concentration of prestigious urban centers, cultural landmarks, and high-end amenities, which collectively drive up the property values. Additionally, factors such as proximity to major financial hubs and access to top-tier educational institutions contribute to the region’s appeal with affluent residents. As a result, the Northeast’s higher average house prices show the cost for upscale housing options and the region’s status as a hub of wealth.
Do bigger houses have more bedrooms and bathrooms?
I wanted to explore whether there is a correlation between the size of houses and the number of bedrooms and bathrooms they contain. I aim to investigate whether larger houses tend to have more bedrooms and bathrooms. By examining this relationship, we can gain valuable insights into the housing market and understand the preferences of homeowners regarding space and functionality.
The graph confirms my initial assumption to a certain extent. It illustrates a general trend where the average square footage tends to increase with the combination of bedrooms and bathrooms. However, an intriguing observation emerges when examining houses with 13 bedrooms: beyond a certain threshold, namely 7 bedrooms, the average square footage begins to decline for each additional bedroom. Similarly, for houses with 11 bathrooms, we notice a decrease in average square footage from bedrooms 6 to 10, followed by a significant spike at 11 bedrooms. These fluctuations in square footage challenge the conventional notion that more bedrooms or bathrooms inherently correlate with larger house sizes, prompting further exploration into the underlying factors driving these patterns.
When looking into a regression analysis where house_size is the dependent variable, and bedrooms and bathrooms are the independent variables, we uncover some interesting findings. For each additional bedroom, there appears to be an increase in house size by 293.9 square feet, when all other variables are held constant. While on the other hand, the model suggests that each additional bathroom results in a substantial increase in house size, by about 625.4 square feet, again with other variables remaining constant. Similar to the the regression ran on price, once again bathroom has the most effect on the dependent variable. I think this may be explained since bathrooms that have showers, and tubs, as well as guest bathrooms must take up substantial size. However, there may be a better way to explain that.
Call:
lm(formula = house_size ~ bed + bath, data = filtered_data)
Residuals:
Min 1Q Median 3Q Max
-7640 -382 -70 249 1558514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -454.435 11.789 -38.55 <2e-16 ***
bed 293.995 4.319 68.08 <2e-16 ***
bath 625.409 4.124 151.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3058 on 652940 degrees of freedom
(62038 observations deleted due to missingness)
Multiple R-squared: 0.09137, Adjusted R-squared: 0.09136
F-statistic: 3.283e+04 on 2 and 652940 DF, p-value: < 2.2e-16
Full Regression
I am interested in doing a regression with bedrooms, bathrooms, acre_lot, and house_size to see how these variables affect price. I would have liked to included state in there, but with so many options I wasn’t completely sure how to interpret that number.
Call:
lm(formula = price ~ bed + bath + acre_lot + house_size, data = filtered_data)
Residuals:
Min 1Q Median 3Q Max
-18422346 -222711 -60060 114108 14988102
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -304697.0782 2791.4814 -109.152 < 0.0000000000000002 ***
bed -21833.4703 991.1639 -22.028 < 0.0000000000000002 ***
bath 329789.1990 889.9640 370.565 < 0.0000000000000002 ***
acre_lot 4.9428 0.9907 4.989 0.000000607 ***
house_size 11.8317 0.2438 48.529 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 599100 on 524771 degrees of freedom
(190205 observations deleted due to missingness)
Multiple R-squared: 0.3035, Adjusted R-squared: 0.3035
F-statistic: 5.717e+04 on 4 and 524771 DF, p-value: < 0.00000000000000022
The regression model shows us a lot. We see that The coefficient for bedrooms is -21,833.4703, indicating that each additional bedroom is associated with a decrease in the predicted price by $21,833.47, holding all other variables constant. As stated earlier this doesn’t necessarily, as we would generally expect more bedrooms to increase the price of a property. The coefficient for bathrooms is 329,789.1990, suggesting that each additional bathroom is associated with an increase in the predicted price by $329,789.19 holding all other variables constant. While this does align with expectations as additional bathrooms typically add value to a property. Why is it so much? The coefficient for acre lot is 4.9428, meaning that for each additional acre of lot, the predicted price increases by 4.9428, assuming all other variables remain constant. This seems very very low. I would have assumed that lot size would play a larger role is the price of a house. The coefficient for house size is 11.8317, indicating that for each additional square foot, the predicted price increases11.8317 assuming all other variables remain constant. Once again, this seems very very low, as I would have assumed that house size would play a larger role in the price of a house.
I definitely think there is more work to do with the regressions, to where you could more accurately predict the price of a house, and see how each variable affects the price.
Sentiment Analysis
Something that has always interest me, was the differences between major cities. Comparing cities can offer valuable insights as to why people prefer on place over another, or what draws them to a new location. In this analysis, I have taken reviews of Chicago and New York from their own respective Yelp pages. In this analysis, I want to explore what people say about these two cities, if there is a trend in the timing of positive or negative reviews written, as well as to see if one city has more positive reviews.
The Data
As mentioned in the introduction, I have scraped these reviews from Chicago’s Yelp page (https://www.yelp.com/biz/city-of-chicago-chicago-6) and New York’s Yelp page (https://www.yelp.com/biz/city-of-new-york-new-york-49). All of the reviews were scraped on April 24th, 2024. The variables I collected were:
-Name: The name of the reviewer
-Date: The date the user posted the review
-Content: The text the user wrote in their review about their respective city
-Page_id: The URL to identify what page the review came from
I did create a new “City” column that has either Chicago or New York to easily identify which city the review is writing about.
Data: City_reviews
Analysis
1. What Are the Most Frequent Emotions Associated with Each City?
To answer this question, I used the NRC lexicon. Using this lexicon, allowed me to get the most commonly used emotive words, and then I could group by the city to be able to visualize how each city compares to each other.
This graph provides a clear visual of the emotional sentiments expressed in both cities. We see that for Chicago, there is a very high count for positive words, at just under 2,000. While New York is a little lower at roughly 1,500. The next most common emotion is negative. This make sense since not everyone is going to love these cities so there will always be a lot of negative reviews. The next most popular emotions are joy, trust, and anticipation. Joy makes a lot of sense to me since these are major cities so there is a lot to do in them that will bring joy to people. As well as anticipation since in cities you never know whats going to happen or what you are going to do, so the anticipation for things to happen or take place can explain why this score is very high. The word trust seems very interesting to me. Is it trust in the city, or something else?
Overall, both cities predominantly express positive emotions than negative ones, with Chicago having higher sentiment scores for joy and positivity compared to New York. The graph shows us that while both cities share a generally positive emotions, Chicago’s sentiment profile is slightly more positive than New York’s.
2. What City Has a Higher Positivity Score?
To answer this question, I calculated the total sentiment scores for each city, after give my reviews in tidy formatting. By taking each word from each review, and then assigning that word a score using the NRC lexicon, I can then calculate the total positivity score for each city, by taking the total number of positive words minus the total number of negative words.
This graph shows us that both Chicago and new York has very high positivity scores. This makes sense as both of these cities are major cities in the United States. We can see that Chicago has a higher positivity score by just 9 points. this offers us a good insight as to how many people said either negative or positive things in their reviews about their respective city.
# A tibble: 2 × 4
city positive_score negative_score positivity_score
<chr> <dbl> <dbl> <dbl>
1 Chicago 1725 936 789
2 New York 1480 699 781
3. How Does the Positivity Score Change by Month for Each City?
I did this by using the Bing lexicon. After getting my reviews in tidy formatting, I was able to join the big lexicon on the word. I grouped by city, month, and sentiment to be able to produce a bar chart to easily see the differences in scores by months for each city.
This graph provides a comparative analysis of the positivity scores for Chicago and New York on a monthly basis. In Chicago, the positivity score peaks in June and dips to its lowest in February. On the other hand, in New York it reaches its highest positivity score in October and its lowest in April. Chicago’s highest months June, September, October make a lot of sense to me with the weather being very nice, and football starting up soon it would make more sense that the positivity scores are very high. February in Chicago actually sucks… it is freezing cold and windy, so that score does not surprise me at all. For New York I don’t quite understand why October has the highest positivity score. Maybe it is because the weather is not blazing hot and humid, or freezing cold yet. What also made sense to me in New York is that December and January were positive scores. There is so much to do there for Christmas and New Years and everyone will be writing reviews about the time they had.
Overall, this graph provides valuable insights into the emotional climate of each city and how it changes over time. It also allows for an easy comparison of the cities’ trends across different months. This give us a little insights as to how different months can affect the positivity score associated with a city.
Conclusion
In conclusion, the sentiment analysis results revealing Chicago as a more positive environment than New York, paired with the housing data indicating that the Midwest offers the lowest average home price compared to any other region offer valuable insights for new graduates or first-time home buyers. For individuals starting on their housing journey, particularly those seeking affordability and a positive living experience, the Midwest, highlighted by cities like Chicago, emerges as an attractive option.
The sentiment analysis shows the welcoming and optimistic atmosphere of cities like Chicago, fostering an environment helpful to personal and professional growth. While on the other hand the real estate data shows how the Midwest is affordable, as well as what factors to consider in a house when looking on a budget.
In essence, for new graduates or first-time home buyers looking for an affordable yet positive living experience, the Midwest offers a promising destination. With Chicago as a beacon of opportunity, individuals can find a vibrant urban lifestyle, a strong sense of community, and housing options that meet their budget constraints and also their lifestyle preferences. By utilizing these insights, individuals can make better informed decisions.