Getting the right answer is only a small part of the grade
Good quality interpretation of your results is the name of the game
If you see something that looks unusual in your data (outlier, some unusual distribution type) - investigate it!
When explaining your results, say something interesting about them. Did it match your expectations? Why or why not?
Brief explanations that simply repeat what I can visually see myself will not receive a good score
On the other hand, filling the homework with pages of not very interesting description is not valuable either. The goal isn’t to write the most words, but find the most interesting things in the data.
You do not need to be an expert in art for a good score, but I will expect you to look up basic information, such as “what does art typically sell for?” and “what is a standard size for paintings”? and so on to help you understand and set expectations your data.
The information requested in the question prompts are only a starting point, if you find other interesting information along the way, please report that. You don’t need to look at the data forever but if there is obviously something else interesting in the data you should report it.
Technical
Make sure your graphs are produced using ggplot(), are well labeled, and are easy to read.
Make sure your tables (including regression tables) are produced with the kable() function from the knitr package, are well labeled, and are easy to read. You can make your tables prettier with the kableExtra package.
Make sure you do not have anything rendered in your HTML file besides your results and, when asked for by a question, your code. That means no warnings, messages, or other output should appear in your final rendered HTML file.
Convert your HTML file to PDF using the Microsoft Print to PDF option in the Print menu (PC) or the PDF button option from the Print menu (Mac)
Make sure to accurately mark each page a question answer appears on when submitting on GradeScope.
Delete the Scoring Guide section of the instructions before final rendering and submission.
Shanghai used house price analysis
Introduction
Question 1: Describing your data (10 points)
1a. Where is this data from?
For this dataset, describe the data according to the five Ws & how defined in the textbook Chapter 1.2. What are some possible problems with the who and what of the dataset?
The dataset you are using for this assignment is a subset of the original dataset that can be found here.
Who: This dataset contains information on second-hand properties listed on the Anjuke website, representing the houses available on the platform. Collected by b2eeze, the data is useful for individuals looking to purchase a home and understand how their preferences impact property prices.
What: The dataset from Anjuke examines how the price of a second-hand house is influenced by various factors, including ID, Title, Title Link, Number of Rooms, Number of Living/Dining Rooms, Number of Bathrooms, Total Area, Year Built, and other details describing the listed properties.
Where: The data was collected from the Anjuke website, which is an online platform for real estate and scraping second-hand housing data in China from Shanghai area.
When: The dataset from Anjuke analyzes how the price of a second-hand house is influenced by various factors with listings featuring apartments constructed between 1983 and 2023, while the data itself was gathered in 2024, reflecting house prices from that year.
Why: The data is collected to analyze the second-hand housing market, including trends in pricing, and housing features. The website mentions that the data is created to establish a regression prediction model for analysis, and aiming to offer users a safe and convenient house-hunting experience.
How: The dataset was obtained from shanghai.anjuke.com, where data on various apartments was collected and analyzed to assess how different factors influence pricing. Since the information was scraped from the Anjuke website, it was sourced directly from online listings.
Possible Problems with the Who and What
Who: The dataset does not include the entire house market. There are houses that are not included on the Anjuke website due to other real estate platform markets are excluded, which could potentially leading to selection bias.
What: For example, in the year built, a lot of the times, the year is a whole number, however, there are some data that has decimals, which does not fit the format for the year. Also, there is a segment where it ask if it’s close to the subway, the response is either true or false, so there could a be potential problem when deciding what counts as a close or far.
1b. What are the variable types?
For the following variables, please make a table.
One column should be the variable name, the second should be the variable type as defined in the textbook Chapter 1.3, and the third the units of the variable (if applicable). Note that you can find the units in the sh.house.raw.sample.hw2.csv file, though that dataset has different cases than the one you are given for your homework.
id
标题链接
居室数
总面积
居民楼总层数
小区均价
物业费用
建造年份
楼层分布
区
近地铁
价格
Variable Name
Variable Type
Units
id (ID)
Identifier
None
标题链接 (Title Link)
Identifier
None (Link/URL)
居室数 (Number of Rooms)
Quantitative
Rooms
总面积 (Total Area)
Quantitative
Square meters (㎡)
居民楼总层数 (Total Floors in Residential Building)
Quantitative
Floors
小区均价 (Average Price in the Community)
Quantitative
CNY per square meter
物业费用 (Propoerty Management Fee)
Quantitative
CNY per square meter per month
建造年份 (Year Built)
Quantitative
Year
楼层分布 (Floor Distribution)
Categorical
None (Unknown, Entire, Low, Middle High)
区 (District)
Categorical
None
近地铁 (Close to Subway)
Categorical
None (Yes or No)
价格 (Price)
Quantitative
CNY
Question 2: Association (20 points)
2a. Investigating average community price vs. green area: association
Using the Think-Show-Tell framework from the textbook (example on page 213), please examine the relationship in association terms between average price of a community and the percent of green area a community has. How strongly are they associated?
Note: for this question and all other Think sections in the homework, you do not need to report the W’s of the data (you have already completed this in Q1)
Think
I want to examine the association between the average price and the green area to assess how strongly these two variables are related. It is possible that communities with more green area might have a higher average community price because of the better living conditions and good quality of life. The explanatory variable (x-axis) would be the green area and the response variable (y-axis) would be the average price of a community. Higher amount of green areas could correspond to a higher average prices so a positive association is expected.
Show
Tell
This graph shows the relationship between green rate (percentage) and average community price (CNY per square meter). The red dashed trendline suggests a slight positive relationship, meaning that communities with higher green rates tend to have higher average prices. Most of the data points fall between 30-40% green rate, with average community prices ranging from under 50,000 CNY to over 200,000 CNY per square meter. There are also some outliers, particularly communities with extremely high prices (above 150,000 CNY/m²) regardless of their green rate. These might be luxury developments, where other factors drive up prices. The trendline’s upward slope 831.05 CNY/m² per 1% increase in green rate suggests that green spaces add some value to property prices. This means that for every 1% increase in green rate, the average community price goes up by about 831.05 CNY per square meter. Given that community prices range from below 50,000 CNY to over 200,000 CNY per square meter, the 831.05 CNY increase per 1% rise in green rate is quite small in comparison. Also, the R² value is only 0.067. The moderate slope of the trendline and the low R² value suggest that while green rate has a small positive impact on community prices, but not a key factor. To better understand what affects community prices, it would helpful to include other factors like location, district, and subway access.
2b. Investigating average community price vs. year built
Using the Think-Show-Tell framework from the textbook, please examine the relationship in association terms between the average price for community and the year built. How strongly are they associated?
Think
I want to know the relationship between the year a community was built and its average price to explore potential trends and outliers to understand how property age influences the market. I want to understand whether average prices increase, decrease, or remain constant as the year built changes. Communities built in more recent years are predicted to have higher prices due to new and modern facilities. However, it is also possible that old communities have high prices due to historical value. It is possible that some certain communities have unusually high or low prices that could skew the analysis. The explanatory variable (x-axis) is the year built, and the response variable (y-axis) is the average community price.
Show
Tell
This graph shows the relationship between the year a community was built and its average community price per square meter in CNY. The red trendline shows a slight downward slope, suggesting that newer communities tend to have slightly lower prices compared to older ones. Most of the data falls between 1995 and 2020, with prices ranging from under 25,000 CNY to over 150,000 CNY per square meter. The trendline indicates that for each additional year, the average price decreases by about 396.97 CNY per square meter. Given that apartment prices range from below 25,000 CNY to over 150,000 CNY per square meter, a yearly drop of 396.97 CNY is relatively small. The R² value is just 0.014, meaning the year built explains only 1.4% of the variation in prices. While the trendline suggests a pattern, the low R² value shows that the influence of the construction year on price is weak. This suggests that while older communities might have higher prices due to factors like location, or historical value. The construction year alone isn’t a strong predictor of price. Some communities with exceptionally high prices, above 150,000 CNY per square meter, exist regardless of when they were built, likely representing luxury developments. To get a clearer picture of what truly drives apartment prices, additional variables would be needed.
2c. Thinking about your results
Consider the results of 2a. and 2b. together. What can we understand about average community price in Shanghai? Why do you think you observe these results?
Apartments with more green space tend to be a bit more expensive, which suggests that buyers in Shanghai appreciate having greener surroundings. This makes sense, given the city’s dense urban environment and the growing demand for more visually appealing and eco-friendly living spaces. That said, while green spaces do add some value, they don’t play the biggest role in determining apartment prices, as shown by the low R² value. One surprising trend is that older apartments, on average, are priced slightly higher than newer ones. This could be because many of Shanghai’s older residential communities are in prime locations where demand remains high. These areas often have well-developed infrastructure, strong business districts, and convenient public transport, all of which help keep property values up. Some older apartments may also have historical or cultural significance, making them even more desirable. Meanwhile, newer developments are more likely to be built in suburban areas, simply because there’s less land available in the city center. Shanghai is known for its high-end real estate market, so it’s possible that buyers who care about quality of life are willing to pay extra for homes with more green space. However, in a fast-paced city, things like location and accessibility often matter more. Being close to business hubs and subway lines probably has a much bigger impact on prices than green spaces alone. At the end of the day, while green space and the age of a building do play a role in pricing, factors like location, amenities, and transportation access are the real drivers of Shanghai’s housing market.
Question 3: Simple regression (20 points)
3a. Investigating price vs. area
Using the Think-Show-Tell framework from the textbook, please examine how the management fee of an apartment is related to the price of the apartment.
Think
I want to understand whether the management fee of an apartment is related to its price. A possible theory is that apartments with higher management fees may have higher prices, as higher fees often indicate better services. There might have possible outliers where luxury apartments with exceptionally high prices may skew the results. Understand this relationship helps identify whether management fees can serve as a reliable indicator of apartment value, which could be helpful for buyers and investors.
Show
Tell
This graph looks at the relationship between management fees (CNY per square meter per month) and property prices (CNY). The red trendline shows a slight positive relationship, meaning properties with higher management fees tend to have slightly higher prices. However, the R² value of 0.054 suggests that management fees only explain about 5.4% of the variation in prices, so they aren’t a major factor in determining property values. The trendline indicates that for every unit increase in log(Management Fee +1), property prices go up by about 867.95 CNY per square meter. This could suggest that higher management fees come with better amenities or services, making the properties more desirable. However, the weak correlation means other factors, like location, district, and access to public transportation, likely have a much bigger impact on property prices. Most properties in the dataset have relatively low management fees, as seen in the way data points are clustered toward the lower end of the x-axis. There is one outlier, particularly high-priced properties, which could be a luxury development where higher management fees reflect premium services and facilities. The outlier will be excluded from the graph by setting appropriate limits for both the x and y axes for the following think again.
3b. Checking model fit
Make use of all the tools described in the textbook to assess model fit in the Think again section - if it is necessary to revise your model, do it in the Think again section. Then state any updated conclusions in the Revising conclusions section.
Think again
I think it would be best to remove the outliers so the graph highlights the main cluster of data points without being skewed by extreme values. This will make it easier to see any meaningful trends. Also, it’s a good idea to check the residuals to evaluate how well the model fits the data and whether any patterns stand out. Additionally, calculating the correlation coefficient will give a clearer picture of the strength and direction of the relationship between the variables.
Revising conclusions
This property is one of the most expensive listings in the dataset, featuring five rooms, a prime location near the subway, and a much larger area (1,167 m²) than most other properties. Located in 华侨城苏河湾(别墅), a luxury villa community in 静安/Jing’an District, it is likely a high-end development that justifies its premium price. With a property price of 25,600 CNY, it stands out as an extreme outlier, especially since the majority of properties in the dataset are considerably smaller and priced much lower. The scatter plot indicates a positive correlation between management fees and property prices, with a slope of 764.68 CNY/m² per unit increase in log(Management Fee+1), suggesting that properties with higher management fees tend to have higher prices. However, the R² value of 0.262 shows that management fees account for only 26.2% of the variation in prices. The color gradient reveals that lower-priced properties are concentrated at the bottom, while higher-priced ones are more spread out, with a few notable outliers, including the high-priced villa. Moreover, the increasing spread of residuals at higher price levels indicates that applying a log transformation to price could help stabilize variance. Excluding this luxury villa as an outlier was a logical choice since its price and size are significantly different from most properties in the dataset, making it an exception rather than a typical example. Keeping it in the model could distort the regression results, leading to a misleading trend that doesn’t accurately represent the overall housing market. By removing extreme values like this, the analysis offers a more accurate view of the relationship between management fees and property prices, ensuring the model is more applicable to standard residential properties rather than high-end luxury homes.
3c. Investigating price vs. apartment floor
Similar to 3a. and 3b., fully analyze the relationship between price and apartment floor.
Think
I want to examine how the categorical levels of apartment floor relates to its price. A possible hypothesis is that apartments on higher floors might have higher prices. Contrary, due to the inconvenience of waiting for the elevators, it could have a limited demand for someone who prefer accessible and convenience. The analysis will focus on comparing average prices across categories. Unique high-value apartment might skew averages for specific categories. Understanding the relationship between floor and price provides insights to buyer for preferences for floor levels in Shanghai and helps reveal how floor level influences apartment valuation.
Show
Tell
The box plot illustrates the relationship between apartment floor levels (Low, Middle, High) and average apartment prices, but due to the similar medians across all three categories, it is difficult to draw a clear conclusion. Interestingly, middle-floor apartments, which are often considered the most desirable, have a lower median price compared to both low- and high-floor apartments in this dataset. However, the differences between categories are not substantial, and the presence of outliers further complicates the analysis. To simplify the comparison, I focused on these three categories (Low, Middle, High), as they provide a more digestible overview for buyers and sellers looking to understand how floor level impacts pricing. A box plot was chosen over a line plot because it effectively displays price distribution, median values, and outliers. However, even with this visualization, the graph does not present a definitive trend, as the medians are too close to indicate a strong relationship between floor level and price. In the following Think Again, a more accurate analysis with the actual median values and the interquartile range (IQR) will be discussed to provide a clearer understanding of price variations across different floor levels.
Think again
Summary table will be provided for the accurate analysis.
# A tibble: 3 × 8
楼层分布 Median Q1 Q3 Min Max IQR Count
<fct> <dbl> <dbl> <dbl> <int> <int> <dbl> <int>
1 Low 380 220. 578. 69 4900 357 82
2 Middle 369 252 513 50 1650 261 83
3 High 378 280 582 75 3725 302 104
Revising conclusions
The summary table provides a clearer view of how apartment floor levels relate to price. The median prices for low, middle, and high floor apartments are 380 CNY, 369 CNY, and 378 CNY, respectively, showing only slight differences between the categories. Although middle floor apartments are often considered the most desirable, they have the lowest median price, while low and high floor apartments are slightly more expensive on average. However, the interquartile range, which represents the middle 50% of values, differs across categories, with low floor apartments showing the greatest variability IQR = 357, followed by high floors 302 and middle floors 261. This suggests that prices for low floor apartments are more spread out, possibly due to a mix of older, lower-priced units and newer, high-end developments. Examining the minimum and maximum prices, low-floor apartments have the widest price range, from 69 CNY to 4,900 CNY, likely influenced by a few exceptionally high-priced properties. Middle-floor apartments have a more constrained range, from 50 CNY to 1,650 CNY, suggesting greater consistency in pricing. High-floor apartments reach a maximum of 3,725 CNY, indicating the presence of some premium properties in this category. Additionally, the number of listings is highest for high-floor apartments, with 104 units, followed by middle floors with 83 units and low floors with 82 units, suggesting that high-floor apartments are the most common in this dataset. Overall, while median prices are quite similar across floor levels.
3d. Thinking about your results
What can we learn about the determinants of apartment prices in Shanghai from these two investigations? Do the results surprise you? What lurking variables do you think could be at work here, if any?
These investigations reveal that apartment prices in Shanghai are influenced by multiple factors, with management fees and floor levels playing only a minor role. The analysis of management fees and property prices indicates a weak positive correlation, implying that while higher management fees may be linked to better amenities, they do not significantly determine property values. Other factors, such as location, district, transportation access, and overall demand, likely have a much greater impact. Additionally, removing a high-priced luxury villa outlier helped refine the analysis, showing that luxury properties follow a different pricing trend than standard apartments. The relationship between apartment floor level and price is more complex than it initially seems. Although middle floors are typically considered the most desirable due to their balance of convenience and views, the data shows that middle-floor apartments actually have the lowest median price. This could be because they are the most competitive, as some buyers specifically avoid low floors due to noise and security concerns and high floors due to longer elevator wait times. Meanwhile, some buyers prefer low floors for easier access and quicker evacuation, while others favor high floors for better views, quieter living conditions, and reduced street-level pollution. Several lurking variables may be influencing these trends. Building age could be a major factor, as older buildings often have lower floors, and if they are located in prime areas, their prices may be high due to location rather than floor level. Newer developments tend to have higher floors with better views, but if they are located farther from the city center, their price growth might be limited. Proximity to subway stations is another crucial factor, as properties with easy access to public transportation tend to be priced higher, regardless of floor level or management fees. Overall, while management fees and floor levels may have some influence on apartment prices, they are not the primary drivers.
Complete up to here for Homework Check - due January 26th at 11:59 pm.
Question 4: Multiple regression (30 points)
4a. Investigating area + floor and price
Using the Think-Show-Tell framework from the textbook, please examine how area and floor of an apartment are related to price. Make use of all the tools described in the textbook to assess model fit in the Think again section - if it is necessary to revise your model, do it in the Think again section. Then state any updated conclusions in the Revising conclusions section.
Think I want to explore whether an apartment’s total area and the number of floors in its building are linked to its price. One possible expectation is that larger apartments tend to be more expensive, as additional space typically increases a property’s value. The number of floors in a building might also play a role, as taller buildings could offer better views, modern infrastructure, and more amenities, making them more desirable. Conversely, shorter buildings might be located in older, well-established neighborhoods where high land value could also contribute to higher prices. There may be outliers, such as luxury penthouses or exceptionally large apartments, that could distort the overall trend. Understanding these relationships can help determine whether apartment size and building height are reliable indicators of property value, offering useful insights for both buyers and investors looking to analyze pricing trends in the real estate market.
Show
Tell
The first graph explores the relationship between total area (square meters) and apartment price (CNY), with the number of floors in the building represented by color. There is a clear positive correlation between total area and price, as larger apartments generally have higher prices. The R² value of 0.769 indicates that approximately 76.9% of the variation in apartment prices can be explained by total area, making it a strong predictor of price. The color gradient shows that buildings with different numbers of floors appear across all price ranges, but there is no obvious trend suggesting that the number of floors significantly impacts price. One notable outlier stands out in the upper right corner of the graph. This is the same high-priced luxury villa that was identified in previous analyses, which has a significantly larger area than most other properties in the dataset. Its presence could be skewing the trend, as it does not represent the typical relationship between total area and price. The residual plot further highlights this, showing that larger apartments tend to deviate more from the model’s predictions. The residuals become more negative as total area increases, suggesting that bigger apartments are often priced higher than expected or that the model underestimates price increases for these properties. Additionally, residuals are more dispersed at higher total areas, indicating reduced accuracy for larger apartments, likely influenced by extreme values like this outlier. Overall, the results show that total area is a strong factor in determining apartment price, but the number of floors does not appear to have a significant impact. However, the presence of the outlier suggests that it may be distorting the trend, making it harder to accurately assess the relationship for standard properties. In the following Think Again, the outlier will be removed to provide a clearer analysis and improve the reliability of the model. Additionally, incorporating other factors such as “Close to Subway” could further refine the model’s predictive accuracy.
Think again
As before, I still want to examine how total apartment area dna building height influence apartment prices but since extreme outliers, like luxury apartments, could skew the results, I will remove a high-priced outlier to ensure a more accurate analysis. This will help determine whether apartment size and building height are strong predictors of price or if other factors, such as location and neighborhood desirability, have a greater impact.
Revising conclusions
After removing the extreme outlier, the positive relationship between total area and apartment price remains clear, though some notable changes have occurred. The R² value has decreased from 0.769 to 0.714, meaning total area now accounts for a slightly smaller portion of price variation. While this may seem like a weaker model, it actually suggests that the previous R² was artificially inflated due to the influence of the outlier. Without this extreme luxury property, the regression now better reflects the overall housing market rather than being skewed by an unusually high-priced listing. The scatter plot still indicates that larger apartments tend to have higher prices, but without the outlier, the price distribution for mid-sized apartments appears more varied. This reinforces the idea that factors beyond just apartment size contribute to price differences. The color gradient, representing the number of floors, does not display a noticeable pattern, suggesting that building height does not significantly impact apartment prices. The residual plot appears more balanced, with fewer extreme deviations compared to the previous version. Some residuals remain negative for larger apartments, indicating that the model might slightly underestimate prices for bigger properties, but the overall distribution looks more reasonable. The remaining variation in residuals suggests that factors such as location, property condition, and neighborhood desirability likely play a role in price determination, aspects not captured by total area alone. Overall, although the decrease in the R² value suggests that total area alone is not a perfect predictor of price, removing the outlier has made the model more reliable for typical properties. This demonstrates the importance of identifying and filtering extreme values to prevent misleading trends.
4b. Interpreting coefficients of 4a. model
Carefully interpret your coefficients from 4a. What do they mean? Are there any lurking variables here?
Think
I want to interpret the coefficients from the regression model that examines how total apartment area and the number of floors in a building influence apartment prices. The coefficients will show how much the price is expected to change with a one-unit increase in total area or building height, assuming all other factors remain the same. If the coefficient for total area is positive, it confirms that larger apartments generally have higher prices, while a positive coefficient for the number of floors would suggest that buildings with more floors tend to have more expensive apartments. However, if the number of floors has a small or negative coefficient, it could mean that taller buildings do not necessarily lead to higher prices, possibly due to factors such as location, building age, or varying demand for high-rise versus low-rise living. Since an extreme outlier was removed, the model should now provide a more accurate reflection of the general housing market. However, there may still be lurking variables that affect the results. Factors such as location, subway accessibility, building age, or the reputation of the developer could influence apartment prices. Understanding the coefficients in this context will help determine whether total area and building height alone are strong predictors of price.
Show
Call:
lm(formula = 价格 ~ 总面积 + 居民楼总层数, data = sh.house_filtered)
Residuals:
Min 1Q Median 3Q Max
-1267.55 -193.18 -0.76 151.18 2642.31
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -362.624 45.048 -8.050 2.03e-14 ***
总面积 8.170 0.308 26.526 < 2e-16 ***
居民楼总层数 10.674 2.326 4.588 6.62e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 338.8 on 296 degrees of freedom
Multiple R-squared: 0.7141, Adjusted R-squared: 0.7122
F-statistic: 369.7 on 2 and 296 DF, p-value: < 2.2e-16
Tell
The multiple regression model examines the relationship between total area, the number of floors in a building, and apartment price. The results indicate that both variables have a statistically significant impact on price. The coefficient for total area is 8.17, meaning that for each additional square meter, the price increases by approximately 8.17 CNY, holding the number of floors constant. Similarly, the coefficient for the total number of floors is 10.67, suggesting that for each additional floor in a building, the apartment price increases by about 10.67 CNY, assuming total area remains the same. Both variables have very low p-values, indicating strong statistical significance.The intercept of -362.62 does not have a meaningful interpretation in this context, as it represents the estimated price when both total area and total floors are zero, which is unrealistic. The model’s R² value is 0.714, meaning that approximately 71.4% of the variation in apartment prices can be explained by total area and the number of floors in the building. This suggests a strong fit, though other factors likely contribute to price differences. Overall, while total area is a stronger predictor than the number of floors, both variables contribute to apartment price.
4c. Add the variable near a subway
Now add the variable near a subway to your model and analyze the relationship similar to what you did in 4a.
Think
I want to examine whether proximity to a subway station influences apartment prices when considering total area and building height. Larger apartments typically cost more, but subway access could further impact pricing by making certain locations more desirable. However, this effect might not be uniform across all districts, as factors like neighborhood quality, building age, and amenities could also play a role. By adding subway proximity to the analysis, I aim to see if there is a noticeable difference in price trends between apartments near and far from subway stations.
Show
Tell
The scatter plot shows the relationship between total apartment area (square meters) and apartment price (CNY), distinguishing between properties near the subway (blue) and those farther from it (red). The dotted trendlines represent separate regression models for each category, with blue indicating properties close to the subway and red for those farther away. The solid black line represents the overall trend. The regression results indicate a strong positive correlation between total area and price, meaning that larger apartments generally have higher prices, regardless of their proximity to the subway. The R² values show that total area accounts for 85% of the price variation for apartments near the subway and 84.4% for those farther away, suggesting a similarly strong relationship in both cases. The overall R² is slightly lower at 0.769, which suggests that other factors may influence price when considering all properties together. One notable observation is that the blue dotted line (near subway) is positioned above the red dotted line (far from subway), indicating that, for apartments of the same size, those closer to the subway tend to have higher prices. This supports the idea that subway accessibility is an important factor in Shanghai’s housing market, likely due to the convenience of transportation and increased demand for well-connected locations. Additionally, the gap between the two trendlines widens as apartment size increases, suggesting that larger apartments near the subway carry an even greater price premium than smaller ones. This suggests that buyers are willing to pay more for well-located, spacious properties in Shanghai.
To better understand the relationship between subway proximity and apartment prices, I created a box plot comparing the log-transformed prices of properties located near and far from subway stations. Using a log scale helps normalize price distribution, making it easier to compare values across different ranges and preventing extreme values from distorting the visualization. The results show that apartments near subway stations tend to have higher prices. The median log-scaled price for properties close to the subway is 405 CNY, while for those farther away, it is 285 CNY. This suggests that being near public transportation may increase property value, likely due to higher demand for convenient locations. There is also greater variability in prices for apartments near subway stations, as shown by a wider interquartile range (IQR) of 381 compared to 275.5 for properties farther from the subway. This suggests that areas near subway stations may have a mix of both standard and high-end properties, while prices in areas farther away tend to be more consistent. Additionally, the maximum log-scaled price for near-subway properties reaches 5,500 CNY, nearly double the 2,950 CNY maximum for those farther away, indicating that high-end developments are more concentrated around subway lines.
4d. Reinterpret your coefficients
Carefully interpret your coefficients from 4c. What do they mean? Any new lurking variables to consider?
Think
I want to analyze how total area and subway proximity affect apartment prices. The coefficients will show how much the price changes with a one unit increase in each factor. I expect coefficients to be positive. Additionally, I believe that the coefficient for apartments near the subway will be higher than for those farther away, suggesting that subway access adds value to a property. Beyond subway proximity, other lurking variables could be influencing prices. Neighborhood desirability, and district-level differences might play a role, as some areas with subway access may already be in high-demand locations. Building age and condition could also be a factor, older buildings may have lower prices despite being near a subway, while new developments farther from subways might still command high prices due to modern amenities.
Show
[1] "Coefficients for Near Subway:"
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1075.83153 70.7975525 -15.19589 7.335497e-36
总面积 18.07045 0.5195274 34.78248 5.079019e-90
[1] "Coefficients for Far from Subway:"
Estimate Std. Error t value Pr(>|t|)
(Intercept) -228.675237 40.926849 -5.587414 2.927050e-07
总面积 6.513735 0.308778 21.095201 7.101759e-35
[1] "Coefficients for Overall Model:"
Estimate Std. Error t value Pr(>|t|)
(Intercept) -936.12135 66.9563217 -13.98107 1.681469e-34
总面积 15.58889 0.4950948 31.48667 8.531568e-97
Tell
The regression results confirm that apartment size influences price, which isn’t surprising. For apartments near the subway, each additional square meter increases the price by 18.07 CNY, while for those farther away, the increase is only 6.51 CNY. This suggests that larger apartments near subway stations are valued more highly than similarly sized apartments in less accessible areas. In the combined model, which includes all apartments regardless of subway proximity, the coefficient is 15.59 CNY per square meter. This value falls between the coefficients of the two separate models, reinforcing the idea that proximity to a subway station enhances the price impact of larger apartments. The negative intercepts across all three models, with the most extreme being -1075.83 for apartments near subway stations, suggest that very small apartments may not follow the same pricing trends as larger ones. While this is unrealistic, since apartment prices cannot be negative, it indicates that the relationship between area and price may not be entirely linear for smaller properties. These findings indicate that subway access plays a role in shaping apartment prices, particularly for larger units, as buyers appear more willing to pay a higher price for convenient transportation options. Other elements, like school districts, proximity to business centers, and future development plans, could be pushing prices up in certain areas, making subway access just one part of the bigger picture. While the data suggests that being near a subway generally leads to higher prices, it’s clear that other factors matter just as much in shaping the housing market.
4e. Thinking about your results
Consider the results of 4a.-4d. together. What can we learn about used housing prices in Shanghai? How did your conclusions change from 3d.? Why do you think they changed?
From these analyses, it’s clear that many factors influence apartment prices in Shanghai, with total area being one of the biggest ones. At first, I thought management fees and floor levels might have an impact, but this investigation shows that being close to a subway station plays a major role, especially for larger apartments. Properties near subway stations tend to be more expensive, and the price gap grows as the apartment size increases. The data also reveals that prices vary more for apartments near subways, which might mean that these areas have a mix of both regular and high-end developments. This shows how important transportation access is to buyers, people are willing to pay more for convenience and easy commuting. Compared to my earlier conclusions, this analysis shifts the focus from building height and management fees to location and accessibility. A big takeaway is that while apartment size still affects price, its impact is even stronger when paired with good transportation access. Removing outliers, like luxury apartments, helped make this trend clearer by preventing them from skewing the data. These findings suggest that both buyers and investors should prioritize location just as much as apartment size when evaluating a property’s worth. Future research could explore other location-based factors, like neighborhood reputation or proximity to business districts, to get an even better understanding of Shanghai’s housing market.
Question 5: Your own investigation (20 points)
5a. Selecting your own question
Develop your own model of used housing price in Shanghai. Use the Think-Show-Tell procedure to conduct your investigation. Think deeply about what your result means and interpret your coefficients carefully.
Think
I want to examine how apartment prices differ across districts in Shanghai. Some districts, like Huangpu and Jiang’an, are expected to have higher prices due to their central locations, while suburban areas may be more affordable. However, some districts are harder to predict. For example, Pudong could have expensive apartments near the airport, but prices might drop in areas affected by airplane noise. A box plot will help compare median prices, variability, and outliers across districts, with a log scale making price differences clearer. This information will be very helpful for people who want to invest or buy houses in Shanghai.
The box plot and summary statistics provide a clear comparison of apartment prices across different districts in Shanghai. The log scale helps visualize the wide range of prices, highlighting significant variations in median values, interquartile ranges, and outliers.
Highest and Lowest Median Prices: Jing’an (910 CNY), Changning (675 CNY), and Putuo (634 CNY) have the highest median apartment prices, likely due to their central locations and well-established infrastructure, making them more attractive. On the other hand, Fengxian (125 CNY) has the lowest median price with minimal variation, indicating it is a more budget-friendly district with fewer upscale properties.
Price Spread and Variability: Jing’an (IQR: 780 CNY), Changning (IQR: 1311 CNY), and Huangpu (IQR: 2308 CNY) have the widest interquartile ranges, indicating a diverse range of apartment prices, from mid-range to high-end properties. In contrast, districts like Songjiang (IQR: 212 CNY) and Jiading (IQR: 241 CNY) have much smaller IQRs, suggesting that most apartments in these areas are priced closer to the median, with fewer luxury properties. Huangpu also has a maximum price of 4900 CNY, reflecting the presence of expensive apartments, but with only three data points, the limited sample size makes it difficult to generalize trends for this district.
Outliers and Maximum/Minimum Prices: Jing’an has the highest maximum apartment price at 25,600 CNY, highlighting the presence of luxury properties in the district. Minhang, with a maximum price of 5,500 CNY, stands out because it is significantly higher than its median price of 435 CNY, suggesting a mix of both affordable and high-end housing options. Meanwhile, Fengxian shows minimal price variation, with its minimum, median, and maximum prices all at 125 CNY, indicating a lack of market diversity.
Pudong’s Mixed Pricing: Pudong has a relatively low median apartment price of 375 CNY, but its maximum price reaches 2,268 CNY. This variation aligns with its location, some areas are highly sought after due to their proximity to the airport and financial hubs, while others may be less desirable due to noise pollution from air traffic.
This analysis highlights how district characteristics play a major role in apartment prices. Central areas like Jing’an and Changning tend to have the highest prices and the greatest variation, likely driven by demand from professionals and wealthy buyers. In contrast, outer districts such as Fengxian and Songjiang are more affordable with less price fluctuation. The presence of high-price outliers in some districts suggests a combination of standard and luxury properties, emphasizing the importance of location, transportation, and local amenities in shaping real estate values.
Think again
I want to better understand whether certain districts have more apartments near subway stations and how that impacts prices. By using facet graphs, I can compare price trends across districts while distinguishing between properties that are close to or far from subway stations. However, overlapping points might make it hard to see patterns clearly, especially in districts with more data points. To address this, I will adjust the transparency of the points to 30% so that overlapping data remains visible. This should help reveal whether districts with more subway access tend to have higher apartment prices and if the price difference between near and far subway properties varies by district. If the trends still appear unclear, I may consider alternative visualizations.
Revising conclusions
The scatter plot explores the relationship between apartment price, total area, and subway proximity across different districts. Each facet represents a district, with red points showing apartments far from the subway and blue points representing those near the subway. Across most districts, larger apartments tend to be more expensive. However, the strength of this trend varies by district. In places like Jing’an and Changning, where housing is in high demand, bigger apartments tend to be noticeably more expensive, especially if they’re close to a subway station. But in districts like Songjiang and Qingpu, the price difference between homes near and far from the subway isn’t as obvious, which might be because demand is lower. The presence of red points at lower price levels across most districts indicate that apartments farther from subway stations are generally more affordable, supporting the idea that proximity to public transportation plays a significant role in shaping real estate prices in Shanghai. The distribution of red points across most districts suggests that apartments farther from subway stations are more common in certain areas, while the concentration of blue points indicates that properties near subway stations are more prevalent in others. In districts like Huangpu, Putuo, Yangpu, Xuhui, and Jing’an, most real estate is located near subway stations, whereas Qingpu and Songjiang have a greater proportion of properties farther from subway access. This suggests that central districts generally have better subway coverage, while outer districts tend to have more developments in areas with limited transit connectivity.
Additional Graph to Analyze
Analysis for this box chart
This box plot does a great job of showing how being near a subway station affects apartment prices in different parts of Shanghai. In every district, homes close to a subway station tend to be more expensive than those farther away, confirming that easy access to public transportation adds value. The price gap is especially noticeable in central areas like Jing’an, Changning, and Huangpu, where demand for convenient commuting is high. In contrast, in districts like Songjiang, Qingpu, and Jiading, the difference isn’t as dramatic, suggesting that subway access might not be as big of a factor there. Some districts, such as Huangpu and Jing’an, have a wide range of prices for homes near subway stations, likely because these areas offer both mid-range and high-end properties. On the other hand, places like Qingpu and Songjiang have more homes located farther from subway stations, which could mean public transit isn’t as essential for homebuyers there. Overall, this box plot provides a clear and useful comparison of how subway access influences real estate prices across Shanghai.
5b. In summary
Sum up everything that you have learned in this investigation. Do not simply repeat/rephrase your previous results but try to say something larger that synthesizes the results together to draw a more meaningful general conclusion.
This analysis highlights how Shanghai’s real estate prices are influenced by a mix of factors, including district characteristics, subway accessibility, and overall urban development. Central districts like Jing’an, Huangpu, and Xuhui tend to have the highest apartment prices, which makes sense given their roles as commercial, financial, and cultural hubs. Huangpu, home to major tourist attractions like The Bund and Nanjing Road, sees strong demand from both investors and luxury buyers. Jing’an, known for its high-end shopping centers and corporate offices, attracts professionals willing to pay a premium for convenience. In contrast, districts like Qingpu and Songjiang have a higher share of properties located farther from subway stations, as they are more suburban and cater to families or students. Songjiang, for example, is home to multiple universities and historical sites, leading to more stable mid-range property prices. Pudong, a district with significant variation in prices, reflects its diverse landscape, those near the airport may be priced lower due to noise pollution. Subway access plays a crucial role in shaping property values, particularly in Shanghai’s busiest areas. Locations with major transit and commercial hubs, such as People’s Square tend to have higher prices due to the convenience of transportation and the steady demand from commuters and businesses. However, in more suburban areas like Jiading and Fengxian, people rely more on buses and highways. A district’s economic activity, tourism appeal, and future infrastructure projects also shape real estate trends. As Shanghai continues to grow and develop, new subway lines, commercial projects, and shifting urban priorities will likely redefine property values, making it essential for investors and buyers to consider these broader factors when analyzing the housing market.