房屋价格探索分析

Author: Haochun Qi

1.背景

此章节介绍数据集相关背景。

参考文献:Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project

数据集:ames_houseprice.csv

变量说明数据集:ames_houseprice_explanation.xlsx

1.1 数据集信息

  • “ames_houseprice”是美国爱荷华州艾姆斯市(Ames, Iowa)的一组住宅房产数据集。
  • 该数据集是由Dean De Cock在2011年整理并发布的,目的是为了提供一个适合教学和研究用的、真实的房地产数据集。
  • 最初的数据集包含113个变量,描述了2006年至2010年间发生在爱荷华州艾姆斯的3970笔房产销售。
  • 作者从房产行业外行人的角度,为能快速理解变量包含信息,删除了所有需要特殊知识或之前计算的变量。

1.2部分变量枚举值解释说明

只介绍了后续问题探究所用变量

1.2.1 MSSubClass

描述 注释
20 1-STORY 1946 & NEWER ALL STYLES 单层住宅(1946年及以后建造的所有风格)
30 1-STORY 1945 & OLDER 单层住宅(1945年及以前建造)
40 1-STORY W/FINISHED ATTIC ALL AGES 单层住宅,带完工的阁楼
45 1-1/2 STORY UNFINISHED ALL AGES 一层半住宅,未完工的上层
50 1-1/2 STORY FINISHED ALL AGES 一层半住宅,完工的上层
60 2-STORY 1946 & NEWER 两层住宅(1946年及以后建造)
70 2-STORY 1945 & OLDER 两层住宅(1945年及以前建造)
75 2-1/2 STORY ALL AGES 两层半住宅
80 SPLIT OR MULTI-LEVEL 分层或多层住宅
85 SPLIT FOYER 分层门厅住宅
90 DUPLEX 双拼别墅
120 1-STORY PUD (Planned Unit Development) 单层规划单元开发住宅
150 1-1/2 STORY PUD 一层半规划单元开发住宅
160 2-STORY PUD 两层规划单元开发住宅
180 PUD (Planned Unit Development) 规划单元开发住宅(多层)
190 2 FAMILY CONVERSION 两户家庭转换住宅(通常由旧建筑改造而成)

1.2.2 数值型变量单位

变量 单位
Lot_Frontage 英尺
Lot_Area 平方英尺
Gr_Liv_Area 平方英尺

注:1英尺 = 0.3048米
1平方英尺 = 0.093平方米

1.2.3 SaleCondition

描述 注释
Normal 正常销售 买方和卖方之间的标准交易,没有特殊条件或限制。
Abnorml 异常销售 销售价格低于市场价值,可能是因为房屋有缺陷、急售等。
AdjLand 相邻土地的销售 房屋与相邻的土地一起出售,或者通过购买相邻土地获得。
Alloca 分配销售 通过分配或重新分配的方式出售,通常发生在政府或机构之间。
Family 家庭内部销售 房屋在家庭成员之间出售,基于家庭关系,可能不遵循市场价。
Partial 部分完工的房屋销售 房屋在出售时尚未完全完工,买方可能会继续完成剩余工程。

1.2.4 Land_Slope

描述 注释
Gtl 轻微倾斜 地形有轻微的坡度,通常不影响建筑物的建设或使用。
Mod 中等倾斜 地形有明显的坡度,可能需要额外的工程措施来适应建筑。
Sev 严重倾斜 地形非常陡峭,可能会对建筑物的建设和安全性造成重大影响。

1.2.5 MSZoning

描述 注释
A 农业用地 主要用于农业生产,不适合大规模的住宅开发。
C 商业用地 主要用于商业活动,如商店、办公楼等。虽然可以建造住宅,但通常是公寓或多户住宅,而不是单户住宅。
FV 浮动村庄 通常位于水边,允许建造水上住宅或其他特殊类型的住宅。
I 工业用地 主要用于工业活动,如工厂、仓库等。通常不允许建造住宅。
RH 高密度住宅区 允许建造多户住宅或多层公寓,适合人口密集的居住环境。
RL 低密度住宅区 主要用于单户住宅开发,通常要求较大的地块和较低的建筑密度,适合独栋别墅或联排别墅。
RP 公园住宅区 通常与公园或绿地相邻,允许建造低密度住宅,可能有特殊的景观或环保要求。
RM 中密度住宅区 允许建造双拼别墅或多户住宅,密度介于低密度和高密度之间。

1.2.6 BldgType

描述 注释
1Fam 单户独立住宅 最常见的住宅类型,指一栋房屋只供一个家庭居住,通常有自己的院子和独立的入口。这类房屋通常是独栋别墅或联排别墅。
2fmCon 两户家庭转换住宅 原本可能是单户住宅,后来被改造为两户家庭共享的住宅,通常有两套独立的居住单元,适合出租或投资。
Duplex 双拼别墅 由两个独立的居住单元组成,每个单元都有自己的入口和生活空间,通常共享一面墙壁。适合两个家庭共同居住或出租。
Twnhs 联排别墅(端头单元) 联排别墅的一部分,但位于一排房屋的两端,因此有一侧与外界相邻,通常有更多的隐私和户外空间。
TwnhsE 联排别墅(中间单元) 联排别墅的一部分,位于一排房屋的中间,两侧都与其他房屋相连,通常共享墙壁,隐私相对较少。

2.数据探查

本练习主要针对房屋售价影响因素进行分析,因此本章节在读取数据后,首先对数据集的重复情况、缺失情况进行探查,之后完成对房屋销售价格分布进行展示。

2.1数据读取

Code
library(tidyverse)
library(openxlsx)
library(RColorBrewer)
library(data.table)
library(DT)
library(magrittr)
library(ggplot2)
library(scales)
library(plotly)
library(janitor)

data <- read.csv("C:/Users/TST/Desktop/统计学_学习/周末作业数据集/ames_houseprice.csv") %>%
  setDT() #读取数据集并转为数据框

glimpse(data)
Rows: 1,460
Columns: 81
$ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,…
$ MSZoning      <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "R…
$ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, …
$ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612…
$ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
$ Alley         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ LotShape      <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1", …
$ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", …
$ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu…
$ LotConfig     <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Inside", "I…
$ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", …
$ Neighborhood  <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoRidge", "…
$ Condition1    <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm",…
$ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
$ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", …
$ HouseStyle    <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fi…
$ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,…
$ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,…
$ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19…
$ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19…
$ RoofStyle     <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "G…
$ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "…
$ Exterior1st   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "…
$ Exterior2nd   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "VinylSd", "…
$ MasVnrType    <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace", "None",…
$ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, …
$ ExterQual     <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", "TA", "T…
$ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
$ Foundation    <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "Wood", "…
$ BsmtQual      <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", "TA", "T…
$ BsmtCond      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "T…
$ BsmtExposure  <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", "No", "N…
$ BsmtFinType1  <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ", …
$ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99…
$ BsmtFinType2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "BLQ", …
$ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17…
$ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10…
$ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
$ HeatingQC     <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", "Gd", "E…
$ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
$ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S…
$ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, …
$ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,…
$ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10…
$ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,…
$ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,…
$ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
$ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,…
$ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
$ KitchenQual   <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", "TA", "T…
$ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6…
$ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", …
$ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,…
$ FireplaceQu   <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA", "TA", …
$ GarageType    <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attch…
$ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19…
$ GarageFinish  <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "RFn", …
$ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,…
$ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7…
$ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "Fa", "G…
$ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T…
$ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "…
$ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160…
$ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,…
$ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, …
$ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, …
$ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ Fence         <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, NA,…
$ MiscFeature   <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, NA, NA, …
$ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,…
$ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10…
$ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20…
$ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W…
$ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm…
$ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, 307000, …

规范变量名

Code
data_clean <- data %>%
  clean_names()

读取变量说明数据集

Code
explain_data <- read.xlsx("C:/Users/TST/Desktop/统计学_学习/周末作业数据集/ames_houseprice_explanation.xlsx") %>%
  setDT() 

看一下clean_names()如何处理中文列名

Code
explain_data <- explain_data %>%
  clean_names()
print(colnames(explain_data))
[1] "lie_ming"    "description" "jie_shi"    

更换列名

Code
explain_data <- explain_data %>%
  rename(column = lie_ming,
         explaination = jie_shi) 

explain_data %>% 
  knitr::kable()
column description explaination
MSSubClass Identifies the type of dwelling involved in the sale. 住宅概况
MSZoning Identifies the general zoning classification of the sale. 建筑性质(农业、商业、高/低密度住宅)
LotFrontage Linear feet of street connected to property 建筑离街道的距离
LotArea Lot size in square feet 占地面积
Street Type of road access to property 建筑附近的路面材质
Alley Type of alley access to property 建筑附近小巷的修建材质
LotShape General shape of property 建筑物的形状
LandContour Flatness of the property 地面平坦程度
Utilities Type of utilities available 可用公用设施类型
LotConfig Lot configuration 房屋哪里配置多
LandSlope Slope of property 建筑的斜率
Neighborhood Physical locations within Ames city limits 建筑在Ames城市的位置
Condition1 Proximity to various conditions 建筑附近的交通网络
Condition2 Proximity to various conditions (if more than one is present) 建筑附近的交通网络
BldgType Type of dwelling 住宅类别(联排别墅、独栋别墅…)
HouseStyle Style of dwelling 建筑风格
OverallQual Rates the overall material and finish of the house 房屋装饰材质水平
OverallCond Rates the overall condition of the house 房屋整体状况评估
YearBuilt Original construction date 房屋修建日期
YearRemodAdd Remodel date (same as construction date if no remodeling or additions) 房屋改建日期
RoofStyle Type of roof 屋顶类型
RoofMatl Roof material 屋顶材质
Exterior1st Exterior covering on house 建筑外立面材质
Exterior2nd Exterior covering on house (if more than one material) 建筑外立面材质
MasVnrType Masonry veneer type 建筑表层砌体类型
MasVnrArea Masonry veneer area in square feet 每平方英尺的砌体面积
ExterQual Evaluates the quality of the material on the exterior 建筑表层砌体材料质量评估
ExterCond Evaluates the present condition of the material on the exterior 建筑表层砌体材料现状评估
Foundation Type of foundation 建筑基础的类型
BsmtQual Evaluates the height of the basement 地下室高度评估
BsmtCond Evaluates the general condition of the basement 地下室总体状况评估
BsmtExposure Refers to walkout or garden level walls 走廊/花园外墙的评估
BsmtFinType1 Rating of basement finished area 地下室完工区域的等级评价
BsmtFinSF1 Type 1 finished square feet 地下室完工区域的面积
BsmtFinType2 Rating of basement finished area (if multiple types) 其他地下室完工区域的等级评价
BsmtFinSF2 Type 2 finished square feet 其他地下室完工区域的面积
BsmtUnfSF Unfinished square feet of basement area 地下室未完工部分的面积
TotalBsmtSF Total square feet of basement area 地下室总面积
Heating Type of heating 房屋暖气类型(地暖、墙暖….)
HeatingQC Heating quality and condition 暖气设施的质量和条件
CentralAir Central air conditioning 是否有中央空调
Electrical Electrical system 电器系统配置标准
1stFlrSF First Floor square feet 一楼面积
2ndFlrSF Second floor square feet 二楼面积
LowQualFinSF Low quality finished square feet (all floors) 所有楼层中低质量施工面积
GrLivArea Above grade (ground) living area square feet 地上居住面积
BsmtFullBath Basement full bathrooms 地下室标准卫生间个数
BsmtHalfBath Basement half bathrooms 地下室简易卫生间个数
FullBath Full bathrooms above grade 地上楼层标准卫生间个数
HalfBath Half baths above grade 地上楼层简易卫生间个数
BedroomAbvGr Bedrooms above grade (does NOT include basement bedrooms) 地上楼层卧室个数
KitchenAbvGr Kitchens above grade 地上楼层厨房个数
KitchenQual Kitchen quality 厨房质量评估
TopRmsAbvGrd Total rooms above grade (does not include bathrooms) 地上楼层房间总数(除去卧室)
Functional Home functionality (Assume typical unless deductions are warranted) 房屋功能情况
Fireplaces Number of fireplaces 壁炉个数
FireplaceQu Fireplace quality 壁炉质量
GarageType Garage location 车库位置
GarageYrBlt Year garage was built 车库建成年份
GarageFinish Interior finish of the garage 车库内部装饰情况
GarageCars Size of garage in car capacity 车库容量
GarageArea Size of garage in square feet 车库占地面积
GarageQual Garage quality 车库质量
GarageCond Garage condition 车库条件
PavedDrive Paved driveway 车道施工方式
WoodDeckSF Wood deck area in square feet 木甲板面积
OpenPorchSF Open porch area in square feet 开放式门廊面积
EnclosedPorch Enclosed porch area in square feet 封闭式门廊面积
3SsnPorch Three season porch area in square feet 三季门廊面积
ScreenPorch Screen porch area in square feet 纱窗门廊面积
PoolArea Pool area in square feet 游泳池面积
PoolQC Pool quality 游泳池质量
Fence Fence quality 栅栏质量
MiscFeature Miscellaneous feature not covered in other categories 其他配套设施(网球场、电梯…)
MiscVal $Value of miscellaneous feature 其他配套设施的费用
MoSold Month Sold (MM) 销售月份
YrSold Year Sold (YYYY) 销售年份
SaleType Type of sale 支付方式
SaleCondition Condition of sale 房屋出售的情况

2.2.是否存在重复数据

2.2.1 id是否重复

Code
id_num <- data_clean %>% 
  group_by(id) %>% 
  nrow()
print(id_num) 
[1] 1460

2.2.2 除id外数据是否重复

Code
expect_id <- data_clean %>%
  select(-id)
print(anyDuplicated(expect_id))
[1] 0

综合2.1和2.2结果,可知数据集没有重复数据

2.3 缺失值

Code
# 缺失值及比例
missing_infor <- data %>%
  summarise(across(everything(), list(
    miss_count = ~ sum(is.na(.x)),
    miss_pct = ~ sum(is.na(.x)) / nrow(data_clean)
  ))) %>%
  pivot_longer(everything(), names_to = c("column", ".value"), names_pattern = "(.*)_(miss_count|miss_pct)") %>%
  arrange(desc(miss_pct)) %>% 
  mutate(miss_pct = scales::percent(miss_pct, accuracy = 0.01)) 

knitr::kable(missing_infor)
column miss_count miss_pct
PoolQC 1453 99.52%
MiscFeature 1406 96.30%
Alley 1369 93.77%
Fence 1179 80.75%
FireplaceQu 690 47.26%
LotFrontage 259 17.74%
GarageType 81 5.55%
GarageYrBlt 81 5.55%
GarageFinish 81 5.55%
GarageQual 81 5.55%
GarageCond 81 5.55%
BsmtExposure 38 2.60%
BsmtFinType2 38 2.60%
BsmtQual 37 2.53%
BsmtCond 37 2.53%
BsmtFinType1 37 2.53%
MasVnrType 8 0.55%
MasVnrArea 8 0.55%
Electrical 1 0.07%
Id 0 0.00%
MSSubClass 0 0.00%
MSZoning 0 0.00%
LotArea 0 0.00%
Street 0 0.00%
LotShape 0 0.00%
LandContour 0 0.00%
Utilities 0 0.00%
LotConfig 0 0.00%
LandSlope 0 0.00%
Neighborhood 0 0.00%
Condition1 0 0.00%
Condition2 0 0.00%
BldgType 0 0.00%
HouseStyle 0 0.00%
OverallQual 0 0.00%
OverallCond 0 0.00%
YearBuilt 0 0.00%
YearRemodAdd 0 0.00%
RoofStyle 0 0.00%
RoofMatl 0 0.00%
Exterior1st 0 0.00%
Exterior2nd 0 0.00%
ExterQual 0 0.00%
ExterCond 0 0.00%
Foundation 0 0.00%
BsmtFinSF1 0 0.00%
BsmtFinSF2 0 0.00%
BsmtUnfSF 0 0.00%
TotalBsmtSF 0 0.00%
Heating 0 0.00%
HeatingQC 0 0.00%
CentralAir 0 0.00%
X1stFlrSF 0 0.00%
X2ndFlrSF 0 0.00%
LowQualFinSF 0 0.00%
GrLivArea 0 0.00%
BsmtFullBath 0 0.00%
BsmtHalfBath 0 0.00%
FullBath 0 0.00%
HalfBath 0 0.00%
BedroomAbvGr 0 0.00%
KitchenAbvGr 0 0.00%
KitchenQual 0 0.00%
TotRmsAbvGrd 0 0.00%
Functional 0 0.00%
Fireplaces 0 0.00%
GarageCars 0 0.00%
GarageArea 0 0.00%
PavedDrive 0 0.00%
WoodDeckSF 0 0.00%
OpenPorchSF 0 0.00%
EnclosedPorch 0 0.00%
X3SsnPorch 0 0.00%
ScreenPorch 0 0.00%
PoolArea 0 0.00%
MiscVal 0 0.00%
MoSold 0 0.00%
YrSold 0 0.00%
SaleType 0 0.00%
SaleCondition 0 0.00%
SalePrice 0 0.00%

游泳池质量、其他配套设施(网球场、电梯…)、建筑附近小巷的修建材质、栅栏质量、壁炉质量五个变量缺失比例较高

2.4 出售房屋的年份范围

Code
year_range <- range(data_clean$yr_sold, na.rm = TRUE)
print(year_range)
[1] 2006 2010

数据集包含2006 - 2010年的出售房屋信息

2.5 出售价格分布

Code
format_y_axis <- function(x) {
  # 将数值除以10000,然后保留两位小数,并加上 "w" 表示 "万"
  paste0(formatC(x / 10000, format = "f", digits = 0), "w")
}

price_box <- ggplot(data_clean, aes(x = "", y = sale_price)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  labs(title = "价格箱线图", y = "价格(万)") +
  theme_minimal() + 
  scale_y_continuous(labels = format_y_axis) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
  )

p_interactive <- ggplotly(price_box, tooltip = c("y"))
p_interactive <- p_interactive %>% 
  style(hovertemplate = "Sale Price: %{y:d}<extra></extra>")
p_interactive

由箱线图可知,75%的房屋售价低于22w。

房屋售价上界为34w,统计超过上界的房屋数量

Code
beyond_upper <- data_clean %>% 
  filter(sale_price > 340000) %>% 
  nrow()
beyond_upper
[1] 61

房屋销售价格存在一定的右偏

2.6 不同年份不同区域的售价分布

Code
avg_price <- data_clean %>%
  group_by(yr_sold, neighborhood) %>%
  summarize(avg_saleprice = mean(sale_price, na.rm = TRUE), .groups = 'drop')

format_sale_price <- function(x) {
  paste0(formatC(x / 10000, format = "f", digits = 2), "w")
}

ggplot(avg_price, aes(x = yr_sold, y = fct_reorder(neighborhood, avg_saleprice), fill = avg_saleprice)) +
  geom_tile() +
  geom_text(aes(label = format_sale_price(avg_saleprice)), color = "white", size = 3, fontface = "bold") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "平均售价", labels = format_sale_price ) +
  labs(title = "房屋均价分布_年份_区域", x = "售卖年份", y = "区域") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),  # 旋转 x 轴标签以避免重叠
    axis.text.y = element_text(size = 8)  # 调整 y 轴标签大小以避免重叠
  )

NoRidge, StoneBr, NridgHt三个地区房屋售价在不同年份均处于较高水平

在样本分类比较多,且样本分布相对均匀,热力图很难一眼定位到全局极值,因此展示不同年份不同区域的售价均值的极值情况。

Code
max_avgprice_infor <- avg_price %>%
  slice(which.max(avg_saleprice)) %>%
  print()
# A tibble: 1 × 3
  yr_sold neighborhood avg_saleprice
    <int> <chr>                <dbl>
1    2007 NoRidge            399731.
Code
min_avgprice_infor <- avg_price %>%
  slice(which.min(avg_saleprice)) %>%
  print()
# A tibble: 1 × 3
  yr_sold neighborhood avg_saleprice
    <int> <chr>                <dbl>
1    2010 MeadowV             81333.

2.7 本章小节

  • 此数据集不存在重复数据,但是存在部分变量缺失比例较高的情况(缺失比例范围47% ~ 99%);
  • 61套房屋的销售价格偏高,后续探索需对销售价格进行处理;
  • ames_houseprice数据集包含2006年 - 2010年房屋销售数据;
  • 不同地区房屋售价存在较大差异,最高和最低值差31w左右。

3.房屋售价影响因素探索

3.1 数据预处理

3.1.1 房屋售价

由2.5小节可知房屋售价存在一定程度的右偏,因此对房屋售价进行log处理

Code
data_clean <- data_clean %>%
  mutate(log_price = log(sale_price))

3.1.2 变量选取

在2.6节中发现不同区域的房屋售价均值差异较大,在区域这一变量基础上,选择 住宅概况,建筑性质,建筑离街道的距离,占地面积, 住宅类别, 房屋售卖时间, 房屋斜率,地上居住面积房屋售卖时间, 房屋售卖情况 11个影响因素进行分析 (ms_sub_class, ms_zoning, lot_frontage, lot_area, bldg_type, land_slope, neighborhood, gr_liv_area, yr_sold, sale_condition)

Code
curr_data <- data_clean %>% 
  select(id,
         log_price,
         ms_sub_class, 
         ms_zoning,
         lot_frontage, 
         lot_area, 
         bldg_type,
         land_slope,
         year_built, 
         neighborhood, 
         gr_liv_area, 
         yr_sold, 
         sale_condition)

3.1.2 影响因素数据处理

由2.3小节可知, lot_frontage变量存在缺失值,且缺失比例为17.74%,该变量能有效反应其交通便利性,保留该变量去除缺失数据 (注:condition1 变量表示房屋附近是否由主干道、社区等,并不能直接替代距离表示信息)

Code
# 去掉缺失值数据
curr_data <- curr_data %>% 
  drop_na()

# 合成变量:将房屋售卖时间 修改为房屋售卖前已建造时间,这样更符合购买者在买房的实际场景。 
curr_data <- curr_data %>% 
  mutate(already_built_year = yr_sold - year_built) %>%
  select(-c(yr_sold, year_built))

# 对所有分类数据进行因子转化
curr_data <- curr_data %>% 
  mutate(across(where(is.character), as.factor))

curr_data$ms_sub_class <- factor(curr_data$ms_sub_class) 

3.2 影响因素分析

3.2.1 数值型影响因素分析

离散型: already_built_year 连续型:lot_frontage, lot_area, gr_liv_area

3.2.1.1 房屋价格与已建造年份

Code
ggplot(curr_data, aes(x = already_built_year, y = log_price, color = neighborhood)) +
  geom_point(alpha = 0.5, position = position_jitter(width = 0.3, height = 0), size = 1.5) +
  geom_smooth(method = "loess", se = FALSE, color = "blue", size = 1.2) +
  labs(
    title = "房屋价格与已建造年份",
    x = "已建造年份",
    y = "价格(对数)",
    color = "区域"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(face = "bold"),
    legend.position = "right",
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "grey80")
  ) +
  scale_x_continuous(breaks = seq(0, max(curr_data$already_built_year, na.rm = TRUE), by = 10), 
                     limits = c(min(curr_data$already_built_year, na.rm = TRUE), 
                                max(curr_data$already_built_year, na.rm = TRUE))) +
  scale_y_continuous(limits = c(min(curr_data$log_price, na.rm = TRUE), 
                                max(curr_data$log_price, na.rm = TRUE)))

  • 由图可得,房屋已建造年份在0 - 30年时,房屋价格和已建造时间成反比例关系,即房屋越新,价格越高;
  • 随着房屋已建造年份增加, 图中拟合曲线的斜率绝对值在逐渐减少,即40 - 100年的建造年份,对房屋价格影响不大。

建造年份在100年及以上,房屋价格与建造年份出现正比例关系,检查数据量。

Code
beyond_100_num <- curr_data %>% 
  filter(already_built_year >= 100) %>% 
  nrow() %>% 
  print()
[1] 31

100年及以上的房屋只有31套,样本量较少。

在2.6节中已知道区域对房屋价格存在影响,因此探究每个区域的建造年份与房屋价格的相关性。

Code
ggplot(curr_data, aes(x = already_built_year, y = log_price)) +
  geom_point(color = "steelblue", alpha = 0.5, size = 1.5) +  # 绘制散点,统一颜色
  geom_smooth(method = "lm", se = FALSE, color = "darkred", size = 1.2) +  # 拟合线性回归线,统一颜色
  labs(
    title = "房屋价格与已建造年份",
    x = "年份",
    y = "售价"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.title = element_text(face = "bold"),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "grey80")
  ) +
  facet_wrap(~ neighborhood, scales = "fixed") +  # 共用同一个y轴
  coord_cartesian(ylim = c(min(curr_data$log_price, na.rm = TRUE), 
                           max(curr_data$log_price, na.rm = TRUE)),  # 设置y轴范围
                  xlim = c(min(curr_data$already_built_year, na.rm = TRUE), 
                           max(curr_data$already_built_year, na.rm = TRUE))) +
  scale_x_continuous(breaks = c(0, 30, 65, 100))

  • 可发现不同区域房屋已建造年份差异较大,Blmngtn, CollgCr, NoRidge, NridgHt,Somerst, StoneBr房屋大多30年以内,
  • 而IDOTRR, SWISU都在30年以后;
  • 不同区域的房屋基本符合30年内房屋越新,价格越高的规律;当房屋为30年以上时,房屋建造年份对房屋售价影响降低。

3.2.1.2 房屋价格与街道距离

采用相同方法,探究房屋价格与街道距离关系

Code
summary(curr_data$lot_frontage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.00   59.00   69.00   70.05   80.00  313.00 

距离大于300英尺房屋数量

Code
beyond_300_num <- curr_data %>% 
  filter(lot_frontage >= 300) %>% 
  nrow() %>% 
  print()
[1] 2

为更好的显示数据分布,去除大于300的数据

Code
curr_data %>%
  filter(lot_frontage < 300) %>% 
  ggplot(aes(x = lot_frontage, y = log_price)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, formula = "y ~ x", fullrange = TRUE) +
  facet_wrap(~neighborhood) +
  theme(strip.background = element_blank())

由散点图得街道距离与房屋价格相关性较低

3.2.1.2 房屋价格与占地面积、地上居住面积

使用皮尔逊相关系数验证分布验证占地面积、地上居住面积与房屋价格关系相关性

占地面积

Code
cor_test_lotarea <- cor.test(curr_data$lot_area, curr_data$log_price, method = "pearson")
print(cor_test_lotarea)

    Pearson's product-moment correlation

data:  curr_data$lot_area and curr_data$log_price
t = 10.633, df = 1199, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2409742 0.3443868
sample estimates:
      cor 
0.2935391 

p值远小于0.05,表明建造面积与房屋价格显著相关,相关系数0.293

地上居住面积

Code
cor_test_grlivarea <- cor.test(curr_data$gr_liv_area, curr_data$log_price, method = "pearson")
print(cor_test_grlivarea)

    Pearson's product-moment correlation

data:  curr_data$gr_liv_area and curr_data$log_price
t = 33.45, df = 1199, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6643236 0.7229351
sample estimates:
      cor 
0.6947811 

p值远小于0.05,表明建造面积与房屋价格显著相关,相关系数0.694 相对于占地面积,地上居住面积与房屋价格相关性更高。

3.2.2 分类型影响因素分析

采用方差分析判断分类型影响因素与房屋价格的相关性

3.2.2.1 房屋价格与住宅概况

Code
anova_result_mssubclass <- aov(log_price ~ ms_sub_class, data = curr_data)
summary(anova_result_mssubclass)
               Df Sum Sq Mean Sq F value Pr(>F)    
ms_sub_class   14   74.0   5.286   46.89 <2e-16 ***
Residuals    1186  133.7   0.113                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

方差分析结果表明住宅概况与房屋价格显著相关

3.2.2.2 房屋价格与建筑性质

Code
anova_result_mszoning <- aov(log_price ~ ms_zoning, data = curr_data)
summary(anova_result_mszoning)
              Df Sum Sq Mean Sq F value Pr(>F)    
ms_zoning      4  39.12   9.781   69.39 <2e-16 ***
Residuals   1196 168.59   0.141                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

方差分析结果表明住宅概况与房屋价格显著相关

3.2.2.3 房屋价格与 住宅类别

Code
anova_result_bldgtype <- aov(log_price ~ bldg_type, data = curr_data)
summary(anova_result_bldgtype)
              Df Sum Sq Mean Sq F value   Pr(>F)    
bldg_type      4   9.47  2.3679   14.29 2.14e-11 ***
Residuals   1196 198.24  0.1658                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

方差分析结果表明住宅类别与房屋价格显著相关

3.2.2.4 房屋价格与住宅类别

Code
anova_result_bldgtype <- aov(log_price ~ bldg_type, data = curr_data)
summary(anova_result_bldgtype)
              Df Sum Sq Mean Sq F value   Pr(>F)    
bldg_type      4   9.47  2.3679   14.29 2.14e-11 ***
Residuals   1196 198.24  0.1658                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

方差分析结果表明住宅类别与房屋价格显著相关

3.2.2.5 房屋价格与建筑斜率

Code
anova_result_landslope <- aov(log_price ~ land_slope, data = curr_data)
summary(anova_result_landslope)
              Df Sum Sq Mean Sq F value Pr(>F)
land_slope     2   0.04  0.0187   0.108  0.898
Residuals   1198 207.67  0.1734               
  • 方差分析结果表明住宅概况与房屋价格不相关
  • 但是房屋倾斜程度肯定会影响购买者的判断,方差分析结果可能是因为代表房屋斜率非常大Sev占比非常低

下面绘制条形图查看不同斜率的数据分布

Code
slope_counts <- curr_data %>%
  group_by(land_slope) %>%
  summarise(count = n()) %>%
  arrange(count)

ggplot(slope_counts, aes(x = count, y = reorder(land_slope, count), fill = land_slope)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = count), hjust = -0.1, color = "black", size = 3) +  
  labs(
    title = "房屋斜率取值分布",
    x = "观察值",
    y = "数量"
  ) +
  scale_fill_brewer(palette = "Set3") +  
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank(),
    legend.position = "right" 
  )

在选定影响因素之前,应先查看数据分布现状,再决定是否对当前影响因素进行分析。

3.2.2.6 房屋价格与售卖情况

Code
anova_result_salecondition <- aov(log_price ~ sale_condition, data = curr_data)
summary(anova_result_salecondition)
                 Df Sum Sq Mean Sq F value Pr(>F)    
sale_condition    5  31.75   6.351   43.13 <2e-16 ***
Residuals      1195 175.96   0.147                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

方差分析结果表明售卖情况与房屋价格显著相关

绘制小提琴图查看数据分布

Code
num_levels <- nlevels(curr_data$sale_condition)

pastel_palette <- brewer.pal(n = num_levels, name = "Pastel1")

ggplot(curr_data, aes(x = sale_condition, y = log_price, fill = sale_condition)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75), alpha = 0.6, scale = "width") +
  geom_jitter(aes(color = after_scale("blue")),  
              position = position_jitter(width = 0.1, height = 0),
              alpha = 0.4, size = 1) +  
  scale_fill_manual(values = pastel_palette) +
  scale_color_manual(values = setNames("blue", "after_scale")) + 
  theme_minimal() +
  labs(title = "售卖情况与房屋价格",
       x = "售卖情况", 
       y = "售价")

4.建模

由第3章节内容可知,地上居住面积与房屋售价相关度高,使用地上居住面积和区域分析两个因素对房屋售价影响

Code
lm(log_price ~ 1 + gr_liv_area + neighborhood, data = curr_data) %>% 
  broom::tidy()
# A tibble: 26 × 5
   term                 estimate std.error statistic   p.value
   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)         11.6      0.0571      203.    0        
 2 gr_liv_area          0.000390 0.0000129    30.2   5.12e-149
 3 neighborhoodBlueste -0.317    0.153        -2.07  3.84e-  2
 4 neighborhoodBrDale  -0.498    0.0741       -6.73  2.71e- 11
 5 neighborhoodBrkSide -0.390    0.0611       -6.38  2.59e- 10
 6 neighborhoodClearCr -0.124    0.0781       -1.59  1.12e-  1
 7 neighborhoodCollgCr -0.0148   0.0570       -0.260 7.95e-  1
 8 neighborhoodCrawfor -0.108    0.0627       -1.73  8.42e-  2
 9 neighborhoodEdwards -0.411    0.0580       -7.08  2.41e- 12
10 neighborhoodGilbert -0.0623   0.0613       -1.02  3.10e-  1
# ℹ 16 more rows
  • 包含截距项 (1): 这是最常见的做法。截距项允许模型在没有任何预测变量影响的情况下,估计一个基础水平的因变量值。
  • 截距项 (Intercept) 截距项表示当所有预测变量为0时,log_price 的预期值。在这个模型中,截距项的p值极小(< 0.001),表明它在统计上非常显著。这意味着即使没有考虑其他因素,模型仍然能够提供一个基础的销售价格估计。
  • 地上居住面积 (gr_liv_area)的系数为0.638,表示每增加1个单位的对数转换后的地上居住面积,log_price 预计会增加0.000390个单位。这个系数的p值也非常小(< 0.001),表明地上居住面积对销售价格有显著的正向影响。