4.0 MAIN ANALYSIS
Next month, your boss is moving to Sindian District in New
Taipei City, Taiwan. They want to buy a house and have asked you to
figure out what most impacts house price.
4.1 IMPORT DATA
df.house <- readxl::read_excel("Lab03_house.xlsx")4.2 DATA SUMMARY
This data set is from UC Irvine’s Machine Learning Repository. It contains historical market real estate valuations from Sindian District, New Taipei City, Taiwan.
Object of Analysis (ie what my Boss cares about):
Houses currently for sale in New Taipei City, Taiwan Object of
Observation: Houses that were on sale in Sindian District, New
Taipei City, Taiwan in 2012-2013.
Population: All houses in New Taipei City, Taiwan
Variables:
- house age (years)
- distance to MRT station (m)
- number of stores in living circle walking distance (integer
number)
- latitude (degree)
- longitude (degree)
- house price by unit area (10k New Taiwan Dollar (TWD)/Ping)
- 1 ping is about 3.3 m\(^2\)
Response Variable: House price
4.3 FILTER DATA
Explore Distances
#-------------------------------------------------------------------------------------
# HISTOGRAM OF DISTANCES
#-------------------------------------------------------------------------------------
p.histo_dist <- ggplot(df.house)+
geom_histogram(aes(Distance.Station), fill = "#7C4981")+ #fill = purple
labs(title = "Distance from Home to Nearest MRT Station (m)",
subtitle = "Sindian District, New Taipei City, Taiwan (2012-2013)")+
ylab("Count")+
xlab("Distance to Neartest MRT Station (m)")
p.histo_dist #print graph#-------------------------------------------------------------------------------------
# SUMMARY STATISTICS
# displayed in HTML using inline code
#-------------------------------------------------------------------------------------
summary.distance <- unlist(round(summary(df.house$Distance.Station)))
summary.distance <- append(summary.distance, sd(summary.distance)) #add sd to stats
# ~~~ INLINE CODE ~~
# summary.distance index:
# [1]min [2]Q1 [3]median [4]mean [5]Q3 [6]max [7]standard deviation
#
# examples of inline code:
# `r summary.distance[1]` prints minimum value in metersMean: 1084
Median: 492
Min: 23
Max: 6488
St.Dev.: 2433
The distributions of distances to nearest station is irregular with a definite right-ward skew and a peak around 450-550 meters.
Filter out distances over 3 km
df.sub_house <- df.house %>%
filter(Distance.Station < 3000)Observations Removed: 41
4.4 EXPLORATORY ANALYSIS
#-------------------------------------------------------------------------------------
# HISTOGRAM OF PRICES
#-------------------------------------------------------------------------------------
p.histo_price <- ggplot(df.house)+
geom_histogram(aes(House.Price), fill = "#7C4981")+ #fill = purple
labs(title = "Prices of Homes in New Taiwan Dollars (TWD) per Ping",
subtitle = "Sindian District, New Taipei City, Taiwan (2012-2013)")+
ylab("Count")+
xlab("House Price (10k New Taiwan Dollars/Ping)")
p.histo_price #print graph#--------------------------------------------------------------------------------------
# SUMMARY STATS: PRICE
# displayed in HTML with inline code
#--------------------------------------------------------------------------------------
summary.price <- unlist(round(summary(df.house$House.Price)))
summary.price <- append(summary.price, sd(summary.price)) #add sd to summary stats
pricesort <- sort(as.numeric(df.house$House.Price))
pricefix <- pricesort[1:length(pricesort)-1]
summary.pricefix <- round(summary(pricefix))
# ~~~ INLINE CODE ~~
# summary.price index:
# [1]min [2]Q1 [3]median [4]mean [5]Q3 [6]max [7]Standard Dev
# examples:
# 'r summary.price[4]` displays mean
# 'r summary.pricefix[6]` displays max after outlier is removedMean: 38
Median: 38
Min: 8
Max: 118
St.Dev.: 38
The price of homes appears approximately normal but with a possible outlier of 117.5 TWD/Ping.
With the outlier removed:
Mean: 38
Median: 38
Min: 8
Max: 78
p.price_age <- ggplot(df.house) +
geom_point(aes(House.Age, House.Price), shape = 4, color = "#068618", size=2)+
labs(title = "Price of House by Age",
subtitle = "Sindian District, New Taipei City, Taiwan (2012-2013)",
caption = "Source: UC Irvine Machine Learning Repository")+
ylab("Price (10k TWD/ping)")+
xlab("Age of House")+
scale_x_continuous()+
scale_y_continuous()
p.price_age4.5 SPATIAL
It’s vector point data where each point represents a single house.
MAKING MAPS
house.spatial <- st_as_sf(df.house,coords=c("Longitude","Latitude"),crs = 4326)
#Darker color = higher value
plot(house.spatial)#--------------------------Better Plot---------------------------------------
tmap_mode("view") #interactive
#tmap_mode("plot") #static
# Command from the tmap library and plot
tm_basemap("Esri.WorldTopoMap") +
qtm(house.spatial, # data
symbols.col="House.Price", # column for the symbols
symbols.alpha=0.9, # transparency
symbols.size=.2, # how big
symbols.palette="Spectral", #colors from https://colorbrewer2.org
symbols.style="fisher") # color breaks