2.0 SET UP

R

#===================================================================================
#                                     SET UP
#===================================================================================
knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE,
    fig.align = "center"
)

options(scipen=999) #scientific notation = OFF

#----------------------------------------------------------------------------------
#                                    LIBRARY
#----------------------------------------------------------------------------------
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(geosphere)
library(maps)
## 
## Attaching package: 'maps'
## 
## The following object is masked from 'package:purrr':
## 
##     map
library(readxl)
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.4.3, PROJ 7.2.1; sf_use_s2() is TRUE
library(sp)
library(tmap)

#---------------------------------------------------------------------------------
#                                 SET GRAPH THEME
#---------------------------------------------------------------------------------
theme_set(theme_classic()+
            theme(plot.title = element_text(hjust=0.5),    #center title
                  plot.subtitle = element_text(hjust=0.5), #center subtitle
                  axis.ticks = element_blank()))           #remove axis ticks

CSS

#=========================================================================================
#                                 HTML OUTPUT FORMATTING
#                       Copy of embedded code from original .Rmd
#=========================================================================================
<style type='text/css'>

  h1, h4, h5, h6 {
    text-align: center;
    font-color: black;
    font-size: 36px

}

   h2 {
    text-align: center;
    font-color: black;
    font-size: 24px

}

  h3 {
    text-align: center;
    font-color: black;
    font-size: 20px

}

</style>

3.0 CODE SHOWCASE

3.1 DATA WRANGLING & DATA CAMP

#-------------------------------------------------------------------------------
#                            PROOF OF DATA CAMP COMPLETION
#-------------------------------------------------------------------------------
knitr::include_graphics("364DataCampProof.jpeg")


4.0 MAIN ANALYSIS


Next month, your boss is moving to Sindian District in New Taipei City, Taiwan. They want to buy a house and have asked you to figure out what most impacts house price.

4.1 IMPORT DATA

df.house <- readxl::read_excel("Lab03_house.xlsx")

4.2 DATA SUMMARY

This data set is from UC Irvine’s Machine Learning Repository. It contains historical market real estate valuations from Sindian District, New Taipei City, Taiwan.

Object of Analysis (ie what my Boss cares about): Houses currently for sale in New Taipei City, Taiwan Object of Observation: Houses that were on sale in Sindian District, New Taipei City, Taiwan in 2012-2013.
Population: All houses in New Taipei City, Taiwan
Variables:

  • house age (years)
  • distance to MRT station (m)
  • number of stores in living circle walking distance (integer number)
  • latitude (degree)
  • longitude (degree)
  • house price by unit area (10k New Taiwan Dollar (TWD)/Ping)
    • 1 ping is about 3.3 m\(^2\)

Response Variable: House price

4.3 FILTER DATA

Explore Distances

#-------------------------------------------------------------------------------------
#                             HISTOGRAM OF DISTANCES
#-------------------------------------------------------------------------------------
p.histo_dist <- ggplot(df.house)+
  geom_histogram(aes(Distance.Station), fill = "#7C4981")+  #fill = purple
  labs(title = "Distance from Home to Nearest MRT Station (m)",
       subtitle = "Sindian District, New Taipei City, Taiwan (2012-2013)")+
  ylab("Count")+
  xlab("Distance to Neartest MRT Station (m)")

p.histo_dist  #print graph

#-------------------------------------------------------------------------------------
#                             SUMMARY STATISTICS
#                      displayed in HTML using inline code
#-------------------------------------------------------------------------------------
summary.distance <- unlist(round(summary(df.house$Distance.Station)))
summary.distance <- append(summary.distance, sd(summary.distance)) #add sd to stats

                       
#                              ~~~  INLINE CODE  ~~
#    summary.distance index:
#             [1]min [2]Q1 [3]median [4]mean [5]Q3 [6]max [7]standard deviation
# 
#    examples of inline code: 
#           `r summary.distance[1]` prints minimum value in meters

Mean: 1084
Median: 492
Min: 23
Max: 6488
St.Dev.: 2433

The distributions of distances to nearest station is irregular with a definite right-ward skew and a peak around 450-550 meters.

Filter out distances over 3 km

df.sub_house <- df.house %>% 
  filter(Distance.Station < 3000)

Observations Removed: 41

4.4 EXPLORATORY ANALYSIS

#-------------------------------------------------------------------------------------
#                              HISTOGRAM OF PRICES
#-------------------------------------------------------------------------------------
p.histo_price <- ggplot(df.house)+
  geom_histogram(aes(House.Price), fill = "#7C4981")+  #fill = purple
  labs(title = "Prices of Homes in New Taiwan Dollars (TWD) per Ping",
       subtitle = "Sindian District, New Taipei City, Taiwan (2012-2013)")+
  ylab("Count")+
  xlab("House Price (10k New Taiwan Dollars/Ping)")

p.histo_price  #print graph

#--------------------------------------------------------------------------------------
#                               SUMMARY STATS: PRICE
#                          displayed in HTML with inline code
#--------------------------------------------------------------------------------------

summary.price <- unlist(round(summary(df.house$House.Price)))
summary.price <- append(summary.price, sd(summary.price))  #add sd to summary stats

pricesort <- sort(as.numeric(df.house$House.Price))
pricefix <- pricesort[1:length(pricesort)-1]
summary.pricefix <- round(summary(pricefix))

#                         ~~~  INLINE CODE  ~~
#    summary.price index:
#             [1]min [2]Q1 [3]median [4]mean [5]Q3 [6]max [7]Standard Dev
#    examples:
#             'r summary.price[4]` displays mean
#             'r summary.pricefix[6]` displays max after outlier is removed

Mean: 38
Median: 38
Min: 8
Max: 118
St.Dev.: 38

The price of homes appears approximately normal but with a possible outlier of 117.5 TWD/Ping.

With the outlier removed:

Mean: 38
Median: 38
Min: 8
Max: 78

p.price_age <- ggplot(df.house) +
  geom_point(aes(House.Age, House.Price), shape = 4, color = "#068618", size=2)+
  labs(title = "Price of House by Age",
       subtitle = "Sindian District, New Taipei City, Taiwan (2012-2013)",
       caption = "Source: UC Irvine Machine Learning Repository")+
  ylab("Price (10k TWD/ping)")+
  xlab("Age of House")+
  scale_x_continuous()+
  scale_y_continuous()
  
p.price_age

4.5 SPATIAL

It’s vector point data where each point represents a single house.

MAKING MAPS

house.spatial <- st_as_sf(df.house,coords=c("Longitude","Latitude"),crs = 4326)

#Darker color = higher value
plot(house.spatial)

#--------------------------Better Plot---------------------------------------
tmap_mode("view")   #interactive
#tmap_mode("plot")  #static

# Command from the tmap library and plot
tm_basemap("Esri.WorldTopoMap") + 
     qtm(house.spatial, # data
         symbols.col="House.Price", # column for the symbols
         symbols.alpha=0.9, # transparency
         symbols.size=.2, # how big
         symbols.palette="Spectral", #colors from https://colorbrewer2.org
         symbols.style="fisher") # color breaks

4.6 RESPONSES TO QUESTIONS

  • The average house price is around 38,000 TWD/ping.
  • Prices range from 8 to 118 TWD/ping.
  • It is difficult to tell what the relationship is between house age and price. Based only on the graph there appears to be a moderately strong negative correlation up until the houses reaches an age of 20 years and then the trend switches to a moderate positive correlation.
  • Initially, houses appear to be cheaper in the south. After looking at the data plotted on a regional map, however, suggests that prices are more likely influenced by distance from city center where larger distances correlate with lower prices. This trend is noticable in the south because mountains and rivers restrict the placement of houses in other directions. This spatial heterogeneity gives the false impression of latitude having a significant impact on home prices.

5.0 ABOVE AND BEYOND

crs stands for Coordinate Reference System. The numbers are from a public Geodetic Parameter Dataset originally created by the European Petroleum Survey Group (EPSG) in 1985.

4326 the is code for the WGS84 projection.

Additional Information

Above and Beyond and Beyond

  1. Used inline codes to more display summary stats with minimum redundant code
  • Used indexes to call specific values (eg summary.price[6] is max price)
  • Also used unlist(), append(), and sort() functions to manage data
  1. Defined custom css in .Rmd file to align & resize html headers (see second tab at very top of document)
  2. Used theme_set() to align title & subtitle and remove tick marks in ggplots