Information

The market historical data set of real estate valuation are collected from Sindian Dist., New Taipei City, Taiwan. Donated on 8/17/2018. The purpose of this analysis is to determine the the house price of a unit area in New Taipei City through a linear regression model.

Data Preparation

## # A tibble: 414 × 8
##       No `X1 transaction date` X2 hous…¹ X3 di…² X4 nu…³ X5 la…⁴ X6 lo…⁵ Y hou…⁶
##    <dbl>                 <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1     1                 2013.      32      84.9      10    25.0    122.    37.9
##  2     2                 2013.      19.5   307.        9    25.0    122.    42.2
##  3     3                 2014.      13.3   562.        5    25.0    122.    47.3
##  4     4                 2014.      13.3   562.        5    25.0    122.    54.8
##  5     5                 2013.       5     391.        5    25.0    122.    43.1
##  6     6                 2013.       7.1  2175.        3    25.0    122.    32.1
##  7     7                 2013.      34.5   623.        7    25.0    122.    40.3
##  8     8                 2013.      20.3   288.        6    25.0    122.    46.7
##  9     9                 2014.      31.7  5512.        1    25.0    121.    18.8
## 10    10                 2013.      17.9  1783.        3    25.0    122.    22.1
## # … with 404 more rows, and abbreviated variable names ¹​`X2 house age`,
## #   ²​`X3 distance to the nearest MRT station`,
## #   ³​`X4 number of convenience stores`, ⁴​`X5 latitude`, ⁵​`X6 longitude`,
## #   ⁶​`Y house price of unit area`

I will rename columns

df1 <- df %>%
  rename("Transaction_date" = "X1 transaction date",
         "House_age" = "X2 house age",
         "Distance_MRTstation" = "X3 distance to the nearest MRT station", 
         "Convenience_stores" = "X4 number of convenience stores",
         "Latitude" = "X5 latitude",
         "Longitude" = "X6 longitude",
         "House_price" = "Y house price of unit area")

Now I will for empty rows

which(is.na(df1))
## integer(0)

There are no missing values.

summary(df1)
##        No        Transaction_date   House_age      Distance_MRTstation
##  Min.   :  1.0   Min.   :2013     Min.   : 0.000   Min.   :  23.38    
##  1st Qu.:104.2   1st Qu.:2013     1st Qu.: 9.025   1st Qu.: 289.32    
##  Median :207.5   Median :2013     Median :16.100   Median : 492.23    
##  Mean   :207.5   Mean   :2013     Mean   :17.713   Mean   :1083.89    
##  3rd Qu.:310.8   3rd Qu.:2013     3rd Qu.:28.150   3rd Qu.:1454.28    
##  Max.   :414.0   Max.   :2014     Max.   :43.800   Max.   :6488.02    
##  Convenience_stores    Latitude       Longitude      House_price    
##  Min.   : 0.000     Min.   :24.93   Min.   :121.5   Min.   :  7.60  
##  1st Qu.: 1.000     1st Qu.:24.96   1st Qu.:121.5   1st Qu.: 27.70  
##  Median : 4.000     Median :24.97   Median :121.5   Median : 38.45  
##  Mean   : 4.094     Mean   :24.97   Mean   :121.5   Mean   : 37.98  
##  3rd Qu.: 6.000     3rd Qu.:24.98   3rd Qu.:121.5   3rd Qu.: 46.60  
##  Max.   :10.000     Max.   :25.01   Max.   :121.6   Max.   :117.50

Research question

The market historical data set of real estate valuation. Can we predict or determine the price of a House in New Taipei City, Taiwan?

Cases

There are 414 instances in this data set.

Part 3 - Data collection

This data is loaded in UC Invine website for any purpose, provided that the appropriate credit is given. UC Irvine

Type of study

This is an observational study.

Data Source

Data was donated on 8/17/2018 to UC Irvine and is available here: https://archive-beta.ics.uci.edu/dataset/477/real+estate+valuation+data+set

Dependent variable

House price is our dependent variable in this regression model. This variable is quantitative.

Independent variable

There are 3 quantitative variables: House age Distance MRT Convenience stores

Relevant statistics

av_house <- mean(df1$`House_price`)
av_house
## [1] 37.98019

The average price for a house in New Taipei City, Taiwan is 37.98 units where each unit equals to 10000 New Taiwan Dollar/Ping.

df2 <- df1 %>%
  select(-'No', -'Transaction_date')
library(corrplot)

corrplot(cor(df2),        
         method = "shade", 
         type = "full",    
         diag = TRUE,      
         tl.col = "black", 
         bg = "white",     
         title = "",       
         col = NULL)  

There are some correlations between variables. Visualization implies that it would be necessary to define which variables are staying for further analysis before running any model.