Email:
Linkedin: https://www.linkedin.com/in/jeffry-wijaya-087a191b5/
RPubs: https://rpubs.com/invokerarts/


1 Introduction

The aim of this report is to apply Exploratory Data Analysis (EDA) to the house sales in King County, Washington State, USA. The data set consisted of historic data of houses sold between May 2014 to May 2015.

  • The dataset consisted of 21 variables and 21613 observations.
  • Variables Description Data Type:
    • id: a notation for a house Numeric
    • date: Date house was sold String
    • price: Price is prediction target Numeric
    • bedrooms: Number of Bedrooms/House Numeric
    • bathrooms: Number of bathrooms/bedrooms Numeric
    • sqftliving: square footage of the home Numeric sqftlot square footage of the lot Numeric
    • floors: Total floors (levels) in house Numeric
    • waterfront: House which has a view to a waterfront Numeric
    • view: Has been viewed Numeric
    • condition: How good the condition is ( Overall ). 1 indicates worn out property and 5 excellent.(http://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r#g) Numeric
    • grade: overall grade given to the housing unit, based on King County grading system. 1 poor ,13 excellent (Numeric)
    • sqftabove: square footage of house apart from basement Numeric
    • sqftbasement: square footage of the basement Numeric
    • yrbuilt: Built Year Numeric
    • yrrenovated: Year when house was renovated Numeric
    • zipcode: zip Numeric
    • lat: Latitude coordinate Numeric
    • long: Longitude coordinate Numeric
    • sqftliving15: Living room area in 2015(implies-some renovations) This might or might not have affected the lotsize area Numeric sqftlot15 lotSize area in 2015(implies-some renovations) Numeric
data1 <- read.csv("kc_house_data.csv")
apply(is.na(data1), 2,which)
## integer(0)
ncol(data1)
## [1] 21
data1 <- na.omit(data1)
datatable(data1)
Rank <- table(data1$grade)
Rank
## 
##    3    4    5    6    7    8    9   10   11   12   13 
##    1   27  242 2038 8974 6065 2615 1134  399   89   13
prop.table(table(data1$grade))
## 
##            3            4            5            6            7            8 
## 4.630273e-05 1.250174e-03 1.120526e-02 9.436496e-02 4.155207e-01 2.808260e-01 
##            9           10           11           12           13 
## 1.210816e-01 5.250729e-02 1.847479e-02 4.120943e-03 6.019355e-04
DataSet <- data1 %>% select_if(is.numeric)
names (DataSet)
##  [1] "id"            "price"         "bedrooms"      "bathrooms"    
##  [5] "sqft_living"   "sqft_lot"      "floors"        "waterfront"   
##  [9] "view"          "condition"     "grade"         "sqft_above"   
## [13] "sqft_basement" "yr_built"      "yr_renovated"  "zipcode"      
## [17] "lat"           "long"          "sqft_living15" "sqft_lot15"
summary (DataSet)
##        id                price            bedrooms        bathrooms    
##  Min.   :1.000e+06   Min.   :  78000   Min.   : 1.000   Min.   :0.500  
##  1st Qu.:2.123e+09   1st Qu.: 322000   1st Qu.: 3.000   1st Qu.:1.750  
##  Median :3.905e+09   Median : 450000   Median : 3.000   Median :2.250  
##  Mean   :4.580e+09   Mean   : 540297   Mean   : 3.373   Mean   :2.116  
##  3rd Qu.:7.309e+09   3rd Qu.: 645000   3rd Qu.: 4.000   3rd Qu.:2.500  
##  Max.   :9.900e+09   Max.   :7700000   Max.   :33.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    520   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1430   1st Qu.:   5040   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1910   Median :   7618   Median :1.500   Median :0.000000  
##  Mean   : 2080   Mean   :  15099   Mean   :1.494   Mean   :0.007547  
##  3rd Qu.: 2550   3rd Qu.:  10685   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1651359   Max.   :3.500   Max.   :1.000000  
##       view          condition        grade          sqft_above  
##  Min.   :0.0000   Min.   :1.00   Min.   : 3.000   Min.   : 370  
##  1st Qu.:0.0000   1st Qu.:3.00   1st Qu.: 7.000   1st Qu.:1190  
##  Median :0.0000   Median :3.00   Median : 7.000   Median :1560  
##  Mean   :0.2343   Mean   :3.41   Mean   : 7.658   Mean   :1789  
##  3rd Qu.:0.0000   3rd Qu.:4.00   3rd Qu.: 8.000   3rd Qu.:2210  
##  Max.   :4.0000   Max.   :5.00   Max.   :13.000   Max.   :9410  
##  sqft_basement       yr_built     yr_renovated        zipcode     
##  Min.   :   0.0   Min.   :1900   Min.   :   0.00   Min.   :98001  
##  1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.00   1st Qu.:98033  
##  Median :   0.0   Median :1975   Median :   0.00   Median :98065  
##  Mean   : 291.7   Mean   :1971   Mean   :  84.46   Mean   :98078  
##  3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.00   3rd Qu.:98118  
##  Max.   :4820.0   Max.   :2015   Max.   :2015.00   Max.   :98199  
##       lat             long        sqft_living15    sqft_lot15    
##  Min.   :47.16   Min.   :-122.5   Min.   : 399   Min.   :   651  
##  1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490   1st Qu.:  5100  
##  Median :47.57   Median :-122.2   Median :1840   Median :  7620  
##  Mean   :47.56   Mean   :-122.2   Mean   :1987   Mean   : 12758  
##  3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360   3rd Qu.: 10083  
##  Max.   :47.78   Max.   :-121.3   Max.   :6210   Max.   :871200
DataSet