Email: sherlytaurinsiri@gmail.com
Linkedin: https://www.linkedin.com/in/sherly-taurin-8a50221b4/
RPubs: https://rpubs.com/sherlytaurin/
The aim of this report is to apply Exploratory Data Analysis (EDA) to the house sales in King County, Washington State, USA. The data set consisted of historic data of houses sold between May 2014 to May 2015.
#setwd("C:/Users/vferd/Documents/Sherly/Campus/Semester 3/Algoritma/week 11") #mengatur working directory
library(dplyr)
library(readr)
library(data.table)
library(tidyverse)
library(funModeling)
library(Hmisc)
library(skimr)
library(DataExplorer)
library(e1071)## 'data.frame': 21597 obs. of 21 variables:
## $ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : chr "10/13/2014" "12/9/2014" "2/25/2015" "12/9/2014" ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
## integer(0)
##
## 1 2 3 4 5
## 3 0 0 0 0 1
## 4 1 4 12 10 0
## 5 9 15 100 84 34
## 6 11 59 1035 685 248
## 7 6 75 5229 2831 833
## 8 2 13 4266 1394 390
## 9 0 2 2041 446 126
## 10 0 2 921 156 55
## 11 0 0 332 56 11
## 12 0 0 73 13 3
## 13 0 0 11 2 0
##
## 1 2 3 4 5
## 3 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 4.630273e-05
## 4 4.630273e-05 1.852109e-04 5.556327e-04 4.630273e-04 0.000000e+00
## 5 4.167245e-04 6.945409e-04 4.630273e-03 3.889429e-03 1.574293e-03
## 6 5.093300e-04 2.731861e-03 4.792332e-02 3.171737e-02 1.148308e-02
## 7 2.778164e-04 3.472705e-03 2.421170e-01 1.310830e-01 3.857017e-02
## 8 9.260545e-05 6.019355e-04 1.975274e-01 6.454600e-02 1.805806e-02
## 9 0.000000e+00 9.260545e-05 9.450387e-02 2.065102e-02 5.834144e-03
## 10 0.000000e+00 9.260545e-05 4.264481e-02 7.223225e-03 2.546650e-03
## 11 0.000000e+00 0.000000e+00 1.537251e-02 2.592953e-03 5.093300e-04
## 12 0.000000e+00 0.000000e+00 3.380099e-03 6.019355e-04 1.389082e-04
## 13 0.000000e+00 0.000000e+00 5.093300e-04 9.260545e-05 0.000000e+00
## [1] "id" "price" "bedrooms" "bathrooms"
## [5] "sqft_living" "sqft_lot" "floors" "waterfront"
## [9] "view" "condition" "grade" "sqft_above"
## [13] "sqft_basement" "yr_built" "yr_renovated" "zipcode"
## [17] "lat" "long" "sqft_living15" "sqft_lot15"
## id price bedrooms bathrooms
## Min. :1.000e+06 Min. : 78000 Min. : 1.000 Min. :0.500
## 1st Qu.:2.123e+09 1st Qu.: 322000 1st Qu.: 3.000 1st Qu.:1.750
## Median :3.905e+09 Median : 450000 Median : 3.000 Median :2.250
## Mean :4.580e+09 Mean : 540297 Mean : 3.373 Mean :2.116
## 3rd Qu.:7.309e+09 3rd Qu.: 645000 3rd Qu.: 4.000 3rd Qu.:2.500
## Max. :9.900e+09 Max. :7700000 Max. :33.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 520 Min. :1.000 Min. :0.000000
## 1st Qu.: 1430 1st Qu.: 5040 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1910 Median : 7618 Median :1.500 Median :0.000000
## Mean : 2080 Mean : 15099 Mean :1.494 Mean :0.007547
## 3rd Qu.: 2550 3rd Qu.: 10685 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1651359 Max. :3.500 Max. :1.000000
## view condition grade sqft_above
## Min. :0.0000 Min. :1.00 Min. : 3.000 Min. : 370
## 1st Qu.:0.0000 1st Qu.:3.00 1st Qu.: 7.000 1st Qu.:1190
## Median :0.0000 Median :3.00 Median : 7.000 Median :1560
## Mean :0.2343 Mean :3.41 Mean : 7.658 Mean :1789
## 3rd Qu.:0.0000 3rd Qu.:4.00 3rd Qu.: 8.000 3rd Qu.:2210
## Max. :4.0000 Max. :5.00 Max. :13.000 Max. :9410
## sqft_basement yr_built yr_renovated zipcode
## Min. : 0.0 Min. :1900 Min. : 0.00 Min. :98001
## 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00 1st Qu.:98033
## Median : 0.0 Median :1975 Median : 0.00 Median :98065
## Mean : 291.7 Mean :1971 Mean : 84.46 Mean :98078
## 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.00 3rd Qu.:98118
## Max. :4820.0 Max. :2015 Max. :2015.00 Max. :98199
## lat long sqft_living15 sqft_lot15
## Min. :47.16 Min. :-122.5 Min. : 399 Min. : 651
## 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490 1st Qu.: 5100
## Median :47.57 Median :-122.2 Median :1840 Median : 7620
## Mean :47.56 Mean :-122.2 Mean :1987 Mean : 12758
## 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360 3rd Qu.: 10083
## Max. :47.78 Max. :-121.3 Max. :6210 Max. :871200
## [1] 0.7881271
basic_eda <- function(data)
{
glimpse(data)
skim(data)
df_status(data)
freq(data)
profiling_num(data)
plot_num(data)
describe(data)
}
basic_eda(df)## Rows: 21,597
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544...
## $ date <chr> "10/13/2014", "12/9/2014", "2/25/2015", "12/9/2014", ...
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1230000, 2575...
## $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4,...
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00,...
## $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, ...
## $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 74...
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0...
## $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,...
## $ condition <int> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4,...
## $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7...
## $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, ...
## $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, ...
## $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960,...
## $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ zipcode <int> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 9819...
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561,...
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -12...
## $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780,...
## $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 811...
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 id 0 0.00 0 0 0 0 numeric 21420
## 2 date 0 0.00 0 0 0 0 character 372
## 3 price 0 0.00 0 0 0 0 numeric 3622
## 4 bedrooms 0 0.00 0 0 0 0 integer 12
## 5 bathrooms 0 0.00 0 0 0 0 numeric 29
## 6 sqft_living 0 0.00 0 0 0 0 integer 1034
## 7 sqft_lot 0 0.00 0 0 0 0 integer 9776
## 8 floors 0 0.00 0 0 0 0 numeric 6
## 9 waterfront 21434 99.25 0 0 0 0 integer 2
## 10 view 19475 90.17 0 0 0 0 integer 5
## 11 condition 0 0.00 0 0 0 0 integer 5
## 12 grade 0 0.00 0 0 0 0 integer 11
## 13 sqft_above 0 0.00 0 0 0 0 integer 942
## 14 sqft_basement 13110 60.70 0 0 0 0 integer 306
## 15 yr_built 0 0.00 0 0 0 0 integer 116
## 16 yr_renovated 20683 95.77 0 0 0 0 integer 70
## 17 zipcode 0 0.00 0 0 0 0 integer 70
## 18 lat 0 0.00 0 0 0 0 numeric 5033
## 19 long 0 0.00 0 0 0 0 numeric 751
## 20 sqft_living15 0 0.00 0 0 0 0 integer 777
## 21 sqft_lot15 0 0.00 0 0 0 0 integer 8682
## Warning in freq_logic(data = data, input = input, plot, na.rm, path_out =
## path_out): Skipping plot for variable 'date' (more than 100 categories)
## data
##
## 21 Variables 21597 Observations
## --------------------------------------------------------------------------------
## id
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 21420 1 4.58e+09 3.297e+09 5.127e+08 1.036e+09
## .25 .50 .75 .90 .95
## 2.123e+09 3.905e+09 7.309e+09 8.732e+09 9.297e+09
##
## lowest : 1000102 1200019 1200021 2800031 3600057
## highest: 9842300095 9842300485 9842300540 9895000040 9900000190
## --------------------------------------------------------------------------------
## date
## n missing distinct
## 21597 0 372
##
## lowest : 1/10/2015 1/12/2015 1/13/2015 1/14/2015 1/15/2015
## highest: 9/5/2014 9/6/2014 9/7/2014 9/8/2014 9/9/2014
## --------------------------------------------------------------------------------
## price
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 3622 1 540297 329526 210000 245000
## .25 .50 .75 .90 .95
## 322000 450000 645000 887000 1160000
##
## lowest : 78000 80000 81000 82000 82500
## highest: 5350000 5570000 6890000 7060000 7700000
## --------------------------------------------------------------------------------
## bedrooms
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 12 0.871 3.373 0.9427 2 2
## .25 .50 .75 .90 .95
## 3 3 4 4 5
##
## lowest : 1 2 3 4 5, highest: 8 9 10 11 33
##
## Value 1 2 3 4 5 6 7 8 9 10 11
## Frequency 196 2760 9824 6882 1601 272 38 13 6 3 1
## Proportion 0.009 0.128 0.455 0.319 0.074 0.013 0.002 0.001 0.000 0.000 0.000
##
## Value 33
## Frequency 1
## Proportion 0.000
## --------------------------------------------------------------------------------
## bathrooms
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 29 0.974 2.116 0.8432 1.00 1.00
## .25 .50 .75 .90 .95
## 1.75 2.25 2.50 3.00 3.50
##
## lowest : 0.50 0.75 1.00 1.25 1.50, highest: 6.50 6.75 7.50 7.75 8.00
## --------------------------------------------------------------------------------
## sqft_living
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 1034 1 2080 978 940 1090
## .25 .50 .75 .90 .95
## 1430 1910 2550 3254 3760
##
## lowest : 370 380 390 410 420, highest: 9640 9890 10040 12050 13540
## --------------------------------------------------------------------------------
## sqft_lot
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 9776 1 15099 17841 1801 3323
## .25 .50 .75 .90 .95
## 5040 7618 10685 21372 43307
##
## lowest : 520 572 600 609 635
## highest: 982998 1024068 1074218 1164794 1651359
## --------------------------------------------------------------------------------
## floors
## n missing distinct Info Mean Gmd
## 21597 0 6 0.823 1.494 0.5561
##
## lowest : 1.0 1.5 2.0 2.5 3.0, highest: 1.5 2.0 2.5 3.0 3.5
##
## Value 1.0 1.5 2.0 2.5 3.0 3.5
## Frequency 10673 1910 8235 161 611 7
## Proportion 0.494 0.088 0.381 0.007 0.028 0.000
## --------------------------------------------------------------------------------
## waterfront
## n missing distinct Info Sum Mean Gmd
## 21597 0 2 0.022 163 0.007547 0.01498
##
## --------------------------------------------------------------------------------
## view
## n missing distinct Info Mean Gmd
## 21597 0 5 0.267 0.2343 0.4322
##
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##
## Value 0 1 2 3 4
## Frequency 19475 332 961 510 319
## Proportion 0.902 0.015 0.044 0.024 0.015
## --------------------------------------------------------------------------------
## condition
## n missing distinct Info Mean Gmd
## 21597 0 5 0.708 3.41 0.6159
##
## lowest : 1 2 3 4 5, highest: 1 2 3 4 5
##
## Value 1 2 3 4 5
## Frequency 29 170 14020 5677 1701
## Proportion 0.001 0.008 0.649 0.263 0.079
## --------------------------------------------------------------------------------
## grade
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 11 0.903 7.658 1.229 6 6
## .25 .50 .75 .90 .95
## 7 7 8 9 10
##
## lowest : 3 4 5 6 7, highest: 9 10 11 12 13
##
## Value 3 4 5 6 7 8 9 10 11 12 13
## Frequency 1 27 242 2038 8974 6065 2615 1134 399 89 13
## Proportion 0.000 0.001 0.011 0.094 0.416 0.281 0.121 0.053 0.018 0.004 0.001
## --------------------------------------------------------------------------------
## sqft_above
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 942 1 1789 875.8 850 970
## .25 .50 .75 .90 .95
## 1190 1560 2210 2950 3400
##
## lowest : 370 380 390 410 420, highest: 7880 8020 8570 8860 9410
## --------------------------------------------------------------------------------
## sqft_basement
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 306 0.776 291.7 422.4 0 0
## .25 .50 .75 .90 .95
## 0 0 560 970 1190
##
## lowest : 0 10 20 40 50, highest: 3260 3480 3500 4130 4820
## --------------------------------------------------------------------------------
## yr_built
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 116 1 1971 33.38 1915 1926
## .25 .50 .75 .90 .95
## 1951 1975 1997 2007 2011
##
## lowest : 1900 1901 1902 1903 1904, highest: 2011 2012 2013 2014 2015
## --------------------------------------------------------------------------------
## yr_renovated
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 70 0.122 84.46 161.8 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 0
##
## lowest : 0 1934 1940 1944 1945, highest: 2011 2012 2013 2014 2015
##
## Value 0 1935 1940 1945 1950 1955 1960 1965 1970 1975 1980
## Frequency 20683 1 2 6 4 13 12 16 27 25 43
## Proportion 0.958 0.000 0.000 0.000 0.000 0.001 0.001 0.001 0.001 0.001 0.002
##
## Value 1985 1990 1995 2000 2005 2010 2015
## Frequency 88 99 84 112 156 82 144
## Proportion 0.004 0.005 0.004 0.005 0.007 0.004 0.007
##
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## zipcode
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 70 1 98078 60.78 98004 98008
## .25 .50 .75 .90 .95
## 98033 98065 98118 98155 98177
##
## lowest : 98001 98002 98003 98004 98005, highest: 98177 98178 98188 98198 98199
## --------------------------------------------------------------------------------
## lat
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 5033 1 47.56 0.1573 47.31 47.35
## .25 .50 .75 .90 .95
## 47.47 47.57 47.68 47.73 47.75
##
## lowest : 47.1559 47.1593 47.1622 47.1647 47.1764
## highest: 47.7771 47.7772 47.7774 47.7775 47.7776
## --------------------------------------------------------------------------------
## long
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 751 1 -122.2 0.1557 -122.4 -122.4
## .25 .50 .75 .90 .95
## -122.3 -122.2 -122.1 -122.0 -122.0
##
## lowest : -122.519 -122.515 -122.514 -122.512 -122.511
## highest: -121.325 -121.321 -121.319 -121.316 -121.315
## --------------------------------------------------------------------------------
## sqft_living15
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 777 1 1987 743.1 1140 1258
## .25 .50 .75 .90 .95
## 1490 1840 2360 2930 3300
##
## lowest : 399 460 620 670 690, highest: 5600 5610 5790 6110 6210
## --------------------------------------------------------------------------------
## sqft_lot15
## n missing distinct Info Mean Gmd .05 .10
## 21597 0 8682 1 12758 13385 2002 3668
## .25 .50 .75 .90 .95
## 5100 7620 10083 17822 37045
##
## lowest : 651 659 660 748 750, highest: 434728 438213 560617 858132 871200
## --------------------------------------------------------------------------------
##
##
## processing file: report.rmd
## output file: C:/Users/vferd/Documents/Sherly/Campus/Semester 3/Algoritma/week 11/report.knit.md
##
## Output created: report.html
Pada part ini, saya akan mencari pengaruh dari beberapa faktor terhadap harga rumah yang ada pada data kc_house_data.
\(H_0 = Harga\ tidak\ dipengaruhi\ oleh\ bedrooms\ bathrooms\ floors\ grade\ dan\ sqft\ living\) \(H_1 = Harga\ dipengaruhi\ oleh\ bedrooms\ bathrooms\ floors\ grade\ dan\ sqft\ living\)
Model yang dipakai adalah $ price = a_0 + a_1bedrooms + a_2bathrooms + a_3floors + a_4grade + a_5sqft living $ saya akan mengestimasi model tersebut menggunakan OLS dan observasi p-value, R kuadrat residual, dan koefisien untuk menilai price (harga)
Model1 <- price~sqft_living+floors+bedrooms+bathrooms+grade
fit1 <- lm(Model1, data = df)
summary(fit1)##
## Call:
## lm(formula = Model1, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1061981 -136474 -23050 97388 4631488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.818e+05 1.494e+04 -32.261 < 2e-16 ***
## sqft_living 2.215e+02 3.622e+00 61.156 < 2e-16 ***
## floors -3.844e+04 3.745e+03 -10.265 < 2e-16 ***
## bedrooms -4.087e+04 2.302e+03 -17.757 < 2e-16 ***
## bathrooms -1.405e+04 3.712e+03 -3.786 0.000153 ***
## grade 1.027e+05 2.389e+03 42.984 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 247600 on 21591 degrees of freedom
## Multiple R-squared: 0.5458, Adjusted R-squared: 0.5457
## F-statistic: 5189 on 5 and 21591 DF, p-value: < 2.2e-16
## Subset selection object
## Call: regsubsets.formula(Model1, data = df, nbest = 1)
## 5 Variables (and intercept)
## Forced in Forced out
## sqft_living FALSE FALSE
## floors FALSE FALSE
## bedrooms FALSE FALSE
## bathrooms FALSE FALSE
## grade FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## sqft_living floors bedrooms bathrooms grade
## 1 ( 1 ) "*" " " " " " " " "
## 2 ( 1 ) "*" " " " " " " "*"
## 3 ( 1 ) "*" " " "*" " " "*"
## 4 ( 1 ) "*" "*" "*" " " "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
Karena p-value < 0.05, maka terima H1 yang berarti Harga dipengaruhi oleh bedrooms, bathrooms, floors, grade, dan sqft_living. Selain itu diperkuat oleh nilai R kuadrat residual yang kecil yang berarti errornya kecil. Kesimpulannya, dengan signifikansi 95% data dapat membuktikan Price dipengaruhi oleh variabel-variabel berikut.