EXPLARATORY DATA ANALYSIS -

House Data

Hello, I’d like to present explatory dataset about House Data at America beginning from 1900 until 2015.

1 DATA INPUT

We will read data, make sure our data placed in the same folder our R project data

house <- read.csv("data_input/house_data.csv")
head(house)
unique(house$yr_built)
#>   [1] 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 1942 1927 1977 1900 1979
#>  [16] 1994 1916 1921 1969 1947 1968 1985 1941 1915 1909 1948 2005 1929 1981 1930
#>  [31] 1904 1996 2000 1984 2014 1922 1959 1966 1953 1950 2008 1991 1954 1973 1925
#>  [46] 1989 1972 1986 1956 2002 1992 1964 1952 1961 2006 1988 1962 1939 1946 1967
#>  [61] 1975 1980 1910 1983 1978 1905 1971 2010 1945 1924 1990 1914 1926 2004 1923
#>  [76] 2007 1976 1949 1999 1901 1993 1920 1997 1943 1957 1940 1918 1928 1974 1911
#>  [91] 1936 1937 1982 1908 1931 1998 1913 2013 1907 1958 2012 1912 2011 1917 1932
#> [106] 1944 1902 2009 1903 1970 2015 1934 1938 1919 1906 1935
range(house$yr_built)
#> [1] 1900 2015

then let’s get started to data inspection

2 DATA INSPECTION

we want to read data top 6 rows

head(house)

then we will see bottom row

tail(house)
dim(house)
#> [1] 21613     9

the data consists 21613 rows, 9 columns

names(house)
#> [1] "price"       "bedrooms"    "bathrooms"   "sqft_living" "sqft_lot"   
#> [6] "floors"      "waterfront"  "grade"       "yr_built"

from our inspection we can conclude * power consumption tetuan city consists 21613 rows, 9 columns * each of column name : “price”,“bedrooms”,“bathrooms”,“sqft_living”,“sqft_lot”,“floors”,“waterfront”, “grade” “yr_built”

3 DATA CLEANSING & COERTIONS

first data cleansing is check data type for each columns

str(house)
#> 'data.frame':    21613 obs. of  9 variables:
#>  $ price      : int  221900 538000 180000 604000 510000 1225000 257500 291850 229500 323000 ...
#>  $ bedrooms   : int  3 3 2 4 3 4 3 3 3 3 ...
#>  $ bathrooms  : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
#>  $ sqft_living: int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
#>  $ sqft_lot   : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
#>  $ floors     : num  1 2 1 1 1 1 2 1 1 2 ...
#>  $ waterfront : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ grade      : int  7 7 6 7 8 11 7 7 7 7 ...
#>  $ yr_built   : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...

from this result, we find some of data type is not correct type. we need to change into correct type (data coertion).

house$bedrooms <- as.factor(house$bedrooms)
house$bathrooms <- as.factor(house$bathrooms)
house$grade <- as.factor(house$grade)

str(house)
#> 'data.frame':    21613 obs. of  9 variables:
#>  $ price      : int  221900 538000 180000 604000 510000 1225000 257500 291850 229500 323000 ...
#>  $ bedrooms   : Factor w/ 13 levels "0","1","2","3",..: 4 4 3 5 4 5 4 4 4 4 ...
#>  $ bathrooms  : Factor w/ 30 levels "0","0.5","0.75",..: 4 9 4 12 8 18 9 6 4 10 ...
#>  $ sqft_living: int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
#>  $ sqft_lot   : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
#>  $ floors     : num  1 2 1 1 1 1 2 1 1 2 ...
#>  $ waterfront : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ grade      : Factor w/ 12 levels "1","3","4","5",..: 6 6 5 6 7 10 6 6 6 6 ...
#>  $ yr_built   : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...

data already changed into correct type as.factor for columns bedrooms , bathrooms, floors , waterfront , grade

he next step is check for missing value

colSums(is.na(house))
#>       price    bedrooms   bathrooms sqft_living    sqft_lot      floors 
#>           0           0           0           0           0           0 
#>  waterfront       grade    yr_built 
#>           0           0           0
anyNA(house)
#> [1] FALSE

gread, no missing values

then, we’re going to the next step

4 DATA EXPLANATION

Brief explanation

summary(house)
#>      price            bedrooms      bathrooms     sqft_living   
#>  Min.   :  75000   3      :9824   2.5    :5380   Min.   :  290  
#>  1st Qu.: 321950   4      :6882   1      :3852   1st Qu.: 1427  
#>  Median : 450000   2      :2760   1.75   :3048   Median : 1910  
#>  Mean   : 540088   5      :1601   2.25   :2047   Mean   : 2080  
#>  3rd Qu.: 645000   6      : 272   2      :1930   3rd Qu.: 2550  
#>  Max.   :7700000   1      : 199   1.5    :1446   Max.   :13540  
#>                    (Other):  75   (Other):3910                  
#>     sqft_lot           floors        waterfront           grade     
#>  Min.   :    520   Min.   :1.000   Min.   :0.000000   7      :8981  
#>  1st Qu.:   5040   1st Qu.:1.000   1st Qu.:0.000000   8      :6068  
#>  Median :   7618   Median :1.500   Median :0.000000   9      :2615  
#>  Mean   :  15107   Mean   :1.494   Mean   :0.007542   6      :2038  
#>  3rd Qu.:  10688   3rd Qu.:2.000   3rd Qu.:0.000000   10     :1134  
#>  Max.   :1651359   Max.   :3.500   Max.   :1.000000   11     : 399  
#>                                                       (Other): 378  
#>     yr_built   
#>  Min.   :1900  
#>  1st Qu.:1951  
#>  Median :1975  
#>  Mean   :1971  
#>  3rd Qu.:1997  
#>  Max.   :2015  
#> 

summary : 1. the most expensive house is 7.700.000 2. average house price is 540.088 3. the first time construction was in 1990 4. the most sold 1 floors and 3bedrooms then 2.5 bathrooms 5. year of contruction, build house average in 1971 6. the most sold grade 7

check the outlier within price

aggregate(x = price ~ floors, data = house, FUN = var)
aggregate(x = price ~ floors, data = house, FUN = sd)
boxplot(house$price)

boxplot(formula = price ~ floors + waterfront , data = house)

From result above, we find posibilities for the outliers, but from our calculation, so the process may continue.

5 DATA MANIPULATION

Which category the lowest sqrt lot? how much price?

house[house$sqft_lot == 520 , ]

answer : the lowest sqrt lot get the price 700000

how much avarage price from floors and bedrooms?

house1 <- aggregate(x = price ~ floors + waterfront , data = house, FUN = mean)
house1

Bar Plot Comparasion Floors and Price

plot(x = house1$floors, y =  house1$price )

from the chart data frame above the highest average price is floors 2.5

Bar Plot Comparasion Waterfront and Price

plot(x = house1$waterfront, y =  house1$price )

from the chart above the highest average price which have waterfront compared to houses that do not have waterfront

6 SUMMARY

  1. in 115 years since beginning 1900 2015, the house prices have increased, the more land and house area the prices increases.
  2. the most expensive house is 7.700.000
  3. average house price is 540.088
  4. the first time construction was in 1990
  5. the most sold 1 floors and 3bedrooms then 2.5 bathrooms
  6. year of contruction, build house average in 1971
  7. the most sold grade 7
  8. Barplot displayed comparasion the average price and floor result the highest average price is floors 2.5
  9. Barplot displayed comparasion the average price and waterfront result the highest average price which have waterfront compared to houses that do not have waterfront