EXPLARATORY DATA ANALYSIS -
House Data
Hello, I’d like to present explatory dataset about House Data at America beginning from 1900 until 2015.
We will read data, make sure our data placed in the same folder our R project data
#> [1] 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 1942 1927 1977 1900 1979
#> [16] 1994 1916 1921 1969 1947 1968 1985 1941 1915 1909 1948 2005 1929 1981 1930
#> [31] 1904 1996 2000 1984 2014 1922 1959 1966 1953 1950 2008 1991 1954 1973 1925
#> [46] 1989 1972 1986 1956 2002 1992 1964 1952 1961 2006 1988 1962 1939 1946 1967
#> [61] 1975 1980 1910 1983 1978 1905 1971 2010 1945 1924 1990 1914 1926 2004 1923
#> [76] 2007 1976 1949 1999 1901 1993 1920 1997 1943 1957 1940 1918 1928 1974 1911
#> [91] 1936 1937 1982 1908 1931 1998 1913 2013 1907 1958 2012 1912 2011 1917 1932
#> [106] 1944 1902 2009 1903 1970 2015 1934 1938 1919 1906 1935
#> [1] 1900 2015
then let’s get started to data inspection
we want to read data top 6 rows
then we will see bottom row
#> [1] 21613 9
the data consists 21613 rows, 9 columns
#> [1] "price" "bedrooms" "bathrooms" "sqft_living" "sqft_lot"
#> [6] "floors" "waterfront" "grade" "yr_built"
from our inspection we can conclude * power consumption tetuan city consists 21613 rows, 9 columns * each of column name : “price”,“bedrooms”,“bathrooms”,“sqft_living”,“sqft_lot”,“floors”,“waterfront”, “grade” “yr_built”
first data cleansing is check data type for each columns
#> 'data.frame': 21613 obs. of 9 variables:
#> $ price : int 221900 538000 180000 604000 510000 1225000 257500 291850 229500 323000 ...
#> $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
#> $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
#> $ sqft_living: int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
#> $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
#> $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
#> $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
#> $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
from this result, we find some of data type is not correct type. we need to change into correct type (data coertion).
house$bedrooms <- as.factor(house$bedrooms)
house$bathrooms <- as.factor(house$bathrooms)
house$grade <- as.factor(house$grade)
str(house)#> 'data.frame': 21613 obs. of 9 variables:
#> $ price : int 221900 538000 180000 604000 510000 1225000 257500 291850 229500 323000 ...
#> $ bedrooms : Factor w/ 13 levels "0","1","2","3",..: 4 4 3 5 4 5 4 4 4 4 ...
#> $ bathrooms : Factor w/ 30 levels "0","0.5","0.75",..: 4 9 4 12 8 18 9 6 4 10 ...
#> $ sqft_living: int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
#> $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
#> $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
#> $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ grade : Factor w/ 12 levels "1","3","4","5",..: 6 6 5 6 7 10 6 6 6 6 ...
#> $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
data already changed into correct type as.factor for columns bedrooms , bathrooms, floors , waterfront , grade
he next step is check for missing value
#> price bedrooms bathrooms sqft_living sqft_lot floors
#> 0 0 0 0 0 0
#> waterfront grade yr_built
#> 0 0 0
#> [1] FALSE
gread, no missing values
then, we’re going to the next step
Brief explanation
#> price bedrooms bathrooms sqft_living
#> Min. : 75000 3 :9824 2.5 :5380 Min. : 290
#> 1st Qu.: 321950 4 :6882 1 :3852 1st Qu.: 1427
#> Median : 450000 2 :2760 1.75 :3048 Median : 1910
#> Mean : 540088 5 :1601 2.25 :2047 Mean : 2080
#> 3rd Qu.: 645000 6 : 272 2 :1930 3rd Qu.: 2550
#> Max. :7700000 1 : 199 1.5 :1446 Max. :13540
#> (Other): 75 (Other):3910
#> sqft_lot floors waterfront grade
#> Min. : 520 Min. :1.000 Min. :0.000000 7 :8981
#> 1st Qu.: 5040 1st Qu.:1.000 1st Qu.:0.000000 8 :6068
#> Median : 7618 Median :1.500 Median :0.000000 9 :2615
#> Mean : 15107 Mean :1.494 Mean :0.007542 6 :2038
#> 3rd Qu.: 10688 3rd Qu.:2.000 3rd Qu.:0.000000 10 :1134
#> Max. :1651359 Max. :3.500 Max. :1.000000 11 : 399
#> (Other): 378
#> yr_built
#> Min. :1900
#> 1st Qu.:1951
#> Median :1975
#> Mean :1971
#> 3rd Qu.:1997
#> Max. :2015
#>
summary : 1. the most expensive house is 7.700.000 2. average house price is 540.088 3. the first time construction was in 1990 4. the most sold 1 floors and 3bedrooms then 2.5 bathrooms 5. year of contruction, build house average in 1971 6. the most sold grade 7
check the outlier within price
From result above, we find posibilities for the outliers, but from our calculation, so the process may continue.
Which category the lowest sqrt lot? how much price?
answer : the lowest sqrt lot get the price 700000
how much avarage price from floors and bedrooms?
Bar Plot Comparasion Floors and Price
from the chart data frame above the highest average price is floors
2.5
Bar Plot Comparasion Waterfront and Price
from the chart above the highest average price which have waterfront
compared to houses that do not have waterfront