Home Prices Analysis Using R

Introduction

On this chapter ,we will use R-Programming Language in order to conduct inferential analysis to analyse the data as much as we can to help decision making

Data Set and Data types view

Data Set (1st five rows) :

datesold	price	propertyType	bedrooms
2007-02-07	525000	house	4
2007-02-27	290000	house	3
2007-03-07	328000	house	3
2007-03-09	380000	house	4
2007-03-21	310000	house	3
2007-04-04	465000	house	4

data set dimensions :

## [1] 29580     4

Data types and some details using glimps function

df %>% glimpse()

## Rows: 29,580
## Columns: 4
## $ datesold     <dttm> 2007-02-07, 2007-02-27, 2007-03-07, 2007-03-09, 2007-03-…
## $ price        <dbl> 525000, 290000, 328000, 380000, 310000, 465000, 399000, 1…
## $ propertyType <chr> "house", "house", "house", "house", "house", "house", "ho…
## $ bedrooms     <dbl> 4, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 4, 4, 4, 5, 3, 5, 4, …

Data summary

Numerical variables EDA -price variable

##      price        
##  Min.   :  56500  
##  1st Qu.: 440000  
##  Median : 550000  
##  Mean   : 609736  
##  3rd Qu.: 705000  
##  Max.   :8000000

Numerical variables distribution

Categorical variables EDA

We have two Categorical Variables (property Type) and (bedrooms)

property Type

property Type variable in details

## Frequencies  
## df$propertyType  
## Type: Character  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##       house   24552     83.00          83.00     83.00          83.00
##        unit    5028     17.00         100.00     17.00         100.00
##        <NA>       0                               0.00         100.00
##       Total   29580    100.00         100.00    100.00         100.00

Insights:

1. house frequency much more than unit frequency
2. house frequency is 83% while unit frequency is 17%

property Type pie3D plot

bedrooms

bedrooms variable in details

## Frequencies  
## df$bedrooms  
## Type: Numeric  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##           0      30      0.10           0.10      0.10           0.10
##           1    1627      5.50           5.60      5.50           5.60
##           2    3598     12.16          17.77     12.16          17.77
##           3   11933     40.34          58.11     40.34          58.11
##           4   10442     35.30          93.41     35.30          93.41
##           5    1950      6.59         100.00      6.59         100.00
##        <NA>       0                               0.00         100.00
##       Total   29580    100.00         100.00    100.00         100.00

Insights:

1. 3 bedrooms has the highest frequency with 40.34% while 0 bedrooms has the lowest with less than 1% (0.10)
3 and 4 bedrooms has most frequencies in the data set

bedrooms pie3D

Combining property Type and bedrooms variables

##        
##             0     1     2     3     4     5   Sum
##   house    19    95   806 11281 10404  1947 24552
##   unit     11  1532  2792   652    38     3  5028
##   Sum      30  1627  3598 11933 10442  1950 29580

Insights:

house with 3 bedrooms has the highest frequency while house with 0 bedrooms has the lowest frequency
unit with 2 bedrooms has the highest frequency while unit with 5 bedrooms has the lowest frequency

mosaic plot

using ggplot2

Price By property Type

## df$propertyType
##       house        unit 
## 15908618866  2127379770

Visualizing Price By property Type using ggplot2

Price By bedrooms

## as.factor(df$bedrooms)
##          0          1          2          3          4          5 
##   16269000  546660458 1591118057 6590648741 7499143065 1792159315

Visualizing Price By bedrooms using ggplot2

Combining property Type and bedrooms variables in respect of price variable

	0	1	2	3	4	5
house	12870500	33189450	391554640	6207526896	7474757065	1788720315
unit	3398500	513471008	1199563417	383121845	24386000	3439000

##                as.factor(df$bedrooms)
## df$propertyType          0          1          2          3          4
##           house   12870500   33189450  391554640 6207526896 7474757065
##           unit     3398500  513471008 1199563417  383121845   24386000
##                as.factor(df$bedrooms)
## df$propertyType          5
##           house 1788720315
##           unit     3439000

Visualizing Price By property Type and bedrooms using ggplot2

Inferential Analysis

choosing the Right statistics tests

In Order To determine what types of statistics tests we will apply , we need to to determine 2 points about the target variable (Price):

1. variable Normality

2. variable homogeneity of variances

price variable Normality check by conducting Visualization

price variable Normality check by Applying Anderson-Darling normality test

## 
##  Anderson-Darling normality test
## 
## data:  df$price
## A = 1180.2, p-value < 2.2e-16

price variable homogeneity of variances check by Applying Levene’s test

## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group    11  112.69 < 2.2e-16 ***
##       29568                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on our above analysis results:

1. Normality: The Anderson-Darling normality test (p-value < 0.05) indicates non-normal distribution.
2. Homogeneity of variances: We couldn’t perform Levene’s test due to quantitative explanatory variables.

Given these results, We will be using non-parametric statistical tests

Aplying non-parametric statistical tests

A. wilcox.test

A.1 wilcox.test on price against property Type

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df$price by propertyType
## W = 102291684, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

After Applying the test we found that:

The pvalue is less than 0.05 –> there is difference in the data distribution between propertyTypes

A.2 Visualization

this difference can be determined if it is large or small by interpreting the rank biserial value which is 0.66

## [1] "very large"
## (Rules: funder2019)

So The difference in the price between property Types in very large

A.3 wilcox_effsize :

## # A tibble: 1 × 7
##   .y.   group1 group2 effsize    n1    n2 magnitude
## * <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
## 1 price house  unit     0.428 24552  5028 moderate

B. kruskal.test

B.1 kruskal.test on price against bedrooms as it has more than 2 factors

## 
##  Kruskal-Wallis rank sum test
## 
## data:  df$price and as.factor(bedrooms)
## Kruskal-Wallis chi-squared = 12050, df = 5, p-value < 2.2e-16

After Applying the test we found that:

The pvalue is less than 0.05 –> there is difference in the data distribution between bedrooms

B.2 Visualization

this difference can be determined if it is large or small by interpreting the rank biserial value which is 0.41

## [1] "very large"
## (Rules: funder2019)

So The difference in the price between bedrooms in very large

B.3 kruskal_effsize :

## # A tibble: 1 × 5
##   .y.       n effsize method  magnitude
## * <chr> <int>   <dbl> <chr>   <ord>    
## 1 price 29580   0.407 eta2[H] large

B.4 to find out in details these differences we will Perform Dunn’s Test with holm correction for p-values :

##    Comparison           Z       P.unadj         P.adj
## 1       0 - 1   6.2532710  4.019437e-10  1.607775e-09
## 2       0 - 2   3.2246371  1.261325e-03  2.522649e-03
## 3       1 - 2 -18.7770577  1.163514e-78  6.981085e-78
## 4       0 - 3  -0.1353723  8.923175e-01  8.923175e-01
## 5       1 - 3 -44.5330056  0.000000e+00  0.000000e+00
## 6       2 - 3 -32.3845603 4.527976e-230 3.622381e-229
## 7       0 - 4  -4.5323069  5.834301e-06  1.750290e-05
## 8       1 - 4 -74.3186748  0.000000e+00  0.000000e+00
## 9       2 - 4 -73.4484829  0.000000e+00  0.000000e+00
## 10      3 - 4 -59.9929036  0.000000e+00  0.000000e+00
## 11      0 - 5  -7.1345697  9.709041e-13  4.854521e-12
## 12      1 - 5 -73.4043100  0.000000e+00  0.000000e+00
## 13      2 - 5 -67.7003098  0.000000e+00  0.000000e+00
## 14      3 - 5 -52.7238080  0.000000e+00  0.000000e+00
## 15      4 - 5 -19.6152541  1.145630e-85  8.019412e-85

C. chisq.test

C.1 is there any significant association between property Type and as.factor(bedrooms)

## 
##  Pearson's Chi-squared test
## 
## data:  table(propertyType, as.factor(bedrooms))
## X-squared = 19805, df = 5, p-value < 2.2e-16

After Applying the test we found that:

The pvalue is less than 0.05 –> there is difference in the data distribution between property Type and bedroom

C.2 if one or any cell of chi$expected (Expected frequency,not the observed frequency-real data-) is less than 5 –>will apply fisher exact test instead of chisq test

##             
## propertyType         0        1        2        3        4         5
##        house 24.900609 1350.443 2986.413 9904.632 8667.072 1618.5396
##        unit   5.099391  276.557  611.587 2028.368 1774.928  331.4604

None of the cells of chi$expected is less than 5 –> So our Applied chisq.test is Sufficient

Exploratory factor analysis

1. cronbach alpah

statistical method used to measure the internal consistence or reliability of the data
the rage of alpha coefficient between 0 to 1
when alpha =0 –> means the scale items are entirely independent from one another –> these items are uncorrelated or share no co-variances
we get higher alpha when all the scale items are entirely dependent on one another –>these items are correlated or have high co-variances
the best alpha range .65 to .8 or higher ,if alpha less than .5–>usually not accepted specially for addressing issue of unidimensionality scale

## # A tibble: 6 × 8
##   propertyType_house propertyType_unit bedrooms_0 bedrooms_1 bedrooms_2
##                <int>             <int>      <int>      <int>      <int>
## 1                  1                 0          0          0          0
## 2                  1                 0          0          0          0
## 3                  1                 0          0          0          0
## 4                  1                 0          0          0          0
## 5                  1                 0          0          0          0
## 6                  1                 0          0          0          0
## # ℹ 3 more variables: bedrooms_3 <int>, bedrooms_4 <int>, bedrooms_5 <int>

## 
## Reliability analysis   
## Call: psych::alpha(x = factor_Analysis_data, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r   S/N    ase mean   sd median_r
##       0.61      -5.5    -4.3     -0.12 -0.85 0.0032 0.34 0.18   -0.094
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt      0.6  0.61  0.62
## Duhachek   0.6  0.61  0.61
## 
##  Reliability if an item is dropped:
##                     raw_alpha std.alpha G6(smc) average_r   S/N alpha se var.r
## propertyType_house-      0.35      -1.4   -0.78    -0.090 -0.58   0.0057 0.067
## propertyType_unit        0.35      -4.1   -2.68    -0.130 -0.80   0.0057 0.058
## bedrooms_0               0.62     -14.1   -2.84    -0.154 -0.93   0.0033 0.140
## bedrooms_1               0.56      -4.4   -1.65    -0.132 -0.82   0.0036 0.118
## bedrooms_2               0.49      -3.0   -1.33    -0.120 -0.75   0.0043 0.104
## bedrooms_3-              0.70      -1.5   -0.83    -0.093 -0.60   0.0022 0.122
## bedrooms_4-              0.69      -1.6   -0.89    -0.097 -0.62   0.0023 0.119
## bedrooms_5-              0.65      -4.1   -1.63    -0.130 -0.80   0.0031 0.141
##                      med.r
## propertyType_house- -0.099
## propertyType_unit   -0.090
## bedrooms_0          -0.196
## bedrooms_1          -0.099
## bedrooms_2          -0.064
## bedrooms_3-         -0.064
## bedrooms_4-         -0.064
## bedrooms_5-         -0.090
## 
##  Item statistics 
##                         n raw.r std.r r.cor  r.drop  mean    sd
## propertyType_house- 29580 0.951 -0.35   NaN  0.9119 0.170 0.376
## propertyType_unit   29580 0.951  0.35   NaN  0.9119 0.170 0.376
## bedrooms_0          29580 0.045  0.79   NaN  0.0230 0.001 0.032
## bedrooms_1          29580 0.533  0.39   NaN  0.4055 0.055 0.228
## bedrooms_2          29580 0.737  0.19   NaN  0.6028 0.122 0.327
## bedrooms_3-         29580 0.334 -0.30   NaN -0.0064 0.597 0.491
## bedrooms_4-         29580 0.352 -0.24   NaN  0.0224 0.647 0.478
## bedrooms_5-         29580 0.128  0.35   NaN -0.0449 0.934 0.248
## 
## Non missing response frequency for each item
##                       0    1 miss
## propertyType_house 0.17 0.83    0
## propertyType_unit  0.83 0.17    0
## bedrooms_0         1.00 0.00    0
## bedrooms_1         0.94 0.06    0
## bedrooms_2         0.88 0.12    0
## bedrooms_3         0.60 0.40    0
## bedrooms_4         0.65 0.35    0
## bedrooms_5         0.93 0.07    0

2. cortest.bartlett

the observed correlation matrix is compared with the identity matrix using cortest.bartlett
H0:observed matrix is identity matrix
H1:observed matrix is not an identity matrix

## $chisq
## [1] Inf
## 
## $p.value
## [1] 0
## 
## $df
## [1] 28

the p value is less than 0.05 –> we can reject H0, and say that the observed matrix is not an identity matrix

3. measure of sampling adequacy

provide index between 0 and 1 of the proportion of the variance among the variables that might be common variance.
check out the kmo value table and its remarkes
should not be less than 0.5 the overall and for each factor

## Error in solve.default(r) : 
##   Lapack routine dgesv: system is exactly singular: U[2,2] = 0

## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = factor_Analysis_data)
## Overall MSA =  0.5
## MSA for each item = 
## propertyType_house  propertyType_unit         bedrooms_0         bedrooms_1 
##                0.5                0.5                0.5                0.5 
##         bedrooms_2         bedrooms_3         bedrooms_4         bedrooms_5 
##                0.5                0.5                0.5                0.5

4. determining the number of factors using parallel lines and very simple structure

## Parallel analysis suggests that the number of factors =  6  and the number of components =  5

## 
## Very Simple Structure
## Call: psych::vss(x = factor_Analysis_data)
## Although the VSS complexity 1 shows  6  factors, it is probably more reasonable to think about  2  factors
## VSS complexity 2 achieves a maximimum of 0.92  with  6  factors
## 
## The Velicer MAP achieves a minimum of 0.11  with  1  factors 
## BIC achieves a minimum of  635693.4  with  2  factors
## Sample Size adjusted BIC achieves a minimum of  635734.8  with  2  factors
## 
## Statistics by number of factors 
##   vss1 vss2  map dof   chisq prob sqresid  fit RMSEA     BIC   SABIC complex
## 1 0.56 0.00 0.11  20  662092    0 6.37453 0.56   1.1  661886  661950     1.0
## 2 0.66 0.74 0.22  13  635827    0 3.80636 0.74   1.3  635693  635735     1.2
## 3 0.60 0.79 0.33   7 1154228    0 2.60192 0.82   2.4 1154156 1154178     1.5
## 4 0.66 0.82 0.72   2 1129301    0 1.39513 0.90   4.4 1129281 1129287     1.4
## 5 0.59 0.84 1.00  -2  974029   NA 1.00189 0.93    NA      NA      NA     1.5
## 6 0.72 0.92  NaN  -5  959575   NA 0.00015 1.00    NA      NA      NA     1.3
## 7 0.72 0.92  NaN  -7  959553   NA 0.00015 1.00    NA      NA      NA     1.3
## 8 0.72 0.92   NA  -8  959532   NA 0.00015 1.00    NA      NA      NA     1.3
##    eChisq    SRMR eCRMS  eBIC
## 1 4.4e+04 0.16313 0.193 43876
## 2 1.6e+04 0.09716 0.143 15504
## 3 7.3e+03 0.06658 0.133  7272
## 4 7.9e+02 0.02185 0.082   770
## 5 8.7e+01 0.00726    NA    NA
## 6 9.1e-01 0.00074    NA    NA
## 7 9.1e-01 0.00074    NA    NA
## 8 9.1e-01 0.00074    NA    NA

on the plot we can see there are 4 “+” above the line –> 4 factors
Also on the VSS plot (very simple structure) we can see that 4 factors are the best number of factors

5. extracting principle components from raw data

when using cor = TRUE, then the correlation is used instead of covariance
5.1 for reporting variance explained :

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.7077973 1.2685701 1.0566932 1.0436648 1.0006335
## Proportion of Variance 0.3645715 0.2011588 0.1395751 0.1361545 0.1251584
## Cumulative Proportion  0.3645715 0.5657302 0.7053053 0.8414598 0.9666183
##                            Comp.6       Comp.7 Comp.8
## Standard deviation     0.51677259 1.684227e-07      0
## Proportion of Variance 0.03338174 3.545776e-15      0
## Cumulative Proportion  1.00000000 1.000000e+00      1

1. in the above output, the proportion on variance explained by comp 1 is 0.365 (36%)
2. in the above output, the proportion on variance explained by comp 2 is 0.20 (20%)
3. in the above output, the cumulative proportion on variance explained by 4 components is 0.8414598 (79%)

5.2 for reporting principal component loadings :

princomp(factor_Analysis_data,cor = TRUE) %>% loadings() # for reporting principal component loadings

## 
## Loadings:
##                    Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## propertyType_house  0.571                              0.414         0.707
## propertyType_unit  -0.571                             -0.414         0.707
## bedrooms_0                                      0.999                     
## bedrooms_1         -0.317        -0.184 -0.745         0.483  0.274       
## bedrooms_2         -0.405         0.153  0.619         0.521  0.393       
## bedrooms_3          0.171  0.727 -0.214               -0.217  0.590       
## bedrooms_4          0.222 -0.684 -0.269               -0.280  0.575       
## bedrooms_5                        0.906 -0.242        -0.162  0.299       
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.125  0.125  0.125  0.125  0.125  0.125  0.125  0.125
## Cumulative Var  0.125  0.250  0.375  0.500  0.625  0.750  0.875  1.000

the loading on each factor,since the ss laodings are ==1 for all will use later advanced package

# this si the kinked cow curve –> where ever point above one –> number of factors –>4

variance explaind by each components

6. principle components analysis :

the -principal- function can produce output much easy to understand than -princomp- function

## Principal Components Analysis
## Call: psych::principal(r = factor_Analysis_data, nfactors = 8, rotate = "varimax", 
##     cor = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                      RC1   RC2   RC3   RC4   RC6   RC5 RC7 RC8 h2       u2 com
## propertyType_house -0.95  0.02  0.04 -0.22 -0.23 -0.01   0   0  1  0.0e+00 1.2
## propertyType_unit   0.95 -0.02 -0.04  0.22  0.23  0.01   0   0  1  1.3e-15 1.2
## bedrooms_0          0.01  0.00  0.00 -0.01 -0.01  1.00   0   0  1  4.4e-15 1.0
## bedrooms_1          0.33  0.01 -0.02  0.94 -0.12 -0.01   0   0  1 -1.6e-15 1.3
## bedrooms_2          0.46  0.02 -0.05 -0.15  0.87 -0.01   0   0  1  0.0e+00 1.6
## bedrooms_3         -0.19 -0.89 -0.24 -0.17 -0.27 -0.03   0   0  1  2.0e-15 1.5
## bedrooms_4         -0.24  0.90 -0.23 -0.15 -0.25 -0.03   0   0  1  2.6e-15 1.5
## bedrooms_5         -0.07  0.00  1.00 -0.02 -0.03  0.00   0   0  1  2.0e-15 1.0
## 
##                        RC1  RC2  RC3  RC4  RC6  RC5 RC7 RC8
## SS loadings           2.21 1.61 1.11 1.05 1.02 1.00   0   0
## Proportion Var        0.28 0.20 0.14 0.13 0.13 0.13   0   0
## Cumulative Var        0.28 0.48 0.62 0.75 0.87 1.00   1   1
## Proportion Explained  0.28 0.20 0.14 0.13 0.13 0.13   0   0
## Cumulative Proportion 0.28 0.48 0.62 0.75 0.87 1.00   1   1
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 8 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0 
##  with the empirical chi square  0  with prob <  NA 
## 
## Fit based upon off diagonal values = 1

the ss laodings is 2.92 means 2.92 units can be explained out of the total 8 factors,#which is equal to 2.92/8=36% variance explained by PC1 and so on
the cumulative variance explained by 4 component is 84%
running the test again with limiting the factors to 4

## Principal Components Analysis
## Call: psych::principal(r = factor_Analysis_data, nfactors = 4, rotate = "varimax", 
##     cor = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                      RC1   RC2   RC4   RC3     h2     u2 com
## propertyType_house -0.84  0.04 -0.49 -0.02 0.9541 0.0459 1.6
## propertyType_unit   0.84 -0.04  0.49  0.02 0.9541 0.0459 1.6
## bedrooms_0          0.01  0.00  0.02  0.03 0.0012 0.9988 2.0
## bedrooms_1          0.09  0.01  0.96  0.06 0.9368 0.0632 1.0
## bedrooms_2          0.93  0.03 -0.25  0.02 0.9272 0.0728 1.1
## bedrooms_3         -0.29 -0.90 -0.12 -0.29 0.9873 0.0127 1.5
## bedrooms_4         -0.31  0.90 -0.11 -0.26 0.9789 0.0211 1.5
## bedrooms_5         -0.13 -0.01 -0.10  0.98 0.9919 0.0081 1.1
## 
##                        RC1  RC2  RC4  RC3
## SS loadings           2.49 1.61 1.51 1.12
## Proportion Var        0.31 0.20 0.19 0.14
## Cumulative Var        0.31 0.51 0.70 0.84
## Proportion Explained  0.37 0.24 0.22 0.17
## Cumulative Proportion 0.37 0.61 0.83 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 4 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.03 
##  with the empirical chi square  1839.41  with prob <  0 
## 
## Fit based upon off diagonal values = 0.99

diagonal values: a straight line that joins two opposite corners of a four-sided flat shape look at the results how it has been changed when reducing the nfactors to 4

Home Prices Analysis Using R

Author : Omar Soub

Date : 27-07/2024

Introduction

Data Set and Data types view

Data Set (1st five rows) :

data set dimensions :

Data types and some details using glimps function

Data summary

Numerical variables EDA -price variable

Numerical variables distribution

Categorical variables EDA

property Type

bedrooms

Combining property Type and bedrooms variables

unit with 2 bedrooms has the highest frequency while unit with 5 bedrooms has the lowest frequency

Price By property Type

Price By bedrooms

Combining property Type and bedrooms variables in respect of price variable

Inferential Analysis

choosing the Right statistics tests

Aplying non-parametric statistical tests

Exploratory factor analysis

1. cronbach alpah

2. cortest.bartlett

3. measure of sampling adequacy

4. determining the number of factors using parallel lines and very simple structure

5. extracting principle components from raw data

6. principle components analysis :