Introduction

  • On this chapter ,we will use R-Programming Language in order to conduct inferential analysis to analyse the data as much as we can to help decision making

Data Set and Data types view


Data Set (1st five rows) :

datesold price propertyType bedrooms
2007-02-07 525000 house 4
2007-02-27 290000 house 3
2007-03-07 328000 house 3
2007-03-09 380000 house 4
2007-03-21 310000 house 3
2007-04-04 465000 house 4

data set dimensions :

## [1] 29580     4

Data types and some details using glimps function

df %>% glimpse()
## Rows: 29,580
## Columns: 4
## $ datesold     <dttm> 2007-02-07, 2007-02-27, 2007-03-07, 2007-03-09, 2007-03-…
## $ price        <dbl> 525000, 290000, 328000, 380000, 310000, 465000, 399000, 1…
## $ propertyType <chr> "house", "house", "house", "house", "house", "house", "ho…
## $ bedrooms     <dbl> 4, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 4, 4, 4, 5, 3, 5, 4, …

Data summary


Numerical variables EDA -price variable

##      price        
##  Min.   :  56500  
##  1st Qu.: 440000  
##  Median : 550000  
##  Mean   : 609736  
##  3rd Qu.: 705000  
##  Max.   :8000000

Numerical variables distribution


Categorical variables EDA

We have two Categorical Variables (property Type) and (bedrooms)

property Type

  • property Type variable in details
## Frequencies  
## df$propertyType  
## Type: Character  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##       house   24552     83.00          83.00     83.00          83.00
##        unit    5028     17.00         100.00     17.00         100.00
##        <NA>       0                               0.00         100.00
##       Total   29580    100.00         100.00    100.00         100.00

Insights:

  • 1. house frequency much more than unit frequency

  • 2. house frequency is 83% while unit frequency is 17%


  • property Type pie3D plot


bedrooms

  • bedrooms variable in details
## Frequencies  
## df$bedrooms  
## Type: Numeric  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##           0      30      0.10           0.10      0.10           0.10
##           1    1627      5.50           5.60      5.50           5.60
##           2    3598     12.16          17.77     12.16          17.77
##           3   11933     40.34          58.11     40.34          58.11
##           4   10442     35.30          93.41     35.30          93.41
##           5    1950      6.59         100.00      6.59         100.00
##        <NA>       0                               0.00         100.00
##       Total   29580    100.00         100.00    100.00         100.00

Insights:

  • 1. 3 bedrooms has the highest frequency with 40.34% while 0 bedrooms has the lowest with less than 1% (0.10)

  • 3 and 4 bedrooms has most frequencies in the data set


  • bedrooms pie3D


Combining property Type and bedrooms variables

##        
##             0     1     2     3     4     5   Sum
##   house    19    95   806 11281 10404  1947 24552
##   unit     11  1532  2792   652    38     3  5028
##   Sum      30  1627  3598 11933 10442  1950 29580

Insights:

  • house with 3 bedrooms has the highest frequency while house with 0 bedrooms has the lowest frequency

  • unit with 2 bedrooms has the highest frequency while unit with 5 bedrooms has the lowest frequency

mosaic plot

using ggplot2


Price By property Type

## df$propertyType
##       house        unit 
## 15908618866  2127379770

Visualizing Price By property Type using ggplot2


Price By bedrooms

## as.factor(df$bedrooms)
##          0          1          2          3          4          5 
##   16269000  546660458 1591118057 6590648741 7499143065 1792159315

Visualizing Price By bedrooms using ggplot2


Combining property Type and bedrooms variables in respect of price variable

0 1 2 3 4 5
house 12870500 33189450 391554640 6207526896 7474757065 1788720315
unit 3398500 513471008 1199563417 383121845 24386000 3439000
  • 4
##                as.factor(df$bedrooms)
## df$propertyType          0          1          2          3          4
##           house   12870500   33189450  391554640 6207526896 7474757065
##           unit     3398500  513471008 1199563417  383121845   24386000
##                as.factor(df$bedrooms)
## df$propertyType          5
##           house 1788720315
##           unit     3439000

Visualizing Price By property Type and bedrooms using ggplot2


Inferential Analysis

choosing the Right statistics tests

In Order To determine what types of statistics tests we will apply , we need to to determine 2 points about the target variable (Price):

1. variable Normality

2. variable homogeneity of variances

  • price variable Normality check by conducting Visualization


  • price variable Normality check by Applying Anderson-Darling normality test
## 
##  Anderson-Darling normality test
## 
## data:  df$price
## A = 1180.2, p-value < 2.2e-16

  • price variable homogeneity of variances check by Applying Levene’s test
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group    11  112.69 < 2.2e-16 ***
##       29568                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on our above analysis results:

  • 1. Normality: The Anderson-Darling normality test (p-value < 0.05) indicates non-normal distribution.

  • 2. Homogeneity of variances: We couldn’t perform Levene’s test due to quantitative explanatory variables.

Given these results, We will be using non-parametric statistical tests


Aplying non-parametric statistical tests

A. wilcox.test

  • A.1 wilcox.test on price against property Type
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df$price by propertyType
## W = 102291684, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

After Applying the test we found that:

The pvalue is less than 0.05 –> there is difference in the data distribution between propertyTypes

  • A.2 Visualization

this difference can be determined if it is large or small by interpreting the rank biserial value which is 0.66

## [1] "very large"
## (Rules: funder2019)

So The difference in the price between property Types in very large

  • A.3 wilcox_effsize :
## # A tibble: 1 × 7
##   .y.   group1 group2 effsize    n1    n2 magnitude
## * <chr> <chr>  <chr>    <dbl> <int> <int> <ord>    
## 1 price house  unit     0.428 24552  5028 moderate

B. kruskal.test

  • B.1 kruskal.test on price against bedrooms as it has more than 2 factors
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df$price and as.factor(bedrooms)
## Kruskal-Wallis chi-squared = 12050, df = 5, p-value < 2.2e-16

After Applying the test we found that:

The pvalue is less than 0.05 –> there is difference in the data distribution between bedrooms

  • B.2 Visualization

this difference can be determined if it is large or small by interpreting the rank biserial value which is 0.41

## [1] "very large"
## (Rules: funder2019)

So The difference in the price between bedrooms in very large

  • B.3 kruskal_effsize :
## # A tibble: 1 × 5
##   .y.       n effsize method  magnitude
## * <chr> <int>   <dbl> <chr>   <ord>    
## 1 price 29580   0.407 eta2[H] large
  • B.4 to find out in details these differences we will Perform Dunn’s Test with holm correction for p-values :
##    Comparison           Z       P.unadj         P.adj
## 1       0 - 1   6.2532710  4.019437e-10  1.607775e-09
## 2       0 - 2   3.2246371  1.261325e-03  2.522649e-03
## 3       1 - 2 -18.7770577  1.163514e-78  6.981085e-78
## 4       0 - 3  -0.1353723  8.923175e-01  8.923175e-01
## 5       1 - 3 -44.5330056  0.000000e+00  0.000000e+00
## 6       2 - 3 -32.3845603 4.527976e-230 3.622381e-229
## 7       0 - 4  -4.5323069  5.834301e-06  1.750290e-05
## 8       1 - 4 -74.3186748  0.000000e+00  0.000000e+00
## 9       2 - 4 -73.4484829  0.000000e+00  0.000000e+00
## 10      3 - 4 -59.9929036  0.000000e+00  0.000000e+00
## 11      0 - 5  -7.1345697  9.709041e-13  4.854521e-12
## 12      1 - 5 -73.4043100  0.000000e+00  0.000000e+00
## 13      2 - 5 -67.7003098  0.000000e+00  0.000000e+00
## 14      3 - 5 -52.7238080  0.000000e+00  0.000000e+00
## 15      4 - 5 -19.6152541  1.145630e-85  8.019412e-85

C. chisq.test

  • C.1 is there any significant association between property Type and as.factor(bedrooms)
## 
##  Pearson's Chi-squared test
## 
## data:  table(propertyType, as.factor(bedrooms))
## X-squared = 19805, df = 5, p-value < 2.2e-16

After Applying the test we found that:

The pvalue is less than 0.05 –> there is difference in the data distribution between property Type and bedroom

  • C.2 if one or any cell of chi$expected (Expected frequency,not the observed frequency-real data-) is less than 5 –>will apply fisher exact test instead of chisq test
##             
## propertyType         0        1        2        3        4         5
##        house 24.900609 1350.443 2986.413 9904.632 8667.072 1618.5396
##        unit   5.099391  276.557  611.587 2028.368 1774.928  331.4604

None of the cells of chi$expected is less than 5 –> So our Applied chisq.test is Sufficient


Exploratory factor analysis

1. cronbach alpah

  • statistical method used to measure the internal consistence or reliability of the data
  • the rage of alpha coefficient between 0 to 1
  • when alpha =0 –> means the scale items are entirely independent from one another –> these items are uncorrelated or share no co-variances
  • we get higher alpha when all the scale items are entirely dependent on one another –>these items are correlated or have high co-variances
  • the best alpha range .65 to .8 or higher ,if alpha less than .5–>usually not accepted specially for addressing issue of unidimensionality scale
## # A tibble: 6 × 8
##   propertyType_house propertyType_unit bedrooms_0 bedrooms_1 bedrooms_2
##                <int>             <int>      <int>      <int>      <int>
## 1                  1                 0          0          0          0
## 2                  1                 0          0          0          0
## 3                  1                 0          0          0          0
## 4                  1                 0          0          0          0
## 5                  1                 0          0          0          0
## 6                  1                 0          0          0          0
## # ℹ 3 more variables: bedrooms_3 <int>, bedrooms_4 <int>, bedrooms_5 <int>
## 
## Reliability analysis   
## Call: psych::alpha(x = factor_Analysis_data, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r   S/N    ase mean   sd median_r
##       0.61      -5.5    -4.3     -0.12 -0.85 0.0032 0.34 0.18   -0.094
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt      0.6  0.61  0.62
## Duhachek   0.6  0.61  0.61
## 
##  Reliability if an item is dropped:
##                     raw_alpha std.alpha G6(smc) average_r   S/N alpha se var.r
## propertyType_house-      0.35      -1.4   -0.78    -0.090 -0.58   0.0057 0.067
## propertyType_unit        0.35      -4.1   -2.68    -0.130 -0.80   0.0057 0.058
## bedrooms_0               0.62     -14.1   -2.84    -0.154 -0.93   0.0033 0.140
## bedrooms_1               0.56      -4.4   -1.65    -0.132 -0.82   0.0036 0.118
## bedrooms_2               0.49      -3.0   -1.33    -0.120 -0.75   0.0043 0.104
## bedrooms_3-              0.70      -1.5   -0.83    -0.093 -0.60   0.0022 0.122
## bedrooms_4-              0.69      -1.6   -0.89    -0.097 -0.62   0.0023 0.119
## bedrooms_5-              0.65      -4.1   -1.63    -0.130 -0.80   0.0031 0.141
##                      med.r
## propertyType_house- -0.099
## propertyType_unit   -0.090
## bedrooms_0          -0.196
## bedrooms_1          -0.099
## bedrooms_2          -0.064
## bedrooms_3-         -0.064
## bedrooms_4-         -0.064
## bedrooms_5-         -0.090
## 
##  Item statistics 
##                         n raw.r std.r r.cor  r.drop  mean    sd
## propertyType_house- 29580 0.951 -0.35   NaN  0.9119 0.170 0.376
## propertyType_unit   29580 0.951  0.35   NaN  0.9119 0.170 0.376
## bedrooms_0          29580 0.045  0.79   NaN  0.0230 0.001 0.032
## bedrooms_1          29580 0.533  0.39   NaN  0.4055 0.055 0.228
## bedrooms_2          29580 0.737  0.19   NaN  0.6028 0.122 0.327
## bedrooms_3-         29580 0.334 -0.30   NaN -0.0064 0.597 0.491
## bedrooms_4-         29580 0.352 -0.24   NaN  0.0224 0.647 0.478
## bedrooms_5-         29580 0.128  0.35   NaN -0.0449 0.934 0.248
## 
## Non missing response frequency for each item
##                       0    1 miss
## propertyType_house 0.17 0.83    0
## propertyType_unit  0.83 0.17    0
## bedrooms_0         1.00 0.00    0
## bedrooms_1         0.94 0.06    0
## bedrooms_2         0.88 0.12    0
## bedrooms_3         0.60 0.40    0
## bedrooms_4         0.65 0.35    0
## bedrooms_5         0.93 0.07    0

2. cortest.bartlett

  • the observed correlation matrix is compared with the identity matrix using cortest.bartlett
  • H0:observed matrix is identity matrix
  • H1:observed matrix is not an identity matrix
## $chisq
## [1] Inf
## 
## $p.value
## [1] 0
## 
## $df
## [1] 28

the p value is less than 0.05 –> we can reject H0, and say that the observed matrix is not an identity matrix


3. measure of sampling adequacy

  • provide index between 0 and 1 of the proportion of the variance among the variables that might be common variance.
  • check out the kmo value table and its remarkes
  • should not be less than 0.5 the overall and for each factor
## Error in solve.default(r) : 
##   Lapack routine dgesv: system is exactly singular: U[2,2] = 0
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = factor_Analysis_data)
## Overall MSA =  0.5
## MSA for each item = 
## propertyType_house  propertyType_unit         bedrooms_0         bedrooms_1 
##                0.5                0.5                0.5                0.5 
##         bedrooms_2         bedrooms_3         bedrooms_4         bedrooms_5 
##                0.5                0.5                0.5                0.5

4. determining the number of factors using parallel lines and very simple structure

## Parallel analysis suggests that the number of factors =  6  and the number of components =  5

## 
## Very Simple Structure
## Call: psych::vss(x = factor_Analysis_data)
## Although the VSS complexity 1 shows  6  factors, it is probably more reasonable to think about  2  factors
## VSS complexity 2 achieves a maximimum of 0.92  with  6  factors
## 
## The Velicer MAP achieves a minimum of 0.11  with  1  factors 
## BIC achieves a minimum of  635693.4  with  2  factors
## Sample Size adjusted BIC achieves a minimum of  635734.8  with  2  factors
## 
## Statistics by number of factors 
##   vss1 vss2  map dof   chisq prob sqresid  fit RMSEA     BIC   SABIC complex
## 1 0.56 0.00 0.11  20  662092    0 6.37453 0.56   1.1  661886  661950     1.0
## 2 0.66 0.74 0.22  13  635827    0 3.80636 0.74   1.3  635693  635735     1.2
## 3 0.60 0.79 0.33   7 1154228    0 2.60192 0.82   2.4 1154156 1154178     1.5
## 4 0.66 0.82 0.72   2 1129301    0 1.39513 0.90   4.4 1129281 1129287     1.4
## 5 0.59 0.84 1.00  -2  974029   NA 1.00189 0.93    NA      NA      NA     1.5
## 6 0.72 0.92  NaN  -5  959575   NA 0.00015 1.00    NA      NA      NA     1.3
## 7 0.72 0.92  NaN  -7  959553   NA 0.00015 1.00    NA      NA      NA     1.3
## 8 0.72 0.92   NA  -8  959532   NA 0.00015 1.00    NA      NA      NA     1.3
##    eChisq    SRMR eCRMS  eBIC
## 1 4.4e+04 0.16313 0.193 43876
## 2 1.6e+04 0.09716 0.143 15504
## 3 7.3e+03 0.06658 0.133  7272
## 4 7.9e+02 0.02185 0.082   770
## 5 8.7e+01 0.00726    NA    NA
## 6 9.1e-01 0.00074    NA    NA
## 7 9.1e-01 0.00074    NA    NA
## 8 9.1e-01 0.00074    NA    NA
  • on the plot we can see there are 4 “+” above the line –> 4 factors
  • Also on the VSS plot (very simple structure) we can see that 4 factors are the best number of factors

5. extracting principle components from raw data

  • when using cor = TRUE, then the correlation is used instead of covariance

  • 5.1 for reporting variance explained :

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.7077973 1.2685701 1.0566932 1.0436648 1.0006335
## Proportion of Variance 0.3645715 0.2011588 0.1395751 0.1361545 0.1251584
## Cumulative Proportion  0.3645715 0.5657302 0.7053053 0.8414598 0.9666183
##                            Comp.6       Comp.7 Comp.8
## Standard deviation     0.51677259 1.684227e-07      0
## Proportion of Variance 0.03338174 3.545776e-15      0
## Cumulative Proportion  1.00000000 1.000000e+00      1
  • 1. in the above output, the proportion on variance explained by comp 1 is 0.365 (36%)
  • 2. in the above output, the proportion on variance explained by comp 2 is 0.20 (20%)
  • 3. in the above output, the cumulative proportion on variance explained by 4 components is 0.8414598 (79%)

  • 5.2 for reporting principal component loadings :
princomp(factor_Analysis_data,cor = TRUE) %>% loadings() # for reporting principal component loadings
## 
## Loadings:
##                    Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## propertyType_house  0.571                              0.414         0.707
## propertyType_unit  -0.571                             -0.414         0.707
## bedrooms_0                                      0.999                     
## bedrooms_1         -0.317        -0.184 -0.745         0.483  0.274       
## bedrooms_2         -0.405         0.153  0.619         0.521  0.393       
## bedrooms_3          0.171  0.727 -0.214               -0.217  0.590       
## bedrooms_4          0.222 -0.684 -0.269               -0.280  0.575       
## bedrooms_5                        0.906 -0.242        -0.162  0.299       
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.125  0.125  0.125  0.125  0.125  0.125  0.125  0.125
## Cumulative Var  0.125  0.250  0.375  0.500  0.625  0.750  0.875  1.000

the loading on each factor,since the ss laodings are ==1 for all will use later advanced package

# this si the kinked cow curve –> where ever point above one –> number of factors –>4

variance explaind by each components

6. principle components analysis :

  • the -principal- function can produce output much easy to understand than -princomp- function
## Principal Components Analysis
## Call: psych::principal(r = factor_Analysis_data, nfactors = 8, rotate = "varimax", 
##     cor = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                      RC1   RC2   RC3   RC4   RC6   RC5 RC7 RC8 h2       u2 com
## propertyType_house -0.95  0.02  0.04 -0.22 -0.23 -0.01   0   0  1  0.0e+00 1.2
## propertyType_unit   0.95 -0.02 -0.04  0.22  0.23  0.01   0   0  1  1.3e-15 1.2
## bedrooms_0          0.01  0.00  0.00 -0.01 -0.01  1.00   0   0  1  4.4e-15 1.0
## bedrooms_1          0.33  0.01 -0.02  0.94 -0.12 -0.01   0   0  1 -1.6e-15 1.3
## bedrooms_2          0.46  0.02 -0.05 -0.15  0.87 -0.01   0   0  1  0.0e+00 1.6
## bedrooms_3         -0.19 -0.89 -0.24 -0.17 -0.27 -0.03   0   0  1  2.0e-15 1.5
## bedrooms_4         -0.24  0.90 -0.23 -0.15 -0.25 -0.03   0   0  1  2.6e-15 1.5
## bedrooms_5         -0.07  0.00  1.00 -0.02 -0.03  0.00   0   0  1  2.0e-15 1.0
## 
##                        RC1  RC2  RC3  RC4  RC6  RC5 RC7 RC8
## SS loadings           2.21 1.61 1.11 1.05 1.02 1.00   0   0
## Proportion Var        0.28 0.20 0.14 0.13 0.13 0.13   0   0
## Cumulative Var        0.28 0.48 0.62 0.75 0.87 1.00   1   1
## Proportion Explained  0.28 0.20 0.14 0.13 0.13 0.13   0   0
## Cumulative Proportion 0.28 0.48 0.62 0.75 0.87 1.00   1   1
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 8 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0 
##  with the empirical chi square  0  with prob <  NA 
## 
## Fit based upon off diagonal values = 1
  • the ss laodings is 2.92 means 2.92 units can be explained out of the total 8 factors,#which is equal to 2.92/8=36% variance explained by PC1 and so on

  • the cumulative variance explained by 4 component is 84%

  • running the test again with limiting the factors to 4

## Principal Components Analysis
## Call: psych::principal(r = factor_Analysis_data, nfactors = 4, rotate = "varimax", 
##     cor = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                      RC1   RC2   RC4   RC3     h2     u2 com
## propertyType_house -0.84  0.04 -0.49 -0.02 0.9541 0.0459 1.6
## propertyType_unit   0.84 -0.04  0.49  0.02 0.9541 0.0459 1.6
## bedrooms_0          0.01  0.00  0.02  0.03 0.0012 0.9988 2.0
## bedrooms_1          0.09  0.01  0.96  0.06 0.9368 0.0632 1.0
## bedrooms_2          0.93  0.03 -0.25  0.02 0.9272 0.0728 1.1
## bedrooms_3         -0.29 -0.90 -0.12 -0.29 0.9873 0.0127 1.5
## bedrooms_4         -0.31  0.90 -0.11 -0.26 0.9789 0.0211 1.5
## bedrooms_5         -0.13 -0.01 -0.10  0.98 0.9919 0.0081 1.1
## 
##                        RC1  RC2  RC4  RC3
## SS loadings           2.49 1.61 1.51 1.12
## Proportion Var        0.31 0.20 0.19 0.14
## Cumulative Var        0.31 0.51 0.70 0.84
## Proportion Explained  0.37 0.24 0.22 0.17
## Cumulative Proportion 0.37 0.61 0.83 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 4 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.03 
##  with the empirical chi square  1839.41  with prob <  0 
## 
## Fit based upon off diagonal values = 0.99

diagonal values: a straight line that joins two opposite corners of a four-sided flat shape look at the results how it has been changed when reducing the nfactors to 4