| datesold | price | propertyType | bedrooms |
|---|---|---|---|
| 2007-02-07 | 525000 | house | 4 |
| 2007-02-27 | 290000 | house | 3 |
| 2007-03-07 | 328000 | house | 3 |
| 2007-03-09 | 380000 | house | 4 |
| 2007-03-21 | 310000 | house | 3 |
| 2007-04-04 | 465000 | house | 4 |
## [1] 29580 4
df %>% glimpse()
## Rows: 29,580
## Columns: 4
## $ datesold <dttm> 2007-02-07, 2007-02-27, 2007-03-07, 2007-03-09, 2007-03-…
## $ price <dbl> 525000, 290000, 328000, 380000, 310000, 465000, 399000, 1…
## $ propertyType <chr> "house", "house", "house", "house", "house", "house", "ho…
## $ bedrooms <dbl> 4, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 4, 4, 4, 5, 3, 5, 4, …
## price
## Min. : 56500
## 1st Qu.: 440000
## Median : 550000
## Mean : 609736
## 3rd Qu.: 705000
## Max. :8000000
We have two Categorical Variables (property Type) and (bedrooms)
## Frequencies
## df$propertyType
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## house 24552 83.00 83.00 83.00 83.00
## unit 5028 17.00 100.00 17.00 100.00
## <NA> 0 0.00 100.00
## Total 29580 100.00 100.00 100.00 100.00
Insights:
1. house frequency much more than unit frequency
2. house frequency is 83% while unit frequency is 17%
## Frequencies
## df$bedrooms
## Type: Numeric
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## 0 30 0.10 0.10 0.10 0.10
## 1 1627 5.50 5.60 5.50 5.60
## 2 3598 12.16 17.77 12.16 17.77
## 3 11933 40.34 58.11 40.34 58.11
## 4 10442 35.30 93.41 35.30 93.41
## 5 1950 6.59 100.00 6.59 100.00
## <NA> 0 0.00 100.00
## Total 29580 100.00 100.00 100.00 100.00
Insights:
1. 3 bedrooms has the highest frequency with 40.34% while 0 bedrooms has the lowest with less than 1% (0.10)
3 and 4 bedrooms has most frequencies in the data set
##
## 0 1 2 3 4 5 Sum
## house 19 95 806 11281 10404 1947 24552
## unit 11 1532 2792 652 38 3 5028
## Sum 30 1627 3598 11933 10442 1950 29580
Insights:
house with 3 bedrooms has the highest frequency while house with 0 bedrooms has the lowest frequency
mosaic plot
using ggplot2
## df$propertyType
## house unit
## 15908618866 2127379770
Visualizing Price By property Type using ggplot2
## as.factor(df$bedrooms)
## 0 1 2 3 4 5
## 16269000 546660458 1591118057 6590648741 7499143065 1792159315
Visualizing Price By bedrooms using ggplot2
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| house | 12870500 | 33189450 | 391554640 | 6207526896 | 7474757065 | 1788720315 |
| unit | 3398500 | 513471008 | 1199563417 | 383121845 | 24386000 | 3439000 |
## as.factor(df$bedrooms)
## df$propertyType 0 1 2 3 4
## house 12870500 33189450 391554640 6207526896 7474757065
## unit 3398500 513471008 1199563417 383121845 24386000
## as.factor(df$bedrooms)
## df$propertyType 5
## house 1788720315
## unit 3439000
Visualizing Price By property Type and bedrooms using ggplot2
In Order To determine what types of statistics tests we will apply , we need to to determine 2 points about the target variable (Price):
1. variable Normality
2. variable homogeneity of variances
##
## Anderson-Darling normality test
##
## data: df$price
## A = 1180.2, p-value < 2.2e-16
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 11 112.69 < 2.2e-16 ***
## 29568
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on our above analysis results:
1. Normality: The Anderson-Darling normality test (p-value < 0.05) indicates non-normal distribution.
2. Homogeneity of variances: We couldn’t perform Levene’s test due to quantitative explanatory variables.
Given these results, We will be using non-parametric statistical tests
A. wilcox.test
##
## Wilcoxon rank sum test with continuity correction
##
## data: df$price by propertyType
## W = 102291684, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
After Applying the test we found that:
The pvalue is less than 0.05 –> there is difference in the data distribution between propertyTypes
this difference can be determined if it is large or small by interpreting the rank biserial value which is 0.66
## [1] "very large"
## (Rules: funder2019)
So The difference in the price between property Types in very large
## # A tibble: 1 × 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 price house unit 0.428 24552 5028 moderate
B. kruskal.test
##
## Kruskal-Wallis rank sum test
##
## data: df$price and as.factor(bedrooms)
## Kruskal-Wallis chi-squared = 12050, df = 5, p-value < 2.2e-16
After Applying the test we found that:
The pvalue is less than 0.05 –> there is difference in the data distribution between bedrooms
this difference can be determined if it is large or small by interpreting the rank biserial value which is 0.41
## [1] "very large"
## (Rules: funder2019)
So The difference in the price between bedrooms in very large
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 price 29580 0.407 eta2[H] large
## Comparison Z P.unadj P.adj
## 1 0 - 1 6.2532710 4.019437e-10 1.607775e-09
## 2 0 - 2 3.2246371 1.261325e-03 2.522649e-03
## 3 1 - 2 -18.7770577 1.163514e-78 6.981085e-78
## 4 0 - 3 -0.1353723 8.923175e-01 8.923175e-01
## 5 1 - 3 -44.5330056 0.000000e+00 0.000000e+00
## 6 2 - 3 -32.3845603 4.527976e-230 3.622381e-229
## 7 0 - 4 -4.5323069 5.834301e-06 1.750290e-05
## 8 1 - 4 -74.3186748 0.000000e+00 0.000000e+00
## 9 2 - 4 -73.4484829 0.000000e+00 0.000000e+00
## 10 3 - 4 -59.9929036 0.000000e+00 0.000000e+00
## 11 0 - 5 -7.1345697 9.709041e-13 4.854521e-12
## 12 1 - 5 -73.4043100 0.000000e+00 0.000000e+00
## 13 2 - 5 -67.7003098 0.000000e+00 0.000000e+00
## 14 3 - 5 -52.7238080 0.000000e+00 0.000000e+00
## 15 4 - 5 -19.6152541 1.145630e-85 8.019412e-85
C. chisq.test
##
## Pearson's Chi-squared test
##
## data: table(propertyType, as.factor(bedrooms))
## X-squared = 19805, df = 5, p-value < 2.2e-16
After Applying the test we found that:
The pvalue is less than 0.05 –> there is difference in the data distribution between property Type and bedroom
##
## propertyType 0 1 2 3 4 5
## house 24.900609 1350.443 2986.413 9904.632 8667.072 1618.5396
## unit 5.099391 276.557 611.587 2028.368 1774.928 331.4604
None of the cells of chi$expected is less than 5 –> So our Applied chisq.test is Sufficient
## # A tibble: 6 × 8
## propertyType_house propertyType_unit bedrooms_0 bedrooms_1 bedrooms_2
## <int> <int> <int> <int> <int>
## 1 1 0 0 0 0
## 2 1 0 0 0 0
## 3 1 0 0 0 0
## 4 1 0 0 0 0
## 5 1 0 0 0 0
## 6 1 0 0 0 0
## # ℹ 3 more variables: bedrooms_3 <int>, bedrooms_4 <int>, bedrooms_5 <int>
##
## Reliability analysis
## Call: psych::alpha(x = factor_Analysis_data, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.61 -5.5 -4.3 -0.12 -0.85 0.0032 0.34 0.18 -0.094
##
## 95% confidence boundaries
## lower alpha upper
## Feldt 0.6 0.61 0.62
## Duhachek 0.6 0.61 0.61
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r
## propertyType_house- 0.35 -1.4 -0.78 -0.090 -0.58 0.0057 0.067
## propertyType_unit 0.35 -4.1 -2.68 -0.130 -0.80 0.0057 0.058
## bedrooms_0 0.62 -14.1 -2.84 -0.154 -0.93 0.0033 0.140
## bedrooms_1 0.56 -4.4 -1.65 -0.132 -0.82 0.0036 0.118
## bedrooms_2 0.49 -3.0 -1.33 -0.120 -0.75 0.0043 0.104
## bedrooms_3- 0.70 -1.5 -0.83 -0.093 -0.60 0.0022 0.122
## bedrooms_4- 0.69 -1.6 -0.89 -0.097 -0.62 0.0023 0.119
## bedrooms_5- 0.65 -4.1 -1.63 -0.130 -0.80 0.0031 0.141
## med.r
## propertyType_house- -0.099
## propertyType_unit -0.090
## bedrooms_0 -0.196
## bedrooms_1 -0.099
## bedrooms_2 -0.064
## bedrooms_3- -0.064
## bedrooms_4- -0.064
## bedrooms_5- -0.090
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## propertyType_house- 29580 0.951 -0.35 NaN 0.9119 0.170 0.376
## propertyType_unit 29580 0.951 0.35 NaN 0.9119 0.170 0.376
## bedrooms_0 29580 0.045 0.79 NaN 0.0230 0.001 0.032
## bedrooms_1 29580 0.533 0.39 NaN 0.4055 0.055 0.228
## bedrooms_2 29580 0.737 0.19 NaN 0.6028 0.122 0.327
## bedrooms_3- 29580 0.334 -0.30 NaN -0.0064 0.597 0.491
## bedrooms_4- 29580 0.352 -0.24 NaN 0.0224 0.647 0.478
## bedrooms_5- 29580 0.128 0.35 NaN -0.0449 0.934 0.248
##
## Non missing response frequency for each item
## 0 1 miss
## propertyType_house 0.17 0.83 0
## propertyType_unit 0.83 0.17 0
## bedrooms_0 1.00 0.00 0
## bedrooms_1 0.94 0.06 0
## bedrooms_2 0.88 0.12 0
## bedrooms_3 0.60 0.40 0
## bedrooms_4 0.65 0.35 0
## bedrooms_5 0.93 0.07 0
## $chisq
## [1] Inf
##
## $p.value
## [1] 0
##
## $df
## [1] 28
the p value is less than 0.05 –> we can reject H0, and say that the observed matrix is not an identity matrix
## Error in solve.default(r) :
## Lapack routine dgesv: system is exactly singular: U[2,2] = 0
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = factor_Analysis_data)
## Overall MSA = 0.5
## MSA for each item =
## propertyType_house propertyType_unit bedrooms_0 bedrooms_1
## 0.5 0.5 0.5 0.5
## bedrooms_2 bedrooms_3 bedrooms_4 bedrooms_5
## 0.5 0.5 0.5 0.5
## Parallel analysis suggests that the number of factors = 6 and the number of components = 5
##
## Very Simple Structure
## Call: psych::vss(x = factor_Analysis_data)
## Although the VSS complexity 1 shows 6 factors, it is probably more reasonable to think about 2 factors
## VSS complexity 2 achieves a maximimum of 0.92 with 6 factors
##
## The Velicer MAP achieves a minimum of 0.11 with 1 factors
## BIC achieves a minimum of 635693.4 with 2 factors
## Sample Size adjusted BIC achieves a minimum of 635734.8 with 2 factors
##
## Statistics by number of factors
## vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex
## 1 0.56 0.00 0.11 20 662092 0 6.37453 0.56 1.1 661886 661950 1.0
## 2 0.66 0.74 0.22 13 635827 0 3.80636 0.74 1.3 635693 635735 1.2
## 3 0.60 0.79 0.33 7 1154228 0 2.60192 0.82 2.4 1154156 1154178 1.5
## 4 0.66 0.82 0.72 2 1129301 0 1.39513 0.90 4.4 1129281 1129287 1.4
## 5 0.59 0.84 1.00 -2 974029 NA 1.00189 0.93 NA NA NA 1.5
## 6 0.72 0.92 NaN -5 959575 NA 0.00015 1.00 NA NA NA 1.3
## 7 0.72 0.92 NaN -7 959553 NA 0.00015 1.00 NA NA NA 1.3
## 8 0.72 0.92 NA -8 959532 NA 0.00015 1.00 NA NA NA 1.3
## eChisq SRMR eCRMS eBIC
## 1 4.4e+04 0.16313 0.193 43876
## 2 1.6e+04 0.09716 0.143 15504
## 3 7.3e+03 0.06658 0.133 7272
## 4 7.9e+02 0.02185 0.082 770
## 5 8.7e+01 0.00726 NA NA
## 6 9.1e-01 0.00074 NA NA
## 7 9.1e-01 0.00074 NA NA
## 8 9.1e-01 0.00074 NA NA
when using cor = TRUE, then the correlation is used instead of covariance
5.1 for reporting variance explained :
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.7077973 1.2685701 1.0566932 1.0436648 1.0006335
## Proportion of Variance 0.3645715 0.2011588 0.1395751 0.1361545 0.1251584
## Cumulative Proportion 0.3645715 0.5657302 0.7053053 0.8414598 0.9666183
## Comp.6 Comp.7 Comp.8
## Standard deviation 0.51677259 1.684227e-07 0
## Proportion of Variance 0.03338174 3.545776e-15 0
## Cumulative Proportion 1.00000000 1.000000e+00 1
princomp(factor_Analysis_data,cor = TRUE) %>% loadings() # for reporting principal component loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## propertyType_house 0.571 0.414 0.707
## propertyType_unit -0.571 -0.414 0.707
## bedrooms_0 0.999
## bedrooms_1 -0.317 -0.184 -0.745 0.483 0.274
## bedrooms_2 -0.405 0.153 0.619 0.521 0.393
## bedrooms_3 0.171 0.727 -0.214 -0.217 0.590
## bedrooms_4 0.222 -0.684 -0.269 -0.280 0.575
## bedrooms_5 0.906 -0.242 -0.162 0.299
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
## Cumulative Var 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000
the loading on each factor,since the ss laodings are ==1 for all will use later advanced package
# this si the kinked cow curve –> where ever point above one –> number of factors –>4
variance explaind by each components
## Principal Components Analysis
## Call: psych::principal(r = factor_Analysis_data, nfactors = 8, rotate = "varimax",
## cor = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC3 RC4 RC6 RC5 RC7 RC8 h2 u2 com
## propertyType_house -0.95 0.02 0.04 -0.22 -0.23 -0.01 0 0 1 0.0e+00 1.2
## propertyType_unit 0.95 -0.02 -0.04 0.22 0.23 0.01 0 0 1 1.3e-15 1.2
## bedrooms_0 0.01 0.00 0.00 -0.01 -0.01 1.00 0 0 1 4.4e-15 1.0
## bedrooms_1 0.33 0.01 -0.02 0.94 -0.12 -0.01 0 0 1 -1.6e-15 1.3
## bedrooms_2 0.46 0.02 -0.05 -0.15 0.87 -0.01 0 0 1 0.0e+00 1.6
## bedrooms_3 -0.19 -0.89 -0.24 -0.17 -0.27 -0.03 0 0 1 2.0e-15 1.5
## bedrooms_4 -0.24 0.90 -0.23 -0.15 -0.25 -0.03 0 0 1 2.6e-15 1.5
## bedrooms_5 -0.07 0.00 1.00 -0.02 -0.03 0.00 0 0 1 2.0e-15 1.0
##
## RC1 RC2 RC3 RC4 RC6 RC5 RC7 RC8
## SS loadings 2.21 1.61 1.11 1.05 1.02 1.00 0 0
## Proportion Var 0.28 0.20 0.14 0.13 0.13 0.13 0 0
## Cumulative Var 0.28 0.48 0.62 0.75 0.87 1.00 1 1
## Proportion Explained 0.28 0.20 0.14 0.13 0.13 0.13 0 0
## Cumulative Proportion 0.28 0.48 0.62 0.75 0.87 1.00 1 1
##
## Mean item complexity = 1.3
## Test of the hypothesis that 8 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0
## with the empirical chi square 0 with prob < NA
##
## Fit based upon off diagonal values = 1
the ss laodings is 2.92 means 2.92 units can be explained out of the total 8 factors,#which is equal to 2.92/8=36% variance explained by PC1 and so on
the cumulative variance explained by 4 component is 84%
running the test again with limiting the factors to 4
## Principal Components Analysis
## Call: psych::principal(r = factor_Analysis_data, nfactors = 4, rotate = "varimax",
## cor = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 RC4 RC3 h2 u2 com
## propertyType_house -0.84 0.04 -0.49 -0.02 0.9541 0.0459 1.6
## propertyType_unit 0.84 -0.04 0.49 0.02 0.9541 0.0459 1.6
## bedrooms_0 0.01 0.00 0.02 0.03 0.0012 0.9988 2.0
## bedrooms_1 0.09 0.01 0.96 0.06 0.9368 0.0632 1.0
## bedrooms_2 0.93 0.03 -0.25 0.02 0.9272 0.0728 1.1
## bedrooms_3 -0.29 -0.90 -0.12 -0.29 0.9873 0.0127 1.5
## bedrooms_4 -0.31 0.90 -0.11 -0.26 0.9789 0.0211 1.5
## bedrooms_5 -0.13 -0.01 -0.10 0.98 0.9919 0.0081 1.1
##
## RC1 RC2 RC4 RC3
## SS loadings 2.49 1.61 1.51 1.12
## Proportion Var 0.31 0.20 0.19 0.14
## Cumulative Var 0.31 0.51 0.70 0.84
## Proportion Explained 0.37 0.24 0.22 0.17
## Cumulative Proportion 0.37 0.61 0.83 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 4 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.03
## with the empirical chi square 1839.41 with prob < 0
##
## Fit based upon off diagonal values = 0.99
diagonal values: a straight line that joins two opposite corners of a four-sided flat shape look at the results how it has been changed when reducing the nfactors to 4