Data wrangling
As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.
mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$price_PLN<-as.numeric(mieszkania$price_PLN)
mieszkania$price_EUR<-as.numeric(mieszkania$price_EUR)Frequency table
## Price.in.PLN Number.of.flats Proportion
## 1 (3.5e+05,4.5e+05] 9 0.045
## 2 (4.5e+05,5.5e+05] 21 0.105
## 3 (5.5e+05,6.5e+05] 33 0.165
## 4 (6.5e+05,7.5e+05] 36 0.180
## 5 (7.5e+05,8.5e+05] 31 0.155
## 6 (8.5e+05,9.5e+05] 36 0.180
## 7 (9.5e+05,1.05e+06] 21 0.105
## 8 (1.05e+06,1.15e+06] 10 0.050
## 9 (1.15e+06,1.25e+06] 2 0.010
## 10 (1.25e+06,1.35e+06] 1 0.005
TAI
## # classes Goodness of fit Tabular accuracy
## 10.0000000 0.9780872 0.8508467
Basic plots
In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend.
First plot is a histogram. It shows diversity of prices of flats in Wroclaw - we can see values from 300 000 up to 1 300 000. Vast majority of flats cost around 600-900 thousands PLN. Density lines show distribution of prices in seperate districts of Wroclaw. They show similar pattern, although Biskupin notates an increase around 900 thousands comparing to other districts.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
Boxplot of prices in each district of Wroclaw shows more accurate distribution of prices. It’s easier to realize the prices shown in each district seperately, hovewer histogram is better in showing general pattern. Mean price of flats is very similar, same as median (represented by a blue star). Biskupin notes slightly higher mean and median. It also has larger diversity, flats reach there higher prices than in other districts. It’s also fact that neither of the districts records any outliers. Even higher prices in Biskupin fit into fourth quontile of prices.
ggplot2 plots
ggplot2 package allows to create plots easier and in a better way. Firstly, a histogram of prices is shown. It shows number of flats in each price cut. Automatical width set to the bins is good enough to see pattern in the distribution, althogh it can be changed. Thanks to this plot we can come to similar results as with previous plots. Biskupin is the only district that notes prices over 1 100 000 PLN; Krzyki the only one that has prices lower, around 300 000 PLN.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Boxplot using ggplot2. This boxplot shows price distribution in districts of Wroclaw. Rectangles show middle 50%, lines low and top quantile. Additionally, geom_jitter() shows dots as separate observations of prices.
Using facets
Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.
First plot groups the flats in Wroclaw into districts and then shows distribution of the prices over number of rooms. Thanks to facet_wrap() we can see all three variables in a clear way. Again, we can realize slightly higher prices in Biskupin and lower in Krzyki.
Next plot shows number of flats with each room number in all three districts. Pie char shows proportions of each number of rooms - for example we can see that in Biskupin most flats have 3 rooms; in Krzyki are bigger - have 4. This may be interesting due to the fact that flats in Biskupin are in general more expensive than in Krzyki.
## `summarise()` has grouped output by 'district', 'rooms'. You can override using the `.groups` argument.
Descriptive statistics #1
Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.
## [1] 359769
## [1] 1277691
## [1] 760035
## [1] 755719.5
## [1] 186099.8
## [1] 34633125960
## [1] 282686.5
## 25%
## 619073.8
## 75%
## 901760.2
## [1] 141343.2
## [1] 0.2448568
After calculating simple measures we can put them into a table so that it will be easy to compare them. The table below shows grouped by each district and type of building: mean price, min, max, mode and all three quantiles.
Q1 - middle value in fist half of prices Q2 - median value Q3 - middle value in second half of prices
## `summarise()` has grouped output by 'district'. You can override using the `.groups` argument.
|
Flat
|
Price
|
Size
|
|||||||
|---|---|---|---|---|---|---|---|---|---|
| District | Type of building | Mean price | Min price | Max price | Mode price | Q1 | Q2 = median | Q3 | Num of flats |
| Biskupin | kamienica | 824 274.9 | 542 009 | 1 230 848 | 683 279.8 | 683 279.8 | 834 185.5 | 903 926.2 | 26 |
| Biskupin | niski blok | 892 939.9 | 595 868 | 1 277 691 | 812 670.0 | 812 670.0 | 908 455.0 | 946 435.0 | 17 |
| Biskupin | wiezowiec | 754 490.4 | 519 652 | 1 050 898 | 605 074.2 | 605 074.2 | 753 906.5 | 895 419.0 | 22 |
| Krzyki | kamienica | 736 690.6 | 415 834 | 1 027 142 | 605 450.0 | 605 450.0 | 800 693.0 | 838 242.0 | 21 |
| Krzyki | niski blok | 787 056.9 | 496 390 | 1 082 279 | 689 064.5 | 689 064.5 | 746 181.0 | 910 260.0 | 24 |
| Krzyki | wiezowiec | 677 476.5 | 359 769 | 1 090 444 | 515 879.2 | 515 879.2 | 628 939.5 | 787 077.8 | 34 |
| Srodmiescie | kamienica | 720 616.8 | 484 111 | 1 027 226 | 544 311.8 | 544 311.8 | 669 799.5 | 870 550.5 | 14 |
| Srodmiescie | niski blok | 786 908.4 | 522 604 | 1 062 054 | 673 736.5 | 673 736.5 | 786 715.5 | 926 902.8 | 22 |
| Srodmiescie | wiezowiec | 700 120.2 | 448 196 | 1 034 385 | 553 515.2 | 553 515.2 | 694 887.0 | 840 260.0 | 20 |
Summary tables with ‘kable’
Using kable and kableextra packages we can easily create summary tables with graphics and/or statistics.
Below we can see three summary tables with plots showing distribution of prices. First one is grouped by number of rooms and shows three plots: histogram, boxplot and lineplot. Second is grouped by districts, and the last one by type of building.
| Rooms | Price.histogram | Price.boxplot | Price.lineplot |
|---|---|---|---|
| 1 | |||
| 2 | |||
| 3 | |||
| 4 |
| District | Price.histogram | Price.boxplot | Price.lineplot |
|---|---|---|---|
| Biskupin | |||
| Krzyki | |||
| Srodmiescie |
| District | Price.histogram | Price.boxplot | Price.lineplot |
|---|---|---|---|
| Kamienica | |||
| Niski blok | |||
| Wiezowiec |
Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using kable packages. You can customize your final report. See some hints here.
Three summary tables are shown below using kable report. They show measures of prices: min, max, mean, standard deviation, interquartile range and Q1, Q2 (median), Q3. Also some measures of size are shown, to have a better view of the flats we are looking at: min, max and mean together with standard deviation. The tables are grouped by respectively: district, number of rooms, type of building.
|
FLATS IN DISTRICTS OF WROCLAW
|
|||
|---|---|---|---|
| Biskupin (n = 65) | Krzyki (n = 79) | Srodmiescie (n = 56) | |
| Price of flat | |||
| Min | 519652 | 359769 | 448196 |
| Max | 1277691 | 1090444 | 1062054 |
| Mean | 818614 | 726507 | 739340 |
| Sd | 175598 | 195015 | 171428 |
| IQR | 249723 | 276126 | 278465 |
| Q1 | 676751 | 600180.5 | 592287.75 |
| Median | 817736 | 716726 | 727477.5 |
| Q3 | 926474 | 876306.5 | 870752.5 |
| Size of flat | |||
| Min Size | 17.1 | 17.4 | 17 |
| Max Size | 87.7 | 86.6 | 83.3 |
| Mean Size | 47.05 \(\pm\) 19.57 | 46.86 \(\pm\) 20.95 | 44.27 \(\pm\) 19.63 |
|
FLATS PER ROOMS IN WROCLAW
|
||||
|---|---|---|---|---|
| 1 Room (n = 44) | 2 Rooms (n = 50) | 3 Rooms (n = 58) | 4 Rooms (n = 48) | |
| Price of flat | ||||
| Min | 359769 | 590286 | 632770 | 736669 |
| Max | 657146 | 888634 | 965829 | 1277691 |
| Mean | 515518 | 683568 | 833706 | 974810 |
| Sd | 66951 | 65073 | 86944 | 113819 |
| IQR | 75340 | 82971 | 131395 | 141605 |
| Q1 | 479684.75 | 634757.25 | 769683.75 | 909371.5 |
| Median | 520507 | 677260 | 846303.5 | 964338.5 |
| Q3 | 555024.75 | 717728.5 | 901078.75 | 1050976.75 |
| Size of flat | ||||
| Min Size | 17 | 29.6 | 41.2 | 53.3 |
| Max Size | 21.9 | 43.7 | 65.2 | 87.7 |
| Mean Size | 19.28 \(\pm\) 1.46 | 36.80 \(\pm\) 4.46 | 53.33 \(\pm\) 7.21 | 72.05 \(\pm\) 10.18 |
|
FLATS IN BUILDINGS OF WROCLAW
|
|||
|---|---|---|---|
| Kamienica (n = 61) | Niski blok (n = 63) | Wiezowiec (n = 76) | |
| Price of flat | |||
| Min | 415834 | 496390 | 359769 |
| Max | 1230848 | 1277691 | 1090444 |
| Mean | 770333 | 815577 | 705729 |
| Sd | 184388 | 176390 | 182503 |
| IQR | 248430 | 246927 | 314954 |
| Q1 | 647756 | 692925.5 | 555798.25 |
| Median | 800693 | 807895 | 678704 |
| Q3 | 896186 | 939852.5 | 870752.5 |
| Size of flat | |||
| Min Size | 17 | 17.4 | 17.4 |
| Max Size | 87.5 | 87.7 | 85.7 |
| Mean Size | 48.37 \(\pm\) 20.92 | 49.13 \(\pm\) 18.99 | 42.02 \(\pm\) 19.82 |