DS Lab #1

Filip Ziętara

2021-03-29

Data wrangling

As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.

mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$rooms<-factor(mieszkania$rooms,ordered=TRUE)
attach(mieszkania)

Frequency table

##                 limits Freq Rel_Freq Cum_Freq
## 1    (3.5e+05,4.5e+05]    9    0.045        9
## 2    (4.5e+05,5.5e+05]   21    0.105       30
## 3    (5.5e+05,6.5e+05]   33    0.165       63
## 4    (6.5e+05,7.5e+05]   36    0.180       99
## 5    (7.5e+05,8.5e+05]   31    0.155      130
## 6    (8.5e+05,9.5e+05]   36    0.180      166
## 7   (9.5e+05,1.05e+06]   21    0.105      187
## 8  (1.05e+06,1.15e+06]   10    0.050      197
## 9  (1.15e+06,1.25e+06]    2    0.010      199
## 10 (1.25e+06,1.35e+06]    1    0.005      200

TAI

##        # classes  Goodness of fit Tabular accuracy 
##        5.0000000        0.9982456        0.9655172

Basic plots

In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

ggplot2 plots

Using facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.

Descriptive statistics #1

Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.

## [1] "central tendency of prices: "
## MEAN: 760035
## median: 755719.5
## sd:  186099.8
## var: 34633125960
## min:  359769
## max:  1277691
## sd:  0.2448568

Summary tables with ‘kable’

Using kable and kableextra packages we can easily create summary tables with graphics and/or statistics.

rooms boxplot histogram line1 line2 points1
1
2
3
4

Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using kable packages. You can customize your final report. See some hints here.

Biskupin (N = 65) Krzyki (N = 79) Srodmiescie (N = 56)
Prices:
min price 519652 359769 448196
median price 817736 716726 727477.5
max price 1277691 1090444 1062054
mean (sd) price 818,614.06 ± 175,597.94 726,507.20 ± 195,015.45 739,339.70 ± 171,428.11
skewness: 0.34 0.07 0.11
kurtosis: -0.29 -0.98 -1.13
Size:
min size 17.1 17.4 17
median size 45.1 44.1 42.95
max size 87.7 86.6 83.3
mean (sd) size 47.05 ± 19.57 46.86 ± 20.95 44.27 ± 19.63
skewness: 0.19 0.17 0.3
kurtosis: -0.78 -1.07 -1.06
Room amount:
One room 12 (18) 18 (23) 14 (25)
Two rooms 16 (25) 19 (24) 15 (27)
Three rooms 24 (37) 18 (23) 16 (29)
Four rooms 13 (20) 24 (30) 11 (20)
kamienica (N = 61) niski blok (N = 63) wiezowiec (N = 76)
Prices:
min price 415834 496390 359769
median price 800693 807895 678704
max price 1230848 1277691 1090444
mean (sd) price 770,332.52 ± 184,388.21 815,576.63 ± 176,390.35 705,728.87 ± 182,503.32
skewness: 0 0.22 0.21
kurtosis: -0.61 -0.45 -0.97
Size:
min size 17 17.4 17.4
median size 46.2 49.6 41.85
max size 87.5 87.7 85.7
mean (sd) size 48.37 ± 20.92 49.13 ± 18.99 42.02 ± 19.82
skewness: 0.13 0.15 0.36
kurtosis: -1.09 -0.7 -1.03
Room amount:
One room 11 (18) 9 (14) 24 (32)
Two rooms 17 (28) 18 (29) 15 (20)
Three rooms 17 (28) 20 (32) 21 (28)
Four rooms 16 (26) 16 (25) 16 (21)