DS Lab #1

Karol Flisikowski, Michał Rejmak

2021-03-30

Data wrangling

As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.

mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$rooms<-factor(mieszkania$rooms,ordered=TRUE)
attach(mieszkania)

Frequency table

##             limits Freq Percentage_of_all
## 1   (80000,100000]    6               3.0
## 2  (100000,120000]   14               7.0
## 3  (120000,140000]   25              12.5
## 4  (140000,160000]   34              17.0
## 5  (160000,180000]   27              13.5
## 6  (180000,200000]   27              13.5
## 7  (200000,220000]   34              17.0
## 8  (220000,240000]   18               9.0
## 9  (240000,260000]   12               6.0
## 10 (260000,280000]    1               0.5
## 11 (280000,300000]    2               1.0

TAI

x
# classes 11.0000000
Goodness of fit 0.9760388
Tabular accuracy 0.8799735

Basic plots

In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend. Read more about graphical parameters here.

Conclusions of the plot

The biggest amount of accommodations cost between 150 000 and 200 000 EUR and in this interval they are distributed equally in terms of building types.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Conclusions of the plot

In this case, we can observe that no matter what is the district, low apartment blocks are the most expensive ones in terms of their upper range price. Surprisingly, only in Krzyki district, the median of the tenement house’s price is higher than in low apartment blocks.

ggplot2 plots

Conclusions of the plot

As we can see, at every district except from Srdomiescie, small flats, with one room form the biggest part of the whole. However, in the center of the city, medium flats with 2/3 rooms form the same percentage.

Conclusions of the plot

On this plot, we can observe that in almost every case, the most expensive flats are in the Biskupin district. The is only one diffrent case, when the type of the building is skyscraper and the surface of it is the biggest. Then, all prices seems to be quite simillar.

Using facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.

Descriptive statistics #1

Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.

Mean
district kamienica niski blok wiezowiec
Biskupin 190804.3 206699.1 174650.5
Krzyki 170530.2 182189.1 156823.4
Srodmiescie 166809.4 182154.7 162064.8
Median
district kamienica niski blok wiezowiec
Biskupin 193098.5 210291.0 174515.5
Krzyki 185346.0 172727.0 145588.0
Srodmiescie 155046.0 182110.5 160853.5
Mode
district kamienica niski blok wiezowiec
Biskupin 186732 214462 128121
Krzyki 98926 158609 226298
Srodmiescie 120514 245846 118333

Conclusions

the highest mean of the price, in every district - low apartment blocks Despite this, the median of tenement house in Krzyki is higher than the low block apartment one. The conclusion is that in Krzyki, there are more tenement houses(data amples) in upper range prices (values) than in low apartment blocks.

Summary tables with ‘kable’

Using kable and kableextra packages we can easily create summary tables with graphics and/or statistics.

rooms boxplot histogram line1 line2 points1
1
2
3
4

Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using kable packages. You can customize your final report. See some hints here.

Biskupin (N = 65) Krzyki (N = 79) Srodmiescie (N = 56)
Prices in EUR
min 120290 83280 103749
max 295762 252418 245846
mean (sd) 189,494.00 \(\pm\) 40,647.70 168,173.00 \(\pm\) 45,142.40 171,143.43 \(\pm\) 39,682.41
Central Tendency of prices
Skewness 0.335196812077772 0.0718500007021854 0.106122298502577
Kurthosis -0.294219001665212 -0.981311219320137 -1.12610324021581
Surface
min 17.1 17.4 17
max 87.7 86.6 83.3
mean (sd) 47.05 \(\pm\) 19.57 46.86 \(\pm\) 20.95 44.27 \(\pm\) 19.63
Central Tendency of Surface
Skewness 0.1900099278084 0.169363280497817 0.296765067386592
Kurthosis -0.778242543309705 -1.07136263028292 -1.05854133853727