DS Lab #1 - Apartments in Wrocław

Karol Flisikowski, Jakub Sochacki

2021-03-31

Data wrangling

As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.

mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$rooms<-factor(mieszkania$rooms,ordered=TRUE)
attach(mieszkania)
mieszkania$price_PLN <- as.numeric(mieszkania$price_PLN)
mieszkania$price_EUR <- as.numeric(mieszkania$price_EUR)

Frequency table

Prices frequency table
Price range Frequency
(350000,400000] 2
(400000,450000] 7
(450000,500000] 9
(500000,550000] 12
(550000,600000] 14
(600000,650000] 19
(650000,700000] 20
(700000,750000] 16
(750000,800000] 12
(800000,850000] 19
(850000,900000] 17
(900000,950000] 19
(950000,1000000] 15
(1000000,1050000] 6
(1050000,1100000] 9
(1100000,1150000] 1
(1150000,1200000] 0
(1200000,1250000] 2
(1250000,1300000] 1
(1300000,1350000] 0

TAI

x
# classes 20.0000000
Goodness of fit 0.9944395
Tabular accuracy 0.9234661

Conclusion:

Division for 20 classes caused really good tabular accuracy index.

Basic plots

In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Conclusion:

Box plot clearly shows price quartiles and median for each district in Wrocław. The main conclusion is that apartments in Biskupin are more expensive. The median price in that district is clearly the highest as well as 25% quartile of prices. Another conclusion would be that in Krzyki there are many cheap apartments.

ggplot2 plots

Following frequency table and histogram show the distribution of apartments sizes.

Apartmant size frequency table
Size range in m^2 Frequency
(17,24.1] 43
(24.1,31.1] 6
(31.1,38.2] 25
(38.2,45.3] 31
(45.3,52.4] 14
(52.4,59.4] 22
(59.4,66.5] 24
(66.5,73.6] 12
(73.6,80.6] 9
(80.6,87.7] 13
x
# classes 10.0000000
Goodness of fit 0.9922741
Tabular accuracy 0.9136083

Conclusion:

Tabular accuracy for 10 classes is slightly lower than for 20 classes as tested before, but still it allows for good visual data representation.

Conclusion:

From the histogram we can see extraordinary frequency of sizes from the lowest class - the smallest apartments.

shape_measure value
Skewness 0.6073357
Kurtosis -0.8379891

Conclusion:

Positive value of skewness tells us, that we can observe asymmetry towards the left side of the histogram. And in fact the histogram is positively skewed. It can be interpreted as follows: median > mean. We should be aware that skewness result is influenced by the large tail on the left side of the plot. In order to evaluate the shape and tailedness of the distribution there comes handy kurtosis. Negative value of this measure tells us, that the distribution is platykurtic.

Conclusion:

Obviously, the more rooms, the more expensive is the apartment. Interesting observation is that the more rooms in the apartment, the wider is interquartile range, and overall range of prices. It means that there is small variety of prices for 1/2-room/s apartments, and larger variety for larger flats.

Using facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.

Conclusion:

Now we can see that not only Biskupin is the most expensive district, but also that in each district the mean price is the highest in low block of flats building type, while in skyscraper we can find the cheapest apartments.

Descriptive statistics #1

Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.

Mean, mean and mode of apartaments prices in Wrocław

Following plots present three central tendency measures: mean, median and mode for each district in Wrocław.

Central tendency measures of price in PLN by districts
district mean median mode
Biskupin 818614.1 817736.0 553484
Krzyki 726507.2 716726.0 977608
Srodmiescie 739339.7 727477.5 511200

Conclusion:

For sure Biskupin is the most expensive district, and Krzyki is the cheapset district. It is worth to mention, that it is easy to lie by mode statistic in this case. It is because mode is the most frequently appearing value, but it is hard to find two or more exact same prices of apartments. Mode suggests that Krzyki is twice as expensive as Śródmieście, but it is not the fact for sure. Mode plot was introduced to show the inaccuracy of this measure in this case. It would be better to find a mode of smaller integers or qualitative data.

Quantiles and interquartile range:

Conclusion:

First quartile is the value which divides observations into two parts: 25% of smaller values and 75% larger values. Second quartile is equal to the median - marked with the red line on the box plot. Third quartile is the value for which only 25% of observations are larger. Left border indicates 5% percentile, and right border indicates 95%. Interquartile range is calculated in such a way: Q3 - Q1. IQR contains 50% of all observations, so the wider IQR is, the more differentiation of the variable.

Summary tables with ‘kable’

Using kable and kableextra packages we can easily create summary tables with graphics and/or statistics.

Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using kable packages. You can customize your final report. See some hints here.

Summary:

Prices and rooms by district in Wrocław
Biskupin (N = 65) Krzyki (N = 79) Srodmiescie (N = 56)
PRICE in PLN
min 519652 359769 448196
max 1277691 1090444 1062054
mean +- SD 818,614.06 \(\pm\) 175,597.94 726,507.20 \(\pm\) 195,015.45 739,339.70 \(\pm\) 171,428.11
Shape of the distribution
Skewness 0.3352 0.07185 0.10612
Kurtosis -0.29421 -0.98131 -1.12611
PRICE PER m^2 in PLN
min 13145.36 11251.33 12331.64
max 37337.84 30559.62 32775.63
median 17739.17 15435.97 17320.75
SIZE
min 17.1 17.4 17
max 87.7 86.6 83.3
mean +- SD 47.05 \(\pm\) 19.57 46.86 \(\pm\) 20.95 44.27 \(\pm\) 19.63

Conclusion:

Cheapest apartment to be found in Krzyki. The most expensive in Biskupin. We can observe that calculations from the summary table correctly describe min / max prices on the box plot. On the histograms we can compare she shape and central tendency measures. Indeed we can see the most positive skewness for Biskupin and the lowest kurtosis for Śródmieście.

District Histogram Boxplot
Biskupin
Krzyki
Srodmiescie

There is another summary table, this time grouped by building type.

Prices and rooms by district in Wrocław
kamienica (N = 61) niski blok (N = 63) wiezowiec (N = 76)
PRICE in PLN
min 415834 496390 359769
max 1230848 1277691 1090444
mean +- SD 770,332.52 \(\pm\) 184,388.21 815,576.63 \(\pm\) 176,390.35 705,728.87 \(\pm\) 182,503.32
Shape of the distribution
Skewness -0.00376 0.22343 0.20718
Kurtosis -0.61148 -0.44766 -0.97383
PRICE PER m^2 in PLN
min 11251.33 12851.01 12576.28
max 37337.84 32775.63 32248.67
median 16175.94 16546.07 17164.4
SIZE
min 17 17.4 17.4
max 87.5 87.7 85.7
mean +- SD 48.37 \(\pm\) 20.92 49.13 \(\pm\) 18.99 42.02 \(\pm\) 19.82
ROOMS
One 11 9 24
Two 17 18 15
Three 17 20 21
Four 16 16 16

Conclusion:

The cheapest apartment to be found in skyscraper, while the most expensive in a low block of flats. Interesting observation is that there is the most one-room apartments in skyscrapers.(24 in contrast to 9 and 11 in other buildings) It highly influences the mean price of an apartment in this building type since one-room flats are relatively cheap.