DS Lab #1

Anna Pająk

2021-03-31

Data wrangling

As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.

mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$rooms<-factor(mieszkania$rooms,ordered=TRUE)
attach(mieszkania)
mieszkania$price_PLN<-as.numeric(mieszkania$price_PLN)
mieszkania$price_EUR<-as.numeric(mieszkania$price_EUR)

Frequency table

In the first stage of our analysis we are going to group our data in the form of the simple frequency table.

First, let’s take a look at the distribution of prices of apartments in our sample:

Apartments in Wroclaw - prices in kPLN
x label Freq Percent Valid Percent Cumulative Percent
Valid 350-450 kPLN 9 4.5 4.5 4.5
450-550 kPLN 21 10.5 10.5 15.0
550-650 kPLN 33 16.5 16.5 31.5
650-750 kPLN 36 18.0 18.0 49.5
750-850 kPLN 31 15.5 15.5 65.0
850-950 kPLN 36 18.0 18.0 83.0
950-1050 kPLN 21 10.5 10.5 93.5
1050-1150 kPLN 10 5.0 5.0 98.5
1150-1250 kPLN 2 1.0 1.0 99.5
1250-1350 kPLN 1 0.5 0.5 100.0
Total 200 100.0 100.0
Missing <blank> 0 0.0
<NA> 0 0.0
Total 200 100.0

TAI

Now let’s check the tabular accuracy.

##        # classes  Goodness of fit Tabular accuracy 
##       10.0000000        0.9780872        0.8508467

Basic plots

In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend. Read more about graphical parameters here.

Histogram below gives a general insight into prices of flats in Wroclaw. We can observe, that price intervals with the greatest number of observations are 600,000 - 700,000 and 800,000 - 1,000,000 PLN. However, the first interval is similarly abundant for all analyzed districts, whereas the second is mostly affected by data about Biskupin district. This conclusion strictly comes from density lines that show the distribution of prices in each district separately. We can also find out that prices in Śródmieście are the closest to the symmetric distribution.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Information about distribution in districts can be easily extended when looking at boxplots like those below. They divide data into three quartiles. Filled rectangles represents the interquartile range of data (middle 50%) and thick vertical line is a median. It’s visible that the lowest prices in set were noted in Krzyki and the highest in Biskupin. Again, we can observe that prices in Śródmieście are similarly dispersed above and below the median.

ggplot2 plots

Another step is analysis of the number of rooms per apartment. Barplot presents it for each district separately. What we can also observe, overall number of observations per district is different (the biggest for Krzyki and the smallest for Śródmieście). Śródmieście, in terms of flats number, is again the most balanced district, while flats in Biskupin vary the most.

The plot below presents the distribution of flats depending on price per square meter. Violin plot supported by a box plot is quite informative. On the violin plots we can see the distribution of all observations in a single category. Simultaneously box plots present location of the middle 50% of data, its median and, what is additional, the mean value.

Using facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.

Descriptive statistics #1

Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.

## [1] 760035
## [1] 755719.5
## [1] 186099.8
## [1] 34633125960
## [1] 0.2448568
##      75% 
## 282686.5
##       75% 
## 0.1870314
## [1] 359769
## [1] 1277691
##        0%        5%       25%       50%       75%       95%      100% 
##  359769.0  477175.4  619073.8  755719.5  901760.2 1054250.8 1277691.0

Summary tables with ‘kable’

Using kable and kableextra packages we can easily create summary tables with graphics and/or statistics.

rooms boxplot histogram line1 line2 points1
1
2
3
4

Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using kable packages. You can customize your final report. See some hints here.

Table 1. Apartments in Wroclaw - prices per m2 in PLN.(By number of rooms).
1 room 2 rooms 3 rooms 4 rooms
Min 17988.45 15130.74 13111.37 11251.33
Max 37337.84 22488.14 20737.01 15599.75
Q1 24221.49 17669.84 14605.45 12906.62
Median 26444.25 18564.91 15399.33 13606.95
Q3 29551.30 19992.93 16646.70 14275.04
Mean 26863.91 18725.02 15772.20 13611.19
Sd 3912.35 1866.75 1657.25 970.05
IQR 5329.80 2323.09 2041.25 1368.41
Sx 2664.90 1161.55 1020.63 684.21
Var % 0.15 0.10 0.11 0.07
IQR Var % 0.20 0.13 0.13 0.10
Skewness 0.19 0.04 0.83 -0.03
Kurtosis -0.24 -0.61 0.35 -0.51
Table 2. Apartments in Wroclaw - prices per m2 in PLN. (By district).
Biskupin Krzyki Sródmieście
Min 13145.36 11251.33 12331.64
Max 37337.84 30559.62 32775.63
Q1 14910.96 13798.74 14488.64
Median 17739.17 15435.97 17320.75
Q3 21228.79 19417.96 21646.04
Mean 19537.32 17314.91 18724.71
Sd 5961.08 4721.67 5208.71
IQR 6317.83 5619.22 7157.40
Sx 3158.92 2809.61 3578.70
Var % 0.31 0.27 0.28
IQR Var % 0.36 0.36 0.41
Skewness 1.20 1.08 0.90
Kurtosis 0.34 0.13 -0.04
Table 3. Apartments in Wroclaw - prices per m2 in PLN. (By building type.)
kamienica niski blok wieżowiec
Min 13145.36 11251.33 12331.64
Max 37337.84 30559.62 32775.63
Q1 14910.96 13798.74 14488.64
Median 17739.17 15435.97 17320.75
Q3 21228.79 19417.96 21646.04
Mean 19537.32 17314.91 18724.71
Sd 5961.08 4721.67 5208.71
IQR 6317.83 5619.22 7157.40
Sx 3158.92 2809.61 3578.70
Var % 0.31 0.27 0.28
IQR Var % 0.36 0.36 0.41
Skewness 1.20 1.08 0.90
Kurtosis 0.34 0.13 -0.04