DS Lab #1

Adam Pocheć

2021-03-31

Data wrangling

As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.

mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$rooms<-factor(mieszkania$rooms,ordered=TRUE)
attach(mieszkania)
mieszkania$price_PLN<-as.numeric(mieszkania$price_PLN)
mieszkania$price_EUR<-as.numeric(mieszkania$price_EUR)

Frequency table

In the first stage of our analysis we are going to group our data in the form of the simple frequency table.

First, let’s take a look at the distribution of prices of apartments in our sample:

##                limits Freq
## 1     (8e+04,8.5e+04]    1
## 2     (8.5e+04,9e+04]    1
## 3     (9e+04,9.5e+04]    1
## 4     (9.5e+04,1e+05]    3
## 5    (1e+05,1.05e+05]    3
## 6  (1.05e+05,1.1e+05]    1
## 7  (1.1e+05,1.15e+05]    7
## 8  (1.15e+05,1.2e+05]    3
## 9  (1.2e+05,1.25e+05]    7
## 10 (1.25e+05,1.3e+05]    7
## 11 (1.3e+05,1.35e+05]    2
## 12 (1.35e+05,1.4e+05]    9
## 13 (1.4e+05,1.45e+05]    8
## 14 (1.45e+05,1.5e+05]   10
## 15 (1.5e+05,1.55e+05]    3
## 16 (1.55e+05,1.6e+05]   13
## 17 (1.6e+05,1.65e+05]    6
## 18 (1.65e+05,1.7e+05]    7
## 19 (1.7e+05,1.75e+05]    8
## 20 (1.75e+05,1.8e+05]    6
## 21 (1.8e+05,1.85e+05]    5
## 22 (1.85e+05,1.9e+05]   12
## 23 (1.9e+05,1.95e+05]    7
## 24   (1.95e+05,2e+05]    3
## 25   (2e+05,2.05e+05]   10
## 26 (2.05e+05,2.1e+05]    9
## 27 (2.1e+05,2.15e+05]    7
## 28 (2.15e+05,2.2e+05]    8
## 29 (2.2e+05,2.25e+05]   11
## 30 (2.25e+05,2.3e+05]    3
## 31 (2.3e+05,2.35e+05]    0
## 32 (2.35e+05,2.4e+05]    4
## 33 (2.4e+05,2.45e+05]    6
## 34 (2.45e+05,2.5e+05]    3
## 35 (2.5e+05,2.55e+05]    2
## 36 (2.55e+05,2.6e+05]    1
## 37 (2.6e+05,2.65e+05]    0
## 38 (2.65e+05,2.7e+05]    0
## 39 (2.7e+05,2.75e+05]    0
## 40 (2.75e+05,2.8e+05]    1
## 41 (2.8e+05,2.85e+05]    1
## 42 (2.85e+05,2.9e+05]    0
## 43 (2.9e+05,2.95e+05]    0
## 44   (2.95e+05,3e+05]    1

TAI

Now let’s check the tabular accuracy.

##        # classes  Goodness of fit Tabular accuracy 
##       44.0000000        0.9989782        0.9679021

Basic plots

In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend. Read more about graphical parameters here.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

ggplot2 plots

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Using facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.

Descriptive statistics #1

Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.

## `summarise()` has grouped output by 'district'. You can override using the `.groups` argument.
## # A tibble: 9 x 5
## # Groups:   district [3]
##   district    building_type    mean  median   mode
##   <fct>       <fct>           <dbl>   <dbl>  <dbl>
## 1 Biskupin    kamienica     190804. 193098. 186732
## 2 Biskupin    niski blok    206699. 210291  214462
## 3 Biskupin    wiezowiec     174651. 174516. 128121
## 4 Krzyki      kamienica     170530. 185346   98926
## 5 Krzyki      niski blok    182189. 172727  158609
## 6 Krzyki      wiezowiec     156823. 145588  226298
## 7 Srodmiescie kamienica     166809. 155046  120514
## 8 Srodmiescie niski blok    182155. 182110. 245846
## 9 Srodmiescie wiezowiec     162065. 160854. 118333
##    0% 
## 83280
##      25% 
## 143304.2
##    50% 
## 174935
##      75% 
## 208740.5
##   100% 
## 295762

Summary tables with ‘kable’

Using kable and kableextra packages we can easily create summary tables with graphics and/or statistics.

rooms boxplot histogram line1 line2 points1
1
2
3
4

Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using kable packages. You can customize your final report. See some hints here.

Biskupin (N = 65) Krzyki (N = 79) Srodmiescie (N = 56)
min 519652.00 359769.00 448196.00
max 1277691.00 1090444.00 1062054.00
mean 818614.06 726507.20 739339.70
Q0 519652.00 359769.00 448196.00
Q025 676751.00 600180.50 592287.75
Q050 817736.00 716726.00 727477.50
Q075 926474.00 876306.50 870752.50
Q1 1277691.00 1090444.00 1062054.00
IQR 249723.00 276126.00 278464.75
SD 175597.94 195015.45 171428.11
Skewness 0.34 0.07 0.11
Kurtosis -0.29 -0.98 -1.13
kamienica (N = 61) niski blok (N = 63) wiezowiec (N = 76)
min 415834.00 496390.00 359769.00
max 1230848.00 1277691.00 1090444.00
mean 770332.52 815576.63 705728.87
Q0 415834.00 496390.00 359769.00
Q025 647756.00 692925.50 555798.25
Q050 800693.00 807895.00 678704.00
Q075 896186.00 939852.50 870752.50
Q1 1230848.00 1277691.00 1090444.00
IQR 248430.00 246927.00 314954.25
SD 184388.21 176390.35 182503.32
Skewness 0.00 0.22 0.21
Kurtosis -0.61 -0.45 -0.97
1 (N = 44) 2 (N = 50) 3 (N = 58) 4 (N = 48)
min 359769.00 590286.00 632770.00 736669.00
max 657146.00 888634.00 965829.00 1277691.00
mean 515518.05 683567.70 833706.02 974809.96
Q0 359769.00 590286.00 632770.00 736669.00
Q025 479684.75 634757.25 769683.75 909371.50
Q050 520507.00 677260.00 846303.50 964338.50
Q075 555024.75 717728.50 901078.75 1050976.75
Q1 657146.00 888634.00 965829.00 1277691.00
IQR 75340.00 82971.25 131395.00 141605.25
SD 66951.03 65072.66 86943.90 113819.21
Skewness -0.20 0.80 -0.42 0.33
Kurtosis -0.38 0.48 -0.83 0.05