DS Lab #1

Monika Sokołowska

2021-03-31

Data wrangling

As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.


```r
mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$rooms<-factor(mieszkania$rooms,ordered=TRUE)
attach(mieszkania)
# mieszkania$price_PLN<-as.numeric(mieszkania$pirice_PLN)
# mieszkania$price_EUR<-as.numeric(mieszkania$pirice_EUR)

Frequency table

##                 limits Freq
## 1    (3.5e+05,4.5e+05]    9
## 2    (4.5e+05,5.5e+05]   21
## 3    (5.5e+05,6.5e+05]   33
## 4    (6.5e+05,7.5e+05]   36
## 5    (7.5e+05,8.5e+05]   31
## 6    (8.5e+05,9.5e+05]   36
## 7   (9.5e+05,1.05e+06]   21
## 8  (1.05e+06,1.15e+06]   10
## 9  (1.15e+06,1.25e+06]    2
## 10 (1.25e+06,1.35e+06]    1

TAI

##        # classes  Goodness of fit Tabular accuracy 
##        8.0000000        0.9601404        0.8351219

Basic plots

In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

The highest prices can be found for the Biskupin district, and the lowest: for Krzyki. It can also be observed that Śródmieśćie and Krzyki have flats in similar price range, while in Biskupin the price range is significantly higher.

ggplot2 plots

The most expensive are tenement houses in Biskupin and the least expensive are skyscrapers in Krzyki. The prices does not differ much in case of low blocks of flats.

Using facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.

Descriptive statistics #1

Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.

## [1] 760035
## [1] 755719.5
## [1] 186099.8
## [1] 34633125960
## [1] 0.2448568
## [1] 282686.5
## [1] 0.1870314
## [1] 359769
## [1] 1277691
##        0%       10%       25%       50%       75%       95%      100% 
##  359769.0  518806.8  619073.8  755719.5  901760.2 1054250.8 1277691.0

Summary tables with ‘kable’

Using kable and kableextra packages we can easily create summary tables with graphics and/or statistics.

rooms boxplot histogram line1 line2 points1
1
2
3
4

Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using kable packages. You can customize your final report. See some hints here.

## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## Warning: The `.dots` argument of `group_by()` is deprecated as of dplyr 1.0.0.
Characteristic Biskupin, N = 65 Krzyki, N = 79 Srodmiescie, N = 56
price_PLN
Median 817,736.00 716,726.00 727,477.50
Mean 818,614.06 726,507.20 739,339.70
Maximum 1,277,691.00 1,090,444.00 1,062,054.00
Minimum 519,652.00 359,769.00 448,196.00
SD 175,597.94 195,015.45 171,428.11
skew 0.34 0.07 0.11
kurtosi -0.29 -0.98 -1.13
IQR 249,723.00 276,126.00 278,464.75
price_EUR
Median 189,291.00 165,909.00 168,397.50
Mean 189,494.00 168,173.00 171,143.43
Maximum 295,762.00 252,418.00 245,846.00
Minimum 120,290.00 83,280.00 103,749.00
SD 40,647.70 45,142.40 39,682.41
skew 0.34 0.07 0.11
kurtosi -0.29 -0.98 -1.13
IQR 57,807.00 63,918.00 64,459.50
rooms
1 12 (18%) 18 (23%) 14 (25%)
2 16 (25%) 19 (24%) 15 (27%)
3 24 (37%) 18 (23%) 16 (29%)
4 13 (20%) 24 (30%) 11 (20%)
size
Median (IQR) 45.10 (35.20, 61.20) 44.10 (31.60, 63.00) 42.95 (27.68, 59.77)
building_type
kamienica 26 (40%) 21 (27%) 14 (25%)
niski blok 17 (26%) 24 (30%) 22 (39%)
wiezowiec 22 (34%) 34 (43%) 20 (36%)

Table summary:

The table is divided according three districts of Wrocław: Biskupin, Krzyki and Śródmieście. In each district there are consecutively 65, 79 and 65 flats. The median price is the highest for Biskupin (812,736.00 PLN) and the lowest for Krzyki (716,726.00 PLN). Situation is similar with mean, where the highest mean price is 818,614.06 PLN in Biskupin and 726,507.20 PLN in Krzyki. The highest (maximum) price for one flat can be found in Biskupin and is equal to 1,277,691.00 PLN, while the lowest (minimum) is 359,769.00 PLN for a locum placed in Krzyki district. The biggest standard deviation can be found for Krzyki and is equal to 195,015.45 PLN. Hence, the prices can be showed as follows:
  • Biskupin : 818,614.06 ± 175,597.94
  • Krzyki: 726,507.20 ± 195,015.45
  • Śródmieście: 739,339.70 ± 171,428.11
Values of the skewness of all prices are positive, which means the tail is on the right side of the distribution, however the highest skewness value can be observed for Biskupin and is equal to 0.34.

On the other hand the kurtosis is negative in all cases, so the intensity of values on tails of normal distribution is small. The smallest value of kurtosis can be found for Śródmieście (-1.13).

In Biskupin and Śródmieście the biggest amount of flats are the locums with three rooms, while in Krzyki dominate four rooms flats.
Also the biggest median size of one flat can be found in Biskupin (45.10 m^2) and the smallest: in Śródmieście(42.95 m^2).
The most common building type for each district is:
  • for Biskupin: tenement house
  • for Krzyki: skyscraper
  • for Śródmieście: low block of flats