title: ‘DS Lab #1’ author: “Monika Dyczewska” date: “2021-03-31” output: rmdformats::downcute: self_contained: true thumbnails: false lightbox: true gallery: true highlight: pygments —

Data wrangling

As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.

mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$rooms<-factor(mieszkania$rooms,ordered=TRUE)
attach(mieszkania)
mieszkania$price_PLN<-as.numeric(mieszkania$price_PLN)
mieszkania$price_EUR<-as.numeric(mieszkania$price_EUR)

Frequency table

##                 limits Freq
## 1    (3.5e+05,4.5e+05]    9
## 2    (4.5e+05,5.5e+05]   21
## 3    (5.5e+05,6.5e+05]   33
## 4    (6.5e+05,7.5e+05]   36
## 5    (7.5e+05,8.5e+05]   31
## 6    (8.5e+05,9.5e+05]   36
## 7   (9.5e+05,1.05e+06]   21
## 8  (1.05e+06,1.15e+06]   10
## 9  (1.15e+06,1.25e+06]    2
## 10 (1.25e+06,1.35e+06]    1

TAI

Basic plots

In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

## ggplot2 plots

## Using facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.

## Descriptive statistics #1

Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.

## The following objects are masked from mieszkania (pos = 3):
## 
##     building_type, district, price_EUR, price_PLN, rooms, size
## [1] 760035
## [1] 175934
## [1] 186099.8
## [1] 43078.62
## [1] 34633125960
## [1] 1855767906
## [1] 0.2448568
## [1] 0.2448567
## [1] 282686.5
## [1] 65436.25
## [1] 0.1870314
## [1] 359769
## [1] 83280
## [1] 1277691
## [1] 295762
##        0%        5%       25%       50%       75%       96%      100% 
##  359769.0  477175.4  619073.8  755719.5  901760.2 1062580.6 1277691.0
##       0%       5%      25%      50%      75%      96%     100% 
##  83280.0 110457.8 143304.2 174935.0 208740.5 245967.9 295762.0
rooms boxplot histogram line1 line2 points1
1
2
3
4

Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using kable packages. You can customize your final report. See some hints here.

## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## The following object is masked from 'package:qwraps2':
## 
##     logit
## Loading required package: xtable
Characteristic N = 200
price_PLN 755,720 (619,074, 901,760)
price_EUR 174,935 (143,304, 208,740)
rooms
1 44 (22%)
2 50 (25%)
3 58 (29%)
4 48 (24%)
size 44 (31, 61)
district
Biskupin 65 (32%)
Krzyki 79 (40%)
Srodmiescie 56 (28%)
building_type
kamienica 61 (30%)
niski blok 63 (32%)
wiezowiec 76 (38%)
1 Median (IQR); n (%)
##      Price in PLN.min      Price in PLN.max    Price in PLN.media 
##             359769.00            1277691.00             755719.50 
##     Price in PLN.mean   Price in PLN.q1.25%   Price in PLN.q3.75% 
##             760035.03             619073.75             901760.25 
##       Price in PLN.sd     Price in PLN.Var% Price in PLN.skewness 
##             186099.77                  0.24                  0.11 
## Price in PLN.kurtosis 
##                 -0.61