DS Lab #1

Data wrangling

As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.

mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$rooms<-factor(mieszkania$rooms,ordered=TRUE)
attach(mieszkania)

Frequency table

##             limits Freq Percentage_of_all
## 1   (80000,100000]    6               3.0
## 2  (100000,120000]   14               7.0
## 3  (120000,140000]   25              12.5
## 4  (140000,160000]   34              17.0
## 5  (160000,180000]   27              13.5
## 6  (180000,200000]   27              13.5
## 7  (200000,220000]   34              17.0
## 8  (220000,240000]   18               9.0
## 9  (240000,260000]   12               6.0
## 10 (260000,280000]    1               0.5
## 11 (280000,300000]    2               1.0

TAI

	x
# classes	11.0000000
Goodness of fit	0.9760388
Tabular accuracy	0.8799735

Basic plots

In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend. Read more about graphical parameters here.

Conclusions of the plot

The biggest amount of accommodations cost between 150 000 and 200 000 EUR and in this interval they are distributed equally in terms of building types.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Conclusions of the plot

In this case, we can observe that no matter what is the district, low apartment blocks are the most expensive ones in terms of their upper range price. Surprisingly, only in Krzyki district, the median of the tenement house’s price is higher than in low apartment blocks.

ggplot2 plots

Conclusions of the plot

As we can see, at every district except from Srdomiescie, small flats, with one room form the biggest part of the whole. However, in the center of the city, medium flats with 2/3 rooms form the same percentage.

Conclusions of the plot

On this plot, we can observe that in almost every case, the most expensive flats are in the Biskupin district. The is only one diffrent case, when the type of the building is skyscraper and the surface of it is the biggest. Then, all prices seems to be quite simillar.

Using facets

Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.

Descriptive statistics #1

Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.

Mean
district	kamienica	niski blok	wiezowiec
Biskupin	190804.3	206699.1	174650.5
Krzyki	170530.2	182189.1	156823.4
Srodmiescie	166809.4	182154.7	162064.8

Median
district	kamienica	niski blok	wiezowiec
Biskupin	193098.5	210291.0	174515.5
Krzyki	185346.0	172727.0	145588.0
Srodmiescie	155046.0	182110.5	160853.5

Mode
district	kamienica	niski blok	wiezowiec
Biskupin	186732	214462	128121
Krzyki	98926	158609	226298
Srodmiescie	120514	245846	118333

Conclusions

the highest mean of the price, in every district - low apartment blocks Despite this, the median of tenement house in Krzyki is higher than the low block apartment one. The conclusion is that in Krzyki, there are more tenement houses(data amples) in upper range prices (values) than in low apartment blocks.

Summary tables with ‘kable’

Using kable and kableextra packages we can easily create summary tables with graphics and/or statistics.

rooms	boxplot	histogram	line1	line2	points1
1
2
3
4

Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using kable packages. You can customize your final report. See some hints here.

	Biskupin (N = 65)	Krzyki (N = 79)	Srodmiescie (N = 56)
Prices in EUR
min	120290	83280	103749
max	295762	252418	245846
mean (sd)	189,494.00 \(\pm\) 40,647.70	168,173.00 \(\pm\) 45,142.40	171,143.43 \(\pm\) 39,682.41
Central Tendency of prices
Skewness	0.335196812077772	0.0718500007021854	0.106122298502577
Kurthosis	-0.294219001665212	-0.981311219320137	-1.12610324021581
Surface
min	17.1	17.4	17
max	87.7	86.6	83.3
mean (sd)	47.05 \(\pm\) 19.57	46.86 \(\pm\) 20.95	44.27 \(\pm\) 19.63
Central Tendency of Surface
Skewness	0.1900099278084	0.169363280497817	0.296765067386592
Kurthosis	-0.778242543309705	-1.07136263028292	-1.05854133853727