Data wrangling
As you can see not all formats of our variables are adjusted. We need to prepare the appropriate formats of our variables according to their measurement scales and future usage.
mieszkania$district<-as.factor(mieszkania$district)
mieszkania$building_type<-as.factor(mieszkania$building_type)
mieszkania$rooms<-factor(mieszkania$rooms,ordered=TRUE)
attach(mieszkania)
mieszkania$price_PLN<-as.numeric(mieszkania$price_PLN)
mieszkania$price_EUR<-as.numeric(mieszkania$price_EUR)
Frequency table and TAI
In the first stage of our analysis we are going to group our data in the form of the simple frequency table.
First, let’s take a look at the distribution of prices of apartments in our sample and verify the tabular accuracy using TAI measure:
## # classes Goodness of fit Tabular accuracy
## 11.0000000 0.9817075 0.8630187
As we can see - TAI index is quite high. 0.85 means that we can accept the proposed design of the frequency table.
Basic plots
In this section we should present our data using basic (pre-installed with R) graphics. Choose the most appropriate plots according to the scale of chosen variables. Investigate the heterogeneity of the distribution presenting data by groups (i.e. by district, building type etc.). Do not forget about main titles, labels and legend. Read more about graphical parameters here.
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.
## Loading required package: sm
## Package 'sm', version 2.2-5.6: type help(sm) for summary information
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
ggplot2 plots
In this section we will present the same plots but with the use of ggplot2 and ggpubr packages.
Ggplot2 allows to show the average value of each group using the stat_summary() function. No more need to calculate your mean values before plotting!
Using facets
Faceting generates small multiples each showing a different subset of the data. Small multiples are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. Read more about facets here.
Descriptive statistics #1
Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the distribution of prices. Compare mean, median and mode together with positional measures - quantiles - by districts and building types or no. of rooms per apartment.
## [1] 175934
## [1] 174935
## [1] 43078.62
## [1] 1855767906
## [1] 0.2448567
## [1] 65436.25
## [1] 0.1870302
## [1] 83280
## [1] 295762
## 0% 10% 25% 50% 75% 95% 100%
## 83280.0 120094.3 143304.2 174935.0 208740.5 244039.9 295762.0
Summary tables with ‘kable’
Using kable and kableextra packages we can easily create summary tables with graphics and/or statistics.
rooms | boxplot | histogram | line1 | line2 | points1 |
---|---|---|---|---|---|
1 | |||||
2 | |||||
3 | |||||
4 |
Ok, now we will finally summarize basic central tendency measures for prices by districts/building types using ‘kable’ package. You can customize your final report. See some hints here.
Biskupin | Krzyki | Sródmieście | |
---|---|---|---|
Min | 120290.00 | 83280.00 | 103749.00 |
Max | 295762.00 | 252418.00 | 245846.00 |
Q1 | 156655.00 | 138930.50 | 137103.50 |
Median | 189291.00 | 165909.00 | 168397.50 |
Q3 | 214462.00 | 202848.50 | 201563.00 |
Mean | 189494.00 | 168173.00 | 171143.43 |
Sd | 40647.70 | 45142.40 | 39682.41 |
IQR | 57807.00 | 63918.00 | 64459.50 |
Sx | 28903.50 | 31959.00 | 32229.75 |
Var % | 0.21 | 0.27 | 0.23 |
IQR Var % | 0.31 | 0.39 | 0.38 |
Skewness | 0.34 | 0.07 | 0.11 |
Kurtosis | -0.29 | -0.98 | -1.13 |
As we can see the most expensive district is Biskupin, it has the highest Min and Max. Not only that but it’s mean and median is also the highest so it means that it is the most expensive district not only because of outlieres. Moreover it has the smallest IQR which means that there is less extreme values. Biskupin has also the biggest positive skewness, which means that there is more cheaper apartments for Biskupin standards, compared to the other districts. The lowest negative kurtosis also shows us that the frequency graph has longer tails, which means more data is in the middle.
Descriptive statistics - summary table for quantitative variable using reporttools
We can also produce summary tables in Latex format (available only for PDF reports). Let’s see the summary table by district for both: price in PLN and price in EUR.
## % latex table generated in R 4.0.4 by xtable 1.8-4 package
## % Thu Apr 01 13:26:37 2021
## \begin{table}[ht]
## \centering
## \begingroup\footnotesize
## \begin{tabular}{llrrrrrrrrrr}
## \textbf{Variable} & \textbf{Levels} & $\mathbf{n}$ & \textbf{Min} & $\mathbf{q_1}$ & $\mathbf{\widetilde{x}}$ & $\mathbf{\bar{x}}$ & $\mathbf{q_3}$ & \textbf{Max} & $\mathbf{s}$ & \textbf{IQR} & \textbf{\#NA} \\
## \hline
## price\_PLN & Biskupin & 65 & 519652 & 676751.0 & 817736.0 & 818614.1 & 926474.0 & 1277691 & 175597.9 & 249723.0 & 0 \\
## & Krzyki & 79 & 359769 & 600180.5 & 716726.0 & 726507.2 & 876306.5 & 1090444 & 195015.5 & 276126.0 & 0 \\
## & Srodmiescie & 56 & 448196 & 592287.8 & 727477.5 & 739339.7 & 870752.5 & 1062054 & 171428.1 & 278464.8 & 0 \\
## \hline
## & all & 200 & 359769 & 619073.8 & 755719.5 & 760035.0 & 901760.2 & 1277691 & 186099.8 & 282686.5 & 0 \\
## \hline
## price\_EUR & Biskupin & 65 & 120290 & 156655.0 & 189291.0 & 189494.0 & 214462.0 & 295762 & 40647.7 & 57807.0 & 0 \\
## & Krzyki & 79 & 83280 & 138930.5 & 165909.0 & 168173.0 & 202848.5 & 252418 & 45142.4 & 63918.0 & 0 \\
## & Srodmiescie & 56 & 103749 & 137103.5 & 168397.5 & 171143.4 & 201563.0 & 245846 & 39682.4 & 64459.5 & 0 \\
## \hline
## & all & 200 & 83280 & 143304.2 & 174935.0 & 175934.0 & 208740.5 & 295762 & 43078.6 & 65436.2 & 0 \\
## \hline
## \end{tabular}
## \endgroup
## \caption{}
## \label{}
## \end{table}
Contingency tables using reporttools
We can easily construct contingency tables in Latex. Let’s print summary table by no. of rooms per apartment for districts and building types:
## % latex table generated in R 4.0.4 by xtable 1.8-4 package
## % Thu Apr 01 13:26:37 2021
## \begin{table}[ht]
## \centering
## \begingroup\footnotesize
## \begin{tabular}{ll|rrr|rrr|rrr|rrr|rrr}
## \textbf{Variable} & \textbf{Levels} & $\mathbf{n_{\mathrm{1}}}$ & $\mathbf{\%_{\mathrm{1}}}$ & $\mathbf{\sum \%_{\mathrm{1}}}$ & $\mathbf{n_{\mathrm{2}}}$ & $\mathbf{\%_{\mathrm{2}}}$ & $\mathbf{\sum \%_{\mathrm{2}}}$ & $\mathbf{n_{\mathrm{3}}}$ & $\mathbf{\%_{\mathrm{3}}}$ & $\mathbf{\sum \%_{\mathrm{3}}}$ & $\mathbf{n_{\mathrm{4}}}$ & $\mathbf{\%_{\mathrm{4}}}$ & $\mathbf{\sum \%_{\mathrm{4}}}$ & $\mathbf{n_{\mathrm{all}}}$ & $\mathbf{\%_{\mathrm{all}}}$ & $\mathbf{\sum \%_{\mathrm{all}}}$ \\
## \hline
## district & Biskupin & 12 & 27.3 & 27.3 & 16 & 32.0 & 32.0 & 24 & 41.4 & 41.4 & 13 & 27.1 & 27.1 & 65 & 32.5 & 32.5 \\
## & Krzyki & 18 & 40.9 & 68.2 & 19 & 38.0 & 70.0 & 18 & 31.0 & 72.4 & 24 & 50.0 & 77.1 & 79 & 39.5 & 72.0 \\
## & Srodmiescie & 14 & 31.8 & 100.0 & 15 & 30.0 & 100.0 & 16 & 27.6 & 100.0 & 11 & 22.9 & 100.0 & 56 & 28.0 & 100.0 \\
## \hline
## & all & 44 & 100.0 & & 50 & 100.0 & & 58 & 100.0 & & 48 & 100.0 & & 200 & 100.0 & \\
## \hline
## \hline
## building\_type & kamienica & 11 & 25.0 & 25.0 & 17 & 34.0 & 34.0 & 17 & 29.3 & 29.3 & 16 & 33.3 & 33.3 & 61 & 30.5 & 30.5 \\
## & niski blok & 9 & 20.4 & 45.5 & 18 & 36.0 & 70.0 & 20 & 34.5 & 63.8 & 16 & 33.3 & 66.7 & 63 & 31.5 & 62.0 \\
## & wiezowiec & 24 & 54.5 & 100.0 & 15 & 30.0 & 100.0 & 21 & 36.2 & 100.0 & 16 & 33.3 & 100.0 & 76 & 38.0 & 100.0 \\
## \hline
## & all & 44 & 100.0 & & 50 & 100.0 & & 58 & 100.0 & & 48 & 100.0 & & 200 & 100.0 & \\
## \hline
## \hline
## \end{tabular}
## \endgroup
## \caption{}
## \label{}
## \end{table}