In our example this week, we are going to use the fake data - about real estates in Wroclaw - prices by districts, size of apartments and many more.
As you can see, not all formats of our variables are adapted. We need to prepare appropriate formats of our variables according to their measurement scale and future application.
head(apartments)
## price_PLN price_EUR rooms size district building_type
## 1 977608 226298 4 76.4 Krzyki wiezowiec
## 2 553484 128121 1 19.4 Biskupin wiezowiec
## 3 511200 118333 1 21.5 Srodmiescie wiezowiec
## 4 872917 202064 4 65.2 Srodmiescie wiezowiec
## 5 562909 130303 1 21.4 Krzyki wiezowiec
## 6 674965 156242 2 35.8 Biskupin wiezowiec
apartments$district<-as.factor(apartments$district)
apartments$building_type<-as.factor(apartments$building_type)
apartments$rooms<-factor(apartments$rooms,ordered=TRUE)
attach(apartments)
apartments$price_PLN<-as.numeric(apartments$price_PLN)
apartments$price_EUR<-as.numeric(apartments$price_EUR)
In the first step of our analysis, we will group our data into a simple frequency table.
First, let’s look at the distribution of housing prices in our sample and verify tabular validity using the TAI measure:
Ok, it looks quite ugly, so let’s wrap it up using the ‘kable’ package:
|
## # classes Goodness of fit Tabular accuracy
## 10.0000000 0.9780872 0.8508467
As we can see - the TAI index is quite high. 0.85 means that we can accept the proposed construction of the frequency table.
In this section, we should represent our data using basic (pre-installed in R) graphics. Select the most appropriate graphs depending on the scale of the selected variables. Explore the heterogeneity of the distribution by presenting the data by group (e.g., by neighborhood, building type, etc.). Don’t forget about main titles, labels and legends. Read more about graphical parameters here.
Note that the echo = FALSE parameter has been added to
the code snippet to prevent printing the R code that generated the
graph.
Now, let’s use the ggplot2 and ggpubr libraries to plot.
Ggplot2 allows you to show the average value for each group using the stat_summary() function. You no longer need to calculate average values before creating a graph!
Faceting generates small multiples, each showing a different subset of the data. They are a powerful tool for exploratory data analysis: you can quickly compare patterns in different parts of the data and see if they are the same or different. Read more here.
Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the price distribution. Compare the mean, median, and mode along with positional measures - quantiles - by district and building type or number of rooms in the apartment.
mean(price_PLN)
## [1] 760035
median(price_PLN)
## [1] 755719.5
sd(price_PLN) #standard deviation
## [1] 186099.8
var(price_PLN) #variance
## [1] 34633125960
coeff_var<-sd(price_PLN)/mean(price_PLN) #coefficient of variability %
coeff_var
## [1] 0.2448568
IQR(price_PLN)# difference between quartiles =Q3-Q1
## 75%
## 282686.5
sx<-IQR(price_PLN)/2 #interquartile deviation
coeff_varx<-sx/median(price_PLN) #IQR coefficient of variability %
coeff_varx
## 75%
## 0.1870314
min(price_PLN)
## [1] 359769
max(price_PLN)
## [1] 1277691
quantile(price_PLN,probs=c(0,0.1,0.25,0.5,0.75,0.95,1),na.rm=TRUE)
## 0% 10% 25% 50% 75% 95% 100%
## 359769.0 518806.8 619073.8 755719.5 901760.2 1054250.8 1277691.0
Ok, we have calculated all of the basic summary statistics above. Let’s wrap them up together now.
| rooms | boxplot | histogram | line1 | line2 | points1 |
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 |
Ok, now we will finally summarize the basic measures of central tendency for prices by district/building type using the ‘kable’ package. Feel free to customize your final report. See some hints here.
We can calculate easily descriptive statistics also using gtsummary package:
apartments %>%
select(price_PLN,rooms) %>%
tbl_summary(label= price_PLN ~ "Price",digits=c(price_PLN)~2,by=rooms,type = all_continuous() ~ "continuous2", statistic = all_continuous() ~ c("{N_nonmiss}", "{median} ({p25}, {p75})", "{min}, {max}"),missing = "no")
| Characteristic | 1, N = 44 | 2, N = 50 | 3, N = 58 | 4, N = 48 |
|---|---|---|---|---|
| Price | ||||
| N | 44.00 | 50.00 | 58.00 | 48.00 |
| Median (IQR) | 520,507.00 (479,684.75, 555,024.75) | 677,260.00 (634,757.25, 717,728.50) | 846,303.50 (769,683.75, 901,078.75) | 964,338.50 (909,371.50, 1,050,976.75) |
| Range | 359,769.00, 657,146.00 | 590,286.00, 888,634.00 | 632,770.00, 965,829.00 | 736,669.00, 1,277,691.00 |
dfSummary() creates a summary table with statistics, frequencies and graphs for all variables in a data frame. The information displayed is type-specific (character, factor, numeric, date) and also varies according to the number of distinct values.
When using dfSummary() in R Markdown documents, it is generally a good idea to exclude a column or two to avoid margin overflow. Since the Valid and Missing columns are redundant, we can drop either one of them.
dfSummary(apartments,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE,
tmp.img.dir = "/tmp")
Dimensions: 200 x 6
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
| 1 | price_PLN [numeric] |
Mean (sd) : 760035 (186099.8) min < med < max: 359769 < 755719.5 < 1277691 IQR (CV) : 282686.5 (0.2) |
200 distinct values | 0 (0.0%) |
|
| 2 | price_EUR [numeric] |
Mean (sd) : 175934 (43078.6) min < med < max: 83280 < 174935 < 295762 IQR (CV) : 65436.2 (0.2) |
200 distinct values | 0 (0.0%) |
|
| 3 | rooms [ordered, factor] |
1. 1 2. 2 3. 3 4. 4 |
44 (22.0%) 50 (25.0%) 58 (29.0%) 48 (24.0%) |
0 (0.0%) |
|
| 4 | size [numeric] |
Mean (sd) : 46.2 (20.1) min < med < max: 17 < 43.7 < 87.7 IQR (CV) : 30.2 (0.4) |
162 distinct values | 0 (0.0%) |
|
| 5 | district [factor] |
1. Biskupin 2. Krzyki 3. Srodmiescie |
65 (32.5%) 79 (39.5%) 56 (28.0%) |
0 (0.0%) |
|
| 6 | building_type [factor] |
1. kamienica 2. niski blok 3. wiezowiec |
61 (30.5%) 63 (31.5%) 76 (38.0%) |
0 (0.0%) |
To produce optimal results, summarytools has its own version of the base by() function. It’s called stby(), and we use it exactly as we would by():
(stats_by_rooms <- stby(data = apartments, INDICES = apartments$rooms, FUN = descr, stats = "common", transpose = TRUE))
## Non-numerical variable(s) ignored: rooms, district, building_type
Descriptive Statistics
apartments
Group: rooms = 1
N: 44
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 119332.95 | 15497.90 | 83280.00 | 120488.00 | 152117.00 | 44.00 | 100.00 |
| price_PLN | 515518.05 | 66951.03 | 359769.00 | 520507.00 | 657146.00 | 44.00 | 100.00 |
| size | 19.28 | 1.46 | 17.00 | 19.10 | 21.90 | 44.00 | 100.00 |
Group: rooms = 2
N: 50
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 158233.22 | 15063.13 | 136640.00 | 156773.00 | 205702.00 | 50.00 | 100.00 |
| price_PLN | 683567.70 | 65072.66 | 590286.00 | 677260.00 | 888634.00 | 50.00 | 100.00 |
| size | 36.80 | 4.46 | 29.60 | 35.95 | 43.70 | 50.00 | 100.00 |
Group: rooms = 3
N: 58
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 192987.55 | 20125.88 | 146475.00 | 195904.00 | 223572.00 | 58.00 | 100.00 |
| price_PLN | 833706.02 | 86943.90 | 632770.00 | 846303.50 | 965829.00 | 58.00 | 100.00 |
| size | 53.33 | 7.21 | 41.20 | 53.45 | 65.20 | 58.00 | 100.00 |
Group: rooms = 4
N: 48
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 225650.42 | 26347.03 | 170525.00 | 223226.50 | 295762.00 | 48.00 | 100.00 |
| price_PLN | 974809.96 | 113819.21 | 736669.00 | 964338.50 | 1277691.00 | 48.00 | 100.00 |
| size | 72.05 | 10.18 | 53.30 | 70.85 | 87.70 | 48.00 | 100.00 |
When generating freq() or descr() tables, it is possible to turn the results into “tidy” tables with the use of the tb() function (think of tb as a diminutive for tibble). For example:
apartments %>%
descr(stats = "common") %>%
tb()
## # A tibble: 3 × 8
## variable mean sd min med max n.valid pct.valid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 price_EUR 175934. 43079. 83280 174935 295762 200 100
## 2 price_PLN 760035. 186100. 359769 755720. 1277691 200 100
## 3 size 46.2 20.1 17 43.7 87.7 200 100
Here are some examples showing how lists created using stby() or group_by() can be transformed into tidy tibbles.
grouped_descr <- stby(data = apartments,INDICES = apartments$rooms, FUN = descr, stats = "common")
grouped_descr %>% tb()
## # A tibble: 12 × 9
## rooms variable mean sd min med max n.valid pct.valid
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 price_EUR 119333. 15498. 83280 120488 1.52e5 44 100
## 2 1 price_PLN 515518. 66951. 359769 520507 6.57e5 44 100
## 3 1 size 19.3 1.46 17 19.1 2.19e1 44 100
## 4 2 price_EUR 158233. 15063. 136640 156773 2.06e5 50 100
## 5 2 price_PLN 683568. 65073. 590286 677260 8.89e5 50 100
## 6 2 size 36.8 4.46 29.6 36.0 4.37e1 50 100
## 7 3 price_EUR 192988. 20126. 146475 195904 2.24e5 58 100
## 8 3 price_PLN 833706. 86944. 632770 846304. 9.66e5 58 100
## 9 3 size 53.3 7.21 41.2 53.4 6.52e1 58 100
## 10 4 price_EUR 225650. 26347. 170525 223226. 2.96e5 48 100
## 11 4 price_PLN 974810. 113819. 736669 964338. 1.28e6 48 100
## 12 4 size 72.0 10.2 53.3 70.8 8.77e1 48 100
stby(data = apartments,
INDICES = apartments$rooms,
FUN = descr,
stats = "fivenum") %>%
tb(order = 3) %>%
kable(format = "html", digits = 2) %>%
collapse_rows(columns = 1, valign = "top")
| variable | rooms | min | q1 | med | q3 | max |
|---|---|---|---|---|---|---|
| price_EUR | 1 | 83280.0 | 110881.0 | 120488.00 | 128568.00 | 152117.0 |
| 2 | 136640.0 | 146754.0 | 156773.00 | 166259.00 | 205702.0 | |
| 3 | 146475.0 | 177478.0 | 195904.00 | 208599.00 | 223572.0 | |
| 4 | 170525.0 | 209827.5 | 223226.50 | 243300.00 | 295762.0 | |
| price_PLN | 1 | 359769.0 | 479005.5 | 520507.00 | 555411.50 | 657146.0 |
| 2 | 590286.0 | 633978.0 | 677260.00 | 718237.00 | 888634.0 | |
| 3 | 632770.0 | 766707.0 | 846303.50 | 901149.00 | 965829.0 | |
| 4 | 736669.0 | 906455.0 | 964338.50 | 1051055.50 | 1277691.0 | |
| size | 1 | 17.0 | 18.1 | 19.10 | 20.60 | 21.9 |
| 2 | 29.6 | 32.9 | 35.95 | 40.50 | 43.7 | |
| 3 | 41.2 | 47.9 | 53.45 | 59.70 | 65.2 | |
| 4 | 53.3 | 64.2 | 70.85 | 82.15 | 87.7 |
Your task this week is to: prepare your own descriptive analysis for the “CreditCard” dataset (AER package). It is a cross-sectional dataframe on the credit history for a sample of applicants for a type of credit card.
Are the yearly incomes (in USD 10,000), credit card expenditures, age, ratio of monthly credit card expenditure to yearly income - significantly different for applicants for customers with different credit risk (“card” variable - factor)?
Prepare a professional data visualizations, descriptive statistics’ tables and interpret them.
# your code here
head(CreditCard)
## card reports age income share expenditure owner selfemp dependents
## 1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no 3
## 2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no 3
## 3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no 4
## 4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no 0
## 5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no 2
## 6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no 0
## months majorcards active
## 1 54 1 12
## 2 34 1 13
## 3 58 1 5
## 4 25 1 7
## 5 64 1 5
## 6 54 1 1
boxplot(CreditCard$age)
mean_age<-mean(CreditCard$age)
CreditCard$age[CreditCard$age < 1] <- mean_age
vis_miss(CreditCard, cluster = TRUE)
CreditCard%>%
miss_case_table()
## # A tibble: 1 × 3
## n_miss_in_case n_cases pct_cases
## <int> <int> <dbl>
## 1 0 1319 100
No missing values are observed, so no cleansing concerning them is needed. As it could be seen from documentation, variable age has some values lower than one, which are a little bit suspicious, boxplot also proves their existence. The best way is to replace them with mean value.
CreditCard$card_num <- ifelse(CreditCard$card == 'yes', 0, 1)
CreditCard$owner_num <- ifelse(CreditCard$owner == 'yes', 1, 0)
CreditCard$selfemp_num <- ifelse(CreditCard$selfemp == 'yes', 1, 0)
CreditCard$credit_risk <- CreditCard$card_num - CreditCard$owner_num - CreditCard$selfemp_num
hist(CreditCard$credit_risk)
library(dplyr)
CreditCard <- mutate(CreditCard,
risk = case_when(
credit_risk == -2 ~ "very low",
credit_risk == -1 ~ "low",
credit_risk == 0 ~ "high",
credit_risk == 1 ~ "very high",
))
CreditCard$risk <- factor(CreditCard$risk)
head(CreditCard)
## card reports age income share expenditure owner selfemp dependents
## 1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no 3
## 2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no 3
## 3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no 4
## 4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no 0
## 5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no 2
## 6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no 0
## months majorcards active card_num owner_num selfemp_num credit_risk risk
## 1 54 1 12 0 1 0 -1 low
## 2 34 1 13 0 0 0 0 high
## 3 58 1 5 0 1 0 -1 low
## 4 25 1 7 0 0 0 0 high
## 5 64 1 5 0 1 0 -1 low
## 6 54 1 1 0 0 0 0 high
Credit risk was calculated for every person by the formula (card-owner-selfemp). Card variable was used to create new numerical variable, as well as owner and selfemp variables.
pie(table(CreditCard$risk))
freq_table <- table(CreditCard$risk)
percentage_table <- prop.table(freq_table) * 100
# Combine the frequency and percentage tables into a single table
result_table <- cbind(freq_table, percentage_table)
result_kable <- kable(result_table, format = "markdown", align = "c", caption = "Table with Frequencies and Percentages")
# Print the result table
print(result_kable)
##
##
## Table: Table with Frequencies and Percentages
##
## | | freq_table | percentage_table |
## |:---------|:----------:|:----------------:|
## |high | 610 | 46.247157 |
## |low | 484 | 36.694465 |
## |very high | 186 | 14.101592 |
## |very low | 39 | 2.956785 |
plot1 <- ggplot(CreditCard, aes(income,risk)) +
geom_abline() +
geom_jitter(width = 0.1, height = 0.1)
plot1
(stats_by <- stby(data = CreditCard, INDICES = CreditCard$risk, FUN = descr, stats = "common", transpose = TRUE))
## Non-numerical variable(s) ignored: card, owner, selfemp, risk
## Descriptive Statistics
## CreditCard
## Group: risk = high
## N: 610
##
## Mean Std.Dev Min Median Max N.Valid Pct.Valid
## ----------------- -------- --------- ------- -------- --------- --------- -----------
## active 6.03 5.76 0.00 5.00 46.00 610.00 100.00
## age 30.66 8.74 18.17 28.17 69.75 610.00 100.00
## card_num 0.17 0.37 0.00 0.00 1.00 610.00 100.00
## credit_risk 0.00 0.00 0.00 0.00 0.00 610.00 100.00
## dependents 0.71 1.09 0.00 0.00 6.00 610.00 100.00
## expenditure 185.58 261.14 0.00 108.50 2291.17 610.00 100.00
## income 2.94 1.24 0.49 2.60 11.00 610.00 100.00
## majorcards 0.83 0.38 0.00 1.00 1.00 610.00 100.00
## months 42.05 54.58 1.00 24.00 528.00 610.00 100.00
## owner_num 0.13 0.34 0.00 0.00 1.00 610.00 100.00
## reports 0.34 1.11 0.00 0.00 14.00 610.00 100.00
## selfemp_num 0.03 0.18 0.00 0.00 1.00 610.00 100.00
## share 0.08 0.11 0.00 0.05 0.91 610.00 100.00
##
## Group: risk = low
## N: 484
##
## Mean Std.Dev Min Median Max N.Valid Pct.Valid
## ----------------- -------- --------- ------- -------- --------- --------- -----------
## active 8.85 6.55 0.00 8.00 44.00 484.00 100.00
## age 37.12 9.90 18.67 35.71 74.17 484.00 100.00
## card_num 0.02 0.13 0.00 0.00 1.00 484.00 100.00
## credit_risk -1.00 0.00 -1.00 -1.00 -1.00 484.00 100.00
## dependents 1.39 1.36 0.00 1.00 6.00 484.00 100.00
## expenditure 256.30 307.62 0.00 158.90 3099.50 484.00 100.00
## income 4.01 1.96 0.21 3.50 13.50 484.00 100.00
## majorcards 0.84 0.37 0.00 1.00 1.00 484.00 100.00
## months 72.98 76.52 0.00 43.00 540.00 484.00 100.00
## owner_num 0.95 0.22 0.00 1.00 1.00 484.00 100.00
## reports 0.18 0.66 0.00 0.00 10.00 484.00 100.00
## selfemp_num 0.07 0.25 0.00 0.00 1.00 484.00 100.00
## share 0.08 0.08 0.00 0.06 0.54 484.00 100.00
##
## Group: risk = very high
## N: 186
##
## Mean Std.Dev Min Median Max N.Valid Pct.Valid
## ----------------- ------- --------- ------- -------- -------- --------- -----------
## active 4.85 6.01 0.00 3.00 33.00 186.00 100.00
## age 31.17 8.74 19.00 29.12 80.17 186.00 100.00
## card_num 1.00 0.00 1.00 1.00 1.00 186.00 100.00
## credit_risk 1.00 0.00 1.00 1.00 1.00 186.00 100.00
## dependents 0.83 1.14 0.00 0.00 6.00 186.00 100.00
## expenditure 0.00 0.00 0.00 0.00 0.00 186.00 100.00
## income 2.81 1.43 1.20 2.49 10.00 186.00 100.00
## majorcards 0.70 0.46 0.00 1.00 1.00 186.00 100.00
## months 47.66 60.73 0.00 28.00 408.00 186.00 100.00
## owner_num 0.00 0.00 0.00 0.00 0.00 186.00 100.00
## reports 1.62 2.46 0.00 0.00 12.00 186.00 100.00
## selfemp_num 0.00 0.00 0.00 0.00 0.00 186.00 100.00
## share 0.00 0.00 0.00 0.00 0.00 186.00 100.00
##
## Group: risk = very low
## N: 39
##
## Mean Std.Dev Min Median Max N.Valid Pct.Valid
## ----------------- -------- --------- ------- -------- -------- --------- -----------
## active 9.28 6.30 0.00 9.00 26.00 39.00 100.00
## age 40.25 12.69 20.83 38.83 83.50 39.00 100.00
## card_num 0.00 0.00 0.00 0.00 0.00 39.00 100.00
## credit_risk -2.00 0.00 -2.00 -2.00 -2.00 39.00 100.00
## dependents 1.26 1.25 0.00 1.00 4.00 39.00 100.00
## expenditure 175.25 205.33 0.00 91.41 747.67 39.00 100.00
## income 4.61 2.22 1.92 4.00 12.00 39.00 100.00
## majorcards 0.90 0.31 0.00 1.00 1.00 39.00 100.00
## months 78.56 67.67 5.00 56.00 288.00 39.00 100.00
## owner_num 1.00 0.00 1.00 1.00 1.00 39.00 100.00
## reports 0.15 0.43 0.00 0.00 2.00 39.00 100.00
## selfemp_num 1.00 0.00 1.00 1.00 1.00 39.00 100.00
## share 0.05 0.05 0.00 0.03 0.20 39.00 100.00
CreditCard <- mutate(CreditCard, risk = fct_relevel(risk,
"very low", "low", "high", "very high"))
ggplot(CreditCard, aes(x=risk, y=age)) +
geom_boxplot(alpha=0.7) +
stat_summary(fun="mean", geom="point", shape=20, size=5, color="red", fill="red")
As we can see, age values are higher for groups with lower risk, but they are very similar for high and very high risk.
ggplot(CreditCard, aes(x=risk, y=income)) +
geom_boxplot(alpha=0.7) +
stat_summary(fun="mean", geom="point", shape=20, size=5, color="red", fill="red")
Similarly to age values, the lower the risk, the higher the income values are, except for high and very high risk - there’s no difference between these two groups.
ggplot(CreditCard, aes(x=risk, y=expenditure)) +
geom_boxplot(alpha=0.7) +
stat_summary(fun="mean", geom="point", shape=20, size=5, color="red", fill="red")
Expenditure levels are the highest for the low risk group, similar in very low and high risk and zero for very high risk (because of the lack of any credit cards in this group).
More adequate would be the ratio of credit card expenditure to income:
ggplot(CreditCard, aes(x=risk, y=(expenditure/income))) +
geom_boxplot(alpha=0.7) +
stat_summary(fun="mean", geom="point", shape=20, size=5, color="red", fill="red")
This ratio is the lowest for the very low risk group. For low and high risk the levels are pretty similar, but high risk group has a bigger spread. The very high risk has zero levels because expenditure levels are zero, as on the previous plot.
As could be observed, mean value of age for low credit risk group is greater, than in high risk groups. In low risk groups values are more dispersed. These values indicate, that older individuals are less likely to have problems with their credit history.
Mean value of income is significantly higher in low-risk groups (4.61 in very low risk against 2.81 in very high), values again are more dispersed in groups with low risk.
Months living at current address also differ - low-risk individuals tend to live longer on their address, which could indicate more stable financial situation.
Number of active accounts in high-risk groups is lower, than in low-risk (4.85 against 9.28).
Overall, given results indicate, that individuals with low credit risk tend to be older, with higher income, better, more stable financial situation and life quality in general. High credit risk individuals are young people with low income and unstable finances.