In our example this week, we are going to use the fake data - about real estates in Wroclaw - prices by districts, size of apartments and many more.
As you can see, not all formats of our variables are adapted. We need to prepare appropriate formats of our variables according to their measurement scale and future application.
apartments$district<-as.factor(apartments$district)
apartments$building_type<-as.factor(apartments$building_type)
apartments$rooms<-factor(apartments$rooms,ordered=TRUE)
attach(apartments)
apartments$price_PLN<-as.numeric(apartments$price_PLN)
apartments$price_EUR<-as.numeric(apartments$price_EUR)
In the first step of our analysis, we will group our data into a simple frequency table.
First, let’s look at the distribution of housing prices in our sample and verify tabular validity using the TAI measure:
Ok, it looks quite ugly, so let’s wrap it up using the ‘kable’ package:
|
## # classes Goodness of fit Tabular accuracy
## 10.0000000 0.9780872 0.8508467
As we can see - the TAI index is quite high. 0.85 means that we can accept the proposed construction of the frequency table.
In this section, we should represent our data using basic (pre-installed in R) graphics. Select the most appropriate graphs depending on the scale of the selected variables. Explore the heterogeneity of the distribution by presenting the data by group (e.g., by neighborhood, building type, etc.). Don’t forget about main titles, labels and legends. Read more about graphical parameters here.
Note that the echo = FALSE parameter has been added to
the code snippet to prevent printing the R code that generated the
graph.
Now, let’s use the ggplot2 and ggpubr libraries to plot.
Ggplot2 allows you to show the average value for each group using the stat_summary() function. You no longer need to calculate average values before creating a graph!
Faceting generates small multiples, each showing a different subset of the data. They are a powerful tool for exploratory data analysis: you can quickly compare patterns in different parts of the data and see if they are the same or different. Read more here.
Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the price distribution. Compare the mean, median, and mode along with positional measures - quantiles - by district and building type or number of rooms in the apartment.
mean(price_PLN)
## [1] 760035
median(price_PLN)
## [1] 755719.5
sd(price_PLN) #standard deviation
## [1] 186099.8
var(price_PLN) #variance
## [1] 34633125960
coeff_var<-sd(price_PLN)/mean(price_PLN) #coefficient of variability %
coeff_var
## [1] 0.2448568
IQR(price_PLN)# difference between quartiles =Q3-Q1
## 75%
## 282686.5
sx<-IQR(price_PLN)/2 #interquartile deviation
coeff_varx<-sx/median(price_PLN) #IQR coefficient of variability %
coeff_varx
## 75%
## 0.1870314
min(price_PLN)
## [1] 359769
max(price_PLN)
## [1] 1277691
quantile(price_PLN,probs=c(0,0.1,0.25,0.5,0.75,0.95,1),na.rm=TRUE)
## 0% 10% 25% 50% 75% 95% 100%
## 359769.0 518806.8 619073.8 755719.5 901760.2 1054250.8 1277691.0
Ok, we have calculated all of the basic summary statistics above. Let’s wrap them up together now.
| rooms | boxplot | histogram | line1 | line2 | points1 |
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 |
Ok, now we will finally summarize the basic measures of central tendency for prices by district/building type using the ‘kable’ package. Feel free to customize your final report. See some hints here.
We can calculate easily descriptive statistics also using gtsummary package:
apartments %>%
select(price_PLN,rooms) %>%
tbl_summary(label= price_PLN ~ "Price",digits=c(price_PLN)~2,by=rooms,type = all_continuous() ~ "continuous2", statistic = all_continuous() ~ c("{N_nonmiss}", "{median} ({p25}, {p75})", "{min}, {max}"),missing = "no")
| Characteristic | 1, N = 44 | 2, N = 50 | 3, N = 58 | 4, N = 48 |
|---|---|---|---|---|
| Price | ||||
| N | 44.00 | 50.00 | 58.00 | 48.00 |
| Median (IQR) | 520,507.00 (479,684.75, 555,024.75) | 677,260.00 (634,757.25, 717,728.50) | 846,303.50 (769,683.75, 901,078.75) | 964,338.50 (909,371.50, 1,050,976.75) |
| Range | 359,769.00, 657,146.00 | 590,286.00, 888,634.00 | 632,770.00, 965,829.00 | 736,669.00, 1,277,691.00 |
dfSummary() creates a summary table with statistics, frequencies and graphs for all variables in a data frame. The information displayed is type-specific (character, factor, numeric, date) and also varies according to the number of distinct values.
When using dfSummary() in R Markdown documents, it is generally a good idea to exclude a column or two to avoid margin overflow. Since the Valid and Missing columns are redundant, we can drop either one of them.
dfSummary(apartments,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE,
tmp.img.dir = "/tmp")
## temporary images written to 'C:\tmp'
Dimensions: 200 x 6
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
| 1 | price_PLN [numeric] |
Mean (sd) : 760035 (186099.8) min < med < max: 359769 < 755719.5 < 1277691 IQR (CV) : 282686.5 (0.2) |
200 distinct values | 0 (0.0%) |
|
| 2 | price_EUR [numeric] |
Mean (sd) : 175934 (43078.6) min < med < max: 83280 < 174935 < 295762 IQR (CV) : 65436.2 (0.2) |
200 distinct values | 0 (0.0%) |
|
| 3 | rooms [ordered, factor] |
1. 1 2. 2 3. 3 4. 4 |
44 (22.0%) 50 (25.0%) 58 (29.0%) 48 (24.0%) |
0 (0.0%) |
|
| 4 | size [numeric] |
Mean (sd) : 46.2 (20.1) min < med < max: 17 < 43.7 < 87.7 IQR (CV) : 30.2 (0.4) |
162 distinct values | 0 (0.0%) |
|
| 5 | district [factor] |
1. Biskupin 2. Krzyki 3. Srodmiescie |
65 (32.5%) 79 (39.5%) 56 (28.0%) |
0 (0.0%) |
|
| 6 | building_type [factor] |
1. kamienica 2. niski blok 3. wiezowiec |
61 (30.5%) 63 (31.5%) 76 (38.0%) |
0 (0.0%) |
To produce optimal results, summarytools has its own version of the base by() function. It’s called stby(), and we use it exactly as we would by():
(stats_by_rooms <- stby(data = apartments, INDICES = apartments$rooms, FUN = descr, stats = "common", transpose = TRUE))
## Non-numerical variable(s) ignored: rooms, district, building_type
Descriptive Statistics
apartments
Group: rooms = 1
N: 44
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 119332.95 | 15497.90 | 83280.00 | 120488.00 | 152117.00 | 44.00 | 100.00 |
| price_PLN | 515518.05 | 66951.03 | 359769.00 | 520507.00 | 657146.00 | 44.00 | 100.00 |
| size | 19.28 | 1.46 | 17.00 | 19.10 | 21.90 | 44.00 | 100.00 |
Group: rooms = 2
N: 50
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 158233.22 | 15063.13 | 136640.00 | 156773.00 | 205702.00 | 50.00 | 100.00 |
| price_PLN | 683567.70 | 65072.66 | 590286.00 | 677260.00 | 888634.00 | 50.00 | 100.00 |
| size | 36.80 | 4.46 | 29.60 | 35.95 | 43.70 | 50.00 | 100.00 |
Group: rooms = 3
N: 58
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 192987.55 | 20125.88 | 146475.00 | 195904.00 | 223572.00 | 58.00 | 100.00 |
| price_PLN | 833706.02 | 86943.90 | 632770.00 | 846303.50 | 965829.00 | 58.00 | 100.00 |
| size | 53.33 | 7.21 | 41.20 | 53.45 | 65.20 | 58.00 | 100.00 |
Group: rooms = 4
N: 48
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 225650.42 | 26347.03 | 170525.00 | 223226.50 | 295762.00 | 48.00 | 100.00 |
| price_PLN | 974809.96 | 113819.21 | 736669.00 | 964338.50 | 1277691.00 | 48.00 | 100.00 |
| size | 72.05 | 10.18 | 53.30 | 70.85 | 87.70 | 48.00 | 100.00 |
When generating freq() or descr() tables, it is possible to turn the results into “tidy” tables with the use of the tb() function (think of tb as a diminutive for tibble). For example:
apartments %>%
descr(stats = "common") %>%
tb()
## # A tibble: 3 × 8
## variable mean sd min med max n.valid pct.valid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 price_EUR 175934. 43079. 83280 174935 295762 200 100
## 2 price_PLN 760035. 186100. 359769 755720. 1277691 200 100
## 3 size 46.2 20.1 17 43.7 87.7 200 100
Here are some examples showing how lists created using stby() or group_by() can be transformed into tidy tibbles.
grouped_descr <- stby(data = apartments,INDICES = apartments$rooms, FUN = descr, stats = "common")
grouped_descr %>% tb()
## # A tibble: 12 × 9
## rooms variable mean sd min med max n.valid pct.valid
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 price_EUR 119333. 15498. 83280 120488 1.52e5 44 100
## 2 1 price_PLN 515518. 66951. 359769 520507 6.57e5 44 100
## 3 1 size 19.3 1.46 17 19.1 2.19e1 44 100
## 4 2 price_EUR 158233. 15063. 136640 156773 2.06e5 50 100
## 5 2 price_PLN 683568. 65073. 590286 677260 8.89e5 50 100
## 6 2 size 36.8 4.46 29.6 36.0 4.37e1 50 100
## 7 3 price_EUR 192988. 20126. 146475 195904 2.24e5 58 100
## 8 3 price_PLN 833706. 86944. 632770 846304. 9.66e5 58 100
## 9 3 size 53.3 7.21 41.2 53.4 6.52e1 58 100
## 10 4 price_EUR 225650. 26347. 170525 223226. 2.96e5 48 100
## 11 4 price_PLN 974810. 113819. 736669 964338. 1.28e6 48 100
## 12 4 size 72.0 10.2 53.3 70.8 8.77e1 48 100
stby(data = apartments,
INDICES = apartments$rooms,
FUN = descr,
stats = "fivenum") %>%
tb(order = 3) %>%
kable(format = "html", digits = 2) %>%
collapse_rows(columns = 1, valign = "top")
| variable | rooms | min | q1 | med | q3 | max |
|---|---|---|---|---|---|---|
| price_EUR | 1 | 83280.0 | 110881.0 | 120488.00 | 128568.00 | 152117.0 |
| 2 | 136640.0 | 146754.0 | 156773.00 | 166259.00 | 205702.0 | |
| 3 | 146475.0 | 177478.0 | 195904.00 | 208599.00 | 223572.0 | |
| 4 | 170525.0 | 209827.5 | 223226.50 | 243300.00 | 295762.0 | |
| price_PLN | 1 | 359769.0 | 479005.5 | 520507.00 | 555411.50 | 657146.0 |
| 2 | 590286.0 | 633978.0 | 677260.00 | 718237.00 | 888634.0 | |
| 3 | 632770.0 | 766707.0 | 846303.50 | 901149.00 | 965829.0 | |
| 4 | 736669.0 | 906455.0 | 964338.50 | 1051055.50 | 1277691.0 | |
| size | 1 | 17.0 | 18.1 | 19.10 | 20.60 | 21.9 |
| 2 | 29.6 | 32.9 | 35.95 | 40.50 | 43.7 | |
| 3 | 41.2 | 47.9 | 53.45 | 59.70 | 65.2 | |
| 4 | 53.3 | 64.2 | 70.85 | 82.15 | 87.7 |
Your task this week is to: prepare your own descriptive analysis for the “CreditCard” dataset (AER package). It is a cross-sectional dataframe on the credit history for a sample of applicants for a type of credit card.
Are the yearly incomes (in USD 10,000), credit card expenditures, age, ratio of monthly credit card expenditure to yearly income - significantly different for applicants for customers with different credit risk (“card” variable - factor)?
Prepare a professional data visualizations, descriptive statistics’ tables and interpret them.
# this analysis shows us that credit risk differs in terms of age, income etc. for example most the highest density of credit risk is within 25 to 35 years. By looking at these graphs we can make a conclusions.
library(AER)
library(dplyr)
library(scales)
##
## Dołączanie pakietu: 'scales'
## Następujące obiekty zostały zakryte z 'package:psych':
##
## alpha, rescale
## Następujący obiekt został zakryty z 'package:desctable':
##
## percent
## Następujący obiekt został zakryty z 'package:purrr':
##
## discard
## Następujący obiekt został zakryty z 'package:readr':
##
## col_factor
## Następujący obiekt został zakryty z 'package:arsenal':
##
## ordinal
library(ggplot2)
library(ggdist)
data(CreditCard)
head(CreditCard)
## card reports age income share expenditure owner selfemp dependents
## 1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no 3
## 2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no 3
## 3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no 4
## 4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no 0
## 5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no 2
## 6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no 0
## months majorcards active
## 1 54 1 12
## 2 34 1 13
## 3 58 1 5
## 4 25 1 7
## 5 64 1 5
## 6 54 1 1
summary(CreditCard)
## card reports age income
## no : 296 Min. : 0.0000 Min. : 0.1667 Min. : 0.210
## yes:1023 1st Qu.: 0.0000 1st Qu.:25.4167 1st Qu.: 2.244
## Median : 0.0000 Median :31.2500 Median : 2.900
## Mean : 0.4564 Mean :33.2131 Mean : 3.365
## 3rd Qu.: 0.0000 3rd Qu.:39.4167 3rd Qu.: 4.000
## Max. :14.0000 Max. :83.5000 Max. :13.500
## share expenditure owner selfemp dependents
## Min. :0.0001091 Min. : 0.000 no :738 no :1228 Min. :0.0000
## 1st Qu.:0.0023159 1st Qu.: 4.583 yes:581 yes: 91 1st Qu.:0.0000
## Median :0.0388272 Median : 101.298 Median :1.0000
## Mean :0.0687322 Mean : 185.057 Mean :0.9939
## 3rd Qu.:0.0936168 3rd Qu.: 249.036 3rd Qu.:2.0000
## Max. :0.9063205 Max. :3099.505 Max. :6.0000
## months majorcards active
## Min. : 0.00 Min. :0.0000 Min. : 0.000
## 1st Qu.: 12.00 1st Qu.:1.0000 1st Qu.: 2.000
## Median : 30.00 Median :1.0000 Median : 6.000
## Mean : 55.27 Mean :0.8173 Mean : 6.997
## 3rd Qu.: 72.00 3rd Qu.:1.0000 3rd Qu.:11.000
## Max. :540.00 Max. :1.0000 Max. :46.000
summary_table <- CreditCard %>%
group_by(card) %>%
summarise(
mean_income = mean(income),
mean_expenditure = mean(expenditure),
median_age = median(age),
median_ratio = median(expenditure / income)
)
levels <- c("poor", "fair", "good", "excellent")
summary_table_formatted <- summary_table %>%
mutate(
mean_income = scales::dollar(mean_income * 10000, prefix = "$"),
mean_expenditure = scales::dollar(mean_expenditure, prefix = "$"),
median_age = paste0(round(median_age, 2), " years"),
median_ratio = scales::percent(median_ratio / 100)
)
kable(summary_table_formatted, align = "c", caption = "Summary Statistics by Credit Risk")
| card | mean_income | mean_expenditure | median_age | median_ratio |
|---|---|---|---|---|
| no | $30,685.09 | $0.00 | 31.83 years | 0% |
| yes | $34,512.73 | $238.60 | 31.08 years | 50% |
ggplot(CreditCard, aes(x = income, fill = card)) +
geom_histogram(binwidth = 2, position = "identity", alpha = 0.7) +
labs(title = "Distribution of Income by Credit Risk",
x = "Income (in USD 10,000)",
y = "Frequency") +
theme_minimal() +
facet_wrap(~ card)
ggplot(CreditCard, aes(x = card, y = income, fill = card)) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_point(aes(y = income, color = card), position = position_jitterdodge(dodge.width = 0.75, jitter.height = 0), size = 0.5, alpha = 0.7) +
labs(title = "Rain Cloud Plot of Income by Credit Risk",
x = "Credit Risk",
y = "Income (in USD 10,000)",
fill = "Credit Risk") +
theme_minimal() +
scale_fill_manual(values = c("poor" = "red", "fair" = "orange", "good" = "green", "excellent" = "blue")) +
scale_color_manual(values = c("poor" = "red", "fair" = "orange", "good" = "green", "excellent" = "blue")) +
guides(fill = FALSE) +
geom_point(aes(y = income, color = card), position = position_jitterdodge(dodge.width = 0.75, jitter.height = 0), size = 0.5, alpha = 0.7)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's colour values.
ggplot(CreditCard, aes(x = card, y = income, fill = card)) +
geom_boxplot() +
geom_jitter(width = 0.2, alpha = 0.7) +
labs(title = "Boxplot with Jitter of Income by Credit Risk",
x = "Credit Risk",
y = "Income (in USD 10,000)",
fill = "Credit Risk") +
theme_minimal()
ggplot(CreditCard, aes(x = card, y = age, fill = card)) +
geom_violin(trim = FALSE, alpha = 0.7) +
labs(title = "Violin Plot of Age by Credit Risk",
x = "Credit Risk",
y = "Age",
fill = "Credit Risk") +
theme_minimal()
income_test <- t.test(income ~ card, data = CreditCard)
expenditure_test <- t.test(expenditure ~ card, data = CreditCard)
age_test <- t.test(age ~ card, data = CreditCard)
ratio_test <- t.test(expenditure / income ~ card, data = CreditCard)
print(income_test)
##
## Welch Two Sample t-test
##
## data: income by card
## t = -3.5441, df = 501.36, p-value = 0.0004308
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -0.5949520 -0.1705758
## sample estimates:
## mean in group no mean in group yes
## 3.068509 3.451273
print(expenditure_test)
##
## Welch Two Sample t-test
##
## data: expenditure by card
## t = -26.525, df = 1022, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -256.2538 -220.9510
## sample estimates:
## mean in group no mean in group yes
## 0.0000 238.6024
print(age_test)
##
## Welch Two Sample t-test
##
## data: age by card
## t = -0.019794, df = 490.24, p-value = 0.9842
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -1.308120 1.282026
## sample estimates:
## mean in group no mean in group yes
## 33.20298 33.21603
print(ratio_test)
##
## Welch Two Sample t-test
##
## data: expenditure/income by card
## t = -28.513, df = 1022, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -78.70527 -68.56985
## sample estimates:
## mean in group no mean in group yes
## 0.00000 73.63756