In our example this week, we are going to use the fake data - about real estates in Wroclaw - prices by districts, size of apartments and many more.
As you can see, not all formats of our variables are adapted. We need to prepare appropriate formats of our variables according to their measurement scale and future application.
apartments$district<-as.factor(apartments$district)
apartments$building_type<-as.factor(apartments$building_type)
apartments$rooms<-factor(apartments$rooms,ordered=TRUE)
attach(apartments)
apartments$price_PLN<-as.numeric(apartments$price_PLN)
apartments$price_EUR<-as.numeric(apartments$price_EUR)
In the first step of our analysis, we will group our data into a simple frequency table.
First, let’s look at the distribution of housing prices in our sample and verify tabular validity using the TAI measure:
Ok, it looks quite ugly, so let’s wrap it up using the ‘kable’ package:
|
## # classes Goodness of fit Tabular accuracy
## 10.0000000 0.9780872 0.8508467
As we can see - the TAI index is quite high. 0.85 means that we can accept the proposed construction of the frequency table.
In this section, we should represent our data using basic (pre-installed in R) graphics. Select the most appropriate graphs depending on the scale of the selected variables. Explore the heterogeneity of the distribution by presenting the data by group (e.g., by neighborhood, building type, etc.). Don’t forget about main titles, labels and legends. Read more about graphical parameters here.
Note that the echo = FALSE parameter has been added to
the code snippet to prevent printing the R code that generated the
graph.
Now, let’s use the ggplot2 and ggpubr libraries to plot.
Ggplot2 allows you to show the average value for each group using the stat_summary() function. You no longer need to calculate average values before creating a graph!
Faceting generates small multiples, each showing a different subset of the data. They are a powerful tool for exploratory data analysis: you can quickly compare patterns in different parts of the data and see if they are the same or different. Read more here.
Before automatically reporting the full summary table of descriptive statistics, this time your goal is to measure the central tendency of the price distribution. Compare the mean, median, and mode along with positional measures - quantiles - by district and building type or number of rooms in the apartment.
mean(price_PLN)
## [1] 760035
median(price_PLN)
## [1] 755719.5
sd(price_PLN) #standard deviation
## [1] 186099.8
var(price_PLN) #variance
## [1] 34633125960
coeff_var<-sd(price_PLN)/mean(price_PLN) #coefficient of variability %
coeff_var
## [1] 0.2448568
IQR(price_PLN)# difference between quartiles =Q3-Q1
## 75%
## 282686.5
sx<-IQR(price_PLN)/2 #interquartile deviation
coeff_varx<-sx/median(price_PLN) #IQR coefficient of variability %
coeff_varx
## 75%
## 0.1870314
min(price_PLN)
## [1] 359769
max(price_PLN)
## [1] 1277691
quantile(price_PLN,probs=c(0,0.1,0.25,0.5,0.75,0.95,1),na.rm=TRUE)
## 0% 10% 25% 50% 75% 95% 100%
## 359769.0 518806.8 619073.8 755719.5 901760.2 1054250.8 1277691.0
Ok, we have calculated all of the basic summary statistics above. Let’s wrap them up together now.
| rooms | boxplot | histogram | line1 | line2 | points1 |
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 |
Ok, now we will finally summarize the basic measures of central tendency for prices by district/building type using the ‘kable’ package. Feel free to customize your final report. See some hints here.
We can calculate easily descriptive statistics also using gtsummary package:
apartments %>%
select(price_PLN,rooms) %>%
tbl_summary(label= price_PLN ~ "Price",digits=c(price_PLN)~2,by=rooms,type = all_continuous() ~ "continuous2", statistic = all_continuous() ~ c("{N_nonmiss}", "{median} ({p25}, {p75})", "{min}, {max}"),missing = "no")
| Characteristic | 1, N = 44 | 2, N = 50 | 3, N = 58 | 4, N = 48 |
|---|---|---|---|---|
| Price | ||||
| N | 44.00 | 50.00 | 58.00 | 48.00 |
| Median (IQR) | 520,507.00 (479,684.75, 555,024.75) | 677,260.00 (634,757.25, 717,728.50) | 846,303.50 (769,683.75, 901,078.75) | 964,338.50 (909,371.50, 1,050,976.75) |
| Range | 359,769.00, 657,146.00 | 590,286.00, 888,634.00 | 632,770.00, 965,829.00 | 736,669.00, 1,277,691.00 |
dfSummary() creates a summary table with statistics, frequencies and graphs for all variables in a data frame. The information displayed is type-specific (character, factor, numeric, date) and also varies according to the number of distinct values.
When using dfSummary() in R Markdown documents, it is generally a good idea to exclude a column or two to avoid margin overflow. Since the Valid and Missing columns are redundant, we can drop either one of them.
dfSummary(apartments,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE,
tmp.img.dir = "/tmp")
## temporary images written to 'C:\tmp'
Dimensions: 200 x 6
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing |
|---|---|---|---|---|---|
| 1 | price_PLN [numeric] |
Mean (sd) : 760035 (186099.8) min < med < max: 359769 < 755719.5 < 1277691 IQR (CV) : 282686.5 (0.2) |
200 distinct values | 0 (0.0%) |
|
| 2 | price_EUR [numeric] |
Mean (sd) : 175934 (43078.6) min < med < max: 83280 < 174935 < 295762 IQR (CV) : 65436.2 (0.2) |
200 distinct values | 0 (0.0%) |
|
| 3 | rooms [ordered, factor] |
1. 1 2. 2 3. 3 4. 4 |
44 (22.0%) 50 (25.0%) 58 (29.0%) 48 (24.0%) |
0 (0.0%) |
|
| 4 | size [numeric] |
Mean (sd) : 46.2 (20.1) min < med < max: 17 < 43.7 < 87.7 IQR (CV) : 30.2 (0.4) |
162 distinct values | 0 (0.0%) |
|
| 5 | district [factor] |
1. Biskupin 2. Krzyki 3. Srodmiescie |
65 (32.5%) 79 (39.5%) 56 (28.0%) |
0 (0.0%) |
|
| 6 | building_type [factor] |
1. kamienica 2. niski blok 3. wiezowiec |
61 (30.5%) 63 (31.5%) 76 (38.0%) |
0 (0.0%) |
To produce optimal results, summarytools has its own version of the base by() function. It’s called stby(), and we use it exactly as we would by():
(stats_by_rooms <- stby(data = apartments, INDICES = apartments$rooms, FUN = descr, stats = "common", transpose = TRUE))
## Non-numerical variable(s) ignored: rooms, district, building_type
Descriptive Statistics
apartments
Group: rooms = 1
N: 44
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 119332.95 | 15497.90 | 83280.00 | 120488.00 | 152117.00 | 44.00 | 100.00 |
| price_PLN | 515518.05 | 66951.03 | 359769.00 | 520507.00 | 657146.00 | 44.00 | 100.00 |
| size | 19.28 | 1.46 | 17.00 | 19.10 | 21.90 | 44.00 | 100.00 |
Group: rooms = 2
N: 50
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 158233.22 | 15063.13 | 136640.00 | 156773.00 | 205702.00 | 50.00 | 100.00 |
| price_PLN | 683567.70 | 65072.66 | 590286.00 | 677260.00 | 888634.00 | 50.00 | 100.00 |
| size | 36.80 | 4.46 | 29.60 | 35.95 | 43.70 | 50.00 | 100.00 |
Group: rooms = 3
N: 58
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 192987.55 | 20125.88 | 146475.00 | 195904.00 | 223572.00 | 58.00 | 100.00 |
| price_PLN | 833706.02 | 86943.90 | 632770.00 | 846303.50 | 965829.00 | 58.00 | 100.00 |
| size | 53.33 | 7.21 | 41.20 | 53.45 | 65.20 | 58.00 | 100.00 |
Group: rooms = 4
N: 48
| Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|
| price_EUR | 225650.42 | 26347.03 | 170525.00 | 223226.50 | 295762.00 | 48.00 | 100.00 |
| price_PLN | 974809.96 | 113819.21 | 736669.00 | 964338.50 | 1277691.00 | 48.00 | 100.00 |
| size | 72.05 | 10.18 | 53.30 | 70.85 | 87.70 | 48.00 | 100.00 |
When generating freq() or descr() tables, it is possible to turn the results into “tidy” tables with the use of the tb() function (think of tb as a diminutive for tibble). For example:
apartments %>%
descr(stats = "common") %>%
tb()
## # A tibble: 3 × 8
## variable mean sd min med max n.valid pct.valid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 price_EUR 175934. 43079. 83280 174935 295762 200 100
## 2 price_PLN 760035. 186100. 359769 755720. 1277691 200 100
## 3 size 46.2 20.1 17 43.7 87.7 200 100
Here are some examples showing how lists created using stby() or group_by() can be transformed into tidy tibbles.
grouped_descr <- stby(data = apartments,INDICES = apartments$rooms, FUN = descr, stats = "common")
grouped_descr %>% tb()
## # A tibble: 12 × 9
## rooms variable mean sd min med max n.valid pct.valid
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 price_EUR 119333. 15498. 83280 120488 1.52e5 44 100
## 2 1 price_PLN 515518. 66951. 359769 520507 6.57e5 44 100
## 3 1 size 19.3 1.46 17 19.1 2.19e1 44 100
## 4 2 price_EUR 158233. 15063. 136640 156773 2.06e5 50 100
## 5 2 price_PLN 683568. 65073. 590286 677260 8.89e5 50 100
## 6 2 size 36.8 4.46 29.6 36.0 4.37e1 50 100
## 7 3 price_EUR 192988. 20126. 146475 195904 2.24e5 58 100
## 8 3 price_PLN 833706. 86944. 632770 846304. 9.66e5 58 100
## 9 3 size 53.3 7.21 41.2 53.4 6.52e1 58 100
## 10 4 price_EUR 225650. 26347. 170525 223226. 2.96e5 48 100
## 11 4 price_PLN 974810. 113819. 736669 964338. 1.28e6 48 100
## 12 4 size 72.0 10.2 53.3 70.8 8.77e1 48 100
stby(data = apartments,
INDICES = apartments$rooms,
FUN = descr,
stats = "fivenum") %>%
tb(order = 3) %>%
kable(format = "html", digits = 2) %>%
collapse_rows(columns = 1, valign = "top")
| variable | rooms | min | q1 | med | q3 | max |
|---|---|---|---|---|---|---|
| price_EUR | 1 | 83280.0 | 110881.0 | 120488.00 | 128568.00 | 152117.0 |
| 2 | 136640.0 | 146754.0 | 156773.00 | 166259.00 | 205702.0 | |
| 3 | 146475.0 | 177478.0 | 195904.00 | 208599.00 | 223572.0 | |
| 4 | 170525.0 | 209827.5 | 223226.50 | 243300.00 | 295762.0 | |
| price_PLN | 1 | 359769.0 | 479005.5 | 520507.00 | 555411.50 | 657146.0 |
| 2 | 590286.0 | 633978.0 | 677260.00 | 718237.00 | 888634.0 | |
| 3 | 632770.0 | 766707.0 | 846303.50 | 901149.00 | 965829.0 | |
| 4 | 736669.0 | 906455.0 | 964338.50 | 1051055.50 | 1277691.0 | |
| size | 1 | 17.0 | 18.1 | 19.10 | 20.60 | 21.9 |
| 2 | 29.6 | 32.9 | 35.95 | 40.50 | 43.7 | |
| 3 | 41.2 | 47.9 | 53.45 | 59.70 | 65.2 | |
| 4 | 53.3 | 64.2 | 70.85 | 82.15 | 87.7 |
Your task this week is to: prepare your own descriptive analysis for the “CreditCard” dataset (AER package). It is a cross-sectional dataframe on the credit history for a sample of applicants for a type of credit card.
library(AER)
library(dplyr)
library(ggplot2)
data(CreditCard)
# analysis by credit risk
summary_by_card <- CreditCard %>%
group_by(card) %>%
summarise(
mean_yearly_income = mean(income),
median_yearly_income = median(income),
mean_expenditure = mean(expenditure),
median_expenditure = median(expenditure),
mean_age = mean(age),
median_age = median(age),
mean_expenditure_income_ratio = mean(expenditure / income),
median_expenditure_income_ratio = median(expenditure / income)
)
# Histogram of yearly income by credit risk
income_histogram <- ggplot(CreditCard, aes(x = income, fill = card)) +
geom_histogram(binwidth = 2, alpha = 0.7, position = "identity") +
labs(title = "Relation between yearly income and credit risk",
x = "Yearly Income (in 10k dollars)",
y = "Frequency",
fill = "Credit Risk")
expenditure_boxplot <- ggplot(CreditCard, aes(x = card, y = expenditure, fill = card)) +
geom_boxplot() +
labs(title = "Expenditure Distribution by Credit Risk",
x = "Credit Risk",
y = "Expenditure",
fill = "Credit Risk")
# graphs
print(income_histogram)
print(expenditure_boxplot)
# your code here
# Load required libraries
library(AER) # for accessing the CreditCard dataset
library(dplyr) # for data manipulation
library(ggplot2)# for data visualization
library(tidyr)
# View the structure of the dataset
str(CreditCard)
## 'data.frame': 1319 obs. of 12 variables:
## $ card : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ reports : num 0 0 0 0 0 0 0 0 0 0 ...
## $ age : num 37.7 33.2 33.7 30.5 32.2 ...
## $ income : num 4.52 2.42 4.5 2.54 9.79 ...
## $ share : num 0.03327 0.00522 0.00416 0.06521 0.06705 ...
## $ expenditure: num 124.98 9.85 15 137.87 546.5 ...
## $ owner : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 2 1 ...
## $ selfemp : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ dependents : num 3 3 4 0 2 0 2 0 0 0 ...
## $ months : num 54 34 58 25 64 54 7 77 97 65 ...
## $ majorcards : num 1 1 1 1 1 1 1 1 1 1 ...
## $ active : num 12 13 5 7 5 1 5 3 6 18 ...
# Define a function to calculate mode
get_mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
#Data cleansing
# Calculate the first and third quartiles
Q1 <- quantile(CreditCard$income, 0.25)
Q3 <- quantile(CreditCard$income, 0.75)
# Calculate the IQR
IQR <- Q3 - Q1
# Define the lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Filter the dataset to remove outliers
CreditCard_filtered <- CreditCard %>%
filter(income >= lower_bound & income <= upper_bound)
# Summary statistics an making one table for better visualization
summary_by_card <- CreditCard %>%
group_by(card) %>%
summarise(
mean_reports = round(mean(reports,na.rm = TRUE),3),
median_reports = round(median(reports,na.rm = TRUE),3),
mode_reports = get_mode(reports),
mean_age = round(mean(age,na.rm = TRUE),3),
median_age = round(median(age,na.rm = TRUE),3),
mode_age = get_mode(age),
mean_income = round(mean(income,na.rm = TRUE),3),
median_income = round(median(income,na.rm = TRUE),3),
mode_income = get_mode(income),
mean_share = round(mean(share,na.rm = TRUE),3),
median_share = round(median(share,na.rm = TRUE),3),
mode_share = get_mode(share),
mean_expenditure = round(mean(expenditure,na.rm = TRUE),3),
median_expenditure = round(median(expenditure,na.rm = TRUE),3),
mode_expenditure = get_mode(expenditure)
)%>%
pivot_wider(names_from = card, values_from = everything(), names_glue = "{.value}-{card}")
summary_by_card_wide <- as.data.frame(summary_by_card)
# Visualization: Histogram of Age by Credit Risk
age_histogram <- ggplot(CreditCard, aes(x = age, fill = card)) +
geom_histogram(binwidth = 2, alpha = 0.7, position = "identity") +
labs(title = "Distribution of Age by Credit Risk",
x = "Age",
y = "Frequency",
fill = "Credit Risk")
# Visualization: Boxplot of Income by Credit Risk
income_boxplot <- ggplot(CreditCard, aes(x = card, y = income, fill = card)) +
geom_boxplot() +
labs(title = "Income Distribution by Credit Risk",
x = "Credit Risk",
y = "Income",
fill = "Credit Risk")
# Visualization: Scatterplot of Share vs. Expenditure by Credit Risk
share_expenditure_scatter <- ggplot(CreditCard, aes(x = share, y = expenditure, color = card)) +
geom_point() +
labs(title = "Share vs. Expenditure by Credit Risk",
x = "Share",
y = "Expenditure",
color = "Credit Risk")
# Distribution of income by credit risk (card)
income_distribution <- ggplot(CreditCard, aes(x = income, fill = card)) +
geom_density(alpha = 0.5) +
labs(title = "Distribution of Income by Credit Risk",
x = "Income",
y = "Density",
fill = "Credit Risk")
# Histogram of income by credit risk
income_credit_histogram <- ggplot(CreditCard, aes(x = income, fill = card)) +
geom_histogram(binwidth = 1000, alpha = 0.7, position = "identity") +
labs(title = "Histogram of Income by Credit Risk",
x = "Income",
y = "Frequency",
fill = "Credit Risk")
# Boxplot of age by credit risk
age_boxplot <- ggplot(CreditCard, aes(x = card, y = age, fill = card)) +
geom_boxplot() +
labs(title = "Boxplot of Age by Credit Risk",
x = "Credit Risk",
y = "Age",
fill = "Credit Risk")
# Frequency distribution of expenditures by credit risk
expenditure_histogram <- ggplot(CreditCard, aes(x = expenditure, fill = card)) +
geom_histogram(binwidth = 50, alpha = 0.7, position = "identity") +
labs(title = "Frequency Distribution of Expenditures by Credit Risk",
x = "Expenditure",
y = "Frequency",
fill = "Credit Risk") +
theme_minimal()
# Frequency distribution of shares by credit risk
share_histogram <- ggplot(CreditCard, aes(x = share, fill = card)) +
geom_histogram(binwidth = 0.05, alpha = 0.7, position = "identity") +
labs(title = "Frequency Distribution of Shares by Credit Risk",
x = "Share",
y = "Frequency",
fill = "Credit Risk") +
theme_minimal()
# Display the visualizations
print(income_distribution)
print(expenditure_histogram)
print(share_histogram)
print(income_credit_histogram)
print(age_boxplot)
print(age_histogram)
print(income_boxplot)
print(share_expenditure_scatter)
# Making one summary for both card types
print_summary_side_by_side <- function(summary_df) {
cat("\n------------------------------------------------------------\n")
cat(sprintf("%-30s %-20s %s\n", "Statistic", "Credit Risk: no", "Credit Risk: yes"))
cat("------------------------------------------------------------\n")
for (stat in c("mean_reports", "median_reports", "mode_reports",
"mean_age", "median_age", "mode_age",
"mean_income", "median_income", "mode_income",
"mean_share", "median_share", "mode_share",
"mean_expenditure", "median_expenditure", "mode_expenditure")) {
no_label <- paste(stat, "no", sep = "-")
yes_label <- paste(stat, "yes", sep = "-")
cat(sprintf("%-30s %-20s %s\n", stat,
format(summary_df[[no_label]], nsmall = 3),
format(summary_df[[yes_label]], nsmall = 3)))
}
cat("------------------------------------------------------------\n")
}
CreditCard %>%
descr(stats = "all") %>%
tb()
## # A tibble: 9 × 16
## variable mean sd min q1 med q3 max mad
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 active 7.00 6.31 0 2 e+0 6 e+0 1.1 e+1 4.6 e+1 5.93e+0
## 2 age 33.2 10.1 0.167 2.54e+1 3.12e+1 3.94e+1 8.35e+1 9.51e+0
## 3 dependents 0.994 1.25 0 0 1 e+0 2 e+0 6 e+0 1.48e+0
## 4 expenditure 185. 272. 0 4.58e+0 1.01e+2 2.49e+2 3.10e+3 1.50e+2
## 5 income 3.37 1.69 0.21 2.24e+0 2.9 e+0 4 e+0 1.35e+1 1.19e+0
## 6 majorcards 0.817 0.387 0 1 e+0 1 e+0 1 e+0 1 e+0 0
## 7 months 55.3 66.3 0 1.2 e+1 3 e+1 7.2 e+1 5.4 e+2 3.11e+1
## 8 reports 0.456 1.35 0 0 0 0 1.4 e+1 0
## 9 share 0.0687 0.0947 0.000109 2.23e-3 3.88e-2 9.36e-2 9.06e-1 5.66e-2
## # ℹ 7 more variables: iqr <dbl>, cv <dbl>, skewness <dbl>, se.skewness <dbl>,
## # kurtosis <dbl>, n.valid <dbl>, pct.valid <dbl>
print_summary_side_by_side(summary_by_card_wide)
##
## ------------------------------------------------------------
## Statistic Credit Risk: no Credit Risk: yes
## ------------------------------------------------------------
## mean_reports 1.588 0.129
## median_reports 1.000 0.000
## mode_reports 0.000 0.000
## mean_age 33.203 33.216
## median_age 31.833 31.083
## mode_age 25.83333 26.08333
## mean_income 3.069 3.451
## median_income 2.590 3.000
## mode_income 2.500 3.000
## mean_share 0.000 0.088
## median_share 0.000 0.060
## mode_share 0.00048 0.0032
## mean_expenditure 0.000 238.602
## median_expenditure 0.000 150.177
## mode_expenditure 0.000 0.000
## ------------------------------------------------------------
Based on the visualizations provided, we can see some major differences between for customers with different credit risk. The biggest difference we can see in incomes. Those with cards (card=yes), tend to have higher income.They also have bigger and way more varied expenditures. The age, however, doesn’t really differ with both groups. Monthly credit card expenditure to yearly income is distinct between the two groups, those having a credit card using a more significant proportion of their income.