In our example this week, we are going to use the fake data - about
real estates in Wroclaw - prices by districts, size of apartments and
many more.
Preprocessing
As you can see, not all formats of our variables are adapted. We need
to prepare appropriate formats of our variables according to their
measurement scale and future application.
In the first step of our analysis, we will group our data into a
simple frequency table.
First, let’s look at the distribution of housing prices in our sample
and verify tabular validity using the TAI measure:
Ok, it looks quite ugly, so let’s wrap it up using the ‘kable’
package:
Apartments in Wroclaw - prices in kPLN
x
label
Freq
Percent
Valid Percent
Cumulative Percent
Valid
350-450 kPLN
9
4.5
4.5
4.5
450-550 kPLN
21
10.5
10.5
15.0
550-650 kPLN
33
16.5
16.5
31.5
650-750 kPLN
36
18.0
18.0
49.5
750-850 kPLN
31
15.5
15.5
65.0
850-950 kPLN
36
18.0
18.0
83.0
950-1050 kPLN
21
10.5
10.5
93.5
1050-1150 kPLN
10
5.0
5.0
98.5
1150-1250 kPLN
2
1.0
1.0
99.5
1250-1350 kPLN
1
0.5
0.5
100.0
Total
200
100.0
100.0
Missing
<blank>
0
0.0
<NA>
0
0.0
Total
200
100.0
## # classes Goodness of fit Tabular accuracy
## 10.0000000 0.9780872 0.8508467
As we can see - the TAI index is quite high. 0.85 means that we can
accept the proposed construction of the frequency table.
Basic plots
In this section, we should represent our data using basic
(pre-installed in R) graphics. Select the most appropriate graphs
depending on the scale of the selected variables. Explore the
heterogeneity of the distribution by presenting the data by group (e.g.,
by neighborhood, building type, etc.). Don’t forget about main titles,
labels and legends. Read more about graphical parameters here.
Note that the echo = FALSE parameter has been added to
the code snippet to prevent printing the R code that generated the
graph.
ggplot2 plots
Now, let’s use the ggplot2 and
ggpubr libraries to plot.
Ggplot2 allows you to show the average value for each group using the
stat_summary() function. You no longer need to
calculate average values before creating a graph!
RainCloud Plot
Faceting
Faceting generates small multiples, each showing a different subset
of the data. They are a powerful tool for exploratory data analysis: you
can quickly compare patterns in different parts of the data and see if
they are the same or different. Read more here.
Univariate Statistics
Before automatically reporting the full summary table of descriptive
statistics, this time your goal is to measure the central tendency of
the price distribution. Compare the mean, median, and mode along with
positional measures - quantiles - by district and building type or
number of rooms in the apartment.
mean(price_PLN)
## [1] 760035
median(price_PLN)
## [1] 755719.5
sd(price_PLN) #standard deviation
## [1] 186099.8
var(price_PLN) #variance
## [1] 34633125960
coeff_var<-sd(price_PLN)/mean(price_PLN) #coefficient of variability %
coeff_var
## [1] 0.2448568
IQR(price_PLN)# difference between quartiles =Q3-Q1
## 75%
## 282686.5
sx<-IQR(price_PLN)/2 #interquartile deviation
coeff_varx<-sx/median(price_PLN) #IQR coefficient of variability %
coeff_varx
Ok, we have calculated all of the basic summary statistics above.
Let’s wrap them up together now.
rooms
boxplot
histogram
line1
line2
points1
1
2
3
4
Summary tables
Ok, now we will finally summarize the basic measures of central
tendency for prices by district/building type using the
‘kable’ package. Feel free to customize your
final report. See some hints here.
gtsummary
We can calculate easily descriptive statistics also using gtsummary
package:
dfSummary() creates a summary table with statistics, frequencies and
graphs for all variables in a data frame. The information displayed is
type-specific (character, factor, numeric, date) and also varies
according to the number of distinct values.
When using dfSummary() in R Markdown documents, it is generally a
good idea to exclude a column or two to avoid margin overflow. Since the
Valid and Missing columns are redundant, we can drop either one of
them.
When generating freq() or descr() tables, it is possible to turn the
results into “tidy” tables with the use of the tb() function (think of
tb as a diminutive for tibble). For example:
Your task this week is to: prepare your own descriptive analysis for
the “CreditCard” dataset (AER package). It is a cross-sectional
dataframe on the credit history for a sample of applicants for a type of
credit card.
Are the yearly incomes (in USD 10,000), credit card expenditures,
age, ratio of monthly credit card expenditure to yearly income -
significantly different for applicants for customers with different
credit risk (“card” variable - factor)?
Prepare a professional data visualizations, descriptive statistics’
tables and interpret them.
data(CreditCard)
# your code here
CreditCard_subset <- CreditCard[, c("card", "income", "expenditure", "age", "share")]
CreditCard_subset$card<-as.factor(CreditCard_subset$card)
CreditCard_subset$income<-as.numeric(CreditCard_subset$income)
CreditCard_subset$expenditure<-as.numeric(CreditCard_subset$expenditure)
CreditCard_subset$age<-as.numeric(CreditCard_subset$age)
CreditCard_subset$share<-as.numeric(CreditCard_subset$share)
# Create a frequency table for the 'card' variable
tableL <- table(CreditCard_subset$card)
table_df <- as.data.frame(tableL)
colnames(table_df) <- c("Card", "Frequency")
kbl(table_df, caption = "Acceptance of credit card") %>%
kable_material(c("striped", "hover"))
## Warning: Values from `Freq` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
## {data} |>
## dplyr::summarise(n = dplyr::n(), .by = c(Var1, Var2)) |>
## dplyr::filter(n > 1L)
## Warning: Values from `Freq` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
## {data} |>
## dplyr::summarise(n = dplyr::n(), .by = c(Var1, Var2)) |>
## dplyr::filter(n > 1L)
Min. : 0.210 , 1st Qu.: 2.350 , Median : 3.000 , Mean : 3.451 , 3rd Qu.:
4.000 , Max. :13.500
Min. : 0.00 , 1st Qu.: 66.52 , Median : 150.18 , Mean : 238.60 , 3rd
Qu.: 313.96 , Max. :3099.51
Min. : 0.1667 , 1st Qu.:25.3333 , Median :31.0833 , Mean :33.2160 , 3rd
Qu.:39.7083 , Max. :83.5000
Min. :0.000186 , 1st Qu.:0.026635 , Median :0.060209 , Mean :0.088482 ,
3rd Qu.:0.113858 , Max. :0.906320
summary_kable_no
Summary Statistics Table for card declined
Var1
income
expenditure
age
share
Min. : 0.490 , 1st Qu.: 2.000 , Median : 2.590 , Mean : 3.069 , 3rd Qu.:
3.625 , Max. :11.000
Min. :0 , 1st Qu.:0 , Median :0 , Mean :0 , 3rd Qu.:0 , Max. :0
Min. : 0.75 , 1st Qu.:25.65 , Median :31.83 , Mean :33.20 , 3rd
Qu.:38.06 , Max. :80.17
Min. :0.0001091 , 1st Qu.:0.0003311 , Median :0.0004633 , Mean
:0.0004768 , 3rd Qu.:0.0006000 , Max. :0.0024490
boxplot(CreditCard$income, main = "Boxplot of Income" ,ylab = "Income")
boxplot(CreditCard$expenditure, main = "Boxplot of Expenditure", ylab = "Expenditure")
boxplot(CreditCard$age, main = "Boxplot of Age", ylab = "Age" )
#We see form boxplots, that there are outliers, that we have handle with
outliersIncome <- CreditCard$income[CreditCard$income > quantile(CreditCard$income, 0.75) + 1.5 * (quantile(CreditCard$income, 0.75) - quantile(CreditCard$income, 0.25)) |
CreditCard$income < quantile(CreditCard$income, 0.25) - 1.5 * (quantile(CreditCard$income, 0.75) - quantile(CreditCard$income, 0.25))]
CreditCard_clean <- CreditCard[!CreditCard$income %in% outliersIncome, ]
outliersExpenditure <- CreditCard$expenditure[CreditCard$expenditure > quantile(CreditCard$expenditure, 0.75) + 1.5 * (quantile(CreditCard$expenditure, 0.75) - quantile(CreditCard$expenditure, 0.25)) | CreditCard$expenditure < quantile(CreditCard$expenditure, 0.25) - 1.5 * (quantile(CreditCard$expenditure, 0.75) - quantile(CreditCard$expenditure, 0.25))]
CreditCard_clean <- CreditCard[!CreditCard$expenditure %in% outliersExpenditure, ]
outliersAge <- CreditCard$age[CreditCard$age > quantile(CreditCard$age, 0.75) + 1.5 * (quantile(CreditCard$age, 0.75) - quantile(CreditCard$age, 0.25)) |
CreditCard$age < quantile(CreditCard$age, 0.25) - 1.5 * (quantile(CreditCard$age, 0.75) - quantile(CreditCard$age, 0.25))]
CreditCard_clean <- CreditCard[!CreditCard$age %in% outliersAge, ]
densityIncome <- ggdensity(CreditCard_clean, x = "income",
fill = "card", palette = "jco")
densityAge <- ggdensity(CreditCard_clean, x = "age",
fill = "card", palette = "jco")
densityExpenditure <- ggdensity(CreditCard_clean, x = "expenditure",
fill = "card", palette = "jco")
CreditCard_yes <- subset(CreditCard_clean, owner == "yes")
densityIncome <- ggplot(CreditCard_yes, aes(x = income, fill = owner)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("yes" = "yellow")) +
labs(title = "Density Plot of Income for 'yes' Owners")
CreditCard_clean$ratio <- CreditCard_clean$expenditure / (CreditCard_clean$income) /12
# histrogram
ggplot(CreditCard_clean, aes(x = ratio)) +
geom_histogram(binwidth = 0.1, fill = "yellow", color = "blue") +
labs(title = "Breakdown of the ratio of monthly credit card expenses to annual income",
x = "Ratio (monthly expenses / annual income)",
y = "How many observations") +
scale_y_continuous(limits = c(0, 20), breaks = seq(0, 20, by = 4)) +
theme_minimal()
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_bar()`).
yes_data <- CreditCard_clean[CreditCard_clean$card == "yes", ]
no_data <- CreditCard_clean[CreditCard_clean$card == "no", ]
mean_income_yes <- aggregate(income ~ age, data = yes_data, FUN = mean)
mean_income_no <- aggregate(income ~ age, data = no_data, FUN = mean)
overall_mean_income <- mean(CreditCard_clean$income)
ggplot(mean_income_yes, aes(x = age, y = income)) +
geom_point() +
geom_hline(yintercept = overall_mean_income, linetype = "dashed", color = "red", size=1.5) +
geom_line(stat = "smooth", aes(group = 1), color = "blue", size=1.5) +
labs(title = "Average income for customers with credit risk",
x = "Age",
y = "Income")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mean_income_no, aes(x = age, y = income)) +
geom_point() +
geom_hline(yintercept = overall_mean_income, linetype = "dashed", color = "red", size=1.5) +
geom_line(stat = "smooth", aes(group = 1), color = "blue", size=1.5) +
labs(title = "Average income for customers without credit risk",
x = "Age",
y = "Income")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggdensity(CreditCard_clean, x = "income",
fill = "card", palette = "jco")
ggdensity(CreditCard_clean, x = "age",
fill = "card", palette = "jco")
From the table, we observe that the mean and median of income,
expenditure, and the share of individuals with credit card risk are
slightly higher than those without risk. However, the age is greater
among individuals without risk.
The plots reveal that the income range is wider for individuals with
risk, and income tends to increase with age. Jitter plots indicate that
there are more values clustered around the mean for individuals with
risk, although the values are similar in both cases. Density plots
suggest that the distribution of individuals based on income is similar
in both cases, as well as according to age, except for individuals aged
35-37, who are more likely to be without risk