Lab 5 (Ismael Hdz)

Author

Ismael Hernandez

Introduction

In this lab, I am going to choose two variables from the datsaet “german” from the ICRdatasets, the purpose of this lab is to look at the relation these two variables have, an interpretation will be provided after completing the 7 steps.

Part 1

## THIS CODE IS TO RENAME THE LEVELS FOR THE CHANNEL AND REGION VARIABLES IN THE CUSTOMER DATASET
library(datasetsICR)
data(german)
head(german)

  Age Gender Housing Saving accounts Checking account Credit amount Duration
1  67   male     own            <NA>           little          1169        6
2  22 female     own          little         moderate          5951       48
3  49   male     own          little             <NA>          2096       12
4  45   male    free          little           little          7882       42
5  53   male    free          little           little          4870       24
6  35   male    free            <NA>             <NA>          9055       36
              Purpose Class Risk
1            radio/TV          1
2            radio/TV          2
3           education          1
4 furniture/equipment          1
5                 car          2
6           education          1

Part 1: Practice using pipes (dplyr) to summarize the data: Two categorical values

For this lab, I am going to use the “housing” and “purpose” variables
Use dplyr to summarize the data by the two categorical variables and get the frequency and percent

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

pip1 <- german %>%         
  group_by(`Housing`, Purpose) %>%
  summarize(N = n()) %>% 
  mutate(freq = N/sum(N),
         pct = round((freq*100),0))

`summarise()` has grouped output by 'Housing'. You can override using the
`.groups` argument.

pip1

# A tibble: 22 × 5
# Groups:   Housing [3]
   Housing Purpose                 N   freq   pct
   <chr>   <chr>               <int>  <dbl> <dbl>
 1 free    business                5 0.0463     5
 2 free    car                    55 0.509     51
 3 free    education              15 0.139     14
 4 free    furniture/equipment    11 0.102     10
 5 free    radio/TV               15 0.139     14
 6 free    repairs                 3 0.0278     3
 7 free    vacation/others         4 0.0370     4
 8 own     business               76 0.107     11
 9 own     car                   219 0.307     31
10 own     domestic appliances    10 0.0140     1
# ℹ 12 more rows

Use the subset() argument in the data= section in the ggplot() argument to remove missing values

library(ggplot2)

p_title <- "Housing type and their purposes"
p_caption <- "german dataset"

# AS STACKED BAR CHART
p <- ggplot(data = subset(pip1, !is.na(Housing) & !is.na(Purpose)), 
                        aes(x=Housing, y=pct, fill = Purpose))

p + geom_col(position = "stack") +
    labs(x="Housing", y="Percent", fill = "Purpose",
         title = p_title, caption = p_caption, 
         subtitle = "As a stacked bar chart") +
    geom_text(aes(label=pct), position = position_stack(vjust=.5))

Interpretation:

From the chart below, using the two variables seem that the information can be interpreted and therefore it makes sense, when dividing the housing types and purposes into percentages, the algorithm displays actual percentages than when using other variables such as Credit amount. Because the type of housing can have different reasons why to be used, the purpose variable fits well with the data provided.

Part 2:Create stacked and dodged bar charts: Two Categorical Variables

p_title <- "Housing type and their purposes"
p_caption <- "german dataset"

# AS STACKED BAR CHART
p <- ggplot(data = subset(pip1, !is.na(Housing) & !is.na(Purpose)), 
                        aes(x=Housing, y=pct, fill = Purpose))

p + geom_col(position = "stack") +
    labs(x="Housing", y="Percent", fill = "Purpose",
         title = p_title, caption = p_caption, 
         subtitle = "As a stacked bar chart") +
    geom_text(aes(label=pct), position = position_stack(vjust=.5))

# AS DODGED BAR CHART
p + geom_col(position = "dodge2") +
    labs(x="Housing", y="Percent", fill = "Purpose",
         title = p_title, caption = p_caption, 
         subtitle = "As a dodged bar chart") + 
    geom_text(aes(label = pct), position = position_dodge(width = .9))

# AS FACETED HORIZONTAL BAR CHART
p + geom_col(position = "dodge2") +
    labs(x=NULL, y="Percent", fill = "Purpose",
         title = p_title, caption = p_caption, 
         subtitle = "As a faceted horizontal bar chart") +
         guides(fill = "none") +
         coord_flip() +
         facet_grid(~ Housing) +
    geom_text(aes(label = pct), position = position_dodge2(width = 1))

Part 3: Practice using pipes (dplyr) to summarize data: Two Continuous Variables and One Categorical

pip2 <- german %>%         
  group_by(Purpose) %>%
  summarize(N = n(),
            credit_mean = mean(`Credit amount`, na.rm=TRUE), 
            duration_mean = mean(Duration, na.rm=TRUE)) %>% 
  mutate(freq = N/sum(N),
         pct = round((freq*100),0))
pip2

# A tibble: 8 × 6
  Purpose                 N credit_mean duration_mean  freq   pct
  <chr>               <int>       <dbl>         <dbl> <dbl> <dbl>
1 business               97       4158.          26.9 0.097    10
2 car                   337       3768.          20.8 0.337    34
3 domestic appliances    12       1498           16.8 0.012     1
4 education              59       2879.          19.7 0.059     6
5 furniture/equipment   181       3067.          19.3 0.181    18
6 radio/TV              280       2488.          20.0 0.28     28
7 repairs                22       2728.          19.1 0.022     2
8 vacation/others        12       8209.          32.3 0.012     1

Part 4: Create a scatterplot: Two Continuous Variables and One Categorical

p <- ggplot(pip2, aes(x=credit_mean, y=duration_mean, color=Purpose))
p + geom_point(size=5) +
    annotate(geom = "text", x = 1.6, y=58, 
                     label = "Lets see how the credit amount earned and the duration it has is for \n type of housing", hjust=0) +
    labs(y="Duration in months", x="Credit amount", 
         title="Credit amount earned and duration", 
         subtitle = "How the credit amount lasts depending on the type of purpose",
         caption <- "german dataset{ICRdatasets}")

Part 5: Legends and guides

p <- ggplot(pip2, aes(x=credit_mean, y=duration_mean, color=Purpose))
p + geom_point(size=5) +
    annotate(geom = "text", x = 1.6, y=58, 
                     label = "Lets see how the credit amount earned and the duration it has is for \n each type of housing", hjust=0) +
    labs(y="Credit Duration in months", x="Credit Aamount", 
         title="Credit amount earned  in Deutsch mark and its duration", 
         subtitle = "How the credit amount lasts depending on the type of purpose",
         caption <- "german dataset{ICRdatasets}")

Part 6: Data Labels

p <- ggplot(pip2, aes(x=credit_mean, y=duration_mean, color=Purpose))
p + geom_point(size=5) +
    geom_text(mapping = aes(label=Purpose), hjust=1.2, size=3) +
    annotate(geom = "text", x = 1.6, y=58, 
                     label = "Lets see how the credit amount earned and the duration it has is for \n each type of housing.", hjust=0) +
    labs(y="Credit Duration", x="Credit Aamount", 
         title="Credit amount earned  in Deutsch mark and its duration",
         subtitle = "How the credit amount lasts depending on the type of purpose",
         caption <- "german dataset{ICRdatasets}",
         color = "Housing") +
    theme(legend.position = "none")

Interpretation.

A graph depicts trends in credit amount and duration in Germany based on the purpose of the loan. Vacation loans are the most frequent, with the highest credit amounts and longest durations. The average vacation loan is around 30,000 Deutsche Marks paid over 20 months. In contrast, loans for domestic appliances have the lowest credit amounts. For items like furniture, cars, education, and home repairs, the credit amounts and durations fall between these two extremes, with only minor variations. Overall, the data indicates Germans are willing to take on more long-term debt to fund recreation and vacations compared to basic household needs. This suggests vacation time is a higher priority in German culture than material possessions. People appear comfortable meeting basic needs without loans, only utilizing longer-term financing for discretionary expenses like travel which bring enjoyment over many months or years. The graph reflects German cultural values and priorities.