Quiz 1

Il dataset Credit si trova nel package ISLR. Usare l’help per avere informazioni sul dataset

if(!require(tidyverse)){
install.packages("tidyverse")
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
if(!require(ISLR)){
install.packages("ISLR")
}
## Loading required package: ISLR
library("tidyverse")
library("ISLR")

Visualizzare la struttura del dataset

str(Credit)
## 'data.frame':    400 obs. of  12 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Income   : num  14.9 106 104.6 148.9 55.9 ...
##  $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
##  $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
##  $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
##  $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
##  $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
##  $ Gender   : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
##  $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
##  $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
##  $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
##  $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...

Quante osservazioni ci sono? Quante variabili categoriche?

data("Credit",package="ISLR")

Osservazioni<-nrow(Credit)
Osservazioni
## [1] 400
VariabiliCategoriche<-sum(sapply(Credit,is.factor))
VariabiliCategoriche
## [1] 4

Calcolare a colpo d’occhio le statistiche descrittive per tutte le variabili del dataset

summary(Credit)
##        ID            Income           Limit           Rating     
##  Min.   :  1.0   Min.   : 10.35   Min.   :  855   Min.   : 93.0  
##  1st Qu.:100.8   1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2  
##  Median :200.5   Median : 33.12   Median : 4622   Median :344.0  
##  Mean   :200.5   Mean   : 45.22   Mean   : 4736   Mean   :354.9  
##  3rd Qu.:300.2   3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2  
##  Max.   :400.0   Max.   :186.63   Max.   :13913   Max.   :982.0  
##      Cards            Age          Education        Gender    Student  
##  Min.   :1.000   Min.   :23.00   Min.   : 5.00    Male :193   No :360  
##  1st Qu.:2.000   1st Qu.:41.75   1st Qu.:11.00   Female:207   Yes: 40  
##  Median :3.000   Median :56.00   Median :14.00                         
##  Mean   :2.958   Mean   :55.67   Mean   :13.45                         
##  3rd Qu.:4.000   3rd Qu.:70.00   3rd Qu.:16.00                         
##  Max.   :9.000   Max.   :98.00   Max.   :20.00                         
##  Married              Ethnicity      Balance       
##  No :155   African American: 99   Min.   :   0.00  
##  Yes:245   Asian           :102   1st Qu.:  68.75  
##            Caucasian       :199   Median : 459.50  
##                                   Mean   : 520.01  
##                                   3rd Qu.: 863.00  
##                                   Max.   :1999.00

Bonus installare il package skimr, caricare il package e provare la funzione skim() sul dataset

if(!require("skimr")){
install.packages("skimr")
}
## Loading required package: skimr
library("skimr")

skim(Credit)
Data summary
Name Credit
Number of rows 400
Number of columns 12
_______________________
Column type frequency:
factor 4
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 0 1 FALSE 2 Fem: 207, Ma: 193
Student 0 1 FALSE 2 No: 360, Yes: 40
Married 0 1 FALSE 2 Yes: 245, No: 155
Ethnicity 0 1 FALSE 3 Cau: 199, Asi: 102, Afr: 99

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 200.50 115.61 1.00 100.75 200.50 300.25 400.00 ▇▇▇▇▇
Income 0 1 45.22 35.24 10.35 21.01 33.12 57.47 186.63 ▇▂▁▁▁
Limit 0 1 4735.60 2308.20 855.00 3088.00 4622.50 5872.75 13913.00 ▆▇▃▁▁
Rating 0 1 354.94 154.72 93.00 247.25 344.00 437.25 982.00 ▆▇▃▁▁
Cards 0 1 2.96 1.37 1.00 2.00 3.00 4.00 9.00 ▇▇▂▁▁
Age 0 1 55.67 17.25 23.00 41.75 56.00 70.00 98.00 ▆▇▇▇▁
Education 0 1 13.45 3.13 5.00 11.00 14.00 16.00 20.00 ▂▅▇▇▂
Balance 0 1 520.02 459.76 0.00 68.75 459.50 863.00 1999.00 ▇▅▃▂▁

Creare un dataset con i soli studenti che hanno almeno 2 carte

Credit2<-Credit %>%
filter(Student=="Yes" & Cards>=2)

In quale dei due dataset ci sono più donne?

Credit %>%
filter(Gender=="Female") %>%
summarize(count=n())
##   count
## 1   207
Credit2 %>%
filter(Gender=="Female") %>%
summarize(count=n())
##   count
## 1    24
print("Nel dataset Credit ci sono più donne")
## [1] "Nel dataset Credit ci sono più donne"

Stampare le categorie della variabile Ethnicity

CategorieEthnicity <- levels(Credit$Ethnicity)
CategorieEthnicity
## [1] "African American" "Asian"            "Caucasian"