Quiz 1
Il dataset Credit si trova nel package ISLR. Usare l’help per avere informazioni sul dataset
if(!require(tidyverse)){
install.packages("tidyverse")
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
if(!require(ISLR)){
install.packages("ISLR")
}
## Loading required package: ISLR
library("tidyverse")
library("ISLR")
Visualizzare la struttura del dataset
str(Credit)
## 'data.frame': 400 obs. of 12 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Income : num 14.9 106 104.6 148.9 55.9 ...
## $ Limit : int 3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
## $ Rating : int 283 483 514 681 357 569 259 512 266 491 ...
## $ Cards : int 2 3 4 3 2 4 2 2 5 3 ...
## $ Age : int 34 82 71 36 68 77 37 87 66 41 ...
## $ Education: int 11 15 11 11 16 10 12 9 13 19 ...
## $ Gender : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
## $ Student : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
## $ Married : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
## $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
## $ Balance : int 333 903 580 964 331 1151 203 872 279 1350 ...
Quante osservazioni ci sono? Quante variabili categoriche?
data("Credit",package="ISLR")
Osservazioni<-nrow(Credit)
Osservazioni
## [1] 400
VariabiliCategoriche<-sum(sapply(Credit,is.factor))
VariabiliCategoriche
## [1] 4
Calcolare a colpo d’occhio le statistiche descrittive per tutte le variabili del dataset
summary(Credit)
## ID Income Limit Rating
## Min. : 1.0 Min. : 10.35 Min. : 855 Min. : 93.0
## 1st Qu.:100.8 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2
## Median :200.5 Median : 33.12 Median : 4622 Median :344.0
## Mean :200.5 Mean : 45.22 Mean : 4736 Mean :354.9
## 3rd Qu.:300.2 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2
## Max. :400.0 Max. :186.63 Max. :13913 Max. :982.0
## Cards Age Education Gender Student
## Min. :1.000 Min. :23.00 Min. : 5.00 Male :193 No :360
## 1st Qu.:2.000 1st Qu.:41.75 1st Qu.:11.00 Female:207 Yes: 40
## Median :3.000 Median :56.00 Median :14.00
## Mean :2.958 Mean :55.67 Mean :13.45
## 3rd Qu.:4.000 3rd Qu.:70.00 3rd Qu.:16.00
## Max. :9.000 Max. :98.00 Max. :20.00
## Married Ethnicity Balance
## No :155 African American: 99 Min. : 0.00
## Yes:245 Asian :102 1st Qu.: 68.75
## Caucasian :199 Median : 459.50
## Mean : 520.01
## 3rd Qu.: 863.00
## Max. :1999.00
Bonus installare il package skimr, caricare il package e provare la funzione skim() sul dataset
if(!require("skimr")){
install.packages("skimr")
}
## Loading required package: skimr
library("skimr")
skim(Credit)
| Name | Credit |
| Number of rows | 400 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| factor | 4 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Gender | 0 | 1 | FALSE | 2 | Fem: 207, Ma: 193 |
| Student | 0 | 1 | FALSE | 2 | No: 360, Yes: 40 |
| Married | 0 | 1 | FALSE | 2 | Yes: 245, No: 155 |
| Ethnicity | 0 | 1 | FALSE | 3 | Cau: 199, Asi: 102, Afr: 99 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 200.50 | 115.61 | 1.00 | 100.75 | 200.50 | 300.25 | 400.00 | ▇▇▇▇▇ |
| Income | 0 | 1 | 45.22 | 35.24 | 10.35 | 21.01 | 33.12 | 57.47 | 186.63 | ▇▂▁▁▁ |
| Limit | 0 | 1 | 4735.60 | 2308.20 | 855.00 | 3088.00 | 4622.50 | 5872.75 | 13913.00 | ▆▇▃▁▁ |
| Rating | 0 | 1 | 354.94 | 154.72 | 93.00 | 247.25 | 344.00 | 437.25 | 982.00 | ▆▇▃▁▁ |
| Cards | 0 | 1 | 2.96 | 1.37 | 1.00 | 2.00 | 3.00 | 4.00 | 9.00 | ▇▇▂▁▁ |
| Age | 0 | 1 | 55.67 | 17.25 | 23.00 | 41.75 | 56.00 | 70.00 | 98.00 | ▆▇▇▇▁ |
| Education | 0 | 1 | 13.45 | 3.13 | 5.00 | 11.00 | 14.00 | 16.00 | 20.00 | ▂▅▇▇▂ |
| Balance | 0 | 1 | 520.02 | 459.76 | 0.00 | 68.75 | 459.50 | 863.00 | 1999.00 | ▇▅▃▂▁ |
Creare un dataset con i soli studenti che hanno almeno 2 carte
Credit2<-Credit %>%
filter(Student=="Yes" & Cards>=2)
In quale dei due dataset ci sono più donne?
Credit %>%
filter(Gender=="Female") %>%
summarize(count=n())
## count
## 1 207
Credit2 %>%
filter(Gender=="Female") %>%
summarize(count=n())
## count
## 1 24
print("Nel dataset Credit ci sono più donne")
## [1] "Nel dataset Credit ci sono più donne"
Stampare le categorie della variabile Ethnicity
CategorieEthnicity <- levels(Credit$Ethnicity)
CategorieEthnicity
## [1] "African American" "Asian" "Caucasian"