RとPythonでデータの欠損値の見つけ方を紹介します。
データは次にリンクされているDC comicsやMarvel comicsなどのヒーローのデータです。
DC Comics vs Marvel Comics — Exploratory Data Analysis and Data Visualization with R
必要なライブラリをインポートし、データを読み込みます。
# import libraries
library(tidyverse)
library(visdat)
library(naniar)
# load the data as tibble
characters <- read_csv("https://raw.githubusercontent.com/cosmoduende/r-marvel-vs-dc/main/dataset_shdb/heroesInformation.csv")
## Warning: Missing column names filled in: 'X1' [1]
# show the data
characters
## # A tibble: 734 x 11
## X1 name Gender `Eye color` Race `Hair color` Height Publisher
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 0 A-Bo… Male yellow Human No Hair 203 Marvel C…
## 2 1 Abe … Male blue Icth… No Hair 191 Dark Hor…
## 3 2 Abin… Male blue Unga… No Hair 185 DC Comics
## 4 3 Abom… Male green Huma… No Hair 203 Marvel C…
## 5 4 Abra… Male blue Cosm… Black -99 Marvel C…
## 6 5 Abso… Male blue Human No Hair 193 Marvel C…
## 7 6 Adam… Male blue - Blond -99 NBC - He…
## 8 7 Adam… Male blue Human Blond 185 DC Comics
## 9 8 Agen… Female blue - Blond 173 Marvel C…
## 10 9 Agen… Male brown Human Brown 178 Marvel C…
## # … with 724 more rows, and 3 more variables: `Skin color` <chr>,
## # Alignment <chr>, Weight <dbl>
列名を表示します。列名は11個です。第一列に列名がないので、“X1”とふられました。
colnames(characters)
## [1] "X1" "name" "Gender" "Eye color" "Race"
## [6] "Hair color" "Height" "Publisher" "Skin color" "Alignment"
## [11] "Weight"
visdatで欠損値をグラフにします。PublisherとWeightに欠損値(NA)があることがわかります。read_csvで読み込んだときに空白セルはNAに変えられています
vis_miss(characters)
空白の他に欠損値がないか確認します。完全ではないですが、glimpseをもちいます。“-99”と“-”が欠損値であると考えられます。
glimpse(characters)
## Rows: 734
## Columns: 11
## $ X1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ name <chr> "A-Bomb", "Abe Sapien", "Abin Sur", "Abomination", "Abra…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", …
## $ `Eye color` <chr> "yellow", "blue", "blue", "green", "blue", "blue", "blue…
## $ Race <chr> "Human", "Icthyo Sapien", "Ungaran", "Human / Radiation"…
## $ `Hair color` <chr> "No Hair", "No Hair", "No Hair", "No Hair", "Black", "No…
## $ Height <dbl> 203, 191, 185, 203, -99, 193, -99, 185, 173, 178, 191, 1…
## $ Publisher <chr> "Marvel Comics", "Dark Horse Comics", "DC Comics", "Marv…
## $ `Skin color` <chr> "-", "blue", "red", "-", "-", "-", "-", "-", "-", "-", "…
## $ Alignment <chr> "good", "good", "good", "bad", "bad", "bad", "good", "go…
## $ Weight <dbl> 441, 65, 90, 441, -99, 122, -99, 88, 61, 81, 104, 108, 9…
naniarを用いて置換することができます。あるいは、read_csv(“ファイル名”, na = c(-99, ‘-’))で読み込み直します。このほうが簡単です。
# replace all instances of -99 or -98, or "N/A" with NA
replace_with_na_all(characters,
condition = ~.x %in% c(-99, "-")) -> characters_na
characters_na
## # A tibble: 734 x 11
## X1 name Gender `Eye color` Race `Hair color` Height Publisher
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 0 A-Bo… Male yellow Human No Hair 203 Marvel C…
## 2 1 Abe … Male blue Icth… No Hair 191 Dark Hor…
## 3 2 Abin… Male blue Unga… No Hair 185 DC Comics
## 4 3 Abom… Male green Huma… No Hair 203 Marvel C…
## 5 4 Abra… Male blue Cosm… Black NA Marvel C…
## 6 5 Abso… Male blue Human No Hair 193 Marvel C…
## 7 6 Adam… Male blue <NA> Blond NA NBC - He…
## 8 7 Adam… Male blue Human Blond 185 DC Comics
## 9 8 Agen… Female blue <NA> Blond 173 Marvel C…
## 10 9 Agen… Male brown Human Brown 178 Marvel C…
## # … with 724 more rows, and 3 more variables: `Skin color` <chr>,
## # Alignment <chr>, Weight <dbl>
再度欠損値のグラフを作成します。
vis_miss(characters_na)
数値で確認します。
miss_var_summary(characters_na)
## # A tibble: 11 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 Skin color 662 90.2
## 2 Race 304 41.4
## 3 Weight 239 32.6
## 4 Height 217 29.6
## 5 Eye color 172 23.4
## 6 Hair color 172 23.4
## 7 Gender 29 3.95
## 8 Publisher 15 2.04
## 9 Alignment 7 0.954
## 10 X1 0 0
## 11 name 0 0
IIと同じデータを読み込みます。読み込む際に、空白と“-”と“-99”を“NaN”に変更します。
# import libraries
import pandas as pd
from matplotlib import pyplot as plt
import klib
# load the data with NaN
missing_values = ['-', "-99"]
df0 = pd.read_csv('https://raw.githubusercontent.com/cosmoduende/r-marvel-vs-dc/main/dataset_shdb/heroesInformation.csv', na_values = missing_values)
print(df0)
## Unnamed: 0 name Gender ... Skin color Alignment Weight
## 0 0 A-Bomb Male ... NaN good 441.0
## 1 1 Abe Sapien Male ... blue good 65.0
## 2 2 Abin Sur Male ... red good 90.0
## 3 3 Abomination Male ... NaN bad 441.0
## 4 4 Abraxas Male ... NaN bad NaN
## .. ... ... ... ... ... ... ...
## 729 729 Yellowjacket II Female ... NaN good 52.0
## 730 730 Ymir Male ... white good NaN
## 731 731 Yoda Male ... green good 17.0
## 732 732 Zatanna Female ... NaN good 57.0
## 733 733 Zoom Male ... NaN bad 81.0
##
## [734 rows x 11 columns]
欠損値の数を表示表示します。
# count missing values
df0.isnull().sum()
## Unnamed: 0 0
## name 0
## Gender 29
## Eye color 172
## Race 304
## Hair color 172
## Height 217
## Publisher 15
## Skin color 662
## Alignment 7
## Weight 239
## dtype: int64
klibで欠損値のグラフを作成します。
# creata a graph for missing values
klib.missingval_plot(df0)
## GridSpec(6, 6)
plt.show()
To be continued.