I.欠損値を見つける

RとPythonでデータの欠損値の見つけ方を紹介します。

データは次にリンクされているDC comicsやMarvel comicsなどのヒーローのデータです。

DC Comics vs Marvel Comics — Exploratory Data Analysis and Data Visualization with R

II.Rの場合

必要なライブラリをインポートし、データを読み込みます。

# import libraries
library(tidyverse)
library(visdat)
library(naniar)
# load the data as tibble
characters <- read_csv("https://raw.githubusercontent.com/cosmoduende/r-marvel-vs-dc/main/dataset_shdb/heroesInformation.csv")
## Warning: Missing column names filled in: 'X1' [1]
# show the data
characters
## # A tibble: 734 x 11
##       X1 name  Gender `Eye color` Race  `Hair color` Height Publisher
##    <dbl> <chr> <chr>  <chr>       <chr> <chr>         <dbl> <chr>    
##  1     0 A-Bo… Male   yellow      Human No Hair         203 Marvel C…
##  2     1 Abe … Male   blue        Icth… No Hair         191 Dark Hor…
##  3     2 Abin… Male   blue        Unga… No Hair         185 DC Comics
##  4     3 Abom… Male   green       Huma… No Hair         203 Marvel C…
##  5     4 Abra… Male   blue        Cosm… Black           -99 Marvel C…
##  6     5 Abso… Male   blue        Human No Hair         193 Marvel C…
##  7     6 Adam… Male   blue        -     Blond           -99 NBC - He…
##  8     7 Adam… Male   blue        Human Blond           185 DC Comics
##  9     8 Agen… Female blue        -     Blond           173 Marvel C…
## 10     9 Agen… Male   brown       Human Brown           178 Marvel C…
## # … with 724 more rows, and 3 more variables: `Skin color` <chr>,
## #   Alignment <chr>, Weight <dbl>

列名を表示します。列名は11個です。第一列に列名がないので、“X1”とふられました。

colnames(characters)
##  [1] "X1"         "name"       "Gender"     "Eye color"  "Race"      
##  [6] "Hair color" "Height"     "Publisher"  "Skin color" "Alignment" 
## [11] "Weight"

visdatで欠損値をグラフにします。PublisherとWeightに欠損値(NA)があることがわかります。read_csvで読み込んだときに空白セルはNAに変えられています

vis_miss(characters)

空白の他に欠損値がないか確認します。完全ではないですが、glimpseをもちいます。“-99”と“-”が欠損値であると考えられます。

glimpse(characters)
## Rows: 734
## Columns: 11
## $ X1           <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ name         <chr> "A-Bomb", "Abe Sapien", "Abin Sur", "Abomination", "Abra…
## $ Gender       <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", …
## $ `Eye color`  <chr> "yellow", "blue", "blue", "green", "blue", "blue", "blue…
## $ Race         <chr> "Human", "Icthyo Sapien", "Ungaran", "Human / Radiation"…
## $ `Hair color` <chr> "No Hair", "No Hair", "No Hair", "No Hair", "Black", "No…
## $ Height       <dbl> 203, 191, 185, 203, -99, 193, -99, 185, 173, 178, 191, 1…
## $ Publisher    <chr> "Marvel Comics", "Dark Horse Comics", "DC Comics", "Marv…
## $ `Skin color` <chr> "-", "blue", "red", "-", "-", "-", "-", "-", "-", "-", "…
## $ Alignment    <chr> "good", "good", "good", "bad", "bad", "bad", "good", "go…
## $ Weight       <dbl> 441, 65, 90, 441, -99, 122, -99, 88, 61, 81, 104, 108, 9…

naniarを用いて置換することができます。あるいは、read_csv(“ファイル名”, na = c(-99, ‘-’))で読み込み直します。このほうが簡単です。

# replace all instances of -99 or -98, or "N/A" with NA
replace_with_na_all(characters, 
                    condition = ~.x %in% c(-99, "-")) -> characters_na
characters_na
## # A tibble: 734 x 11
##       X1 name  Gender `Eye color` Race  `Hair color` Height Publisher
##    <dbl> <chr> <chr>  <chr>       <chr> <chr>         <dbl> <chr>    
##  1     0 A-Bo… Male   yellow      Human No Hair         203 Marvel C…
##  2     1 Abe … Male   blue        Icth… No Hair         191 Dark Hor…
##  3     2 Abin… Male   blue        Unga… No Hair         185 DC Comics
##  4     3 Abom… Male   green       Huma… No Hair         203 Marvel C…
##  5     4 Abra… Male   blue        Cosm… Black            NA Marvel C…
##  6     5 Abso… Male   blue        Human No Hair         193 Marvel C…
##  7     6 Adam… Male   blue        <NA>  Blond            NA NBC - He…
##  8     7 Adam… Male   blue        Human Blond           185 DC Comics
##  9     8 Agen… Female blue        <NA>  Blond           173 Marvel C…
## 10     9 Agen… Male   brown       Human Brown           178 Marvel C…
## # … with 724 more rows, and 3 more variables: `Skin color` <chr>,
## #   Alignment <chr>, Weight <dbl>

再度欠損値のグラフを作成します。

vis_miss(characters_na)

数値で確認します。

miss_var_summary(characters_na)
## # A tibble: 11 x 3
##    variable   n_miss pct_miss
##    <chr>       <int>    <dbl>
##  1 Skin color    662   90.2  
##  2 Race          304   41.4  
##  3 Weight        239   32.6  
##  4 Height        217   29.6  
##  5 Eye color     172   23.4  
##  6 Hair color    172   23.4  
##  7 Gender         29    3.95 
##  8 Publisher      15    2.04 
##  9 Alignment       7    0.954
## 10 X1              0    0    
## 11 name            0    0

III.Pythonの場合

IIと同じデータを読み込みます。読み込む際に、空白と“-”と“-99”を“NaN”に変更します。

# import libraries
import pandas as pd
from matplotlib import pyplot as plt
import klib
# load the data with NaN
missing_values = ['-', "-99"] 
df0 = pd.read_csv('https://raw.githubusercontent.com/cosmoduende/r-marvel-vs-dc/main/dataset_shdb/heroesInformation.csv', na_values = missing_values)
print(df0)
##      Unnamed: 0             name  Gender  ... Skin color Alignment Weight
## 0             0           A-Bomb    Male  ...        NaN      good  441.0
## 1             1       Abe Sapien    Male  ...       blue      good   65.0
## 2             2         Abin Sur    Male  ...        red      good   90.0
## 3             3      Abomination    Male  ...        NaN       bad  441.0
## 4             4          Abraxas    Male  ...        NaN       bad    NaN
## ..          ...              ...     ...  ...        ...       ...    ...
## 729         729  Yellowjacket II  Female  ...        NaN      good   52.0
## 730         730             Ymir    Male  ...      white      good    NaN
## 731         731             Yoda    Male  ...      green      good   17.0
## 732         732          Zatanna  Female  ...        NaN      good   57.0
## 733         733             Zoom    Male  ...        NaN       bad   81.0
## 
## [734 rows x 11 columns]

欠損値の数を表示表示します。

# count missing values
df0.isnull().sum()
## Unnamed: 0      0
## name            0
## Gender         29
## Eye color     172
## Race          304
## Hair color    172
## Height        217
## Publisher      15
## Skin color    662
## Alignment       7
## Weight        239
## dtype: int64

klibで欠損値のグラフを作成します。

# creata a graph for missing values 
klib.missingval_plot(df0)
## GridSpec(6, 6)
plt.show()

To be continued.