This project is to provide a comprehensive analysis involving data cleaning, transformation, and visualization for three different datasets: Pokémon competitive analysis, Dungeons & Dragons (DND) characters, and NYC Gifted and Talented Grades

Load packages

library(readr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.3
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ purrr     1.0.4
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(readxl)
library(ggplot2)

Creating path for the Datasets

white.csv <-("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\pokemon_competitive_analysis.csv")

write_tsv <- ("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\dnd_chars_all.tsv.txt") 

white.csv <- ("C:\\Users\\wduro\\Downloads\\NYC Gifted and Talented Grades 2018-19 - Sheet5.csv")

Readig the data

The three datasets were loaded into R using read_csv and read_tsv functions. Each dataset was read in with its specific file path.

pokemon_data <- read_csv("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\pokemon_competitive_analysis.csv", show_col_types = FALSE)

dnd_data <- read_tsv ("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\dnd_chars_all.tsv.txt", show_col_types = FALSE)

 nyc_gifted_data <- read_csv("C:\\Users\\wduro\\Downloads\\NYC Gifted and Talented Grades 2018-19 - Sheet5.csv", show_col_types = FALSE)
## New names:
## • `Timestamp` -> `Timestamp...1`
## • `` -> `...11`
## • `Timestamp` -> `Timestamp...14`
colnames(nyc_gifted_data)
##  [1] "Timestamp...1"              "Entering Grade Level"      
##  [3] "District"                   "Birth Month"               
##  [5] "OLSAT Verbal Score"         "OLSAT Verbal Percentile"   
##  [7] "NNAT Non Verbal Raw Score"  "NNAT Non Verbal Percentile"
##  [9] "Overall Score"              "School Preferences"        
## [11] "...11"                      "School Assigned"           
## [13] "Will you enroll there?"     "Timestamp...14"
colnames(nyc_gifted_data) <- make.names(colnames(nyc_gifted_data), unique = TRUE)

head(nyc_gifted_data)
## # A tibble: 6 × 14
##   Timestamp...1 Entering.Grade.Level District Birth.Month OLSAT.Verbal.Score
##   <chr>         <chr>                   <dbl> <chr>                    <dbl>
## 1 2/14/2018     K                          30 October                     24
## 2 3/1/2018      K                          30 February                    25
## 3 3/27/2018     1                           2 January                     25
## 4 3/27/2018     1                           2 May                         25
## 5 3/27/2018     1                           2 July                        25
## 6 3/27/2018     1                           2 November                    25
## # ℹ 9 more variables: OLSAT.Verbal.Percentile <dbl>,
## #   NNAT.Non.Verbal.Raw.Score <dbl>, NNAT.Non.Verbal.Percentile <dbl>,
## #   Overall.Score <dbl>, School.Preferences <chr>, ...11 <lgl>,
## #   School.Assigned <chr>, Will.you.enroll.there. <chr>, Timestamp...14 <chr>
nyc_gifted_data <- nyc_gifted_data[!duplicated(colnames(nyc_gifted_data))]
head(pokemon_data)
## # A tibble: 6 × 23
##   index name   type1 type2 ability1 ability2 hidden_ability    hp attack defense
##   <dbl> <chr>  <chr> <chr> <chr>    <chr>    <chr>          <dbl>  <dbl>   <dbl>
## 1     1 bulba… grass pois… overgrow No_abil… chlorophyll       45     49      49
## 2     2 ivysa… grass pois… overgrow No_abil… chlorophyll       60     62      63
## 3     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 4     3 venus… grass pois… thick-f… No_abil… None              80    100     123
## 5     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 6     4 charm… fire  No_t… blaze    No_abil… solar-power       39     52      43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## #   total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## #   Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## #   Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## #   Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>
head(dnd_data)
## # A tibble: 6 × 35
##   ip     finger hash  name  race  background date                class justClass
##   <chr>  <chr>  <chr> <chr> <chr> <chr>      <dttm>              <chr> <chr>    
## 1 <NA>   ed15f… fe3e… ee1e… Hill… Guild Mem… 2022-08-23 20:02:11 Sorc… Sorcerer…
## 2 <NA>   ed15f… aa65… ee1e… Hill… Guild Mem… 2022-08-23 19:43:25 Sorc… Sorcerer…
## 3 6b5d3… d9226… 04b9… f1f6… Human Noble      2022-08-22 14:57:09 Figh… Fighter  
## 4 9b721… b5d19… ba92… f92b… Fall… Outlander  2022-08-22 12:12:53 Sorc… Sorcerer…
## 5 9b721… b5d19… 2f4a… f92b… Fall… Outlander  2022-08-22 12:07:21 Sorc… Sorcerer…
## 6 bf084… 6594c… de16… 5b8c… Vari… Entertain… 2022-08-22 03:19:30 Bard… Bard     
## # ℹ 26 more variables: subclass <chr>, level <dbl>, feats <chr>, HP <dbl>,
## #   AC <dbl>, Str <dbl>, Dex <dbl>, Con <dbl>, Int <dbl>, Wis <dbl>, Cha <dbl>,
## #   alignment <chr>, skills <chr>, weapons <chr>, spells <chr>,
## #   castingStat <chr>, choices <chr>, country <chr>, countryCode <chr>,
## #   processedAlignment <chr>, good <lgl>, lawful <lgl>, processedRace <chr>,
## #   processedSpells <chr>, processedWeapons <chr>, alias <chr>
head(nyc_gifted_data)
## # A tibble: 6 × 14
##   Timestamp...1 Entering.Grade.Level District Birth.Month OLSAT.Verbal.Score
##   <chr>         <chr>                   <dbl> <chr>                    <dbl>
## 1 2/14/2018     K                          30 October                     24
## 2 3/1/2018      K                          30 February                    25
## 3 3/27/2018     1                           2 January                     25
## 4 3/27/2018     1                           2 May                         25
## 5 3/27/2018     1                           2 July                        25
## 6 3/27/2018     1                           2 November                    25
## # ℹ 9 more variables: OLSAT.Verbal.Percentile <dbl>,
## #   NNAT.Non.Verbal.Raw.Score <dbl>, NNAT.Non.Verbal.Percentile <dbl>,
## #   Overall.Score <dbl>, School.Preferences <chr>, ...11 <lgl>,
## #   School.Assigned <chr>, Will.you.enroll.there. <chr>, Timestamp...14 <chr>

Inspecting and Cleaning the Data:

Pokemon Dataset: Missing values were handled using drop_na(). Created a new variable total_stats as a sum of multiple existing statistics.

Check column names and missing values for pokemon_data

colnames(pokemon_data)
##  [1] "index"                 "name"                  "type1"                
##  [4] "type2"                 "ability1"              "ability2"             
##  [7] "hidden_ability"        "hp"                    "attack"               
## [10] "defense"               "sp_atk"                "sp_def"               
## [13] "speed"                 "total_stats"           "legendary"            
## [16] "mythical"              "generation"            "Smogon_VGC_Usage_2022"
## [19] "Smogon_VGC_Usage_2023" "Smogon_VGC_Usage_2024" "Worlds_VGC_Usage_2022"
## [22] "Worlds_VGC_Usage_2023" "Worlds_VGC_Usage_2024"
missing_pokemon <- colSums(is.na(pokemon_data))
missing_pokemon
##                 index                  name                 type1 
##                     0                     0                     0 
##                 type2              ability1              ability2 
##                     0                     0                     0 
##        hidden_ability                    hp                attack 
##                     0                     0                     0 
##               defense                sp_atk                sp_def 
##                     0                     0                     0 
##                 speed           total_stats             legendary 
##                     0                     0                     0 
##              mythical            generation Smogon_VGC_Usage_2022 
##                     0                     0                     0 
## Smogon_VGC_Usage_2023 Smogon_VGC_Usage_2024 Worlds_VGC_Usage_2022 
##                     0                     0                     0 
## Worlds_VGC_Usage_2023 Worlds_VGC_Usage_2024 
##                     0                     0

Handdling missing Values

#Dropping rows with missing values to disply simplysity,then checking the data cleanliness.

pokemon_data_cleaned <- pokemon_data %>%
  drop_na()

head(pokemon_data_cleaned)
## # A tibble: 6 × 23
##   index name   type1 type2 ability1 ability2 hidden_ability    hp attack defense
##   <dbl> <chr>  <chr> <chr> <chr>    <chr>    <chr>          <dbl>  <dbl>   <dbl>
## 1     1 bulba… grass pois… overgrow No_abil… chlorophyll       45     49      49
## 2     2 ivysa… grass pois… overgrow No_abil… chlorophyll       60     62      63
## 3     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 4     3 venus… grass pois… thick-f… No_abil… None              80    100     123
## 5     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 6     4 charm… fire  No_t… blaze    No_abil… solar-power       39     52      43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## #   total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## #   Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## #   Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## #   Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>

cleaning the datasets

pokemon_data_cleaned <- pokemon_data %>%
  drop_na()


head(pokemon_data_cleaned)
## # A tibble: 6 × 23
##   index name   type1 type2 ability1 ability2 hidden_ability    hp attack defense
##   <dbl> <chr>  <chr> <chr> <chr>    <chr>    <chr>          <dbl>  <dbl>   <dbl>
## 1     1 bulba… grass pois… overgrow No_abil… chlorophyll       45     49      49
## 2     2 ivysa… grass pois… overgrow No_abil… chlorophyll       60     62      63
## 3     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 4     3 venus… grass pois… thick-f… No_abil… None              80    100     123
## 5     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 6     4 charm… fire  No_t… blaze    No_abil… solar-power       39     52      43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## #   total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## #   Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## #   Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## #   Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>

#DND Dataset:

Checked for missing values and removed rows with NAs.

colnames(dnd_data)
##  [1] "ip"                 "finger"             "hash"              
##  [4] "name"               "race"               "background"        
##  [7] "date"               "class"              "justClass"         
## [10] "subclass"           "level"              "feats"             
## [13] "HP"                 "AC"                 "Str"               
## [16] "Dex"                "Con"                "Int"               
## [19] "Wis"                "Cha"                "alignment"         
## [22] "skills"             "weapons"            "spells"            
## [25] "castingStat"        "choices"            "country"           
## [28] "countryCode"        "processedAlignment" "good"              
## [31] "lawful"             "processedRace"      "processedSpells"   
## [34] "processedWeapons"   "alias"
missing_dnd <- colSums(is.na(dnd_data))
missing_dnd
##                 ip             finger               hash               name 
##               1957                830                  0                234 
##               race         background               date              class 
##                 24                 20                  0                  0 
##          justClass           subclass              level              feats 
##                  0               2618                  0               7460 
##                 HP                 AC                Str                Dex 
##                  0                  0                  0                  0 
##                Con                Int                Wis                Cha 
##                  0                  0                  0                  0 
##          alignment             skills            weapons             spells 
##               7735                  8                163               6030 
##        castingStat            choices            country        countryCode 
##                  0               5098               3262               3262 
## processedAlignment               good             lawful      processedRace 
##               7807              10894              10894                231 
##    processedSpells   processedWeapons              alias 
##               6030                211                234
dnd_data_cleaned <- dnd_data %>%
  drop_na()


head(dnd_data_cleaned)
## # A tibble: 0 × 35
## # ℹ 35 variables: ip <chr>, finger <chr>, hash <chr>, name <chr>, race <chr>,
## #   background <chr>, date <dttm>, class <chr>, justClass <chr>,
## #   subclass <chr>, level <dbl>, feats <chr>, HP <dbl>, AC <dbl>, Str <dbl>,
## #   Dex <dbl>, Con <dbl>, Int <dbl>, Wis <dbl>, Cha <dbl>, alignment <chr>,
## #   skills <chr>, weapons <chr>, spells <chr>, castingStat <chr>,
## #   choices <chr>, country <chr>, countryCode <chr>, processedAlignment <chr>,
## #   good <lgl>, lawful <lgl>, processedRace <chr>, processedSpells <chr>, …

#NYC Gifted Dataset:

Converted the Entering Grade Level column into a factor variable and dropped any rows with missing values. Used pivot_longer to reshape the dataset for better analysis, especially for the school type-related columns.

colnames(nyc_gifted_data)
##  [1] "Timestamp...1"              "Entering.Grade.Level"      
##  [3] "District"                   "Birth.Month"               
##  [5] "OLSAT.Verbal.Score"         "OLSAT.Verbal.Percentile"   
##  [7] "NNAT.Non.Verbal.Raw.Score"  "NNAT.Non.Verbal.Percentile"
##  [9] "Overall.Score"              "School.Preferences"        
## [11] "...11"                      "School.Assigned"           
## [13] "Will.you.enroll.there."     "Timestamp...14"
missing_nyc <- colSums(is.na(nyc_gifted_data))
missing_nyc
##              Timestamp...1       Entering.Grade.Level 
##                          0                          0 
##                   District                Birth.Month 
##                          0                          0 
##         OLSAT.Verbal.Score    OLSAT.Verbal.Percentile 
##                          0                          0 
##  NNAT.Non.Verbal.Raw.Score NNAT.Non.Verbal.Percentile 
##                          0                          0 
##              Overall.Score         School.Preferences 
##                          0                         28 
##                      ...11            School.Assigned 
##                        104                         83 
##     Will.you.enroll.there.             Timestamp...14 
##                         61                         44
nyc_gifted_data_cleaned <- nyc_gifted_data %>%
  drop_na() %>%
  mutate(grade = as.factor(`Entering.Grade.Level`))  


head(nyc_gifted_data_cleaned)
## # A tibble: 0 × 15
## # ℹ 15 variables: Timestamp...1 <chr>, Entering.Grade.Level <chr>,
## #   District <dbl>, Birth.Month <chr>, OLSAT.Verbal.Score <dbl>,
## #   OLSAT.Verbal.Percentile <dbl>, NNAT.Non.Verbal.Raw.Score <dbl>,
## #   NNAT.Non.Verbal.Percentile <dbl>, Overall.Score <dbl>,
## #   School.Preferences <chr>, ...11 <lgl>, School.Assigned <chr>,
## #   Will.you.enroll.there. <chr>, Timestamp...14 <chr>, grade <fct>

Pivoting the data to Check the transformed long format

nyc_long_data <- nyc_gifted_data_cleaned %>%
  pivot_longer(cols = starts_with("School"), 
               names_to = "school_type", 
               values_to = "school_value")


head(nyc_long_data)
## # A tibble: 0 × 15
## # ℹ 15 variables: Timestamp...1 <chr>, Entering.Grade.Level <chr>,
## #   District <dbl>, Birth.Month <chr>, OLSAT.Verbal.Score <dbl>,
## #   OLSAT.Verbal.Percentile <dbl>, NNAT.Non.Verbal.Raw.Score <dbl>,
## #   NNAT.Non.Verbal.Percentile <dbl>, Overall.Score <dbl>, ...11 <lgl>,
## #   Will.you.enroll.there. <chr>, Timestamp...14 <chr>, grade <fct>,
## #   school_type <chr>, school_value <chr>

New column

Check the new data set for new column created

pokemon_data_cleaned <- pokemon_data_cleaned %>%
  mutate(total_stats = attack + defense + speed + sp_atk + sp_def)


head(pokemon_data_cleaned)
## # A tibble: 6 × 23
##   index name   type1 type2 ability1 ability2 hidden_ability    hp attack defense
##   <dbl> <chr>  <chr> <chr> <chr>    <chr>    <chr>          <dbl>  <dbl>   <dbl>
## 1     1 bulba… grass pois… overgrow No_abil… chlorophyll       45     49      49
## 2     2 ivysa… grass pois… overgrow No_abil… chlorophyll       60     62      63
## 3     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 4     3 venus… grass pois… thick-f… No_abil… None              80    100     123
## 5     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 6     4 charm… fire  No_t… blaze    No_abil… solar-power       39     52      43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## #   total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## #   Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## #   Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## #   Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>

Summary statistics

for NYC Gifted data, Pokemon data, DND data, basic summary statistics were generated using the summary() function.

summary(nyc_gifted_data_cleaned)
##  Timestamp...1      Entering.Grade.Level    District   Birth.Month       
##  Length:0           Length:0             Min.   : NA   Length:0          
##  Class :character   Class :character     1st Qu.: NA   Class :character  
##  Mode  :character   Mode  :character     Median : NA   Mode  :character  
##                                          Mean   :NaN                     
##                                          3rd Qu.: NA                     
##                                          Max.   : NA                     
##  OLSAT.Verbal.Score OLSAT.Verbal.Percentile NNAT.Non.Verbal.Raw.Score
##  Min.   : NA        Min.   : NA             Min.   : NA              
##  1st Qu.: NA        1st Qu.: NA             1st Qu.: NA              
##  Median : NA        Median : NA             Median : NA              
##  Mean   :NaN        Mean   :NaN             Mean   :NaN              
##  3rd Qu.: NA        3rd Qu.: NA             3rd Qu.: NA              
##  Max.   : NA        Max.   : NA             Max.   : NA              
##  NNAT.Non.Verbal.Percentile Overall.Score School.Preferences  ...11        
##  Min.   : NA                Min.   : NA   Length:0           Mode:logical  
##  1st Qu.: NA                1st Qu.: NA   Class :character                 
##  Median : NA                Median : NA   Mode  :character                 
##  Mean   :NaN                Mean   :NaN                                    
##  3rd Qu.: NA                3rd Qu.: NA                                    
##  Max.   : NA                Max.   : NA                                    
##  School.Assigned    Will.you.enroll.there. Timestamp...14      grade 
##  Length:0           Length:0               Length:0           NULL:  
##  Class :character   Class :character       Class :character          
##  Mode  :character   Mode  :character       Mode  :character          
##                                                                      
##                                                                      
## 
summary(pokemon_data_cleaned)
##      index            name              type1              type2          
##  Min.   :   1.0   Length:1303        Length:1303        Length:1303       
##  1st Qu.: 234.5   Class :character   Class :character   Class :character  
##  Median : 511.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 507.0                                                           
##  3rd Qu.: 774.0                                                           
##  Max.   :1025.0                                                           
##    ability1           ability2         hidden_ability           hp        
##  Length:1303        Length:1303        Length:1303        Min.   :  1.00  
##  Class :character   Class :character   Class :character   1st Qu.: 54.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 70.00  
##                                                           Mean   : 71.35  
##                                                           3rd Qu.: 85.00  
##                                                           Max.   :255.00  
##      attack          defense           sp_atk           sp_def      
##  Min.   :  5.00   Min.   :  5.00   Min.   : 10.00   Min.   : 20.00  
##  1st Qu.: 58.00   1st Qu.: 53.00   1st Qu.: 50.00   1st Qu.: 52.00  
##  Median : 80.00   Median : 70.00   Median : 65.00   Median : 70.00  
##  Mean   : 81.59   Mean   : 75.24   Mean   : 73.68   Mean   : 72.97  
##  3rd Qu.:100.00   3rd Qu.: 95.00   3rd Qu.: 95.00   3rd Qu.: 90.00  
##  Max.   :190.00   Max.   :250.00   Max.   :194.00   Max.   :250.00  
##      speed         total_stats    legendary        mythical      
##  Min.   :  5.00   Min.   :120.0   Mode :logical   Mode :logical  
##  1st Qu.: 47.50   1st Qu.:285.0   FALSE:1186      FALSE:1269     
##  Median : 70.00   Median :390.0   TRUE :117       TRUE :34       
##  Mean   : 71.13   Mean   :374.6                                  
##  3rd Qu.: 92.00   3rd Qu.:440.0                                  
##  Max.   :200.00   Max.   :870.0                                  
##   generation        Smogon_VGC_Usage_2022 Smogon_VGC_Usage_2023
##  Length:1303        Length:1303           Length:1303          
##  Class :character   Class :character      Class :character     
##  Mode  :character   Mode  :character      Mode  :character     
##                                                                
##                                                                
##                                                                
##  Smogon_VGC_Usage_2024 Worlds_VGC_Usage_2022 Worlds_VGC_Usage_2023
##  Length:1303           Length:1303           Length:1303          
##  Class :character      Class :character      Class :character     
##  Mode  :character      Mode  :character      Mode  :character     
##                                                                   
##                                                                   
##                                                                   
##  Worlds_VGC_Usage_2024
##  Length:1303          
##  Class :character     
##  Mode  :character     
##                       
##                       
## 
summary(dnd_data_cleaned)
##       ip               finger              hash               name          
##  Length:0           Length:0           Length:0           Length:0          
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      race            background             date        class          
##  Length:0           Length:0           Min.   :NA    Length:0          
##  Class :character   Class :character   1st Qu.:NA    Class :character  
##  Mode  :character   Mode  :character   Median :NA    Mode  :character  
##                                        Mean   :NaN                     
##                                        3rd Qu.:NA                      
##                                        Max.   :NA                      
##   justClass           subclass             level        feats          
##  Length:0           Length:0           Min.   : NA   Length:0          
##  Class :character   Class :character   1st Qu.: NA   Class :character  
##  Mode  :character   Mode  :character   Median : NA   Mode  :character  
##                                        Mean   :NaN                     
##                                        3rd Qu.: NA                     
##                                        Max.   : NA                     
##        HP            AC           Str           Dex           Con     
##  Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA   Min.   : NA  
##  1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA  
##  Median : NA   Median : NA   Median : NA   Median : NA   Median : NA  
##  Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN   Mean   :NaN  
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA  
##  Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA   Max.   : NA  
##       Int           Wis           Cha       alignment        
##  Min.   : NA   Min.   : NA   Min.   : NA   Length:0          
##  1st Qu.: NA   1st Qu.: NA   1st Qu.: NA   Class :character  
##  Median : NA   Median : NA   Median : NA   Mode  :character  
##  Mean   :NaN   Mean   :NaN   Mean   :NaN                     
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA                     
##  Max.   : NA   Max.   : NA   Max.   : NA                     
##     skills            weapons             spells          castingStat       
##  Length:0           Length:0           Length:0           Length:0          
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    choices            country          countryCode        processedAlignment
##  Length:0           Length:0           Length:0           Length:0          
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    good          lawful        processedRace      processedSpells   
##  Mode:logical   Mode:logical   Length:0           Length:0          
##                                Class :character   Class :character  
##                                Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##  processedWeapons      alias          
##  Length:0           Length:0          
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Visualizations:

#Pokemon Dataset:

A histogram was plotted to show the distribution of the total_stats variable.

ggplot(pokemon_data_cleaned, aes(x = total_stats)) +
  geom_histogram(binwidth = 10, fill = "orange", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Total Stats for Pokémon", 
       x = "Total Stats", 
       y = "Frequency") +
  theme_minimal()

Saving the cleaned data

The cleaned datasets were saved into new CSV files for further use or sharing

write_csv(pokemon_data_cleaned, "C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\pokemon_data_cleaned.csv")

write_csv(nyc_gifted_data_cleaned, "C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\nyc_gifted_data_cleaned.csv")

write_csv(dnd_data_cleaned, "C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\dnd_data_cleaned.csv")

#Accesing cleaned data to analyze

pokemon_data <- read_csv ("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\pokemon_data_cleaned.csv")
## Rows: 1303 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): name, type1, type2, ability1, ability2, hidden_ability, generation...
## dbl  (8): index, hp, attack, defense, sp_atk, sp_def, speed, total_stats
## lgl  (2): legendary, mythical
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(pokemon_data)
## # A tibble: 6 × 23
##   index name   type1 type2 ability1 ability2 hidden_ability    hp attack defense
##   <dbl> <chr>  <chr> <chr> <chr>    <chr>    <chr>          <dbl>  <dbl>   <dbl>
## 1     1 bulba… grass pois… overgrow No_abil… chlorophyll       45     49      49
## 2     2 ivysa… grass pois… overgrow No_abil… chlorophyll       60     62      63
## 3     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 4     3 venus… grass pois… thick-f… No_abil… None              80    100     123
## 5     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 6     4 charm… fire  No_t… blaze    No_abil… solar-power       39     52      43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## #   total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## #   Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## #   Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## #   Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>
nyc_gifted_data <- read_csv ("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\nyc_gifted_data_cleaned.csv")
## Rows: 0 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (15): Timestamp...1, Entering.Grade.Level, District, Birth.Month, OLSAT....
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nyc_gifted_data)
## # A tibble: 0 × 15
## # ℹ 15 variables: Timestamp...1 <chr>, Entering.Grade.Level <chr>,
## #   District <chr>, Birth.Month <chr>, OLSAT.Verbal.Score <chr>,
## #   OLSAT.Verbal.Percentile <chr>, NNAT.Non.Verbal.Raw.Score <chr>,
## #   NNAT.Non.Verbal.Percentile <chr>, Overall.Score <chr>,
## #   School.Preferences <chr>, ...11 <chr>, School.Assigned <chr>,
## #   Will.you.enroll.there. <chr>, Timestamp...14 <chr>, grade <chr>
dnd_data <- read_csv("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\dnd_data_cleaned.csv")
## Rows: 0 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (35): ip, finger, hash, name, race, background, date, class, justClass, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dnd_data)
## # A tibble: 0 × 35
## # ℹ 35 variables: ip <chr>, finger <chr>, hash <chr>, name <chr>, race <chr>,
## #   background <chr>, date <chr>, class <chr>, justClass <chr>, subclass <chr>,
## #   level <chr>, feats <chr>, HP <chr>, AC <chr>, Str <chr>, Dex <chr>,
## #   Con <chr>, Int <chr>, Wis <chr>, Cha <chr>, alignment <chr>, skills <chr>,
## #   weapons <chr>, spells <chr>, castingStat <chr>, choices <chr>,
## #   country <chr>, countryCode <chr>, processedAlignment <chr>, good <chr>,
## #   lawful <chr>, processedRace <chr>, processedSpells <chr>, …

Data Cleaning and Transformation:

Ensuring data are tidy and transformed.

Tidy the Pokemon dataset

pokemon_data_cleaned <- pokemon_data %>%
  drop_na() %>% 
  mutate(total_stats = attack + defense + speed + sp_atk + sp_def)


head(pokemon_data_cleaned)
## # A tibble: 6 × 23
##   index name   type1 type2 ability1 ability2 hidden_ability    hp attack defense
##   <dbl> <chr>  <chr> <chr> <chr>    <chr>    <chr>          <dbl>  <dbl>   <dbl>
## 1     1 bulba… grass pois… overgrow No_abil… chlorophyll       45     49      49
## 2     2 ivysa… grass pois… overgrow No_abil… chlorophyll       60     62      63
## 3     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 4     3 venus… grass pois… thick-f… No_abil… None              80    100     123
## 5     3 venus… grass pois… overgrow No_abil… chlorophyll       80     82      83
## 6     4 charm… fire  No_t… blaze    No_abil… solar-power       39     52      43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## #   total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## #   Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## #   Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## #   Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>

DND Dataset

We’ll check for missing values and remove rows with NA. Additionally, we can transform the dataset (e.g., for specific character attributes).

dnd_data_cleaned <- dnd_data %>%
  drop_na()

head(dnd_data_cleaned)
## # A tibble: 0 × 35
## # ℹ 35 variables: ip <chr>, finger <chr>, hash <chr>, name <chr>, race <chr>,
## #   background <chr>, date <chr>, class <chr>, justClass <chr>, subclass <chr>,
## #   level <chr>, feats <chr>, HP <chr>, AC <chr>, Str <chr>, Dex <chr>,
## #   Con <chr>, Int <chr>, Wis <chr>, Cha <chr>, alignment <chr>, skills <chr>,
## #   weapons <chr>, spells <chr>, castingStat <chr>, choices <chr>,
## #   country <chr>, countryCode <chr>, processedAlignment <chr>, good <chr>,
## #   lawful <chr>, processedRace <chr>, processedSpells <chr>, …

NYC Gifted Dataset

We’ll ensure that the Entering Grade Level column is properly converted to a factor and handle any other transformations needed.

nyc_gifted_data_cleaned <- nyc_gifted_data %>%
  drop_na() %>%
  mutate(grade = as.factor(`Entering.Grade.Level`))

head(nyc_gifted_data_cleaned)
## # A tibble: 0 × 15
## # ℹ 15 variables: Timestamp...1 <chr>, Entering.Grade.Level <chr>,
## #   District <chr>, Birth.Month <chr>, OLSAT.Verbal.Score <chr>,
## #   OLSAT.Verbal.Percentile <chr>, NNAT.Non.Verbal.Raw.Score <chr>,
## #   NNAT.Non.Verbal.Percentile <chr>, Overall.Score <chr>,
## #   School.Preferences <chr>, ...11 <chr>, School.Assigned <chr>,
## #   Will.you.enroll.there. <chr>, Timestamp...14 <chr>, grade <fct>

Conclusion,

In this following project the goal was to transform and cleaned three data sets. The project involved data cleaning, handling missing values, transforming variables, and performing some basic analysis and visualizations. The datasets were saved for future reference and additional analysis.

#Key Learnings:

Data Cleaning: You handled missing data, standardized column names, and reshaped data for easier analysis.

Transformation: You created new variables (e.g., total_stats) to enhance the analysis.

Visualizations: You generated clear and effective visualizations, particularly for the Pokémon dataset, to understand the distribution of key variables.

Tidy Data Principles: You applied key tidying techniques to ensure your datasets were in a “long” format when needed.