This project is to provide a comprehensive analysis involving data cleaning, transformation, and visualization for three different datasets: Pokémon competitive analysis, Dungeons & Dragons (DND) characters, and NYC Gifted and Talented Grades
library(readr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.3
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ purrr 1.0.4
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(readxl)
library(ggplot2)
white.csv <-("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\pokemon_competitive_analysis.csv")
write_tsv <- ("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\dnd_chars_all.tsv.txt")
white.csv <- ("C:\\Users\\wduro\\Downloads\\NYC Gifted and Talented Grades 2018-19 - Sheet5.csv")
The three datasets were loaded into R using read_csv and read_tsv functions. Each dataset was read in with its specific file path.
pokemon_data <- read_csv("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\pokemon_competitive_analysis.csv", show_col_types = FALSE)
dnd_data <- read_tsv ("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\dnd_chars_all.tsv.txt", show_col_types = FALSE)
nyc_gifted_data <- read_csv("C:\\Users\\wduro\\Downloads\\NYC Gifted and Talented Grades 2018-19 - Sheet5.csv", show_col_types = FALSE)
## New names:
## • `Timestamp` -> `Timestamp...1`
## • `` -> `...11`
## • `Timestamp` -> `Timestamp...14`
colnames(nyc_gifted_data)
## [1] "Timestamp...1" "Entering Grade Level"
## [3] "District" "Birth Month"
## [5] "OLSAT Verbal Score" "OLSAT Verbal Percentile"
## [7] "NNAT Non Verbal Raw Score" "NNAT Non Verbal Percentile"
## [9] "Overall Score" "School Preferences"
## [11] "...11" "School Assigned"
## [13] "Will you enroll there?" "Timestamp...14"
colnames(nyc_gifted_data) <- make.names(colnames(nyc_gifted_data), unique = TRUE)
head(nyc_gifted_data)
## # A tibble: 6 × 14
## Timestamp...1 Entering.Grade.Level District Birth.Month OLSAT.Verbal.Score
## <chr> <chr> <dbl> <chr> <dbl>
## 1 2/14/2018 K 30 October 24
## 2 3/1/2018 K 30 February 25
## 3 3/27/2018 1 2 January 25
## 4 3/27/2018 1 2 May 25
## 5 3/27/2018 1 2 July 25
## 6 3/27/2018 1 2 November 25
## # ℹ 9 more variables: OLSAT.Verbal.Percentile <dbl>,
## # NNAT.Non.Verbal.Raw.Score <dbl>, NNAT.Non.Verbal.Percentile <dbl>,
## # Overall.Score <dbl>, School.Preferences <chr>, ...11 <lgl>,
## # School.Assigned <chr>, Will.you.enroll.there. <chr>, Timestamp...14 <chr>
nyc_gifted_data <- nyc_gifted_data[!duplicated(colnames(nyc_gifted_data))]
head(pokemon_data)
## # A tibble: 6 × 23
## index name type1 type2 ability1 ability2 hidden_ability hp attack defense
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 bulba… grass pois… overgrow No_abil… chlorophyll 45 49 49
## 2 2 ivysa… grass pois… overgrow No_abil… chlorophyll 60 62 63
## 3 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 4 3 venus… grass pois… thick-f… No_abil… None 80 100 123
## 5 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 6 4 charm… fire No_t… blaze No_abil… solar-power 39 52 43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## # total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## # Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## # Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## # Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>
head(dnd_data)
## # A tibble: 6 × 35
## ip finger hash name race background date class justClass
## <chr> <chr> <chr> <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 <NA> ed15f… fe3e… ee1e… Hill… Guild Mem… 2022-08-23 20:02:11 Sorc… Sorcerer…
## 2 <NA> ed15f… aa65… ee1e… Hill… Guild Mem… 2022-08-23 19:43:25 Sorc… Sorcerer…
## 3 6b5d3… d9226… 04b9… f1f6… Human Noble 2022-08-22 14:57:09 Figh… Fighter
## 4 9b721… b5d19… ba92… f92b… Fall… Outlander 2022-08-22 12:12:53 Sorc… Sorcerer…
## 5 9b721… b5d19… 2f4a… f92b… Fall… Outlander 2022-08-22 12:07:21 Sorc… Sorcerer…
## 6 bf084… 6594c… de16… 5b8c… Vari… Entertain… 2022-08-22 03:19:30 Bard… Bard
## # ℹ 26 more variables: subclass <chr>, level <dbl>, feats <chr>, HP <dbl>,
## # AC <dbl>, Str <dbl>, Dex <dbl>, Con <dbl>, Int <dbl>, Wis <dbl>, Cha <dbl>,
## # alignment <chr>, skills <chr>, weapons <chr>, spells <chr>,
## # castingStat <chr>, choices <chr>, country <chr>, countryCode <chr>,
## # processedAlignment <chr>, good <lgl>, lawful <lgl>, processedRace <chr>,
## # processedSpells <chr>, processedWeapons <chr>, alias <chr>
head(nyc_gifted_data)
## # A tibble: 6 × 14
## Timestamp...1 Entering.Grade.Level District Birth.Month OLSAT.Verbal.Score
## <chr> <chr> <dbl> <chr> <dbl>
## 1 2/14/2018 K 30 October 24
## 2 3/1/2018 K 30 February 25
## 3 3/27/2018 1 2 January 25
## 4 3/27/2018 1 2 May 25
## 5 3/27/2018 1 2 July 25
## 6 3/27/2018 1 2 November 25
## # ℹ 9 more variables: OLSAT.Verbal.Percentile <dbl>,
## # NNAT.Non.Verbal.Raw.Score <dbl>, NNAT.Non.Verbal.Percentile <dbl>,
## # Overall.Score <dbl>, School.Preferences <chr>, ...11 <lgl>,
## # School.Assigned <chr>, Will.you.enroll.there. <chr>, Timestamp...14 <chr>
Pokemon Dataset: Missing values were handled using drop_na(). Created a new variable total_stats as a sum of multiple existing statistics.
colnames(pokemon_data)
## [1] "index" "name" "type1"
## [4] "type2" "ability1" "ability2"
## [7] "hidden_ability" "hp" "attack"
## [10] "defense" "sp_atk" "sp_def"
## [13] "speed" "total_stats" "legendary"
## [16] "mythical" "generation" "Smogon_VGC_Usage_2022"
## [19] "Smogon_VGC_Usage_2023" "Smogon_VGC_Usage_2024" "Worlds_VGC_Usage_2022"
## [22] "Worlds_VGC_Usage_2023" "Worlds_VGC_Usage_2024"
missing_pokemon <- colSums(is.na(pokemon_data))
missing_pokemon
## index name type1
## 0 0 0
## type2 ability1 ability2
## 0 0 0
## hidden_ability hp attack
## 0 0 0
## defense sp_atk sp_def
## 0 0 0
## speed total_stats legendary
## 0 0 0
## mythical generation Smogon_VGC_Usage_2022
## 0 0 0
## Smogon_VGC_Usage_2023 Smogon_VGC_Usage_2024 Worlds_VGC_Usage_2022
## 0 0 0
## Worlds_VGC_Usage_2023 Worlds_VGC_Usage_2024
## 0 0
#Dropping rows with missing values to disply simplysity,then checking the data cleanliness.
pokemon_data_cleaned <- pokemon_data %>%
drop_na()
head(pokemon_data_cleaned)
## # A tibble: 6 × 23
## index name type1 type2 ability1 ability2 hidden_ability hp attack defense
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 bulba… grass pois… overgrow No_abil… chlorophyll 45 49 49
## 2 2 ivysa… grass pois… overgrow No_abil… chlorophyll 60 62 63
## 3 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 4 3 venus… grass pois… thick-f… No_abil… None 80 100 123
## 5 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 6 4 charm… fire No_t… blaze No_abil… solar-power 39 52 43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## # total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## # Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## # Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## # Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>
pokemon_data_cleaned <- pokemon_data %>%
drop_na()
head(pokemon_data_cleaned)
## # A tibble: 6 × 23
## index name type1 type2 ability1 ability2 hidden_ability hp attack defense
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 bulba… grass pois… overgrow No_abil… chlorophyll 45 49 49
## 2 2 ivysa… grass pois… overgrow No_abil… chlorophyll 60 62 63
## 3 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 4 3 venus… grass pois… thick-f… No_abil… None 80 100 123
## 5 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 6 4 charm… fire No_t… blaze No_abil… solar-power 39 52 43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## # total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## # Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## # Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## # Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>
#DND Dataset:
Checked for missing values and removed rows with NAs.
colnames(dnd_data)
## [1] "ip" "finger" "hash"
## [4] "name" "race" "background"
## [7] "date" "class" "justClass"
## [10] "subclass" "level" "feats"
## [13] "HP" "AC" "Str"
## [16] "Dex" "Con" "Int"
## [19] "Wis" "Cha" "alignment"
## [22] "skills" "weapons" "spells"
## [25] "castingStat" "choices" "country"
## [28] "countryCode" "processedAlignment" "good"
## [31] "lawful" "processedRace" "processedSpells"
## [34] "processedWeapons" "alias"
missing_dnd <- colSums(is.na(dnd_data))
missing_dnd
## ip finger hash name
## 1957 830 0 234
## race background date class
## 24 20 0 0
## justClass subclass level feats
## 0 2618 0 7460
## HP AC Str Dex
## 0 0 0 0
## Con Int Wis Cha
## 0 0 0 0
## alignment skills weapons spells
## 7735 8 163 6030
## castingStat choices country countryCode
## 0 5098 3262 3262
## processedAlignment good lawful processedRace
## 7807 10894 10894 231
## processedSpells processedWeapons alias
## 6030 211 234
dnd_data_cleaned <- dnd_data %>%
drop_na()
head(dnd_data_cleaned)
## # A tibble: 0 × 35
## # ℹ 35 variables: ip <chr>, finger <chr>, hash <chr>, name <chr>, race <chr>,
## # background <chr>, date <dttm>, class <chr>, justClass <chr>,
## # subclass <chr>, level <dbl>, feats <chr>, HP <dbl>, AC <dbl>, Str <dbl>,
## # Dex <dbl>, Con <dbl>, Int <dbl>, Wis <dbl>, Cha <dbl>, alignment <chr>,
## # skills <chr>, weapons <chr>, spells <chr>, castingStat <chr>,
## # choices <chr>, country <chr>, countryCode <chr>, processedAlignment <chr>,
## # good <lgl>, lawful <lgl>, processedRace <chr>, processedSpells <chr>, …
#NYC Gifted Dataset:
Converted the Entering Grade Level column into a factor variable and dropped any rows with missing values. Used pivot_longer to reshape the dataset for better analysis, especially for the school type-related columns.
colnames(nyc_gifted_data)
## [1] "Timestamp...1" "Entering.Grade.Level"
## [3] "District" "Birth.Month"
## [5] "OLSAT.Verbal.Score" "OLSAT.Verbal.Percentile"
## [7] "NNAT.Non.Verbal.Raw.Score" "NNAT.Non.Verbal.Percentile"
## [9] "Overall.Score" "School.Preferences"
## [11] "...11" "School.Assigned"
## [13] "Will.you.enroll.there." "Timestamp...14"
missing_nyc <- colSums(is.na(nyc_gifted_data))
missing_nyc
## Timestamp...1 Entering.Grade.Level
## 0 0
## District Birth.Month
## 0 0
## OLSAT.Verbal.Score OLSAT.Verbal.Percentile
## 0 0
## NNAT.Non.Verbal.Raw.Score NNAT.Non.Verbal.Percentile
## 0 0
## Overall.Score School.Preferences
## 0 28
## ...11 School.Assigned
## 104 83
## Will.you.enroll.there. Timestamp...14
## 61 44
nyc_gifted_data_cleaned <- nyc_gifted_data %>%
drop_na() %>%
mutate(grade = as.factor(`Entering.Grade.Level`))
head(nyc_gifted_data_cleaned)
## # A tibble: 0 × 15
## # ℹ 15 variables: Timestamp...1 <chr>, Entering.Grade.Level <chr>,
## # District <dbl>, Birth.Month <chr>, OLSAT.Verbal.Score <dbl>,
## # OLSAT.Verbal.Percentile <dbl>, NNAT.Non.Verbal.Raw.Score <dbl>,
## # NNAT.Non.Verbal.Percentile <dbl>, Overall.Score <dbl>,
## # School.Preferences <chr>, ...11 <lgl>, School.Assigned <chr>,
## # Will.you.enroll.there. <chr>, Timestamp...14 <chr>, grade <fct>
nyc_long_data <- nyc_gifted_data_cleaned %>%
pivot_longer(cols = starts_with("School"),
names_to = "school_type",
values_to = "school_value")
head(nyc_long_data)
## # A tibble: 0 × 15
## # ℹ 15 variables: Timestamp...1 <chr>, Entering.Grade.Level <chr>,
## # District <dbl>, Birth.Month <chr>, OLSAT.Verbal.Score <dbl>,
## # OLSAT.Verbal.Percentile <dbl>, NNAT.Non.Verbal.Raw.Score <dbl>,
## # NNAT.Non.Verbal.Percentile <dbl>, Overall.Score <dbl>, ...11 <lgl>,
## # Will.you.enroll.there. <chr>, Timestamp...14 <chr>, grade <fct>,
## # school_type <chr>, school_value <chr>
Check the new data set for new column created
pokemon_data_cleaned <- pokemon_data_cleaned %>%
mutate(total_stats = attack + defense + speed + sp_atk + sp_def)
head(pokemon_data_cleaned)
## # A tibble: 6 × 23
## index name type1 type2 ability1 ability2 hidden_ability hp attack defense
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 bulba… grass pois… overgrow No_abil… chlorophyll 45 49 49
## 2 2 ivysa… grass pois… overgrow No_abil… chlorophyll 60 62 63
## 3 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 4 3 venus… grass pois… thick-f… No_abil… None 80 100 123
## 5 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 6 4 charm… fire No_t… blaze No_abil… solar-power 39 52 43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## # total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## # Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## # Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## # Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>
for NYC Gifted data, Pokemon data, DND data, basic summary statistics were generated using the summary() function.
summary(nyc_gifted_data_cleaned)
## Timestamp...1 Entering.Grade.Level District Birth.Month
## Length:0 Length:0 Min. : NA Length:0
## Class :character Class :character 1st Qu.: NA Class :character
## Mode :character Mode :character Median : NA Mode :character
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
## OLSAT.Verbal.Score OLSAT.Verbal.Percentile NNAT.Non.Verbal.Raw.Score
## Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA
## NNAT.Non.Verbal.Percentile Overall.Score School.Preferences ...11
## Min. : NA Min. : NA Length:0 Mode:logical
## 1st Qu.: NA 1st Qu.: NA Class :character
## Median : NA Median : NA Mode :character
## Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA
## School.Assigned Will.you.enroll.there. Timestamp...14 grade
## Length:0 Length:0 Length:0 NULL:
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
summary(pokemon_data_cleaned)
## index name type1 type2
## Min. : 1.0 Length:1303 Length:1303 Length:1303
## 1st Qu.: 234.5 Class :character Class :character Class :character
## Median : 511.0 Mode :character Mode :character Mode :character
## Mean : 507.0
## 3rd Qu.: 774.0
## Max. :1025.0
## ability1 ability2 hidden_ability hp
## Length:1303 Length:1303 Length:1303 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 54.00
## Mode :character Mode :character Mode :character Median : 70.00
## Mean : 71.35
## 3rd Qu.: 85.00
## Max. :255.00
## attack defense sp_atk sp_def
## Min. : 5.00 Min. : 5.00 Min. : 10.00 Min. : 20.00
## 1st Qu.: 58.00 1st Qu.: 53.00 1st Qu.: 50.00 1st Qu.: 52.00
## Median : 80.00 Median : 70.00 Median : 65.00 Median : 70.00
## Mean : 81.59 Mean : 75.24 Mean : 73.68 Mean : 72.97
## 3rd Qu.:100.00 3rd Qu.: 95.00 3rd Qu.: 95.00 3rd Qu.: 90.00
## Max. :190.00 Max. :250.00 Max. :194.00 Max. :250.00
## speed total_stats legendary mythical
## Min. : 5.00 Min. :120.0 Mode :logical Mode :logical
## 1st Qu.: 47.50 1st Qu.:285.0 FALSE:1186 FALSE:1269
## Median : 70.00 Median :390.0 TRUE :117 TRUE :34
## Mean : 71.13 Mean :374.6
## 3rd Qu.: 92.00 3rd Qu.:440.0
## Max. :200.00 Max. :870.0
## generation Smogon_VGC_Usage_2022 Smogon_VGC_Usage_2023
## Length:1303 Length:1303 Length:1303
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Smogon_VGC_Usage_2024 Worlds_VGC_Usage_2022 Worlds_VGC_Usage_2023
## Length:1303 Length:1303 Length:1303
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Worlds_VGC_Usage_2024
## Length:1303
## Class :character
## Mode :character
##
##
##
summary(dnd_data_cleaned)
## ip finger hash name
## Length:0 Length:0 Length:0 Length:0
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## race background date class
## Length:0 Length:0 Min. :NA Length:0
## Class :character Class :character 1st Qu.:NA Class :character
## Mode :character Mode :character Median :NA Mode :character
## Mean :NaN
## 3rd Qu.:NA
## Max. :NA
## justClass subclass level feats
## Length:0 Length:0 Min. : NA Length:0
## Class :character Class :character 1st Qu.: NA Class :character
## Mode :character Mode :character Median : NA Mode :character
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
## HP AC Str Dex Con
## Min. : NA Min. : NA Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA Max. : NA
## Int Wis Cha alignment
## Min. : NA Min. : NA Min. : NA Length:0
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA Class :character
## Median : NA Median : NA Median : NA Mode :character
## Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA
## skills weapons spells castingStat
## Length:0 Length:0 Length:0 Length:0
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## choices country countryCode processedAlignment
## Length:0 Length:0 Length:0 Length:0
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## good lawful processedRace processedSpells
## Mode:logical Mode:logical Length:0 Length:0
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## processedWeapons alias
## Length:0 Length:0
## Class :character Class :character
## Mode :character Mode :character
##
##
##
#Pokemon Dataset:
A histogram was plotted to show the distribution of the total_stats variable.
ggplot(pokemon_data_cleaned, aes(x = total_stats)) +
geom_histogram(binwidth = 10, fill = "orange", color = "black", alpha = 0.7) +
labs(title = "Distribution of Total Stats for Pokémon",
x = "Total Stats",
y = "Frequency") +
theme_minimal()
The cleaned datasets were saved into new CSV files for further use or sharing
write_csv(pokemon_data_cleaned, "C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\pokemon_data_cleaned.csv")
write_csv(nyc_gifted_data_cleaned, "C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\nyc_gifted_data_cleaned.csv")
write_csv(dnd_data_cleaned, "C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\dnd_data_cleaned.csv")
#Accesing cleaned data to analyze
pokemon_data <- read_csv ("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\pokemon_data_cleaned.csv")
## Rows: 1303 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): name, type1, type2, ability1, ability2, hidden_ability, generation...
## dbl (8): index, hp, attack, defense, sp_atk, sp_def, speed, total_stats
## lgl (2): legendary, mythical
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(pokemon_data)
## # A tibble: 6 × 23
## index name type1 type2 ability1 ability2 hidden_ability hp attack defense
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 bulba… grass pois… overgrow No_abil… chlorophyll 45 49 49
## 2 2 ivysa… grass pois… overgrow No_abil… chlorophyll 60 62 63
## 3 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 4 3 venus… grass pois… thick-f… No_abil… None 80 100 123
## 5 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 6 4 charm… fire No_t… blaze No_abil… solar-power 39 52 43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## # total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## # Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## # Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## # Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>
nyc_gifted_data <- read_csv ("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\nyc_gifted_data_cleaned.csv")
## Rows: 0 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (15): Timestamp...1, Entering.Grade.Level, District, Birth.Month, OLSAT....
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nyc_gifted_data)
## # A tibble: 0 × 15
## # ℹ 15 variables: Timestamp...1 <chr>, Entering.Grade.Level <chr>,
## # District <chr>, Birth.Month <chr>, OLSAT.Verbal.Score <chr>,
## # OLSAT.Verbal.Percentile <chr>, NNAT.Non.Verbal.Raw.Score <chr>,
## # NNAT.Non.Verbal.Percentile <chr>, Overall.Score <chr>,
## # School.Preferences <chr>, ...11 <chr>, School.Assigned <chr>,
## # Will.you.enroll.there. <chr>, Timestamp...14 <chr>, grade <chr>
dnd_data <- read_csv("C:\\Users\\wduro\\OneDrive - City University of New York\\DATA607\\dnd_data_cleaned.csv")
## Rows: 0 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (35): ip, finger, hash, name, race, background, date, class, justClass, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dnd_data)
## # A tibble: 0 × 35
## # ℹ 35 variables: ip <chr>, finger <chr>, hash <chr>, name <chr>, race <chr>,
## # background <chr>, date <chr>, class <chr>, justClass <chr>, subclass <chr>,
## # level <chr>, feats <chr>, HP <chr>, AC <chr>, Str <chr>, Dex <chr>,
## # Con <chr>, Int <chr>, Wis <chr>, Cha <chr>, alignment <chr>, skills <chr>,
## # weapons <chr>, spells <chr>, castingStat <chr>, choices <chr>,
## # country <chr>, countryCode <chr>, processedAlignment <chr>, good <chr>,
## # lawful <chr>, processedRace <chr>, processedSpells <chr>, …
Ensuring data are tidy and transformed.
pokemon_data_cleaned <- pokemon_data %>%
drop_na() %>%
mutate(total_stats = attack + defense + speed + sp_atk + sp_def)
head(pokemon_data_cleaned)
## # A tibble: 6 × 23
## index name type1 type2 ability1 ability2 hidden_ability hp attack defense
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 bulba… grass pois… overgrow No_abil… chlorophyll 45 49 49
## 2 2 ivysa… grass pois… overgrow No_abil… chlorophyll 60 62 63
## 3 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 4 3 venus… grass pois… thick-f… No_abil… None 80 100 123
## 5 3 venus… grass pois… overgrow No_abil… chlorophyll 80 82 83
## 6 4 charm… fire No_t… blaze No_abil… solar-power 39 52 43
## # ℹ 13 more variables: sp_atk <dbl>, sp_def <dbl>, speed <dbl>,
## # total_stats <dbl>, legendary <lgl>, mythical <lgl>, generation <chr>,
## # Smogon_VGC_Usage_2022 <chr>, Smogon_VGC_Usage_2023 <chr>,
## # Smogon_VGC_Usage_2024 <chr>, Worlds_VGC_Usage_2022 <chr>,
## # Worlds_VGC_Usage_2023 <chr>, Worlds_VGC_Usage_2024 <chr>
We’ll check for missing values and remove rows with NA. Additionally, we can transform the dataset (e.g., for specific character attributes).
dnd_data_cleaned <- dnd_data %>%
drop_na()
head(dnd_data_cleaned)
## # A tibble: 0 × 35
## # ℹ 35 variables: ip <chr>, finger <chr>, hash <chr>, name <chr>, race <chr>,
## # background <chr>, date <chr>, class <chr>, justClass <chr>, subclass <chr>,
## # level <chr>, feats <chr>, HP <chr>, AC <chr>, Str <chr>, Dex <chr>,
## # Con <chr>, Int <chr>, Wis <chr>, Cha <chr>, alignment <chr>, skills <chr>,
## # weapons <chr>, spells <chr>, castingStat <chr>, choices <chr>,
## # country <chr>, countryCode <chr>, processedAlignment <chr>, good <chr>,
## # lawful <chr>, processedRace <chr>, processedSpells <chr>, …
We’ll ensure that the Entering Grade Level column is properly converted to a factor and handle any other transformations needed.
nyc_gifted_data_cleaned <- nyc_gifted_data %>%
drop_na() %>%
mutate(grade = as.factor(`Entering.Grade.Level`))
head(nyc_gifted_data_cleaned)
## # A tibble: 0 × 15
## # ℹ 15 variables: Timestamp...1 <chr>, Entering.Grade.Level <chr>,
## # District <chr>, Birth.Month <chr>, OLSAT.Verbal.Score <chr>,
## # OLSAT.Verbal.Percentile <chr>, NNAT.Non.Verbal.Raw.Score <chr>,
## # NNAT.Non.Verbal.Percentile <chr>, Overall.Score <chr>,
## # School.Preferences <chr>, ...11 <chr>, School.Assigned <chr>,
## # Will.you.enroll.there. <chr>, Timestamp...14 <chr>, grade <fct>
In this following project the goal was to transform and cleaned three data sets. The project involved data cleaning, handling missing values, transforming variables, and performing some basic analysis and visualizations. The datasets were saved for future reference and additional analysis.
#Key Learnings:
Data Cleaning: You handled missing data, standardized column names, and reshaped data for easier analysis.
Transformation: You created new variables (e.g., total_stats) to enhance the analysis.
Visualizations: You generated clear and effective visualizations, particularly for the Pokémon dataset, to understand the distribution of key variables.
Tidy Data Principles: You applied key tidying techniques to ensure your datasets were in a “long” format when needed.