Preparing different datasets for downstream analysis work

Task

Overview

Seen below is the initial shape of data on load. Unsparingly, it is exactly the same as the original image. With it now loaded into R, we can begin discussing and then fixing the issues with our data

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!

  3. Perform the analysis requested in the discussion item

  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

  • The URL to the .Rmd file in your GitHub repository. and
  • The URL for your rpubs.com web page.

Packages

#Packages used

library(tidytext)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(plotly)
library(stringr)
library(DT)

Dataset

Dataset #1

Maria A Ginorio

Undefeated Boxers Data

boxing <-read.csv("https://raw.githubusercontent.com/mgino11/Tidy_project2/main/Project2.csv",stringsAsFactors = FALSE)

datatable(boxing)
Dataset #2

Euclides Rodriguez

study <- read.csv("https://raw.githubusercontent.com/engine2031/Data-607/main/Project%202_Privacy%20Suppresion%20Education%20Data2.csv")

datatable(study)
Dataset #3

Peter Gatica

nba_stats <- read.csv("https://github.com/audiorunner13/Masters-Coursework/blob/main/DATA607%20Spring%202021/Week6/Project2/Data/NBA_Player_Stats.txt")

datatable(nba_stats)

Tidying Data

Boxing

Clean data

Glimpse the data we can see it has 31 rows and 8 columns all in chr

glimpse(boxing)
## Rows: 31
## Columns: 8
## $ Country       <chr> "United States", "United Kingdom ", "", "", "", "", "Sou~
## $ Boxer         <chr> "Jimmy Barry", "Joe Calzaghe", "", "", "", "", "Kim Ji-w~
## $ Record        <chr> "58–0–10", "46–0–0", "", "", "", "", "16–0–2", "28–0–0",~
## $ Weight.Class  <chr> "Bantamweight", "Super middleweight", "", "", "", "Light~
## $ Title         <chr> "World", "WBO", "IBF", "The Ring, lineal", "WBA, WBC", "~
## $ Champ.Year    <chr> "1894-1919", "1997–2008", "2006", "2006–2008", "2007–200~
## $ Title.Defense <chr> "4 withdraws by opponent", "21", "1", "3", "0", "1", "4"~
## $ Notes         <chr> "", "Vacated title to move up to light heavyweight.", "V~

Lets start by selecting only the columns we are interested in working on. In this case we will create a subset of the data and drop Champ year, title defense and Notes.

tidy_box <- boxing %>%
  select(-Title, -Title.Defense, -Notes, -Champ.Year)

datatable(tidy_box)

Cleaning Records

  1. The first task turning empty spaces into NA
tidy_box %>%
  select(Country:Weight.Class) %>%
  na_if("")
##            Country                 Boxer  Record        Weight.Class
## 1    United States           Jimmy Barry 58–0–10        Bantamweight
## 2  United Kingdom           Joe Calzaghe  46–0–0  Super middleweight
## 3             <NA>                  <NA>    <NA>                <NA>
## 4             <NA>                  <NA>    <NA>                <NA>
## 5             <NA>                  <NA>    <NA>                <NA>
## 6             <NA>                  <NA>    <NA>   Light heavyweight
## 7      South Korea            Kim Ji-won  16–0–2  Super bantamweight
## 8          Romania             Mihai Leu  28–0–0        Welterweight
## 9           Mexico         Ricardo López  51–0–1       Minimumweight
## 10            <NA>                  <NA>    <NA>                <NA>
## 11            <NA>                  <NA>    <NA>                <NA>
## 12            <NA>                  <NA>    <NA>     Light flyweight
## 13   United States        Rocky Marciano  49–0–0         Heavyweight
## 14 United Kingdom            Terry Marsh  26–0–1  Light welterweight
## 15   United States  Floyd Mayweather Jr.  50–0–0 Super featherweight
## 16            <NA>                  <NA>    <NA>         Lightweight
## 17            <NA>                  <NA>    <NA>  Light welterweight
## 18            <NA>                  <NA>    <NA>        Welterweight
## 19            <NA>                  <NA>    <NA>  Light middleweight
## 20   United States        Jack McAuliffe 28–0–10         Lightweight
## 21         Germany            Sven Ottke  34–0–0  Super middleweight
## 22            <NA>                  <NA>    <NA>                <NA>
## 23          Russia          Dmitry Pirog  20–0–0        Middleweight
## 24         Namibia           Harry Simon  31–0–0  Light middleweight
## 25            <NA>                  <NA>    <NA>        Middleweight
## 26       Thailand  Pichit Sitbangprachan  24–0–0           Flyweight
## 27      Venezuela           Edwin Valero  27–0–0 Super featherweight
## 28            <NA>                  <NA>    <NA>         Lightweight
## 29   United States            Andre Ward  32–0–0  Super middleweight
## 30            <NA>                  <NA>    <NA>                <NA>
## 31            <NA>                  <NA>    <NA>   Light heavyweight

2 ) Fix the column Record - format wins/losses/draws

58 = wins 0 = losses 10 = draws

box_df <-tidy_box %>%
  separate(Record, c("wins","loss", "draws"), sep = "[^0-9]", convert = TRUE, remove = TRUE)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 16 rows [3, 4, 5,
## 6, 10, 11, 12, 16, 17, 18, 19, 22, 25, 28, 30, 31].
box_df
##            Country                 Boxer wins loss draws        Weight.Class
## 1    United States           Jimmy Barry   58    0    10        Bantamweight
## 2  United Kingdom           Joe Calzaghe   46    0     0  Super middleweight
## 3                                          NA   NA    NA                    
## 4                                          NA   NA    NA                    
## 5                                          NA   NA    NA                    
## 6                                          NA   NA    NA   Light heavyweight
## 7      South Korea            Kim Ji-won   16    0     2  Super bantamweight
## 8          Romania             Mihai Leu   28    0     0        Welterweight
## 9           Mexico         Ricardo López   51    0     1       Minimumweight
## 10                                         NA   NA    NA                    
## 11                                         NA   NA    NA                    
## 12                                         NA   NA    NA     Light flyweight
## 13   United States        Rocky Marciano   49    0     0         Heavyweight
## 14 United Kingdom            Terry Marsh   26    0     1  Light welterweight
## 15   United States  Floyd Mayweather Jr.   50    0     0 Super featherweight
## 16                                         NA   NA    NA         Lightweight
## 17                                         NA   NA    NA  Light welterweight
## 18                                         NA   NA    NA        Welterweight
## 19                                         NA   NA    NA  Light middleweight
## 20   United States        Jack McAuliffe   28    0    10         Lightweight
## 21         Germany            Sven Ottke   34    0     0  Super middleweight
## 22                                         NA   NA    NA                    
## 23          Russia          Dmitry Pirog   20    0     0        Middleweight
## 24         Namibia           Harry Simon   31    0     0  Light middleweight
## 25                                         NA   NA    NA        Middleweight
## 26       Thailand  Pichit Sitbangprachan   24    0     0           Flyweight
## 27      Venezuela           Edwin Valero   27    0     0 Super featherweight
## 28                                         NA   NA    NA         Lightweight
## 29   United States            Andre Ward   32    0     0  Super middleweight
## 30                                         NA   NA    NA                    
## 31                                         NA   NA    NA   Light heavyweight
  1. Now we reorganize the data using pivot longer in order to have a better view the country, the boxer, the score and the count of wins, losses or draws in the category of world champion.
box1 <- box_df %>%
  pivot_longer(box_df,
               cols = c(wins:draws),
               names_to = "score",
               values_to = "times",
               values_drop_na = T)
box1
## # A tibble: 45 x 5
##    Country         Boxer        Weight.Class       score times
##    <chr>           <chr>        <chr>              <chr> <int>
##  1 United States   Jimmy Barry  Bantamweight       wins     58
##  2 United States   Jimmy Barry  Bantamweight       loss      0
##  3 United States   Jimmy Barry  Bantamweight       draws    10
##  4 United Kingdom  Joe Calzaghe Super middleweight wins     46
##  5 United Kingdom  Joe Calzaghe Super middleweight loss      0
##  6 United Kingdom  Joe Calzaghe Super middleweight draws     0
##  7 South Korea     Kim Ji-won   Super bantamweight wins     16
##  8 South Korea     Kim Ji-won   Super bantamweight loss      0
##  9 South Korea     Kim Ji-won   Super bantamweight draws     2
## 10 Romania         Mihai Leu    Welterweight       wins     28
## # ... with 35 more rows
  1. Now that we have the data organized and tidy we can proceed to investigate more about the data we have
  • How Many boxers come from the US?
box1 %>%
  count(Country, sort = T)
## # A tibble: 10 x 2
##    Country             n
##    <chr>           <int>
##  1 United States      15
##  2 United Kingdom      6
##  3 Germany             3
##  4 Mexico              3
##  5 Namibia             3
##  6 Romania             3
##  7 Russia              3
##  8 South Korea         3
##  9 Thailand            3
## 10 Venezuela           3

what weight class is the most prominent among world champions?

box1 %>%
  count(Weight.Class, sort = T)
## # A tibble: 12 x 2
##    Weight.Class            n
##    <chr>               <int>
##  1 Super middleweight      9
##  2 Super featherweight     6
##  3 Bantamweight            3
##  4 Flyweight               3
##  5 Heavyweight             3
##  6 Light middleweight      3
##  7 Light welterweight      3
##  8 Lightweight             3
##  9 Middleweight            3
## 10 Minimumweight           3
## 11 Super bantamweight      3
## 12 Welterweight            3

In order to obtain arithmetic calculations we need to use pivot wider to be able to compute the information from the data.

undefeated_boxers <- box1 %>%
 pivot_wider(names_from = score, values_from = times)

undefeated_boxers
## # A tibble: 15 x 6
##    Country         Boxer                 Weight.Class         wins  loss draws
##    <chr>           <chr>                 <chr>               <int> <int> <int>
##  1 United States   Jimmy Barry           Bantamweight           58     0    10
##  2 United Kingdom  Joe Calzaghe          Super middleweight     46     0     0
##  3 South Korea     Kim Ji-won            Super bantamweight     16     0     2
##  4 Romania         Mihai Leu             Welterweight           28     0     0
##  5 Mexico          Ricardo López         Minimumweight          51     0     1
##  6 United States   Rocky Marciano        Heavyweight            49     0     0
##  7 United Kingdom  Terry Marsh           Light welterweight     26     0     1
##  8 United States   Floyd Mayweather Jr.  Super featherweight    50     0     0
##  9 United States   Jack McAuliffe        Lightweight            28     0    10
## 10 Germany         Sven Ottke            Super middleweight     34     0     0
## 11 Russia          Dmitry Pirog          Middleweight           20     0     0
## 12 Namibia         Harry Simon           Light middleweight     31     0     0
## 13 Thailand        Pichit Sitbangprachan Flyweight              24     0     0
## 14 Venezuela       Edwin Valero          Super featherweight    27     0     0
## 15 United States   Andre Ward            Super middleweight     32     0     0

We calculate the average wins per boxer and the average draws.

undefeated_boxers %>%
  mutate(avg_wins = wins/wins+loss+draws,
         avg_draws = draws/wins+loss+draws) %>%
  arrange(desc(avg_wins))
## # A tibble: 15 x 8
##    Country     Boxer        Weight.Class     wins  loss draws avg_wins avg_draws
##    <chr>       <chr>        <chr>           <int> <int> <int>    <dbl>     <dbl>
##  1 United Sta~ Jimmy Barry  Bantamweight       58     0    10       11     10.2 
##  2 United Sta~ Jack McAuli~ Lightweight        28     0    10       11     10.4 
##  3 South Korea Kim Ji-won   Super bantamwe~    16     0     2        3      2.12
##  4 Mexico      Ricardo Lóp~ Minimumweight      51     0     1        2      1.02
##  5 United Kin~ Terry Marsh  Light welterwe~    26     0     1        2      1.04
##  6 United Kin~ Joe Calzaghe Super middlewe~    46     0     0        1      0   
##  7 Romania     Mihai Leu    Welterweight       28     0     0        1      0   
##  8 United Sta~ Rocky Marci~ Heavyweight        49     0     0        1      0   
##  9 United Sta~ Floyd Maywe~ Super featherw~    50     0     0        1      0   
## 10 Germany     Sven Ottke   Super middlewe~    34     0     0        1      0   
## 11 Russia      Dmitry Pirog Middleweight       20     0     0        1      0   
## 12 Namibia     Harry Simon  Light middlewe~    31     0     0        1      0   
## 13 Thailand    Pichit Sitb~ Flyweight          24     0     0        1      0   
## 14 Venezuela   Edwin Valero Super featherw~    27     0     0        1      0   
## 15 United Sta~ Andre Ward   Super middlewe~    32     0     0        1      0

Graphs

Boxing

I want to interpret graphically the amount of wins by boxer and filled by country

undefeated_boxers %>%
  ggplot(aes(Boxer,wins,
             color = Country))+ 
  geom_point()

2) Interpret graphically the wins by weight class we can see how light weight and minimum weight are the ones with the most wins however, most undefeated champions do not belong to middleweight

undefeated_boxers %>%
  ggplot(aes(wins, color = Weight.Class))+
  geom_histogram(binwidth = 2, size = 1)+
  labs(title = "Wins by Weightclass")

Conclusion

Boxing

Thanks to the data wrangling I was able to determine that: 1) Most undefeated boxers come from the US followed by the UK 2) the most prominent weight class for undefeated boxers is middle weight and super feather weight. 3)The average win for an undefeated boxer is 11 wins and 10 draws.