Seen below is the initial shape of data on load. Unsparingly, it is exactly the same as the original image. With it now loaded into R, we can begin discussing and then fixing the issues with our data
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!
Perform the analysis requested in the discussion item
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
#Packages used
library(tidytext)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(plotly)
library(stringr)
library(DT)
Maria A Ginorio
boxing <-read.csv("https://raw.githubusercontent.com/mgino11/Tidy_project2/main/Project2.csv",stringsAsFactors = FALSE)
datatable(boxing)
Euclides Rodriguez
study <- read.csv("https://raw.githubusercontent.com/engine2031/Data-607/main/Project%202_Privacy%20Suppresion%20Education%20Data2.csv")
datatable(study)
Peter Gatica
nba_stats <- read.csv("https://github.com/audiorunner13/Masters-Coursework/blob/main/DATA607%20Spring%202021/Week6/Project2/Data/NBA_Player_Stats.txt")
datatable(nba_stats)
Clean data
Glimpse the data we can see it has 31 rows and 8 columns all in chr
glimpse(boxing)
## Rows: 31
## Columns: 8
## $ Country <chr> "United States", "United Kingdom ", "", "", "", "", "Sou~
## $ Boxer <chr> "Jimmy Barry", "Joe Calzaghe", "", "", "", "", "Kim Ji-w~
## $ Record <chr> "58–0–10", "46–0–0", "", "", "", "", "16–0–2", "28–0–0",~
## $ Weight.Class <chr> "Bantamweight", "Super middleweight", "", "", "", "Light~
## $ Title <chr> "World", "WBO", "IBF", "The Ring, lineal", "WBA, WBC", "~
## $ Champ.Year <chr> "1894-1919", "1997–2008", "2006", "2006–2008", "2007–200~
## $ Title.Defense <chr> "4 withdraws by opponent", "21", "1", "3", "0", "1", "4"~
## $ Notes <chr> "", "Vacated title to move up to light heavyweight.", "V~
Lets start by selecting only the columns we are interested in working on. In this case we will create a subset of the data and drop Champ year, title defense and Notes.
tidy_box <- boxing %>%
select(-Title, -Title.Defense, -Notes, -Champ.Year)
datatable(tidy_box)
Cleaning Records
tidy_box %>%
select(Country:Weight.Class) %>%
na_if("")
## Country Boxer Record Weight.Class
## 1 United States Jimmy Barry 58–0–10 Bantamweight
## 2 United Kingdom Joe Calzaghe 46–0–0 Super middleweight
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 <NA> <NA> <NA> Light heavyweight
## 7 South Korea Kim Ji-won 16–0–2 Super bantamweight
## 8 Romania Mihai Leu 28–0–0 Welterweight
## 9 Mexico Ricardo López 51–0–1 Minimumweight
## 10 <NA> <NA> <NA> <NA>
## 11 <NA> <NA> <NA> <NA>
## 12 <NA> <NA> <NA> Light flyweight
## 13 United States Rocky Marciano 49–0–0 Heavyweight
## 14 United Kingdom Terry Marsh 26–0–1 Light welterweight
## 15 United States Floyd Mayweather Jr. 50–0–0 Super featherweight
## 16 <NA> <NA> <NA> Lightweight
## 17 <NA> <NA> <NA> Light welterweight
## 18 <NA> <NA> <NA> Welterweight
## 19 <NA> <NA> <NA> Light middleweight
## 20 United States Jack McAuliffe 28–0–10 Lightweight
## 21 Germany Sven Ottke 34–0–0 Super middleweight
## 22 <NA> <NA> <NA> <NA>
## 23 Russia Dmitry Pirog 20–0–0 Middleweight
## 24 Namibia Harry Simon 31–0–0 Light middleweight
## 25 <NA> <NA> <NA> Middleweight
## 26 Thailand Pichit Sitbangprachan 24–0–0 Flyweight
## 27 Venezuela Edwin Valero 27–0–0 Super featherweight
## 28 <NA> <NA> <NA> Lightweight
## 29 United States Andre Ward 32–0–0 Super middleweight
## 30 <NA> <NA> <NA> <NA>
## 31 <NA> <NA> <NA> Light heavyweight
2 ) Fix the column Record - format wins/losses/draws
58 = wins 0 = losses 10 = draws
box_df <-tidy_box %>%
separate(Record, c("wins","loss", "draws"), sep = "[^0-9]", convert = TRUE, remove = TRUE)
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 16 rows [3, 4, 5,
## 6, 10, 11, 12, 16, 17, 18, 19, 22, 25, 28, 30, 31].
box_df
## Country Boxer wins loss draws Weight.Class
## 1 United States Jimmy Barry 58 0 10 Bantamweight
## 2 United Kingdom Joe Calzaghe 46 0 0 Super middleweight
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA Light heavyweight
## 7 South Korea Kim Ji-won 16 0 2 Super bantamweight
## 8 Romania Mihai Leu 28 0 0 Welterweight
## 9 Mexico Ricardo López 51 0 1 Minimumweight
## 10 NA NA NA
## 11 NA NA NA
## 12 NA NA NA Light flyweight
## 13 United States Rocky Marciano 49 0 0 Heavyweight
## 14 United Kingdom Terry Marsh 26 0 1 Light welterweight
## 15 United States Floyd Mayweather Jr. 50 0 0 Super featherweight
## 16 NA NA NA Lightweight
## 17 NA NA NA Light welterweight
## 18 NA NA NA Welterweight
## 19 NA NA NA Light middleweight
## 20 United States Jack McAuliffe 28 0 10 Lightweight
## 21 Germany Sven Ottke 34 0 0 Super middleweight
## 22 NA NA NA
## 23 Russia Dmitry Pirog 20 0 0 Middleweight
## 24 Namibia Harry Simon 31 0 0 Light middleweight
## 25 NA NA NA Middleweight
## 26 Thailand Pichit Sitbangprachan 24 0 0 Flyweight
## 27 Venezuela Edwin Valero 27 0 0 Super featherweight
## 28 NA NA NA Lightweight
## 29 United States Andre Ward 32 0 0 Super middleweight
## 30 NA NA NA
## 31 NA NA NA Light heavyweight
box1 <- box_df %>%
pivot_longer(box_df,
cols = c(wins:draws),
names_to = "score",
values_to = "times",
values_drop_na = T)
box1
## # A tibble: 45 x 5
## Country Boxer Weight.Class score times
## <chr> <chr> <chr> <chr> <int>
## 1 United States Jimmy Barry Bantamweight wins 58
## 2 United States Jimmy Barry Bantamweight loss 0
## 3 United States Jimmy Barry Bantamweight draws 10
## 4 United Kingdom Joe Calzaghe Super middleweight wins 46
## 5 United Kingdom Joe Calzaghe Super middleweight loss 0
## 6 United Kingdom Joe Calzaghe Super middleweight draws 0
## 7 South Korea Kim Ji-won Super bantamweight wins 16
## 8 South Korea Kim Ji-won Super bantamweight loss 0
## 9 South Korea Kim Ji-won Super bantamweight draws 2
## 10 Romania Mihai Leu Welterweight wins 28
## # ... with 35 more rows
box1 %>%
count(Country, sort = T)
## # A tibble: 10 x 2
## Country n
## <chr> <int>
## 1 United States 15
## 2 United Kingdom 6
## 3 Germany 3
## 4 Mexico 3
## 5 Namibia 3
## 6 Romania 3
## 7 Russia 3
## 8 South Korea 3
## 9 Thailand 3
## 10 Venezuela 3
what weight class is the most prominent among world champions?
box1 %>%
count(Weight.Class, sort = T)
## # A tibble: 12 x 2
## Weight.Class n
## <chr> <int>
## 1 Super middleweight 9
## 2 Super featherweight 6
## 3 Bantamweight 3
## 4 Flyweight 3
## 5 Heavyweight 3
## 6 Light middleweight 3
## 7 Light welterweight 3
## 8 Lightweight 3
## 9 Middleweight 3
## 10 Minimumweight 3
## 11 Super bantamweight 3
## 12 Welterweight 3
In order to obtain arithmetic calculations we need to use pivot wider to be able to compute the information from the data.
undefeated_boxers <- box1 %>%
pivot_wider(names_from = score, values_from = times)
undefeated_boxers
## # A tibble: 15 x 6
## Country Boxer Weight.Class wins loss draws
## <chr> <chr> <chr> <int> <int> <int>
## 1 United States Jimmy Barry Bantamweight 58 0 10
## 2 United Kingdom Joe Calzaghe Super middleweight 46 0 0
## 3 South Korea Kim Ji-won Super bantamweight 16 0 2
## 4 Romania Mihai Leu Welterweight 28 0 0
## 5 Mexico Ricardo López Minimumweight 51 0 1
## 6 United States Rocky Marciano Heavyweight 49 0 0
## 7 United Kingdom Terry Marsh Light welterweight 26 0 1
## 8 United States Floyd Mayweather Jr. Super featherweight 50 0 0
## 9 United States Jack McAuliffe Lightweight 28 0 10
## 10 Germany Sven Ottke Super middleweight 34 0 0
## 11 Russia Dmitry Pirog Middleweight 20 0 0
## 12 Namibia Harry Simon Light middleweight 31 0 0
## 13 Thailand Pichit Sitbangprachan Flyweight 24 0 0
## 14 Venezuela Edwin Valero Super featherweight 27 0 0
## 15 United States Andre Ward Super middleweight 32 0 0
We calculate the average wins per boxer and the average draws.
undefeated_boxers %>%
mutate(avg_wins = wins/wins+loss+draws,
avg_draws = draws/wins+loss+draws) %>%
arrange(desc(avg_wins))
## # A tibble: 15 x 8
## Country Boxer Weight.Class wins loss draws avg_wins avg_draws
## <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl>
## 1 United Sta~ Jimmy Barry Bantamweight 58 0 10 11 10.2
## 2 United Sta~ Jack McAuli~ Lightweight 28 0 10 11 10.4
## 3 South Korea Kim Ji-won Super bantamwe~ 16 0 2 3 2.12
## 4 Mexico Ricardo Lóp~ Minimumweight 51 0 1 2 1.02
## 5 United Kin~ Terry Marsh Light welterwe~ 26 0 1 2 1.04
## 6 United Kin~ Joe Calzaghe Super middlewe~ 46 0 0 1 0
## 7 Romania Mihai Leu Welterweight 28 0 0 1 0
## 8 United Sta~ Rocky Marci~ Heavyweight 49 0 0 1 0
## 9 United Sta~ Floyd Maywe~ Super featherw~ 50 0 0 1 0
## 10 Germany Sven Ottke Super middlewe~ 34 0 0 1 0
## 11 Russia Dmitry Pirog Middleweight 20 0 0 1 0
## 12 Namibia Harry Simon Light middlewe~ 31 0 0 1 0
## 13 Thailand Pichit Sitb~ Flyweight 24 0 0 1 0
## 14 Venezuela Edwin Valero Super featherw~ 27 0 0 1 0
## 15 United Sta~ Andre Ward Super middlewe~ 32 0 0 1 0
I want to interpret graphically the amount of wins by boxer and filled by country
undefeated_boxers %>%
ggplot(aes(Boxer,wins,
color = Country))+
geom_point()
2) Interpret graphically the wins by weight class we can see how light weight and minimum weight are the ones with the most wins however, most undefeated champions do not belong to middleweight
undefeated_boxers %>%
ggplot(aes(wins, color = Weight.Class))+
geom_histogram(binwidth = 2, size = 1)+
labs(title = "Wins by Weightclass")
Thanks to the data wrangling I was able to determine that: 1) Most undefeated boxers come from the US followed by the UK 2) the most prominent weight class for undefeated boxers is middle weight and super feather weight. 3)The average win for an undefeated boxer is 11 wins and 10 draws.