Download and load the Superheroes Dataset (heroes_information.csv) into a data frame.

Use any of the tools you currently have from Base R, the tidyverse and ggplot2 to answer the questions below.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
heroes_info <- read.csv("heroes_information.csv")
#heroes_info
  1. Examine the data set with str, dim, etc. What do you notice about the data? Are there any potential problems? There are missing data such like “-” and some data such first column that we do not need. Negative Weight and Heights are the potential problems too
class(heroes_info)
## [1] "data.frame"
dim(heroes_info)
## [1] 734  11
summary(heroes_info)
##        X             name              Gender           Eye.color        
##  Min.   :  0.0   Length:734         Length:734         Length:734        
##  1st Qu.:183.2   Class :character   Class :character   Class :character  
##  Median :366.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :366.5                                                           
##  3rd Qu.:549.8                                                           
##  Max.   :733.0                                                           
##                                                                          
##      Race            Hair.color            Height       Publisher        
##  Length:734         Length:734         Min.   :-99.0   Length:734        
##  Class :character   Class :character   1st Qu.:-99.0   Class :character  
##  Mode  :character   Mode  :character   Median :175.0   Mode  :character  
##                                        Mean   :102.3                     
##                                        3rd Qu.:185.0                     
##                                        Max.   :975.0                     
##                                                                          
##   Skin.color         Alignment             Weight      
##  Length:734         Length:734         Min.   :-99.00  
##  Class :character   Class :character   1st Qu.:-99.00  
##  Mode  :character   Mode  :character   Median : 62.00  
##                                        Mean   : 43.86  
##                                        3rd Qu.: 90.00  
##                                        Max.   :900.00  
##                                        NA's   :2
str(heroes_info)
## 'data.frame':    734 obs. of  11 variables:
##  $ X         : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ name      : chr  "A-Bomb" "Abe Sapien" "Abin Sur" "Abomination" ...
##  $ Gender    : chr  "Male" "Male" "Male" "Male" ...
##  $ Eye.color : chr  "yellow" "blue" "blue" "green" ...
##  $ Race      : chr  "Human" "Icthyo Sapien" "Ungaran" "Human / Radiation" ...
##  $ Hair.color: chr  "No Hair" "No Hair" "No Hair" "No Hair" ...
##  $ Height    : num  203 191 185 203 -99 193 -99 185 173 178 ...
##  $ Publisher : chr  "Marvel Comics" "Dark Horse Comics" "DC Comics" "Marvel Comics" ...
##  $ Skin.color: chr  "-" "blue" "red" "-" ...
##  $ Alignment : chr  "good" "good" "good" "bad" ...
##  $ Weight    : num  441 65 90 441 -99 122 -99 88 61 81 ...
heroes_info <- as_tibble(heroes_info)

heroes_info <- select(heroes_info, -X)

sum(is.na(heroes_info))
## [1] 2
heroes_info[heroes_info == "-"] <- NA
  1. How many superheroes are Human? What race has the second highest number of superheroes? Do not count rows where Race is not given. 208 superheroes are Human. Mutant has the second highest number of superheroes.
group_race <- heroes_info %>%
  group_by(Race) %>%
  summarise(n = n()) 
group_race %>% arrange(desc(n))
## # A tibble: 62 x 2
##    Race                  n
##    <chr>             <int>
##  1 <NA>                304
##  2 Human               208
##  3 Mutant               63
##  4 God / Eternal        14
##  5 Cyborg               11
##  6 Human / Radiation    11
##  7 Android               9
##  8 Symbiote              9
##  9 Alien                 7
## 10 Kryptonian            7
## # ... with 52 more rows
  1. Find the frequency of Gender values. How many female superheroes are there? 200 Female superheroes.
table(heroes_info$Gender)
## 
## Female   Male 
##    200    505
  1. Create a contingency table for Gender and Alignment. What do you notice about this analysis?

I have noticed that there are higher-good superheroes than bad, both male and female. When comparing males and females, I see that males have a higher number in the same category in all aspects.

Gender_x_Alignment <- table(heroes_info$Gender, heroes_info$Alignment)
Gender_x_Alignment
##         
##          bad good neutral
##   Female  35  161       4
##   Male   165  316      18
  1. Filter the data to remove rows where Height is less than 0. Do the same for Weight. Use this filtered data for questions 6 through 10 below.
heroes_info <- heroes_info %>% filter(Height > 0 )


heroes_info <- heroes_info %>% filter(Weight > 0 )
heroes_info
## # A tibble: 490 x 10
##    name  Gender Eye.color Race  Hair.color Height Publisher Skin.color Alignment
##    <chr> <chr>  <chr>     <chr> <chr>       <dbl> <chr>     <chr>      <chr>    
##  1 A-Bo~ Male   yellow    Human No Hair       203 Marvel C~ <NA>       good     
##  2 Abe ~ Male   blue      Icth~ No Hair       191 Dark Hor~ blue       good     
##  3 Abin~ Male   blue      Unga~ No Hair       185 DC Comics red        good     
##  4 Abom~ Male   green     Huma~ No Hair       203 Marvel C~ <NA>       bad      
##  5 Abso~ Male   blue      Human No Hair       193 Marvel C~ <NA>       bad      
##  6 Adam~ Male   blue      Human Blond         185 DC Comics <NA>       good     
##  7 Agen~ Female blue      <NA>  Blond         173 Marvel C~ <NA>       good     
##  8 Agen~ Male   brown     Human Brown         178 Marvel C~ <NA>       good     
##  9 Agen~ Male   <NA>      <NA>  <NA>          191 Marvel C~ <NA>       good     
## 10 Air-~ Male   blue      <NA>  White         188 Marvel C~ <NA>       bad      
## # ... with 480 more rows, and 1 more variable: Weight <dbl>
  1. Calculate the mean, median and standard deviation of superhero Weight and Height.
#Weight
mean(heroes_info$Weight, na.rm = TRUE)
## [1] 112.1796
median(heroes_info$Weight, na.rm = TRUE)
## [1] 81
sd(heroes_info$Weight, na.rm = TRUE)
## [1] 104.4227
#Height
mean(heroes_info$Height, na.rm = TRUE)
## [1] 187.1239
median(heroes_info$Height, na.rm = TRUE)
## [1] 183
sd(heroes_info$Height, na.rm = TRUE)
## [1] 58.99002
  1. Create a histogram of Weight. Describe the distribution and what it tells us. The histogram is unimodal and right-skewed with two outliers at 600-700 and 800-900. It shows how many superheroes are lightweight because they weigh between 0-100 and 100-200. Only a couple of superheroes have a heavy weight with 600-700 and 800-900.
hist(heroes_info$Weight, main="Maximum weight of Superheroes",
    xlab="Weight",xlim=c(0,1100),
        col="lightskyblue"
)

  1. Plot Height vs. Weight. Do you see any correlation? Describe what you see. The points lie close to a straight line, which has a positive gradient. It shows that as one variable increases, the other increases. It is a positive correlation between height and weight. However, it is weak because it is slightly going up. In other words, taller superheroes seem to be a bit heavier than the ones shorter.
plot(heroes_info$Weight, heroes_info$Height , main = "Height vs. Weight",
     xlab = "Weight", ylab = "Height")
abline(lm(Height ~ Weight, data = heroes_info), col = "blue")

  1. Calculate the five-number summary values for Height and Weight. Create a box plot of Weight for Alignment. What does this chart tell us?

The median weight for good superheroes is slightly less than bad and neutral. There is a greater variability for neutral than any other, but good superheroes have the largest outliers. The minimun number of bad superheroes is higher than others.

The chart tells us that good superheroes have the least average weight of all. The good superheroes have the less dispersed the weight, While the nuetral seems to have more dispersed weight. The minimum weight of bad superheroes is higher than other.

fivenum(heroes_info$Height)
## [1]  15.2 173.0 183.0 188.0 975.0
fivenum(heroes_info$Weight)
## [1]   4  61  81 106 900
boxplot(heroes_info$Weight ~ heroes_info$Alignment)

  1. Filter the data to include only the two largest publishers. Which publisher has a higher proportion of bad aligned superheroes? Hint: One way to do this is to use dplyr and group by the two columns Publisher and Alignment rather than a single column. The two largest publishers are Marvel Comics and DC Comics. Marvel Comics has a higher proportion of bad aligned superheroes by about 0.02.
#heroes_info %>% group_by(Publisher)
heroes_info %>% count(Publisher, sort = TRUE) 
## # A tibble: 10 x 2
##    Publisher               n
##    <chr>               <int>
##  1 "Marvel Comics"       318
##  2 "DC Comics"           143
##  3 "Dark Horse Comics"    11
##  4 "George Lucas"          5
##  5 "Shueisha"              4
##  6 "Team Epic TV"          4
##  7 "Star Trek"             2
##  8 ""                      1
##  9 "Image Comics"          1
## 10 "Sony Pictures"         1
data_group <- filter(heroes_info, Publisher == "Marvel Comics" | Publisher == "DC Comics", na.rm = TRUE)
con_tab <- table(data_group$Publisher, data_group$Alignment)
con_tab
##                
##                 bad good neutral
##   DC Comics      40   94       8
##   Marvel Comics  92  215       9
prop.table(con_tab, 1) 
##                
##                        bad       good    neutral
##   DC Comics     0.28169014 0.66197183 0.05633803
##   Marvel Comics 0.29113924 0.68037975 0.02848101