Assignment 1 - Titanic and BaseBall Players

The titanic data

1.27. The Titanic and class

(a) Make a bar graph of these data.

#rename the titanic dataset
titanic <- read.csv("ex01-029titanic.csv")

# change the dataset to data frame
titanic <- as.data.frame(titanic)

# group the number of passengers based on the level of luxury
titanicClass <- titanic %>% 
  group_by(class = titanic$pclass)%>% 
  summarise(frequency = n())
titanicClass

## # A tibble: 3 x 2
##   class frequency
##   <int>     <int>
## 1     1       323
## 2     2       277
## 3     3       709

# graph the bar plot to see the number of passengers based on class
ggplot(titanicClass, aes(x = titanicClass$class,
                         y = titanicClass$frequency)) + 
                      labs(title = "Number of Passengers based on Class",
                           x = "Number of Passengers",
                           y = "Class") +
                      geom_col(alpha = .5) +
                         theme_tufte()

# create another column calculating 
# the percentages of passengers in each class 
percentClassTitanic <- titanicClass %>% 
                       mutate(percentage = frequency/(sum(frequency)))
percentClassTitanic

## # A tibble: 3 x 3
##   class frequency percentage
##   <int>     <int>      <dbl>
## 1     1       323  0.2467532
## 2     2       277  0.2116119
## 3     3       709  0.5416348

# graph the barplot based on the percentage of passengers.
ggplot(percentClassTitanic, 
      aes(x = class, y = percentage)) +
      labs(title = "Percentages of Passengers based on Class",
           y = "Percentages of Passengers") +
      geom_col(alpha = .5) +
      theme_tufte()

(b) Give a short summary of how the number of passengers varied with class

Answer: Class 3, which is the least luxurious class, has the largest number of passengers (709). Class 1 and 2 are the more luxurious classes, and they have quite similar number of passengers. In here, class 1 has 323 passengers and class 2 has 277 passengers.

(c) If you made a bar graph of the percent of passengers in each class, would the general features of the graph differ from the one you made in part (a)?

Answer: The features of the graph based on percents of passengers will stay the same.

1.28: (a) Make a pie chart to display the data.

(b) Compare the pie chart with the bar graph. Which do you prefer? Give reasons for your answer.

Answer: I prefer the bar graph because it shows more information such as the number of passengers. The pie chart, on the other hand, only shows the proportions of the passengers based on class.

1.29. Create a graphical summary that shows how the survival of passengers depended on class.

# select two columns that are classes and the survival rate
# and count the number of survived people based on class
titanicSurvival <- titanic %>% 
                   select(survived, pclass) %>% 
                   filter(survived == 1) %>% 
                   group_by(pclass) %>% 
                   summarise(survive = n())
        
titanicSurvival

## # A tibble: 3 x 2
##   pclass survive
##    <int>   <int>
## 1      1     200
## 2      2     119
## 3      3     181

# plot the bar group 
# rename it
titanicSurvival %>% ggplot(aes(x = pclass, y = survive)) +
                    labs(title = "Number of survived people based on Class",
                         x = "Class",
                         y = "Number of Survived People") +
                    geom_col(alpha = .5) +
                    theme_tufte()

Answer: This graph shows that the survival rate of people in class 1 was the highest, with 200 people survived. Only 119 people in class 2 survived and 181 people in class 3 survived. It should be noted that although class 3 had more survival rate than class 2, class 3 had many more people than class 2.

Part 2: Potassium in potatoes data

1.30. Potassium from potatoes.

(d) Describe the shape, center, and spread of the distribution.

(a) Make a stemplot of the data.

# read and convert potassium data into data frames
potassium <- read.csv("ex01-057kpot40.csv")
potassium <- as.data.frame(potassium)

# create a stemplot
stem(potassium$Potassium_mg)

## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   26 | 69
##   28 | 5688
##   30 | 357702235
##   32 | 336689
##   34 | 9148
##   36 | 1
##   38 | 
##   40 | 
##   42 | 1

1.59

describe(potassium$Potassium_mg)

##    vars  n    mean     sd  median trimmed    mad     min     max   range
## X1    1 27 3208.44 306.68 3130.37  3189.3 220.63 2664.38 4213.49 1549.11
##    skew kurtosis    se
## X1 1.16     2.14 59.02

summary(potassium$Potassium_mg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2664    3037    3130    3208    3283    4213       2

# five-number summary
summary(potassium$Potassium_mg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2664    3037    3130    3208    3283    4213       2

(a) Compute the standard deviation for these data.

Answer: the standard deviation of potassium consumption is 306.68 mg.

(b) Compute the quartiles for these data.

Answer: The first quartile is 3037 mg, the middle quartile is 3130 mg and the third quartile has 3283 mg.

(c) Give the five-number summary and explain the meaning of each of the five numbers.

The 1st Quartile measures 25% of the data less than 3037 mg. The third quartile measures 75% of the data less than 3283 mg Potassium. The mean measures the average while the median measures the middle of the data.

(a) Make a histogram and use it to describe the distribution of potassium absorption.

ggplot(potassium, aes(x = Potassium_mg)) + 
                     labs(title = "Histogram of Potassium absorptions",
                          y = "Frequency",
                          x = "Potassium Absorptions (mg)") +
                     geom_histogram(binwidth = 100, 
                                    alpha = .5) +
                     theme_tufte()

## Warning: Removed 2 rows containing non-finite values (stat_bin).

(b) Describe the pattern of the distribution.

The potassium consumptions have a quite bell-shape curve, though it has an outlier at the right side of the graph. Most of the data fall into 3000 mg Potassium, while some of the data lie at 3500 mg.

1.61.

# graph the boxplot
ggplot(potassium, aes(x = ID, y = Potassium_mg)) + 
                     labs(title = "Potassium absorptions based on ID",
                          y = "Potassium Absorptions (mg)",
                          x = "ID Number") +
                     geom_boxplot(alpha = 0.7) +
                     geom_jitter(color = "brown", alpha = .7) +
                     theme_tufte()

(b) Make a boxplot and use it to describe the distribution of potassium absorption. 1.30 (c) Are there any outliers? If yes, describe them and explain why you have declared them to be outliers.

This boxplot shows the potassium absorbtions, while also plotting the individual consumptions for each ID number. While there are two outliers that have values more than 4000 mg Potassium, the rest of the data lie between 3000 and 3300 mg. The first quartile is around 3200 mg and the third quartile is around 3300 mg.

(c) Compare the stemplot, the histogram, and the boxplot as graphical summaries of this distribution. Which do you prefer? Give reasons for your answer.

In this case, the box plot can reveal more information, with each individual point being plotted. The median, outliers, the 1st quartile and the 3rd quartile can be easily traced. The histogram only shows the distribution and the frequency of Potassium absorbtion, while not having any information about the central tendency. The stemplot plots the outliers clearly, but it’s difficult to intepret and does not have a good visualisation.

[D]1.77. Mean versus median. A small accounting firm pays each of its seven clerks $55,000, three junior accountants $80,000 each, and the firm’s owner $650,000.

# create data frame of the accounting firm's salary
firm <- read.csv("salary.csv")
describe(firm)

##        vars  n     mean     sd median  trimmed mad   min    max  range
## salary    1 11 115909.1 177508  55000 63333.33   0 55000 650000 595000
##        skew kurtosis       se
## salary 2.45     4.46 53520.68

firm$salary < 115909.1

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

What is the mean salary paid at this firm? How many of the employees earn less than the mean? What is the median salary?

The mean salary of this firm is $115,909. The median salary of this firm is $55,000.Interestingly, only one person whose salary is more than the mean salary, while the other ten people earn less than the mean salary.

1.79 The firm in Exercise 1.77 gives no raises to the clerks and junior accountants, while the owner’s take increases to $900,000.

newFirm <-read.csv("newsalary.csv")
describe(newFirm)

##        vars  n     mean       sd median  trimmed mad   min    max  range
## salary    1 11 143181.8 267836.5  55000 63333.33   0 55000 950000 895000
##        skew kurtosis       se
## salary 2.46     4.49 80755.73

How does the median change? How does this change affect the mean? How does it affect the median?

When the owner’s salary increases, the mean salary increases from ~$116000 to $143,181. The median salary, on the other hand, remains the same.

[F] Get Messy! How to the mean and median salaries compare? Are there a few “star” players that skew the data? Yes? No? How do you know? Are there modalities to the data? If so, why might this be the case?

# read the table using html_table in package rvest
link <- "https://www.usatoday.com/sports/mlb/salaries/"
salary <- read_html(link)
salary <- html_table(salary)
salary <- as.data.frame(salary)

# there are commas and the "$" symbol in the salary column,
# so we need to replace it 
# in order to do statistics
salary$cleanSalary <- salary$Salary %>% 
  str_replace_all("\\$|\\,| ", "") %>% 
  as.numeric()

# descriptive statistics analysis on salary
ggplot(salary, aes(x = cleanSalary)) + 
      geom_histogram(binwidth = 1, alpha = 0.5) + 
      scale_x_continuous(name = "Salary", trans = "log2") +
      theme_tufte() +
      labs(title = "Histogram showing Salary of Baseball Players",
           x = "Salary",
           y = "Frequency")

describe(salary$cleanSalary)

##    vars   n    mean      sd  median trimmed     mad    min     max
## X1    1 868 4468069 5948459 1562500 3137099 1521148 535000 3.3e+07
##       range skew kurtosis       se
## X1 32465000 1.93     3.42 201903.9

The histogram is unimodal and skewed right. The majority of these baseball players earn around 1 million dollars a year, while very few people earn more than 16 million dollars.

We can also plot the relationship between players’ positions and their salary.

ggplot(salary, aes(x = POS, y = cleanSalary)) + 
      geom_jitter(aes(color =POS)) +
      scale_y_continuous(trans = "log2") +
      coord_flip() +
      labs(title = "Salary of baseball players based on Position",
           y = "Salary",
           x = "Position")

This jitter plot shows each player’s salary based on different positions. We can see that players with SP positions earn the highest money. Players with P positions all earn more than 4,200,000 millions per year. Players with OP positions either earn very high (more than 5 million dollars) or very low (~ 1 million dollars). Other positions scatter around the salary range. Most of these players, not surprisingly, earn around one million dollars, while the minimum salary is ~500,000 dollars.