library(devtools)
## Loading required package: usethis
library(LearnEDAfunctions)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
 library(aplpack)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(remotes)
## 
## Attaching package: 'remotes'
## 
## The following objects are masked from 'package:devtools':
## 
##     dev_package_deps, install_bioc, install_bitbucket, install_cran,
##     install_deps, install_dev, install_git, install_github,
##     install_gitlab, install_local, install_svn, install_url,
##     install_version, update_packages
## 
## The following object is masked from 'package:usethis':
## 
##     git_credentials

1. The dataset baseball.attendance in the LearnEDA package contains the average (mean)

home attendance of the 30 Major League baseball teams in 2010. For this exercise,

you don’t have to understand baseball. All you have to know is that there are 30

baseball teams and the names of the teams correspond to their city locations.

There are two leagues – the National League has 16 teams and the American League

has 14 teams. The dataset has four columns:

Team – the name of the team (city)

League – the league that this team belongs

N.Home – the number of home games

Avg.Home – the mean attendance (number of people) at games played at home

Pct.Home – the percentage of park capacity for home games

N.Away – the number of away games

Avg.Away – the mean attendance (number of people) at games played away from home

Pct.Away – the percentage of park capacity for away games

For this assignment, we focus on the average home attendance (variable Avg.Home).

(a) Construct a stemplot where

- the breakpoint is between the 10,000 and 1000 places

- you have 10 leaves per stem

aplpack::stem.leaf(baseball.attendance$Avg.Home, unit = 1000, m = 1)
## 1 | 2: represents 12000
##  leaf unit: 1000
##             n: 30
##     4    1 | 6689
##   (14)   2 | 00223335566799
##    12    3 | 344678999
##     3    4 | 455

Discuss what you see in this distribution of attendance numbers, including shape,

“average”, spread, and any unusual characteristics. If there are any unusual high

or low attendances, identify the cities that have these unusual mean attendances.

We can observe that the distribution of attendance numbers resembles a symmetric

shape although it is slightly skewed right. Here, the “average” can be observed as

being in the 20,000’s. Regarding the spread, we see that most of the attendance

numbers roughly fall between 20,000 and 40,000. As far as unusual characteristic,

there aren’t any obvious ones. One thing to note is that Florida and Cleveland

have the lowest mean attendance at about 16,000 whereas NY_Yankees, Philadelphia,

and LA_Dodgers have the highest mean attendance a about 45,000 each.

(b) Redraw the stemplot using the same breakpoint and 5 leaves per stem.

aplpack::stem.leaf(baseball.attendance$Avg.Home, unit = 1000, m = 1.5)
## 1 | 2: represents 12000
##  leaf unit: 1000
##             n: 30
##    2    1 | 66
##   10    1 | 00223389
##   (8)   2 | 35566799
##   12    3 | 3446
##    8    3 | 78999
##    3    4 | 455

(c) Redraw the stemplot using the same breakpoint and 2 leaves per stem.

aplpack::stem.leaf(baseball.attendance$Avg.Home, unit = 1000, m = 5)
## 1 | 2: represents 12000
##  leaf unit: 1000
##             n: 30
##    2     s | 66
##    4    1. | 89
##    6    2* | 00
##   11     t | 22333
##   13     f | 55
##   (3)    s | 667
##   14    2. | 99
##         3* | 
##   12     t | 3
##   11     f | 44
##    9     s | 67
##    7    3. | 8999
##         4* | 
##          t | 
##    3     f | 455

(d) Discuss any features that you’ve learned about this dataset by constructing

these two additional stemplots. What is the best choice of stemplot? Why?

When looking at the stemplot with 5 leaves per stem, we can see that there are

two modes (or peaks) in the attendance number. In both the 5 leave as well as

the 2 leave stemplot, we can now see more clearly that the three mean attendances

at roughly 44,000 and 45,000 are unusually high values. Overall, I feel like

the stemplot with the 2 leaves per stem looks too spread out and contains gaps.

This is why I think the stemplot with 5 leaves per stem is the best choice as it

most accurately represents the data. Also, it lets us see the two modes as well

as the unusually high attendance means which was not the case in the 10 leaves

per stem stemplot.

2. In Assignment 1, you found two datasets.

For each dataset …

(a) Read the data into R.

largest.us.cities.by.population <- read.csv("C:/Users/eclai/Downloads/largest.us.cities.by.population.csv")
vegetarianism.by.country <- read.csv("C:/Users/eclai/Downloads/vegetarianism.by.country.csv")

Note: for the purpose of this assignment we will focus on the % Vegetarian and

population density to construct the two stemplots

(b) Construct two different stemplots using R.

largest US cities by population: density

aplpack::stem.leaf(largest.us.cities.by.population$density, unit = 100, m = 2)
## 1 | 2: represents 1200
##  leaf unit: 100
##             n: 300
##     2    0* | 13
##     7    0. | 56779
##    34    1* | 000011112222233333334444444
##    61    1. | 556666677788888889999999999
##    97    2* | 000000011122222222233333333444444444
##   129    2. | 55555555666666667777778888899999
##   (31)   3* | 0000000000000111122222333344444
##   140    3. | 55555666666667777788888899999999999
##   105    4* | 000011111111112223333333344
##    78    4. | 555666777777899
##    63    5* | 0011222344
##    53    5. | 556777778
##    44    6* | 0022
##    40    6. | 568888899
##    31    7* | 113
##    28    7. | 557
##    25    8* | 01
##    23    8. | 666
##    20    9* | 23
## HI: 9795 10064 10262 10333 10755 10761 11277 11378 11472 11475 12110 12469 12774 15260 18126 18216 18376 26260
aplpack::stem.leaf(largest.us.cities.by.population$density, unit = 100, m = 1)
## 1 | 2: represents 1200
##  leaf unit: 100
##             n: 300
##     7    0 | 1356779
##    61    1 | 000011112222233333334444444556666677788888889999999999
##   129    2 | 00000001112222222223333333344444444455555555666666667777778888899999
##   (66)   3 | 000000000000011112222233334444455555666666667777788888899999999999
##   105    4 | 000011111111112223333333344555666777777899
##    63    5 | 0011222344556777778
##    44    6 | 0022568888899
##    31    7 | 113557
##    25    8 | 01666
##    20    9 | 237
## HI: 10064 10262 10333 10755 10761 11277 11378 11472 11475 12110 12469 12774 15260 18126 18216 18376 26260
aplpack::stem.leaf(largest.us.cities.by.population$density, unit = 100, m = 3)
## 1 | 2: represents 1200
##  leaf unit: 100
##             n: 300
##     2    0 | 13
##     4    0 | 56
##     7    0 | 779
##    25    1 | 000011112222233333
##    39    1 | 33444444455666
##    61    1 | 6677788888889999999999
##    81    2 | 00000001112222222223
##   112    2 | 3333333444444444555555556666666
##   129    2 | 67777778888899999
##   (23)   3 | 00000000000001111222223
##   148    3 | 333444445555566666666
##   127    3 | 7777788888899999999999
##   105    4 | 00001111111111222333
##    85    4 | 3333344555666
##    72    4 | 777777899
##    63    5 | 0011222
##    56    5 | 344556
##    50    5 | 777778
##    44    6 | 0022
##    40    6 | 5
##    39    6 | 68888899
##    31    7 | 11
##    29    7 | 355
##    26    7 | 7
##    25    8 | 01
##    23    8 | 666
##          8 | 
##    20    9 | 2
## HI: 9347 9795 10064 10262 10333 10755 10761 11277 11378 11472 11475 12110 12469 12774 15260 18126 18216 18376 26260

vegetarianism by country: percentage of vegetarians

aplpack::stem.leaf(vegetarianism.by.country$percent.vegetarian, unit = 0.001, m = 1)
## 1 | 2: represents 0.012
##  leaf unit: 0.001
##             n: 40
##    4     1 | 0024
##          2 | 
##    6     3 | 03
##    8     4 | 03
##   17     5 | 000000022
##   (3)    6 | 000
##   (3)    7 | 000
##   17     8 | 49
##   15     9 | 00
##   13    10 | 00000
##         11 | 
##    8    12 | 000
##    5    13 | 0
##    4    14 | 00
## HI: 0.19 0.24
aplpack::stem.leaf(vegetarianism.by.country$percent.vegetarian, unit = 0.001, m = 0.5)
## 1 | 2: represents 0.012
##  leaf unit: 0.001
##             n: 40
##     4     1 | 0024
##     8     3 | 0033
##   (12)    5 | 000000000022
##    (5)    7 | 00049
##    15     9 | 0000000
##     8    11 | 000
##     5    13 | 000
## HI: 0.19 0.24

(c) Discuss the data (shape, average, spread, unusual characteristics, etc.) and

discuss the best choice of stemplot.

largest US cities by population: density

We can observe that the distribution of the population density numbers is slightly

skewed right. Here, the “average” can be observed as roughly being in the 3,000’s.

Regarding the spread, we see that most of the population density numbers roughly

fall between 1,000 and 8,000. As far as unusual characteristics, there are a few

noticeable outliers. Some unusually high numbers include 9347, 9795, 10064, 10262,

10333, 10755, 10761, 11277, 11378, 11472, 11475, 12110, 12469, 12774, 15260, 18126,

18216, 18376, and 26260. Here, I would say that the stemplot with 4 leaves per stem

is the best choice although the stemplot with 5 leaves per stem also works. The

stemplot with 4 leaves per stem clearly shows the spread and the peaks of the data

as well as the shape (right skewed). We also see more structure as opposed to the

stemplot with 10 leaves per stem.

vegetarianism by country: percentage of vegetarians

We can observe that the distribution of the percentage of vegetarians is almost

symmetrical although it is skewed left slightly. Here, the “average” can be observed

as roughly being 0.07. Regarding the spread, we see that most of the percentage of

vegetarians numbers roughly fall between 0.01 and 0.13 As far as unusual characteristics,

there are a few noticeable outliers. Some unusually high numbers include 0.19 and 0.24.

Here, I would say that the best choice for a stemplot is the one with 2 leaves per stem.

Yes, the stemplot with 2 leaves per stem does contain two gaps but compared to the stemplot

which has 3 leaves per stem, the data is more spread out and not as clustered.

PART B: Working with a Single Batch – Summaries

For each of the three datasets (the baseball attendance data and the two interesting datasets that you found)

(a) compute the letter values (median, fourths, eighths, and extremes)

baseball:

fivenum(baseball.attendance$Avg.Home)
## [1] 16230.0 22850.0 27102.5 37592.0 45715.0

median (M) = 27102.5

lower fourth (Fl): 22850.0

upper fourth (Fu): 37592.0

smallest observation (LO): 16230.0

highest observation (HI): 45715.0

population:

fivenum(largest.us.cities.by.population$density)
## [1]   166.0  2242.0  3293.0  4598.5 26260.0

median (M) = 3293.0

lower fourth (Fl): 2242.0

upper fourth (Fu): 4598.5

smallest observation (LO): 166.0

highest observation (HI): 26260.

vegetarianism:

fivenum(vegetarianism.by.country$percent.vegetarian)
## [1] 0.010 0.050 0.065 0.100 0.240

median (M) = 0.065

lower fourth (Fl): 0.050

upper fourth (Fu): 0.100

smallest observation (LO): 0.010

highest observation (HI): 0.240

(b) find the mean and median. If these two measures of average are different, explain

why (look at the shape of the dataset).

baseball:

fivenum(baseball.attendance$Avg.Home)
## [1] 16230.0 22850.0 27102.5 37592.0 45715.0
# median (M) = 27102.5 
mean(baseball.attendance$Avg.Home)
## [1] 29514.8

mean = 29514.8

here, the mean is greater than the median which makes sense when looking at the

shape of the data set. There are more values in the 30,000’s and 40,000’s which

contributes to the mean being higher than the median. Also, The median is more

resistant to these higher values

population:

fivenum(largest.us.cities.by.population$density)
## [1]   166.0  2242.0  3293.0  4598.5 26260.0

median (M) = 3293.0

mean(largest.us.cities.by.population$density)
## [1] 4030.74

mean = 4030.74

here, the mean is greater than the median which makes sense when looking at the

shape of the data set. Also, we have a lot of unusually large outliers which contributes

to the mean being larger. There are more values in the upper half (higher values) of

the stemplot which contributes to the mean being higher than the median.The data

is slightly right skewed which is why the median is a lower value compared to the

mean. Also, The median is more resistant to these higher values

vegetarianism:

fivenum(vegetarianism.by.country$percent.vegetarian)
## [1] 0.010 0.050 0.065 0.100 0.240

median (M) = 0.065

mean(vegetarianism.by.country$percent.vegetarian)
## [1] 0.077225

mean = 0.077225

here, the mean is yet again greater than the median which makes sense when looking at the

shape of the data set. Also, we have two high outliers which contributes to the mean being

larger. The median is more resistant to these higher values. When looking at the shape

of the data, we would expect our mean to roughly fall between 0.07 and 0.09 so it is reasonable

that the mean is indeed greater than the median.

(c) find the fourth spread

baseball:

fourth spread: 37592.0 - 22850.0 = 14742

population:

fourth spread: 4598.5 - 2242.0 = 2356.5

vegetarianism:

fourth spread: 0.100 - 0.050 = 0.05

(d) find the step, the inner and outer lower and upper fences, and identify any outliers

(mild and extreme) in your dataset. Can you offer any explanation for these outliers?

baseball:

step: 1.5 * 14742 = 14742

inner lower fence: Fl - step = 22850.0 - 14742 = 8108

inner upper fence: Fu - step = 37592.0 + 14742 = 52334

outer lower fence: Fl - 2 * step = 22850.0 - 2 * 14742 = -6634

outer upper fence: Fu - 2 * step = 37592.0 + 2 * 14742 = 67076

Do we have any outliers? No, looking at the stemplot and the inner and

outer lower/upper fences, we see that there are no outliers.

population:

step: 1.5 * 2356.5 = 3534.75

inner lower fence: Fl - step = 2242.0 - 3534.75 = -1292.75

inner upper fence: Fu - step = 4598.5 + 3534.75 = 8133.25

outer lower fence: Fl - 2 * step = 2242.0 - 2 * 3534.75 = -4827.5

outer upper fence: Fu - 2 * step = 4598.5 + 2 * 3534.75 = 11668

Do we have any outliers? yes, looking at the stemplot there are several

values that are beyond the inner fence but still within the outer fence like Lowell,

Downey, Seattle, Long Beach, Bridgeport Garden Grove, Alexandria, Hialeah, Providence

Washington, Berkeley, Elizabeth, Santa Ana, Philadelphia, Chicago, and Yonkers. We also

notice some cities where the population density is beyong the outer fence as well such

as Miami, Newark, Boston, San Francisco, Cambridge, Jersey City, Paterson, and New York City.

The reason for these values can be explained by the fact that most of these cities are

highly populated cities as compared to their area so the population density is high. For

instance, New York city is a highly populated city and as we know, apartments and high-rise

buildings are frequent meaning it is more densely populated as there will be more people

living there per square mile.

vegetarianism:

step: 1.5 * 0.050 = 0.075

inner lower fence: Fl - step = 0.050 - 0.075 = -0.025

inner upper fence: Fu - step = 0.100 + 0.075 = 0.175

outer lower fence: Fl - 2 * step = 0.050 - 2 * 0.075 = -0.1

outer upper fence: Fu - 2 * step = 0.100 + 2 * 0.075 = 0.25

Do we have any outliers? yes, looking at the stemplot there are several

values that are beyond the inner fence but still within the outer fence like

Mexico and India. India’s high percentage of vegetarians could possibly be explained

by a cultural attitude towards consuming meat as well as religious traditions that

promote vegetarianism. When it comes to Mexico, there may be several reasons for

the high percentage of vegetarians. Some of them include cultural beliefs and attitudes,

as well as fish still counts as vegetarian. If the latter is the case, of course the number

of people who consume no red meat but still chicken and fish would also contribute to the

number of vegetarians.