library(devtools)
## Loading required package: usethis
library(LearnEDAfunctions)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
library(aplpack)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(remotes)
##
## Attaching package: 'remotes'
##
## The following objects are masked from 'package:devtools':
##
## dev_package_deps, install_bioc, install_bitbucket, install_cran,
## install_deps, install_dev, install_git, install_github,
## install_gitlab, install_local, install_svn, install_url,
## install_version, update_packages
##
## The following object is masked from 'package:usethis':
##
## git_credentials
1. The dataset baseball.attendance in the LearnEDA package contains
the average (mean)
home attendance of the 30 Major League baseball teams in 2010. For
this exercise,
you don’t have to understand baseball. All you have to know is that
there are 30
baseball teams and the names of the teams correspond to their city
locations.
There are two leagues – the National League has 16 teams and the
American League
has 14 teams. The dataset has four columns:
Team – the name of the team (city)
League – the league that this team belongs
N.Home – the number of home games
Avg.Home – the mean attendance (number of people) at games played at
home
Pct.Home – the percentage of park capacity for home games
N.Away – the number of away games
Avg.Away – the mean attendance (number of people) at games played
away from home
Pct.Away – the percentage of park capacity for away games
For this assignment, we focus on the average home attendance
(variable Avg.Home).
(a) Construct a stemplot where
- the breakpoint is between the 10,000 and 1000 places
- you have 10 leaves per stem
aplpack::stem.leaf(baseball.attendance$Avg.Home, unit = 1000, m = 1)
## 1 | 2: represents 12000
## leaf unit: 1000
## n: 30
## 4 1 | 6689
## (14) 2 | 00223335566799
## 12 3 | 344678999
## 3 4 | 455
Discuss what you see in this distribution of attendance numbers,
including shape,
“average”, spread, and any unusual characteristics. If there are any
unusual high
or low attendances, identify the cities that have these unusual mean
attendances.
We can observe that the distribution of attendance numbers resembles
a symmetric
shape although it is slightly skewed right. Here, the “average” can
be observed as
being in the 20,000’s. Regarding the spread, we see that most of the
attendance
numbers roughly fall between 20,000 and 40,000. As far as unusual
characteristic,
there aren’t any obvious ones. One thing to note is that Florida and
Cleveland
have the lowest mean attendance at about 16,000 whereas NY_Yankees,
Philadelphia,
and LA_Dodgers have the highest mean attendance a about 45,000
each.
(b) Redraw the stemplot using the same breakpoint and 5 leaves per
stem.
aplpack::stem.leaf(baseball.attendance$Avg.Home, unit = 1000, m = 1.5)
## 1 | 2: represents 12000
## leaf unit: 1000
## n: 30
## 2 1 | 66
## 10 1 | 00223389
## (8) 2 | 35566799
## 12 3 | 3446
## 8 3 | 78999
## 3 4 | 455
(c) Redraw the stemplot using the same breakpoint and 2 leaves per
stem.
aplpack::stem.leaf(baseball.attendance$Avg.Home, unit = 1000, m = 5)
## 1 | 2: represents 12000
## leaf unit: 1000
## n: 30
## 2 s | 66
## 4 1. | 89
## 6 2* | 00
## 11 t | 22333
## 13 f | 55
## (3) s | 667
## 14 2. | 99
## 3* |
## 12 t | 3
## 11 f | 44
## 9 s | 67
## 7 3. | 8999
## 4* |
## t |
## 3 f | 455
(d) Discuss any features that you’ve learned about this dataset by
constructing
these two additional stemplots. What is the best choice of stemplot?
Why?
When looking at the stemplot with 5 leaves per stem, we can see that
there are
two modes (or peaks) in the attendance number. In both the 5 leave
as well as
the 2 leave stemplot, we can now see more clearly that the three
mean attendances
at roughly 44,000 and 45,000 are unusually high values. Overall, I
feel like
the stemplot with the 2 leaves per stem looks too spread out and
contains gaps.
This is why I think the stemplot with 5 leaves per stem is the best
choice as it
most accurately represents the data. Also, it lets us see the two
modes as well
as the unusually high attendance means which was not the case in the
10 leaves
per stem stemplot.
2. In Assignment 1, you found two datasets.
For each dataset …
(a) Read the data into R.
largest.us.cities.by.population <- read.csv("C:/Users/eclai/Downloads/largest.us.cities.by.population.csv")
vegetarianism.by.country <- read.csv("C:/Users/eclai/Downloads/vegetarianism.by.country.csv")
Note: for the purpose of this assignment we will focus on the %
Vegetarian and
population density to construct the two stemplots
(b) Construct two different stemplots using R.
largest US cities by population: density
aplpack::stem.leaf(largest.us.cities.by.population$density, unit = 100, m = 2)
## 1 | 2: represents 1200
## leaf unit: 100
## n: 300
## 2 0* | 13
## 7 0. | 56779
## 34 1* | 000011112222233333334444444
## 61 1. | 556666677788888889999999999
## 97 2* | 000000011122222222233333333444444444
## 129 2. | 55555555666666667777778888899999
## (31) 3* | 0000000000000111122222333344444
## 140 3. | 55555666666667777788888899999999999
## 105 4* | 000011111111112223333333344
## 78 4. | 555666777777899
## 63 5* | 0011222344
## 53 5. | 556777778
## 44 6* | 0022
## 40 6. | 568888899
## 31 7* | 113
## 28 7. | 557
## 25 8* | 01
## 23 8. | 666
## 20 9* | 23
## HI: 9795 10064 10262 10333 10755 10761 11277 11378 11472 11475 12110 12469 12774 15260 18126 18216 18376 26260
aplpack::stem.leaf(largest.us.cities.by.population$density, unit = 100, m = 1)
## 1 | 2: represents 1200
## leaf unit: 100
## n: 300
## 7 0 | 1356779
## 61 1 | 000011112222233333334444444556666677788888889999999999
## 129 2 | 00000001112222222223333333344444444455555555666666667777778888899999
## (66) 3 | 000000000000011112222233334444455555666666667777788888899999999999
## 105 4 | 000011111111112223333333344555666777777899
## 63 5 | 0011222344556777778
## 44 6 | 0022568888899
## 31 7 | 113557
## 25 8 | 01666
## 20 9 | 237
## HI: 10064 10262 10333 10755 10761 11277 11378 11472 11475 12110 12469 12774 15260 18126 18216 18376 26260
aplpack::stem.leaf(largest.us.cities.by.population$density, unit = 100, m = 3)
## 1 | 2: represents 1200
## leaf unit: 100
## n: 300
## 2 0 | 13
## 4 0 | 56
## 7 0 | 779
## 25 1 | 000011112222233333
## 39 1 | 33444444455666
## 61 1 | 6677788888889999999999
## 81 2 | 00000001112222222223
## 112 2 | 3333333444444444555555556666666
## 129 2 | 67777778888899999
## (23) 3 | 00000000000001111222223
## 148 3 | 333444445555566666666
## 127 3 | 7777788888899999999999
## 105 4 | 00001111111111222333
## 85 4 | 3333344555666
## 72 4 | 777777899
## 63 5 | 0011222
## 56 5 | 344556
## 50 5 | 777778
## 44 6 | 0022
## 40 6 | 5
## 39 6 | 68888899
## 31 7 | 11
## 29 7 | 355
## 26 7 | 7
## 25 8 | 01
## 23 8 | 666
## 8 |
## 20 9 | 2
## HI: 9347 9795 10064 10262 10333 10755 10761 11277 11378 11472 11475 12110 12469 12774 15260 18126 18216 18376 26260
vegetarianism by country: percentage of vegetarians
aplpack::stem.leaf(vegetarianism.by.country$percent.vegetarian, unit = 0.001, m = 1)
## 1 | 2: represents 0.012
## leaf unit: 0.001
## n: 40
## 4 1 | 0024
## 2 |
## 6 3 | 03
## 8 4 | 03
## 17 5 | 000000022
## (3) 6 | 000
## (3) 7 | 000
## 17 8 | 49
## 15 9 | 00
## 13 10 | 00000
## 11 |
## 8 12 | 000
## 5 13 | 0
## 4 14 | 00
## HI: 0.19 0.24
aplpack::stem.leaf(vegetarianism.by.country$percent.vegetarian, unit = 0.001, m = 0.5)
## 1 | 2: represents 0.012
## leaf unit: 0.001
## n: 40
## 4 1 | 0024
## 8 3 | 0033
## (12) 5 | 000000000022
## (5) 7 | 00049
## 15 9 | 0000000
## 8 11 | 000
## 5 13 | 000
## HI: 0.19 0.24
(c) Discuss the data (shape, average, spread, unusual
characteristics, etc.) and
discuss the best choice of stemplot.
largest US cities by population: density
We can observe that the distribution of the population density
numbers is slightly
skewed right. Here, the “average” can be observed as roughly being
in the 3,000’s.
Regarding the spread, we see that most of the population density
numbers roughly
fall between 1,000 and 8,000. As far as unusual characteristics,
there are a few
noticeable outliers. Some unusually high numbers include 9347, 9795,
10064, 10262,
10333, 10755, 10761, 11277, 11378, 11472, 11475, 12110, 12469,
12774, 15260, 18126,
18216, 18376, and 26260. Here, I would say that the stemplot with 4
leaves per stem
is the best choice although the stemplot with 5 leaves per stem also
works. The
stemplot with 4 leaves per stem clearly shows the spread and the
peaks of the data
as well as the shape (right skewed). We also see more structure as
opposed to the
stemplot with 10 leaves per stem.
vegetarianism by country: percentage of vegetarians
We can observe that the distribution of the percentage of
vegetarians is almost
symmetrical although it is skewed left slightly. Here, the “average”
can be observed
as roughly being 0.07. Regarding the spread, we see that most of the
percentage of
vegetarians numbers roughly fall between 0.01 and 0.13 As far as
unusual characteristics,
there are a few noticeable outliers. Some unusually high numbers
include 0.19 and 0.24.
Here, I would say that the best choice for a stemplot is the one
with 2 leaves per stem.
Yes, the stemplot with 2 leaves per stem does contain two gaps but
compared to the stemplot
which has 3 leaves per stem, the data is more spread out and not as
clustered.
PART B: Working with a Single Batch – Summaries
For each of the three datasets (the baseball attendance data and the
two interesting datasets that you found)
(a) compute the letter values (median, fourths, eighths, and
extremes)
baseball:
fivenum(baseball.attendance$Avg.Home)
## [1] 16230.0 22850.0 27102.5 37592.0 45715.0
median (M) = 27102.5
lower fourth (Fl): 22850.0
upper fourth (Fu): 37592.0
smallest observation (LO): 16230.0
highest observation (HI): 45715.0
population:
fivenum(largest.us.cities.by.population$density)
## [1] 166.0 2242.0 3293.0 4598.5 26260.0
median (M) = 3293.0
lower fourth (Fl): 2242.0
upper fourth (Fu): 4598.5
smallest observation (LO): 166.0
highest observation (HI): 26260.
vegetarianism:
fivenum(vegetarianism.by.country$percent.vegetarian)
## [1] 0.010 0.050 0.065 0.100 0.240
median (M) = 0.065
lower fourth (Fl): 0.050
upper fourth (Fu): 0.100
smallest observation (LO): 0.010
highest observation (HI): 0.240
(b) find the mean and median. If these two measures of average are
different, explain
why (look at the shape of the dataset).
baseball:
fivenum(baseball.attendance$Avg.Home)
## [1] 16230.0 22850.0 27102.5 37592.0 45715.0
# median (M) = 27102.5
mean(baseball.attendance$Avg.Home)
## [1] 29514.8
mean = 29514.8
here, the mean is greater than the median which makes sense when
looking at the
shape of the data set. There are more values in the 30,000’s and
40,000’s which
contributes to the mean being higher than the median. Also, The
median is more
resistant to these higher values
population:
fivenum(largest.us.cities.by.population$density)
## [1] 166.0 2242.0 3293.0 4598.5 26260.0
mean = 4030.74
here, the mean is greater than the median which makes sense when
looking at the
shape of the data set. Also, we have a lot of unusually large
outliers which contributes
to the mean being larger. There are more values in the upper half
(higher values) of
the stemplot which contributes to the mean being higher than the
median.The data
is slightly right skewed which is why the median is a lower value
compared to the
mean. Also, The median is more resistant to these higher values
vegetarianism:
fivenum(vegetarianism.by.country$percent.vegetarian)
## [1] 0.010 0.050 0.065 0.100 0.240
mean = 0.077225
here, the mean is yet again greater than the median which makes
sense when looking at the
shape of the data set. Also, we have two high outliers which
contributes to the mean being
larger. The median is more resistant to these higher values. When
looking at the shape
of the data, we would expect our mean to roughly fall between 0.07
and 0.09 so it is reasonable
that the mean is indeed greater than the median.
(c) find the fourth spread
baseball:
fourth spread: 37592.0 - 22850.0 = 14742
population:
fourth spread: 4598.5 - 2242.0 = 2356.5
vegetarianism:
fourth spread: 0.100 - 0.050 = 0.05
(d) find the step, the inner and outer lower and upper fences, and
identify any outliers
(mild and extreme) in your dataset. Can you offer any explanation
for these outliers?
baseball:
step: 1.5 * 14742 = 14742
inner lower fence: Fl - step = 22850.0 - 14742 = 8108
inner upper fence: Fu - step = 37592.0 + 14742 = 52334
outer lower fence: Fl - 2 * step = 22850.0 - 2 * 14742 = -6634
outer upper fence: Fu - 2 * step = 37592.0 + 2 * 14742 = 67076
Do we have any outliers? No, looking at the stemplot and the inner
and
outer lower/upper fences, we see that there are no outliers.
population:
step: 1.5 * 2356.5 = 3534.75
inner lower fence: Fl - step = 2242.0 - 3534.75 = -1292.75
inner upper fence: Fu - step = 4598.5 + 3534.75 = 8133.25
outer lower fence: Fl - 2 * step = 2242.0 - 2 * 3534.75 =
-4827.5
outer upper fence: Fu - 2 * step = 4598.5 + 2 * 3534.75 = 11668
Do we have any outliers? yes, looking at the stemplot there are
several
values that are beyond the inner fence but still within the outer
fence like Lowell,
Downey, Seattle, Long Beach, Bridgeport Garden Grove, Alexandria,
Hialeah, Providence
Washington, Berkeley, Elizabeth, Santa Ana, Philadelphia, Chicago,
and Yonkers. We also
notice some cities where the population density is beyong the outer
fence as well such
as Miami, Newark, Boston, San Francisco, Cambridge, Jersey City,
Paterson, and New York City.
The reason for these values can be explained by the fact that most
of these cities are
highly populated cities as compared to their area so the population
density is high. For
instance, New York city is a highly populated city and as we know,
apartments and high-rise
buildings are frequent meaning it is more densely populated as there
will be more people
living there per square mile.
vegetarianism:
step: 1.5 * 0.050 = 0.075
inner lower fence: Fl - step = 0.050 - 0.075 = -0.025
inner upper fence: Fu - step = 0.100 + 0.075 = 0.175
outer lower fence: Fl - 2 * step = 0.050 - 2 * 0.075 = -0.1
outer upper fence: Fu - 2 * step = 0.100 + 2 * 0.075 = 0.25
Do we have any outliers? yes, looking at the stemplot there are
several
values that are beyond the inner fence but still within the outer
fence like
Mexico and India. India’s high percentage of vegetarians could
possibly be explained
by a cultural attitude towards consuming meat as well as religious
traditions that
promote vegetarianism. When it comes to Mexico, there may be several
reasons for
the high percentage of vegetarians. Some of them include cultural
beliefs and attitudes,
recent vegetarianism trends, and the fact that some countries
consider consuming chicken
as well as fish still counts as vegetarian. If the latter is the
case, of course the number
of people who consume no red meat but still chicken and fish would
also contribute to the
number of vegetarians.