R for Data Science, Chapter 3

3.2.4 Exercises

library(tidyverse)

## -- Attaching packages ----------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## Warning: package 'dplyr' was built under R version 4.0.3

## -- Conflicts -------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ggplot2::mpg

## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l~ f        18    29 p     comp~
##  2 audi         a4         1.8  1999     4 manual~ f        21    29 p     comp~
##  3 audi         a4         2    2008     4 manual~ f        20    31 p     comp~
##  4 audi         a4         2    2008     4 auto(a~ f        21    30 p     comp~
##  5 audi         a4         2.8  1999     6 auto(l~ f        16    26 p     comp~
##  6 audi         a4         2.8  1999     6 manual~ f        18    26 p     comp~
##  7 audi         a4         3.1  2008     6 auto(a~ f        18    27 p     comp~
##  8 audi         a4 quat~   1.8  1999     4 manual~ 4        18    26 p     comp~
##  9 audi         a4 quat~   1.8  1999     4 auto(l~ 4        16    25 p     comp~
## 10 audi         a4 quat~   2    2008     4 manual~ 4        20    28 p     comp~
## # ... with 224 more rows

str(mpg)

## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

1. Run ggplot(data = mpg). What do you see?

ggplot(data = mpg)

This code produces a blank plot. No geom functions were added to the code, therefore there is no relationship to plot.

2. How many rows are in mpg? How many columns

To answer this question I ran:

str (mpg)

## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

mpg produced a tibble that was 234 x 11. Indicating that there were 11 columns and 234 rows.

I also could have ran nrow(mpg) or ncolumn(mpg) for the same results

3. What does the drv variable describe? Read the help for ?mpg to find out

?mpg

## starting httpd help server ... done

I ran the above and a help file was pulled up in the Help tab that explained the abbreviations in the mpg data frame. This file indicated that drv variable indicates “the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd” drv is a categorical variable.

4. Make a scatterplot of hwy vs cyl.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = hwy, y = cyl))

5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = class, y = drv))

This plot only contains a few datapoints, which isn’t useful for understanding the relationship between class and drive train. Both class and drv are categorical variables, therefore this relationship would perhaps be more useful as a bar graph. Continuous variables are more useful for scatterplots.

3.3.1 Exercises

1. What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The points in this scatterplot are not blue because the code doesn’t recognize “blue” as a color part of the mpg dataset so it is treated as a categorical variable and hence placed in the legend as a name for the points [I think].

We’d need to move color = “blue” outside of the aesthetics function in order to make all of the points blue.

This is how we’d change it so the points are blue:

ggplot(data = mpg) + 
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

mpg

## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l~ f        18    29 p     comp~
##  2 audi         a4         1.8  1999     4 manual~ f        21    29 p     comp~
##  3 audi         a4         2    2008     4 manual~ f        20    31 p     comp~
##  4 audi         a4         2    2008     4 auto(a~ f        21    30 p     comp~
##  5 audi         a4         2.8  1999     6 auto(l~ f        16    26 p     comp~
##  6 audi         a4         2.8  1999     6 manual~ f        18    26 p     comp~
##  7 audi         a4         3.1  2008     6 auto(a~ f        18    27 p     comp~
##  8 audi         a4 quat~   1.8  1999     4 manual~ 4        18    26 p     comp~
##  9 audi         a4 quat~   1.8  1999     4 auto(l~ 4        16    25 p     comp~
## 10 audi         a4 quat~   2    2008     4 manual~ 4        20    28 p     comp~
## # ... with 224 more rows

#Manufacturer, model, transmission, drv, fuel type, year, and class are all categorical variables. #Continuous variables are displacement, city mpg, cyl (number of cylinders) and highway mpg.

You can see this info when you run mpg and look for the words at the top of each column which will indicate for a categorical variable, and or for continuous variables.

3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = cty))

This code (^) is for the map grouping the cty variable by color. It works!

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = cty)) This code (^) is for the map grouping the cty variable by shape. It does not work!

The error message indicated: Error: A continuous variable can not be mapped to shape.

This makes sense, given that continuous data is ordered it would be difficult to assign ordered value to shapes.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = cty))

This code is for the map grouping the cty variable by color. It works!

4. What happens if you map the same variable to multiple aesthetics?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = cty, size = cty))

I tried this with cty for both size and color and it worked!

5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

The stroke aesthetic changes the size of the border around shapes. It allows for shapes to be filled with a different color, for instance. It works well with shape 21 or anything that can be similarly filled. After much trial and error (and some googling), I tried it out below.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), shape = 21, color = "black", fill = "lightsalmon", size = 4, stroke = 2)

6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

This separates the data by color depending on if the datapoint falls below or above the indicated number.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

3.5.1 Exercises

1. What happens if you facet on a continuous variable?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(~ cty)

This code (^), which facets the cty variable, produces a plot that creates a new facet plot for each value.

2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

This indicates that there are no datapoints in that subset of data.

3. What plots does the following code make? What does . do?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

The “.” indicates that the map does not need to consider that axis when faceting.

4. Take the first faceted plot in this section:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

Facting is useful because you see the exact data groups for each facet. This helps the viewer understand how the groupings of class fit within the overall trends.

Its disadvantages are that you are not seeing the data all together, + it’s slightly boring.

If the dataset was larger and in color, there would be too many colors and very overwhelming, especially if they start to overlap. As the dataset increases, color gets worse to use, therefore faceting is better with larger datasets.

5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

?facet_wrap

nrow shows the number of rows in a dataframe, while ncol shows the number of columns in a dataframe.

These are used with facet_wrap because it only uses 1 variable while facet_grid uses two discrete variables.

According to the help file, facet_wrap can also be altered through scales, labels, dropping data, changing direction, or making the facet into a table.

6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

There is more space horizontally when the variable is in the columns for more data to be displayed.

3.6.1 Exercises

1 What geom would you use to draw a line chart? A boxplot? A historgram? An area chart?

Line.chart <- geom_line()
Boxplot <- geom_boxplot()
Histogram <- geom_histogram()
Area.chart <- geom_area()

2 Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

My prediction was that the colors would be grouped by drv data, that it would be both a scatterplot and a smooth geom plot being displayed. I was not sure what the standard error would do.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

After plotting it, I was surprised to see that the smooth geom plot would be in the 3 different colors they were grouped by, however upon reviewing this section, it made sense given data was plotted with the ggplot vs with the individual geom’s, so naturally both geom’s would reflect the color groupings. I didn’t figure out what the standard error did, but I figured it out in question 4.

3 What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?*

?show.legend = FALSE

This removes the legend from the map. If you remove it or change it to TRUE, the legend appears on the map. It was used earlier in the chapter because none of the other maps with it had legends and it would have looked not great.

ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#If you remove the = FALSE, it returns the legend.
ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, color = drv))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

4. What does the se argument to geom_smooth() do?

?se

## No documentation for 'se' in specified packages and libraries:
## you could try '??se'

?geom_smooth

Se= Standard Error. Adding the se argument (se = TRUE) to geom_smooth() adds the grey area surrounding the smooth geom line to account for the standard error of the datapoints the line is representing. The se = FALSE argument removes these grey areas.

5. Will these two graphs look different? Why/Why not?

No – this is using the same data and asking it to perform the same task, it’s just written in a different way that runs the code through the ggplot and then adds the point and smooth geoms in the first one, and runs the data through each geom in the second one. The first method is more efficient.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

6. Recreate the R code necessary to generate the following graphs.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(aes(color = drv)) + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(aes(color = drv)) + 
  geom_smooth(aes(linetype = drv), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(size = 5, color = "white") +
  geom_point(aes(color = drv))

R for Data Science, Chapter 7

I’m a big fan of Wes Anderson films…so here we go.

library(wesanderson)

## Warning: package 'wesanderson' was built under R version 4.0.3

7.3.4 Exercises

1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

library (tidyverse)

You could totally make three separate histograms, one for x, y, and z respectfully. But you can also combine the three to look at them alongside one another by using freqpoly.

?diamonds
str(diamonds)

## tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

# x = length, y = width, z = depth
ggplot (data = diamonds) +
  geom_freqpoly(binwidth=0.1,aes(x = x ), color = wes_palette (n = 1, name = "Moonrise3")) +
  geom_freqpoly(binwidth=0.1,aes(x = y ), color = wes_palette (n = 1, name = "BottleRocket1")) +
  geom_freqpoly(binwidth=0.1,aes(x = z ), color  = wes_palette (n = 1, name = "IsleofDogs1"))

These colors aren’t very clear cut with this kinda visually complicated graph.

ggplot (data = diamonds) +
  geom_freqpoly(binwidth=0.1,aes(x = x ), color = "red") +
  geom_freqpoly(binwidth=0.1,aes(x = y ), color = "blue") +
  geom_freqpoly(binwidth=0.1,aes(x = z ), color  = "green")

The issue here is, the data between the x and the y are too similar to stand out from one another on a single graph.

So I can show off the individual graphs to demonstrate them a little better.

ggplot (data = diamonds, aes (x = x)) +
  geom_histogram(binwidth = 0.1, color = wes_palette (n = 1, name = "Moonrise3"))

ggplot (data = diamonds, aes (x = y)) +
  geom_histogram(binwidth = 0.1, color = wes_palette (n = 1, name = "BottleRocket1"))

ggplot (data = diamonds, aes (x = z)) +
  geom_histogram(binwidth = 0.1, color  = wes_palette (n = 1, name = "IsleofDogs1"))

mean(diamonds$carat)

## [1] 0.7979397

Basically, it looks like there are a lot of similarities between the three measurements. Across the board there are very little cut diamonds that are a measurement than the means of the respective measurement.

X and Y are very similar, both in counts for each measurement along with the higher measurements in general than the z. However, z had far more counts in the lower measurement ranges alongside more outliers.

2. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

Okay, this is a little more simiplier than the previous question. We can just make a staightforward histogram or two.

ggplot (data = diamonds, aes (price)) +
  geom_histogram (binwidth = 2, color  = wes_palette (n = 1, name = "Moonrise1"))

ggplot (data = diamonds, aes (price)) +
  geom_histogram (binwidth = 10, color  = wes_palette (n = 1, name = "GrandBudapest2"))

ggplot (data = diamonds, aes (price)) +
  geom_histogram (binwidth = 20, color  = wes_palette (n = 1, name = "Zissou1"))

There is a huge jump from the highest percentage of diamond prices to the lower percentage of diamond prices, at least concerning the 1,000 and under prices compared to the rest of the prices. Once you get around 2,000 and more you have a gradual drop off over time. However, there is a gap around 1,500 where you have no diamonds at all. This suggests a market control. 10 and 20 were far more helpful than 2.

3. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

I try to avoid it like the plague, but for this you have to go into tidyverse stuff

We don’t really need to do a graph, we can just summarize the information. Because we are only looking at two datapoints, we can filter them out of the larger dataset.

diamonds %>% group_by(carat) %>% 
  filter (between (carat, .99, 1.0)) %>% 
  summarize(count = n())

## # A tibble: 2 x 2
##   carat count
##   <dbl> <int>
## 1  0.99    23
## 2  1     1558

.99 had 23, and 1 had 1558 This is very interesting, and probably tells us that the human mind is more likely to round an odd number to a even numbered measurement.

Let us expand this to include .90 to 1.10

diamond.prices <- diamonds %>% group_by(carat) %>% 
  filter (between (carat, .89, 1.10)) %>% 
  summarize(count = n())
str(diamond.prices)

## tibble [22 x 2] (S3: tbl_df/tbl/data.frame)
##  $ carat: num [1:22] 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 ...
##  $ count: int [1:22] 21 1485 570 226 142 59 65 103 59 31 ...

This is strange, you have the same situation with .90 as you do with 1.00. That being a higher number than the closest number to it. (I orginally had it from .90 to 1.10 but I wanted to test the .89 to .90). The exact same thing happened between .89 and .90, but you get a bit of an uptick with .91.

You almost get a even bigger boast with 1.01, showing that a lot of these measurements are probably arbritary. Or at the least the difference between .01 is abritary.

4. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

ggplot (diamonds) +
geom_histogram (aes (x = price), binwidth = 20) +
xlim (c(0, 3000)) +
ylim (c(0, 600))

## Warning: Removed 23604 rows containing non-finite values (stat_bin).

## Warning: Removed 6 rows containing missing values (geom_bar).

ggplot (diamonds) +
geom_histogram (aes (x = price), binwidth = 20) + 
coord_cartesian( xlim = c(0, 3000), ylim = c(0, 600))

The main difference between the two seems to be that xlim and ylim cut off whatever it is that you do not focus on. Meanwhile, coord_cartesian includes some of the data (on the ‘bars’) that falls outside of the range you zoom in on.

When you do not include binwidth

{r} ggplot (diamonds) + geom_histogram (aes (x = price) + xlim (c(0, 3000)) + ylim (c(0, 600))

ggplot (diamonds) + geom_histogram (aes (x = price) + xlim (c(0, 3000)) + ylim (c(0, 600))

It seems like both do not work out.

Drennan, Chapter 4

Nanxiong <- read.csv("C:/Users/Cameron/Desktop/UMD 2020-2021/Quant/MyRProject/data/Nanxiong.csv")

1. Continue to explore the areas of sites near Nanxiong given in Table 3.5 by making box-and-dot plots of Early and Late Bronze Age site areas. How do the levels of the two batches compare?

There are at least three ways you can approach this.

Nanxiong.Early <- subset(x = Nanxiong,
                         subset = Period == "Early Bronze Age")
Nanxiong.Late <- subset(x = Nanxiong,
                        subset = Period == "Late Bronze Age")     
boxplot (Nanxiong.Late$Area, ylim = c(0, 15), las = 1)

boxplot (Nanxiong.Early$Area, ylim = c(0, 15), las = 1)

Early.Area<- Nanxiong.Early$Area
fivenum(Early.Area)

## [1] 0.60 1.15 1.90 2.30 4.20

boxplot (Early.Area, ylim = c(0, 15), las = 1)

Late.Area <- Nanxiong.Late$Area
fivenum(Late.Area)

## [1]  2.60  4.15  5.40  7.25 12.80

boxplot (Late.Area, ylim = c(0, 15), las = 1)

boxplot (list(Early.Area, Late.Area), ylim = c(0, 15), las = 1, ylab = ("Area"), names = c("Early Bronze Age", "Late Bronze Age"))

This for sure lines up with our observations from last week. Not only are there more recorded sites from the Late Bronze Age, but all five observable levels (including the quartiles and the median) differ strongly between the two periods of time in that they are more varying and widespread in size.

According to this data, there is seemingly little variance in area size at Early Bronze Age sites.

2. Now compare Early and Late Bronze Age site areas by drawing box and dot plots with the levels removed. How do the spreads of the two batches compare?

Early.Area

##  [1] 1.8 1.0 1.9 0.6 2.3 1.2 0.8 4.2 1.5 2.6 2.1 1.7 2.3 2.4 0.6 2.9 2.0 2.2 1.9
## [20] 1.1 2.6 2.2 1.7 1.1

(EB.1 <- Early.Area - median(Early.Area))

##  [1] -0.1 -0.9  0.0 -1.3  0.4 -0.7 -1.1  2.3 -0.4  0.7  0.2 -0.2  0.4  0.5 -1.3
## [16]  1.0  0.1  0.3  0.0 -0.8  0.7  0.3 -0.2 -0.8

fivenum(EB.1)

## [1] -1.30 -0.75  0.00  0.40  2.30

Late.Area

##  [1] 10.4  5.9 12.8  4.6  7.8  4.1  2.6  8.4  5.2  4.5  4.1  4.0 11.2  6.7  5.8
## [16]  3.9  9.2  5.6  5.4  4.8  4.2  3.0  6.1  5.1  6.3 12.3  3.9

(LB.1 <- Late.Area - median(Late.Area))

##  [1]  5.0  0.5  7.4 -0.8  2.4 -1.3 -2.8  3.0 -0.2 -0.9 -1.3 -1.4  5.8  1.3  0.4
## [16] -1.5  3.8  0.2  0.0 -0.6 -1.2 -2.4  0.7 -0.3  0.9  6.9 -1.5

fivenum(LB.1)

## [1] -2.80 -1.25  0.00  1.85  7.40

boxplot (list(EB.1, LB.1), ylim = c(-5, 10), las = 1, ylab = ("Area"), main = "Areas of Bronze Age Sites", sub = "Level Removed", names = c("Early Bronze Age", "Late Bronze Age"))

So by standardizing the two batches we can more closely compare these two data sets. You essentially see the same outcome as before: smaller areas at the ‘earlier’ contexts, larger areas at the ‘later’ contexts. By getting rid of the medians you can more closely compare the batches to one another at the same level of view. You can see that there is some point of overlap between site sizes once they are shown at the same angle. If anything, it shows that there are some Late Bronze Age sites smaller than the rest of the sites during that period of time. Meanwhile, during the Early Bronze Age the smaller sites are not necessarily that much smaller than the other sites during that period of time? Perhaps this leads further credance to the belief that there is far more variation during the Late Bronze Age. Showing how larger urban contexts don’t come in and completely take over smaller population sizes, there is a need/persistence of multiple population densities.

3. Now compare Early and late Bronze Age site areas by drawing box and dot plots with the levels and spreads removed. HOw do the two batches compare in terms of symmetry?

EB.ls <- EB.1 / IQR (Early.Area)
fivenum(EB.ls)

## [1] -1.1555556 -0.6666667  0.0000000  0.3555556  2.0444444

LB.ls <- LB.1 / IQR (Late.Area)
fivenum(LB.ls)

## [1] -0.9032258 -0.4032258  0.0000000  0.5967742  2.3870968

boxplot (list(EB.ls, LB.ls), ylim = c(-5, 10), las = 1, ylab = ("Area"), main = "Areas of Bronze Age Sites", sub = "Level Removed", names = c("Early Bronze Age", "Late Bronze Age"))

This better represents the difference between the two datasets, going back to showing the relation of area site sizes where the sites are much smaller during the Early Bronze Age. In terms of symmtrey, the Early Bronze Age sites had more smaller sites than anything in the ‘average’ or the larger sites. Meanwhile, it is literally flipped for the Late Bronze Age.

4. The largest Early Bronze Age site is 4.2ha; the largest Late Bronze Age site is 12.8 ha. Which of these sites is more unusual in terms of its batch? WHy? Use the median and the midspread of each batch to provide a score fot he unusualness of each of these sites. Use the mean and standard deviation of each batch to do the same thing. Do these scores confirm your assessment of which site is more unusual in its batch?

If I had to make an uneducated guess, I would say that the Late Bronze AGe is more “unusual” in terms of its batch. But, as I have been saying, this is probably because this period of time saw a great deal of cultural change, variation, and technological/demographic shifts.

Median and midspread

scale(Early.Area, center=median(Early.Area), scale=IQR(Early.Area))

##              [,1]
##  [1,] -0.08888889
##  [2,] -0.80000000
##  [3,]  0.00000000
##  [4,] -1.15555556
##  [5,]  0.35555556
##  [6,] -0.62222222
##  [7,] -0.97777778
##  [8,]  2.04444444
##  [9,] -0.35555556
## [10,]  0.62222222
## [11,]  0.17777778
## [12,] -0.17777778
## [13,]  0.35555556
## [14,]  0.44444444
## [15,] -1.15555556
## [16,]  0.88888889
## [17,]  0.08888889
## [18,]  0.26666667
## [19,]  0.00000000
## [20,] -0.71111111
## [21,]  0.62222222
## [22,]  0.26666667
## [23,] -0.17777778
## [24,] -0.71111111
## attr(,"scaled:center")
## [1] 1.9
## attr(,"scaled:scale")
## [1] 1.125

scale(Late.Area, center=median(Late.Area), scale=IQR(Late.Area))

##              [,1]
##  [1,]  1.61290323
##  [2,]  0.16129032
##  [3,]  2.38709677
##  [4,] -0.25806452
##  [5,]  0.77419355
##  [6,] -0.41935484
##  [7,] -0.90322581
##  [8,]  0.96774194
##  [9,] -0.06451613
## [10,] -0.29032258
## [11,] -0.41935484
## [12,] -0.45161290
## [13,]  1.87096774
## [14,]  0.41935484
## [15,]  0.12903226
## [16,] -0.48387097
## [17,]  1.22580645
## [18,]  0.06451613
## [19,]  0.00000000
## [20,] -0.19354839
## [21,] -0.38709677
## [22,] -0.77419355
## [23,]  0.22580645
## [24,] -0.09677419
## [25,]  0.29032258
## [26,]  2.22580645
## [27,] -0.48387097
## attr(,"scaled:center")
## [1] 5.4
## attr(,"scaled:scale")
## [1] 3.1

Mean and standard deviation.

scale(Early.Area)

##              [,1]
##  [1,] -0.07624158
##  [2,] -1.05213376
##  [3,]  0.04574495
##  [4,] -1.54007985
##  [5,]  0.53369104
##  [6,] -0.80816071
##  [7,] -1.29610680
##  [8,]  2.85143496
##  [9,] -0.44220114
## [10,]  0.89965060
## [11,]  0.28971799
## [12,] -0.19822810
## [13,]  0.53369104
## [14,]  0.65567756
## [15,] -1.54007985
## [16,]  1.26561017
## [17,]  0.16773147
## [18,]  0.41170451
## [19,]  0.04574495
## [20,] -0.93014723
## [21,]  0.89965060
## [22,]  0.41170451
## [23,] -0.19822810
## [24,] -0.93014723
## attr(,"scaled:center")
## [1] 1.8625
## attr(,"scaled:scale")
## [1] 0.8197627

scale(Late.Area)

##              [,1]
##  [1,]  1.49876058
##  [2,] -0.11416600
##  [3,]  2.35898808
##  [4,] -0.58012256
##  [5,]  0.56684745
##  [6,] -0.75933663
##  [7,] -1.29697882
##  [8,]  0.78190432
##  [9,] -0.36506569
## [10,] -0.61596537
## [11,] -0.75933663
## [12,] -0.79517944
## [13,]  1.78550308
## [14,]  0.17257651
## [15,] -0.15000881
## [16,] -0.83102225
## [17,]  1.06864682
## [18,] -0.22169443
## [19,] -0.29338006
## [20,] -0.50843694
## [21,] -0.72349381
## [22,] -1.15360757
## [23,] -0.04248037
## [24,] -0.40090850
## [25,]  0.02920525
## [26,]  2.17977402
## [27,] -0.83102225
## attr(,"scaled:center")
## [1] 6.218519
## attr(,"scaled:scale")
## [1] 2.78996

Yes, it does confirm my hypothesis.

Problem Set 4

Liz McCague and Cam Walker