Stat451 Final

                                            STAT451 Final Exam

Assignment 1

(please refer to test for questions)

  1. After looking through several resources, the first palette from the exam would be referred to as Sequential. In all the options, each column uses the same color at different intensities. This would be used when you have quantitative data and want to show contrast from high to low. An example being a heat map showing levels of crime in a suburban area. The low crime areas would be at the low end of a color spectrum and high crime at the opposite.
  2. The second is a Diverging palette. The two ends of each column are different colors, and they meet at neutral white. This is to express two separate quantitative factors and/or to emphasize where they meet. This could be used to visually express data about a specific universities average student age. Or a way to show the undecided states in a political race. However, a more practical example, would be when using data with variables that have both positive and negative values. Like that of a stock ticker.
  3. Finally, the last is a Qualitative palette. The column’s colors are clearly different from one another. This is to express variety and to find outliers. It makes it easier to visualize data with categorical variables. This could be used to express data with many different animals, insects, or plant species. Or it could be used to highlight a small number of variables in a large population, such as a rare blood type of eye color. A specific use is a pie chart showing percent of a company budget. Each chunk is what percent which department per year, each department being a different color. This also makes it easier to tell which color belongs to what department in the legend.

References and Resources:

R Color Cheatsheet https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf When to Use Sequential and Diverging Palettes https://everydayanalytics.ca/2017/03/when-to-use-sequential-and-diverging-palettes.html Understanding Sequential and Diverging Palettes in Tableau https://interworks.com/blog/rrouse/2014/12/15/understanding-sequential-and-diverging-color-palettes-tableau/ How to Pick the Perfect Color Combination for Your Data Visualiztion https://blog.hubspot.com/marketing/color-combination-data-visualization Statistical Language-What are Variables http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-+what+are+variables Wiki Categorical Variable https://en.wikipedia.org/wiki/Categorical_variable

Assignment 2

Data from https://earthquake.usgs.gov/earthquakes/search/, Global records for 30 days, min. magnitude =1.0.Create csv query from site or load csv file from email, then set directory.

data <- read.csv("query.csv")
keeps <- c("time","latitude","longitude","mag")
data1 = data[keeps]


library(maps)
library(ggplot2)

world_map <- map_data("world")

p1 <- ggplot() + coord_fixed() +
  xlab("") + ylab("")+labs(title = "Gloabal Earthquakes Over 30 Days")



base_world1 <- p1 + geom_polygon(data=world_map, aes(x=long, y=lat, group=group), 
                               colour="light green", fill="light green")


etqk_data <- 
  base_world1 +
  geom_point(data=data1, 
             aes(x=longitude, y=latitude), colour="Deep Pink", 
             fill="Pink",pch=21, alpha=I(0.7))
etqk_data

p2 <- ggplot() + coord_fixed() +
  xlab("") + ylab("")+labs(title = "Gloabal Earthquakes and Magnitude(approx.) Over 30 Days")

base_world2 <- p2 + geom_polygon(data=world_map, aes(x=long, y=lat, group=group), 
                                 colour="light green", fill="light green")

cleanup <- 
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
        panel.background = element_rect(fill = 'black', colour = 'black'), 
        axis.line = element_line(colour = "black"),
        axis.ticks=element_blank(), axis.text.x=element_blank(),legend.key=element_rect(fill = 'black', colour = 'black'),
        axis.text.y=element_blank())

base_world_clean <- base_world2 + cleanup

etqk_data_enhanced <- 
  base_world_clean +
  geom_point(data=data1, 
             aes(x=longitude, y=latitude, size=mag),colour="Deep Pink", 
             fill="Pink",pch=21, alpha=I(0.2))

etqk_data_enhanced

In the end, the last graphic was a nice way to see frequency of earthquakes hitting the same areas at specific magnitudes. It is obvious that the coasts near the Pacific and Indian Ocean are being hit with large earthquakes. Whereas the North American Pacific coast seem to be feeling smaller earthquakes at a much greater frequency.

References and Resources:

Plotting Data Points on Maps with R https://sarahleejane.github.io/learning/r/2014/09/21/plotting-data-points-on-maps-with-r.html Plotting Beautiful Clear Maps with R http://sarahleejane.github.io/learning/r/2014/09/20/plotting-beautiful-clear-maps-with-r.html Modify Components of a Theme https://ggplot2.tidyverse.org/reference/theme.html R Plot PCH Symbols Chart http://www.endmemo.com/program/R/pchsymbols.php

Assignment 3

When choosing my data it was hard to find a dataset I was interested in. After asking my girlfriend about data she would be interested in knowing, she had mentioned “Titanic”, as in the movie. I asked why? She replied, “her students had been obsessing about the movie”. She is a 3rd and 4th grade special education teacher with some awesome kids in her class. So I figured that was a good enough reason as any. So I looked up data on Titanic and found a .CSV dataset on:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

…It’s titled TitanicSurvival. It is a bit morbid, but I was interested at that point.

  1. My first question I asked, who survived more male or female? Prior to my data visualization I assumed more males died. I only assumed this due to the movie. In the movie all the life boats were calling for “women and children”! I decided that that was a good thing to investigate.
#set directory
data2 <- read.csv("TitanicSurvival.csv")

colnames(data2)[colnames(data2)=="sex"] <- "gender"
colnames(data2)[colnames(data2)=="passengerClass"] <- "class"


ggplot(data2, aes(x = survived, fill = gender)) + geom_bar()

  1. After the first plot its pretty obvious more males than females died. However, no other details were easily assumed. So I asked a second question, what classes died the most? As I assumed by watching the movie, the third class would be the worst off as it was closer to the iceberg damage.
ggplot(data2, aes(x = survived, fill = class)) + geom_bar()

c)Again my movie knowledge and it’s historical accuracy holds, so I asked a last question. What was average age of people who died per class?

ggplot(data2, aes(x=age, y=class, shape=gender, color=survived)) +
  geom_point(aes(size=survived))
## Warning: Using size for a discrete variable is not advised.
## Warning: Removed 263 rows containing missing values (geom_point).

  1. The above plot was able to show me a range of ages, and it showed that more males died. However, it was hard to know the average age of male vs female. The last plot is a boxplot and I think it best represents average age of people who died per class.
p <- qplot(class, age,data=data2, geom=c("boxplot"), shape=gender, color=survived, fill=gender, main="Survival Average by Gender and Age",
      xlab="Class", ylab="Age")

p2 <- p + theme_classic()

p2
## Warning: Removed 263 rows containing non-finite values (stat_boxplot).

Honestly, I think the final plot takes a second to grasp as it holds all the data from the dataset. I struggled with the overall look, but I think it tells the story well. And it points out details I would not have assumed. For instance, average age for males to die in first class was around 45/50, where as the age for a male in 3rd class was about 25/30. But something a bit more intriguing is on average males who died were a bit older than the women who survived. Which is intriguing, seeing as Jack Dawson was 3 years older than Rose in the movie. Don’t know if James Cameron meant to get that right, but it was pretty cool to find that out through the data visualization.

References and Resources:

Datasets https://vincentarelbundock.github.io/Rdatasets/datasets.html Chapter 2 R ggplot2 Examples http://www.stat.wisc.edu/~larget/stat302/chap2.pdf Ggplot2 Scatter Plots http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization Advanced Data Visualization with Ggplot2 https://4va.github.io/biodatasci/r-viz-gapminder.html Understanding Interpreting Boxplots https://www.wellbeingatschool.org.nz/information-sheet/understanding-and-interpreting-box-plots Plot multiple boxplots https://stackoverflow.com/questions/14604439/plot-multiple-boxplot-in-one-graph Rose Bukater https://jamescameronstitanic.fandom.com/wiki/Rose_DeWitt_Bukater Jack Dawson https://jamescameronstitanic.fandom.com/wiki/Jack_Dawson