Exercise 1

library(mosaic)

ggplot(data=Galton, aes(x=height, y=father, color=sex, shape=sex)) +
  geom_point() +
  facet_wrap(~sex) +
  geom_smooth(method=lm, se=F) +
  xlab("height") + 
  ylab("father")

Exercise 2

Misused Axis

In chart 4: Heights Infographic, the graph is trying to show the different heights of women in different countries, but since the y-axis doesn’t start at 0, it is misleading. There is only a 5 inch difference between the tallest and shortest, but the graph makes it look like women from Latvia are five time larger than the women form India. To improve, the creator should start their y-axis at 0.

In chart 6: New COVID-19 Cases per day, the y-axis makes the graph misleading, again. Luckily the dates on the x-axis are equally spaced and in order, but the y-axis does not start at 0 and is not equally spaced which makes it difficult to see the trend. To improve, the creator might want to start their y-axis at 0, but it should defiantly be equally spaced.

In chart 2: Napoleon’s March on Moscow, the axis are not clearly defined or visible which makes the large amount of information on the chart even harder to read. There are lines and axis for temperature and location, but they are hard to read and connect to the map. To improve, the creator should not layer so much information on top of each other. It might have been helpful to have a separate graph of Temp v. Time or Temp v. Distance.

Pie Charts

Pie charts shouldn’t be used because they’re hard to read. Our eyes are very good at differentiating between lengths in something like a bar graph, but they’re not very good at differentiating angles or area. It is also not useful for something in which the categories are not mutually exclusive, for example, if the chart was showing the percentages of people who had pain in their backs, knees, and/ or hips. Since someone could have back and knee pain, the percentages might add up to over 100%.

Good Data Visualization

Chart 1 uses box plots to compare the weights of chickens on different feed. Although they could use units, the axis are labeled and reasonably spaced. The data is also easy to understand and compare because they used box plots which is a good choice for looking at averages. The violin graph and colors also make the chart more interesting to look at.

Chart 10: Vaccination Rates By State is also well done. Although specific states and data points are hard to pick out, the overall trend is clear. The chart has a title, labeled axis, and appropriately spaced axis.

Exercise 3

library(mdsr)

ggplot(data=MLB_teams, aes(x=WPct, y=payroll, color=lgID)) +
  geom_text(aes(label=teamID, size=attendance)) +
  facet_wrap(~lgID) +
  xlab("Win Percentage") + 
  ylab("Payroll")

The above graphic has a lot of interesting information on it, but it’s not legible. Most of the team ID’s are overlapping, making it impossible to read them. One way to make this graphic beter would be to either reduce the amount of information or separate them. The names could be replaced by simple points and a trend line could be added. Or the graphs could be separated by year, team, or a different baseball category.

library(mdsr)

MLB_09_14 <- MLB_teams %>% filter(yearID == 2008 | yearID == 2014)

ggplot(data=MLB_09_14, aes(x=WPct, y=payroll, color=yearID)) +
  geom_point() +
  geom_smooth(method=lm, se=F) +
  facet_wrap(~yearID) +
  xlab("Win Percentage") + 
  ylab("Payroll") +
  ggtitle("Payroll vs Win Percentage")

The above graph shows that in both 2008 and 2014, the earliest and latest years in the data set, Win Percentage and Payroll were positively correlated. If we were to run this for all of the years between we would find a similar correlation.

library(mdsr)
MLB_2014_AL <- MLB_teams %>% filter(yearID == 2014) %>% filter(lgID == "AL")

ggplot(data=MLB_2014_AL, aes(x=teamID, y=attendance)) +
  geom_col() +
  xlab("Team") + 
  ylab("Attendance") + 
  ggtitle("Attendance for AL Teams in 2014")

The above bar graph displays the relative attendance for each team in the AL in 2014. This is much easy to read than the 1st graph were attendance was represented by font size.

Exercise 4

titanic <- read.csv("/Users/oliviast.marie/Desktop/DS 325/titanic.csv")
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Sex <- as.factor(titanic$Sex)

ggplot(data=titanic, aes(x=Pclass, fill = Pclass)) +
  geom_bar(color = "black")

ggplot(data=titanic, aes(x=Sex, fill = Sex)) +
  geom_bar(color = "black")

ggplot(data=titanic, aes(x=Age, fill = Sex)) +
  geom_density(color = "black", alpha = 0.3)

Exercise 5

library(mdsr)

ggplot(data=RailTrail, aes(x=hightemp, y=volume, alpha = lowtemp, size = cloudcover, color=dayType, shape=dayType)) +
  geom_point() +
  geom_smooth(method=lm, se=F) +
  facet_wrap(~dayType) + 
  ggtitle("Original Plot")

RailTrail$spring <- as.factor(RailTrail$spring)

ggplot(data=RailTrail, aes(x=lowtemp, y=hightemp, color=volume, shape=spring, size = precip, alpha = cloudcover)) +
  geom_point() +
  geom_smooth(method=lm, se=F) +
  facet_wrap(~summer) +
  ggtitle("Nonsense Plot")

Homework 1

Olivia St. Marie

09/01/2024