lab 8

Quarto

8.1 Refinement

options(scipen=999)

library(socviz)
Warning: package 'socviz' was built under R version 4.3.2
load("C:/Users/Kate/Downloads/Murders96_viz (2).RData")
library(caret)
Warning: package 'caret' was built under R version 4.3.2
Loading required package: ggplot2
Warning: package 'ggplot2' was built under R version 4.3.2
Loading required package: lattice
Warning: package 'lattice' was built under R version 4.3.2
set.seed(1)
index <- createDataPartition(Murders96_viz$arrests, p=0.6, list=FALSE)
train <- Murders96_viz[index,]
test  <- Murders96_viz[-index,]
library(ggplot2)
library(DataExplorer)
plot(test$arrests, test$execrate, col = test$arrests, 
     xlab = "Arrests", ylab = "Execution Rate")

I added color to this plot so it would be easier to differentiate between the individual points. This plot shows that the relationship between executions and arrests is nonlinear. As the number of arrests increases, the number of executions doesn’t necessarily increase with it.

8.1 Refinement 2

load("C:/Users/Kate/Downloads/FIFA_viz.RData")
FIFA <- FIFA_viz
p <- ggplot(data = FIFA,
            mapping = aes(x = Weight, y = Age, color = Preferred.Foot))
p + geom_point(size = 2) + scale_color_brewer(palette = "Set2") +
    theme(legend.position = "right")

I added the preferred foot variable to this plot and created a legend with colors. I thought this was helpful because these colors pop against the dark background so it is easy to see which foot players prefer. Most players prefer their right foot and are younger than 40. 

8.2 Refinement

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ purrr::lift()   masks caret::lift()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot(fill = 'green') +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

I changed the color of these boxplots to green so they would stand out more against the black of individual observations. These boxplots give visual representation of the variables in a murder dataset.

8.2 Refinement 2

library(ggplot2)
library(dplyr)
library(gapminder)
Warning: package 'gapminder' was built under R version 4.3.2
subset <- gapminder %>% filter (year == 2007)
mysubset <- subset

plot3 <- ggplot2::ggplot(data = mysubset, aes(x = lifeExp, y = gdpPercap, color = continent)) +
  geom_density_2d() +
  geom_point() +
  labs(title = "Density plot with points colored by continent", subtitle = "Life expectancy generally increases as GDP increases.", caption = "Source- Gapminder Dataset")

I assigned the individual continents each different colors. I also changed the background to be dark. Having the darker background allows the individual points to pop and show up more. This plot shows us that as GDP increases, life expecancy generally increases too. There is a linear relationship between these variables.

8.3 Refinement

options(scipen=999) 
library(tidyverse)
library(socviz)
library(datasetsICR)
data(FIFA)
pip1 <- FIFA %>%         
  group_by(Nationality, Preferred.Foot) %>%
  summarize(N = n()) %>% 
  mutate(freq = N/sum(N),
         pct = round((freq*100),0)) %>%
  filter(Nationality %in% c("Spain", "Brazil", "France", "Italy", "Argentina"))
`summarise()` has grouped output by 'Nationality'. You can override using the
`.groups` argument.
pip1
# A tibble: 15 × 5
# Groups:   Nationality [5]
   Nationality Preferred.Foot     N     freq   pct
   <fct>       <fct>          <int>    <dbl> <dbl>
 1 Argentina   ""                 1 0.00107      0
 2 Argentina   "Left"           214 0.228       23
 3 Argentina   "Right"          722 0.771       77
 4 Brazil      ""                 2 0.00242      0
 5 Brazil      "Left"           245 0.296       30
 6 Brazil      "Right"          580 0.701       70
 7 France      ""                 3 0.00328      0
 8 France      "Left"           236 0.258       26
 9 France      "Right"          675 0.739       74
10 Italy       ""                 3 0.00427      0
11 Italy       "Left"           189 0.269       27
12 Italy       "Right"          510 0.726       73
13 Spain       ""                 1 0.000933     0
14 Spain       "Left"           298 0.278       28
15 Spain       "Right"          773 0.721       72
pip2 <- FIFA %>%         
  group_by(Nationality) %>%
  summarize(N = n(),
            age_mean = mean(Age, na.rm=TRUE),
            weight_mean = mean(Weight, na.rm=TRUE)) %>%
  mutate(freq = N/sum(N),
         pct = round((freq*100),0)) %>%
         filter(Nationality %in% c("Spain", "Brazil", "France", "Italy", "Argentina"))
pip2
# A tibble: 5 × 6
  Nationality     N age_mean weight_mean   freq   pct
  <fct>       <int>    <dbl>       <dbl>  <dbl> <dbl>
1 Argentina     937     26.2        75.1 0.0515     5
2 Brazil        827     27.6        76.0 0.0454     5
3 France        914     24.6        75.6 0.0502     5
4 Italy         702     25.9        76.0 0.0386     4
5 Spain        1072     25.3        74.0 0.0589     6
p <- ggplot(pip2, aes(x=weight_mean, y=age_mean, color=Nationality))
p2 <- p + geom_point(size=5) +
    annotate(geom = "text", x = 75, y=25, 
                     label = "France has the lowest average age.", hjust=0) +
    labs(y="Average Age", x="Average Weight", 
         title="Age and Weight by Nationality", 
         subtitle = "Spain has the lowest average weight.",
         caption = "FIFA dataset{datasetsICR}")
p2

theme_set(theme_dark())
p2 + theme(legend.position="left")

I changed the theme to dark because it helps the colors to pop and stand out against the dark background more. This plot shows the average weight and age of FIFA players by nationality. Spain has the lowest average weight, and France has the lowest average age.

8.3 Refinement 2

p2 + theme(legend.position = "top",
           plot.title = element_text(size=rel(2),
                                     lineheight=.5,
                                     family="TT Courier New",
                                     face="bold.italic",
                                     colour="orange"),
           axis.text.x = element_text(size=rel(1.1),
                                      family="TT Times New Roman",
                                      face="bold",
                                      color="green"))
Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database

Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
not found in Windows font database
Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
family not found in Windows font database

I changed the fonts and the colors of the title text. I also changed the size and location of the text. This helped me emphasize the specific words and information that are most important. This plot shows the average weight and age of FIFA players by nationality. Spain has the lowest average weight, and France has the lowest average age. 

8.4 Refinement

p3 <- ggplot(data = subset(FIFA, Age %in% Age),
            mapping = aes(x = Weight))

p1 <- p3 + geom_density(fill = "gray20", color = FALSE,
                       alpha = 0.9, mapping = aes(y = ..scaled..)) +
    geom_vline(data = subset(FIFA, Age %in% Age),
               aes(xintercept = Weight), color = "white", size = 0.5) +
    geom_text(data = subset(FIFA, Age %in% Age),
              aes(x = Weight, y = Age, label = Weight), nudge_x = 7.5,
              color = "white", size = 3.5, hjust = 1) +
    geom_text(data = subset(FIFA, Age %in% Age),
              aes(x = Weight, y = Age, label = Age)) +
    facet_grid(Age ~ ., switch = "y")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
p1
Warning: The dot-dot notation (`..scaled..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(scaled)` instead.
Warning: Removed 48 rows containing non-finite values (`stat_density()`).
Warning: Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf
Warning: Removed 48 rows containing missing values (`geom_vline()`).
Warning: Removed 48 rows containing missing values (`geom_text()`).
Removed 48 rows containing missing values (`geom_text()`).

I created subsets by age in order to see the distribution of weight better. This plot shows that weight generally increase until middle age and then starts to decrease. 

8.4 Refinement 2

library(gapminder)
library(dplyr)
subset <- gapminder %>% filter (year == 2007)
library(ggplot2)
plot1 <- ggplot(subset, aes(x = lifeExp, y = gdpPercap, color = continent)) +
  geom_density_2d() +
  geom_point() +
  labs(title = "Density plot with points colored by continent", subtitle = "Life expectancy generally increases as GDP increases.", caption = "Source- Gapminder Dataset") +
facet_wrap(~continent) +
  labs(title = "Faceted by continent", subtitle = "Life expectancy generally increases as GDP increases.", caption = "Source- Gapminder Dataset")
plot1
Warning: `stat_contour()`: Zero contours were generated
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf

I subset each individual continent so I could see the distribution for each one. Through this plot, we see that similar positive correlation between life expectancy and GDP per capita and highlights the variation within and between continents. African countries generally have lower GDP per capita and life expectancy. European countries mostly fall into higher ranges for both metrics.