DS 2870: Homework 4 - data wrangling

Data Description

The data set has information about 1538 skeletons kept in different locations across the world.

There are 3 categorical variables:

sex: The sex of the skeleton (“Male” = known male, “Female” = known female, “uMale” = probably male, “uFemale” = probably female)
age: the age group of the skeleton (18-24, 25-29, 30-39, 40-49, 50+)
location: The country of the skeleton (if in the United States, it will be State, United States)

The remaining 8 variables are the lengths of 8 limb bones (in mm). The limb bones are

rhum and lhum = right and left humerus (upper arm)
rrad and lrad = right and left radius (lower arm)
rfem and lfem = right and left femur (upper leg)
rtib and ltib = right and left tibia (lower leg)

Note: Most skeletons have at least 1 missing measurement.

`str_detect()` function

For a couple of questions, you’ll be using the str_detect() function in the stringr package, which takes 2 or 3 arguments:

string = the character (or column of characters) that we want to check contains certain character of characters
pattern = the character(s) we want to check if it is in the value given to the string argument
negate = F if set to FALSE, it will return TRUE if the character contains the pattern. If set to TRUE, it will return FALSE if the character contains the pattern

For example, str_detect("abcdef", pattern = "a", negate = F) will return TRUE since “a” is in “abcdef”. If you change negate = T, it will return FALSE instead

Question 1: Skeletons by age and sex

Part 1a: Skeletons with known sex

Using the correct dplyr verb and str_detect(), only keep the rows where the sex of the skeleton is known (no “0u” or “1u”). Save the resulting data frame as bones1 and use count(bones1, sex) to display the results in the knitted document.

bones1 <- 
  bones |> 
  filter(str_detect(sex, "u", negate = T))

count(bones1, sex)

##      sex   n
## 1 Female 540
## 2   Male 981

Part 1b: Bar chart for sex by age (30 and older)

Create the bar chart seen in Brightspace using the bones1 data set and two additional dplyr verbs. Make sure to place the order of the age ranges in the proper order!

bones1 |> 
  # Only keeping the age groups for 30+
  filter(
    age %in% c("30-39", "40-49", "50+")
  #  age >= 30
  ) |> 
  # Reordering the age group to be in numeric order
  mutate(
  #  age = factor(age, levels = c("30-39", "40-49", "50+"))
  ) |> 
  # Creating the conditional bar chart
  ggplot(
    mapping = aes(
      x = age,
      fill = sex
    )
  ) +
  
  geom_bar(
    position = "fill"
  ) +

  # Adding a title, removing the labels for y and fill
  labs(
    x = "Age Range",
    y = NULL,
    fill = NULL,
    title = "Age and Sex of Skeletons in Museums"
  ) + 
  # Changing the tickmarks to percentages and removing the buffer space
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0),
    labels = scales::label_percent()
  ) + 
  # Changing the theme to theme_classic()
  theme_classic() +
  # Centering the title
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Question 2: Keeping Specific Columns

The code chunk below will change the format of the data from a wide format to a long format and save the data set as boness. To get the code chunk below to work, fill in the space to keep the ID, age, and the 8 bone columns only. We’ll be looking at what the pivot_longer() function does later

If done correctly, the data set should have 11,032 rows and 5 columns

bones2 <- 
  bones |> 
  ### Enter your code below:
  # Use the correct dplyr verb in the line below to keep the specified columns
  dplyr::select(ID, age, where(is.numeric)) |> 
  
  # Changing the format from wide to long
  pivot_longer(
    cols = where(is.numeric),
    names_to = "bone",
    values_to = "length",
    values_drop_na = T     # Removing the new row if the value is NA
  ) |> 
  # Separating the bone column into two columns: limb bone and the side
  mutate(
    side = str_trunc(bone, 1, ellipsis = ""),
    bone = str_trunc(bone, 3, side = "left", ellipsis = ""),
    bone = factor(bone, levels = c("hum", "rad", "fem", "tib"))
  )


tibble(bones2)

## # A tibble: 11,032 × 5
##    ID    age   bone  length side 
##    <chr> <chr> <fct>  <dbl> <chr>
##  1 1     40-49 hum     308. l    
##  2 1     40-49 rad     229  l    
##  3 1     40-49 fem     443  l    
##  4 1     40-49 fem     452  r    
##  5 1     40-49 tib     366. r    
##  6 2     50+   fem     386  l    
##  7 2     50+   fem     383  r    
##  8 2     50+   tib     283  l    
##  9 2     50+   tib     282. r    
## 10 3     40-49 hum     311  l    
## # ℹ 11,022 more rows

Question 3: Skeletons by Country

Part 3a: Adding the country column

Create a column in the bones data set called country that has just the country the skeleton is currently located in (aka, for any row with location in the United States, it has the state, United States). You should use the appropriate dplyr verb to add a new column along with the if_else() and str_detect() functions, and no others.

Name the resulting data set bones3a.

bones3a <-
  bones |>
  ## Your code here ###
  # Remove the state from the column
  mutate(
    country = if_else(str_detect(location, "United States"), 
                      "United States", 
                      location)
  )

tibble(bones3a)

## # A tibble: 1,531 × 13
##    ID    sex    age   location    lhum  rhum  lrad  rrad  lfem  rfem  ltib  rtib
##    <chr> <chr>  <chr> <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 1     Male   40-49 Alaska, U…  308.   NA   229    NA   443   452     NA  366.
##  2 2     Female 50+   Alaska, U…   NA    NA    NA    NA   386   383    283  282.
##  3 3     Male   40-49 Alaska, U…  311   310    NA   222.  415   416.   330  332.
##  4 4     Male   25-29 Alaska, U…  289   298   224.  227   398   400    312  320.
##  5 5     Female 50+   Alaska, U…  295   302   208   205   395   396    306  305 
##  6 6     Female 30-39 Alaska, U…  270.  281   196.   NA    NA   375    298  294.
##  7 7     Male   40-49 Alaska, U…  314.  306.  237   226   430   431    351  352 
##  8 8     Male   30-39 Alaska, U…  318   322   234   238   434.  437    361  364 
##  9 9     Male   30-39 Alaska, U…  304.  307   233   235   402.  407    338  338 
## 10 10    Female 50+   Alaska, U…  287   292.  210.  214   390   386    311  315 
## # ℹ 1,521 more rows
## # ℹ 1 more variable: country <chr>

Part 3b: Country proportion and top 10 countries only

Using the bones3a data set, create a data set named bones3b that has the following columns:

1) Country: Each country listed only once 2) count: How many times each country appears in the bones3a data set 3) prop: The proportion of skeletons in the bones3a data that are in each country

Then keep only the 10 countries with the highest proportion and arrange the rows from highest proportion to lowest proportoin. Make sure it appears in the knitted document

bones3b <- 
  bones3a |> 
  # Counting how frequently each country appears
  count(
    country,
    name = "count"
  )  |> 
  # Calculating the proportion with mutate()
  mutate(
    prop = count/sum(count)
  ) |> 
  
  # Keeping the top 10 countries
  slice_max(prop, n = 10) |> 
  # arranging them from highest to lowest prop
  arrange(-prop)

bones3b

##               country count       prop
## 1       United States   722 0.47158720
## 2             Germany   108 0.07054213
## 3               Japan    94 0.06139778
## 4      United Kingdom    85 0.05551927
## 5               Egypt    81 0.05290660
## 6             Austria    78 0.05094709
## 7               Sudan    53 0.03461790
## 8               Italy    51 0.03331156
## 9             Belgium    41 0.02677988
## 10 Philippine Islands    31 0.02024820

If done correctly, the code to create the graph below should run

# Creating the bar chart
ggplot(
  data = bones3b,
  mapping = aes(
    y = fct_reorder(country, prop), # Ordering largest to smallest prop
    x = prop
  )
) +
  
  geom_col(
    fill = "wheat1"
  ) + 

  # Removing the x and y-axis label  
  labs(
    y = NULL,
    x = NULL
  ) +
  
  # Having the bars touch the y-axis and display percent on the x-axis
  scale_x_continuous(
    expand = c(0, 0, 0.05, 0),
    labels = scales::label_percent()
  )

Question 4: Correlation between left arm and left leg

Calculate the correlation between the length of the left arm and left leg by using the appropriate dplyr verbs in the following order:

1) Add two columns to the bones data set that are: a) larm = length of the left arm b) lleg = length of the left leg

2) Remove a row if larm or lleg is missing

3) Calculate the number of remaining skeletons and correlation between larm and lleg rounded to 3 decimal places. Both should be in the same data frame that has 1 row

bones |> 
  # Adding the larm and lleg columns
  mutate(
    larm = lhum + lrad,
    lleg = lfem + ltib
  ) |> 
  
  # Remove any rows that have missing values
  filter(
    !is.na(larm),
    !is.na(lleg)
  ) |> 
  # Counting the number of skeletons with both left bones and the correlation
  summarize(
    skeletons = n(),
    left_limb_cors = round(cor(larm, lleg), digits = 3)
  )

##   skeletons left_limb_cors
## 1      1093          0.928

Question 5: Medians of Bone Lengths by Age Range

To create the graph seen in Brightspace in the second code chunk, create a data set named bones5 using the bones2 data set created in question 4. It will need 4 columns:

1) The age range (18-24, 25-29, 30-39, 40-49, 50+)

2) The bone (hum/rad/fem/tib)

3) The side of the body (l/r)

4) The median of the length column for each combination of age, bone, and side

Make sure the data frame appears in your knitted document by using tibble(bones5) at the bottom of the first code chunk. Then check that it is correct by running the second code chunk, which should create the graph

bones5 <- 
  bones2 |> 
  # Calculating the median of bone length for each age and bone combo
  summarize(
    .by = c(age, bone, side),
    length_med = median(length)
  )

tibble(bones5)

## # A tibble: 40 × 4
##    age   bone  side  length_med
##    <chr> <fct> <chr>      <dbl>
##  1 40-49 hum   l           302 
##  2 40-49 rad   l           232.
##  3 40-49 fem   l           425 
##  4 40-49 fem   r           424 
##  5 40-49 tib   r           350.
##  6 50+   fem   l           428 
##  7 50+   fem   r           426 
##  8 50+   tib   l           351 
##  9 50+   tib   r           352.
## 10 40-49 hum   r           306.
## # ℹ 30 more rows

# Creating the graph
ggplot(
  data = bones5,
  mapping = aes(
    x = age,
    y = length_med
  )
) + 
  
  # Adding the bars
  geom_col(
    fill = "steelblue"
  ) + 
  
  # Creating a separate graph for each bone
  facet_grid(
    rows = vars(side),
    cols = vars(bone),
    # Changing the names that appear in the facet name plates
    labeller = labeller(
      side = c("l" = "Left Side", "r" = "Right Side"),
      bone = c("fem" = "Femur", "rad" = "Radius", 
               "tib" = "Tibia", "hum" = "Humerus")
    )
  ) +
  
  # Changing the labels and adding a title
  labs(
    x = NULL,
    title = "Medians of Bone Lengths by Age",
    y = "Median length (mm)"
  ) + 
  
  # Centering the title and rotating the labels on the x-axis
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  ) + 
  
  # Removing the buffer space on the bottom of the graph
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  )

DS 2870: Homework 4 - data wrangling

Your Name

2024-02-27

Data Description

`str_detect()` function

Question 1: Skeletons by age and sex

Part 1a: Skeletons with known sex

Part 1b: Bar chart for sex by age (30 and older)

Question 2: Keeping Specific Columns

Question 3: Skeletons by Country

Part 3a: Adding the country column

Part 3b: Country proportion and top 10 countries only

Question 4: Correlation between left arm and left leg

Question 5: Medians of Bone Lengths by Age Range

DS 2870: Homework 4 - data wrangling

Your Name

2024-02-27

Data Description

str_detect() function

Question 1: Skeletons by age and sex

Part 1a: Skeletons with known sex

Part 1b: Bar chart for sex by age (30 and older)

Question 2: Keeping Specific Columns

Question 3: Skeletons by Country

Part 3a: Adding the country column

Part 3b: Country proportion and top 10 countries only

Question 4: Correlation between left arm and left leg

Question 5: Medians of Bone Lengths by Age Range

`str_detect()` function