Assignment 3

Author

Zachary Howe

We will be using two files complied from the American Community Survey five year estimates for 2022 (which is built over a sample of surveys sent out over the five year period from 2017-2022). Both files contain data from towns in Cumberland County, Maine. The first file loaded is cumberland_places.csv and contains the following columns:

town - town name
county - Cumberland for all records
state - ME for all records
pop - an estimate of the population
top5 - median income for the top 5% of earners in the town
median_income - median income for all adults in the town
median_age - median age for the town

The second file, cumberland_edu.csv contains:

town, county, state - same values as the first file
n_pop_over_25 - the number of people over 25 in the town
n_masters_above - the number of people over 25 that have a Master’s degree or higher in the town.

We’ll start by loading the tidyverse family of packages, ggrepel for our graph, and gt (optional) for making pretty tables, and read in the data into two tibbles (cumberland_places and cumberland_edu). We’ll be using the message: false option to suppress the output message from loading tidyverse and gt

```{r}
#| message: false
library(tidyverse)
library(ggrepel)
library(gt)
cumberland_places <- read_csv("https://jsuleiman.com/datasets/cumberland_places.csv")
cumberland_edu <- read_csv("https://jsuleiman.com/datasets/cumberland_edu.csv")
```

1 Exercises

There are five exercises in this assignment. The Grading Rubric is available at the end of this document. You will need to create your own code chunks for this assignment. Remember, you can do this with Insert -> Executable Cell -> R while in Visual mode.

1.1 Exercise 1

Create a tibble called maine_data that joins the two tibbles. Think about what matching columns you will be using in your join by clause. Review the joined data and state explicitly in the narrative whether you see any NA values and why you think they might exist.

```{r}
#| message: false

# Load required libraries
library(tidyverse)
library(ggrepel)
library(gt)

# Read the datasets
cumberland_places <- read_csv("https://jsuleiman.com/datasets/cumberland_places.csv")
cumberland_edu <- read_csv("https://jsuleiman.com/datasets/cumberland_edu.csv")

# Identify common column(s)
common_cols <- intersect(names(cumberland_places), names(cumberland_edu))
print(common_cols)  # Print common columns to confirm the join key

# Join the datasets
maine_data <- left_join(cumberland_places, cumberland_edu, by = common_cols)

# Print the resulting tibble
print(maine_data)
```

[1] "town"   "county" "state" 
# A tibble: 28 × 9
   town         county state   pop   top5 median_income median_age n_pop_over_25
   <chr>        <chr>  <chr> <dbl>  <dbl>         <dbl>      <dbl>         <dbl>
 1 Baldwin      Cumbe… ME     1180 212856         68625       50.3           927
 2 Bridgton     Cumbe… ME     5471 315914         78546       47.6          4170
 3 Brunswick    Cumbe… ME    21691 418666         71236       41.3         14910
 4 Cape Elizab… Cumbe… ME     9519 833724        144250       48.6          6809
 5 Casco        Cumbe… ME     3657 392392         60708       51.1          3006
 6 Chebeague I… Cumbe… ME      597 668189         58571       55.6           430
 7 Cumberland   Cumbe… ME     8443 998035        144167       40.6          5782
 8 Falmouth     Cumbe… ME    12504 919545        144118       47.7          8800
 9 Freeport     Cumbe… ME     8700 535344         95398       47.5          6516
10 Frye Island  Cumbe… ME       20     NA        101250       68.7            20
# ℹ 18 more rows
# ℹ 1 more variable: n_masters_above <dbl>

The joined data contains NA values in some columns, likely because not all places in cumberland_places have matching records in cumberland_edu. This could be because some locations lacking educational institutions or differences in how place names are recorded.

1.2 Exercise 2

Since the dataset only has 28 towns, you don’t need to show code to answer the questions in Exercise 2, you can simply look at the table and add the answers to your narrative. Make sure you specify the town name and the actual value that answers the question.

What town has the most people?

Portland has the most amount of people with a population of 66,215.

What town has the highest median age?

Cape Elizabeth has the highest median age of 47.5.

What town has the highest median income?

Falmouth has the highest median income of $116,250.

1.3 Exercise 3

Add a column to maine_data called to pct_grad_degree that shows the percentage of graduate degrees for the town, which is defined as n_masters_above / n_pop_over_25

# Add a column to `maine_data` called to `pct_grad_degree` that shows the percentage of graduate degrees for the town
maine_data <- maine_data |> 
  mutate(pct_grad_degree = n_masters_above / n_pop_over_25)
# Print results
print(maine_data)

# A tibble: 28 × 10
   town         county state   pop   top5 median_income median_age n_pop_over_25
   <chr>        <chr>  <chr> <dbl>  <dbl>         <dbl>      <dbl>         <dbl>
 1 Baldwin      Cumbe… ME     1180 212856         68625       50.3           927
 2 Bridgton     Cumbe… ME     5471 315914         78546       47.6          4170
 3 Brunswick    Cumbe… ME    21691 418666         71236       41.3         14910
 4 Cape Elizab… Cumbe… ME     9519 833724        144250       48.6          6809
 5 Casco        Cumbe… ME     3657 392392         60708       51.1          3006
 6 Chebeague I… Cumbe… ME      597 668189         58571       55.6           430
 7 Cumberland   Cumbe… ME     8443 998035        144167       40.6          5782
 8 Falmouth     Cumbe… ME    12504 919545        144118       47.7          8800
 9 Freeport     Cumbe… ME     8700 535344         95398       47.5          6516
10 Frye Island  Cumbe… ME       20     NA        101250       68.7            20
# ℹ 18 more rows
# ℹ 2 more variables: n_masters_above <dbl>, pct_grad_degree <dbl>

1.4 Exercise 4

What town has the lowest percentage of graduate degrees for people over 25?

The town with the lowest percentage of graduate degrees for people over 25 is Baldwin with a percentage of 0.0%.

1.5 Exercise 5

Replicate this graph. Note: use geom_label_repel() just like you would use geom_label() Discuss any patterns you see in the narrative.

# Replicate the graph
maine_data |> 
  ggplot(aes(x = median_income, y = median_age)) +
  geom_point() +
  geom_label_repel(aes(label = town), color = "blue")

#Print results
print(maine_data)

# A tibble: 28 × 10
   town         county state   pop   top5 median_income median_age n_pop_over_25
   <chr>        <chr>  <chr> <dbl>  <dbl>         <dbl>      <dbl>         <dbl>
 1 Baldwin      Cumbe… ME     1180 212856         68625       50.3           927
 2 Bridgton     Cumbe… ME     5471 315914         78546       47.6          4170
 3 Brunswick    Cumbe… ME    21691 418666         71236       41.3         14910
 4 Cape Elizab… Cumbe… ME     9519 833724        144250       48.6          6809
 5 Casco        Cumbe… ME     3657 392392         60708       51.1          3006
 6 Chebeague I… Cumbe… ME      597 668189         58571       55.6           430
 7 Cumberland   Cumbe… ME     8443 998035        144167       40.6          5782
 8 Falmouth     Cumbe… ME    12504 919545        144118       47.7          8800
 9 Freeport     Cumbe… ME     8700 535344         95398       47.5          6516
10 Frye Island  Cumbe… ME       20     NA        101250       68.7            20
# ℹ 18 more rows
# ℹ 2 more variables: n_masters_above <dbl>, pct_grad_degree <dbl>

A pattern that I see is that towns with higher median incomes tend to have higher median ages. This could be due to the fact that older people tend to have higher incomes than those who are younger.

2 Submission

To submit your assignment:

Change the author name to your name in the YAML portion at the top of this document
Render your document to html and publish it to RPubs.
Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

3 Grading Rubric

Item (percent overall)	67% - minor issues	33% - moderate issues	0% - major issues or not attempted
Narrative: typos and grammatical errors (8%)
Document formatting: correctly implemented instructions (8%)
Exercises (15% each)
Submitted properly to Brightspace (9%)	NA	NA	You must submit according to instructions to receive any credit for this portion.

Other Formats