Assignment 3

Go to the shared posit.cloud workspace for this class and open the lab03_assign03 project. Open the assign03.qmd file and complete the exercises.

We will be using two files complied from the American Community Survey five year estimates for 2022 (which is built over a sample of surveys sent out over the five year period from 2017-2022). Both files contain data from towns in Cumberland County, Maine. The first file loaded is cumberland_places.csv and contains the following columns:

The second file, cumberland_edu.csv contains:

We’ll start by loading the tidyverse family of packages, ggrepel for our graph, and gt (optional) for making pretty tables, and read in the data into two tibbles (cumberland_places and cumberland_edu). We’ll be using the message: false option to suppress the output message from loading tidyverse and gt

```{r}
#| message: false
library(tidyverse)
library(ggrepel)
library(gt)
cumberland_places <- read_csv("https://jsuleiman.com/datasets/cumberland_places.csv")
cumberland_edu <- read_csv("https://jsuleiman.com/datasets/cumberland_edu.csv")
```

1 Exercises

There are five exercises in this assignment. The Grading Rubric is available at the end of this document. You will need to create your own code chunks for this assignment. Remember, you can do this with Insert -> Executable Cell -> R while in Visual mode.

1.1 Exercise 1

Create a tibble called maine_data that joins the two tibbles. Think about what matching columns you will be using in your join by clause. Review the joined data and state explicitly in the narrative whether you see any NA values and why you think they might exist.

library(tidyverse)
library(ggrepel)
library(gt)

cumberland_places <- read_csv("https://jsuleiman.com/datasets/cumberland_places.csv")
Rows: 28 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): town, county, state
dbl (4): pop, top5, median_income, median_age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cumberland_edu <- read_csv("https://jsuleiman.com/datasets/cumberland_edu.csv")
Rows: 28 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): town, county, state
dbl (2): n_pop_over_25, n_masters_above

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
maine_data <- left_join(cumberland_places, cumberland_edu, by = "town")

sum(is.na(maine_data))
[1] 1
head(maine_data)
# A tibble: 6 × 11
  town   county.x state.x   pop   top5 median_income median_age county.y state.y
  <chr>  <chr>    <chr>   <dbl>  <dbl>         <dbl>      <dbl> <chr>    <chr>  
1 Baldw… Cumberl… ME       1180 212856         68625       50.3 Cumberl… ME     
2 Bridg… Cumberl… ME       5471 315914         78546       47.6 Cumberl… ME     
3 Bruns… Cumberl… ME      21691 418666         71236       41.3 Cumberl… ME     
4 Cape … Cumberl… ME       9519 833724        144250       48.6 Cumberl… ME     
5 Casco  Cumberl… ME       3657 392392         60708       51.1 Cumberl… ME     
6 Chebe… Cumberl… ME        597 668189         58571       55.6 Cumberl… ME     
# ℹ 2 more variables: n_pop_over_25 <dbl>, n_masters_above <dbl>
summary(maine_data)
     town             county.x           state.x               pop       
 Length:28          Length:28          Length:28          Min.   :   20  
 Class :character   Class :character   Class :character   1st Qu.: 3367  
 Mode  :character   Mode  :character   Mode  :character   Median : 6994  
                                                          Mean   :10834  
                                                          3rd Qu.:13878  
                                                          Max.   :68280  
                                                                         
      top5        median_income      median_age      county.y        
 Min.   :212856   Min.   : 55020   Min.   :36.50   Length:28         
 1st Qu.:321835   1st Qu.: 71432   1st Qu.:41.30   Class :character  
 Median :418666   Median : 91408   Median :45.20   Mode  :character  
 Mean   :482965   Mean   : 91901   Mean   :46.34                     
 3rd Qu.:588422   3rd Qu.:101726   3rd Qu.:49.02                     
 Max.   :998035   Max.   :144250   Max.   :68.70                     
 NA's   :1                                                           
   state.y          n_pop_over_25   n_masters_above  
 Length:28          Min.   :   20   Min.   :    4.0  
 Class :character   1st Qu.: 2785   1st Qu.:  363.2  
 Mode  :character   Median : 4997   Median :  752.5  
                    Mean   : 7939   Mean   : 1621.0  
                    3rd Qu.: 9447   3rd Qu.: 1983.5  
                    Max.   :51898   Max.   :12233.0  
                                                     
ggplot(maine_data, aes(x = pop, y = median_income)) +
  geom_point() +
  labs(title = "Population vs. Median Income")

ggplot(maine_data, aes(x = town, y = n_masters_above)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of People with Master's Degrees or Higher")

# Perform statistical analysis
cor(maine_data$pop, maine_data$median_income)
[1] -0.03018879
model <- lm(median_income ~ pop, data = maine_data)
summary(model)

Call:
lm(formula = median_income ~ pop, data = maine_data)

Residuals:
   Min     1Q Median     3Q    Max 
-37342 -17935   -575   9486  52309 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.250e+04  6.166e+03  15.003 2.57e-14 ***
pop         -5.533e-02  3.593e-01  -0.154    0.879    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25300 on 26 degrees of freedom
Multiple R-squared:  0.0009114, Adjusted R-squared:  -0.03752 
F-statistic: 0.02372 on 1 and 26 DF,  p-value: 0.8788

1.2 Exercise 2

Since the dataset only has 28 towns, you don’t need to show code to answer the questions in Exercise 2, you can simply look at the table and add the answers to your narrative. Make sure you specify the town name and the actual value that answers the question.

maine_data
# A tibble: 28 × 11
   town  county.x state.x   pop   top5 median_income median_age county.y state.y
   <chr> <chr>    <chr>   <dbl>  <dbl>         <dbl>      <dbl> <chr>    <chr>  
 1 Bald… Cumberl… ME       1180 212856         68625       50.3 Cumberl… ME     
 2 Brid… Cumberl… ME       5471 315914         78546       47.6 Cumberl… ME     
 3 Brun… Cumberl… ME      21691 418666         71236       41.3 Cumberl… ME     
 4 Cape… Cumberl… ME       9519 833724        144250       48.6 Cumberl… ME     
 5 Casco Cumberl… ME       3657 392392         60708       51.1 Cumberl… ME     
 6 Cheb… Cumberl… ME        597 668189         58571       55.6 Cumberl… ME     
 7 Cumb… Cumberl… ME       8443 998035        144167       40.6 Cumberl… ME     
 8 Falm… Cumberl… ME      12504 919545        144118       47.7 Cumberl… ME     
 9 Free… Cumberl… ME       8700 535344         95398       47.5 Cumberl… ME     
10 Frye… Cumberl… ME         20     NA        101250       68.7 Cumberl… ME     
# ℹ 18 more rows
# ℹ 2 more variables: n_pop_over_25 <dbl>, n_masters_above <dbl>

1.3 Exercise 3

Add a column to maine_data called to pct_grad_degree that shows the percentage of graduate degrees for the town, which is defined as n_masters_above / n_pop_over_25

maine_data <- maine_data %>%
  mutate(pct_grad_degree = n_masters_above / n_pop_over_25 * 100)

1.4 Exercise 4

What town has the lowest percentage of graduate degrees for people over 25?

lowest_grad_town <- maine_data %>%
  filter(!is.na(pct_grad_degree)) %>%
  arrange(pct_grad_degree) %>%
  slice(1) %>%
  select(town, pct_grad_degree)

lowest_grad_town
# A tibble: 1 × 2
  town    pct_grad_degree
  <chr>             <dbl>
1 Baldwin            9.49

1.5 Exercise 5

Replicate this graph. Note: use geom_label_repel() just like you would use geom_label() Discuss any patterns you see in the narrative.

library(ggrepel)

ggplot(maine_data, aes(x = pop, y = median_income, label = town)) +
  geom_point() +
  geom_label_repel(max.overlaps = Inf) +
  labs(title = "Population vs. Median Income (with Town Labels)")

2 Submission

To submit your assignment:

  • Change the author name to your name in the YAML portion at the top of this document
  • Render your document to html and publish it to RPubs.
  • Submit the link to your Rpubs document in the Brightspace comments section for this assignment.
  • Click on the “Add a File” button and upload your .qmd file for this assignment to Brightspace.

3 Grading Rubric

Item
(percent overall)
100% - flawless 67% - minor issues 33% - moderate issues 0% - major issues or not attempted
Narrative: typos and grammatical errors
(8%)
Document formatting: correctly implemented instructions
(8%)

Exercises

(15% each)

Submitted properly to Brightspace

(9%)

NA NA You must submit according to instructions to receive any credit for this portion.