DS 2870 - Homework 6

Data description and cleaning:

we’ll be using the county_complete data set that has 188 variables from the openintro package, but we’ll be using the following columns:

name: County name
state: State name
fips: An identifier that uniquely IDs each county
households_2019: The number of households in the county
households_speak_spanish_2019: Percent of households speaking Spanish
households_speak_other_indo_euro_lang_2019: Percent of household speaking other Indo-European language
households_speak_asian_or_pac_isl_2019: Percent of households speaking Asian and Pacific Island language
households_speak_other_2019: Percent of house holds speaking non-European or Asian/Pacific Island language
households_speak_limited_english_2019: Percent of limited English-speaking households

In the code chunk below:

keep only the 9 columns listed above,
rename the last 5 columns as Spanish, European, Asian_Pacific, Other, and limited, respectively,
remove rows corresponding to counties in Hawaii, Alaska, and District of Columbia
save the results as counties6.

It should have a total of 3107 rows. To clean the data, use 2 or 3 dplyr verbs we’ve seen in class so far.

## # A tibble: 3,107 × 9
##    name         state  fips households_2019 Spanish European Asian_Pacific Other
##    <chr>        <chr> <dbl>           <dbl>   <dbl>    <dbl>         <dbl> <dbl>
##  1 Autauga Cou… Alab…  1001           21397     2.9      0.3           1.8   0.2
##  2 Baldwin Cou… Alab…  1003           80930     4.6      1.8           0.6   0  
##  3 Barbour Cou… Alab…  1005            9345     5.2      1.1           0.6   0  
##  4 Bibb County  Alab…  1007            6891     1.9      0.5           0     0  
##  5 Blount Coun… Alab…  1009           20847     6.6      0.9           0.1   0.2
##  6 Bullock Cou… Alab…  1011            3521     3.6      1.6           0.4   0  
##  7 Butler Coun… Alab…  1013            6506     1.4      0.6           0.2   0  
##  8 Calhoun Cou… Alab…  1015           44605     3.1      0.8           0.8   0.1
##  9 Chambers Co… Alab…  1017           13448     1.4      0.5           1     0  
## 10 Cherokee Co… Alab…  1019           10737     2        0.1           0.5   0.1
## # ℹ 3,097 more rows
## # ℹ 1 more variable: limited <dbl>

Question 1: State Level Data

In question 1, you’ll make a graph of the percentage of people that speak limited English by state

Question 1A) Creating the state-level data set

Create a data set with 48 rows (one for each continental state) and the following three columns:

state: The name of the state
households: The number of all households in the state
limited_per: The percentage of all households in the state that speak limited English

Note: limited_per is not the average of limited for all the counties in the state!

Save the results as states_q1a. Display the results from highest to lowest in the knitted document (make sure to use tibble() as well to not show all 48 states). You can check your results in Brightspace.

## # A tibble: 48 × 3
##    state         households limited_per
##    <chr>              <dbl>       <dbl>
##  1 California      13044266        8.95
##  2 New York         7343234        8.00
##  3 Texas            9691647        7.73
##  4 New Jersey       3231874        7.02
##  5 Florida          7736311        6.92
##  6 Nevada           1098602        5.98
##  7 Massachusetts    2617497        5.84
##  8 Rhode Island      410489        5.47
##  9 Connecticut      1370746        5.22
## 10 New Mexico        780249        5.21
## # ℹ 38 more rows

Question 1B) State lines data set

Create a data set with the three columns from question 1A, plus the latitude and longitude of the outline for each state. Save the results as state_lines and use the code provided at the bottom to display the same columns as seen in Brightspace

##          long      lat group order         region subregion households
## 1  -122.37806 37.40841     4   941     california      <NA>   13044266
## 2   -81.25114 31.55852    10  2376        georgia      <NA>    3758798
## 3  -114.39675 46.66168    11  2832          idaho      <NA>     630008
## 4   -90.31534 30.32665    17  4500      louisiana      <NA>    1739497
## 5   -68.77785 44.52455    18  5249          maine      <NA>     559921
## 6   -75.21790 38.02721    19  5479       maryland      <NA>    2205204
## 7   -76.58727 39.23615    19  5733       maryland      <NA>    2205204
## 8   -79.24007 33.36906    47 11508 south carolina      <NA>    1921862
## 9   -98.04454 33.99932    50 13146          texas      <NA>    9691647
## 10  -73.34433 44.94281    52 13478        vermont      <NA>     260029
##    limited_per
## 1    8.9457728
## 2    2.8300776
## 3    2.0222857
## 4    1.9297767
## 5    0.9558641
## 6    3.2835147
## 7    3.2835147
## 8    1.4331921
## 9    7.7322729
## 10   0.7482015

Question 1C) Map of Limited English Percentage

Create the graph seen in Brightspace. To get the color guide to be similar to what is in the solutions, use scale_fill_fermenter() (you’ll need to use some specific arguments!)

Question 2: Most common non-English language by county

Next, you’ll create a graph that displays the most spoken language group (Spanish,Other European, Asian/Pacific Islander, Other) by county.

Question 2A) Finding the most common non-English language per county

For each county, find the column with the largest percentage between Spanish, European, Asian_Pacific, and Other and save the result in most_common. The resulting data set should have 5 columns:

name, state, fips: the name of the county, the state, and FIPS code
language: The most common non-English language spoken
lang_per: The percentage of households in the county that speak the most common language

In the appropriate dplyr verb, include the with_ties = F to only keep 1 row per county (in case of a tie).

You’ll need to use the counties6 data set created at the beginning. Save the results as counties_2a. Once you’ve finished, uncomment the code at the bottom of the code chunk to display the counties seen in the solutions in Brightspace

Hint: you’ll only need to use two (max three) dplyr verbs, and neither of them are mutate()

## # A tibble: 3,107 × 5
##    name                state       fips language      lang_per
##    <chr>               <chr>      <dbl> <chr>            <dbl>
##  1 Rock County         Wisconsin  55105 Spanish            5.2
##  2 Hancock County      Kentucky   21091 Spanish            1.3
##  3 Mason County        Illinois   17125 Spanish            0.9
##  4 Wahkiakum County    Washington 53069 Asian_Pacific      2.3
##  5 San Patricio County Texas      48409 Spanish           45.5
##  6 Riley County        Kansas     20161 Spanish            5.7
##  7 Decatur County      Georgia    13087 Spanish            2.1
##  8 Skamania County     Washington 53059 Spanish            3.1
##  9 Murray County       Oklahoma   40099 Spanish            4.3
## 10 Caledonia County    Vermont    50005 European           3.4
## # ℹ 3,097 more rows

Question 2B) Forming the counties data set to create the map

Similar to Question 1B), merge the data set created in Quesiton 2A) with a data set that has the longitude and latitude outlines of each of the counties. Save it as county_lines. Make sure that the final data set has exactly 87,949 rows and 8 columns

See the results in Brightspace. It will probably take multiple steps! Make sure that the resultin

## # A tibble: 87,949 × 8
##     long   lat group order  fips state_county    language lang_per
##    <dbl> <dbl> <dbl> <int> <dbl> <chr>           <chr>       <dbl>
##  1 -86.5  32.3     1     1  1001 alabama,autauga Spanish       2.9
##  2 -86.5  32.4     1     2  1001 alabama,autauga Spanish       2.9
##  3 -86.5  32.4     1     3  1001 alabama,autauga Spanish       2.9
##  4 -86.6  32.4     1     4  1001 alabama,autauga Spanish       2.9
##  5 -86.6  32.4     1     5  1001 alabama,autauga Spanish       2.9
##  6 -86.6  32.4     1     6  1001 alabama,autauga Spanish       2.9
##  7 -86.6  32.4     1     7  1001 alabama,autauga Spanish       2.9
##  8 -86.6  32.4     1     8  1001 alabama,autauga Spanish       2.9
##  9 -86.6  32.4     1     9  1001 alabama,autauga Spanish       2.9
## 10 -86.6  32.4     1    10  1001 alabama,autauga Spanish       2.9
## # ℹ 87,939 more rows

Question 2C) Most common non-English language by county graph

Create the graph seen in Brightspace. The hex codes for the colors in the graph are:

Spanish = #FABD00
European = #003399
Asian_Pacific = #EE1C25
Other = #007749

DS 2870 - Homework 6 - Key

Jacob Martin

2023-09-25