Midterm

Task 1: Using the code I have provided above, install and load the tidyverse package, and create the Inc2022 and Inc2017 data frames. Accompany the code with a brief explanation of how the code works.

Response:

if(!require(tidyverse))install.packages("tidyverse")

## Loading required package: tidyverse

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyverse)

Inc2022 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2022.csv")

Inc2017 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2017.csv")

I went through and ran the tidyverse package. After I ran the code for Inc2017 and Inc2022 I got the data frames I’ll be working with. When the read.csv function opens up it will show the income from 2017 and 2022.

Task 2: Write R code that uses the left_join() function to merge the Inc2022 and Inc2017 data frames into a new data frame called Income, then shows the first 10 rows of the Income data frame. Explain which data frame was your “to” data frame, which was your “from” data frame, and which variables served as your “key column” variables.

Response:

Income <- left_join(Inc2022,
                        Inc2017,
                        by = join_by(GEOID == GEOID))
head(Income,10)

##         GEOID   District          County HHInc2022   Significance HHInc2017
## 1  4702190022 District 1 Cheatham County     59741    Significant     44091
## 2  4702190212 District 2 Cheatham County     73634    Significant     55247
## 3  4702190402 District 3 Cheatham County     89861 Nonsignificant     73802
## 4  4702190592 District 4 Cheatham County     73293    Significant     49731
## 5  4702190782 District 5 Cheatham County     78380    Significant     60793
## 6  4702190972 District 6 Cheatham County     92305    Significant     66750
## 7  4703790038 District 1 Davidson County     75038    Significant     56258
## 8  4703790228 District 2 Davidson County     54346    Significant     32487
## 9  4703790418 District 3 Davidson County     67953    Significant     49402
## 10 4703790608 District 4 Davidson County    106260    Significant     79381

The left_join function joins both data frames together. The head 10 function allows me to get only the first ten data frames. My “to” data frame is Inc2022 and my “from” data frame is Inc2017. My “key column” variables is the GEOID which essentially just tells R which data set to pull from.

Task 3: Write R code that adds a variable called “Change” to the Income data frame showing the difference between each district’s 2022 and 2017 income estimates and displays the first 10 rows of the Income. Explain which function you used to create the “Change” variable.

Response:

Income <- Income %>% 
  mutate(Change = HHInc2022 - HHInc2017)

The mutate Change function helps show the difference between HHInc2022 to HHInc2017.

Task 4: Write code that adds a variable called “Level” to the Income data frame indicating whether each district’s Inc2022 figure is $100,000 or more, or less than $100,000. The “Level” variable should read, “$100k+” for districts where Inc2022 is 100,000 or more, and “<$100k” for districts where Inc2022 is less than 100,000. Explain which functions you used to create the “Level” variable.

Response:

Income <- Income %>% 
  mutate(Level = case_when(HHInc2022 < 100000 ~ "<$100k",
                           HHInc2022 == 100000 ~ "No change",
                           HHInc2022 > 100000 ~ "$100k+",
                           .default = "Error"))

The code mutate adds a column named level which helps organize from whats more or less than $100k.

Task 5: Create a data frame called “LevelbyCounty” that aggregates the data to show how many districts in each county are in the “Level” variable’s “$100k+” and “<$100k” categories. Show the “LevelbyCounty” data frame’s first 10 rows. Explain the functions you used. Your code’s output should include a table that looks like this one, except that every county should have a value in the “<$100k” and “$100k+” columns, not just Cheatham County. I left the Cheatham County figures visible to help you make sure you set the process up correctly.

Response:

LevelByCounty <- Income
LevelByCounty <- group_by(LevelByCounty, County, Level)
LevelByCounty <- summarize(LevelByCounty, Count = n())

## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.

LevelByCounty <- pivot_wider(LevelByCounty,
                           names_from = Level,
                           values_from = Count)

The LevelByCounty code categorizes from which county’s are $100k+ or <$100k. The code I used to accomplish this is the group_by, summarize, and pivot_wider.

Task 6: Write code that creates a “RichDistricts” data frame containing rows for only those districts in the “Level” variable’s “$100k+” category and shows the first 10 rows of the “RichDistricts” data frame. Explain which function you used.

Response:

RichDistricts <- Income

RichDistricts <- Income %>% 
filter(Level=="$100k+")
head(RichDistricts, 10)

##         GEOID    District            County HHInc2022   Significance HHInc2017
## 1  4703790608  District 4   Davidson County    106260    Significant     79381
## 2  4703794218 District 23   Davidson County    117474    Significant     86768
## 3  4703794408 District 24   Davidson County    110739    Significant     79959
## 4  4703794598 District 25   Davidson County    120206    Significant     90998
## 5  4703796460 District 34   Davidson County    161370    Significant    128421
## 6  4703796650 District 35   Davidson County    117294    Significant     99666
## 7  4714791098  District 6  Robertson County    106058    Significant     65179
## 8  4714991480  District 8 Rutherford County    102862    Significant     75875
## 9  4714991670  District 9 Rutherford County    103006    Significant     77481
## 10 4714992620 District 14 Rutherford County    105750 Nonsignificant     95708
##    Change  Level
## 1   26879 $100k+
## 2   30706 $100k+
## 3   30780 $100k+
## 4   29208 $100k+
## 5   32949 $100k+
## 6   17628 $100k+
## 7   40879 $100k+
## 8   26987 $100k+
## 9   25525 $100k+
## 10  10042 $100k+

I created a RichDistricts data frame that categorizes all of the counties that are over $100k+. The head function allows the reader to only see the first 10 rows.

Task 7: Write code that sorts the “RichDistricts” data frame by the “HHInc2022” variable, in descending order (biggest value at the top), and displays the first 10 rows of the data frame. Explain which function you used.

Response:

RichDistricts <- RichDistricts %>% 
  arrange(desc(HHInc2022))
head(RichDistricts,10)

##         GEOID    District            County HHInc2022   Significance HHInc2017
## 1  4718791328  District 7 Williamson County    181709    Significant    150106
## 2  4718791138  District 6 Williamson County    178665    Significant    154149
## 3  4703796460 District 34   Davidson County    161370    Significant    128421
## 4  4718790948  District 5 Williamson County    159737    Significant    121622
## 5  4718791708  District 9 Williamson County    144924    Significant    117526
## 6  4718993040 District 16     Wilson County    125785    Significant     74528
## 7  4718990570  District 3     Wilson County    125324    Significant     80577
## 8  4718790758  District 4 Williamson County    124237 Nonsignificant    127552
## 9  4718790378  District 2 Williamson County    123609    Significant    101687
## 10 4718990380  District 2     Wilson County    120302    Significant     98846
##    Change  Level
## 1   31603 $100k+
## 2   24516 $100k+
## 3   32949 $100k+
## 4   38115 $100k+
## 5   27398 $100k+
## 6   51257 $100k+
## 7   44747 $100k+
## 8   -3315 $100k+
## 9   21922 $100k+
## 10  21456 $100k+

Lastly, I sorted the data frame of HHInc2022 by descending order. I accomplished this by using the arrange function and descending function.

Task 8: Look back over the results you got and summarize, in a few sentences, at least one interesting pattern you discovered pertaining to how income levels vary across districts and/or counties in the Nashville area.

Response:

An interesting pattern I discovered about the income levels across the counties in the Nashville area are that the Williamson county is significantly higher than any other county. I think Davidson and Wilson county are still just as expensive but Williamson was consistently higher. Out of the first 5 rows, Williamson was 4 of them.

Midterm

Alison Wich

2024-03-01