Task 1: Using the code I have provided above, install and load the tidyverse package, and create the Inc2022 and Inc2017 data frames. Accompany the code with a brief explanation of how the code works.
Response:
if(!require(tidyverse))install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyverse)
Inc2022 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2022.csv")
Inc2017 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2017.csv")
I went through and ran the tidyverse package. After I ran the code for Inc2017 and Inc2022 I got the data frames I’ll be working with. When the read.csv function opens up it will show the income from 2017 and 2022.
Task 2: Write R code that uses the left_join() function to merge the Inc2022 and Inc2017 data frames into a new data frame called Income, then shows the first 10 rows of the Income data frame. Explain which data frame was your “to” data frame, which was your “from” data frame, and which variables served as your “key column” variables.
Response:
Income <- left_join(Inc2022,
Inc2017,
by = join_by(GEOID == GEOID))
head(Income,10)
## GEOID District County HHInc2022 Significance HHInc2017
## 1 4702190022 District 1 Cheatham County 59741 Significant 44091
## 2 4702190212 District 2 Cheatham County 73634 Significant 55247
## 3 4702190402 District 3 Cheatham County 89861 Nonsignificant 73802
## 4 4702190592 District 4 Cheatham County 73293 Significant 49731
## 5 4702190782 District 5 Cheatham County 78380 Significant 60793
## 6 4702190972 District 6 Cheatham County 92305 Significant 66750
## 7 4703790038 District 1 Davidson County 75038 Significant 56258
## 8 4703790228 District 2 Davidson County 54346 Significant 32487
## 9 4703790418 District 3 Davidson County 67953 Significant 49402
## 10 4703790608 District 4 Davidson County 106260 Significant 79381
The left_join function joins both data frames together. The head 10 function allows me to get only the first ten data frames. My “to” data frame is Inc2022 and my “from” data frame is Inc2017. My “key column” variables is the GEOID which essentially just tells R which data set to pull from.
Task 3: Write R code that adds a variable called “Change” to the Income data frame showing the difference between each district’s 2022 and 2017 income estimates and displays the first 10 rows of the Income. Explain which function you used to create the “Change” variable.
Response:
Income <- Income %>%
mutate(Change = HHInc2022 - HHInc2017)
The mutate Change function helps show the difference between HHInc2022 to HHInc2017.
Task 4: Write code that adds a variable called “Level” to the Income data frame indicating whether each district’s Inc2022 figure is $100,000 or more, or less than $100,000. The “Level” variable should read, “$100k+” for districts where Inc2022 is 100,000 or more, and “<$100k” for districts where Inc2022 is less than 100,000. Explain which functions you used to create the “Level” variable.
Response:
Income <- Income %>%
mutate(Level = case_when(HHInc2022 < 100000 ~ "<$100k",
HHInc2022 == 100000 ~ "No change",
HHInc2022 > 100000 ~ "$100k+",
.default = "Error"))
The code mutate adds a column named level which helps organize from whats more or less than $100k.
Task 5: Create a data frame called “LevelbyCounty” that aggregates the data to show how many districts in each county are in the “Level” variable’s “$100k+” and “<$100k” categories. Show the “LevelbyCounty” data frame’s first 10 rows. Explain the functions you used. Your code’s output should include a table that looks like this one, except that every county should have a value in the “<$100k” and “$100k+” columns, not just Cheatham County. I left the Cheatham County figures visible to help you make sure you set the process up correctly.
Response:
LevelByCounty <- Income
LevelByCounty <- group_by(LevelByCounty, County, Level)
LevelByCounty <- summarize(LevelByCounty, Count = n())
## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.
LevelByCounty <- pivot_wider(LevelByCounty,
names_from = Level,
values_from = Count)
The LevelByCounty code categorizes from which county’s are $100k+ or <$100k. The code I used to accomplish this is the group_by, summarize, and pivot_wider.
Task 6: Write code that creates a “RichDistricts” data frame containing rows for only those districts in the “Level” variable’s “$100k+” category and shows the first 10 rows of the “RichDistricts” data frame. Explain which function you used.
Response:
RichDistricts <- Income
RichDistricts <- Income %>%
filter(Level=="$100k+")
head(RichDistricts, 10)
## GEOID District County HHInc2022 Significance HHInc2017
## 1 4703790608 District 4 Davidson County 106260 Significant 79381
## 2 4703794218 District 23 Davidson County 117474 Significant 86768
## 3 4703794408 District 24 Davidson County 110739 Significant 79959
## 4 4703794598 District 25 Davidson County 120206 Significant 90998
## 5 4703796460 District 34 Davidson County 161370 Significant 128421
## 6 4703796650 District 35 Davidson County 117294 Significant 99666
## 7 4714791098 District 6 Robertson County 106058 Significant 65179
## 8 4714991480 District 8 Rutherford County 102862 Significant 75875
## 9 4714991670 District 9 Rutherford County 103006 Significant 77481
## 10 4714992620 District 14 Rutherford County 105750 Nonsignificant 95708
## Change Level
## 1 26879 $100k+
## 2 30706 $100k+
## 3 30780 $100k+
## 4 29208 $100k+
## 5 32949 $100k+
## 6 17628 $100k+
## 7 40879 $100k+
## 8 26987 $100k+
## 9 25525 $100k+
## 10 10042 $100k+
I created a RichDistricts data frame that categorizes all of the counties that are over $100k+. The head function allows the reader to only see the first 10 rows.
Task 7: Write code that sorts the “RichDistricts” data frame by the “HHInc2022” variable, in descending order (biggest value at the top), and displays the first 10 rows of the data frame. Explain which function you used.
Response:
RichDistricts <- RichDistricts %>%
arrange(desc(HHInc2022))
head(RichDistricts,10)
## GEOID District County HHInc2022 Significance HHInc2017
## 1 4718791328 District 7 Williamson County 181709 Significant 150106
## 2 4718791138 District 6 Williamson County 178665 Significant 154149
## 3 4703796460 District 34 Davidson County 161370 Significant 128421
## 4 4718790948 District 5 Williamson County 159737 Significant 121622
## 5 4718791708 District 9 Williamson County 144924 Significant 117526
## 6 4718993040 District 16 Wilson County 125785 Significant 74528
## 7 4718990570 District 3 Wilson County 125324 Significant 80577
## 8 4718790758 District 4 Williamson County 124237 Nonsignificant 127552
## 9 4718790378 District 2 Williamson County 123609 Significant 101687
## 10 4718990380 District 2 Wilson County 120302 Significant 98846
## Change Level
## 1 31603 $100k+
## 2 24516 $100k+
## 3 32949 $100k+
## 4 38115 $100k+
## 5 27398 $100k+
## 6 51257 $100k+
## 7 44747 $100k+
## 8 -3315 $100k+
## 9 21922 $100k+
## 10 21456 $100k+
Lastly, I sorted the data frame of HHInc2022 by descending order. I accomplished this by using the arrange function and descending function.
Task 8: Look back over the results you got and summarize, in a few sentences, at least one interesting pattern you discovered pertaining to how income levels vary across districts and/or counties in the Nashville area.
Response:
An interesting pattern I discovered about the income levels across the counties in the Nashville area are that the Williamson county is significantly higher than any other county. I think Davidson and Wilson county are still just as expensive but Williamson was consistently higher. Out of the first 5 rows, Williamson was 4 of them.