Task 1: This code loads the tidyverse library after determining whether or not the library is installed. It then uses the read.csv() function to load the Inc2022.csv data into its corresponding data frame: Inc2022. Lastly, it repeats the read process for the Inc2017.csv data and, naturally, the Inc2017 data frame.
#Task 1
if(!require(tidyverse))install.packages("tidyverse")
library(tidyverse)
Inc2022 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2022.csv")
Inc2017 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2017.csv")
Task 2: This code uses the left_join() function to add the data from the Inc2022 data frame to that of Inc2017 by their GEOID variables, which served as the “key column” variables. This new hybrid data frame is then stored with the name Income. Finally, the head() function is used to display the first 10 values of Income.
#Task 2
Income <- left_join(Inc2017,
Inc2022,
by = join_by(GEOID == GEOID))
head(Income, 10)
## GEOID HHInc2017 District County HHInc2022 Significance
## 1 4702190022 44091 District 1 Cheatham County 59741 Significant
## 2 4702190212 55247 District 2 Cheatham County 73634 Significant
## 3 4702190402 73802 District 3 Cheatham County 89861 Nonsignificant
## 4 4702190592 49731 District 4 Cheatham County 73293 Significant
## 5 4702190782 60793 District 5 Cheatham County 78380 Significant
## 6 4702190972 66750 District 6 Cheatham County 92305 Significant
## 7 4703790038 56258 District 1 Davidson County 75038 Significant
## 8 4703790228 32487 District 2 Davidson County 54346 Significant
## 9 4703790418 49402 District 3 Davidson County 67953 Significant
## 10 4703790608 79381 District 4 Davidson County 106260 Significant
Task 3: This code adds a new variable, Change, to the Income data frame via the mutate() function. The Change variable is written to equal the values stored in the HH2017 column subtracted from the values in the HH2022 column. Lastly, the head() function mentioned above is reused to display the first 10 values of this updated iteration of Income.
#Task 3
Income <- Income %>%
mutate(Change = HHInc2022 - HHInc2017)
head(Income, 10)
## GEOID HHInc2017 District County HHInc2022 Significance
## 1 4702190022 44091 District 1 Cheatham County 59741 Significant
## 2 4702190212 55247 District 2 Cheatham County 73634 Significant
## 3 4702190402 73802 District 3 Cheatham County 89861 Nonsignificant
## 4 4702190592 49731 District 4 Cheatham County 73293 Significant
## 5 4702190782 60793 District 5 Cheatham County 78380 Significant
## 6 4702190972 66750 District 6 Cheatham County 92305 Significant
## 7 4703790038 56258 District 1 Davidson County 75038 Significant
## 8 4703790228 32487 District 2 Davidson County 54346 Significant
## 9 4703790418 49402 District 3 Davidson County 67953 Significant
## 10 4703790608 79381 District 4 Davidson County 106260 Significant
## Change
## 1 15650
## 2 18387
## 3 16059
## 4 23562
## 5 17587
## 6 25555
## 7 18780
## 8 21859
## 9 18551
## 10 26879
Task 4: This code reuses the mutate() function to add a new variable called Level. Level is determined by using the case_when() function to give one of two possible values. If the HHInc2022 is greater than or equal to 100,000, this labels the district as “$100k+”; if not, the district is labeled with the value, “<$100k”.
#Task 4
Income <- Income %>%
mutate(Level = case_when(HHInc2022 >= 100000 ~ "$100k+",
HHInc2022 < 100000 ~ "<$100k"))
Task 5: This code uses four functions; three of which are used to alter the Income data frame, which is stored as a new data frame named LevelByCounty. First, the group_by() function groups the Income data frame first by the County variable and then, within each County, by the Level variable. Next, the summarize() function counts how many rows are in Income. After that comes the pivot_wider() function, which reorganizes this data into a table that tracks how many districts are labeled with either Label value. Finally, the fourth function is the aforementioned head() function, now being used to display the first 10 values of LevelByCounty.
#Task 5
LevelByCounty <- Income %>%
group_by(County, Level) %>%
summarize(Count = n()) %>%
pivot_wider(names_from = Level,
values_from = Count)
head(LevelByCounty, 10)
## # A tibble: 7 × 3
## # Groups: County [7]
## County `<$100k` `$100k+`
## <chr> <int> <int>
## 1 Cheatham County 6 NA
## 2 Davidson County 29 6
## 3 Robertson County 11 1
## 4 Rutherford County 18 3
## 5 Sumner County 10 2
## 6 Williamson County 2 10
## 7 Wilson County 15 10
Task 6: The code for Task 6 is a “return to form” of sorts, going back making new data frames out of the Income data frame. Income is filtered via the filter() function to only list districts that have the “$100k+” Label value before being stored in a new data frame: RichDistricts. The code ends with a head() function display of the first 10 values of this RichDistricts data frame.
#Task 6
RichDistricts <- filter(Income, Level == "$100k+")
head(RichDistricts, 10)
## GEOID HHInc2017 District County HHInc2022 Significance
## 1 4703790608 79381 District 4 Davidson County 106260 Significant
## 2 4703794218 86768 District 23 Davidson County 117474 Significant
## 3 4703794408 79959 District 24 Davidson County 110739 Significant
## 4 4703794598 90998 District 25 Davidson County 120206 Significant
## 5 4703796460 128421 District 34 Davidson County 161370 Significant
## 6 4703796650 99666 District 35 Davidson County 117294 Significant
## 7 4714791098 65179 District 6 Robertson County 106058 Significant
## 8 4714991480 75875 District 8 Rutherford County 102862 Significant
## 9 4714991670 77481 District 9 Rutherford County 103006 Significant
## 10 4714992620 95708 District 14 Rutherford County 105750 Nonsignificant
## Change Level
## 1 26879 $100k+
## 2 30706 $100k+
## 3 30780 $100k+
## 4 29208 $100k+
## 5 32949 $100k+
## 6 17628 $100k+
## 7 40879 $100k+
## 8 26987 $100k+
## 9 25525 $100k+
## 10 10042 $100k+
Task 7: This code takes the previously mentioned RichDistricts data frame and uses the arrange() function to list the data in order of one of its variables. In this case, the desc() function is used to indicate that the data frame is meant to be arranged in descending order according to the HHInc2022 variable. Finally, the head() function is reused to display the first 10 values of the newly arranged RichDistricts.
#Task 7
RichDistricts <- arrange(RichDistricts, desc(HHInc2022))
head(RichDistricts, 10)
## GEOID HHInc2017 District County HHInc2022 Significance
## 1 4718791328 150106 District 7 Williamson County 181709 Significant
## 2 4718791138 154149 District 6 Williamson County 178665 Significant
## 3 4703796460 128421 District 34 Davidson County 161370 Significant
## 4 4718790948 121622 District 5 Williamson County 159737 Significant
## 5 4718791708 117526 District 9 Williamson County 144924 Significant
## 6 4718993040 74528 District 16 Wilson County 125785 Significant
## 7 4718990570 80577 District 3 Wilson County 125324 Significant
## 8 4718790758 127552 District 4 Williamson County 124237 Nonsignificant
## 9 4718790378 101687 District 2 Williamson County 123609 Significant
## 10 4718990380 98846 District 2 Wilson County 120302 Significant
## Change Level
## 1 31603 $100k+
## 2 24516 $100k+
## 3 32949 $100k+
## 4 38115 $100k+
## 5 27398 $100k+
## 6 51257 $100k+
## 7 44747 $100k+
## 8 -3315 $100k+
## 9 21922 $100k+
## 10 21456 $100k+
Task 8: I see two notable patterns with this data. First, the only county to not have at least one district in the RichDistricts data frame is Cheatam County, which could imply that Cheatam is the poorest county in the area. Secondly, and more interestingly, there is only one district in the entirety of RichDistricts that had a negative Change value. District 4 in Williamson County had a Change of -3315, which makes it one of four districts that had a “Nonsignificant” Change. Despite this, it still had the eighth highest HHInc2022 value.