Task 1:Accompany the code with a brief explanation of how the code works.
Response: This code installs the tidyverse package, if needed, and loads it into R Studio so that it can be implemented into whatever code is written. Next, it uses the read.csv() function to download the Inc2022.csv data and put it into a data frame called Inc2022. Finally, it uses the same function to download the Inc2017.csv data and put it into a data frame called Inc2017.
if(!require(tidyverse))install.packages("tidyverse")
library(tidyverse)
Inc2022 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2022.csv")
Inc2017 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2017.csv")
Task 2: Explain which data frame was your “to” data frame, which was your “from” data frame, and which variables served as your “key column” variables.
Response: The 2022 data frame was my To data frame as the data encompasses until 2022 and the from is 2017 as that is where my data is pooling from as a baseline point. The Geoid served as my key column variable, as it was the necessary variable to note in the joining function.
Income <- left_join(Inc2022,Inc2017,
by = join_by(GEOID == GEOID))
head(Income, 10)
## GEOID District County HHInc2022 Significance HHInc2017
## 1 4702190022 District 1 Cheatham County 59741 Significant 44091
## 2 4702190212 District 2 Cheatham County 73634 Significant 55247
## 3 4702190402 District 3 Cheatham County 89861 Nonsignificant 73802
## 4 4702190592 District 4 Cheatham County 73293 Significant 49731
## 5 4702190782 District 5 Cheatham County 78380 Significant 60793
## 6 4702190972 District 6 Cheatham County 92305 Significant 66750
## 7 4703790038 District 1 Davidson County 75038 Significant 56258
## 8 4703790228 District 2 Davidson County 54346 Significant 32487
## 9 4703790418 District 3 Davidson County 67953 Significant 49402
## 10 4703790608 District 4 Davidson County 106260 Significant 79381
Task 3:Explain which function you used to create the “Change” variable.
Response- I used the mutate function and I noted that it would be used to denote a new variable in the Income data set, that being change.
Income <- Income %>%
mutate(Change = HHInc2022 - HHInc2017)
head(Income, 10)
## GEOID District County HHInc2022 Significance HHInc2017
## 1 4702190022 District 1 Cheatham County 59741 Significant 44091
## 2 4702190212 District 2 Cheatham County 73634 Significant 55247
## 3 4702190402 District 3 Cheatham County 89861 Nonsignificant 73802
## 4 4702190592 District 4 Cheatham County 73293 Significant 49731
## 5 4702190782 District 5 Cheatham County 78380 Significant 60793
## 6 4702190972 District 6 Cheatham County 92305 Significant 66750
## 7 4703790038 District 1 Davidson County 75038 Significant 56258
## 8 4703790228 District 2 Davidson County 54346 Significant 32487
## 9 4703790418 District 3 Davidson County 67953 Significant 49402
## 10 4703790608 District 4 Davidson County 106260 Significant 79381
## Change
## 1 15650
## 2 18387
## 3 16059
## 4 23562
## 5 17587
## 6 25555
## 7 18780
## 8 21859
## 9 18551
## 10 26879
Task 4:Explain which functions you used to create the “Level” variable.
Response- I used the mutate function in conjunction with the case when function, that way it distinguished what data set it would be looking for and it directly stated what variables needed to be distinguished within a specific margin of error in the HHincome data sets.
Income <- Income %>%
mutate(Level = case_when(HHInc2022 > 99999 ~ "$100k+",
HHInc2022 < 100000 ~ "<$100k",
.default = "Error"))
head(Income,10)
## GEOID District County HHInc2022 Significance HHInc2017
## 1 4702190022 District 1 Cheatham County 59741 Significant 44091
## 2 4702190212 District 2 Cheatham County 73634 Significant 55247
## 3 4702190402 District 3 Cheatham County 89861 Nonsignificant 73802
## 4 4702190592 District 4 Cheatham County 73293 Significant 49731
## 5 4702190782 District 5 Cheatham County 78380 Significant 60793
## 6 4702190972 District 6 Cheatham County 92305 Significant 66750
## 7 4703790038 District 1 Davidson County 75038 Significant 56258
## 8 4703790228 District 2 Davidson County 54346 Significant 32487
## 9 4703790418 District 3 Davidson County 67953 Significant 49402
## 10 4703790608 District 4 Davidson County 106260 Significant 79381
## Change Level
## 1 15650 <$100k
## 2 18387 <$100k
## 3 16059 <$100k
## 4 23562 <$100k
## 5 17587 <$100k
## 6 25555 <$100k
## 7 18780 <$100k
## 8 21859 <$100k
## 9 18551 <$100k
## 10 26879 $100k+
Task 5: Create a data frame called “LevelbyCounty” that aggregates the data to show how many districts in each county are in the “Level” variable’s “$100k+” and “<$100k” categories. Show the “LevelbyCounty” data frame’s first 10 rows. Explain the functions you used.
Response- To create the data frame level by county I had to use the group by function to group the districts in each county by income, referencing the Income data set so that it was grouped accurately. I then used the pivot wider function so that I could distinguish what needed to be arranged and from where. I then used the head function to show the first 10 rows *as it has been used this entire assignment.
LevelbyCounty <- Income %>%
group_by(County, Level) %>%
summarize( Count = n()) %>%
pivot_wider( names_from = Level,
values_from = Count)
## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.
head(LevelbyCounty, 10)
## # A tibble: 7 × 3
## # Groups: County [7]
## County `<$100k` `$100k+`
## <chr> <int> <int>
## 1 Cheatham County 6 NA
## 2 Davidson County 29 6
## 3 Robertson County 11 1
## 4 Rutherford County 18 3
## 5 Sumner County 10 2
## 6 Williamson County 2 10
## 7 Wilson County 15 10
Task 6:Write code that creates a “RichDistricts” data frame containing rows for only those districts in the “Level” variable’s “$100k+” category and shows the first 10 rows of the “RichDistricts” data frame. Explain which function you used.
Response- I used the filter function for the new variable Richdistricts so that it could be filtered of off the highest income districts, with the arrange function being used so that it could be shown in descending order
RichDistricts <- Income %>%
filter(Level == "$100k+")
RichDistricts <- RichDistricts %>%
arrange(desc(HHInc2022))
Task 7:Write code that sorts the “RichDistricts” data frame by the “HHInc2022” variable, in descending order (biggest value at the top), and displays the first 10 rows of the data frame. Explain which function you used.
Response- So I used this code in tandem with task #7 to create the Richdistricts data, but the arrange function here and the interlaying of the mentioning of rich districts makes it so that it arranges it by the income of 2022 in rich districts.
RichDistricts <- RichDistricts %>%
arrange(desc(HHInc2022))
Task 8:Look back over the results you got and summarize, in a few sentences, at least one interesting pattern you discovered pertaining to how income levels vary across districts and/or counties in the Nashville area.
Response- So I noticed an interesting pattern that Rutherford county was the least represented county in the rich districts value set, and that can be seen directly due to the student population in rutherford county, and how the income increases the closer you go towards larger populated counties. This can be seen when examining the data sets and seeing rutherford county being rarely mentioned whilsts Williamson and Davidson show up more concurrently.