Midterm

Task 1:Accompany the code with a brief explanation of how the code works.

Response: This code installs the tidyverse package, if needed, and loads it into R Studio so that it can be implemented into whatever code is written. Next, it uses the read.csv() function to download the Inc2022.csv data and put it into a data frame called Inc2022. Finally, it uses the same function to download the Inc2017.csv data and put it into a data frame called Inc2017.

if(!require(tidyverse))install.packages("tidyverse")
library(tidyverse)

Inc2022 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2022.csv")

Inc2017 <- read.csv("https://drkblake.com/wp-content/uploads/2024/02/Inc2017.csv")

Task 2: Explain which data frame was your “to” data frame, which was your “from” data frame, and which variables served as your “key column” variables.

Response: The 2022 data frame was my To data frame as the data encompasses until 2022 and the from is 2017 as that is where my data is pooling from as a baseline point. The Geoid served as my key column variable, as it was the necessary variable to note in the joining function.

Income <- left_join(Inc2022,Inc2017,
                          by = join_by(GEOID == GEOID))
head(Income, 10)

##         GEOID   District          County HHInc2022   Significance HHInc2017
## 1  4702190022 District 1 Cheatham County     59741    Significant     44091
## 2  4702190212 District 2 Cheatham County     73634    Significant     55247
## 3  4702190402 District 3 Cheatham County     89861 Nonsignificant     73802
## 4  4702190592 District 4 Cheatham County     73293    Significant     49731
## 5  4702190782 District 5 Cheatham County     78380    Significant     60793
## 6  4702190972 District 6 Cheatham County     92305    Significant     66750
## 7  4703790038 District 1 Davidson County     75038    Significant     56258
## 8  4703790228 District 2 Davidson County     54346    Significant     32487
## 9  4703790418 District 3 Davidson County     67953    Significant     49402
## 10 4703790608 District 4 Davidson County    106260    Significant     79381

Task 3:Explain which function you used to create the “Change” variable.

Response- I used the mutate function and I noted that it would be used to denote a new variable in the Income data set, that being change.

  Income <- Income %>% 
    mutate(Change = HHInc2022 - HHInc2017)
  head(Income, 10)

##         GEOID   District          County HHInc2022   Significance HHInc2017
## 1  4702190022 District 1 Cheatham County     59741    Significant     44091
## 2  4702190212 District 2 Cheatham County     73634    Significant     55247
## 3  4702190402 District 3 Cheatham County     89861 Nonsignificant     73802
## 4  4702190592 District 4 Cheatham County     73293    Significant     49731
## 5  4702190782 District 5 Cheatham County     78380    Significant     60793
## 6  4702190972 District 6 Cheatham County     92305    Significant     66750
## 7  4703790038 District 1 Davidson County     75038    Significant     56258
## 8  4703790228 District 2 Davidson County     54346    Significant     32487
## 9  4703790418 District 3 Davidson County     67953    Significant     49402
## 10 4703790608 District 4 Davidson County    106260    Significant     79381
##    Change
## 1   15650
## 2   18387
## 3   16059
## 4   23562
## 5   17587
## 6   25555
## 7   18780
## 8   21859
## 9   18551
## 10  26879

Task 4:Explain which functions you used to create the “Level” variable.

Response- I used the mutate function in conjunction with the case when function, that way it distinguished what data set it would be looking for and it directly stated what variables needed to be distinguished within a specific margin of error in the HHincome data sets.

  Income <- Income %>% 
    mutate(Level = case_when(HHInc2022 > 99999 ~ "$100k+",
                             HHInc2022 < 100000 ~ "<$100k",
                             .default = "Error"))
  head(Income,10)

##         GEOID   District          County HHInc2022   Significance HHInc2017
## 1  4702190022 District 1 Cheatham County     59741    Significant     44091
## 2  4702190212 District 2 Cheatham County     73634    Significant     55247
## 3  4702190402 District 3 Cheatham County     89861 Nonsignificant     73802
## 4  4702190592 District 4 Cheatham County     73293    Significant     49731
## 5  4702190782 District 5 Cheatham County     78380    Significant     60793
## 6  4702190972 District 6 Cheatham County     92305    Significant     66750
## 7  4703790038 District 1 Davidson County     75038    Significant     56258
## 8  4703790228 District 2 Davidson County     54346    Significant     32487
## 9  4703790418 District 3 Davidson County     67953    Significant     49402
## 10 4703790608 District 4 Davidson County    106260    Significant     79381
##    Change  Level
## 1   15650 <$100k
## 2   18387 <$100k
## 3   16059 <$100k
## 4   23562 <$100k
## 5   17587 <$100k
## 6   25555 <$100k
## 7   18780 <$100k
## 8   21859 <$100k
## 9   18551 <$100k
## 10  26879 $100k+

Task 5: Create a data frame called “LevelbyCounty” that aggregates the data to show how many districts in each county are in the “Level” variable’s “$100k+” and “<$100k” categories. Show the “LevelbyCounty” data frame’s first 10 rows. Explain the functions you used.

Response- To create the data frame level by county I had to use the group by function to group the districts in each county by income, referencing the Income data set so that it was grouped accurately. I then used the pivot wider function so that I could distinguish what needed to be arranged and from where. I then used the head function to show the first 10 rows *as it has been used this entire assignment.

LevelbyCounty <- Income %>%
   group_by(County, Level) %>%
   summarize( Count = n()) %>%
   pivot_wider( names_from = Level,
                values_from = Count)

## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.

  head(LevelbyCounty, 10)

## # A tibble: 7 × 3
## # Groups:   County [7]
##   County            `<$100k` `$100k+`
##   <chr>                <int>    <int>
## 1 Cheatham County          6       NA
## 2 Davidson County         29        6
## 3 Robertson County        11        1
## 4 Rutherford County       18        3
## 5 Sumner County           10        2
## 6 Williamson County        2       10
## 7 Wilson County           15       10

Task 6:Write code that creates a “RichDistricts” data frame containing rows for only those districts in the “Level” variable’s “$100k+” category and shows the first 10 rows of the “RichDistricts” data frame. Explain which function you used.

Response- I used the filter function for the new variable Richdistricts so that it could be filtered of off the highest income districts, with the arrange function being used so that it could be shown in descending order

RichDistricts <- Income %>% 
  filter(Level == "$100k+")
  
RichDistricts <- RichDistricts %>% 
  arrange(desc(HHInc2022))

Task 7:Write code that sorts the “RichDistricts” data frame by the “HHInc2022” variable, in descending order (biggest value at the top), and displays the first 10 rows of the data frame. Explain which function you used.

Response- So I used this code in tandem with task #7 to create the Richdistricts data, but the arrange function here and the interlaying of the mentioning of rich districts makes it so that it arranges it by the income of 2022 in rich districts.

RichDistricts <- RichDistricts %>% 
  arrange(desc(HHInc2022))

Task 8:Look back over the results you got and summarize, in a few sentences, at least one interesting pattern you discovered pertaining to how income levels vary across districts and/or counties in the Nashville area.

Response- So I noticed an interesting pattern that Rutherford county was the least represented county in the rich districts value set, and that can be seen directly due to the student population in rutherford county, and how the income increases the closer you go towards larger populated counties. This can be seen when examining the data sets and seeing rutherford county being rarely mentioned whilsts Williamson and Davidson show up more concurrently.

Midterm

William Wright

2024-03-01