Challenge 3_Tidy Data

Author

Jingyi Yang

1. Start Up

knitr::opts_chunk$set(echo = TRUE)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(here)

here() starts at C:/8-601

library(readr)
library(readxl)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2. Import the data

2.1 Import the “eggs_tidy.csv” data set

setwd("C:\\8-601\\challenge_datasets")
egg <- read_csv("eggs_tidy.csv")

Rows: 120 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): month
dbl (5): year, large_half_dozen, large_dozen, extra_large_half_dozen, extra_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

egg  %>% print(n = 10, width = Inf)

# A tibble: 120 × 6
   month      year large_half_dozen large_dozen extra_large_half_dozen
   <chr>     <dbl>            <dbl>       <dbl>                  <dbl>
 1 January    2004             126         230                    132 
 2 February   2004             128.        226.                   134.
 3 March      2004             131         225                    137 
 4 April      2004             131         225                    137 
 5 May        2004             131         225                    137 
 6 June       2004             134.        231.                   137 
 7 July       2004             134.        234.                   137 
 8 August     2004             134.        234.                   137 
 9 September  2004             130.        234.                   136.
10 October    2004             128.        234.                   136.
   extra_large_dozen
               <dbl>
 1              230 
 2              230 
 3              230 
 4              234.
 5              236 
 6              241 
 7              241 
 8              241 
 9              241 
10              241 
# ℹ 110 more rows

2.2 Import the “australian_marriage*.xls” data set

Importing the Excel file with several sheets into R is hard, so reference the resource through the internet and used it below.

setwd("C:\\8-601\\challenge_datasets")

marriage <- here("australian_marriage_law_postal_survey_2017_-_response_final.xls") 
excel_sheets("australian_marriage_law_postal_survey_2017_-_response_final.xls")

[1] "Contents"          "Table 1"           "Table 2"          
[4] "Explanatory Notes"

marriage_1 <-read_excel("australian_marriage_law_postal_survey_2017_-_response_final.xls",  
                       sheet = "Table 1", range = "A8:P16", col_names = c("State and Territory", "Response clear_yes_number", "Response clear_yes_percentage", "Response clear_no_number", "Response clear_no_percentage", "Response clear_total_number","Response clear_total_percentage","delete", "Eligible Participants_Response clear_number","Eligible Participants_Response clear_percentage", "Eligible Participants_Response not clear(a)_number", "Eligible Participants_Response not clear(a)_percentage","Eligible Participants_Non-responding_number", "Eligible Participants_Non-responding_percentage", "Eligible Participants_total_number", "Eligible Participants_total_percentage"))

marriage_1 %>% print(width = Inf)

# A tibble: 9 × 16
  `State and Territory`           `Response clear_yes_number`
  <chr>                                                 <dbl>
1 New South Wales                                     2374362
2 Victoria                                            2145629
3 Queensland                                          1487060
4 South Australia                                      592528
5 Western Australia                                    801575
6 Tasmania                                             191948
7 Northern Territory(b)                                 48686
8 Australian Capital Territory(c)                      175459
9 Australia                                           7817247
  `Response clear_yes_percentage` `Response clear_no_number`
                            <dbl>                      <dbl>
1                            57.8                    1736838
2                            64.9                    1161098
3                            60.7                     961015
4                            62.5                     356247
5                            63.7                     455924
6                            63.6                     109655
7                            60.6                      31690
8                            74                        61520
9                            61.6                    4873987
  `Response clear_no_percentage` `Response clear_total_number`
                           <dbl>                         <dbl>
1                           42.2                       4111200
2                           35.1                       3306727
3                           39.3                       2448075
4                           37.5                        948775
5                           36.3                       1257499
6                           36.4                        301603
7                           39.4                         80376
8                           26                          236979
9                           38.4                      12691234
  `Response clear_total_percentage` delete
                              <dbl> <lgl> 
1                               100 NA    
2                               100 NA    
3                               100 NA    
4                               100 NA    
5                               100 NA    
6                               100 NA    
7                               100 NA    
8                               100 NA    
9                               100 NA    
  `Eligible Participants_Response clear_number`
                                          <dbl>
1                                       4111200
2                                       3306727
3                                       2448075
4                                        948775
5                                       1257499
6                                        301603
7                                         80376
8                                        236979
9                                      12691234
  `Eligible Participants_Response clear_percentage`
                                              <dbl>
1                                              79.2
2                                              81.4
3                                              77.7
4                                              79.5
5                                              78.3
6                                              79.5
7                                              58.2
8                                              82.3
9                                              79.3
  `Eligible Participants_Response not clear(a)_number`
                                                 <dbl>
1                                                11036
2                                                11028
3                                                 7088
4                                                 2778
5                                                 3188
6                                                  805
7                                                  229
8                                                  534
9                                                36686
  `Eligible Participants_Response not clear(a)_percentage`
                                                     <dbl>
1                                                      0.2
2                                                      0.3
3                                                      0.2
4                                                      0.2
5                                                      0.2
6                                                      0.2
7                                                      0.2
8                                                      0.2
9                                                      0.2
  `Eligible Participants_Non-responding_number`
                                          <dbl>
1                                       1065445
2                                        743634
3                                        695710
4                                        242027
5                                        346333
6                                         77020
7                                         57496
8                                         50595
9                                       3278260
  `Eligible Participants_Non-responding_percentage`
                                              <dbl>
1                                              20.5
2                                              18.3
3                                              22.1
4                                              20.3
5                                              21.6
6                                              20.3
7                                              41.6
8                                              17.6
9                                              20.5
  `Eligible Participants_total_number` `Eligible Participants_total_percentage`
                                 <dbl>                                    <dbl>
1                              5187681                                      100
2                              4061389                                      100
3                              3150873                                      100
4                              1193580                                      100
5                              1607020                                      100
6                               379428                                      100
7                               138101                                      100
8                               288108                                      100
9                             16006180                                      100

marriage_2 <-read_excel("australian_marriage_law_postal_survey_2017_-_response_final.xls",  
                       sheet = "Table 2", range = "A8:P183", col_names = c("places_regions", "Response clear_yes_number", "Response clear_yes_percentage", "Response clear_no_number", "Response clear_no_percentage", "Response clear_total_number","Response clear_total_percentage","delete", "Eligible Participants_Response clear_number","Eligible Participants_Response clear_percentage", "Eligible Participants_Response not clear(b)_number", "Eligible Participants_Response not clear(b)_percentage","Eligible Participants_Non-responding_number", "Eligible Participants_Non-responding_percentage", "Eligible Participants_total_number", "Eligible Participants_total_percentage"))

marriage_2  %>% print(n = 10, width = Inf)

# A tibble: 176 × 16
   places_regions            `Response clear_yes_number`
   <chr>                                           <dbl>
 1 New South Wales Divisions                          NA
 2 Banks                                           37736
 3 Barton                                          37153
 4 Bennelong                                       42943
 5 Berowra                                         48471
 6 Blaxland                                        20406
 7 Bradfield                                       53681
 8 Calare                                          54091
 9 Chifley                                         32871
10 Cook                                            47505
   `Response clear_yes_percentage` `Response clear_no_number`
                             <dbl>                      <dbl>
 1                            NA                           NA
 2                            44.9                      46343
 3                            43.6                      47984
 4                            49.8                      43215
 5                            54.6                      40369
 6                            26.1                      57926
 7                            60.6                      34927
 8                            60.2                      35779
 9                            41.3                      46702
10                            55                        38804
   `Response clear_no_percentage` `Response clear_total_number`
                            <dbl>                         <dbl>
 1                           NA                              NA
 2                           55.1                         84079
 3                           56.4                         85137
 4                           50.2                         86158
 5                           45.4                         88840
 6                           73.9                         78332
 7                           39.4                         88608
 8                           39.8                         89870
 9                           58.7                         79573
10                           45                           86309
   `Response clear_total_percentage` delete
                               <dbl> <lgl> 
 1                                NA NA    
 2                               100 NA    
 3                               100 NA    
 4                               100 NA    
 5                               100 NA    
 6                               100 NA    
 7                               100 NA    
 8                               100 NA    
 9                               100 NA    
10                               100 NA    
   `Eligible Participants_Response clear_number`
                                           <dbl>
 1                                            NA
 2                                         84079
 3                                         85137
 4                                         86158
 5                                         88840
 6                                         78332
 7                                         88608
 8                                         89870
 9                                         79573
10                                         86309
   `Eligible Participants_Response clear_percentage`
                                               <dbl>
 1                                              NA  
 2                                              79.9
 3                                              77.8
 4                                              81  
 5                                              84.5
 6                                              75  
 7                                              83.5
 8                                              77.8
 9                                              73.7
10                                              82  
   `Eligible Participants_Response not clear(b)_number`
                                                  <dbl>
 1                                                   NA
 2                                                  247
 3                                                  226
 4                                                  244
 5                                                  212
 6                                                  220
 7                                                  202
 8                                                  285
 9                                                  263
10                                                  229
   `Eligible Participants_Response not clear(b)_percentage`
                                                      <dbl>
 1                                                     NA  
 2                                                      0.2
 3                                                      0.2
 4                                                      0.2
 5                                                      0.2
 6                                                      0.2
 7                                                      0.2
 8                                                      0.2
 9                                                      0.2
10                                                      0.2
   `Eligible Participants_Non-responding_number`
                                           <dbl>
 1                                            NA
 2                                         20928
 3                                         24008
 4                                         19973
 5                                         16038
 6                                         25883
 7                                         17261
 8                                         25342
 9                                         28180
10                                         18713
   `Eligible Participants_Non-responding_percentage`
                                               <dbl>
 1                                              NA  
 2                                              19.9
 3                                              22  
 4                                              18.8
 5                                              15.3
 6                                              24.8
 7                                              16.3
 8                                              21.9
 9                                              26.1
10                                              17.8
   `Eligible Participants_total_number` `Eligible Participants_total_percentage`
                                  <dbl>                                    <dbl>
 1                                   NA                                       NA
 2                               105254                                      100
 3                               109371                                      100
 4                               106375                                      100
 5                               105090                                      100
 6                               104435                                      100
 7                               106071                                      100
 8                               115497                                      100
 9                               108016                                      100
10                               105251                                      100
# ℹ 166 more rows

3. clean the data

3.1 clean the “egg” data set

The egg data set cleaning is mainly regarding the “month” column, which is made the “month” column arranged from January to December, which makes the data set more focused on the month column, not the year column, for some research purposes like analyzing the change in different years based on the same month. This data set also includes “year,” “large_half_dozen,” “large_dozen extra_large_half_dozen,” and “extra_large_dozen.”

egg_clean <- egg %>%
  mutate(
  month = factor(month, levels = month.name)
  ) %>%
  arrange(month)

egg_clean  %>% print(n = 10, width = Inf)

# A tibble: 120 × 6
   month    year large_half_dozen large_dozen extra_large_half_dozen
   <fct>   <dbl>            <dbl>       <dbl>                  <dbl>
 1 January  2004             126         230                    132 
 2 January  2005             128.        234.                   136.
 3 January  2006             128.        234.                   136.
 4 January  2007             128.        234.                   136.
 5 January  2008             132         237                    139 
 6 January  2009             174.        278.                   186.
 7 January  2010             174.        272.                   186.
 8 January  2011             174.        268.                   186.
 9 January  2012             174.        268.                   186.
10 January  2013             178         268.                   188.
   extra_large_dozen
               <dbl>
 1              230 
 2              241 
 3              241 
 4              242.
 5              245 
 6              286.
 7              286.
 8              286.
 9              286.
10              290 
# ℹ 110 more rows

3.2 clean the “marriage” data set

Cleaning Table 1 in the marriage data set mainly deletes the columns containing no information. Cleaning Table 2 in the marriage data set mainly deletes the column containing no information and the rows that only contain the names of the divisions in Australia and the total numbers of values. The cleaning also includes changing the information regarding the name of divisions into new categories called divisions. It is hard to organize the places into different divisions and make a new column, so reference the resources for organizing and making a new data set is needed. Both of these two tables contain information about “response clear,” which refers to the answer in the survey being clearly recorded based on “yes” or “no.” They also contain information about “Eligible participants,” which means the people eligible for enrollment on the Commonwealth Electoral Roll and elections. The values for these two variables are further divided through “yes,” “no,” and “total, and presented through numbers and percentages.

3.2.1 clean the “marriage” data set sheet 1 “Table 1” (or marriage 1)

marriage_1_clean <- marriage_1 %>%
  select(!contains("delete"))

marriage_1_clean %>% print(width = Inf)

# A tibble: 9 × 15
  `State and Territory`           `Response clear_yes_number`
  <chr>                                                 <dbl>
1 New South Wales                                     2374362
2 Victoria                                            2145629
3 Queensland                                          1487060
4 South Australia                                      592528
5 Western Australia                                    801575
6 Tasmania                                             191948
7 Northern Territory(b)                                 48686
8 Australian Capital Territory(c)                      175459
9 Australia                                           7817247
  `Response clear_yes_percentage` `Response clear_no_number`
                            <dbl>                      <dbl>
1                            57.8                    1736838
2                            64.9                    1161098
3                            60.7                     961015
4                            62.5                     356247
5                            63.7                     455924
6                            63.6                     109655
7                            60.6                      31690
8                            74                        61520
9                            61.6                    4873987
  `Response clear_no_percentage` `Response clear_total_number`
                           <dbl>                         <dbl>
1                           42.2                       4111200
2                           35.1                       3306727
3                           39.3                       2448075
4                           37.5                        948775
5                           36.3                       1257499
6                           36.4                        301603
7                           39.4                         80376
8                           26                          236979
9                           38.4                      12691234
  `Response clear_total_percentage`
                              <dbl>
1                               100
2                               100
3                               100
4                               100
5                               100
6                               100
7                               100
8                               100
9                               100
  `Eligible Participants_Response clear_number`
                                          <dbl>
1                                       4111200
2                                       3306727
3                                       2448075
4                                        948775
5                                       1257499
6                                        301603
7                                         80376
8                                        236979
9                                      12691234
  `Eligible Participants_Response clear_percentage`
                                              <dbl>
1                                              79.2
2                                              81.4
3                                              77.7
4                                              79.5
5                                              78.3
6                                              79.5
7                                              58.2
8                                              82.3
9                                              79.3
  `Eligible Participants_Response not clear(a)_number`
                                                 <dbl>
1                                                11036
2                                                11028
3                                                 7088
4                                                 2778
5                                                 3188
6                                                  805
7                                                  229
8                                                  534
9                                                36686
  `Eligible Participants_Response not clear(a)_percentage`
                                                     <dbl>
1                                                      0.2
2                                                      0.3
3                                                      0.2
4                                                      0.2
5                                                      0.2
6                                                      0.2
7                                                      0.2
8                                                      0.2
9                                                      0.2
  `Eligible Participants_Non-responding_number`
                                          <dbl>
1                                       1065445
2                                        743634
3                                        695710
4                                        242027
5                                        346333
6                                         77020
7                                         57496
8                                         50595
9                                       3278260
  `Eligible Participants_Non-responding_percentage`
                                              <dbl>
1                                              20.5
2                                              18.3
3                                              22.1
4                                              20.3
5                                              21.6
6                                              20.3
7                                              41.6
8                                              17.6
9                                              20.5
  `Eligible Participants_total_number` `Eligible Participants_total_percentage`
                                 <dbl>                                    <dbl>
1                              5187681                                      100
2                              4061389                                      100
3                              3150873                                      100
4                              1193580                                      100
5                              1607020                                      100
6                               379428                                      100
7                               138101                                      100
8                               288108                                      100
9                             16006180                                      100

3.2.1 clean the “marriage” data set sheet 2 “Table 2” (or marriage 2)

marriage_2_clean <- marriage_2 %>%
  select(!contains("delete"))%>%
  na.omit()%>%
  filter(!str_detect(places_regions, "Total"))
  
  marriage_2_clean %>% print(n = 10, width = Inf)

# A tibble: 150 × 15
   places_regions `Response clear_yes_number` `Response clear_yes_percentage`
   <chr>                                <dbl>                           <dbl>
 1 Banks                                37736                            44.9
 2 Barton                               37153                            43.6
 3 Bennelong                            42943                            49.8
 4 Berowra                              48471                            54.6
 5 Blaxland                             20406                            26.1
 6 Bradfield                            53681                            60.6
 7 Calare                               54091                            60.2
 8 Chifley                              32871                            41.3
 9 Cook                                 47505                            55  
10 Cowper                               57493                            60  
   `Response clear_no_number` `Response clear_no_percentage`
                        <dbl>                          <dbl>
 1                      46343                           55.1
 2                      47984                           56.4
 3                      43215                           50.2
 4                      40369                           45.4
 5                      57926                           73.9
 6                      34927                           39.4
 7                      35779                           39.8
 8                      46702                           58.7
 9                      38804                           45  
10                      38317                           40  
   `Response clear_total_number` `Response clear_total_percentage`
                           <dbl>                             <dbl>
 1                         84079                               100
 2                         85137                               100
 3                         86158                               100
 4                         88840                               100
 5                         78332                               100
 6                         88608                               100
 7                         89870                               100
 8                         79573                               100
 9                         86309                               100
10                         95810                               100
   `Eligible Participants_Response clear_number`
                                           <dbl>
 1                                         84079
 2                                         85137
 3                                         86158
 4                                         88840
 5                                         78332
 6                                         88608
 7                                         89870
 8                                         79573
 9                                         86309
10                                         95810
   `Eligible Participants_Response clear_percentage`
                                               <dbl>
 1                                              79.9
 2                                              77.8
 3                                              81  
 4                                              84.5
 5                                              75  
 6                                              83.5
 7                                              77.8
 8                                              73.7
 9                                              82  
10                                              79  
   `Eligible Participants_Response not clear(b)_number`
                                                  <dbl>
 1                                                  247
 2                                                  226
 3                                                  244
 4                                                  212
 5                                                  220
 6                                                  202
 7                                                  285
 8                                                  263
 9                                                  229
10                                                  315
   `Eligible Participants_Response not clear(b)_percentage`
                                                      <dbl>
 1                                                      0.2
 2                                                      0.2
 3                                                      0.2
 4                                                      0.2
 5                                                      0.2
 6                                                      0.2
 7                                                      0.2
 8                                                      0.2
 9                                                      0.2
10                                                      0.3
   `Eligible Participants_Non-responding_number`
                                           <dbl>
 1                                         20928
 2                                         24008
 3                                         19973
 4                                         16038
 5                                         25883
 6                                         17261
 7                                         25342
 8                                         28180
 9                                         18713
10                                         25197
   `Eligible Participants_Non-responding_percentage`
                                               <dbl>
 1                                              19.9
 2                                              22  
 3                                              18.8
 4                                              15.3
 5                                              24.8
 6                                              16.3
 7                                              21.9
 8                                              26.1
 9                                              17.8
10                                              20.8
   `Eligible Participants_total_number` `Eligible Participants_total_percentage`
                                  <dbl>                                    <dbl>
 1                               105254                                      100
 2                               109371                                      100
 3                               106375                                      100
 4                               105090                                      100
 5                               104435                                      100
 6                               106071                                      100
 7                               115497                                      100
 8                               108016                                      100
 9                               105251                                      100
10                               121322                                      100
# ℹ 140 more rows

NewSouthWalesDivisions <- c("Banks", "Barton", "Bennelong", "Berowra", "Blaxland","Bradfield","Calare", "Chifley", "Cook", "Cowper", "Cunningham", "Dobell", "Eden-Monaro", "Farrer", "Fowler", "Gilmore", "Grayndler", "Greenway", "Hughes", "Hume" , "Hunter", "Kingsford Smith", "Lindsay", "Lyne", "Macarthur", "Mackellar", "Macquarie", "McMahon", "Mitchell", "Newcastle", "New England", "North Sydney", "Page", "Parkes", "Parramatta", "Paterson", "Reid", "Richmond", "Riverina", "Robertson", "Shortland", "Sydney", "Warringah", "Watson", "Wentworth", "Werriwa", "Whitlam")

VictoriaDivisions <- c("Aston","Ballarat","Batman", "Bendigo", "Bruce", "Calwell", "Casey", "Chisholm", "Corangamite", "Corio", "Deakin", "Dunkley", "Flinders", "Gellibrand", "Gippsland", "Goldstein", "Gorton", "Higgins", "Holt", "Hotham", "Indi", "Isaacs", "Jagajaga","Kooyong", "Lalor", "La Trobe", "Mallee","Maribyrnong", "McEwen", "McMillan", "Melbourne", "Melbourne Ports", "Menzies", "Murray", "Scullin", "Wannon", "Wills")
 
QueenslandDivisions <- c("Blair", "Bonner","Bowman", "Brisbane", "Capricornia", "Dawson", "Dickson", "Fadden", "Fairfax", "Fisher", "Flynn", "Forde", "Griffith", "Groom", "Herbert", "Hinkler", "Kennedy", "Leichhardt", "Lilley", "Longman", "Maranoa", "McPherson", "Moncrieff", "Moreton", "Oxley", "Petrie","Rankin", "Ryan", "Wide Bay", "Wright")

SouthAustraliaDivisions <- c("Adelaide", "Barker", "Boothby", "Grey", "Hindmarsh", "Kingston", "Makin", "Mayo", "Port Adelaide", "Sturt", "Wakefield")

WesternAustraliaDivisions <- c("Brand", "Burt", "Canning", "Cowan", "Curtin", "Durack", "Forrest", "Fremantle", "Hasluck", "Moore", "O'Connor", "Pearce", "Perth", "Stirling", "Swan", "Tangney")

TasmaniaDivisions <- c("Bass", "Braddon", "Denison", "Franklin", "Lyons")

NorthernTerritoryDivisions <- c("Lingiari(c)", "Solomon")

AustralianCapitalTerritoryDivisions <- c("Canberra(d)", "Fenner(e)")


NewSouthWales_division <- tibble( regions=NewSouthWalesDivisions)%>%
mutate(division= "New South Wales")

Victoria_divisions <- tibble(regions=VictoriaDivisions)%>%
mutate(division= "Victoria")

Queensland_divisions <- tibble(regions=QueenslandDivisions)%>%
mutate(division= "Queensland")

SouthAustralia_divisions <- tibble(regions=SouthAustraliaDivisions)%>%
mutate(division= "South Australia")

WesternAustralia_divisions <- tibble(regions=WesternAustraliaDivisions)%>%
mutate(division= "Western Australia")

Tasmania_divisions <- tibble(regions=TasmaniaDivisions)%>%
mutate(division= "Tasmania")

NorthernTerritory_divisions <- tibble(regions=NorthernTerritoryDivisions)%>%
mutate(division= "Northern Territory")

AustralianCapitalTerritory_divisions <- tibble(regions=AustralianCapitalTerritoryDivisions)%>%
mutate(division= "Australian Capital Territory")

marriage_2_clean_final <- bind_rows(NewSouthWales_division, Victoria_divisions,Queensland_divisions,SouthAustralia_divisions, WesternAustralia_divisions,Tasmania_divisions,NorthernTerritory_divisions,AustralianCapitalTerritory_divisions)%>%
  mutate(marriage_2_clean)%>%
select(!places_regions)

marriage_2_clean_final %>% print(n = 10, width = Inf)

# A tibble: 150 × 16
   regions   division        `Response clear_yes_number`
   <chr>     <chr>                                 <dbl>
 1 Banks     New South Wales                       37736
 2 Barton    New South Wales                       37153
 3 Bennelong New South Wales                       42943
 4 Berowra   New South Wales                       48471
 5 Blaxland  New South Wales                       20406
 6 Bradfield New South Wales                       53681
 7 Calare    New South Wales                       54091
 8 Chifley   New South Wales                       32871
 9 Cook      New South Wales                       47505
10 Cowper    New South Wales                       57493
   `Response clear_yes_percentage` `Response clear_no_number`
                             <dbl>                      <dbl>
 1                            44.9                      46343
 2                            43.6                      47984
 3                            49.8                      43215
 4                            54.6                      40369
 5                            26.1                      57926
 6                            60.6                      34927
 7                            60.2                      35779
 8                            41.3                      46702
 9                            55                        38804
10                            60                        38317
   `Response clear_no_percentage` `Response clear_total_number`
                            <dbl>                         <dbl>
 1                           55.1                         84079
 2                           56.4                         85137
 3                           50.2                         86158
 4                           45.4                         88840
 5                           73.9                         78332
 6                           39.4                         88608
 7                           39.8                         89870
 8                           58.7                         79573
 9                           45                           86309
10                           40                           95810
   `Response clear_total_percentage`
                               <dbl>
 1                               100
 2                               100
 3                               100
 4                               100
 5                               100
 6                               100
 7                               100
 8                               100
 9                               100
10                               100
   `Eligible Participants_Response clear_number`
                                           <dbl>
 1                                         84079
 2                                         85137
 3                                         86158
 4                                         88840
 5                                         78332
 6                                         88608
 7                                         89870
 8                                         79573
 9                                         86309
10                                         95810
   `Eligible Participants_Response clear_percentage`
                                               <dbl>
 1                                              79.9
 2                                              77.8
 3                                              81  
 4                                              84.5
 5                                              75  
 6                                              83.5
 7                                              77.8
 8                                              73.7
 9                                              82  
10                                              79  
   `Eligible Participants_Response not clear(b)_number`
                                                  <dbl>
 1                                                  247
 2                                                  226
 3                                                  244
 4                                                  212
 5                                                  220
 6                                                  202
 7                                                  285
 8                                                  263
 9                                                  229
10                                                  315
   `Eligible Participants_Response not clear(b)_percentage`
                                                      <dbl>
 1                                                      0.2
 2                                                      0.2
 3                                                      0.2
 4                                                      0.2
 5                                                      0.2
 6                                                      0.2
 7                                                      0.2
 8                                                      0.2
 9                                                      0.2
10                                                      0.3
   `Eligible Participants_Non-responding_number`
                                           <dbl>
 1                                         20928
 2                                         24008
 3                                         19973
 4                                         16038
 5                                         25883
 6                                         17261
 7                                         25342
 8                                         28180
 9                                         18713
10                                         25197
   `Eligible Participants_Non-responding_percentage`
                                               <dbl>
 1                                              19.9
 2                                              22  
 3                                              18.8
 4                                              15.3
 5                                              24.8
 6                                              16.3
 7                                              21.9
 8                                              26.1
 9                                              17.8
10                                              20.8
   `Eligible Participants_total_number` `Eligible Participants_total_percentage`
                                  <dbl>                                    <dbl>
 1                               105254                                      100
 2                               109371                                      100
 3                               106375                                      100
 4                               105090                                      100
 5                               104435                                      100
 6                               106071                                      100
 7                               115497                                      100
 8                               108016                                      100
 9                               105251                                      100
10                               121322                                      100
# ℹ 140 more rows

4. Pivot the data

4.1 Pivot the “egg” data set

Pivot the “egg” data set using the “pivot_longer” function, changing the column titles from “large_half_dozen” to “extra_large_dozen” into a column “Key” and making the values in these columns into one column called “values.” It will make the table longer but less wide and also make the values in the various columns easier to analyze based on the month, year, and various dozen categories.

egg_clean_pivot_final <- pivot_longer(egg_clean, cols = "large_half_dozen": "extra_large_dozen", names_to = "key", values_to = "Values")

egg_clean_pivot_final %>% print(n = 10, width = Inf)

# A tibble: 480 × 4
   month    year key                    Values
   <fct>   <dbl> <chr>                   <dbl>
 1 January  2004 large_half_dozen         126 
 2 January  2004 large_dozen              230 
 3 January  2004 extra_large_half_dozen   132 
 4 January  2004 extra_large_dozen        230 
 5 January  2005 large_half_dozen         128.
 6 January  2005 large_dozen              234.
 7 January  2005 extra_large_half_dozen   136.
 8 January  2005 extra_large_dozen        241 
 9 January  2006 large_half_dozen         128.
10 January  2006 large_dozen              234.
# ℹ 470 more rows

4.2 Pivot the “marriage” data set

Pivot the two tables in the marriage data set by 1) dividing the table through “clear response” and “eligible participants” and using “pivot_longer” function to make the column title contain “clear response” and “eligible participants” into a column called key so that their value can become one column, which will be easier for further analysis by different groups, 2) merging these two tables through “bind_rows” function so that it creates a new table and will not be impacted by the “eligible participants have more observations than the”clear response,” and 3) dividing down the key column by using the “separate” function to three columns: “Element,” which mainly refers to the “clear response” and “eligible participants” this two categories, “status&total” refers to the yes, no, and total in the original tables, and “Counting ways” refers to two ways the table present the data- numbers and percentage. Doing this will make the table easier to understand and do further analysis.

4.2.1 Pivot the “marriage Table 1” data set

marriage_1_clean_tidy_response <- pivot_longer(marriage_1_clean, cols= c("Response clear_yes_number", "Response clear_yes_percentage", "Response clear_no_number", "Response clear_no_percentage",  "Response clear_total_number","Response clear_total_percentage"), names_to = "key", values_to= "values") %>%
 select(!contains("Eligible Participants"))

marriage_1_clean_tidy_response %>% print(n = 10, width = Inf)

# A tibble: 54 × 3
   `State and Territory` key                                values
   <chr>                 <chr>                               <dbl>
 1 New South Wales       Response clear_yes_number       2374362  
 2 New South Wales       Response clear_yes_percentage        57.8
 3 New South Wales       Response clear_no_number        1736838  
 4 New South Wales       Response clear_no_percentage         42.2
 5 New South Wales       Response clear_total_number     4111200  
 6 New South Wales       Response clear_total_percentage     100  
 7 Victoria              Response clear_yes_number       2145629  
 8 Victoria              Response clear_yes_percentage        64.9
 9 Victoria              Response clear_no_number        1161098  
10 Victoria              Response clear_no_percentage         35.1
# ℹ 44 more rows

marriage_1_clean_tidy_eligible <- pivot_longer(marriage_1_clean, cols= c( "Eligible Participants_Response clear_number",    "Eligible Participants_Response clear_percentage",  "Eligible Participants_Response not clear(a)_number",   "Eligible Participants_Response not clear(a)_percentage",   "Eligible Participants_Non-responding_number",  "Eligible Participants_Non-responding_percentage",  "Eligible Participants_total_number",   "Eligible Participants_total_percentage"),  names_to= "key", values_to= "values" )%>%
 select(!contains("Response clear"))

marriage_1_clean_tidy_eligible %>% print(n = 10, width = Inf)

# A tibble: 72 × 3
   `State and Territory` key                                                   
   <chr>                 <chr>                                                 
 1 New South Wales       Eligible Participants_Response clear_number           
 2 New South Wales       Eligible Participants_Response clear_percentage       
 3 New South Wales       Eligible Participants_Response not clear(a)_number    
 4 New South Wales       Eligible Participants_Response not clear(a)_percentage
 5 New South Wales       Eligible Participants_Non-responding_number           
 6 New South Wales       Eligible Participants_Non-responding_percentage       
 7 New South Wales       Eligible Participants_total_number                    
 8 New South Wales       Eligible Participants_total_percentage                
 9 Victoria              Eligible Participants_Response clear_number           
10 Victoria              Eligible Participants_Response clear_percentage       
      values
       <dbl>
 1 4111200  
 2      79.2
 3   11036  
 4       0.2
 5 1065445  
 6      20.5
 7 5187681  
 8     100  
 9 3306727  
10      81.4
# ℹ 62 more rows

marriage_1_clean_final_tidy<- bind_rows(marriage_1_clean_tidy_response,marriage_1_clean_tidy_eligible)%>%
   separate(key, into = c("Element", "status&total", "Counting ways"), sep = "_")

marriage_1_clean_final_tidy %>% print(n = 10, width = Inf)

# A tibble: 126 × 5
   `State and Territory` Element        `status&total` `Counting ways`    values
   <chr>                 <chr>          <chr>          <chr>               <dbl>
 1 New South Wales       Response clear yes            number          2374362  
 2 New South Wales       Response clear yes            percentage           57.8
 3 New South Wales       Response clear no             number          1736838  
 4 New South Wales       Response clear no             percentage           42.2
 5 New South Wales       Response clear total          number          4111200  
 6 New South Wales       Response clear total          percentage          100  
 7 Victoria              Response clear yes            number          2145629  
 8 Victoria              Response clear yes            percentage           64.9
 9 Victoria              Response clear no             number          1161098  
10 Victoria              Response clear no             percentage           35.1
# ℹ 116 more rows

4.2.2 pivot the the “marriage” dara set sheet 1 “Table 2” (or marriage 2)

marriage_2_clean_final_2_tidy_response <- pivot_longer(marriage_2_clean_final, cols= c("Response clear_yes_number", "Response clear_yes_percentage", "Response clear_no_number", "Response clear_no_percentage",    "Response clear_total_number","Response clear_total_percentage"), names_to= "key", values_to= "values")%>%
 select(!contains("Eligible Participants"))

marriage_2_clean_final_2_tidy_response %>% print(n = 10, width = Inf)

# A tibble: 900 × 4
   regions division        key                              values
   <chr>   <chr>           <chr>                             <dbl>
 1 Banks   New South Wales Response clear_yes_number       37736  
 2 Banks   New South Wales Response clear_yes_percentage      44.9
 3 Banks   New South Wales Response clear_no_number        46343  
 4 Banks   New South Wales Response clear_no_percentage       55.1
 5 Banks   New South Wales Response clear_total_number     84079  
 6 Banks   New South Wales Response clear_total_percentage   100  
 7 Barton  New South Wales Response clear_yes_number       37153  
 8 Barton  New South Wales Response clear_yes_percentage      43.6
 9 Barton  New South Wales Response clear_no_number        47984  
10 Barton  New South Wales Response clear_no_percentage       56.4
# ℹ 890 more rows

marriage_2_clean_final_tidy_eligible <- pivot_longer( marriage_2_clean_final, cols= c( "Eligible Participants_Response clear_number",   "Eligible Participants_Response clear_percentage",  "Eligible Participants_Response not clear(b)_number",   "Eligible Participants_Response not clear(b)_percentage",   "Eligible Participants_Non-responding_number",  "Eligible Participants_Non-responding_percentage",  "Eligible Participants_total_number",   "Eligible Participants_total_percentage"),  names_to= "key", values_to= "values" )%>%
 select(!contains("Response clear"))

marriage_2_clean_final_tidy_eligible %>% print(n = 10, width = Inf)

# A tibble: 1,200 × 4
   regions division       
   <chr>   <chr>          
 1 Banks   New South Wales
 2 Banks   New South Wales
 3 Banks   New South Wales
 4 Banks   New South Wales
 5 Banks   New South Wales
 6 Banks   New South Wales
 7 Banks   New South Wales
 8 Banks   New South Wales
 9 Barton  New South Wales
10 Barton  New South Wales
   key                                                      values
   <chr>                                                     <dbl>
 1 Eligible Participants_Response clear_number             84079  
 2 Eligible Participants_Response clear_percentage            79.9
 3 Eligible Participants_Response not clear(b)_number        247  
 4 Eligible Participants_Response not clear(b)_percentage      0.2
 5 Eligible Participants_Non-responding_number             20928  
 6 Eligible Participants_Non-responding_percentage            19.9
 7 Eligible Participants_total_number                     105254  
 8 Eligible Participants_total_percentage                    100  
 9 Eligible Participants_Response clear_number             85137  
10 Eligible Participants_Response clear_percentage            77.8
# ℹ 1,190 more rows

marriage_2_clean_final_tidy<- bind_rows(marriage_2_clean_final_2_tidy_response,marriage_2_clean_final_tidy_eligible)%>%
   separate(key, into = c("Element", "status&total", "Counting ways"), sep = "_") %>%
  unite(regions_divisions, division, regions, sep = ",")

marriage_2_clean_final_tidy %>% print(n = 10, width = Inf)

# A tibble: 2,100 × 5
   regions_divisions      Element        `status&total` `Counting ways`  values
   <chr>                  <chr>          <chr>          <chr>             <dbl>
 1 New South Wales,Banks  Response clear yes            number          37736  
 2 New South Wales,Banks  Response clear yes            percentage         44.9
 3 New South Wales,Banks  Response clear no             number          46343  
 4 New South Wales,Banks  Response clear no             percentage         55.1
 5 New South Wales,Banks  Response clear total          number          84079  
 6 New South Wales,Banks  Response clear total          percentage        100  
 7 New South Wales,Barton Response clear yes            number          37153  
 8 New South Wales,Barton Response clear yes            percentage         43.6
 9 New South Wales,Barton Response clear no             number          47984  
10 New South Wales,Barton Response clear no             percentage         56.4
# ℹ 2,090 more rows