MATH2405 Data Wrangling

RPubs URL: https://rpubs.com/Chuver/714636

Setup

# The following packages are loaded for the purpose of Assignment 1  


library(readr) # Useful for importing data
library(tidyr) # Useful for creating tidy data
library(dplyr) # Useful for data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr) # Useful for code readability

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

Step 1 - Locate Data

With the imminent start of the new school year, as well as the fact that I have two girls attending the same school, I was curious to find out more about the number of school enrolments across schools in Victoria and consequently have used this as the data set for my assignment.

The motivation behind choosing the All Schools FTE enrolments 2020 data was as a result of searching for the answers to:

Which school in Victoria has the highest number of year 12 enrolments?
What is the breakdown at year 12 level between boys and girls?

The data is published by the Department of Education and Training and was sourced using https://www.data.vic.gov.au/.

The actual URL of the data is http://www.education.vic.gov.au/Documents/about/research/datavic/dv300-allschoolsFTEenrolmentsFeb2020.csv. The data itself is in CSV format.

The data is from a census taken in February 2020 of Victorian schools. The data set provides information on the full time equivalent (FTE) enrolments of students by school, school type, year level and sex. A student attending school on a full time basis is counted as 1.0 FTE. In the event where a student has attended more than 1 school, their time is apportioned accordingly, e.g. 0.6 FTE at school A and 0.4 FTE at school B.

Step 2 - Read/Import Data

From the URL for All Schools FTE enrolments - Feb2020 http://www.education.vic.gov.au/Documents/about/research/datavic/dv300-allschoolsFTEenrolmentsFeb2020.csv, the file was downloaded to the local laptop directory.

The downloaded file was then read into R using the read_csv function by assigning it to “schools_fte01”.

# Step 2 - Read CSV file from local directory in R and save as a data frame
schools_fte01 <- read_csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/Schools_FTE_2020.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   Education_Sector = col_character(),
##   School_Name = col_character(),
##   School_Type = col_character(),
##   School_Status = col_character(),
##   CENSUS_TYPE = col_logical()
## )
## i Use `spec()` for the full column specifications.

# Display contents of read_csv
head(schools_fte01)

# Save as dataframe
schools_fte01 %>%
  saveRDS(file = "C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/schools_fte01.RDS")

Step 3 - Data description

Data source: The data can be found at the following URL: http://www.education.vic.gov.au/Documents/about/research/datavic/dv300-allschoolsFTEenrolmentsFeb2020.csv

The data comprises of the following 2263 observations with 56 variables. The variables are broken down as follows:

Four qualitative (categorical) variables:
- Education_Sector - indicates whether a school is Catholic, Government or Independent.
- School_Name - is the name of the school.
- School_Type - indicates whether the school is a Language, Pri/Sec, Primary, Secondary or Special school.
- School_Status - indicates the status of a school Open (O) or Closed (C).
one logical variable:
- CENSUS_TYPE - indicates that the census is complete (F)
51 numeric variables:
- Entity Type - indicates whether the school is Government (1) or (Catholic / Independent)(2).
- School_No - represents the school ID
- *Total - represents the FTE by year (there are 48 of these variables)
- Year - indicates the census year - 2020

Step 4 - Inspect dataset and variables

In order to obtain greater understanding of the data set, the following was performed:

4.1 Checked the dimensions of the data frame using dim(schools_fte01).

4.2 Checked the variable data types using glimpse(schools_fte01).

4.3a Checked the column names and determined that some names had 2 backslashes as a suffix.

4.3b Removed the 2 backslashes as a suffix using gsub(schools_fte01)

4.4 Created new Year 12 data set for the purpose of summary statistics.

# Check structure of data set
# 4.1 Check data frame dimensions
dim(schools_fte01)

## [1] 2263   56

The output confirms that there are 2263 rows and 56 variables

# 4.2 Use glimpse() to check variable data types
glimpse(schools_fte01)

## Rows: 2,263
## Columns: 56
## $ Education_Sector                     <chr> "Catholic", "Catholic", "Catho...
## $ Entity_Type                          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ School_No                            <dbl> 20, 25, 26, 28, 29, 30, 33, 35...
## $ School_Name                          <chr> "Parade College", "Simonds Cat...
## $ School_Type                          <chr> "Secondary", "Secondary", "Sec...
## $ School_Status                        <chr> "O", "O", "O", "O", "O", "O", ...
## $ `Prep Males Total""`                 <dbl> 0, 0, 0, 0, 30, 18, 0, 0, 11, ...
## $ `Prep Females Total""`               <dbl> 0.0, 0.0, 0.0, 0.0, 12.0, 18.0...
## $ `Prep Total""`                       <dbl> 0.0, 0.0, 0.0, 0.0, 42.0, 36.0...
## $ `Year 1 Males Total""`               <dbl> 0, 0, 0, 0, 14, 9, 0, 0, 9, 23...
## $ `Year 1 Females Total""`             <dbl> 0, 0, 0, 0, 18, 20, 0, 0, 11, ...
## $ `Year 1 Total""`                     <dbl> 0, 0, 0, 0, 32, 29, 0, 0, 20, ...
## $ `Year 2 Males total""`               <dbl> 0.0, 0.0, 0.0, 0.0, 15.0, 19.0...
## $ `Year 2 Females Total""`             <dbl> 0, 0, 0, 0, 16, 14, 0, 0, 14, ...
## $ `Year 2 Total""`                     <dbl> 0.0, 0.0, 0.0, 0.0, 31.0, 33.0...
## $ `Year 3 Males Total""`               <dbl> 0.0, 0.0, 0.0, 0.0, 18.0, 18.0...
## $ `Year 3 Females Total""`             <dbl> 0, 0, 0, 0, 15, 25, 0, 0, 10, ...
## $ `Year 3 Total""`                     <dbl> 0.0, 0.0, 0.0, 0.0, 33.0, 43.0...
## $ `Year 4 Males Total""`               <dbl> 0, 0, 0, 0, 13, 20, 0, 0, 14, ...
## $ `Year 4 Females Total""`             <dbl> 0, 0, 0, 0, 21, 29, 0, 0, 17, ...
## $ `Year 4 Total""`                     <dbl> 0, 0, 0, 0, 34, 49, 0, 0, 31, ...
## $ `Year 5 Males Total""`               <dbl> 0.0, 0.0, 0.0, 0.0, 16.0, 22.0...
## $ `Year 5 Females Total""`             <dbl> 0, 0, 0, 0, 22, 24, 0, 0, 14, ...
## $ `Year 5 Total""`                     <dbl> 0.0, 0.0, 0.0, 0.0, 38.0, 46.0...
## $ `Year 6 Males Total""`               <dbl> 0, 0, 0, 0, 18, 16, 0, 0, 14, ...
## $ `Year 6 Females Total""`             <dbl> 0, 0, 0, 0, 27, 15, 0, 0, 20, ...
## $ `Year 6 Total""`                     <dbl> 0, 0, 0, 0, 45, 31, 0, 0, 34, ...
## $ `Primary Ungraded Males Total""`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Primary Ungraded Females Total""`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Primary Ungraded Total""`           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Primary Total""`                    <dbl> 0.0, 0.0, 0.0, 0.0, 255.0, 267...
## $ `Year 7 Males Total""`               <dbl> 333, 64, 54, 229, 0, 0, 104, 0...
## $ `Year 7 Females Total""`             <dbl> 0, 0, 0, 0, 0, 0, 118, 156, 0,...
## $ `Year 7 Total""`                     <dbl> 333, 64, 54, 229, 0, 0, 222, 1...
## $ `Year 8 Males Total""`               <dbl> 337, 85, 74, 233, 0, 0, 110, 0...
## $ `Year 8 Females Total""`             <dbl> 0, 0, 0, 0, 0, 0, 110, 159, 0,...
## $ `Year 8 Total""`                     <dbl> 337, 85, 74, 233, 0, 0, 220, 1...
## $ `Year 9 Males Total""`               <dbl> 351, 69, 94, 224, 0, 0, 85, 0,...
## $ `Year 9 Females Total""`             <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ `Year 9 Total""`                     <dbl> 351.0, 69.0, 94.0, 224.0, 0.0,...
## $ `Year 10 Males Total""`              <dbl> 319, 73, 87, 236, 0, 0, 92, 0,...
## $ `Year 10 Females Total""`            <dbl> 0, 0, 0, 0, 0, 0, 103, 162, 0,...
## $ `Year 10 Total""`                    <dbl> 319, 73, 87, 236, 0, 0, 195, 1...
## $ `Year 11 Males Total""`              <dbl> 332, 68, 97, 228, 0, 0, 89, 0,...
## $ `Year 11 Females Total""`            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ `Year 11 Total""`                    <dbl> 332.0, 68.0, 97.0, 228.0, 0.0,...
## $ `Year 12 Males""`                    <dbl> 281.0, 62.0, 71.0, 190.0, 0.0,...
## $ `Year 12 Females""`                  <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ `Year 12 Total""`                    <dbl> 281.0, 62.0, 71.0, 190.0, 0.0,...
## $ `Secondary Ungraded Males Total""`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Secondary Ungraded Females Total""` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Secondary Ungraded Total""`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Secondary Total""`                  <dbl> 1953.0, 421.0, 477.0, 1340.0, ...
## $ `Grand Total""`                      <dbl> 1953.0, 421.0, 477.0, 1340.0, ...
## $ Year                                 <dbl> 2020, 2020, 2020, 2020, 2020, ...
## $ CENSUS_TYPE                          <lgl> FALSE, FALSE, FALSE, FALSE, FA...

As can be seen above all the variables are in the correct data type and do not require amendment

# 4.3a Use names() to check column names
names(schools_fte01)

##  [1] "Education_Sector"                    
##  [2] "Entity_Type"                         
##  [3] "School_No"                           
##  [4] "School_Name"                         
##  [5] "School_Type"                         
##  [6] "School_Status"                       
##  [7] "Prep Males Total\"\""                
##  [8] "Prep Females Total\"\""              
##  [9] "Prep Total\"\""                      
## [10] "Year 1 Males Total\"\""              
## [11] "Year 1 Females Total\"\""            
## [12] "Year 1 Total\"\""                    
## [13] "Year 2 Males total\"\""              
## [14] "Year 2 Females Total\"\""            
## [15] "Year 2 Total\"\""                    
## [16] "Year 3 Males Total\"\""              
## [17] "Year 3 Females Total\"\""            
## [18] "Year 3 Total\"\""                    
## [19] "Year 4 Males Total\"\""              
## [20] "Year 4 Females Total\"\""            
## [21] "Year 4 Total\"\""                    
## [22] "Year 5 Males Total\"\""              
## [23] "Year 5 Females Total\"\""            
## [24] "Year 5 Total\"\""                    
## [25] "Year 6 Males Total\"\""              
## [26] "Year 6 Females Total\"\""            
## [27] "Year 6 Total\"\""                    
## [28] "Primary Ungraded Males Total\"\""    
## [29] "Primary Ungraded Females Total\"\""  
## [30] "Primary Ungraded Total\"\""          
## [31] "Primary Total\"\""                   
## [32] "Year 7 Males Total\"\""              
## [33] "Year 7 Females Total\"\""            
## [34] "Year 7 Total\"\""                    
## [35] "Year 8 Males Total\"\""              
## [36] "Year 8 Females Total\"\""            
## [37] "Year 8 Total\"\""                    
## [38] "Year 9 Males Total\"\""              
## [39] "Year 9 Females Total\"\""            
## [40] "Year 9 Total\"\""                    
## [41] "Year 10 Males Total\"\""             
## [42] "Year 10 Females Total\"\""           
## [43] "Year 10 Total\"\""                   
## [44] "Year 11 Males Total\"\""             
## [45] "Year 11 Females Total\"\""           
## [46] "Year 11 Total\"\""                   
## [47] "Year 12 Males\"\""                   
## [48] "Year 12 Females\"\""                 
## [49] "Year 12 Total\"\""                   
## [50] "Secondary Ungraded Males Total\"\""  
## [51] "Secondary Ungraded Females Total\"\""
## [52] "Secondary Ungraded Total\"\""        
## [53] "Secondary Total\"\""                 
## [54] "Grand Total\"\""                     
## [55] "Year"                                
## [56] "CENSUS_TYPE"

As can be seen from the output of column names above, many columns contain 2 backslashes in their name.

# 4.3b Use gsub() to remove "\"\"" from some column names
colnames(schools_fte01) = gsub("\"\"", "", colnames(schools_fte01))
colnames(schools_fte01)

##  [1] "Education_Sector"                 "Entity_Type"                     
##  [3] "School_No"                        "School_Name"                     
##  [5] "School_Type"                      "School_Status"                   
##  [7] "Prep Males Total"                 "Prep Females Total"              
##  [9] "Prep Total"                       "Year 1 Males Total"              
## [11] "Year 1 Females Total"             "Year 1 Total"                    
## [13] "Year 2 Males total"               "Year 2 Females Total"            
## [15] "Year 2 Total"                     "Year 3 Males Total"              
## [17] "Year 3 Females Total"             "Year 3 Total"                    
## [19] "Year 4 Males Total"               "Year 4 Females Total"            
## [21] "Year 4 Total"                     "Year 5 Males Total"              
## [23] "Year 5 Females Total"             "Year 5 Total"                    
## [25] "Year 6 Males Total"               "Year 6 Females Total"            
## [27] "Year 6 Total"                     "Primary Ungraded Males Total"    
## [29] "Primary Ungraded Females Total"   "Primary Ungraded Total"          
## [31] "Primary Total"                    "Year 7 Males Total"              
## [33] "Year 7 Females Total"             "Year 7 Total"                    
## [35] "Year 8 Males Total"               "Year 8 Females Total"            
## [37] "Year 8 Total"                     "Year 9 Males Total"              
## [39] "Year 9 Females Total"             "Year 9 Total"                    
## [41] "Year 10 Males Total"              "Year 10 Females Total"           
## [43] "Year 10 Total"                    "Year 11 Males Total"             
## [45] "Year 11 Females Total"            "Year 11 Total"                   
## [47] "Year 12 Males"                    "Year 12 Females"                 
## [49] "Year 12 Total"                    "Secondary Ungraded Males Total"  
## [51] "Secondary Ungraded Females Total" "Secondary Ungraded Total"        
## [53] "Secondary Total"                  "Grand Total"                     
## [55] "Year"                             "CENSUS_TYPE"

The backslashes now no longer appear after execution of the gsub() function.

# 4.4 Create new dataset for Year 12 only the purpose of summary statistics
# This is achieved by selecting variables which contain "Year 12" in their name.


schools_year12 <- schools_fte01
schools_year12 %<>% 
  select(School_Name,`Year 12 Males`,`Year 12 Females`, `Year 12 Total`) 
schools_year12

The revised data set above only shows Year 12 data.

Step 5 - Tidy data

In order for data to meet tidy data principles, tidy data needs to adhere to the following rules (Wickham and Grolemund (2016)):

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

The schools_year12 data set does not fully conform with all the tidy data principles. It passes principles 2 and 3 but fails principle 1, as every column header is a value rather than a variable. In order to tidy the data it is necessary to clean it by pivoting the Year 12 Boys, Year 12 Girls and Year 12 Total columns using the pivot_longer() function. The resulting new variables are Y12 and FTE respectively.

# Step 5 - reshape data using pivot_longer()
tidy1 <- schools_year12
tidy1 %<>% 
  pivot_longer(-School_Name, names_to = "Y12", values_to = "FTE") %>% 
  group_by(Y12) %>% 
  arrange(desc(FTE))
tidy1

As can be seen above the tidy1 version of the schools_year12 data set is in a tidy format. Each variable now has its own column.

Step 6 - Summary statistics

The summary statistics will focus solely on Year 12 data and is grouped by School_Name using the tidied tidy1 data set from step 5.

# Step 6 - Summary statistics using tidy data as per step 5

Y12_Summary <- tidy1 
Y12_Summary %<>%
  group_by(School_Name) %>% 
  summarise(Mean = mean(FTE, na.rm = TRUE),
            Median = median(FTE, na.rm = TRUE),
            Max = max(FTE, na.rm = TRUE),
            Min = min(FTE, na.rm = TRUE),
            Std_Dev = sd(FTE, na.rm = TRUE)) %>% 
  arrange(desc(Mean))

## `summarise()` ungrouping output (override with `.groups` argument)

Y12_Summary

The Summary Statistics above show the data for Year 12 by School in descending order of Mean values.

Step 7 - Create a list

A list of Local Government IDs (LGA_ID) for each school will be generated from the “Schools_Loc_2020.csv” data set. This file contains location specific information for all schools registered in Victoria. It will be joined to the Y12_Summary data set and added after School Name.

The URL for this separate data set is as follows: https://www.education.vic.gov.au/Documents/about/research/datavic/dv296-schoollocations2020.csv

Steps undertaken:

7.1 Used read_csv to read in downloaded file.

7.2 Used names() to determine relevant name for LGA ID look up

7.3 Used pull() on “Schools_Loc_2020.csv” (column 18) to extract LGA_ID

# Step 7.1  - Read file into R
lga_id <- read_csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/Schools_Loc_2020.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   Entity_Type = col_double(),
##   School_No = col_double(),
##   Address_Postcode = col_double(),
##   Postal_Postcode = col_double(),
##   LGA_ID = col_double(),
##   X = col_double(),
##   Y = col_double()
## )
## i Use `spec()` for the full column specifications.

lga_id

str(lga_id)

## tibble [2,263 x 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Education_Sector     : chr [1:2263] "Catholic" "Catholic" "Catholic" "Catholic" ...
##  $ Entity_Type          : num [1:2263] 2 2 2 2 2 2 2 2 2 2 ...
##  $ School_No            : num [1:2263] 20 25 26 28 29 30 33 35 60 77 ...
##  $ School_Name          : chr [1:2263] "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
##  $ School_Type          : chr [1:2263] "Secondary" "Secondary" "Secondary" "Secondary" ...
##  $ School_Status        : chr [1:2263] "O" "O" "O" "O" ...
##  $ Address_Line_1       : chr [1:2263] "1436 Plenty Road" "273 Victoria Street" "11 Westbury Street" "1431 Sturt Street" ...
##  $ Address_Line_2       : chr [1:2263] NA NA NA NA ...
##  $ Address_Town         : chr [1:2263] "BUNDOORA" "WEST MELBOURNE" "ST KILDA EAST" "BALLARAT" ...
##  $ Address_State        : chr [1:2263] "VIC" "VIC" "VIC" "VIC" ...
##  $ Address_Postcode     : num [1:2263] 3083 3003 3183 3350 3350 ...
##  $ Postal_Address_Line_1: chr [1:2263] "1436 Plenty Road" "273 Victoria Street" "PO Box 258" "Locked Bag 31" ...
##  $ Postal_Address_Line_2: chr [1:2263] NA NA NA NA ...
##  $ Postal_Town          : chr [1:2263] "BUNDOORA" "WEST MELBOURNE" "ST KILDA" "BALLARAT" ...
##  $ Postal_State         : chr [1:2263] "VIC" "VIC" "VIC" "VIC" ...
##  $ Postal_Postcode      : num [1:2263] 3083 3003 3182 3350 3350 ...
##  $ Full_Phone_No        : chr [1:2263] "03 9468 3300" "03 9321 9200" "03 9529 6611" "03 5331 1688" ...
##  $ LGA_ID               : num [1:2263] 66 460 590 57 57 57 673 57 425 311 ...
##  $ LGA_Name             : chr [1:2263] "Banyule (C)" "Melbourne (C)" "Port Phillip (C)" "Ballarat (C)" ...
##  $ X                    : num [1:2263] 145 145 145 144 144 ...
##  $ Y                    : num [1:2263] -37.7 -37.8 -37.9 -37.5 -37.6 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Education_Sector = col_character(),
##   ..   Entity_Type = col_double(),
##   ..   School_No = col_double(),
##   ..   School_Name = col_character(),
##   ..   School_Type = col_character(),
##   ..   School_Status = col_character(),
##   ..   Address_Line_1 = col_character(),
##   ..   Address_Line_2 = col_character(),
##   ..   Address_Town = col_character(),
##   ..   Address_State = col_character(),
##   ..   Address_Postcode = col_double(),
##   ..   Postal_Address_Line_1 = col_character(),
##   ..   Postal_Address_Line_2 = col_character(),
##   ..   Postal_Town = col_character(),
##   ..   Postal_State = col_character(),
##   ..   Postal_Postcode = col_double(),
##   ..   Full_Phone_No = col_character(),
##   ..   LGA_ID = col_double(),
##   ..   LGA_Name = col_character(),
##   ..   X = col_double(),
##   ..   Y = col_double()
##   .. )

As shown above is the structure for the location data file.

# Step 7.2 Look up names for "Schools_Loc_2020.csv" data set
names(lga_id)

##  [1] "Education_Sector"      "Entity_Type"           "School_No"            
##  [4] "School_Name"           "School_Type"           "School_Status"        
##  [7] "Address_Line_1"        "Address_Line_2"        "Address_Town"         
## [10] "Address_State"         "Address_Postcode"      "Postal_Address_Line_1"
## [13] "Postal_Address_Line_2" "Postal_Town"           "Postal_State"         
## [16] "Postal_Postcode"       "Full_Phone_No"         "LGA_ID"               
## [19] "LGA_Name"              "X"                     "Y"

As shown above LGA_ID is the name for the numeric variable to be output. It is located in column 18 of the data set.

# Step 7.3 Generate list for LGA_ID (Local Government Authority ID) using pull() and lga_id data set
#
# 
lga <- pull(lga_id, LGA_ID)
lga

##    [1]  66 460 590  57  57  57 673  57 425 311 311 460 670 635 590 485 506 262
##   [19] 167 211  57 111 635 189 189 381 381 283 485 127 681 485 515 515 413 291
##   [37] 262 101 515 275 581 478  91  91 231 262  57 543 413 413 127 283 525 726
##   [55] 413 327 335 335 394 101 145 490 111 626  26 241 515 549 735 735 745 433
##   [73] 433 319 175 175 343 534 283 490 735 490 673 275 275 275 506 745 571 735
##   [91] 763 137 111  26 137 549 137 460 241 681 137 111 262 335 626 183 183 506
##  [109] 291 626 262 111 717 581 343 183  83 262  66  66 525 343  57 525 231 267
##  [127]  91  74 506 225 635 525  57 617 617 745 145 635 460 608 189 635 681 262
##  [145] 590 413 661 534 175 118 189 673 319  91 643 433 490  66 643 635 211  11
##  [163] 506 562 111 231 343  91 635 111  91 262 225 534 525 698 311 571 478 217
##  [181] 698 189 231  91 327 681  83 231 635 111 441  66 534 111 231 478 707 189
##  [199] 497 525 367 698 283 189 635 145 506 275 267 343 189 111 525 698 525 433
##  [217] 275 506 421 111 515 231 111 497 298 275 381 421  83 118  66 549  91 343
##  [235] 441 689 562 506 189 118 698 137 681 275  66 311 626 497 267 497 763 506
##  [253] 275 275  66 118 118 433 525 267 118 394 275  91 262 118  57 478 433 381
##  [271] 497 267 283 698 343 497 381 127 745 189 211  66 667 497 506 267 211 661
##  [289] 670 698 327 137 111 497 698 497 343 421 490 161 441 525 497 707 525 745
##  [307] 673 506 175 571 161 506 189 698 670 763 118 311  57 421 367 667 745 367
##  [325] 367 311 343 525 497 698 327 311 745 707 707 275 525 698 681 534 217 367
##  [343] 217 217 497 506 421 421 698 497 726 231 681 311 506 673  11 311 367 717
##  [361] 262 343 118 478 726 707 327 161 267 465 275 670 441 381 698 707 327 421
##  [379] 217 118 267 189 118 118 145 275 118 465 571 367 118 717 433  57 745 421
##  [397] 381 327 217 707  66 465 161 343 534 490 726 367 311 161 275 118 118 367
##  [415] 707 275 707 421 118 465 745 534 497 745 367 161 726 267 649 617 217  66
##  [433] 327 661 161 161 161  83  57 262 590 534 161 726 327 327 413 161 118 726
##  [451] 275 275 465 726 161 145 381 525 465 283 717 707 726 735 262 465  57 327
##  [469] 327 726 465 707 670 145 161 118 275 726 460 707 726 161 726 726 145 707
##  [487] 262  57 327 249 485 649  74 283 327 707 275 275 681 673 599 643 421 275
##  [505] 515  57 717 161 599  91 549 311 626 175 543 543 291 262 571 673 534 421
##  [523] 137 571 262 543 161 506 433 413  91  11 275 111  66 626 262 275 262 649
##  [541] 335 413 413 249 262 262 413  57 465 735 543 291 698 413 175 506 525 626
##  [559] 515 707 241 581 335 413 249 275 681 183 460 681 543 549 413 707 670 335
##  [577] 413 673 726 543 335 485 549 291 681  57 562 413 515 175 497 670 262 167
##  [595] 211 515 291 275 681  11 137  26 425 189 211  11 343 745  26 381 599 581
##  [613] 681 101 515 183 275 745 262 413 249  57 667 111 327 562 465 249 571 167
##  [631] 745  26 262 327 327 571 413 262 698 394 111 367 167 745 745 241 327 394
##  [649] 543 413 249 111 249 549 534  57 394 534 335 643 681 343 425 335 681 211
##  [667] 485 283 571 515 581 211 661  11 275 249 111 599 262  57 681 183 590 311
##  [685] 534 608 241 167 689 262 217 485 211 485 735 543 745  26  57 562 543 485
##  [703]  74 291 183  66 262 241 581 763 670 599 735 571 667 283 534  74 298 670
##  [721] 425 249 670 189 460 267 413 311 225  83 249 670 262 283 118 515  57 283
##  [739] 283 283 335 217 127 485 590 249 745 485 335 735 275 189 161 111 262 283
##  [757]  91 319 763 262 291 581 335 283 485 549 275 118 167 670 745 291  57 262
##  [775] 763 262 497 275 635 291 283 211 599 649 275 581 549 413 571 534 562 599
##  [793] 543 681 161 534 670 643 394 283 211 283  57 670 673 101 283  11 262 745
##  [811]  57 137 262 291 485 599 367 101 283 667 735 275 635 161 211 275 275 433
##  [829] 413 707 262 515  83 490  26  83 667 515 726 681 707 262 485 649  83  57
##  [847] 175 534 626  57 291  57  91 543 571 241 490 275  66 249 161 490  26 225
##  [865] 707  57  57  83 571 381 626 225  83 490 381 275 367 763 497  83 381  83
##  [883] 562  83 211 717 490 145 137 145 617 745 225 490  57  11  83  83 667  83
##  [901] 534 367 211 262 460 381 617 381  83  66 275 745 590  83 465  83 145 681
##  [919]  83 745 145 283 262 635 745 460 506 381 534 145 635 534 534  26 137 211
##  [937] 670 643 175 189  83 145 381 145 525 211  74 698 590 490 319 667 590 745
##  [955] 433 525  83  26  91 381 231 145 441 506 698 617 478 698 211 590 111 343
##  [973] 735 745 571 161 211  83 118 617 298 635 534 617 145 211  74 581 283 231
##  [991] 617 534 525 562 275 763 735 118 534 189 335 735 111 478 478 525 211 745
## [1009] 745 745 562 421 562 939 745 661 745 745 127 745 562 608 617 343 343 275
## [1027] 745 649 145 343 617 465 661 497 211 231 745 478 441 485 745 175 262 211
## [1045] 217 145 534 745 118 111 478 381 525 525 735  83 217 707 745  91 745 745
## [1063]  74 311 161 267 111 211 262 145 283 478 231 137 367  74 465 343 211  91
## [1081] 460 515 343 283 189 478 231 217 698 225 189 189 433 745  91  91 283 211
## [1099] 311 161 367 562 534 137 137 571 525 590 698 283 571 745 189 381 211 137
## [1117] 763 745 478 506 433 745 506 506 111 478 571 161 175 231  83  66 649 441
## [1135] 478 635 441 525 726 433 681 111 343 497 189 111 343 571 673 267 441 506
## [1153] 275 211 343 478 118 283 617 217 525 189 111 231 111 231 681 649  66 283
## [1171] 231 343  66 681 478 275 525 283 161  91 311 534 745 525  66 367 506 111
## [1189] 111 670  11 189 381 381 283 534 381 275 311 283 635 381 111 189 275 217
## [1207] 745 189 231 490  57 717 111  83 319 381 381 441 525 707 189 231 698 497
## [1225] 698 367  26 525 267 211 262 267 525 111 275 381 118 283 661 118 118  66
## [1243] 241  66 343 161 670  57 189 211 275 327 673 175 626 231 217 433 433 343
## [1261] 267  91 275 267 698 525 267 118 506 421 717 217 698 441 506 497 707 497
## [1279] 231  66 343 698 189 231 118 343 327 681 421 118 698 698 275 421 275 497
## [1297] 327 525 506 441 698 275 441 161  66  66 745 183 490 571 217 497 367 367
## [1315] 367 343 441 525 441 267 275 217 497 497  66  91 311 698 581  66  57 745
## [1333] 217 283  66 118 118 225 327 343 421 275 367 497 381 275 367 217 381 707
## [1351] 118 460 275 421 497  66 267 367 421  26 217 707 367 726 421 217 327 327
## [1369] 745 497 745 367  66 421 283 681 698  66 118 327 707 465 571 497 367 534
## [1387] 717 217 735 441 118 745 726 118 217 267 745 745 118 745 939 506 161 441
## [1405] 275 381 217 367 118 267 283 465 726 327 497 497 327 421 343 421 497 343
## [1423] 506 497 698 381 118 267 441 275 745 421 161 367 441 534 707 217 161 118
## [1441] 534 717 343 217  66 590 490 497 726 283 698 707 571 465 267 726 421 534
## [1459] 311 745 681 497 118 327 534 571 275 327 161 343 217 745 745  74 367 327
## [1477] 515  57 534 161 465 670 441 421 161 118 311 327 189 381 717 161 698 670
## [1495] 327 241 534 161 161 367 161 118 161 534 343 745 327 707 681 745 175 327
## [1513] 478 231 726 441 465 217 118  66 525 327 118 661 735 735 581 319 183 673
## [1531] 433 275 367 241 626 707 698 707 311 111  83 311 763 161 161 161 367 118
## [1549] 262 137 343 161 161 367 726 217 726 217 707 118 311 327 726 367 478 161
## [1567] 327 275 726 161 726 643 275 515 145 161 534 183 549 626 275 549 241  57
## [1585]  57 161 763  83 681  83  83 681 617 670 670 670 490 478 127 543 661 506
## [1603]  57 465 497 217 698 617 145 343  74 497 367 497 497 367 497 698 689 478
## [1621] 549 267 497  66 661 249 745 681 327 275 433 525 421 661 161 161 441 161
## [1639] 465 161 327 726 534 231 211 327 161 617 707 726 465 707 726 465 161 145
## [1657] 726 145 465 161 698 118 707 137 161  57 161 327 189 327 707 118 145 726
## [1675] 698 262 707 267 267 267 267 267 267 118 726 262 525 189 327 327 319 726
## [1693] 145 707 707 161 327 327 649 465 161 590 161 145 590 145 275 275 649 327
## [1711] 381 707 161 275 267 175 175 681 490 225 735 626 394 617 689 241 335 175
## [1729] 137 661 183 241 137  66 145 617 549 534 745 137 478 183 183 189 394 217
## [1747] 698 707 327 183 161 291 534 275 478 262 707 745 465 478 231  57 506 189
## [1765] 111 460 673 726 262 707 562 111  57 111 367 599  91 335 275 262 161 698
## [1783] 698 433 497  91 485 506 111 111 241 343 490 225 161 441 506 298 421 127
## [1801] 421  83 262 571 590 707 643 367 262 433 617 217 275 275 413 327  66 283
## [1819] 534 745 241 497 225 460 111 698 145 617 137 413 707 707 745 649 267 735
## [1837] 681 425 433 275 635 465 343 478 617 745  66 745 343  11 534 497 231 590
## [1855] 490  83 275 189 441 490 275 211 478 145 343 525 635 189 735 726 275 298
## [1873] 478 441 735 661 137 534 335 726 367 118 581 231 506 327 525 667 707  83
## [1891] 726 460 327 745 745 698 161 217 670 367 763 421 497 726 211 343 497 311
## [1909] 717 681  57 562 231 327 145 189 161 726 118 381 465 211 217 327 698 217
## [1927]  66 581 367  74  91 735 497 698 617 571 698 735 267  66  26 707 515 726
## [1945] 465 118 118 485 189 241 118 311 497 661 381 534 506 525 497 217 673  66
## [1963] 267 626 745 441 319 735 275 381 421 543  83  57 127 127 127 298 681 217
## [1981] 433 161 275 689 667 167 707 726 726 525 717 327 726 137 497 311 267  74
## [1999] 311 327 175 161 726 267 161 161 161 626  11 161 534 726 394 490 485 478
## [2017] 590 189 118 211 525 525 189 707 161 726  57 275 649 763 161 145 465 465
## [2035] 465 465 441 367 726 101 460 460 635 111 231 167 626 111  57 460 111 635
## [2053] 275 111 460 275 241 698 590  91  57 626 311  66 111  91 441 534 635 506
## [2071] 506 231 343  66 111 231 267 111 343 681 111  91 497 698 161 111 111 698
## [2089] 478 698 161 231 231 111 231 298 231 534 275 111 441 111 441 161 745 441
## [2107] 635 698 571 571 421 698 231 413 465 497 298 319  66 367 635 726 707 189
## [2125] 441 745 543 275 745  74 241 275 745 735 262 626 571 441 217 111 478 283
## [2143] 367 726 161  83 118 497 145 745 137 745 497 745 225 367  83 525 534 717
## [2161] 465 425 231 673 735 726 118 745 515 267 735 745 543 485 327 425 267 717
## [2179] 661 262 161 262 327 275 111 275 726 327 327 189 283 525 145 211 189 145
## [2197] 327 111  57 275 649 413 267 460 571 635 275 478 267 460 267 590 460 717
## [2215] 145 335 534  57 735 101 327 161 275 291 707  83 217 283 367 460 726 465
## [2233] 267 231 525 670 617  91 211 101 745 490 343 460 413 267 465 745 111 161
## [2251] 311  74 534 189 460 217 460 111 590 145 343 161  74

str(lga)

##  num [1:2263] 66 460 590 57 57 57 673 57 425 311 ...

The full list of all 2263 LGA_ID values can be seen above.

Step 8 - Join the list

Join LGA_ID with Y12_Summary data set using School_Name for the join.

# Step 8 - Join LGA_ID with Y12 Summary using School_Name
Y12_LGA <- Y12_Summary %>% left_join(lga_id, by = "School_Name") %>% 
  select(School_Name, LGA_ID, Mean, Median, Max, Min, Std_Dev)
Y12_LGA

LGA_ID has now been added as an additional variable after School_Name in the Y12_LGA data set.

Step 9 - Subsetting I

In this step the Y12_LGA data frame was subset using first 10 observations (include all variables).

# Step 9 - Subset data frame using the first 10 observations (including all variables). 
j10 <- Y12_LGA
j20 <-j10[c(1:10),]
j20

Only the first 10 observations can be seen from the output above.

str(j20)

## tibble [10 x 7] (S3: tbl_df/tbl/data.frame)
##  $ School_Name: chr [1:10] "Bendigo Senior Secondary College" "Virtual School Victoria" "St Francis Xavier College" "Caulfield Grammar School" ...
##  $ LGA_ID     : num [1:10] 262 189 145 231 478 118 460 111 111 717
##  $ Mean       : num [1:10] 544 366 298 295 295 ...
##  $ Median     : num [1:10] 419 323 236 236 224 ...
##  $ Max        : num [1:10] 816 549 447 442 442 ...
##  $ Min        : num [1:10] 397 226 211 206 218 ...
##  $ Std_Dev    : num [1:10] 236 166 130 129 128 ...

As seen above all the Summary Statistical variables are numeric.

# Convert to a matrix
mtrx <- as.matrix(j20)
mtrx

##       School_Name                                       LGA_ID Mean      
##  [1,] "Bendigo Senior Secondary College"                "262"  "543.8667"
##  [2,] "Virtual School Victoria"                         "189"  "366.0667"
##  [3,] "St Francis Xavier College"                       "145"  "298.0000"
##  [4,] "Caulfield Grammar School"                        "231"  "294.8000"
##  [5,] "Mildura Senior College"                          "478"  "294.6667"
##  [6,] "Catholic Regional College Institute of Training" "118"  "272.0000"
##  [7,] "Wesley College"                                  "460"  "254.6667"
##  [8,] "SEDA College (Victoria)"                         "111"  "252.0000"
##  [9,] "Balwyn High School"                              "111"  "248.0000"
## [10,] "Wodonga Senior Secondary College"                "717"  "239.5333"
##       Median  Max     Min     Std_Dev   
##  [1,] "418.7" "815.8" "397.1" "235.7487"
##  [2,] "322.8" "549.1" "226.3" "165.6924"
##  [3,] "236.0" "447.0" "211.0" "129.6418"
##  [4,] "236.0" "442.2" "206.2" "128.5188"
##  [5,] "224.0" "442.0" "218.0" "127.6297"
##  [6,] "215.0" "408.0" "193.0" "118.2920"
##  [7,] "228.0" "382.0" "154.0" "116.3157"
##  [8,] "259.0" "378.0" "119.0" "129.6418"
##  [9,] "210.5" "372.0" "161.5" "110.1465"
## [10,] "182.1" "359.3" "177.2" "103.7499"

Once data frame j20 has been converted to a matrix the Summary Statistical variables become character in nature. This can be seen by the inverted commas around the numbers.

# Check if matrix.

is.matrix(mtrx)

## [1] TRUE

The check using function is.matrix() confirms that “mtrx” is indeed a matrix.

# Check whether matrix is char, logical,integer, factor, or logical.
str(mtrx)

##  chr [1:10, 1:7] "Bendigo Senior Secondary College" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:7] "School_Name" "LGA_ID" "Mean" "Median" ...

Using function str() provides further confirmation that all the summary statistic values are now character in nature. The reason for this is that all elements of the matrix must be of the same type, e.g. numeric or character.

Step 10 - Subsetting II

Only the first and the last(7th) variable in the Y12_LGA data set from step 9 have been subset and then saved as an R object (RDS) file.

# Step 10 - Subset data frame including only 1st and last variable in the data set.
# Save as an R.Data file
j30 <- j10[, c(1,7)]
j30

As shown above only School_Name and Std_Dev have been output as the first and last variables respectively.

j30 %>% 
saveRDS(file = "C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/j30.RDS")

As shown above, the output has been saved to a local directory as a RDS file.

Step 11 - Create a new Data Frame

Data frame df1 has been created comprising of:

An integer variable with values from “996” to “998” (Model_no)
An ordinal variable comprising of “Small”, “Medium” and “Large” (Size)

# Step 11 - Create new Data Frame df1 from scratch

df1 <- data.frame(col1 = 996:998,
                  col2 = factor(c("Small", "Medium", "Large"), 
                                levels = c("Small", "Medium", "Large"), ordered = TRUE))
# Assign names variables in data frame df1.
# The integer variable has been named "Model_no".
# The ordinal variable has been named "Size".
colnames(df1) <- c("Model_no", "Size")
# Assign row names
rownames(df1) <- c("Row 1", "Row 2", "Row 3")
# Verify structure and levels of variables in data frame df1
df1

str(df1)

## 'data.frame':    3 obs. of  2 variables:
##  $ Model_no: int  996 997 998
##  $ Size    : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 2 3

As shown a data frame with 3 observations and 2 variables has been created. Model_no is the integer variable, whereas Size is the ordinal variable.

# Create an additional numeric vector called Price
Price <- c(1349, 1649, 1949)
Price

## [1] 1349 1649 1949

str(Price)

##  num [1:3] 1349 1649 1949

Numeric vector Price has been created with 3 values.

# Use cbind() to bind Price to df1 above to create combined data frame df2
df2 <- cbind(df1, Price)
# Output data frame df2 to show 3 variables ("Model_no", "Size" and "Price")
df2

# Check attributes of data frame df2
attributes(df2)

## $names
## [1] "Model_no" "Size"     "Price"   
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] "Row 1" "Row 2" "Row 3"

# Check dimensions of data frame df2
dim(df2)

## [1] 3 3

As can be seen above, data frame df2 has been created with 3 variables, along with its attributes and dimensions.

Step 12 - Create another Data Frame

Data frame df3 has been created as follows:

An integer variable with values from “996” to “998” (Model_no).
A categorical variable with values as “Galaxy S21”, “Galaxy S21+”, “Galaxy S21 Ultra” (Model_name).

Data frame df3 was joined to df2 using a left join by Model_no.

Once joined categorical variable Model_name will form part of the new consolidated data frame df4

# Step 12a - Create another data frame (df3) and join with data frame df2 using "Model_no"
df3 <- data.frame(col1 = 996:998,
                  col2 = c("Galaxy S21", "Galaxy S21+", "Galaxy S21 Ultra"))
# Assign column names to df3
colnames(df3) <- c("Model_no", "Model_name")
df3

Data frame df3 has been created with Model_no and Model_name

# Step 12b - Join df4 to df3 using a left join by Model_no
df4 <- df3 %>% left_join(df2, by = "Model_no")
df4

Once joined, data frame df4 has been created with 4 variables.

References

All Schools FTE enrolments - Feb 2020, Department of Education and Training, viewed 12 January 2021, https://discover.data.vic.gov.au/dataset/all-schools-fte-enrolments-feb-2020
Boehmke, BC, 2016, Data Wrangling with R, Springer International Publishing, Cham, Switzerland.
School Locations - 2020, Department of Education and Training, viewed 12 January 2021, https://discover.data.vic.gov.au/dataset/school-locations-2020
Wickham, H, & Grolemund, G 2016, R for Data Science : Import, Tidy, Transform, Visualize, and Model Data, O’Reilly Media, Incorporated, Sebastopol, CA.