RPubs URL: https://rpubs.com/Chuver/714636
# The following packages are loaded for the purpose of Assignment 1
library(readr) # Useful for importing data
library(tidyr) # Useful for creating tidy data
library(dplyr) # Useful for data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr) # Useful for code readability
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
With the imminent start of the new school year, as well as the fact that I have two girls attending the same school, I was curious to find out more about the number of school enrolments across schools in Victoria and consequently have used this as the data set for my assignment.
The motivation behind choosing the All Schools FTE enrolments 2020 data was as a result of searching for the answers to:
The data is published by the Department of Education and Training and was sourced using https://www.data.vic.gov.au/.
The actual URL of the data is http://www.education.vic.gov.au/Documents/about/research/datavic/dv300-allschoolsFTEenrolmentsFeb2020.csv. The data itself is in CSV format.
The data is from a census taken in February 2020 of Victorian schools. The data set provides information on the full time equivalent (FTE) enrolments of students by school, school type, year level and sex. A student attending school on a full time basis is counted as 1.0 FTE. In the event where a student has attended more than 1 school, their time is apportioned accordingly, e.g. 0.6 FTE at school A and 0.4 FTE at school B.
From the URL for All Schools FTE enrolments - Feb2020 http://www.education.vic.gov.au/Documents/about/research/datavic/dv300-allschoolsFTEenrolmentsFeb2020.csv, the file was downloaded to the local laptop directory.
The downloaded file was then read into R using the read_csv function by assigning it to “schools_fte01”.
# Step 2 - Read CSV file from local directory in R and save as a data frame
schools_fte01 <- read_csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/Schools_FTE_2020.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## Education_Sector = col_character(),
## School_Name = col_character(),
## School_Type = col_character(),
## School_Status = col_character(),
## CENSUS_TYPE = col_logical()
## )
## i Use `spec()` for the full column specifications.
# Display contents of read_csv
head(schools_fte01)
# Save as dataframe
schools_fte01 %>%
saveRDS(file = "C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/schools_fte01.RDS")
Data source: The data can be found at the following URL: http://www.education.vic.gov.au/Documents/about/research/datavic/dv300-allschoolsFTEenrolmentsFeb2020.csv
The data comprises of the following 2263 observations with 56 variables. The variables are broken down as follows:
In order to obtain greater understanding of the data set, the following was performed:
4.1 Checked the dimensions of the data frame using dim(schools_fte01).
4.2 Checked the variable data types using glimpse(schools_fte01).
4.3a Checked the column names and determined that some names had 2 backslashes as a suffix.
4.3b Removed the 2 backslashes as a suffix using gsub(schools_fte01)
4.4 Created new Year 12 data set for the purpose of summary statistics.
# Check structure of data set
# 4.1 Check data frame dimensions
dim(schools_fte01)
## [1] 2263 56
The output confirms that there are 2263 rows and 56 variables
# 4.2 Use glimpse() to check variable data types
glimpse(schools_fte01)
## Rows: 2,263
## Columns: 56
## $ Education_Sector <chr> "Catholic", "Catholic", "Catho...
## $ Entity_Type <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ School_No <dbl> 20, 25, 26, 28, 29, 30, 33, 35...
## $ School_Name <chr> "Parade College", "Simonds Cat...
## $ School_Type <chr> "Secondary", "Secondary", "Sec...
## $ School_Status <chr> "O", "O", "O", "O", "O", "O", ...
## $ `Prep Males Total""` <dbl> 0, 0, 0, 0, 30, 18, 0, 0, 11, ...
## $ `Prep Females Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 12.0, 18.0...
## $ `Prep Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 42.0, 36.0...
## $ `Year 1 Males Total""` <dbl> 0, 0, 0, 0, 14, 9, 0, 0, 9, 23...
## $ `Year 1 Females Total""` <dbl> 0, 0, 0, 0, 18, 20, 0, 0, 11, ...
## $ `Year 1 Total""` <dbl> 0, 0, 0, 0, 32, 29, 0, 0, 20, ...
## $ `Year 2 Males total""` <dbl> 0.0, 0.0, 0.0, 0.0, 15.0, 19.0...
## $ `Year 2 Females Total""` <dbl> 0, 0, 0, 0, 16, 14, 0, 0, 14, ...
## $ `Year 2 Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 31.0, 33.0...
## $ `Year 3 Males Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 18.0, 18.0...
## $ `Year 3 Females Total""` <dbl> 0, 0, 0, 0, 15, 25, 0, 0, 10, ...
## $ `Year 3 Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 33.0, 43.0...
## $ `Year 4 Males Total""` <dbl> 0, 0, 0, 0, 13, 20, 0, 0, 14, ...
## $ `Year 4 Females Total""` <dbl> 0, 0, 0, 0, 21, 29, 0, 0, 17, ...
## $ `Year 4 Total""` <dbl> 0, 0, 0, 0, 34, 49, 0, 0, 31, ...
## $ `Year 5 Males Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 16.0, 22.0...
## $ `Year 5 Females Total""` <dbl> 0, 0, 0, 0, 22, 24, 0, 0, 14, ...
## $ `Year 5 Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 38.0, 46.0...
## $ `Year 6 Males Total""` <dbl> 0, 0, 0, 0, 18, 16, 0, 0, 14, ...
## $ `Year 6 Females Total""` <dbl> 0, 0, 0, 0, 27, 15, 0, 0, 20, ...
## $ `Year 6 Total""` <dbl> 0, 0, 0, 0, 45, 31, 0, 0, 34, ...
## $ `Primary Ungraded Males Total""` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Primary Ungraded Females Total""` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Primary Ungraded Total""` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Primary Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 255.0, 267...
## $ `Year 7 Males Total""` <dbl> 333, 64, 54, 229, 0, 0, 104, 0...
## $ `Year 7 Females Total""` <dbl> 0, 0, 0, 0, 0, 0, 118, 156, 0,...
## $ `Year 7 Total""` <dbl> 333, 64, 54, 229, 0, 0, 222, 1...
## $ `Year 8 Males Total""` <dbl> 337, 85, 74, 233, 0, 0, 110, 0...
## $ `Year 8 Females Total""` <dbl> 0, 0, 0, 0, 0, 0, 110, 159, 0,...
## $ `Year 8 Total""` <dbl> 337, 85, 74, 233, 0, 0, 220, 1...
## $ `Year 9 Males Total""` <dbl> 351, 69, 94, 224, 0, 0, 85, 0,...
## $ `Year 9 Females Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ `Year 9 Total""` <dbl> 351.0, 69.0, 94.0, 224.0, 0.0,...
## $ `Year 10 Males Total""` <dbl> 319, 73, 87, 236, 0, 0, 92, 0,...
## $ `Year 10 Females Total""` <dbl> 0, 0, 0, 0, 0, 0, 103, 162, 0,...
## $ `Year 10 Total""` <dbl> 319, 73, 87, 236, 0, 0, 195, 1...
## $ `Year 11 Males Total""` <dbl> 332, 68, 97, 228, 0, 0, 89, 0,...
## $ `Year 11 Females Total""` <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ `Year 11 Total""` <dbl> 332.0, 68.0, 97.0, 228.0, 0.0,...
## $ `Year 12 Males""` <dbl> 281.0, 62.0, 71.0, 190.0, 0.0,...
## $ `Year 12 Females""` <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
## $ `Year 12 Total""` <dbl> 281.0, 62.0, 71.0, 190.0, 0.0,...
## $ `Secondary Ungraded Males Total""` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Secondary Ungraded Females Total""` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Secondary Ungraded Total""` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ `Secondary Total""` <dbl> 1953.0, 421.0, 477.0, 1340.0, ...
## $ `Grand Total""` <dbl> 1953.0, 421.0, 477.0, 1340.0, ...
## $ Year <dbl> 2020, 2020, 2020, 2020, 2020, ...
## $ CENSUS_TYPE <lgl> FALSE, FALSE, FALSE, FALSE, FA...
As can be seen above all the variables are in the correct data type and do not require amendment
# 4.3a Use names() to check column names
names(schools_fte01)
## [1] "Education_Sector"
## [2] "Entity_Type"
## [3] "School_No"
## [4] "School_Name"
## [5] "School_Type"
## [6] "School_Status"
## [7] "Prep Males Total\"\""
## [8] "Prep Females Total\"\""
## [9] "Prep Total\"\""
## [10] "Year 1 Males Total\"\""
## [11] "Year 1 Females Total\"\""
## [12] "Year 1 Total\"\""
## [13] "Year 2 Males total\"\""
## [14] "Year 2 Females Total\"\""
## [15] "Year 2 Total\"\""
## [16] "Year 3 Males Total\"\""
## [17] "Year 3 Females Total\"\""
## [18] "Year 3 Total\"\""
## [19] "Year 4 Males Total\"\""
## [20] "Year 4 Females Total\"\""
## [21] "Year 4 Total\"\""
## [22] "Year 5 Males Total\"\""
## [23] "Year 5 Females Total\"\""
## [24] "Year 5 Total\"\""
## [25] "Year 6 Males Total\"\""
## [26] "Year 6 Females Total\"\""
## [27] "Year 6 Total\"\""
## [28] "Primary Ungraded Males Total\"\""
## [29] "Primary Ungraded Females Total\"\""
## [30] "Primary Ungraded Total\"\""
## [31] "Primary Total\"\""
## [32] "Year 7 Males Total\"\""
## [33] "Year 7 Females Total\"\""
## [34] "Year 7 Total\"\""
## [35] "Year 8 Males Total\"\""
## [36] "Year 8 Females Total\"\""
## [37] "Year 8 Total\"\""
## [38] "Year 9 Males Total\"\""
## [39] "Year 9 Females Total\"\""
## [40] "Year 9 Total\"\""
## [41] "Year 10 Males Total\"\""
## [42] "Year 10 Females Total\"\""
## [43] "Year 10 Total\"\""
## [44] "Year 11 Males Total\"\""
## [45] "Year 11 Females Total\"\""
## [46] "Year 11 Total\"\""
## [47] "Year 12 Males\"\""
## [48] "Year 12 Females\"\""
## [49] "Year 12 Total\"\""
## [50] "Secondary Ungraded Males Total\"\""
## [51] "Secondary Ungraded Females Total\"\""
## [52] "Secondary Ungraded Total\"\""
## [53] "Secondary Total\"\""
## [54] "Grand Total\"\""
## [55] "Year"
## [56] "CENSUS_TYPE"
As can be seen from the output of column names above, many columns contain 2 backslashes in their name.
# 4.3b Use gsub() to remove "\"\"" from some column names
colnames(schools_fte01) = gsub("\"\"", "", colnames(schools_fte01))
colnames(schools_fte01)
## [1] "Education_Sector" "Entity_Type"
## [3] "School_No" "School_Name"
## [5] "School_Type" "School_Status"
## [7] "Prep Males Total" "Prep Females Total"
## [9] "Prep Total" "Year 1 Males Total"
## [11] "Year 1 Females Total" "Year 1 Total"
## [13] "Year 2 Males total" "Year 2 Females Total"
## [15] "Year 2 Total" "Year 3 Males Total"
## [17] "Year 3 Females Total" "Year 3 Total"
## [19] "Year 4 Males Total" "Year 4 Females Total"
## [21] "Year 4 Total" "Year 5 Males Total"
## [23] "Year 5 Females Total" "Year 5 Total"
## [25] "Year 6 Males Total" "Year 6 Females Total"
## [27] "Year 6 Total" "Primary Ungraded Males Total"
## [29] "Primary Ungraded Females Total" "Primary Ungraded Total"
## [31] "Primary Total" "Year 7 Males Total"
## [33] "Year 7 Females Total" "Year 7 Total"
## [35] "Year 8 Males Total" "Year 8 Females Total"
## [37] "Year 8 Total" "Year 9 Males Total"
## [39] "Year 9 Females Total" "Year 9 Total"
## [41] "Year 10 Males Total" "Year 10 Females Total"
## [43] "Year 10 Total" "Year 11 Males Total"
## [45] "Year 11 Females Total" "Year 11 Total"
## [47] "Year 12 Males" "Year 12 Females"
## [49] "Year 12 Total" "Secondary Ungraded Males Total"
## [51] "Secondary Ungraded Females Total" "Secondary Ungraded Total"
## [53] "Secondary Total" "Grand Total"
## [55] "Year" "CENSUS_TYPE"
The backslashes now no longer appear after execution of the gsub() function.
# 4.4 Create new dataset for Year 12 only the purpose of summary statistics
# This is achieved by selecting variables which contain "Year 12" in their name.
schools_year12 <- schools_fte01
schools_year12 %<>%
select(School_Name,`Year 12 Males`,`Year 12 Females`, `Year 12 Total`)
schools_year12
The revised data set above only shows Year 12 data.
In order for data to meet tidy data principles, tidy data needs to adhere to the following rules (Wickham and Grolemund (2016)):
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
The schools_year12 data set does not fully conform with all the tidy data principles. It passes principles 2 and 3 but fails principle 1, as every column header is a value rather than a variable. In order to tidy the data it is necessary to clean it by pivoting the Year 12 Boys, Year 12 Girls and Year 12 Total columns using the pivot_longer() function. The resulting new variables are Y12 and FTE respectively.
# Step 5 - reshape data using pivot_longer()
tidy1 <- schools_year12
tidy1 %<>%
pivot_longer(-School_Name, names_to = "Y12", values_to = "FTE") %>%
group_by(Y12) %>%
arrange(desc(FTE))
tidy1
As can be seen above the tidy1 version of the schools_year12 data set is in a tidy format. Each variable now has its own column.
The summary statistics will focus solely on Year 12 data and is grouped by School_Name using the tidied tidy1 data set from step 5.
# Step 6 - Summary statistics using tidy data as per step 5
Y12_Summary <- tidy1
Y12_Summary %<>%
group_by(School_Name) %>%
summarise(Mean = mean(FTE, na.rm = TRUE),
Median = median(FTE, na.rm = TRUE),
Max = max(FTE, na.rm = TRUE),
Min = min(FTE, na.rm = TRUE),
Std_Dev = sd(FTE, na.rm = TRUE)) %>%
arrange(desc(Mean))
## `summarise()` ungrouping output (override with `.groups` argument)
Y12_Summary
The Summary Statistics above show the data for Year 12 by School in descending order of Mean values.
A list of Local Government IDs (LGA_ID) for each school will be generated from the “Schools_Loc_2020.csv” data set. This file contains location specific information for all schools registered in Victoria. It will be joined to the Y12_Summary data set and added after School Name.
The URL for this separate data set is as follows: https://www.education.vic.gov.au/Documents/about/research/datavic/dv296-schoollocations2020.csv
Steps undertaken:
7.1 Used read_csv to read in downloaded file.
7.2 Used names() to determine relevant name for LGA ID look up
7.3 Used pull() on “Schools_Loc_2020.csv” (column 18) to extract LGA_ID
# Step 7.1 - Read file into R
lga_id <- read_csv("C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/Schools_Loc_2020.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## Entity_Type = col_double(),
## School_No = col_double(),
## Address_Postcode = col_double(),
## Postal_Postcode = col_double(),
## LGA_ID = col_double(),
## X = col_double(),
## Y = col_double()
## )
## i Use `spec()` for the full column specifications.
lga_id
str(lga_id)
## tibble [2,263 x 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Education_Sector : chr [1:2263] "Catholic" "Catholic" "Catholic" "Catholic" ...
## $ Entity_Type : num [1:2263] 2 2 2 2 2 2 2 2 2 2 ...
## $ School_No : num [1:2263] 20 25 26 28 29 30 33 35 60 77 ...
## $ School_Name : chr [1:2263] "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
## $ School_Type : chr [1:2263] "Secondary" "Secondary" "Secondary" "Secondary" ...
## $ School_Status : chr [1:2263] "O" "O" "O" "O" ...
## $ Address_Line_1 : chr [1:2263] "1436 Plenty Road" "273 Victoria Street" "11 Westbury Street" "1431 Sturt Street" ...
## $ Address_Line_2 : chr [1:2263] NA NA NA NA ...
## $ Address_Town : chr [1:2263] "BUNDOORA" "WEST MELBOURNE" "ST KILDA EAST" "BALLARAT" ...
## $ Address_State : chr [1:2263] "VIC" "VIC" "VIC" "VIC" ...
## $ Address_Postcode : num [1:2263] 3083 3003 3183 3350 3350 ...
## $ Postal_Address_Line_1: chr [1:2263] "1436 Plenty Road" "273 Victoria Street" "PO Box 258" "Locked Bag 31" ...
## $ Postal_Address_Line_2: chr [1:2263] NA NA NA NA ...
## $ Postal_Town : chr [1:2263] "BUNDOORA" "WEST MELBOURNE" "ST KILDA" "BALLARAT" ...
## $ Postal_State : chr [1:2263] "VIC" "VIC" "VIC" "VIC" ...
## $ Postal_Postcode : num [1:2263] 3083 3003 3182 3350 3350 ...
## $ Full_Phone_No : chr [1:2263] "03 9468 3300" "03 9321 9200" "03 9529 6611" "03 5331 1688" ...
## $ LGA_ID : num [1:2263] 66 460 590 57 57 57 673 57 425 311 ...
## $ LGA_Name : chr [1:2263] "Banyule (C)" "Melbourne (C)" "Port Phillip (C)" "Ballarat (C)" ...
## $ X : num [1:2263] 145 145 145 144 144 ...
## $ Y : num [1:2263] -37.7 -37.8 -37.9 -37.5 -37.6 ...
## - attr(*, "spec")=
## .. cols(
## .. Education_Sector = col_character(),
## .. Entity_Type = col_double(),
## .. School_No = col_double(),
## .. School_Name = col_character(),
## .. School_Type = col_character(),
## .. School_Status = col_character(),
## .. Address_Line_1 = col_character(),
## .. Address_Line_2 = col_character(),
## .. Address_Town = col_character(),
## .. Address_State = col_character(),
## .. Address_Postcode = col_double(),
## .. Postal_Address_Line_1 = col_character(),
## .. Postal_Address_Line_2 = col_character(),
## .. Postal_Town = col_character(),
## .. Postal_State = col_character(),
## .. Postal_Postcode = col_double(),
## .. Full_Phone_No = col_character(),
## .. LGA_ID = col_double(),
## .. LGA_Name = col_character(),
## .. X = col_double(),
## .. Y = col_double()
## .. )
As shown above is the structure for the location data file.
# Step 7.2 Look up names for "Schools_Loc_2020.csv" data set
names(lga_id)
## [1] "Education_Sector" "Entity_Type" "School_No"
## [4] "School_Name" "School_Type" "School_Status"
## [7] "Address_Line_1" "Address_Line_2" "Address_Town"
## [10] "Address_State" "Address_Postcode" "Postal_Address_Line_1"
## [13] "Postal_Address_Line_2" "Postal_Town" "Postal_State"
## [16] "Postal_Postcode" "Full_Phone_No" "LGA_ID"
## [19] "LGA_Name" "X" "Y"
As shown above LGA_ID is the name for the numeric variable to be output. It is located in column 18 of the data set.
# Step 7.3 Generate list for LGA_ID (Local Government Authority ID) using pull() and lga_id data set
#
#
lga <- pull(lga_id, LGA_ID)
lga
## [1] 66 460 590 57 57 57 673 57 425 311 311 460 670 635 590 485 506 262
## [19] 167 211 57 111 635 189 189 381 381 283 485 127 681 485 515 515 413 291
## [37] 262 101 515 275 581 478 91 91 231 262 57 543 413 413 127 283 525 726
## [55] 413 327 335 335 394 101 145 490 111 626 26 241 515 549 735 735 745 433
## [73] 433 319 175 175 343 534 283 490 735 490 673 275 275 275 506 745 571 735
## [91] 763 137 111 26 137 549 137 460 241 681 137 111 262 335 626 183 183 506
## [109] 291 626 262 111 717 581 343 183 83 262 66 66 525 343 57 525 231 267
## [127] 91 74 506 225 635 525 57 617 617 745 145 635 460 608 189 635 681 262
## [145] 590 413 661 534 175 118 189 673 319 91 643 433 490 66 643 635 211 11
## [163] 506 562 111 231 343 91 635 111 91 262 225 534 525 698 311 571 478 217
## [181] 698 189 231 91 327 681 83 231 635 111 441 66 534 111 231 478 707 189
## [199] 497 525 367 698 283 189 635 145 506 275 267 343 189 111 525 698 525 433
## [217] 275 506 421 111 515 231 111 497 298 275 381 421 83 118 66 549 91 343
## [235] 441 689 562 506 189 118 698 137 681 275 66 311 626 497 267 497 763 506
## [253] 275 275 66 118 118 433 525 267 118 394 275 91 262 118 57 478 433 381
## [271] 497 267 283 698 343 497 381 127 745 189 211 66 667 497 506 267 211 661
## [289] 670 698 327 137 111 497 698 497 343 421 490 161 441 525 497 707 525 745
## [307] 673 506 175 571 161 506 189 698 670 763 118 311 57 421 367 667 745 367
## [325] 367 311 343 525 497 698 327 311 745 707 707 275 525 698 681 534 217 367
## [343] 217 217 497 506 421 421 698 497 726 231 681 311 506 673 11 311 367 717
## [361] 262 343 118 478 726 707 327 161 267 465 275 670 441 381 698 707 327 421
## [379] 217 118 267 189 118 118 145 275 118 465 571 367 118 717 433 57 745 421
## [397] 381 327 217 707 66 465 161 343 534 490 726 367 311 161 275 118 118 367
## [415] 707 275 707 421 118 465 745 534 497 745 367 161 726 267 649 617 217 66
## [433] 327 661 161 161 161 83 57 262 590 534 161 726 327 327 413 161 118 726
## [451] 275 275 465 726 161 145 381 525 465 283 717 707 726 735 262 465 57 327
## [469] 327 726 465 707 670 145 161 118 275 726 460 707 726 161 726 726 145 707
## [487] 262 57 327 249 485 649 74 283 327 707 275 275 681 673 599 643 421 275
## [505] 515 57 717 161 599 91 549 311 626 175 543 543 291 262 571 673 534 421
## [523] 137 571 262 543 161 506 433 413 91 11 275 111 66 626 262 275 262 649
## [541] 335 413 413 249 262 262 413 57 465 735 543 291 698 413 175 506 525 626
## [559] 515 707 241 581 335 413 249 275 681 183 460 681 543 549 413 707 670 335
## [577] 413 673 726 543 335 485 549 291 681 57 562 413 515 175 497 670 262 167
## [595] 211 515 291 275 681 11 137 26 425 189 211 11 343 745 26 381 599 581
## [613] 681 101 515 183 275 745 262 413 249 57 667 111 327 562 465 249 571 167
## [631] 745 26 262 327 327 571 413 262 698 394 111 367 167 745 745 241 327 394
## [649] 543 413 249 111 249 549 534 57 394 534 335 643 681 343 425 335 681 211
## [667] 485 283 571 515 581 211 661 11 275 249 111 599 262 57 681 183 590 311
## [685] 534 608 241 167 689 262 217 485 211 485 735 543 745 26 57 562 543 485
## [703] 74 291 183 66 262 241 581 763 670 599 735 571 667 283 534 74 298 670
## [721] 425 249 670 189 460 267 413 311 225 83 249 670 262 283 118 515 57 283
## [739] 283 283 335 217 127 485 590 249 745 485 335 735 275 189 161 111 262 283
## [757] 91 319 763 262 291 581 335 283 485 549 275 118 167 670 745 291 57 262
## [775] 763 262 497 275 635 291 283 211 599 649 275 581 549 413 571 534 562 599
## [793] 543 681 161 534 670 643 394 283 211 283 57 670 673 101 283 11 262 745
## [811] 57 137 262 291 485 599 367 101 283 667 735 275 635 161 211 275 275 433
## [829] 413 707 262 515 83 490 26 83 667 515 726 681 707 262 485 649 83 57
## [847] 175 534 626 57 291 57 91 543 571 241 490 275 66 249 161 490 26 225
## [865] 707 57 57 83 571 381 626 225 83 490 381 275 367 763 497 83 381 83
## [883] 562 83 211 717 490 145 137 145 617 745 225 490 57 11 83 83 667 83
## [901] 534 367 211 262 460 381 617 381 83 66 275 745 590 83 465 83 145 681
## [919] 83 745 145 283 262 635 745 460 506 381 534 145 635 534 534 26 137 211
## [937] 670 643 175 189 83 145 381 145 525 211 74 698 590 490 319 667 590 745
## [955] 433 525 83 26 91 381 231 145 441 506 698 617 478 698 211 590 111 343
## [973] 735 745 571 161 211 83 118 617 298 635 534 617 145 211 74 581 283 231
## [991] 617 534 525 562 275 763 735 118 534 189 335 735 111 478 478 525 211 745
## [1009] 745 745 562 421 562 939 745 661 745 745 127 745 562 608 617 343 343 275
## [1027] 745 649 145 343 617 465 661 497 211 231 745 478 441 485 745 175 262 211
## [1045] 217 145 534 745 118 111 478 381 525 525 735 83 217 707 745 91 745 745
## [1063] 74 311 161 267 111 211 262 145 283 478 231 137 367 74 465 343 211 91
## [1081] 460 515 343 283 189 478 231 217 698 225 189 189 433 745 91 91 283 211
## [1099] 311 161 367 562 534 137 137 571 525 590 698 283 571 745 189 381 211 137
## [1117] 763 745 478 506 433 745 506 506 111 478 571 161 175 231 83 66 649 441
## [1135] 478 635 441 525 726 433 681 111 343 497 189 111 343 571 673 267 441 506
## [1153] 275 211 343 478 118 283 617 217 525 189 111 231 111 231 681 649 66 283
## [1171] 231 343 66 681 478 275 525 283 161 91 311 534 745 525 66 367 506 111
## [1189] 111 670 11 189 381 381 283 534 381 275 311 283 635 381 111 189 275 217
## [1207] 745 189 231 490 57 717 111 83 319 381 381 441 525 707 189 231 698 497
## [1225] 698 367 26 525 267 211 262 267 525 111 275 381 118 283 661 118 118 66
## [1243] 241 66 343 161 670 57 189 211 275 327 673 175 626 231 217 433 433 343
## [1261] 267 91 275 267 698 525 267 118 506 421 717 217 698 441 506 497 707 497
## [1279] 231 66 343 698 189 231 118 343 327 681 421 118 698 698 275 421 275 497
## [1297] 327 525 506 441 698 275 441 161 66 66 745 183 490 571 217 497 367 367
## [1315] 367 343 441 525 441 267 275 217 497 497 66 91 311 698 581 66 57 745
## [1333] 217 283 66 118 118 225 327 343 421 275 367 497 381 275 367 217 381 707
## [1351] 118 460 275 421 497 66 267 367 421 26 217 707 367 726 421 217 327 327
## [1369] 745 497 745 367 66 421 283 681 698 66 118 327 707 465 571 497 367 534
## [1387] 717 217 735 441 118 745 726 118 217 267 745 745 118 745 939 506 161 441
## [1405] 275 381 217 367 118 267 283 465 726 327 497 497 327 421 343 421 497 343
## [1423] 506 497 698 381 118 267 441 275 745 421 161 367 441 534 707 217 161 118
## [1441] 534 717 343 217 66 590 490 497 726 283 698 707 571 465 267 726 421 534
## [1459] 311 745 681 497 118 327 534 571 275 327 161 343 217 745 745 74 367 327
## [1477] 515 57 534 161 465 670 441 421 161 118 311 327 189 381 717 161 698 670
## [1495] 327 241 534 161 161 367 161 118 161 534 343 745 327 707 681 745 175 327
## [1513] 478 231 726 441 465 217 118 66 525 327 118 661 735 735 581 319 183 673
## [1531] 433 275 367 241 626 707 698 707 311 111 83 311 763 161 161 161 367 118
## [1549] 262 137 343 161 161 367 726 217 726 217 707 118 311 327 726 367 478 161
## [1567] 327 275 726 161 726 643 275 515 145 161 534 183 549 626 275 549 241 57
## [1585] 57 161 763 83 681 83 83 681 617 670 670 670 490 478 127 543 661 506
## [1603] 57 465 497 217 698 617 145 343 74 497 367 497 497 367 497 698 689 478
## [1621] 549 267 497 66 661 249 745 681 327 275 433 525 421 661 161 161 441 161
## [1639] 465 161 327 726 534 231 211 327 161 617 707 726 465 707 726 465 161 145
## [1657] 726 145 465 161 698 118 707 137 161 57 161 327 189 327 707 118 145 726
## [1675] 698 262 707 267 267 267 267 267 267 118 726 262 525 189 327 327 319 726
## [1693] 145 707 707 161 327 327 649 465 161 590 161 145 590 145 275 275 649 327
## [1711] 381 707 161 275 267 175 175 681 490 225 735 626 394 617 689 241 335 175
## [1729] 137 661 183 241 137 66 145 617 549 534 745 137 478 183 183 189 394 217
## [1747] 698 707 327 183 161 291 534 275 478 262 707 745 465 478 231 57 506 189
## [1765] 111 460 673 726 262 707 562 111 57 111 367 599 91 335 275 262 161 698
## [1783] 698 433 497 91 485 506 111 111 241 343 490 225 161 441 506 298 421 127
## [1801] 421 83 262 571 590 707 643 367 262 433 617 217 275 275 413 327 66 283
## [1819] 534 745 241 497 225 460 111 698 145 617 137 413 707 707 745 649 267 735
## [1837] 681 425 433 275 635 465 343 478 617 745 66 745 343 11 534 497 231 590
## [1855] 490 83 275 189 441 490 275 211 478 145 343 525 635 189 735 726 275 298
## [1873] 478 441 735 661 137 534 335 726 367 118 581 231 506 327 525 667 707 83
## [1891] 726 460 327 745 745 698 161 217 670 367 763 421 497 726 211 343 497 311
## [1909] 717 681 57 562 231 327 145 189 161 726 118 381 465 211 217 327 698 217
## [1927] 66 581 367 74 91 735 497 698 617 571 698 735 267 66 26 707 515 726
## [1945] 465 118 118 485 189 241 118 311 497 661 381 534 506 525 497 217 673 66
## [1963] 267 626 745 441 319 735 275 381 421 543 83 57 127 127 127 298 681 217
## [1981] 433 161 275 689 667 167 707 726 726 525 717 327 726 137 497 311 267 74
## [1999] 311 327 175 161 726 267 161 161 161 626 11 161 534 726 394 490 485 478
## [2017] 590 189 118 211 525 525 189 707 161 726 57 275 649 763 161 145 465 465
## [2035] 465 465 441 367 726 101 460 460 635 111 231 167 626 111 57 460 111 635
## [2053] 275 111 460 275 241 698 590 91 57 626 311 66 111 91 441 534 635 506
## [2071] 506 231 343 66 111 231 267 111 343 681 111 91 497 698 161 111 111 698
## [2089] 478 698 161 231 231 111 231 298 231 534 275 111 441 111 441 161 745 441
## [2107] 635 698 571 571 421 698 231 413 465 497 298 319 66 367 635 726 707 189
## [2125] 441 745 543 275 745 74 241 275 745 735 262 626 571 441 217 111 478 283
## [2143] 367 726 161 83 118 497 145 745 137 745 497 745 225 367 83 525 534 717
## [2161] 465 425 231 673 735 726 118 745 515 267 735 745 543 485 327 425 267 717
## [2179] 661 262 161 262 327 275 111 275 726 327 327 189 283 525 145 211 189 145
## [2197] 327 111 57 275 649 413 267 460 571 635 275 478 267 460 267 590 460 717
## [2215] 145 335 534 57 735 101 327 161 275 291 707 83 217 283 367 460 726 465
## [2233] 267 231 525 670 617 91 211 101 745 490 343 460 413 267 465 745 111 161
## [2251] 311 74 534 189 460 217 460 111 590 145 343 161 74
str(lga)
## num [1:2263] 66 460 590 57 57 57 673 57 425 311 ...
The full list of all 2263 LGA_ID values can be seen above.
Join LGA_ID with Y12_Summary data set using School_Name for the join.
# Step 8 - Join LGA_ID with Y12 Summary using School_Name
Y12_LGA <- Y12_Summary %>% left_join(lga_id, by = "School_Name") %>%
select(School_Name, LGA_ID, Mean, Median, Max, Min, Std_Dev)
Y12_LGA
LGA_ID has now been added as an additional variable after School_Name in the Y12_LGA data set.
In this step the Y12_LGA data frame was subset using first 10 observations (include all variables).
# Step 9 - Subset data frame using the first 10 observations (including all variables).
j10 <- Y12_LGA
j20 <-j10[c(1:10),]
j20
Only the first 10 observations can be seen from the output above.
str(j20)
## tibble [10 x 7] (S3: tbl_df/tbl/data.frame)
## $ School_Name: chr [1:10] "Bendigo Senior Secondary College" "Virtual School Victoria" "St Francis Xavier College" "Caulfield Grammar School" ...
## $ LGA_ID : num [1:10] 262 189 145 231 478 118 460 111 111 717
## $ Mean : num [1:10] 544 366 298 295 295 ...
## $ Median : num [1:10] 419 323 236 236 224 ...
## $ Max : num [1:10] 816 549 447 442 442 ...
## $ Min : num [1:10] 397 226 211 206 218 ...
## $ Std_Dev : num [1:10] 236 166 130 129 128 ...
As seen above all the Summary Statistical variables are numeric.
# Convert to a matrix
mtrx <- as.matrix(j20)
mtrx
## School_Name LGA_ID Mean
## [1,] "Bendigo Senior Secondary College" "262" "543.8667"
## [2,] "Virtual School Victoria" "189" "366.0667"
## [3,] "St Francis Xavier College" "145" "298.0000"
## [4,] "Caulfield Grammar School" "231" "294.8000"
## [5,] "Mildura Senior College" "478" "294.6667"
## [6,] "Catholic Regional College Institute of Training" "118" "272.0000"
## [7,] "Wesley College" "460" "254.6667"
## [8,] "SEDA College (Victoria)" "111" "252.0000"
## [9,] "Balwyn High School" "111" "248.0000"
## [10,] "Wodonga Senior Secondary College" "717" "239.5333"
## Median Max Min Std_Dev
## [1,] "418.7" "815.8" "397.1" "235.7487"
## [2,] "322.8" "549.1" "226.3" "165.6924"
## [3,] "236.0" "447.0" "211.0" "129.6418"
## [4,] "236.0" "442.2" "206.2" "128.5188"
## [5,] "224.0" "442.0" "218.0" "127.6297"
## [6,] "215.0" "408.0" "193.0" "118.2920"
## [7,] "228.0" "382.0" "154.0" "116.3157"
## [8,] "259.0" "378.0" "119.0" "129.6418"
## [9,] "210.5" "372.0" "161.5" "110.1465"
## [10,] "182.1" "359.3" "177.2" "103.7499"
Once data frame j20 has been converted to a matrix the Summary Statistical variables become character in nature. This can be seen by the inverted commas around the numbers.
# Check if matrix.
is.matrix(mtrx)
## [1] TRUE
The check using function is.matrix() confirms that “mtrx” is indeed a matrix.
# Check whether matrix is char, logical,integer, factor, or logical.
str(mtrx)
## chr [1:10, 1:7] "Bendigo Senior Secondary College" ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:7] "School_Name" "LGA_ID" "Mean" "Median" ...
Using function str() provides further confirmation that all the summary statistic values are now character in nature. The reason for this is that all elements of the matrix must be of the same type, e.g. numeric or character.
Only the first and the last(7th) variable in the Y12_LGA data set from step 9 have been subset and then saved as an R object (RDS) file.
# Step 10 - Subset data frame including only 1st and last variable in the data set.
# Save as an R.Data file
j30 <- j10[, c(1,7)]
j30
As shown above only School_Name and Std_Dev have been output as the first and last variables respectively.
j30 %>%
saveRDS(file = "C:/Users/verno/OneDrive/Documents/RMIT/MATH2405/Data/j30.RDS")
As shown above, the output has been saved to a local directory as a RDS file.
Data frame df1 has been created comprising of:
An integer variable with values from “996” to “998” (Model_no)
An ordinal variable comprising of “Small”, “Medium” and “Large” (Size)
# Step 11 - Create new Data Frame df1 from scratch
df1 <- data.frame(col1 = 996:998,
col2 = factor(c("Small", "Medium", "Large"),
levels = c("Small", "Medium", "Large"), ordered = TRUE))
# Assign names variables in data frame df1.
# The integer variable has been named "Model_no".
# The ordinal variable has been named "Size".
colnames(df1) <- c("Model_no", "Size")
# Assign row names
rownames(df1) <- c("Row 1", "Row 2", "Row 3")
# Verify structure and levels of variables in data frame df1
df1
str(df1)
## 'data.frame': 3 obs. of 2 variables:
## $ Model_no: int 996 997 998
## $ Size : Ord.factor w/ 3 levels "Small"<"Medium"<..: 1 2 3
As shown a data frame with 3 observations and 2 variables has been created. Model_no is the integer variable, whereas Size is the ordinal variable.
# Create an additional numeric vector called Price
Price <- c(1349, 1649, 1949)
Price
## [1] 1349 1649 1949
str(Price)
## num [1:3] 1349 1649 1949
Numeric vector Price has been created with 3 values.
# Use cbind() to bind Price to df1 above to create combined data frame df2
df2 <- cbind(df1, Price)
# Output data frame df2 to show 3 variables ("Model_no", "Size" and "Price")
df2
# Check attributes of data frame df2
attributes(df2)
## $names
## [1] "Model_no" "Size" "Price"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] "Row 1" "Row 2" "Row 3"
# Check dimensions of data frame df2
dim(df2)
## [1] 3 3
As can be seen above, data frame df2 has been created with 3 variables, along with its attributes and dimensions.
Data frame df3 has been created as follows:
An integer variable with values from “996” to “998” (Model_no).
A categorical variable with values as “Galaxy S21”, “Galaxy S21+”, “Galaxy S21 Ultra” (Model_name).
Data frame df3 was joined to df2 using a left join by Model_no.
Once joined categorical variable Model_name will form part of the new consolidated data frame df4
# Step 12a - Create another data frame (df3) and join with data frame df2 using "Model_no"
df3 <- data.frame(col1 = 996:998,
col2 = c("Galaxy S21", "Galaxy S21+", "Galaxy S21 Ultra"))
# Assign column names to df3
colnames(df3) <- c("Model_no", "Model_name")
df3
Data frame df3 has been created with Model_no and Model_name
# Step 12b - Join df4 to df3 using a left join by Model_no
df4 <- df3 %>% left_join(df2, by = "Model_no")
df4
Once joined, data frame df4 has been created with 4 variables.
All Schools FTE enrolments - Feb 2020, Department of Education and Training, viewed 12 January 2021, https://discover.data.vic.gov.au/dataset/all-schools-fte-enrolments-feb-2020
Boehmke, BC, 2016, Data Wrangling with R, Springer International Publishing, Cham, Switzerland.
School Locations - 2020, Department of Education and Training, viewed 12 January 2021, https://discover.data.vic.gov.au/dataset/school-locations-2020
Wickham, H, & Grolemund, G 2016, R for Data Science : Import, Tidy, Transform, Visualize, and Model Data, O’Reilly Media, Incorporated, Sebastopol, CA.