1.Locate Data.

I was intrigued to compare Victorian school enrollments between Grade 3, Grade 5, Grade 7 and Grade 9 and if Naplan Results contributed to enrollements in certain LGA. My intention was to obtain the Naplan results by LGA for 2021, as this is yet to be published , I have used data from 2019 (due to Napaln being cancelled in 2020).

Key objective will be to understand which school in Victoria had the highest number of enrollments for Grade 3, Grade 5, Grade 7 and Grade 9 in Year 2020, and if the data can confirm the school was located in a LGA with higher Naplan Results .

I have choosen the following datasets for my assignemnet:

All Schools FTE ((Full Time Equivalent)enrolments 2020 data and VCAMS percentage of students achieving national benchmark in literacy/numeracy data was was published by the Department of Education and Training and was sourced using https://www.data.vic.gov.au/
URL of the All Schools FTE enrolments 2020 data is http://www.education.vic.gov.au/Documents/about/research/datavic/dv300-allschoolsFTEenrolmentsFeb2020.csv The was obtained in CSV format.
The NAPLAN data is collected annually (May).It is a full cohort common test for Literacy and Numeracy, of students in all school sectors in Years 3, 5, 7 and 9, unless exempted for disability or recently arrived (one year or less) non-English speaking migrants.

Breakdown of data between the state level and for Local Government Areas(LGA) have been separated as they are calculated differently. State level data include plausible values for absent children not officially exempt from testing; LGA data exclude these records. For the purpose of this assignments data from LGA has been used.

URL of the VCAMS percentage of students achieving national benchmark in literacy https://discover.data.vic.gov.au/dataset/vcams-percentage-of-students-achieving-national-benchmark-in-literacy This was obtained in excel format.
URL of the VCAMS percentage of students achieving national benchmark in numeracy https://discover.data.vic.gov.au/dataset/vcams-percentage-of-students-achieving-national-benchmark-in-numeracy This was obtained in excel format.

2.Read/Import Data

The file for All Schools FTE enrollments 2020 was downloaded to my laptop directory.The read_csv function in R as been used to read the file,have assigned it to “2020_schoolsfte”.
- To display contents of “schoolsfte_2020”, I have used str() function to see the structure of the internal data.
- converting to data.frame()
- confirming its converted to a dataframe

# Importing CSV file from local directory in R
schoolsfte_2020 <- read_csv("dv300-allschoolsFTEenrolmentsFeb2020.csv")

## Rows: 2263 Columns: 56

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): Education_Sector, School_Name, School_Type, School_Status
## dbl (51): Entity_Type, School_No, Prep Males Total", Prep Females Total", Pr...
## lgl  (1): CENSUS_TYPE

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(schoolsfte_2020)

## spec_tbl_df [2,263 × 56] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Education_Sector                 : chr [1:2263] "Catholic" "Catholic" "Catholic" "Catholic" ...
##  $ Entity_Type                      : num [1:2263] 2 2 2 2 2 2 2 2 2 2 ...
##  $ School_No                        : num [1:2263] 20 25 26 28 29 30 33 35 60 77 ...
##  $ School_Name                      : chr [1:2263] "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
##  $ School_Type                      : chr [1:2263] "Secondary" "Secondary" "Secondary" "Secondary" ...
##  $ School_Status                    : chr [1:2263] "O" "O" "O" "O" ...
##  $ Prep Males Total"                : num [1:2263] 0 0 0 0 30 18 0 0 11 20 ...
##  $ Prep Females Total"              : num [1:2263] 0 0 0 0 12 18 0 0 7 24 ...
##  $ Prep Total"                      : num [1:2263] 0 0 0 0 42 36 0 0 18 44 ...
##  $ Year 1 Males Total"              : num [1:2263] 0 0 0 0 14 9 0 0 9 23 ...
##  $ Year 1 Females Total"            : num [1:2263] 0 0 0 0 18 20 0 0 11 25 ...
##  $ Year 1 Total"                    : num [1:2263] 0 0 0 0 32 29 0 0 20 48 ...
##  $ Year 2 Males total"              : num [1:2263] 0 0 0 0 15 19 0 0 11.3 20 ...
##  $ Year 2 Females Total"            : num [1:2263] 0 0 0 0 16 14 0 0 14 30 ...
##  $ Year 2 Total"                    : num [1:2263] 0 0 0 0 31 33 0 0 25.3 50 ...
##  $ Year 3 Males Total"              : num [1:2263] 0 0 0 0 18 18 0 0 17 23 ...
##  $ Year 3 Females Total"            : num [1:2263] 0 0 0 0 15 25 0 0 10 24 ...
##  $ Year 3 Total"                    : num [1:2263] 0 0 0 0 33 43 0 0 27 47 ...
##  $ Year 4 Males Total"              : num [1:2263] 0 0 0 0 13 20 0 0 14 18 ...
##  $ Year 4 Females Total"            : num [1:2263] 0 0 0 0 21 29 0 0 17 19 ...
##  $ Year 4 Total"                    : num [1:2263] 0 0 0 0 34 49 0 0 31 37 ...
##  $ Year 5 Males Total"              : num [1:2263] 0 0 0 0 16 22 0 0 19 17 ...
##  $ Year 5 Females Total"            : num [1:2263] 0 0 0 0 22 24 0 0 14 26 ...
##  $ Year 5 Total"                    : num [1:2263] 0 0 0 0 38 46 0 0 33 43 ...
##  $ Year 6 Males Total"              : num [1:2263] 0 0 0 0 18 16 0 0 14 24 ...
##  $ Year 6 Females Total"            : num [1:2263] 0 0 0 0 27 15 0 0 20 24 ...
##  $ Year 6 Total"                    : num [1:2263] 0 0 0 0 45 31 0 0 34 48 ...
##  $ Primary Ungraded Males Total"    : num [1:2263] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Primary Ungraded Females Total"  : num [1:2263] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Primary Ungraded Total"          : num [1:2263] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Primary Total"                   : num [1:2263] 0 0 0 0 255 ...
##  $ Year 7 Males Total"              : num [1:2263] 333 64 54 229 0 0 104 0 0 0 ...
##  $ Year 7 Females Total"            : num [1:2263] 0 0 0 0 0 0 118 156 0 0 ...
##  $ Year 7 Total"                    : num [1:2263] 333 64 54 229 0 0 222 156 0 0 ...
##  $ Year 8 Males Total"              : num [1:2263] 337 85 74 233 0 0 110 0 0 0 ...
##  $ Year 8 Females Total"            : num [1:2263] 0 0 0 0 0 0 110 159 0 0 ...
##  $ Year 8 Total"                    : num [1:2263] 337 85 74 233 0 0 220 159 0 0 ...
##  $ Year 9 Males Total"              : num [1:2263] 351 69 94 224 0 0 85 0 0 0 ...
##  $ Year 9 Females Total"            : num [1:2263] 0 0 0 0 0 ...
##  $ Year 9 Total"                    : num [1:2263] 351 69 94 224 0 ...
##  $ Year 10 Males Total"             : num [1:2263] 319 73 87 236 0 0 92 0 0 0 ...
##  $ Year 10 Females Total"           : num [1:2263] 0 0 0 0 0 0 103 162 0 0 ...
##  $ Year 10 Total"                   : num [1:2263] 319 73 87 236 0 0 195 162 0 0 ...
##  $ Year 11 Males Total"             : num [1:2263] 332 68 97 228 0 0 89 0 0 0 ...
##  $ Year 11 Females Total"           : num [1:2263] 0 0 0 0 0 ...
##  $ Year 11 Total"                   : num [1:2263] 332 68 97 228 0 ...
##  $ Year 12 Males"                   : num [1:2263] 281 62 71 190 0 0 76.4 0 0 0 ...
##  $ Year 12 Females"                 : num [1:2263] 0 0 0 0 0 0 85.6 151 0 0 ...
##  $ Year 12 Total"                   : num [1:2263] 281 62 71 190 0 0 162 151 0 0 ...
##  $ Secondary Ungraded Males Total"  : num [1:2263] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Secondary Ungraded Females Total": num [1:2263] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Secondary Ungraded Total"        : num [1:2263] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Secondary Total"                 : num [1:2263] 1953 421 477 1340 0 ...
##  $ Grand Total"                     : num [1:2263] 1953 421 477 1340 255 ...
##  $ Year                             : num [1:2263] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ CENSUS_TYPE                      : logi [1:2263] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Education_Sector = col_character(),
##   ..   Entity_Type = col_double(),
##   ..   School_No = col_double(),
##   ..   School_Name = col_character(),
##   ..   School_Type = col_character(),
##   ..   School_Status = col_character(),
##   ..   `Prep Males Total"` = col_double(),
##   ..   `Prep Females Total"` = col_double(),
##   ..   `Prep Total"` = col_double(),
##   ..   `Year 1 Males Total"` = col_double(),
##   ..   `Year 1 Females Total"` = col_double(),
##   ..   `Year 1 Total"` = col_double(),
##   ..   `Year 2 Males total"` = col_double(),
##   ..   `Year 2 Females Total"` = col_double(),
##   ..   `Year 2 Total"` = col_double(),
##   ..   `Year 3 Males Total"` = col_double(),
##   ..   `Year 3 Females Total"` = col_double(),
##   ..   `Year 3 Total"` = col_double(),
##   ..   `Year 4 Males Total"` = col_double(),
##   ..   `Year 4 Females Total"` = col_double(),
##   ..   `Year 4 Total"` = col_double(),
##   ..   `Year 5 Males Total"` = col_double(),
##   ..   `Year 5 Females Total"` = col_double(),
##   ..   `Year 5 Total"` = col_double(),
##   ..   `Year 6 Males Total"` = col_double(),
##   ..   `Year 6 Females Total"` = col_double(),
##   ..   `Year 6 Total"` = col_double(),
##   ..   `Primary Ungraded Males Total"` = col_double(),
##   ..   `Primary Ungraded Females Total"` = col_double(),
##   ..   `Primary Ungraded Total"` = col_double(),
##   ..   `Primary Total"` = col_double(),
##   ..   `Year 7 Males Total"` = col_double(),
##   ..   `Year 7 Females Total"` = col_double(),
##   ..   `Year 7 Total"` = col_double(),
##   ..   `Year 8 Males Total"` = col_double(),
##   ..   `Year 8 Females Total"` = col_double(),
##   ..   `Year 8 Total"` = col_double(),
##   ..   `Year 9 Males Total"` = col_double(),
##   ..   `Year 9 Females Total"` = col_double(),
##   ..   `Year 9 Total"` = col_double(),
##   ..   `Year 10 Males Total"` = col_double(),
##   ..   `Year 10 Females Total"` = col_double(),
##   ..   `Year 10 Total"` = col_double(),
##   ..   `Year 11 Males Total"` = col_double(),
##   ..   `Year 11 Females Total"` = col_double(),
##   ..   `Year 11 Total"` = col_double(),
##   ..   `Year 12 Males"` = col_double(),
##   ..   `Year 12 Females"` = col_double(),
##   ..   `Year 12 Total"` = col_double(),
##   ..   `Secondary Ungraded Males Total"` = col_double(),
##   ..   `Secondary Ungraded Females Total"` = col_double(),
##   ..   `Secondary Ungraded Total"` = col_double(),
##   ..   `Secondary Total"` = col_double(),
##   ..   `Grand Total"` = col_double(),
##   ..   Year = col_double(),
##   ..   CENSUS_TYPE = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>

schoolsfte_2020 <-as.data.frame(schoolsfte_2020)

glimpse(schoolsfte_2020)

## Rows: 2,263
## Columns: 56
## $ Education_Sector                    <chr> "Catholic", "Catholic", "Catholic"…
## $ Entity_Type                         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ School_No                           <dbl> 20, 25, 26, 28, 29, 30, 33, 35, 60…
## $ School_Name                         <chr> "Parade College", "Simonds Catholi…
## $ School_Type                         <chr> "Secondary", "Secondary", "Seconda…
## $ School_Status                       <chr> "O", "O", "O", "O", "O", "O", "O",…
## $ `Prep Males Total"`                 <dbl> 0, 0, 0, 0, 30, 18, 0, 0, 11, 20, …
## $ `Prep Females Total"`               <dbl> 0.0, 0.0, 0.0, 0.0, 12.0, 18.0, 0.…
## $ `Prep Total"`                       <dbl> 0.0, 0.0, 0.0, 0.0, 42.0, 36.0, 0.…
## $ `Year 1 Males Total"`               <dbl> 0, 0, 0, 0, 14, 9, 0, 0, 9, 23, 25…
## $ `Year 1 Females Total"`             <dbl> 0, 0, 0, 0, 18, 20, 0, 0, 11, 25, …
## $ `Year 1 Total"`                     <dbl> 0, 0, 0, 0, 32, 29, 0, 0, 20, 48, …
## $ `Year 2 Males total"`               <dbl> 0.0, 0.0, 0.0, 0.0, 15.0, 19.0, 0.…
## $ `Year 2 Females Total"`             <dbl> 0, 0, 0, 0, 16, 14, 0, 0, 14, 30, …
## $ `Year 2 Total"`                     <dbl> 0.0, 0.0, 0.0, 0.0, 31.0, 33.0, 0.…
## $ `Year 3 Males Total"`               <dbl> 0.0, 0.0, 0.0, 0.0, 18.0, 18.0, 0.…
## $ `Year 3 Females Total"`             <dbl> 0, 0, 0, 0, 15, 25, 0, 0, 10, 24, …
## $ `Year 3 Total"`                     <dbl> 0.0, 0.0, 0.0, 0.0, 33.0, 43.0, 0.…
## $ `Year 4 Males Total"`               <dbl> 0, 0, 0, 0, 13, 20, 0, 0, 14, 18, …
## $ `Year 4 Females Total"`             <dbl> 0, 0, 0, 0, 21, 29, 0, 0, 17, 19, …
## $ `Year 4 Total"`                     <dbl> 0, 0, 0, 0, 34, 49, 0, 0, 31, 37, …
## $ `Year 5 Males Total"`               <dbl> 0.0, 0.0, 0.0, 0.0, 16.0, 22.0, 0.…
## $ `Year 5 Females Total"`             <dbl> 0, 0, 0, 0, 22, 24, 0, 0, 14, 26, …
## $ `Year 5 Total"`                     <dbl> 0.0, 0.0, 0.0, 0.0, 38.0, 46.0, 0.…
## $ `Year 6 Males Total"`               <dbl> 0, 0, 0, 0, 18, 16, 0, 0, 14, 24, …
## $ `Year 6 Females Total"`             <dbl> 0, 0, 0, 0, 27, 15, 0, 0, 20, 24, …
## $ `Year 6 Total"`                     <dbl> 0, 0, 0, 0, 45, 31, 0, 0, 34, 48, …
## $ `Primary Ungraded Males Total"`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Primary Ungraded Females Total"`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Primary Ungraded Total"`           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Primary Total"`                    <dbl> 0.0, 0.0, 0.0, 0.0, 255.0, 267.0, …
## $ `Year 7 Males Total"`               <dbl> 333, 64, 54, 229, 0, 0, 104, 0, 0,…
## $ `Year 7 Females Total"`             <dbl> 0, 0, 0, 0, 0, 0, 118, 156, 0, 0, …
## $ `Year 7 Total"`                     <dbl> 333, 64, 54, 229, 0, 0, 222, 156, …
## $ `Year 8 Males Total"`               <dbl> 337, 85, 74, 233, 0, 0, 110, 0, 0,…
## $ `Year 8 Females Total"`             <dbl> 0, 0, 0, 0, 0, 0, 110, 159, 0, 0, …
## $ `Year 8 Total"`                     <dbl> 337, 85, 74, 233, 0, 0, 220, 159, …
## $ `Year 9 Males Total"`               <dbl> 351, 69, 94, 224, 0, 0, 85, 0, 0, …
## $ `Year 9 Females Total"`             <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 122.…
## $ `Year 9 Total"`                     <dbl> 351.0, 69.0, 94.0, 224.0, 0.0, 0.0…
## $ `Year 10 Males Total"`              <dbl> 319, 73, 87, 236, 0, 0, 92, 0, 0, …
## $ `Year 10 Females Total"`            <dbl> 0, 0, 0, 0, 0, 0, 103, 162, 0, 0, …
## $ `Year 10 Total"`                    <dbl> 319, 73, 87, 236, 0, 0, 195, 162, …
## $ `Year 11 Males Total"`              <dbl> 332, 68, 97, 228, 0, 0, 89, 0, 0, …
## $ `Year 11 Females Total"`            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 106.…
## $ `Year 11 Total"`                    <dbl> 332.0, 68.0, 97.0, 228.0, 0.0, 0.0…
## $ `Year 12 Males"`                    <dbl> 281.0, 62.0, 71.0, 190.0, 0.0, 0.0…
## $ `Year 12 Females"`                  <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 85.6…
## $ `Year 12 Total"`                    <dbl> 281.0, 62.0, 71.0, 190.0, 0.0, 0.0…
## $ `Secondary Ungraded Males Total"`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Secondary Ungraded Females Total"` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Secondary Ungraded Total"`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Secondary Total"`                  <dbl> 1953.0, 421.0, 477.0, 1340.0, 0.0,…
## $ `Grand Total"`                      <dbl> 1953.0, 421.0, 477.0, 1340.0, 255.…
## $ Year                                <dbl> 2020, 2020, 2020, 2020, 2020, 2020…
## $ CENSUS_TYPE                         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,…

head() and tail() functions as been used to see first five rows and last five rowes accordingly.It displays the contents of the data frame as well.

head(schoolsfte_2020)

tail(schoolsfte_2020)

The file for Naplan Literacy was downloaded to my laptop directory in excel format.The read_excel function in R as been used to read the file,have assigned it to “naplan_literacy”, of the excel file the following sheet “NAPLAN_reading_LGA” has been extracted for only Year 2018,2019 results. The file contained data from Year 2008 to Year 2019, but for the purpouse of this anlaysis I will onlycompare the Year 2018,2019 Naplan results and Year 2020 enrollments.

naplan_literacy <- read_excel("~/Desktop/practice/data/VCAMS_Indicator_Literacy.xlsx", sheet = "NAPLAN_reading_LGA",range = cell_rows(3167:3812), col_names = FALSE)

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...

R has automatically allocated new column names when the data was imported, which will be amended later on.
To display contents of “naplan_literacy”, I have used str() function to see the structure of the internal data.
renaming only the column names that I will reqcuire for the analysis
converting to data.frame()
confirming its converted to a dataframe

str(naplan_literacy)

## tibble [646 × 8] (S3: tbl_df/tbl/data.frame)
##  $ ...1: num [1:646] 2018 2018 2018 2018 2018 ...
##  $ ...2: chr [1:646] "Year 3" "Year 3" "Year 3" "Year 3" ...
##  $ ...3: num [1:646] 20110 20260 20570 20660 20740 ...
##  $ ...4: chr [1:646] "Alpine (S)" "Ararat (Rc)" "Ballarat (C)" "Banyule (C)" ...
##  $ ...5: num [1:646] 11.1 11.1 11.1 11.1 11.1 11.1 11.1 11.1 11.1 11.1 ...
##  $ ...6: chr [1:646] "126" "111" "1270" "1520" ...
##  $ ...7: chr [1:646] "132" "125" "1364" "1571" ...
##  $ ...8: chr [1:646] "0.95454545454545459" "0.88800000000000001" "0.93108504398826974" "0.96753660089115212" ...

# renaming the col.names
colnames(naplan_literacy)[c(1:4, 8)] <- c("year", "grade","LGA_id","LGA_Name","results_literacy_percentage")

# converting to data.frame()
naplan_literacy <-as.data.frame(naplan_literacy)

# confirming its converted to a dataframe
str(naplan_literacy)

## 'data.frame':    646 obs. of  8 variables:
##  $ year                       : num  2018 2018 2018 2018 2018 ...
##  $ grade                      : chr  "Year 3" "Year 3" "Year 3" "Year 3" ...
##  $ LGA_id                     : num  20110 20260 20570 20660 20740 ...
##  $ LGA_Name                   : chr  "Alpine (S)" "Ararat (Rc)" "Ballarat (C)" "Banyule (C)" ...
##  $ ...5                       : num  11.1 11.1 11.1 11.1 11.1 11.1 11.1 11.1 11.1 11.1 ...
##  $ ...6                       : chr  "126" "111" "1270" "1520" ...
##  $ ...7                       : chr  "132" "125" "1364" "1571" ...
##  $ results_literacy_percentage: chr  "0.95454545454545459" "0.88800000000000001" "0.93108504398826974" "0.96753660089115212" ...

head(naplan_literacy)

The file for Naplan Numeracy was downloaded to my laptop directory in excel format.The read_excel function in R as been used to read the file,have assigned it to “naplan_numeracy”, of the excel file the following sheet “NAPLAN_numeracy_LGA” has been extracted for only Year 2018,2019 results. The file contained data from Year 2008 to Year 2019, but for the purpouse of this anlaysis I will onlycompare the Year 2018,2019 Naplan results and Year 2020 enrollments.

naplan_numeracy <- read_excel("~/Desktop/practice/data/VCAMS_Indicator_Numeracy.xlsx", 
                              sheet = "NAPLAN_numeracy_LGA",range = cell_rows(3167:3812), col_names = FALSE)

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...

# renaming only the column names that I will reqcuire for the analysis
colnames(naplan_numeracy)[c(1:4, 8)] <- c("year", "grade","LGA_id","LGA_Name","results_numeracy_percentage")

# convering to data.frame()
naplan_numeracy <-as.data.frame(naplan_numeracy)

# To display contents of "naplan_numeracy", I have used str() function to see the structure of the internal data.
str(naplan_numeracy)

## 'data.frame':    646 obs. of  8 variables:
##  $ year                       : num  2018 2018 2018 2018 2018 ...
##  $ grade                      : chr  "Year 3" "Year 3" "Year 3" "Year 3" ...
##  $ LGA_id                     : num  20110 20260 20570 20660 20740 ...
##  $ LGA_Name                   : chr  "Alpine (S)" "Ararat (Rc)" "Ballarat (C)" "Banyule (C)" ...
##  $ ...5                       : num  11.2 11.2 11.2 11.2 11.2 11.2 11.2 11.2 11.2 11.2 ...
##  $ ...6                       : chr  "135" "117" "1311" "1541" ...
##  $ ...7                       : chr  "137" "124" "1369" "1577" ...
##  $ results_numeracy_percentage: chr  "0.98540145985401462" "0.94354838709677424" "0.95763330898466037" "0.97717184527584022" ...

# confirming the its data.frame and viewing the first five rows of "naplan_numeracy"
head(naplan_numeracy)

3.Data Description

Data source: The data for All Schools FTE Enrollments can be found at the following URL:http://www.education.vic.gov.au/Documents/about/research/datavic/dv300-allschoolsFTEenrolments Feb2020.csv

The data consists information Prep - Year 12 FTE enrollments for Year 2020, for Goverment, Catholic and Independent(Private Schools) schools.

This dataset contains 56 Variables and 2263 observations. • One logical variable: o CENSUS_TYPE - confirmation on Census being completed.

• Four qualitative (categorical) variables: o Education_Sector – Categorizing schools by Catholic, Government or Independent. o School_Name - Name of the school. o School_Type – Type of school, example whether the school is a, combine school (Primary/Secondary), Primary only, Secondary or Special school only . o School_Status – shows if the school is Open (O) or Closed (C).

• 51 numeric variables: o Entity Type – Column shows if the school is either Government -1 or Catholic / Independent- 2. o School_No – indicates the school Identification Number. o *Total - represents the FTE (Full Time Equivalent) by year,(48 of these variables can be found. o Year - Census year - 2020

Data source: VCAMS percentage of students achieving national benchmark in literacy https://discover.data.vic.gov.au/dataset/vcams-percentage-of-students-achieving-national-benchmark-in-literacy

This dataset contains 8 Variables and 3811 observations and it comntained infor from Year 2008 to 2019, however for the purpouse of the assognment Year 2018 and Year 2019 were extracted.

• Five qualitative (categorical) variables: Year_Level – Grades by Year 3,Year 5, Year 7, Year 9. LGA_DESC– Local Government Area Numerator- Number of students in Years 3, 5, 7 and 9 who meet or exceed the benchmarks for
literacy. Denominator– Number of students in the reference years (Years 3, 5, 7 and 9) who in participated for testing or were officially exempted *Indicator_Calc - Percentage of students achiving national benchmark for literacy

• Three numeric variable Year – Census Year. LGA_KEY – LGA ID No. *INDICATOR_NUM_FULL - Proportion of students in Years 3, 5, 7 and 9 who meet or exceed the benchmarks for literacy (Government and Non Government)

URL of the VCAMS percentage of students achieving national benchmark in numeracy https://discover.data.vic.gov.au/dataset/vcams-percentage-of-students-achieving-national-benchmark-in-numeracy

This dataset contains 8 Variables and 3811 observations.

• Five qualitative (categorical) variables: Year_Level – Grades by Year 3,Year 5, Year 7, Year 9. LGA_DESC– Local Government Area Numerator- Number of students in Years 3, 5, 7 and 9 who meet or exceed the benchmarks for numeracy Denominator– Number of students in the reference years (Years 3, 5, 7 and 9) who participated in testing or were officially exempted *Indicator_Calc - Percentage of students achiving national benchmark for numeracy

• Three numeric variable Year – Census Year. LGA_KEY – LGA ID N. *INDICATOR_NUM_FULL - Proportion of students in Years 3, 5, 7 and 9 who meet or exceed the benchmarks for literacy (Government and Non Government)

4.Inspect dataset and variables

Step 4a - Check the dimensions of the dataframe:
To check the overall dimension of the dataframe dim function was used(shows both rows and columns)
To check the number of rows the nrow function was used.
To check the number of columns the ncol function was used

4a.

# All Schools Enrollment
dim(schoolsfte_2020)

## [1] 2263   56

nrow(schoolsfte_2020)

## [1] 2263

ncol(schoolsfte_2020)

## [1] 56

# Naplan Literacy
dim(naplan_literacy)

## [1] 646   8

nrow(naplan_literacy)

## [1] 646

ncol(naplan_literacy)

## [1] 8

# Naplan Numeracy
dim(naplan_numeracy)

## [1] 646   8

nrow(naplan_numeracy)

## [1] 646

ncol(naplan_numeracy)

## [1] 8

##4b Checked the variable data types by using glimpse()

summarise the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.

glimpse(schoolsfte_2020)

## Rows: 2,263
## Columns: 56
## $ Education_Sector                    <chr> "Catholic", "Catholic", "Catholic"…
## $ Entity_Type                         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ School_No                           <dbl> 20, 25, 26, 28, 29, 30, 33, 35, 60…
## $ School_Name                         <chr> "Parade College", "Simonds Catholi…
## $ School_Type                         <chr> "Secondary", "Secondary", "Seconda…
## $ School_Status                       <chr> "O", "O", "O", "O", "O", "O", "O",…
## $ `Prep Males Total"`                 <dbl> 0, 0, 0, 0, 30, 18, 0, 0, 11, 20, …
## $ `Prep Females Total"`               <dbl> 0.0, 0.0, 0.0, 0.0, 12.0, 18.0, 0.…
## $ `Prep Total"`                       <dbl> 0.0, 0.0, 0.0, 0.0, 42.0, 36.0, 0.…
## $ `Year 1 Males Total"`               <dbl> 0, 0, 0, 0, 14, 9, 0, 0, 9, 23, 25…
## $ `Year 1 Females Total"`             <dbl> 0, 0, 0, 0, 18, 20, 0, 0, 11, 25, …
## $ `Year 1 Total"`                     <dbl> 0, 0, 0, 0, 32, 29, 0, 0, 20, 48, …
## $ `Year 2 Males total"`               <dbl> 0.0, 0.0, 0.0, 0.0, 15.0, 19.0, 0.…
## $ `Year 2 Females Total"`             <dbl> 0, 0, 0, 0, 16, 14, 0, 0, 14, 30, …
## $ `Year 2 Total"`                     <dbl> 0.0, 0.0, 0.0, 0.0, 31.0, 33.0, 0.…
## $ `Year 3 Males Total"`               <dbl> 0.0, 0.0, 0.0, 0.0, 18.0, 18.0, 0.…
## $ `Year 3 Females Total"`             <dbl> 0, 0, 0, 0, 15, 25, 0, 0, 10, 24, …
## $ `Year 3 Total"`                     <dbl> 0.0, 0.0, 0.0, 0.0, 33.0, 43.0, 0.…
## $ `Year 4 Males Total"`               <dbl> 0, 0, 0, 0, 13, 20, 0, 0, 14, 18, …
## $ `Year 4 Females Total"`             <dbl> 0, 0, 0, 0, 21, 29, 0, 0, 17, 19, …
## $ `Year 4 Total"`                     <dbl> 0, 0, 0, 0, 34, 49, 0, 0, 31, 37, …
## $ `Year 5 Males Total"`               <dbl> 0.0, 0.0, 0.0, 0.0, 16.0, 22.0, 0.…
## $ `Year 5 Females Total"`             <dbl> 0, 0, 0, 0, 22, 24, 0, 0, 14, 26, …
## $ `Year 5 Total"`                     <dbl> 0.0, 0.0, 0.0, 0.0, 38.0, 46.0, 0.…
## $ `Year 6 Males Total"`               <dbl> 0, 0, 0, 0, 18, 16, 0, 0, 14, 24, …
## $ `Year 6 Females Total"`             <dbl> 0, 0, 0, 0, 27, 15, 0, 0, 20, 24, …
## $ `Year 6 Total"`                     <dbl> 0, 0, 0, 0, 45, 31, 0, 0, 34, 48, …
## $ `Primary Ungraded Males Total"`     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Primary Ungraded Females Total"`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Primary Ungraded Total"`           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Primary Total"`                    <dbl> 0.0, 0.0, 0.0, 0.0, 255.0, 267.0, …
## $ `Year 7 Males Total"`               <dbl> 333, 64, 54, 229, 0, 0, 104, 0, 0,…
## $ `Year 7 Females Total"`             <dbl> 0, 0, 0, 0, 0, 0, 118, 156, 0, 0, …
## $ `Year 7 Total"`                     <dbl> 333, 64, 54, 229, 0, 0, 222, 156, …
## $ `Year 8 Males Total"`               <dbl> 337, 85, 74, 233, 0, 0, 110, 0, 0,…
## $ `Year 8 Females Total"`             <dbl> 0, 0, 0, 0, 0, 0, 110, 159, 0, 0, …
## $ `Year 8 Total"`                     <dbl> 337, 85, 74, 233, 0, 0, 220, 159, …
## $ `Year 9 Males Total"`               <dbl> 351, 69, 94, 224, 0, 0, 85, 0, 0, …
## $ `Year 9 Females Total"`             <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 122.…
## $ `Year 9 Total"`                     <dbl> 351.0, 69.0, 94.0, 224.0, 0.0, 0.0…
## $ `Year 10 Males Total"`              <dbl> 319, 73, 87, 236, 0, 0, 92, 0, 0, …
## $ `Year 10 Females Total"`            <dbl> 0, 0, 0, 0, 0, 0, 103, 162, 0, 0, …
## $ `Year 10 Total"`                    <dbl> 319, 73, 87, 236, 0, 0, 195, 162, …
## $ `Year 11 Males Total"`              <dbl> 332, 68, 97, 228, 0, 0, 89, 0, 0, …
## $ `Year 11 Females Total"`            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 106.…
## $ `Year 11 Total"`                    <dbl> 332.0, 68.0, 97.0, 228.0, 0.0, 0.0…
## $ `Year 12 Males"`                    <dbl> 281.0, 62.0, 71.0, 190.0, 0.0, 0.0…
## $ `Year 12 Females"`                  <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 85.6…
## $ `Year 12 Total"`                    <dbl> 281.0, 62.0, 71.0, 190.0, 0.0, 0.0…
## $ `Secondary Ungraded Males Total"`   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Secondary Ungraded Females Total"` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Secondary Ungraded Total"`         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ `Secondary Total"`                  <dbl> 1953.0, 421.0, 477.0, 1340.0, 0.0,…
## $ `Grand Total"`                      <dbl> 1953.0, 421.0, 477.0, 1340.0, 255.…
## $ Year                                <dbl> 2020, 2020, 2020, 2020, 2020, 2020…
## $ CENSUS_TYPE                         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE,…

Based on above , it can be determined the variables are assigned to correct data type, therefore we do not require any amendments.

To illustrate how to change the variable type the following codes have been used. Using a variable that will be be not recquired for the analysis.

The change of type test conducted as per below steps:
1. Convert from logical to Character 2. Convert from Character to Number 3. Convert from Number to logical (original type)

schoolsfte_2020$CENSUS_TYPE <-as.character(schoolsfte_2020$CENSUS_TYPE)
# confirming change 1
str(schoolsfte_2020)

## 'data.frame':    2263 obs. of  56 variables:
##  $ Education_Sector                 : chr  "Catholic" "Catholic" "Catholic" "Catholic" ...
##  $ Entity_Type                      : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ School_No                        : num  20 25 26 28 29 30 33 35 60 77 ...
##  $ School_Name                      : chr  "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
##  $ School_Type                      : chr  "Secondary" "Secondary" "Secondary" "Secondary" ...
##  $ School_Status                    : chr  "O" "O" "O" "O" ...
##  $ Prep Males Total"                : num  0 0 0 0 30 18 0 0 11 20 ...
##  $ Prep Females Total"              : num  0 0 0 0 12 18 0 0 7 24 ...
##  $ Prep Total"                      : num  0 0 0 0 42 36 0 0 18 44 ...
##  $ Year 1 Males Total"              : num  0 0 0 0 14 9 0 0 9 23 ...
##  $ Year 1 Females Total"            : num  0 0 0 0 18 20 0 0 11 25 ...
##  $ Year 1 Total"                    : num  0 0 0 0 32 29 0 0 20 48 ...
##  $ Year 2 Males total"              : num  0 0 0 0 15 19 0 0 11.3 20 ...
##  $ Year 2 Females Total"            : num  0 0 0 0 16 14 0 0 14 30 ...
##  $ Year 2 Total"                    : num  0 0 0 0 31 33 0 0 25.3 50 ...
##  $ Year 3 Males Total"              : num  0 0 0 0 18 18 0 0 17 23 ...
##  $ Year 3 Females Total"            : num  0 0 0 0 15 25 0 0 10 24 ...
##  $ Year 3 Total"                    : num  0 0 0 0 33 43 0 0 27 47 ...
##  $ Year 4 Males Total"              : num  0 0 0 0 13 20 0 0 14 18 ...
##  $ Year 4 Females Total"            : num  0 0 0 0 21 29 0 0 17 19 ...
##  $ Year 4 Total"                    : num  0 0 0 0 34 49 0 0 31 37 ...
##  $ Year 5 Males Total"              : num  0 0 0 0 16 22 0 0 19 17 ...
##  $ Year 5 Females Total"            : num  0 0 0 0 22 24 0 0 14 26 ...
##  $ Year 5 Total"                    : num  0 0 0 0 38 46 0 0 33 43 ...
##  $ Year 6 Males Total"              : num  0 0 0 0 18 16 0 0 14 24 ...
##  $ Year 6 Females Total"            : num  0 0 0 0 27 15 0 0 20 24 ...
##  $ Year 6 Total"                    : num  0 0 0 0 45 31 0 0 34 48 ...
##  $ Primary Ungraded Males Total"    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Primary Ungraded Females Total"  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Primary Ungraded Total"          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Primary Total"                   : num  0 0 0 0 255 ...
##  $ Year 7 Males Total"              : num  333 64 54 229 0 0 104 0 0 0 ...
##  $ Year 7 Females Total"            : num  0 0 0 0 0 0 118 156 0 0 ...
##  $ Year 7 Total"                    : num  333 64 54 229 0 0 222 156 0 0 ...
##  $ Year 8 Males Total"              : num  337 85 74 233 0 0 110 0 0 0 ...
##  $ Year 8 Females Total"            : num  0 0 0 0 0 0 110 159 0 0 ...
##  $ Year 8 Total"                    : num  337 85 74 233 0 0 220 159 0 0 ...
##  $ Year 9 Males Total"              : num  351 69 94 224 0 0 85 0 0 0 ...
##  $ Year 9 Females Total"            : num  0 0 0 0 0 ...
##  $ Year 9 Total"                    : num  351 69 94 224 0 ...
##  $ Year 10 Males Total"             : num  319 73 87 236 0 0 92 0 0 0 ...
##  $ Year 10 Females Total"           : num  0 0 0 0 0 0 103 162 0 0 ...
##  $ Year 10 Total"                   : num  319 73 87 236 0 0 195 162 0 0 ...
##  $ Year 11 Males Total"             : num  332 68 97 228 0 0 89 0 0 0 ...
##  $ Year 11 Females Total"           : num  0 0 0 0 0 ...
##  $ Year 11 Total"                   : num  332 68 97 228 0 ...
##  $ Year 12 Males"                   : num  281 62 71 190 0 0 76.4 0 0 0 ...
##  $ Year 12 Females"                 : num  0 0 0 0 0 0 85.6 151 0 0 ...
##  $ Year 12 Total"                   : num  281 62 71 190 0 0 162 151 0 0 ...
##  $ Secondary Ungraded Males Total"  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Secondary Ungraded Females Total": num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Secondary Ungraded Total"        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Secondary Total"                 : num  1953 421 477 1340 0 ...
##  $ Grand Total"                     : num  1953 421 477 1340 255 ...
##  $ Year                             : num  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ CENSUS_TYPE                      : chr  "FALSE" "FALSE" "FALSE" "FALSE" ...

# change 2
schoolsfte_2020$CENSUS_TYPE <- as.numeric(schoolsfte_2020$CENSUS_TYPE)

## Warning: NAs introduced by coercion

# change 3(back to original data type)
schoolsfte_2020$CENSUS_TYPE <- as.logical(schoolsfte_2020$CENSUS_TYPE)

In Colnames some variable have backslashes as a suffix, removed the backlashes using gsub()

colnames(schoolsfte_2020) = gsub( "\"", " ", colnames(schoolsfte_2020))

colnames(schoolsfte_2020)

##  [1] "Education_Sector"                  "Entity_Type"                      
##  [3] "School_No"                         "School_Name"                      
##  [5] "School_Type"                       "School_Status"                    
##  [7] "Prep Males Total "                 "Prep Females Total "              
##  [9] "Prep Total "                       "Year 1 Males Total "              
## [11] "Year 1 Females Total "             "Year 1 Total "                    
## [13] "Year 2 Males total "               "Year 2 Females Total "            
## [15] "Year 2 Total "                     "Year 3 Males Total "              
## [17] "Year 3 Females Total "             "Year 3 Total "                    
## [19] "Year 4 Males Total "               "Year 4 Females Total "            
## [21] "Year 4 Total "                     "Year 5 Males Total "              
## [23] "Year 5 Females Total "             "Year 5 Total "                    
## [25] "Year 6 Males Total "               "Year 6 Females Total "            
## [27] "Year 6 Total "                     "Primary Ungraded Males Total "    
## [29] "Primary Ungraded Females Total "   "Primary Ungraded Total "          
## [31] "Primary Total "                    "Year 7 Males Total "              
## [33] "Year 7 Females Total "             "Year 7 Total "                    
## [35] "Year 8 Males Total "               "Year 8 Females Total "            
## [37] "Year 8 Total "                     "Year 9 Males Total "              
## [39] "Year 9 Females Total "             "Year 9 Total "                    
## [41] "Year 10 Males Total "              "Year 10 Females Total "           
## [43] "Year 10 Total "                    "Year 11 Males Total "             
## [45] "Year 11 Females Total "            "Year 11 Total "                   
## [47] "Year 12 Males "                    "Year 12 Females "                 
## [49] "Year 12 Total "                    "Secondary Ungraded Males Total "  
## [51] "Secondary Ungraded Females Total " "Secondary Ungraded Total "        
## [53] "Secondary Total "                  "Grand Total "                     
## [55] "Year"                              "CENSUS_TYPE"

# Names of elements within an object can be checked using the below code
names(schoolsfte_2020)

##  [1] "Education_Sector"                  "Entity_Type"                      
##  [3] "School_No"                         "School_Name"                      
##  [5] "School_Type"                       "School_Status"                    
##  [7] "Prep Males Total "                 "Prep Females Total "              
##  [9] "Prep Total "                       "Year 1 Males Total "              
## [11] "Year 1 Females Total "             "Year 1 Total "                    
## [13] "Year 2 Males total "               "Year 2 Females Total "            
## [15] "Year 2 Total "                     "Year 3 Males Total "              
## [17] "Year 3 Females Total "             "Year 3 Total "                    
## [19] "Year 4 Males Total "               "Year 4 Females Total "            
## [21] "Year 4 Total "                     "Year 5 Males Total "              
## [23] "Year 5 Females Total "             "Year 5 Total "                    
## [25] "Year 6 Males Total "               "Year 6 Females Total "            
## [27] "Year 6 Total "                     "Primary Ungraded Males Total "    
## [29] "Primary Ungraded Females Total "   "Primary Ungraded Total "          
## [31] "Primary Total "                    "Year 7 Males Total "              
## [33] "Year 7 Females Total "             "Year 7 Total "                    
## [35] "Year 8 Males Total "               "Year 8 Females Total "            
## [37] "Year 8 Total "                     "Year 9 Males Total "              
## [39] "Year 9 Females Total "             "Year 9 Total "                    
## [41] "Year 10 Males Total "              "Year 10 Females Total "           
## [43] "Year 10 Total "                    "Year 11 Males Total "             
## [45] "Year 11 Females Total "            "Year 11 Total "                   
## [47] "Year 12 Males "                    "Year 12 Females "                 
## [49] "Year 12 Total "                    "Secondary Ungraded Males Total "  
## [51] "Secondary Ungraded Females Total " "Secondary Ungraded Total "        
## [53] "Secondary Total "                  "Grand Total "                     
## [55] "Year"                              "CENSUS_TYPE"

# The below code retrieves the internal class of an object
class(schoolsfte_2020)

## [1] "data.frame"

4c. Check the levels of factor variables, rename/rearrange them if required.

When reviewing the data frame for factors I have identified 1 variables which could be converted to factors.

1.Education_Sector 2.School_name

schoolsfte_2020$Education_Sector <- as.factor(schoolsfte_2020$Education_Sector)
levels( schoolsfte_2020$Education_Sector)

## [1] "Catholic"    "Government"  "Independent"

is.factor(schoolsfte_2020$Education_Sector)

## [1] TRUE

# 
schoolsfte_2020$School_Name <- as.factor(schoolsfte_2020$School_Name)

4d Created new dataset for summary statistics containing Year 3 Total, Year 5 Total ,Year 7 Total and Year 9 Total (To review only NAPLAN Grades 3,5,7,9).

schools_naplan_grades <- schoolsfte_2020
schools_naplan_grades %<>% select(School_Name,School_Type,`Year 3 Total `, `Year 5 Total `, `Year 7 Total `, `Year 9 Total `)
schools_naplan_grades

commencing checks from 4b to 4d on the other two datasets

4b.Checking variable data types on naplan_literacy and naplan_numeracy

glimpse(naplan_literacy)

## Rows: 646
## Columns: 8
## $ year                        <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ grade                       <chr> "Year 3", "Year 3", "Year 3", "Year 3", "Y…
## $ LGA_id                      <dbl> 20110, 20260, 20570, 20660, 20740, 20830, …
## $ LGA_Name                    <chr> "Alpine (S)", "Ararat (Rc)", "Ballarat (C)…
## $ ...5                        <dbl> 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, …
## $ ...6                        <chr> "126", "111", "1270", "1520", "343", "647"…
## $ ...7                        <chr> "132", "125", "1364", "1571", "355", "681"…
## $ results_literacy_percentage <chr> "0.95454545454545459", "0.8880000000000000…

glimpse(naplan_numeracy)

## Rows: 646
## Columns: 8
## $ year                        <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ grade                       <chr> "Year 3", "Year 3", "Year 3", "Year 3", "Y…
## $ LGA_id                      <dbl> 20110, 20260, 20570, 20660, 20740, 20830, …
## $ LGA_Name                    <chr> "Alpine (S)", "Ararat (Rc)", "Ballarat (C)…
## $ ...5                        <dbl> 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, …
## $ ...6                        <chr> "135", "117", "1311", "1541", "343", "657"…
## $ ...7                        <chr> "137", "124", "1369", "1577", "353", "681"…
## $ results_numeracy_percentage <chr> "0.98540145985401462", "0.9435483870967742…

Based on above , it can be determined variable “grade” and results_percentage" need to be converted to factor and numeric accordingly, therefore we do require the below amendments.

Please not column “6”, “7” have been assigned has character and needs to be amended to numeric however as these variables are not required for my analysis , it will not be converted.

# converting grade to factor
naplan_literacy$grade <- as.factor(naplan_literacy$grade)
is.factor(naplan_literacy$grade)

## [1] TRUE

naplan_numeracy$grade <- as.factor(naplan_numeracy$grade)
is.factor(naplan_numeracy$grade)

## [1] TRUE

# converting "results_percentage" to numeric (amending only the columns I need to subset)

naplan_literacy$results_literacy_percentage <- as.numeric(naplan_literacy$results_literacy_percentage)

## Warning: NAs introduced by coercion

glimpse(naplan_literacy$results_literacy_percentage)

##  num [1:646] 0.955 0.888 0.931 0.968 0.966 ...

is.numeric(naplan_literacy$results_literacy_percentage)

## [1] TRUE

naplan_numeracy$results_numeracy_percentage <- as.numeric(naplan_numeracy$results_numeracy_percentage)

## Warning: NAs introduced by coercion

glimpse(naplan_numeracy$results_numeracy_percentage)

##  num [1:646] 0.985 0.944 0.958 0.977 0.972 ...

is.numeric(naplan_numeracy$results_numeracy_percentage)

## [1] TRUE

Using str() will check the structure of both datasets to confirm the changed has occurred or can use is. function - This function tests for the given data type and return a logical value (TRUE or FALSE).

4d. Created new dataset by merging or naplan_literacy and naplan_numeracy and removing unwanted columns. Renaming the columns. changing the results from decimals to percentage.

# merging naplan_literacy and naplan_numeracy
naplan_literacy_numeracy_df <- merge(naplan_literacy, naplan_numeracy, 'by' = 'LGA_Name')

# subsetting the data with only the recquired columns for analysis
naplan_literacy_numeracy_df1 <- naplan_literacy_numeracy_df[c(1,8,9,10,15)]

5. Tidy data

For the data to meet tidy data principles, tidy data needs to adhere to the following rules (Wickham and Grolemund (2016)):

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Data Frame - schools_naplan_grades: Passes principles 2 and 3 but it fails principle 1. It as column header as a values rather than a variable. In order to tidy the data it is necessary to clean it by pivoting the Grade 3 ,5 ,7 ,9 (Grades with total numbers combines as one column) columns using the pivot_longer() function.

This will result in new variables called grades (for columns with total enrollments) and FTE respectively.

tidy_FTE1 <- schools_naplan_grades %>%
     pivot_longer(
     cols = starts_with("Year"),
     names_to = "grade",
     names_prefix = "Year",
     values_to = "enrollments",
    values_drop_na = TRUE) %>%
    group_by(grade) %>%
    arrange(desc(enrollments)
   )

# Removing blank data to enable clean grouping and analysis, as the observations have reduced , it confirms blank cells have been removed.
naplan_literacy_numeracy_df2 <- na.omit(naplan_literacy_numeracy_df1)

With new data frame created for naplan_literacy_numeracy_df2 , there was no tidy pricipales recquired as the data is in the tidy format.

Based upon the review of tidy_FTE1 and naplan_literacy_numeracy_df2,it can be confirmed the two datasets conforms with all 3 data tidy principles.

Each variable has been assigned to a column with data separated correctly,
The columns have headers assigned correctly for each variable.
Each value has been assigned to a cell.
Each observation has been allocated a row.

Further to the above , to tidy the date , columns were renamed and unnecessary/unknown data columns were removed. Remove observation rows which have incomplete/ blank data cells.

6. Summary statistics

The aim of this summary statistics will by grouped by School_Name using the tidy_FTE1 dataset to calculate Mean, Median,Max,Min and Std_Dev.

1.The aim of this summary statistics will by grouped by School_Name using the tidy_FTE1 data set to calculate Mean, Median,Max,Min and Std_Dev.

school_grades_enrollments_summary <- tidy_FTE1 
school_grades_enrollments_summary %<>%
  group_by(School_Name) %>% 
  summarise(Mean = mean(enrollments, na.rm = TRUE),
    Median = median(enrollments, na.rm = TRUE),
    Max = max(enrollments, na.rm = TRUE),
    Min = min(enrollments, na.rm = TRUE),
    Std_Dev = sd(enrollments, na.rm = TRUE)) %>% 
  arrange(desc(Mean))


school_grades_enrollments_summary

## 2. The aim of this summary statistics will by grouped by LGA_Name and Mean, Median,Max,Min and Std_Dev have been calculated for the results_literacy_percdntage.

literacy_results_summary <- naplan_literacy_numeracy_df2 %>% 
  group_by(LGA_Name) %>% 
  summarise(Mean = mean(results_literacy_percentage),
            Median = median(results_literacy_percentage),
            Max = max(results_literacy_percentage),
            Min = min(results_literacy_percentage),
            Std_Dev = sd(results_literacy_percentage)) %>% 
  arrange(desc(Mean))

literacy_results_summary

## 3. The aim of this summary statistics will by grouped by LGA_Name and Mean, Median,Max,Min and Std_Dev have been calculated for the results_numeracy_percentage.

numeracy_results_summary <- naplan_literacy_numeracy_df2 %>% 
  group_by(LGA_Name) %>% 
  summarise(Mean = mean(results_numeracy_percentage),
            Median = median(results_numeracy_percentage),
            Max = max(results_numeracy_percentage),
            Min = min(results_numeracy_percentage),
            Std_Dev = sd(results_numeracy_percentage)) %>% 
  arrange(desc(Mean))

numeracy_results_summary

# The Summary Statistics above show the order of Mean values in descending order.

#converting "results_percentage" column from decimals to percentage

naplan_literacy_numeracy_df2$results__literacy_percentage <- 
  percent(naplan_literacy_numeracy_df2$results_literacy_percentage, accuracy= 0.01)

naplan_literacy_numeracy_df2$results_numeracy_percentage <- 
  percent(naplan_literacy_numeracy_df2$results_numeracy_percentage, accuracy= 0.01)

7. Create a list

Create a list that contains a numeric value for each response to the categorical variable. Typically, they are numbered from 1-n.

## Assign the Numeric Value to School_Name categories in school_grades_enrollments_summary

ListSchool_Name <- list(school_grades_enrollments_summary$School_Name)

## Generate a standard numeric list for assignment to the categories

ListNum <- list(1:2067)

## Convert the lists to a set of values

School_Name <- unlist(ListSchool_Name)
NUM <- unlist(ListNum)

### Use the values to create a new dataframe

School_NameDF <- data.frame(School_Name, NUM)

View(School_NameDF)

8. Join the list

Join this list on using a join of your choice. Remember that this has to keep the numeric variable, as well as matching to your categorical variable.

Two data frames ( school_grades_enrollments_summary and School_NameDF) have been merged to create a new Dataframe (school_grades_enrollments_summary_DF1) which now contains a number assigned to each School_Name(the categorical variable)

school_grades_enrollments_summary_DF1 <- merge(school_grades_enrollments_summary,School_NameDF , 'by' = 'School_Name')


#check
View(school_grades_enrollments_summary_DF1)

Further to abalys the high enrollents with LGA_Name’s obtained the below: A list of Local Government IDs (LGA_ID) for each school, dataset “Schools_LGA_2021.csv” has been used, the data was obtained from www.data.gov.au

All schools registered in Victoria under certain LGA can be obtained through LGA_ID and LGA_Name It will be joined to the tidy_FTE1 data set to compare results.

he URL for LGA_ID data set is as follows: https://data.gov.au/dataset/ds-vic-cebdeac5-00af-4fe7-9537-b1597ef7ffea/distribution/dist-vic-97c05fd1-8671-4f0a-9f91-e8d57a1c1135/details?q=

The following steps were to taken to create a list called LGA_Name_ID Combines with school_grades_enrollments_summary with Schools_LGA_2021 to creat new dataset called tidy_FTE1_LGA

Schools_LGA_2021 <- read_csv("Schools_LGA_2021.csv")

## Rows: 2291 Columns: 21

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Education_Sector, School_Name, School_Type, School_Status, Address...
## dbl  (7): Entity_Type, SCHOOL_NO, Address_Postcode, Postal_Postcode, LGA_ID,...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Used str() to determine the structure  of the dataset and names() to see the columns 
str(Schools_LGA_2021)

## spec_tbl_df [2,291 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Education_Sector     : chr [1:2291] "Government" "Government" "Government" "Government" ...
##  $ Entity_Type          : num [1:2291] 1 1 1 1 1 1 1 1 1 1 ...
##  $ SCHOOL_NO            : num [1:2291] 1 3 4 8 12 26 28 33 37 40 ...
##  $ School_Name          : chr [1:2291] "Alberton Primary School" "Allansford and District Primary School" "Avoca Primary School" "Avenel Primary School" ...
##  $ School_Type          : chr [1:2291] "Primary" "Primary" "Primary" "Primary" ...
##  $ School_Status        : chr [1:2291] "O" "O" "O" "O" ...
##  $ Address_Line_1       : chr [1:2291] "21 Thomson Street" "Frank Street" "118 Barnett Street" "40 Anderson Street" ...
##  $ Address_Line_2       : chr [1:2291] NA NA NA NA ...
##  $ Address_Town         : chr [1:2291] "Alberton" "Allansford" "Avoca" "Avenel" ...
##  $ Address_State        : chr [1:2291] "VIC" "VIC" "VIC" "VIC" ...
##  $ Address_Postcode     : num [1:2291] 3971 3277 3467 3664 3113 ...
##  $ Postal_Address_Line_1: chr [1:2291] "21 Thomson Street" "Frank Street" "P O Box 12" "40 Anderson Street" ...
##  $ Postal_Address_Line_2: chr [1:2291] NA NA NA NA ...
##  $ Postal_Town          : chr [1:2291] "ALBERTON" "ALLANSFORD" "AVOCA" "AVENEL" ...
##  $ Postal_State         : chr [1:2291] "VIC" "VIC" "VIC" "VIC" ...
##  $ Postal_Postcode      : num [1:2291] 3971 3277 3467 3664 3113 ...
##  $ Full_Phone_No        : chr [1:2291] "03 5183 2412" "03 5565 1382" "03 5465 3176" "03 5796 2264" ...
##  $ LGA_ID               : num [1:2291] 681 673 599 643 421 275 515 57 717 161 ...
##  $ LGA_Name             : chr [1:2291] "Wellington (S)" "Warrnambool (C)" "Pyrenees (S)" "Strathbogie (S)" ...
##  $ X                    : num [1:2291] 147 143 143 145 145 ...
##  $ Y                    : num [1:2291] -38.6 -38.4 -37.1 -36.9 -37.7 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Education_Sector = col_character(),
##   ..   Entity_Type = col_double(),
##   ..   SCHOOL_NO = col_double(),
##   ..   School_Name = col_character(),
##   ..   School_Type = col_character(),
##   ..   School_Status = col_character(),
##   ..   Address_Line_1 = col_character(),
##   ..   Address_Line_2 = col_character(),
##   ..   Address_Town = col_character(),
##   ..   Address_State = col_character(),
##   ..   Address_Postcode = col_double(),
##   ..   Postal_Address_Line_1 = col_character(),
##   ..   Postal_Address_Line_2 = col_character(),
##   ..   Postal_Town = col_character(),
##   ..   Postal_State = col_character(),
##   ..   Postal_Postcode = col_double(),
##   ..   Full_Phone_No = col_character(),
##   ..   LGA_ID = col_double(),
##   ..   LGA_Name = col_character(),
##   ..   X = col_double(),
##   ..   Y = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

names(Schools_LGA_2021)

##  [1] "Education_Sector"      "Entity_Type"           "SCHOOL_NO"            
##  [4] "School_Name"           "School_Type"           "School_Status"        
##  [7] "Address_Line_1"        "Address_Line_2"        "Address_Town"         
## [10] "Address_State"         "Address_Postcode"      "Postal_Address_Line_1"
## [13] "Postal_Address_Line_2" "Postal_Town"           "Postal_State"         
## [16] "Postal_Postcode"       "Full_Phone_No"         "LGA_ID"               
## [19] "LGA_Name"              "X"                     "Y"

#merging school_grades_enrollments_summary and Schools_LGA_2021 by School_Name
tidy_FTE1_LGA <-  school_grades_enrollments_summary %>% left_join(Schools_LGA_2021, by = "School_Name") %>% 
  select(School_Name,LGA_ID,LGA_Name, Mean, Median, Max, Min, Std_Dev)
tidy_FTE1_LGA

9 Subsetting

Subsetting I

#In this step the tidy_FTE1_LGA data frame was subset first to include  School_Name, LGA_Name,LGA_ID,Max)
highenrollments_obs <- tidy_FTE1_LGA[c(1:3,6)]


#Arranging Max enrollements in descending order and then selecting first 10 observation
highenrollments_obs1 <- highenrollments_obs %>% arrange(desc(Max))
highenrollments_obs1_Sub <- highenrollments_obs1 %>% filter(row(highenrollments_obs1) > 0 & row(highenrollments_obs1) < 11)

str(highenrollments_obs1_Sub)

## tibble [10 × 4] (S3: tbl_df/tbl/data.frame)
##  $ School_Name: chr [1:10] "St Francis Xavier College" "Emmanuel College" "Padua College" "McKinnon Secondary College" ...
##  $ LGA_ID     : num [1:10] 145 311 534 231 118 283 231 460 111 262
##  $ LGA_Name   : chr [1:10] "Cardinia (S)" "Hobsons Bay (C)" "Mornington Peninsula (S)" "Glen Eira (C)" ...
##  $ Max        : num [1:10] 612 459 446 437 401 ...

#create a matrix
highenrollments_obs1_Sub_mtrx <- as.matrix(highenrollments_obs1_Sub)

# by conducting the below argument, it can be confirmed the data frame is converted to matrix and as all the variables are character.  All elements of the matrix data frame has to be of the same type, e.g. numeric or character.
glimpse(highenrollments_obs1_Sub_mtrx)

##  chr [1:10, 1:4] "St Francis Xavier College" "Emmanuel College" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "School_Name" "LGA_ID" "LGA_Name" "Max"

#confirming the data frame is matric
is.matrix(highenrollments_obs1_Sub_mtrx)

## [1] TRUE

10.Subsetting

Only the first and the last variable in the data set from step 9 have been subset and then saved as an R object (RDS) file.

# ### Using the highenrollments_obs1_Sub dataframe to create the subset - 
subset_rds <- head(highenrollments_obs1_Sub[,c("School_Name","Max")],10)
subset_rds

## Create a RDD data file of the dataframe - Use the save function and identify the dataframe and assign a file name. 
subset_rds %>% 
saveRDS(file = "~/Desktop/practice/data/subset_rds.RDS")

Step 11 - Steps to create a new Data Frame

#Create a data frame with 3 variables. Your data frame has to contain one integer variable and one ordinal variable,

df1 <- data.frame(col1 = c("Jo", "Nick","Tina","Ria","Max","Lee","Sam"),
                  col2 = factor(c("Male","Male","Female","Female","Male","Male","Male"), 
                                levels = c("Male", "Female"), ordered = TRUE),
                  col3 = c(1:7))


# The integer variable has been named "ID" and The ordinal variable has been named "Sex".

colnames(df1) <- c("Names", "Sex","ID")       

# Assign row names

rownames(df1) <- c("Row 1", "Row 2", "Row 3","Row 4", "Row 5", "Row 6","Row 7") 

# Verify structure and levels of variables in data frame df1

df1

# Checking the structure of the data frame confirm the data frame consists of 7 observations and 3 variables.

str(df1)

## 'data.frame':    7 obs. of  3 variables:
##  $ Names: chr  "Jo" "Nick" "Tina" "Ria" ...
##  $ Sex  : Ord.factor w/ 2 levels "Male"<"Female": 1 1 2 2 1 1 1
##  $ ID   : int  1 2 3 4 5 6 7

# Creating an additional numeric vector called Age with 7 values.
Age <- c('15','25','36','40','45','57','67')
Age

## [1] "15" "25" "36" "40" "45" "57" "67"

## [1] "15" "25" "36" "40" "45" "57" "67"

### Create a new numeric variable
population <- c(25000, 75000, 95000, 115000, 130000, 80000, 50000)

str(Age)

##  chr [1:7] "15" "25" "36" "40" "45" "57" "67"

## chr [1:7] "15" "25" "36" "40" "45" "57" "67"

str(population)

##  num [1:7] 25000 75000 95000 115000 130000 80000 50000

##num [1:7] 25000 75000 95000 115000 130000 80000 50000

# Use cbind() to bind Age and population to df1,to create a combined data frame called df2.
df2 <- cbind(df1, Age,population)

# Output data frame df2 to show 4 variables "Names", "Sex" ,"ID", "Age"
df2

#data frame df2 has been created with 4 variables, along with its attributes and dimensions

# Check attributes of data frame NewDF6

attributes(df2)

## $names
## [1] "Names"      "Sex"        "ID"         "Age"        "population"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] "Row 1" "Row 2" "Row 3" "Row 4" "Row 5" "Row 6" "Row 7"

##$names
##[1] "Names"      "Sex"        "ID"         "Age"        "population"

##$class
##[1] "data.frame"

##$row.names
## [1] "Row 1" "Row 2" "Row 3" "Row 4" "Row 5" "Row 6" "Row 7"

Step 12 - Create another Data Frame

#Creating two new values in the form of “ID2” and “Age”

ID2 <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,6,7,7,7)
Age <- c(7,8,10,11,13,11,16,17,21,23,26,24,32,50,44,77,47,59,67,51,66,77,60,49,67,68,74,56,71,67)

## Create new dataframe from new variable vectors
df3 <- data.frame(Age, ID2)

## Rename the ID2 variable to ID to allow for merge based upon naming convention.
df3 <- df3 %>% 
  rename(
    ID = ID2)

## Merge the two dataframe using ID as the reference. 
df3 <- merge(df2, df3, 'by' = 'ID')

View(df3)

By creating two values Age and ID2 in a new dataframe, by doing so “Age” gets assigned to “ID2”.

The purpose of renaming “ID2” variable to “ID” in the “df3” dataframe is to be consistent and to allow for the merge to occur using the common variable.

#References

All Schools FTE enrolments - Feb 2020, Department of Education and Training, viewed 16 September 2021, http://www.education.vic.gov.au/Documents/about/research/datavic/dv300-allschoolsFTEenrolmentsFeb2020.csv

School Locations - 2021, Department of Education and Training, viewed 16 September 2021, https://data.gov.au/dataset/ds-vic-cebdeac5-00af-4fe7-9537-b1597ef7ffea/distribution/dist-vic-97c05fd1-8671-4f0a-9f91-e8d57a1c1135/details?q=

Boehmke, BC 2016, Data Wrangling with R (Links to an external site.), Springer International Publishing, Cham, Switzerland.

Wickham, H 2019, Advanced R (Links to an external site.), CRC press.

VCAMS percentage of students achieving national benchmark in literacy , viewed 16 September 2021, * https://discover.data.vic.gov.au/dataset/vcams-percentage-of-students-achieving-national-benchmark-in-literacy This was obtained in excel format.

VCAMS percentage of students achieving national benchmark in numeracy, viewed 16 September 2021 * https://discover.data.vic.gov.au/dataset/vcams-percentage-of-students-achieving-national-benchmark-in-numeracy

Data Wrangling,Assinment 1, 2021

Assignment 1

Josephine Elayaperuma and S3910607