*This dataset is about about combination of attendance rate for Semester 1 in SA Government Schools by year level and Index of Educational disadvantage by Government Schools in SA for year 2000.
*Combination of this two datasets are to analyse the relationship between the attendance levels of students and the index of educational disadvantage.The most disadvantage schools have an index of 1 and the least disadvantaged have an index of 7.
*Detailed explanation regarding the two datasets are done below separately.
*Both these datasets have a common variable of School Number and will be using this for the merge.
*Understanding the dataset, inspecting the variables and converting them to proper data type conversions are done in the Understand section.
#Variable Description
*School Number - Each school has a number in South Austrlia
*School Name - Name of the school
*Type of School - If it’s a Primary/Secondary/Aboroginal/Special school
*Suburb/Postcode - Suburb and Postcode of the school
*Attendance Rates from Reception to Secondary
*Index of Educational Disadvantage - Most disadvantage schools have an index of 1 and the least disadvantaged have an index of 7
#Dataset 1#
*Attendance rate for semester 1 in SA Government schools by school and year level in year 2000
*Calculation for Attendance Rate = (No of days attending school/ No of days student enrolled) x 100
*This dataset contians untidy data and will be dealt with this data in the Tidy data section.
*Lots of zeros can be seen in this dataset and they are for the year levels who doesn’t have such classes Example: Coromandel Valley Primary School doesn’t have attendaces after Year 7 as it’s a primary school.
*Also this dataset has missing values with NA and that will be dealt separately in the Scanning section.
#Dataset 2#
*The Index of Educational disadvantage for SA Government school in year 2000
*The most disadvantage schools have an index of 1 and the least disadvantaged have an index of 7
*Index is calculated using the measures of
* Parental economic resource
* Parental ecucation and occupation
* Aboriginality
* Student mobility
#Sample of Dataset 1
mydata0 <-read.csv ("C:/Data/attendance-rate-by-school-2000.csv")
mydata4 <- head(mydata0,10)
#Styling the Table
knitr:: kable(mydata4, booktabs=T) %>%
kable_styling(latex_options = c("striped","scale down","hold_position"))
| Census_Year | School_Number | School_Name | Type_of_School | Suburb | Postcode | Reception | Year_1 | Year_2 | Year_3 | Year_4 | Year_5 | Year_6 | Year_7 | Primary_Other | Year_8 | Year_9 | Year_10 | Year_11 | Year_12 | Secondary_Other |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020 | 104 | Coromandel Valley Primary School | Primary Education | Coromandel Valley | 5051 | 92.3 | 92.3 | 93.5 | 90.7 | 92.2 | 92.7 | 87.9 | 92.2 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 106 | Cowandilla Primary School | Primary Education | Cowandilla | 5033 | 86.6 | 84.3 | 86.4 | 84.2 | 86.7 | 88.1 | 84.4 | 88.3 | 85.5 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 107 | Crafers Primary School | Primary Education | Crafers | 5152 | 95.0 | 92.7 | 95.6 | 95.4 | 95.1 | 94.4 | 94.8 | 94.2 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 111 | Crystal Brook Primary School | Primary Education | Crystal Brook | 5523 | 90.2 | 93.4 | 92.7 | 92.9 | 91.6 | 92.0 | 93.8 | 93.4 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 114 | Curramulka Primary School | Primary Education | Curramulka | 5580 | 97.4 | 90.6 | 97.9 | 93.0 | 87.5 | 0.0 | 81.4 | 99.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 116 | Murray Bridge Special School | Special Education | Murray Bridge | 5253 | 0.0 | 0.0 | 0.0 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 | 84.6 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 120 | Gilles Street Primary School | Primary Education | Adelaide | 5000 | 87.7 | 84.7 | 85.9 | 85.2 | 88.3 | 88.9 | 86.3 | 87.8 | 84.3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 123 | Echunga Primary School | Primary Education | Echunga | 5153 | 93.3 | 93.3 | 89.8 | 92.1 | 95.7 | 94.7 | 90.7 | 94.7 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 124 | Eden Hills Primary School | Primary Education | Eden Hills | 5050 | 90.3 | 92.2 | 91.0 | 92.6 | 92.8 | 92.0 | 92.8 | 91.4 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2020 | 127 | Edwardstown Primary School | Primary Education | Melrose Park | 5039 | 86.6 | 88.0 | 87.2 | 89.1 | 83.7 | 87.2 | 83.7 | 85.7 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 |
#Sample of Dataset 2
mydata1 <-read.csv ("C:/Data/index-of-disadvantage-by-school-2020.csv")
mydata5 <- sample_n(mydata1,10)
#Styling the Table
kable(mydata5, booktabs=T) %>%
kable_styling(latex_options = c("striped","scale down","hold_position"))
| ID | School_Number | School_Name | Suburb | Post_Code | Type_of_Schooling | Index_of_Educational_Disadvantage |
|---|---|---|---|---|---|---|
| 440 | 1100 | Surrey Downs R-7 School | Surrey Downs | 5126 | Primary Education | 6 |
| 37 | 770 | Birdwood High School | Birdwood | 5234 | Secondary Education | 6 |
| 237 | 895 | Loxton High School | Loxton | 5333 | Secondary Education | 4 |
| 452 | 810 | Thebarton Senior College | Torrensville | 5031 | Secondary Education | 3 |
| 56 | 752 | Brinkworth Primary School | Brinkworth | 5464 | Primary Education | 4 |
| 370 | 1801 | Port Lincoln Special School | Port Lincoln | 5606 | Special Education | 3 |
| 364 | 355 | Port Augusta West Primary School | Port Augusta West | 5700 | Primary Education | 4 |
| 186 | 777 | Jamestown Community School | Jamestown | 5491 | Primary/Secondary Combined | 6 |
| 310 | 936 | Naracoorte South Primary School | Naracoorte | 5271 | Primary Education | 4 |
| 326 | 762 | Oakbank Area School | Oakbank | 5243 | Primary/Secondary Combined | 6 |
#Sample of Merged Dataset
mydata2 <- merge(mydata0,mydata1,by="School_Number")
mydata6 <- sample_n(mydata2,10)
#Styling the Table
kable(mydata6, booktabs=T) %>%
kable_styling(latex_options = c("striped","scale down","hold_position"))
| School_Number | Census_Year | School_Name.x | Type_of_School | Suburb.x | Postcode | Reception | Year_1 | Year_2 | Year_3 | Year_4 | Year_5 | Year_6 | Year_7 | Primary_Other | Year_8 | Year_9 | Year_10 | Year_11 | Year_12 | Secondary_Other | ID | School_Name.y | Suburb.y | Post_Code | Type_of_Schooling | Index_of_Educational_Disadvantage |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1858 | 2020 | Greenwith Primary School | Primary Education | Greenwith | 5125 | 93.8 | 94.1 | 92.3 | 92.2 | 92.4 | 90.4 | 90.2 | 90.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 157 | Greenwith Primary School | Greenwith | 5125 | Primary Education | 6 |
| 935 | 2020 | Para Hills School P-7 | Primary Education | Para Hills | 5096 | 81.4 | 85.3 | 87.5 | 84.5 | 88.1 | 85.4 | 90.3 | 89.0 | 87.4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 337 | Para Hills School P-7 | Para Hills | 5096 | Primary Education | 4 |
| 1399 | 2020 | Port Augusta Special School | Special Education | Port Augusta West | 5700 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 55.6 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 79.3 | 363 | Port Augusta Special School | Port Augusta West | 5700 | Special Education | 1 |
| 1001 | 2020 | Amata Anangu School | Aboriginal/Anangu Schools | Amata | 872 | 56.0 | 52.7 | 49.6 | 37.2 | 43.2 | 56.3 | 52.6 | 45.7 | 0.0 | 34.7 | 40.3 | 21.4 | 33.2 | 27.5 | 0.0 | 15 | Amata Anangu School | Amata | 872 | Aboriginal/Anangu Schools | 1 |
| 391 | 2020 | St Leonards Primary School | Primary Education | Glenelg North | 5045 | 87.3 | 89.6 | 88.7 | 87.0 | 87.6 | 86.2 | 87.5 | 87.5 | 95.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 432 | St Leonards Primary School | Glenelg North | 5045 | Primary Education | 6 |
| 962 | 2020 | Riverland Special School | Special Education | Berri | 5343 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 77.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 84.4 | 395 | Riverland Special School | Berri | 5343 | Special Education | 3 |
| 1183 | 2020 | North Ingle School | Primary Education | Ingle Farm | 5098 | 90.6 | 89.2 | 88.4 | 90.8 | 88.6 | 91.0 | 92.8 | 90.7 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 316 | North Ingle School | Ingle Farm | 5098 | Primary Education | 4 |
| 475 | 2020 | Westbourne Park Primary School | Primary Education | Westbourne Park | 5041 | 92.4 | 91.6 | 93.0 | 93.3 | 92.3 | 92.6 | 91.9 | 92.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 487 | Westbourne Park Primary School | Westbourne Park | 5041 | Primary Education | 7 |
| 447 | 2020 | Unley Primary School | Primary Education | Unley | 5061 | 90.7 | 89.0 | 90.7 | 90.0 | 92.0 | 89.2 | 90.3 | 87.8 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 464 | Unley Primary School | Unley | 5061 | Primary Education | 7 |
| 1696 | 2020 | Kenmore Park Anangu School | Aboriginal/Anangu Schools | Kenmore Park | 872 | 9.1 | 69.8 | 33.3 | 87.5 | 58.1 | 79.9 | 68.3 | 0.0 | 75.0 | 67.9 | 87.9 | 100.0 | 0.0 | 0.0 | 0.0 | 204 | Kenmore Park Anangu School | Kenmore Park | 872 | Aboriginal/Anangu Schools | 1 |
#Removing the unnecessary columns from the merged dataset
library(dplyr)
mydata7 <- select(mydata2, -ID, -School_Name.y,-Suburb.y ,-Post_Code,-Type_of_Schooling)
mydata7a <- sample_n(mydata7,5)
#Styling the Table
kable(mydata7a, booktabs=T) %>%
kable_styling(latex_options = c("striped","scale down","hold_position"))
| School_Number | Census_Year | School_Name.x | Type_of_School | Suburb.x | Postcode | Reception | Year_1 | Year_2 | Year_3 | Year_4 | Year_5 | Year_6 | Year_7 | Primary_Other | Year_8 | Year_9 | Year_10 | Year_11 | Year_12 | Secondary_Other | Index_of_Educational_Disadvantage |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 239 | 2020 | Macclesfield Primary School | Primary Education | Macclesfield | 5153 | 91.5 | 94.8 | 92.8 | 84.0 | 90.9 | 90.9 | 93.1 | 92.3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 |
| 270 | 2020 | Mitcham Primary School | Primary Education | Kingswood | 5062 | 91.3 | 91.4 | 92.6 | 92.2 | 91.9 | 90.9 | 91.0 | 89.9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
| 647 | 2020 | Linden Park Primary School | Primary Education | Linden Park | 5065 | 90.7 | 89.6 | 91.3 | 91.3 | 93.5 | 93.4 | 91.4 | 90.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 |
| 1481 | 2020 | Stirling North Primary School | Primary Education | Stirling North | 5710 | 90.9 | 91.5 | 91.1 | 92.2 | 89.5 | 90.5 | 88.8 | 89.4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 340 | 2020 | Ascot Park Primary School | Primary Education | Park Holme | 5043 | 87.2 | 80.7 | 90.3 | 86.1 | 88.1 | 87.4 | 84.6 | 82.6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
#Identifying the number of variables and observations.
# No of Observations in the Dataset
nrow(mydata7)
## [1] 505
# No of Variables in the Dataset
ncol(mydata7)
## [1] 22
#Summary of the Dataset
summary(mydata7)
## School_Number Census_Year School_Name.x Type_of_School
## Min. : 104.0 Min. :2020 Length:505 Length:505
## 1st Qu.: 387.0 1st Qu.:2020 Class :character Class :character
## Median : 754.0 Median :2020 Mode :character Mode :character
## Mean : 770.9 Mean :2020
## 3rd Qu.:1018.0 3rd Qu.:2020
## Max. :1919.0 Max. :2020
##
## Suburb.x Postcode Reception Year_1
## Length:505 Min. : 872 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.:5086 1st Qu.: 79.30 1st Qu.:80.30
## Mode :character Median :5164 Median : 87.50 Median :87.30
## Mean :5177 Mean : 71.71 Mean :72.33
## 3rd Qu.:5351 3rd Qu.: 90.50 3rd Qu.:90.70
## Max. :5734 Max. :100.00 Max. :98.60
##
## Year_2 Year_3 Year_4 Year_5
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 82.05 1st Qu.:81.10 1st Qu.: 81.30 1st Qu.:80.45
## Median : 88.00 Median :88.40 Median : 88.20 Median :87.80
## Mean : 72.75 Mean :73.33 Mean : 72.78 Mean :72.50
## 3rd Qu.: 91.03 3rd Qu.:91.10 3rd Qu.: 91.15 3rd Qu.:90.80
## Max. :100.00 Max. :98.30 Max. :100.00 Max. :98.30
## NA's :1 NA's :2 NA's :2
## Year_6 Year_7 Primary_Other Year_8
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:80.45 1st Qu.: 78.45 1st Qu.: 0.00 1st Qu.: 0.00
## Median :87.60 Median : 87.35 Median : 0.00 Median : 0.00
## Mean :71.99 Mean : 71.91 Mean : 20.56 Mean : 22.26
## 3rd Qu.:90.60 3rd Qu.: 90.58 3rd Qu.: 33.85 3rd Qu.: 57.20
## Max. :97.20 Max. :100.00 Max. :100.00 Max. :100.00
## NA's :2 NA's :3 NA's :3
## Year_9 Year_10 Year_11 Year_12
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.0 Median : 0.00 Median : 0.00
## Mean : 21.73 Mean : 21.4 Mean : 20.48 Mean : 21.19
## 3rd Qu.: 53.40 3rd Qu.: 52.8 3rd Qu.: 22.80 3rd Qu.: 33.40
## Max. :100.00 Max. :100.0 Max. :100.00 Max. :100.00
##
## Secondary_Other Index_of_Educational_Disadvantage
## Min. : 0.000 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:3.000
## Median : 0.000 Median :5.000
## Mean : 9.438 Mean :4.327
## 3rd Qu.: 0.000 3rd Qu.:6.000
## Max. :100.000 Max. :7.000
##
#Checking the types of Variables
str(mydata7)
## 'data.frame': 505 obs. of 22 variables:
## $ School_Number : int 104 106 107 111 114 116 120 123 124 127 ...
## $ Census_Year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ School_Name.x : chr "Coromandel Valley Primary School" "Cowandilla Primary School" "Crafers Primary School" "Crystal Brook Primary School" ...
## $ Type_of_School : chr "Primary Education" "Primary Education" "Primary Education" "Primary Education" ...
## $ Suburb.x : chr "Coromandel Valley" "Cowandilla" "Crafers" "Crystal Brook" ...
## $ Postcode : int 5051 5033 5152 5523 5580 5253 5000 5153 5050 5039 ...
## $ Reception : num 92.3 86.6 95 90.2 97.4 0 87.7 93.3 90.3 86.6 ...
## $ Year_1 : num 92.3 84.3 92.7 93.4 90.6 0 84.7 93.3 92.2 88 ...
## $ Year_2 : num 93.5 86.4 95.6 92.7 97.9 0 85.9 89.8 91 87.2 ...
## $ Year_3 : num 90.7 84.2 95.4 92.9 93 0 85.2 92.1 92.6 89.1 ...
## $ Year_4 : num 92.2 86.7 95.1 91.6 87.5 100 88.3 95.7 92.8 83.7 ...
## $ Year_5 : num 92.7 88.1 94.4 92 0 0 88.9 94.7 92 87.2 ...
## $ Year_6 : num 87.9 84.4 94.8 93.8 81.4 0 86.3 90.7 92.8 83.7 ...
## $ Year_7 : num 92.2 88.3 94.2 93.4 99 0 87.8 94.7 91.4 85.7 ...
## $ Primary_Other : num 0 85.5 0 0 0 84.6 84.3 0 0 0 ...
## $ Year_8 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Year_9 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Year_10 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Year_11 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Year_12 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Secondary_Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Index_of_Educational_Disadvantage: int 7 4 7 5 5 3 4 7 7 6 ...
#Converting a Character variable to a Factor Variable and Labeling
mydata8 <-mydata7
mydata8$Type_of_school <- factor (c(mydata7$Type_of_School),levels = c("Aboriginal/Anangu Schools","Primary Education","Primary/Secondary Combined","Secondary Education","Special Education"),labels=c("ABO","PRIM","PRIM/SEC","SECONDRY","SPECIAL"))
mydata8$Type_of_school
## [1] PRIM PRIM PRIM PRIM PRIM SPECIAL PRIM PRIM
## [9] PRIM PRIM PRIM/SEC PRIM PRIM PRIM PRIM PRIM
## [17] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [25] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM/SEC
## [33] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [41] PRIM PRIM PRIM PRIM PRIM PRIM PRIM SPECIAL
## [49] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM/SEC
## [57] PRIM PRIM PRIM PRIM PRIM PRIM ABO PRIM
## [65] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [73] PRIM PRIM PRIM PRIM PRIM PRIM PRIM/SEC PRIM
## [81] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [89] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [97] ABO PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [105] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [113] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [121] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [129] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [137] PRIM PRIM PRIM PRIM/SEC PRIM PRIM PRIM PRIM
## [145] PRIM PRIM/SEC PRIM PRIM PRIM PRIM PRIM PRIM
## [153] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [161] PRIM PRIM PRIM PRIM SPECIAL PRIM PRIM PRIM
## [169] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [177] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [185] SPECIAL PRIM PRIM PRIM PRIM SECONDRY PRIM PRIM
## [193] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [201] SECONDRY PRIM/SEC PRIM/SEC PRIM PRIM PRIM PRIM PRIM
## [209] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [217] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [225] PRIM PRIM PRIM PRIM PRIM PRIM PRIM/SEC PRIM
## [233] PRIM/SEC PRIM PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC
## [241] PRIM PRIM/SEC PRIM PRIM/SEC PRIM/SEC PRIM PRIM/SEC PRIM/SEC
## [249] PRIM/SEC PRIM/SEC PRIM PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC
## [257] PRIM/SEC PRIM PRIM/SEC PRIM/SEC PRIM PRIM PRIM/SEC PRIM/SEC
## [265] PRIM/SEC SECONDRY SECONDRY SECONDRY PRIM/SEC PRIM/SEC PRIM/SEC SECONDRY
## [273] SECONDRY PRIM/SEC PRIM/SEC SECONDRY SECONDRY PRIM/SEC SECONDRY SECONDRY
## [281] SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY
## [289] PRIM/SEC SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY
## [297] SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY PRIM
## [305] SPECIAL ABO SECONDRY SECONDRY SECONDRY SECONDRY PRIM PRIM
## [313] PRIM SECONDRY SECONDRY PRIM/SEC PRIM/SEC PRIM/SEC SECONDRY SECONDRY
## [321] PRIM PRIM PRIM/SEC PRIM PRIM PRIM SECONDRY SECONDRY
## [329] SECONDRY SECONDRY PRIM PRIM PRIM PRIM PRIM PRIM
## [337] SECONDRY PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [345] PRIM/SEC SPECIAL SECONDRY SECONDRY PRIM PRIM PRIM PRIM
## [353] PRIM PRIM SECONDRY SECONDRY PRIM ABO PRIM PRIM
## [361] PRIM PRIM PRIM PRIM PRIM PRIM SPECIAL ABO
## [369] ABO ABO PRIM PRIM ABO SECONDRY SECONDRY SPECIAL
## [377] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM/SEC
## [385] SECONDRY ABO PRIM PRIM PRIM PRIM PRIM PRIM
## [393] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [401] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [409] PRIM PRIM/SEC PRIM PRIM SECONDRY PRIM/SEC PRIM/SEC PRIM
## [417] PRIM PRIM PRIM PRIM SECONDRY PRIM PRIM PRIM
## [425] PRIM PRIM PRIM PRIM PRIM/SEC ABO PRIM PRIM
## [433] SECONDRY PRIM PRIM SECONDRY ABO PRIM PRIM PRIM
## [441] PRIM PRIM PRIM PRIM PRIM PRIM PRIM PRIM
## [449] SECONDRY SPECIAL PRIM PRIM PRIM SPECIAL SPECIAL SPECIAL
## [457] SPECIAL PRIM/SEC PRIM PRIM/SEC PRIM/SEC PRIM PRIM PRIM
## [465] PRIM/SEC ABO ABO PRIM PRIM PRIM SPECIAL PRIM
## [473] PRIM SECONDRY SECONDRY ABO PRIM PRIM ABO SECONDRY
## [481] SPECIAL PRIM/SEC SECONDRY PRIM PRIM PRIM PRIM PRIM
## [489] PRIM PRIM ABO PRIM PRIM PRIM PRIM/SEC PRIM/SEC
## [497] PRIM/SEC PRIM SECONDRY PRIM PRIM PRIM PRIM/SEC PRIM
## [505] SECONDRY
## Levels: ABO PRIM PRIM/SEC SECONDRY SPECIAL
#Below Variables are included in the dataset
*Integer
*Factor with Labels
*Character
*Numerical
str(mydata8)
## 'data.frame': 505 obs. of 23 variables:
## $ School_Number : int 104 106 107 111 114 116 120 123 124 127 ...
## $ Census_Year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ School_Name.x : chr "Coromandel Valley Primary School" "Cowandilla Primary School" "Crafers Primary School" "Crystal Brook Primary School" ...
## $ Type_of_School : chr "Primary Education" "Primary Education" "Primary Education" "Primary Education" ...
## $ Suburb.x : chr "Coromandel Valley" "Cowandilla" "Crafers" "Crystal Brook" ...
## $ Postcode : int 5051 5033 5152 5523 5580 5253 5000 5153 5050 5039 ...
## $ Reception : num 92.3 86.6 95 90.2 97.4 0 87.7 93.3 90.3 86.6 ...
## $ Year_1 : num 92.3 84.3 92.7 93.4 90.6 0 84.7 93.3 92.2 88 ...
## $ Year_2 : num 93.5 86.4 95.6 92.7 97.9 0 85.9 89.8 91 87.2 ...
## $ Year_3 : num 90.7 84.2 95.4 92.9 93 0 85.2 92.1 92.6 89.1 ...
## $ Year_4 : num 92.2 86.7 95.1 91.6 87.5 100 88.3 95.7 92.8 83.7 ...
## $ Year_5 : num 92.7 88.1 94.4 92 0 0 88.9 94.7 92 87.2 ...
## $ Year_6 : num 87.9 84.4 94.8 93.8 81.4 0 86.3 90.7 92.8 83.7 ...
## $ Year_7 : num 92.2 88.3 94.2 93.4 99 0 87.8 94.7 91.4 85.7 ...
## $ Primary_Other : num 0 85.5 0 0 0 84.6 84.3 0 0 0 ...
## $ Year_8 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Year_9 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Year_10 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Year_11 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Year_12 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Secondary_Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Index_of_Educational_Disadvantage: int 7 4 7 5 5 3 4 7 7 6 ...
## $ Type_of_school : Factor w/ 5 levels "ABO","PRIM","PRIM/SEC",..: 2 2 2 2 2 5 2 2 2 2 ...
#Replacing the zeros
This data set consists of zeros in a messy format (0.0) and replacing them with a tidier formatting option “_”
Please note missing values are not dealt here
library(dplyr)
mydata9 <- mydata8
mydata9 [mydata9 == 0] <- "_"
mydata13 <-mydata9
mydata13a <- sample_n(mydata9,6)
knitr:: kable(mydata13a, booktabs=T) %>%
kable_styling(latex_options = c("striped","scale down","hold_position"))
| School_Number | Census_Year | School_Name.x | Type_of_School | Suburb.x | Postcode | Reception | Year_1 | Year_2 | Year_3 | Year_4 | Year_5 | Year_6 | Year_7 | Primary_Other | Year_8 | Year_9 | Year_10 | Year_11 | Year_12 | Secondary_Other | Index_of_Educational_Disadvantage | Type_of_school |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 323 | 2020 | Coorara Primary School | Primary Education | Morphett Vale | 5162 | 88.8 | 91.3 | 85.9 | 85.8 | 87.1 | 84.2 | 84.3 | 78 | 83.1 | _ | _ | _ | _ | _ | _ | 3 | PRIM |
| 449 | 2020 | Uraidla Primary School | Primary Education | Uraidla | 5142 | 93.2 | 89.8 | 89.9 | 93.1 | 92.4 | 94.7 | 92.7 | 92.9 | _ | _ | _ | _ | _ | _ | _ | 7 | PRIM |
| 1060 | 2020 | O’Sullivan Beach Primary School | Primary Education | O’Sullivan Beach | 5166 | 75 | 85.1 | 87.7 | 83.8 | 74.2 | 87.5 | 84.5 | 71.4 | _ | _ | _ | _ | _ | _ | _ | 2 | PRIM |
| 253 | 2020 | Melrose Primary School | Primary Education | Melrose | 5483 | _ | 95.9 | 88.8 | 95.5 | 85.1 | 92.3 | 79.9 | 72.5 | _ | _ | _ | _ | _ | _ | _ | 7 | PRIM |
| 1903 | 2020 | Blair Athol North School B-7 | Primary Education | Blair Athol | 5084 | 80.1 | 76.1 | 83.1 | 80.9 | 82 | 84.4 | 81.1 | 84.9 | 87.6 | _ | _ | _ | _ | _ | _ | 2 | PRIM |
| 822 | 2020 | Whyalla High School | Secondary Education | Whyalla | 5600 | _ | _ | _ | _ | _ | _ | _ | _ | _ | 81.6 | 78.5 | 75.3 | _ | _ | _ | 2 | SECONDRY |
#Dropping/ Renaming and Ordering the Variables
library(dplyr)
mydata10 <- select(mydata9, -Census_Year, -Type_of_School)
mydata11 <- (rename(mydata10,School_No =School_Number,School=School_Name.x,Suburb=Suburb.x,Index_Edu_Disadvantage=Index_of_Educational_Disadvantage
))
col_order <- c("School_No","School","Type_of_school","Index_Edu_Disadvantage","Suburb","Postcode","Reception","Year_1","Year_2","Year_3","Year_4","Year_5","Year_6","Year_7","Primary_Other","Year_8","Year_9","Year_10","Year_11","Year_12","Secondary_Other")
mydata12 <- mydata11 %>% select(col_order)
mydata14 <- sample_n(mydata12,5)
#Styling the Table
kable(mydata14, booktabs=T) %>%
kable_styling(latex_options = c("striped","scale down","hold_position"))
| School_No | School | Type_of_school | Index_Edu_Disadvantage | Suburb | Postcode | Reception | Year_1 | Year_2 | Year_3 | Year_4 | Year_5 | Year_6 | Year_7 | Primary_Other | Year_8 | Year_9 | Year_10 | Year_11 | Year_12 | Secondary_Other |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 424 | Tailem Bend Primary School | PRIM | 3 | Tailem Bend | 5260 | 88.2 | 84.7 | 87.9 | 86.3 | 92 | 87.9 | 87.7 | 87.6 | _ | _ | _ | _ | _ | _ | _ |
| 237 | Lyndoch Primary School | PRIM | 6 | Lyndoch | 5351 | 91.7 | 90.1 | 92.8 | 90.8 | 89.1 | 89.9 | 86.2 | 92.6 | _ | _ | _ | _ | _ | _ | _ |
| 990 | Augusta Park Primary School | PRIM | 1 | Port Augusta | 5700 | 74.1 | 76 | 77.3 | 76.7 | 70.1 | 80.8 | 68 | 74.4 | 66.6 | _ | _ | _ | _ | _ | _ |
| 740 | Orroroo Area School | PRIM/SEC | 6 | Orroroo | 5431 | 91.9 | 98.3 | 94.7 | 95.4 | 89.7 | 95 | 89.3 | 95.4 | _ | 91.4 | 85.5 | 91.5 | 86.5 | 94 | _ |
| 928 | Grant High School | SECONDRY | 3 | Mount Gambier | 5290 | _ | _ | _ | _ | _ | _ | _ | _ | _ | 92.2 | 87.9 | 89.4 | 84.5 | 88.9 | 87.6 |
#Rechecking the Variable Types after the clean up
*Some variables have been turned back to Character when replacing the zeros with “_“.
summary(mydata12)
## School_No School Type_of_school Index_Edu_Disadvantage
## Min. : 104.0 Length:505 ABO : 16 Min. :1.000
## 1st Qu.: 387.0 Class :character PRIM :350 1st Qu.:3.000
## Median : 754.0 Mode :character PRIM/SEC: 60 Median :5.000
## Mean : 770.9 SECONDRY: 64 Mean :4.327
## 3rd Qu.:1018.0 SPECIAL : 15 3rd Qu.:6.000
## Max. :1919.0 Max. :7.000
## Suburb Postcode Reception Year_1
## Length:505 Min. : 872 Length:505 Length:505
## Class :character 1st Qu.:5086 Class :character Class :character
## Mode :character Median :5164 Mode :character Mode :character
## Mean :5177
## 3rd Qu.:5351
## Max. :5734
## Year_2 Year_3 Year_4 Year_5
## Length:505 Length:505 Length:505 Length:505
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Year_6 Year_7 Primary_Other Year_8
## Length:505 Length:505 Length:505 Length:505
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Year_9 Year_10 Year_11 Year_12
## Length:505 Length:505 Length:505 Length:505
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Secondary_Other
## Length:505
## Class :character
## Mode :character
##
##
##
#Converting the Character Variables to Numeric
mydata15 <- mydata12
mydata15$Reception <- as.numeric (mydata12$Reception)
mydata15$Year_1 <- as.numeric (mydata12$Year_1)
mydata15$Year_2 <- as.numeric (mydata12$Year_2)
mydata15$Year_3 <- as.numeric (mydata12$Year_3)
mydata15$Year_4 <- as.numeric (mydata12$Year_4)
mydata15$Year_5 <- as.numeric (mydata12$Year_5)
mydata15$Year_6 <- as.numeric (mydata12$Year_6)
mydata15$Year_7 <- as.numeric (mydata12$Year_7)
mydata15$Primary_Other <- as.numeric (mydata14$Primary_Other)
mydata15$Year_8 <- as.numeric (mydata12$Year_8)
mydata15$Year_9 <- as.numeric (mydata12$Year_9)
mydata15$Year_10 <- as.numeric (mydata12$Year_10)
mydata15$Year_11 <- as.numeric (mydata12$Year_11)
mydata15$Year_12 <- as.numeric (mydata12$Year_12)
mydata15$Secondary_Other<- as.numeric (mydata12$Secondary_Other)
#Mutating Variables
Mutating below 3 variables
*Mean_Attendance_Rate
*Minimum_Attendance_Rate
*Maximum_Attendance_Rate
mydata16 <-mydata15
library(dplyr)
mydata17 <- mydata16 %>%
mutate(Mean_Attendance_Rate=rowMeans(.[ , c("Reception","Year_1","Year_2","Year_3","Year_4","Year_5","Year_6","Year_7","Primary_Other","Year_8","Year_9","Year_10","Year_11","Year_12","Secondary_Other")], na.rm=TRUE))
mydata18 <- mydata17 %>%
rowwise() %>%
mutate(Minimum_Attendance_Rate = min(Reception, Year_1,Year_2,Year_3,Year_4,Year_5,Year_6,Year_7,Primary_Other,Year_8,Year_9,Year_10,Year_11,Year_12,Secondary_Other, na.rm = T)) %>%
mutate(Maximum_Attendance_Rate = max(Reception, Year_1,Year_2,Year_3,Year_4,Year_5,Year_6,Year_7,Primary_Other,Year_8,Year_9,Year_10,Year_11,Year_12,Secondary_Other, na.rm = T))
mydata19 <- head(mydata18,5)
#Styling the Table
kable(mydata19, booktabs=T) %>%
kable_styling(latex_options = c("striped","scale_down","hold_position"))
| School_No | School | Type_of_school | Index_Edu_Disadvantage | Suburb | Postcode | Reception | Year_1 | Year_2 | Year_3 | Year_4 | Year_5 | Year_6 | Year_7 | Primary_Other | Year_8 | Year_9 | Year_10 | Year_11 | Year_12 | Secondary_Other | Mean_Attendance_Rate | Minimum_Attendance_Rate | Maximum_Attendance_Rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 104 | Coromandel Valley Primary School | PRIM | 7 | Coromandel Valley | 5051 | 92.3 | 92.3 | 93.5 | 90.7 | 92.2 | 92.7 | 87.9 | 92.2 | NA | NA | NA | NA | NA | NA | NA | 91.72500 | 87.9 | 93.5 |
| 106 | Cowandilla Primary School | PRIM | 4 | Cowandilla | 5033 | 86.6 | 84.3 | 86.4 | 84.2 | 86.7 | 88.1 | 84.4 | 88.3 | NA | NA | NA | NA | NA | NA | NA | 86.12500 | 84.2 | 88.3 |
| 107 | Crafers Primary School | PRIM | 7 | Crafers | 5152 | 95.0 | 92.7 | 95.6 | 95.4 | 95.1 | 94.4 | 94.8 | 94.2 | 66.6 | NA | NA | NA | NA | NA | NA | 91.53333 | 66.6 | 95.6 |
| 111 | Crystal Brook Primary School | PRIM | 5 | Crystal Brook | 5523 | 90.2 | 93.4 | 92.7 | 92.9 | 91.6 | 92.0 | 93.8 | 93.4 | NA | NA | NA | NA | NA | NA | NA | 92.50000 | 90.2 | 93.8 |
| 114 | Curramulka Primary School | PRIM | 5 | Curramulka | 5580 | 97.4 | 90.6 | 97.9 | 93.0 | 87.5 | NA | 81.4 | 99.0 | NA | NA | NA | NA | NA | NA | NA | 92.40000 | 81.4 | 99.0 |
#Merging the datasets
*This additional step had to be done because mydata18 has got the actual missing values as well as the values should be missed as well. The reason behind this was when converting to the numeric variable, the character variable which was “_” has converted to NA as well and its mixing up with the actual NA values.
*Therefore I will be merging the dataset -mydata12 which should have the missing values “_” in this format and actual missing values in NA format with the dataset - mydata18 where I can get the mutated variables
*Dropping unwanted variables
*Renaming variables
mydata19 <- merge(mydata12,mydata18, by="School_No")
mydata20 <-select(mydata19,-School.y,-Type_of_school.y,-Index_Edu_Disadvantage.y,-Suburb.y,-Postcode.y,-Reception.y,-Year_1.y,-Year_2.y,-Year_3.y,-Year_4.y,-Year_5.y,-Year_6.y,-Year_7.y,-Primary_Other.y,-Year_8.y,-Year_9.y,-Year_10.y,-Year_11.y,-Year_12.y,-Secondary_Other.y)
mydata21 <- rename(mydata20,School=School.x,Type_of_school =Type_of_school.x,Index_Edu_Disadvantage=Index_Edu_Disadvantage.x,Suburb=Suburb.x,Postcode=Postcode.x,Reception=Reception.x,Year_1=Year_1.x,Year_2=Year_2.x,Year_3=Year_3.x,Year_4=Year_4.x,Year_5=Year_5.x,Year_6=Year_6.x,Year_7=Year_7.x,Primary_Other=Primary_Other.x,Year_8=Year_8.x,Year_9=Year_9.x,Year_10=Year_10.x,Year_11=Year_11.x,Year_12=Year_12.x,Secondary_Other=Secondary_Other.x)
mydata22<-head(mydata21,5)
#Styling the Table
kable(mydata22, booktabs=T) %>%
kable_styling(latex_options = c("striped","scale_down","hold_position"))
| School_No | School | Type_of_school | Index_Edu_Disadvantage | Suburb | Postcode | Reception | Year_1 | Year_2 | Year_3 | Year_4 | Year_5 | Year_6 | Year_7 | Primary_Other | Year_8 | Year_9 | Year_10 | Year_11 | Year_12 | Secondary_Other | Mean_Attendance_Rate | Minimum_Attendance_Rate | Maximum_Attendance_Rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 104 | Coromandel Valley Primary School | PRIM | 7 | Coromandel Valley | 5051 | 92.3 | 92.3 | 93.5 | 90.7 | 92.2 | 92.7 | 87.9 | 92.2 | _ | _ | _ | _ | _ | _ | _ | 91.72500 | 87.9 | 93.5 |
| 106 | Cowandilla Primary School | PRIM | 4 | Cowandilla | 5033 | 86.6 | 84.3 | 86.4 | 84.2 | 86.7 | 88.1 | 84.4 | 88.3 | 85.5 | _ | _ | _ | _ | _ | _ | 86.12500 | 84.2 | 88.3 |
| 107 | Crafers Primary School | PRIM | 7 | Crafers | 5152 | 95 | 92.7 | 95.6 | 95.4 | 95.1 | 94.4 | 94.8 | 94.2 | _ | _ | _ | _ | _ | _ | _ | 91.53333 | 66.6 | 95.6 |
| 111 | Crystal Brook Primary School | PRIM | 5 | Crystal Brook | 5523 | 90.2 | 93.4 | 92.7 | 92.9 | 91.6 | 92 | 93.8 | 93.4 | _ | _ | _ | _ | _ | _ | _ | 92.50000 | 90.2 | 93.8 |
| 114 | Curramulka Primary School | PRIM | 5 | Curramulka | 5580 | 97.4 | 90.6 | 97.9 | 93 | 87.5 | _ | 81.4 | 99 | _ | _ | _ | _ | _ | _ | _ | 92.40000 | 81.4 | 99.0 |
#Identifying the Missing Values
mydata23 <- mydata21
options(max.print = 999999999)
#is.na(mydata23)#
#Count of Missing Values
sum(is.na(mydata23))
## [1] 14
#Calculating the Mean of Missing Value
mean(is.na(mydata23))
## [1] 0.001155116
#Finding where the data is missing
which(is.na(mydata23))
## [1] 4137 5113 5147 5592 5730 6079 6123 6612 6627 6654 7402 7405
## [13] 7406 10981
#Replacing the missing values with Mean
mydata24<- mydata23
for (i in 1:ncol(mydata23)){mydata24[ ,i][is.na(mydata24[ ,i])] <- mean(mydata24[ ,i],na.rm=TRUE)}
mydata25 <- head(mydata24,15)
#Styling the Table
kable(mydata25, booktabs=T) %>%
kable_styling(latex_options = c("striped","scale_down","hold_position"))
| School_No | School | Type_of_school | Index_Edu_Disadvantage | Suburb | Postcode | Reception | Year_1 | Year_2 | Year_3 | Year_4 | Year_5 | Year_6 | Year_7 | Primary_Other | Year_8 | Year_9 | Year_10 | Year_11 | Year_12 | Secondary_Other | Mean_Attendance_Rate | Minimum_Attendance_Rate | Maximum_Attendance_Rate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 104 | Coromandel Valley Primary School | PRIM | 7 | Coromandel Valley | 5051 | 92.3 | 92.3 | 93.5 | 90.7 | 92.2 | 92.7 | 87.9 | 92.2 | _ | _ | _ | _ | _ | _ | _ | 91.72500 | 87.9 | 93.5 |
| 106 | Cowandilla Primary School | PRIM | 4 | Cowandilla | 5033 | 86.6 | 84.3 | 86.4 | 84.2 | 86.7 | 88.1 | 84.4 | 88.3 | 85.5 | _ | _ | _ | _ | _ | _ | 86.12500 | 84.2 | 88.3 |
| 107 | Crafers Primary School | PRIM | 7 | Crafers | 5152 | 95 | 92.7 | 95.6 | 95.4 | 95.1 | 94.4 | 94.8 | 94.2 | _ | _ | _ | _ | _ | _ | _ | 91.53333 | 66.6 | 95.6 |
| 111 | Crystal Brook Primary School | PRIM | 5 | Crystal Brook | 5523 | 90.2 | 93.4 | 92.7 | 92.9 | 91.6 | 92 | 93.8 | 93.4 | _ | _ | _ | _ | _ | _ | _ | 92.50000 | 90.2 | 93.8 |
| 114 | Curramulka Primary School | PRIM | 5 | Curramulka | 5580 | 97.4 | 90.6 | 97.9 | 93 | 87.5 | _ | 81.4 | 99 | _ | _ | _ | _ | _ | _ | _ | 92.40000 | 81.4 | 99.0 |
| 116 | Murray Bridge Special School | SPECIAL | 3 | Murray Bridge | 5253 | _ | _ | _ | _ | 100 | _ | _ | _ | 84.6 | _ | _ | _ | _ | _ | _ | 100.00000 | 100.0 | 100.0 |
| 120 | Gilles Street Primary School | PRIM | 4 | Adelaide | 5000 | 87.7 | 84.7 | 85.9 | 85.2 | 88.3 | 88.9 | 86.3 | 87.8 | 84.3 | _ | _ | _ | _ | _ | _ | 86.85000 | 84.7 | 88.9 |
| 123 | Echunga Primary School | PRIM | 7 | Echunga | 5153 | 93.3 | 93.3 | 89.8 | 92.1 | 95.7 | 94.7 | 90.7 | 94.7 | _ | _ | _ | _ | _ | _ | _ | 90.10000 | 66.6 | 95.7 |
| 124 | Eden Hills Primary School | PRIM | 7 | Eden Hills | 5050 | 90.3 | 92.2 | 91 | 92.6 | 92.8 | 92 | 92.8 | 91.4 | _ | _ | _ | _ | _ | _ | _ | 91.88750 | 90.3 | 92.8 |
| 127 | Edwardstown Primary School | PRIM | 6 | Melrose Park | 5039 | 86.6 | 88 | 87.2 | 89.1 | 83.7 | 87.2 | 83.7 | 85.7 | _ | _ | _ | _ | _ | _ | _ | 86.40000 | 83.7 | 89.1 |
| 128 | Elliston Area School | PRIM/SEC | 5 | Elliston | 5670 | 81.6 | 80 | 93 | 87.3 | 91.9 | 96.6 | 92.1 | 89.4 | _ | 93.7 | 95.1 | 97.9 | _ | _ | _ | 90.78182 | 80.0 | 97.9 |
| 131 | Norwood Primary School | PRIM | 7 | Norwood | 5067 | 89.7 | 90.3 | 92.7 | 91.7 | 91.8 | 91.9 | 90.6 | 87.9 | _ | _ | _ | _ | _ | _ | _ | 90.82500 | 87.9 | 92.7 |
| 135 | Flinders Park Primary School | PRIM | 4 | Flinders Park | 5025 | 86.8 | 90.8 | 90.4 | 87.5 | 86.9 | 90.6 | 90.5 | 92.1 | 89.8 | _ | _ | _ | _ | _ | _ | 86.91111 | 66.6 | 92.1 |
| 136 | Lenswood Primary School | PRIM | 7 | Lenswood | 5240 | 94.6 | 86.1 | 91.4 | 93.2 | 91.2 | 93.4 | 91.6 | 87.6 | _ | _ | _ | _ | _ | _ | _ | 91.13750 | 86.1 | 94.6 |
| 138 | Frances Primary School | PRIM | 7 | Frances | 5262 | 96.1 | 94.5 | 95.4 | 91.5 | 88.1 | 87.6 | 95.4 | 89.9 | _ | _ | _ | _ | _ | _ | _ | 92.31250 | 87.6 | 96.1 |
#Detecting Outliers of numerical variables
Outliers are detected in the below variables
*Mean_Attendance_Rate
*Minimum_Attendance_Rate
*Maximum_Attendance_Rate
mydata26 <- select(mydata23,Mean_Attendance_Rate :Maximum_Attendance_Rate)
library(ggplot2)
par(mfrow = c(1, ncol(mydata26)))
invisible(lapply(1:ncol(mydata26), function(i) boxplot(mydata26[, i])))
#Finding the Outliers
*Finding the Outliers of the 3 variables seperatly.
Outliers1 <- boxplot(mydata26$Mean_Attendance_Rate,plot =FALSE)$out
Outliers1
## [1] 100.00000 68.03636 68.86000 74.30000 60.00000 62.15000 74.48750
## [8] 74.94615 72.78333 70.85000 64.12143 74.61250 68.12727 73.77778
## [15] 46.84286 42.33846 57.25000 68.68750 72.78333 47.11538 65.41250
## [22] 54.01538 53.64000 52.43846 52.88000 74.96250 74.04444 58.33750
## [29] 59.52308 58.60000 66.18000 53.43846 56.35000
Outliers2 <- boxplot(mydata26$Minimum_Attendance_Rate,plot =FALSE)$out
Outliers2
## [1] 26.5 33.3 12.5 11.9 21.4 22.2 Inf 22.8 32.6 21.4 28.4 33.5 9.1 12.1
Outliers3 <- boxplot(mydata26$Maximum_Attendance_Rate,plot =FALSE)$out
Outliers3
## [1] 74.3 71.7 79.0 78.6 76.2 80.5 79.1 80.8 80.8 72.8 56.3 78.5 -Inf 75.2 79.3
## [16] 63.1 72.6 61.6 81.0 79.3 72.4 71.7 75.3 66.9
#Summary of Outliers
summary(Outliers1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.34 56.35 65.41 64.30 72.78 100.00
summary(Outliers2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.10 14.72 22.50 Inf 31.55 Inf
summary(Outliers3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -Inf 71.70 75.25 -Inf 79.15 81.00
#Remove Outliers - Mean_Attendance_Rate Variable
*Replacing the Outliers1 with the Median for Mean_Attendance_Rate Variable
*When comparing the Before & After boxplot charts, its clear that outliers have been controlled
mydata27 <- mydata26
mydata27$Mean_Attendance_Rate[mydata27$Mean_Attendance_Rate %in% boxplot(mydata27$Mean_Attendance_Rate)$out]<-median(mydata27$Mean_Attendance_Rate)
boxplot(mydata27$Mean_Attendance_Rate)
#Remove Outliers - Minimum_Attendance_Rate Variable
*Replacing the Outliers2 with the Median for Minimum_Attendance_Rate Variable
*When comparing the Before & After boxplot charts, its clear that outliers have been controlled
mydata28 <- mydata27
mydata28$Minimum_Attendance_Rate[mydata28$Minimum_Attendance_Rate %in% boxplot(mydata28$Minimum_Attendance_Rate)$out]<-median(mydata28$Minimum_Attendance_Rate)
boxplot(mydata28$Minimum_Attendance_Rate)
#Remove Outliers - Maximum_Attendance_Rate Variable
*Replacing the Outliers3 with the Median for Minimum_Attendance_Rate Variable
*When comparing the Before & After boxplot charts, its clear that outliers have been controlled
mydata29 <- mydata28
mydata29$Maximum_Attendance_Rate[mydata29$Maximum_Attendance_Rate %in% boxplot(mydata29$Maximum_Attendance_Rate)$out]<-median(mydata29$Maximum_Attendance_Rate)
boxplot(mydata29$Maximum_Attendance_Rate)
#Histogram for Mean_Attendance_Rate Variable
*Most appropriate Transformation is the Right skewed transformation which converts likely into a normal distribution `
#Applying the sqrt Transformation
sqrt_att <- sqrt (mydata29$Mean_Attendance_Rate)
hist(sqrt_att,
main = "Histogram of the Square Root of Mean Attendance Rate",
xlab = "Square Root of Mean Attendance Rate")
right_skew <- mydata29
right_skew <- right_skew$Mean_Attendance_Rate
hist(right_skew,
main = "Histogram of right-skewed dataset",
xlab = "Right-skewed Mean Attendance Rate")
right_skew_recip <- 1 / right_skew
hist(right_skew_recip)
#Histogram for Minimum_Attendance_Rate Variable
*Most appropriate Transformation is the Left skewed transformation which converts likely into a normal distribution. `
#Applying the sqrt Transformation
sqrt_min <- sqrt (mydata29$Minimum_Attendance_Rate)
hist(sqrt_min,
main = "Histogram of the Square Root of Minimum Attendance Rate",
xlab = "Square Root of Minimum Attendance Rate")
left_skewed <- mydata29
left_skewed <- left_skewed[,-1]
hist(left_skewed$Minimum_Attendance_Rate,
main= "Histogram of left-skewed dataset",
xlab="Left-skewed Minimum Attendance Rate")
right_skew <- mydata29
right_skew <- right_skew$Minimum_Attendance_Rate
hist(right_skew,
main = "Histogram of right-skewed dataset",
xlab = "Right-skewed Minimum Attendance Rate")
right_skew_recip <- 1 / right_skew
hist(right_skew_recip)
#Histogram for Maximum_Attendance_Rate Variable
*Most appropriate Transformation is the left skewed transformation which converts likely into a normal distribution
#Applying the sqrt Transformation
sqrt_max <- sqrt (mydata29$Maximum_Attendance_Rate)
hist(sqrt_max,
main = "Histogram of the Square Root of Maximum Attendance Rate",
xlab = "Square Root of Maximum Attendance Rate")
left_skewed <- mydata29
left_skewed <- left_skewed[,-1]
hist(left_skewed$Maximum_Attendance_Rate,
main= "Histogram of left-skewed dataset",
xlab="Left-skewed Maximum Attendance Rate")
right_skew <- mydata29
right_skew <- right_skew$Maximum_Attendance_Rate
hist(right_skew,
main = "Histogram of right-skewed dataset",
xlab = "Right-skewed Maximum Attendance Rate")
right_skew_recip <- 1 / right_skew
hist(right_skew_recip)
*Government of South Australia(2020)Attendance rate by school by year level 2020,Data SA,accessed 12 April 2022,https://data.sa.gov.au/data/dataset/attendance-percentage-by-school-by-year-level
*Government of South Australia(2020)Index of Educational Disadvantage by school,Data SA,accessed 15 April 2022,https://data.sa.gov.au/data/dataset/index-of-disadvantage-by-school
*Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax.
The two datsets I found are very interesting and informative datasets where I could gather more insights and provide a very good analysis.
The attendance rate dataset was not a very tidy dataset and it looked messy but my initial plan was to merge them and see how it looks like and see if I can proceed further.
Merging the datsets was not a difficult task as they both had a common variable which is the school number.
After merging the datsets i wanted to clean and tidy the dataset.
There were two sets of missing values in my dataset.
#Missing values with ‘zeros’ and they should be missing values because primary schools have attendances till grade 7 and grades above should be 0.
#Missing values with NA as they should have got attendance rate but missing for some reason.My intention was to replace the zeros with the “_” sign, so that it will look lot more tidier in the dataset.
There was no issue converting the zero’s to “_” but I didn’t realize the damage it has done to the dataset as all the attendance rate variables have got converted to Character variables and I couldn’t mutate variables as its giving an error message saying the variable has to be numeric.
Therefore I had to convert the Character variable to Numeric variable in order to mutate variables.
When I went to scan the variables, the actual missing value count was incorrect because all the missing values have converted to NA and could not find the actual missing values.
To overcome this situation I had to merge two datsets where one after converting the zeros to “_” and the dataset i subset with the mutating variables. So that I will have the actual missing values where I can scan the data and apply a solution correctly.
One thing I have learnt is you cannot go as you planned as there will be hiccups/shows stoppers/ turning points and the moment you think of giving up you’ll somehow find a workaround because you are doing this subject with real passion. Therefore I would say every hiccup/shows stopper/ turning point was a real learning curve for me as the workarounds and the knowledge you try to gain to fix the issue will be rarely forgotten.