Data Wrangling - Dataset challenge

Data Description

*This dataset is about about combination of attendance rate for Semester 1 in SA Government Schools by year level and Index of Educational disadvantage by Government Schools in SA for year 2000.

*Combination of this two datasets are to analyse the relationship between the attendance levels of students and the index of educational disadvantage.The most disadvantage schools have an index of 1 and the least disadvantaged have an index of 7.

*Detailed explanation regarding the two datasets are done below separately.

*Both these datasets have a common variable of School Number and will be using this for the merge.

*Understanding the dataset, inspecting the variables and converting them to proper data type conversions are done in the Understand section.

#Variable Description

*School Number - Each school has a number in South Austrlia

*School Name - Name of the school

*Type of School - If it’s a Primary/Secondary/Aboroginal/Special school

*Suburb/Postcode - Suburb and Postcode of the school

*Attendance Rates from Reception to Secondary

*Index of Educational Disadvantage - Most disadvantage schools have an index of 1 and the least disadvantaged have an index of 7

#Dataset 1#

*Attendance rate for semester 1 in SA Government schools by school and year level in year 2000

*Calculation for Attendance Rate = (No of days attending school/ No of days student enrolled) x 100

*This dataset contians untidy data and will be dealt with this data in the Tidy data section.

*Lots of zeros can be seen in this dataset and they are for the year levels who doesn’t have such classes Example: Coromandel Valley Primary School doesn’t have attendaces after Year 7 as it’s a primary school.

*Also this dataset has missing values with NA and that will be dealt separately in the Scanning section.

#Dataset 2#

*The Index of Educational disadvantage for SA Government school in year 2000

*The most disadvantage schools have an index of 1 and the least disadvantaged have an index of 7

*Index is calculated using the measures of

  * Parental economic resource
  
  * Parental ecucation and occupation
  
  * Aboriginality
  
  * Student mobility

#Sample of Dataset 1

mydata0 <-read.csv ("C:/Data/attendance-rate-by-school-2000.csv")

mydata4 <- head(mydata0,10)


#Styling the Table


knitr:: kable(mydata4, booktabs=T) %>%
  kable_styling(latex_options = c("striped","scale down","hold_position"))

Census_Year	School_Number	School_Name	Type_of_School	Suburb	Postcode	Reception	Year_1	Year_2	Year_3	Year_4	Year_5	Year_6	Year_7	Primary_Other
2020	104	Coromandel Valley Primary School	Primary Education	Coromandel Valley	5051	92.3	92.3	93.5	90.7	92.2	92.7	87.9	92.2	0.0
2020	106	Cowandilla Primary School	Primary Education	Cowandilla	5033	86.6	84.3	86.4	84.2	86.7	88.1	84.4	88.3	85.5
2020	107	Crafers Primary School	Primary Education	Crafers	5152	95.0	92.7	95.6	95.4	95.1	94.4	94.8	94.2	0.0
2020	111	Crystal Brook Primary School	Primary Education	Crystal Brook	5523	90.2	93.4	92.7	92.9	91.6	92.0	93.8	93.4	0.0
2020	114	Curramulka Primary School	Primary Education	Curramulka	5580	97.4	90.6	97.9	93.0	87.5	0.0	81.4	99.0	0.0
2020	116	Murray Bridge Special School	Special Education	Murray Bridge	5253	0.0	0.0	0.0	0.0	100.0	0.0	0.0	0.0	84.6
2020	120	Gilles Street Primary School	Primary Education	Adelaide	5000	87.7	84.7	85.9	85.2	88.3	88.9	86.3	87.8	84.3
2020	123	Echunga Primary School	Primary Education	Echunga	5153	93.3	93.3	89.8	92.1	95.7	94.7	90.7	94.7	0.0
2020	124	Eden Hills Primary School	Primary Education	Eden Hills	5050	90.3	92.2	91.0	92.6	92.8	92.0	92.8	91.4	0.0
2020	127	Edwardstown Primary School	Primary Education	Melrose Park	5039	86.6	88.0	87.2	89.1	83.7	87.2	83.7	85.7	0.0

#Sample of Dataset 2

mydata1 <-read.csv ("C:/Data/index-of-disadvantage-by-school-2020.csv")

mydata5 <- sample_n(mydata1,10)


#Styling the Table


kable(mydata5, booktabs=T) %>%
  kable_styling(latex_options = c("striped","scale down","hold_position"))

ID	School_Number	School_Name	Suburb	Post_Code	Type_of_Schooling	Index_of_Educational_Disadvantage
440	1100	Surrey Downs R-7 School	Surrey Downs	5126	Primary Education	6
37	770	Birdwood High School	Birdwood	5234	Secondary Education	6
237	895	Loxton High School	Loxton	5333	Secondary Education	4
452	810	Thebarton Senior College	Torrensville	5031	Secondary Education	3
56	752	Brinkworth Primary School	Brinkworth	5464	Primary Education	4
370	1801	Port Lincoln Special School	Port Lincoln	5606	Special Education	3
364	355	Port Augusta West Primary School	Port Augusta West	5700	Primary Education	4
186	777	Jamestown Community School	Jamestown	5491	Primary/Secondary Combined	6
310	936	Naracoorte South Primary School	Naracoorte	5271	Primary Education	4
326	762	Oakbank Area School	Oakbank	5243	Primary/Secondary Combined	6

#Sample of Merged Dataset

mydata2 <- merge(mydata0,mydata1,by="School_Number")

mydata6 <- sample_n(mydata2,10)


#Styling the Table


kable(mydata6, booktabs=T) %>%
  kable_styling(latex_options = c("striped","scale down","hold_position"))

School_Number	Census_Year	School_Name.x	Type_of_School	Suburb.x	Postcode	Reception	Year_1	Year_2	Year_3	Year_4	Year_5	Year_6	Year_7	Primary_Other	Year_8	Year_9	Year_10	Year_11	Year_12	Secondary_Other	ID	School_Name.y	Suburb.y	Post_Code	Type_of_Schooling	Index_of_Educational_Disadvantage
1858	2020	Greenwith Primary School	Primary Education	Greenwith	5125	93.8	94.1	92.3	92.2	92.4	90.4	90.2	90.7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	157	Greenwith Primary School	Greenwith	5125	Primary Education	6
935	2020	Para Hills School P-7	Primary Education	Para Hills	5096	81.4	85.3	87.5	84.5	88.1	85.4	90.3	89.0	87.4	0.0	0.0	0.0	0.0	0.0	0.0	337	Para Hills School P-7	Para Hills	5096	Primary Education	4
1399	2020	Port Augusta Special School	Special Education	Port Augusta West	5700	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	55.6	0.0	0.0	0.0	0.0	0.0	79.3	363	Port Augusta Special School	Port Augusta West	5700	Special Education	1
1001	2020	Amata Anangu School	Aboriginal/Anangu Schools	Amata	872	56.0	52.7	49.6	37.2	43.2	56.3	52.6	45.7	0.0	34.7	40.3	21.4	33.2	27.5	0.0	15	Amata Anangu School	Amata	872	Aboriginal/Anangu Schools	1
391	2020	St Leonards Primary School	Primary Education	Glenelg North	5045	87.3	89.6	88.7	87.0	87.6	86.2	87.5	87.5	95.0	0.0	0.0	0.0	0.0	0.0	0.0	432	St Leonards Primary School	Glenelg North	5045	Primary Education	6
962	2020	Riverland Special School	Special Education	Berri	5343	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	77.0	0.0	0.0	0.0	0.0	0.0	84.4	395	Riverland Special School	Berri	5343	Special Education	3
1183	2020	North Ingle School	Primary Education	Ingle Farm	5098	90.6	89.2	88.4	90.8	88.6	91.0	92.8	90.7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	316	North Ingle School	Ingle Farm	5098	Primary Education	4
475	2020	Westbourne Park Primary School	Primary Education	Westbourne Park	5041	92.4	91.6	93.0	93.3	92.3	92.6	91.9	92.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	487	Westbourne Park Primary School	Westbourne Park	5041	Primary Education	7
447	2020	Unley Primary School	Primary Education	Unley	5061	90.7	89.0	90.7	90.0	92.0	89.2	90.3	87.8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	464	Unley Primary School	Unley	5061	Primary Education	7
1696	2020	Kenmore Park Anangu School	Aboriginal/Anangu Schools	Kenmore Park	872	9.1	69.8	33.3	87.5	58.1	79.9	68.3	0.0	75.0	67.9	87.9	100.0	0.0	0.0	0.0	204	Kenmore Park Anangu School	Kenmore Park	872	Aboriginal/Anangu Schools	1

#Removing the unnecessary columns from the merged dataset

library(dplyr)
mydata7 <- select(mydata2, -ID, -School_Name.y,-Suburb.y ,-Post_Code,-Type_of_Schooling)

mydata7a <- sample_n(mydata7,5)

#Styling the Table


kable(mydata7a, booktabs=T) %>%
  kable_styling(latex_options = c("striped","scale down","hold_position"))

School_Number	Census_Year	School_Name.x	Type_of_School	Suburb.x	Postcode	Reception	Year_1	Year_2	Year_3	Year_4	Year_5	Year_6	Year_7	Index_of_Educational_Disadvantage
239	2020	Macclesfield Primary School	Primary Education	Macclesfield	5153	91.5	94.8	92.8	84.0	90.9	90.9	93.1	92.3	6
270	2020	Mitcham Primary School	Primary Education	Kingswood	5062	91.3	91.4	92.6	92.2	91.9	90.9	91.0	89.9	7
647	2020	Linden Park Primary School	Primary Education	Linden Park	5065	90.7	89.6	91.3	91.3	93.5	93.4	91.4	90.0	7
1481	2020	Stirling North Primary School	Primary Education	Stirling North	5710	90.9	91.5	91.1	92.2	89.5	90.5	88.8	89.4	3
340	2020	Ascot Park Primary School	Primary Education	Park Holme	5043	87.2	80.7	90.3	86.1	88.1	87.4	84.6	82.6	3

Understand

#Identifying the number of variables and observations.

# No of Observations in the Dataset

nrow(mydata7)

## [1] 505

# No of Variables in the Dataset

ncol(mydata7)

## [1] 22

#Summary of the Dataset

summary(mydata7)

##  School_Number     Census_Year   School_Name.x      Type_of_School    
##  Min.   : 104.0   Min.   :2020   Length:505         Length:505        
##  1st Qu.: 387.0   1st Qu.:2020   Class :character   Class :character  
##  Median : 754.0   Median :2020   Mode  :character   Mode  :character  
##  Mean   : 770.9   Mean   :2020                                        
##  3rd Qu.:1018.0   3rd Qu.:2020                                        
##  Max.   :1919.0   Max.   :2020                                        
##                                                                       
##    Suburb.x            Postcode      Reception          Year_1     
##  Length:505         Min.   : 872   Min.   :  0.00   Min.   : 0.00  
##  Class :character   1st Qu.:5086   1st Qu.: 79.30   1st Qu.:80.30  
##  Mode  :character   Median :5164   Median : 87.50   Median :87.30  
##                     Mean   :5177   Mean   : 71.71   Mean   :72.33  
##                     3rd Qu.:5351   3rd Qu.: 90.50   3rd Qu.:90.70  
##                     Max.   :5734   Max.   :100.00   Max.   :98.60  
##                                                                    
##      Year_2           Year_3          Year_4           Year_5     
##  Min.   :  0.00   Min.   : 0.00   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 82.05   1st Qu.:81.10   1st Qu.: 81.30   1st Qu.:80.45  
##  Median : 88.00   Median :88.40   Median : 88.20   Median :87.80  
##  Mean   : 72.75   Mean   :73.33   Mean   : 72.78   Mean   :72.50  
##  3rd Qu.: 91.03   3rd Qu.:91.10   3rd Qu.: 91.15   3rd Qu.:90.80  
##  Max.   :100.00   Max.   :98.30   Max.   :100.00   Max.   :98.30  
##  NA's   :1                        NA's   :2        NA's   :2      
##      Year_6          Year_7       Primary_Other        Year_8      
##  Min.   : 0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:80.45   1st Qu.: 78.45   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :87.60   Median : 87.35   Median :  0.00   Median :  0.00  
##  Mean   :71.99   Mean   : 71.91   Mean   : 20.56   Mean   : 22.26  
##  3rd Qu.:90.60   3rd Qu.: 90.58   3rd Qu.: 33.85   3rd Qu.: 57.20  
##  Max.   :97.20   Max.   :100.00   Max.   :100.00   Max.   :100.00  
##  NA's   :2       NA's   :3        NA's   :3                        
##      Year_9          Year_10         Year_11          Year_12      
##  Min.   :  0.00   Min.   :  0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.0   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  0.00   Median :  0.0   Median :  0.00   Median :  0.00  
##  Mean   : 21.73   Mean   : 21.4   Mean   : 20.48   Mean   : 21.19  
##  3rd Qu.: 53.40   3rd Qu.: 52.8   3rd Qu.: 22.80   3rd Qu.: 33.40  
##  Max.   :100.00   Max.   :100.0   Max.   :100.00   Max.   :100.00  
##                                                                    
##  Secondary_Other   Index_of_Educational_Disadvantage
##  Min.   :  0.000   Min.   :1.000                    
##  1st Qu.:  0.000   1st Qu.:3.000                    
##  Median :  0.000   Median :5.000                    
##  Mean   :  9.438   Mean   :4.327                    
##  3rd Qu.:  0.000   3rd Qu.:6.000                    
##  Max.   :100.000   Max.   :7.000                    
##

#Checking the types of Variables

str(mydata7)

## 'data.frame':    505 obs. of  22 variables:
##  $ School_Number                    : int  104 106 107 111 114 116 120 123 124 127 ...
##  $ Census_Year                      : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ School_Name.x                    : chr  "Coromandel Valley Primary School" "Cowandilla Primary School" "Crafers Primary School" "Crystal Brook Primary School" ...
##  $ Type_of_School                   : chr  "Primary Education" "Primary Education" "Primary Education" "Primary Education" ...
##  $ Suburb.x                         : chr  "Coromandel Valley" "Cowandilla" "Crafers" "Crystal Brook" ...
##  $ Postcode                         : int  5051 5033 5152 5523 5580 5253 5000 5153 5050 5039 ...
##  $ Reception                        : num  92.3 86.6 95 90.2 97.4 0 87.7 93.3 90.3 86.6 ...
##  $ Year_1                           : num  92.3 84.3 92.7 93.4 90.6 0 84.7 93.3 92.2 88 ...
##  $ Year_2                           : num  93.5 86.4 95.6 92.7 97.9 0 85.9 89.8 91 87.2 ...
##  $ Year_3                           : num  90.7 84.2 95.4 92.9 93 0 85.2 92.1 92.6 89.1 ...
##  $ Year_4                           : num  92.2 86.7 95.1 91.6 87.5 100 88.3 95.7 92.8 83.7 ...
##  $ Year_5                           : num  92.7 88.1 94.4 92 0 0 88.9 94.7 92 87.2 ...
##  $ Year_6                           : num  87.9 84.4 94.8 93.8 81.4 0 86.3 90.7 92.8 83.7 ...
##  $ Year_7                           : num  92.2 88.3 94.2 93.4 99 0 87.8 94.7 91.4 85.7 ...
##  $ Primary_Other                    : num  0 85.5 0 0 0 84.6 84.3 0 0 0 ...
##  $ Year_8                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year_9                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year_10                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year_11                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year_12                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Secondary_Other                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Index_of_Educational_Disadvantage: int  7 4 7 5 5 3 4 7 7 6 ...

#Converting a Character variable to a Factor Variable and Labeling

mydata8 <-mydata7
mydata8$Type_of_school <- factor (c(mydata7$Type_of_School),levels = c("Aboriginal/Anangu Schools","Primary Education","Primary/Secondary Combined","Secondary Education","Special Education"),labels=c("ABO","PRIM","PRIM/SEC","SECONDRY","SPECIAL"))
mydata8$Type_of_school

##   [1] PRIM     PRIM     PRIM     PRIM     PRIM     SPECIAL  PRIM     PRIM    
##   [9] PRIM     PRIM     PRIM/SEC PRIM     PRIM     PRIM     PRIM     PRIM    
##  [17] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
##  [25] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM/SEC
##  [33] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
##  [41] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     SPECIAL 
##  [49] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM/SEC
##  [57] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     ABO      PRIM    
##  [65] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
##  [73] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM/SEC PRIM    
##  [81] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
##  [89] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
##  [97] ABO      PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [105] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [113] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [121] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [129] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [137] PRIM     PRIM     PRIM     PRIM/SEC PRIM     PRIM     PRIM     PRIM    
## [145] PRIM     PRIM/SEC PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [153] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [161] PRIM     PRIM     PRIM     PRIM     SPECIAL  PRIM     PRIM     PRIM    
## [169] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [177] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [185] SPECIAL  PRIM     PRIM     PRIM     PRIM     SECONDRY PRIM     PRIM    
## [193] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [201] SECONDRY PRIM/SEC PRIM/SEC PRIM     PRIM     PRIM     PRIM     PRIM    
## [209] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [217] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [225] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM/SEC PRIM    
## [233] PRIM/SEC PRIM     PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC
## [241] PRIM     PRIM/SEC PRIM     PRIM/SEC PRIM/SEC PRIM     PRIM/SEC PRIM/SEC
## [249] PRIM/SEC PRIM/SEC PRIM     PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC PRIM/SEC
## [257] PRIM/SEC PRIM     PRIM/SEC PRIM/SEC PRIM     PRIM     PRIM/SEC PRIM/SEC
## [265] PRIM/SEC SECONDRY SECONDRY SECONDRY PRIM/SEC PRIM/SEC PRIM/SEC SECONDRY
## [273] SECONDRY PRIM/SEC PRIM/SEC SECONDRY SECONDRY PRIM/SEC SECONDRY SECONDRY
## [281] SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY
## [289] PRIM/SEC SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY
## [297] SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY SECONDRY PRIM    
## [305] SPECIAL  ABO      SECONDRY SECONDRY SECONDRY SECONDRY PRIM     PRIM    
## [313] PRIM     SECONDRY SECONDRY PRIM/SEC PRIM/SEC PRIM/SEC SECONDRY SECONDRY
## [321] PRIM     PRIM     PRIM/SEC PRIM     PRIM     PRIM     SECONDRY SECONDRY
## [329] SECONDRY SECONDRY PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [337] SECONDRY PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [345] PRIM/SEC SPECIAL  SECONDRY SECONDRY PRIM     PRIM     PRIM     PRIM    
## [353] PRIM     PRIM     SECONDRY SECONDRY PRIM     ABO      PRIM     PRIM    
## [361] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     SPECIAL  ABO     
## [369] ABO      ABO      PRIM     PRIM     ABO      SECONDRY SECONDRY SPECIAL 
## [377] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM/SEC
## [385] SECONDRY ABO      PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [393] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [401] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [409] PRIM     PRIM/SEC PRIM     PRIM     SECONDRY PRIM/SEC PRIM/SEC PRIM    
## [417] PRIM     PRIM     PRIM     PRIM     SECONDRY PRIM     PRIM     PRIM    
## [425] PRIM     PRIM     PRIM     PRIM     PRIM/SEC ABO      PRIM     PRIM    
## [433] SECONDRY PRIM     PRIM     SECONDRY ABO      PRIM     PRIM     PRIM    
## [441] PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM     PRIM    
## [449] SECONDRY SPECIAL  PRIM     PRIM     PRIM     SPECIAL  SPECIAL  SPECIAL 
## [457] SPECIAL  PRIM/SEC PRIM     PRIM/SEC PRIM/SEC PRIM     PRIM     PRIM    
## [465] PRIM/SEC ABO      ABO      PRIM     PRIM     PRIM     SPECIAL  PRIM    
## [473] PRIM     SECONDRY SECONDRY ABO      PRIM     PRIM     ABO      SECONDRY
## [481] SPECIAL  PRIM/SEC SECONDRY PRIM     PRIM     PRIM     PRIM     PRIM    
## [489] PRIM     PRIM     ABO      PRIM     PRIM     PRIM     PRIM/SEC PRIM/SEC
## [497] PRIM/SEC PRIM     SECONDRY PRIM     PRIM     PRIM     PRIM/SEC PRIM    
## [505] SECONDRY
## Levels: ABO PRIM PRIM/SEC SECONDRY SPECIAL

#Below Variables are included in the dataset

*Integer

*Factor with Labels

*Character

*Numerical

str(mydata8)

## 'data.frame':    505 obs. of  23 variables:
##  $ School_Number                    : int  104 106 107 111 114 116 120 123 124 127 ...
##  $ Census_Year                      : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ School_Name.x                    : chr  "Coromandel Valley Primary School" "Cowandilla Primary School" "Crafers Primary School" "Crystal Brook Primary School" ...
##  $ Type_of_School                   : chr  "Primary Education" "Primary Education" "Primary Education" "Primary Education" ...
##  $ Suburb.x                         : chr  "Coromandel Valley" "Cowandilla" "Crafers" "Crystal Brook" ...
##  $ Postcode                         : int  5051 5033 5152 5523 5580 5253 5000 5153 5050 5039 ...
##  $ Reception                        : num  92.3 86.6 95 90.2 97.4 0 87.7 93.3 90.3 86.6 ...
##  $ Year_1                           : num  92.3 84.3 92.7 93.4 90.6 0 84.7 93.3 92.2 88 ...
##  $ Year_2                           : num  93.5 86.4 95.6 92.7 97.9 0 85.9 89.8 91 87.2 ...
##  $ Year_3                           : num  90.7 84.2 95.4 92.9 93 0 85.2 92.1 92.6 89.1 ...
##  $ Year_4                           : num  92.2 86.7 95.1 91.6 87.5 100 88.3 95.7 92.8 83.7 ...
##  $ Year_5                           : num  92.7 88.1 94.4 92 0 0 88.9 94.7 92 87.2 ...
##  $ Year_6                           : num  87.9 84.4 94.8 93.8 81.4 0 86.3 90.7 92.8 83.7 ...
##  $ Year_7                           : num  92.2 88.3 94.2 93.4 99 0 87.8 94.7 91.4 85.7 ...
##  $ Primary_Other                    : num  0 85.5 0 0 0 84.6 84.3 0 0 0 ...
##  $ Year_8                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year_9                           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year_10                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year_11                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Year_12                          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Secondary_Other                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Index_of_Educational_Disadvantage: int  7 4 7 5 5 3 4 7 7 6 ...
##  $ Type_of_school                   : Factor w/ 5 levels "ABO","PRIM","PRIM/SEC",..: 2 2 2 2 2 5 2 2 2 2 ...

Tidy & Manipulate Data I

#Replacing the zeros

This data set consists of zeros in a messy format (0.0) and replacing them with a tidier formatting option “_”
Please note missing values are not dealt here

library(dplyr)
mydata9 <- mydata8

mydata9 [mydata9 == 0] <- "_"



mydata13 <-mydata9


mydata13a <- sample_n(mydata9,6)


knitr:: kable(mydata13a, booktabs=T) %>%
  kable_styling(latex_options = c("striped","scale down","hold_position"))

School_Number	Census_Year	School_Name.x	Type_of_School	Suburb.x	Postcode	Reception	Year_1	Year_2	Year_3	Year_4	Year_5	Year_6	Year_7	Primary_Other	Year_8	Year_9	Year_10	Year_11	Year_12	Secondary_Other	Index_of_Educational_Disadvantage	Type_of_school
323	2020	Coorara Primary School	Primary Education	Morphett Vale	5162	88.8	91.3	85.9	85.8	87.1	84.2	84.3	78	83.1	_	_	_	_	_	_	3	PRIM
449	2020	Uraidla Primary School	Primary Education	Uraidla	5142	93.2	89.8	89.9	93.1	92.4	94.7	92.7	92.9	_	_	_	_	_	_	_	7	PRIM
1060	2020	O’Sullivan Beach Primary School	Primary Education	O’Sullivan Beach	5166	75	85.1	87.7	83.8	74.2	87.5	84.5	71.4	_	_	_	_	_	_	_	2	PRIM
253	2020	Melrose Primary School	Primary Education	Melrose	5483	_	95.9	88.8	95.5	85.1	92.3	79.9	72.5	_	_	_	_	_	_	_	7	PRIM
1903	2020	Blair Athol North School B-7	Primary Education	Blair Athol	5084	80.1	76.1	83.1	80.9	82	84.4	81.1	84.9	87.6	_	_	_	_	_	_	2	PRIM
822	2020	Whyalla High School	Secondary Education	Whyalla	5600	_	_	_	_	_	_	_	_	_	81.6	78.5	75.3	_	_	_	2	SECONDRY

#Dropping/ Renaming and Ordering the Variables

library(dplyr)
mydata10 <- select(mydata9, -Census_Year, -Type_of_School)

mydata11 <- (rename(mydata10,School_No =School_Number,School=School_Name.x,Suburb=Suburb.x,Index_Edu_Disadvantage=Index_of_Educational_Disadvantage
))



col_order <- c("School_No","School","Type_of_school","Index_Edu_Disadvantage","Suburb","Postcode","Reception","Year_1","Year_2","Year_3","Year_4","Year_5","Year_6","Year_7","Primary_Other","Year_8","Year_9","Year_10","Year_11","Year_12","Secondary_Other")

mydata12 <- mydata11 %>% select(col_order)
mydata14 <- sample_n(mydata12,5)



#Styling the Table


kable(mydata14, booktabs=T) %>%
  kable_styling(latex_options = c("striped","scale down","hold_position"))

School_No	School	Type_of_school	Index_Edu_Disadvantage	Suburb	Postcode	Reception	Year_1	Year_2	Year_3	Year_4	Year_5	Year_6	Year_7	Primary_Other	Year_8	Year_9	Year_10	Year_11	Year_12	Secondary_Other
424	Tailem Bend Primary School	PRIM	3	Tailem Bend	5260	88.2	84.7	87.9	86.3	92	87.9	87.7	87.6	_	_	_	_	_	_	_
237	Lyndoch Primary School	PRIM	6	Lyndoch	5351	91.7	90.1	92.8	90.8	89.1	89.9	86.2	92.6	_	_	_	_	_	_	_
990	Augusta Park Primary School	PRIM	1	Port Augusta	5700	74.1	76	77.3	76.7	70.1	80.8	68	74.4	66.6	_	_	_	_	_	_
740	Orroroo Area School	PRIM/SEC	6	Orroroo	5431	91.9	98.3	94.7	95.4	89.7	95	89.3	95.4	_	91.4	85.5	91.5	86.5	94	_
928	Grant High School	SECONDRY	3	Mount Gambier	5290	_	_	_	_	_	_	_	_	_	92.2	87.9	89.4	84.5	88.9	87.6

#Rechecking the Variable Types after the clean up

*Some variables have been turned back to Character when replacing the zeros with “_“.

summary(mydata12)

##    School_No         School           Type_of_school Index_Edu_Disadvantage
##  Min.   : 104.0   Length:505         ABO     : 16    Min.   :1.000         
##  1st Qu.: 387.0   Class :character   PRIM    :350    1st Qu.:3.000         
##  Median : 754.0   Mode  :character   PRIM/SEC: 60    Median :5.000         
##  Mean   : 770.9                      SECONDRY: 64    Mean   :4.327         
##  3rd Qu.:1018.0                      SPECIAL : 15    3rd Qu.:6.000         
##  Max.   :1919.0                                      Max.   :7.000         
##     Suburb             Postcode     Reception            Year_1         
##  Length:505         Min.   : 872   Length:505         Length:505        
##  Class :character   1st Qu.:5086   Class :character   Class :character  
##  Mode  :character   Median :5164   Mode  :character   Mode  :character  
##                     Mean   :5177                                        
##                     3rd Qu.:5351                                        
##                     Max.   :5734                                        
##     Year_2             Year_3             Year_4             Year_5         
##  Length:505         Length:505         Length:505         Length:505        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Year_6             Year_7          Primary_Other         Year_8         
##  Length:505         Length:505         Length:505         Length:505        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     Year_9            Year_10            Year_11            Year_12         
##  Length:505         Length:505         Length:505         Length:505        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Secondary_Other   
##  Length:505        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

#Converting the Character Variables to Numeric

mydata15 <- mydata12

mydata15$Reception <- as.numeric (mydata12$Reception)

mydata15$Year_1 <- as.numeric (mydata12$Year_1)
mydata15$Year_2 <- as.numeric (mydata12$Year_2)
mydata15$Year_3 <- as.numeric (mydata12$Year_3)
mydata15$Year_4 <- as.numeric (mydata12$Year_4)
mydata15$Year_5 <- as.numeric (mydata12$Year_5)
mydata15$Year_6 <- as.numeric (mydata12$Year_6)
mydata15$Year_7 <- as.numeric (mydata12$Year_7)
mydata15$Primary_Other <- as.numeric (mydata14$Primary_Other)
mydata15$Year_8 <- as.numeric (mydata12$Year_8)
mydata15$Year_9 <- as.numeric (mydata12$Year_9)
mydata15$Year_10 <- as.numeric (mydata12$Year_10)
mydata15$Year_11 <- as.numeric (mydata12$Year_11)
mydata15$Year_12 <- as.numeric (mydata12$Year_12)
mydata15$Secondary_Other<- as.numeric (mydata12$Secondary_Other)

Tidy & Manipulate Data II

#Mutating Variables

Mutating below 3 variables

*Mean_Attendance_Rate

*Minimum_Attendance_Rate

*Maximum_Attendance_Rate

mydata16 <-mydata15

library(dplyr)

mydata17 <- mydata16 %>%
  
  
  mutate(Mean_Attendance_Rate=rowMeans(.[ , c("Reception","Year_1","Year_2","Year_3","Year_4","Year_5","Year_6","Year_7","Primary_Other","Year_8","Year_9","Year_10","Year_11","Year_12","Secondary_Other")], na.rm=TRUE))

mydata18 <- mydata17 %>%


  
  rowwise() %>% 
  

  
  mutate(Minimum_Attendance_Rate = min(Reception, Year_1,Year_2,Year_3,Year_4,Year_5,Year_6,Year_7,Primary_Other,Year_8,Year_9,Year_10,Year_11,Year_12,Secondary_Other, na.rm = T))  %>%

 
  
 mutate(Maximum_Attendance_Rate = max(Reception, Year_1,Year_2,Year_3,Year_4,Year_5,Year_6,Year_7,Primary_Other,Year_8,Year_9,Year_10,Year_11,Year_12,Secondary_Other, na.rm = T))

mydata19 <- head(mydata18,5)



#Styling the Table


kable(mydata19, booktabs=T) %>%
  kable_styling(latex_options = c("striped","scale_down","hold_position"))

School_No	School	Type_of_school	Index_Edu_Disadvantage	Suburb	Postcode	Reception	Year_1	Year_2	Year_3	Year_4	Year_5	Year_6	Year_7	Primary_Other	Year_8	Year_9	Year_10	Year_11	Year_12	Secondary_Other	Mean_Attendance_Rate	Minimum_Attendance_Rate	Maximum_Attendance_Rate
104	Coromandel Valley Primary School	PRIM	7	Coromandel Valley	5051	92.3	92.3	93.5	90.7	92.2	92.7	87.9	92.2	NA	NA	NA	NA	NA	NA	NA	91.72500	87.9	93.5
106	Cowandilla Primary School	PRIM	4	Cowandilla	5033	86.6	84.3	86.4	84.2	86.7	88.1	84.4	88.3	NA	NA	NA	NA	NA	NA	NA	86.12500	84.2	88.3
107	Crafers Primary School	PRIM	7	Crafers	5152	95.0	92.7	95.6	95.4	95.1	94.4	94.8	94.2	66.6	NA	NA	NA	NA	NA	NA	91.53333	66.6	95.6
111	Crystal Brook Primary School	PRIM	5	Crystal Brook	5523	90.2	93.4	92.7	92.9	91.6	92.0	93.8	93.4	NA	NA	NA	NA	NA	NA	NA	92.50000	90.2	93.8
114	Curramulka Primary School	PRIM	5	Curramulka	5580	97.4	90.6	97.9	93.0	87.5	NA	81.4	99.0	NA	NA	NA	NA	NA	NA	NA	92.40000	81.4	99.0

#Merging the datasets

*This additional step had to be done because mydata18 has got the actual missing values as well as the values should be missed as well. The reason behind this was when converting to the numeric variable, the character variable which was “_” has converted to NA as well and its mixing up with the actual NA values.

*Therefore I will be merging the dataset -mydata12 which should have the missing values “_” in this format and actual missing values in NA format with the dataset - mydata18 where I can get the mutated variables

*Dropping unwanted variables

*Renaming variables

mydata19 <- merge(mydata12,mydata18, by="School_No")

mydata20 <-select(mydata19,-School.y,-Type_of_school.y,-Index_Edu_Disadvantage.y,-Suburb.y,-Postcode.y,-Reception.y,-Year_1.y,-Year_2.y,-Year_3.y,-Year_4.y,-Year_5.y,-Year_6.y,-Year_7.y,-Primary_Other.y,-Year_8.y,-Year_9.y,-Year_10.y,-Year_11.y,-Year_12.y,-Secondary_Other.y)

mydata21 <- rename(mydata20,School=School.x,Type_of_school =Type_of_school.x,Index_Edu_Disadvantage=Index_Edu_Disadvantage.x,Suburb=Suburb.x,Postcode=Postcode.x,Reception=Reception.x,Year_1=Year_1.x,Year_2=Year_2.x,Year_3=Year_3.x,Year_4=Year_4.x,Year_5=Year_5.x,Year_6=Year_6.x,Year_7=Year_7.x,Primary_Other=Primary_Other.x,Year_8=Year_8.x,Year_9=Year_9.x,Year_10=Year_10.x,Year_11=Year_11.x,Year_12=Year_12.x,Secondary_Other=Secondary_Other.x)

mydata22<-head(mydata21,5)


#Styling the Table


kable(mydata22, booktabs=T) %>%
  kable_styling(latex_options = c("striped","scale_down","hold_position"))

School_No	School	Type_of_school	Index_Edu_Disadvantage	Suburb	Postcode	Reception	Year_1	Year_2	Year_3	Year_4	Year_5	Year_6	Year_7	Primary_Other	Year_8	Year_9	Year_10	Year_11	Year_12	Secondary_Other	Mean_Attendance_Rate	Minimum_Attendance_Rate	Maximum_Attendance_Rate
104	Coromandel Valley Primary School	PRIM	7	Coromandel Valley	5051	92.3	92.3	93.5	90.7	92.2	92.7	87.9	92.2	_	_	_	_	_	_	_	91.72500	87.9	93.5
106	Cowandilla Primary School	PRIM	4	Cowandilla	5033	86.6	84.3	86.4	84.2	86.7	88.1	84.4	88.3	85.5	_	_	_	_	_	_	86.12500	84.2	88.3
107	Crafers Primary School	PRIM	7	Crafers	5152	95	92.7	95.6	95.4	95.1	94.4	94.8	94.2	_	_	_	_	_	_	_	91.53333	66.6	95.6
111	Crystal Brook Primary School	PRIM	5	Crystal Brook	5523	90.2	93.4	92.7	92.9	91.6	92	93.8	93.4	_	_	_	_	_	_	_	92.50000	90.2	93.8
114	Curramulka Primary School	PRIM	5	Curramulka	5580	97.4	90.6	97.9	93	87.5	_	81.4	99	_	_	_	_	_	_	_	92.40000	81.4	99.0

Scan I

#Identifying the Missing Values

mydata23 <- mydata21
options(max.print = 999999999)

#is.na(mydata23)#

#Count of Missing Values

sum(is.na(mydata23))

## [1] 14

#Calculating the Mean of Missing Value

mean(is.na(mydata23))

## [1] 0.001155116

#Finding where the data is missing

which(is.na(mydata23))

##  [1]  4137  5113  5147  5592  5730  6079  6123  6612  6627  6654  7402  7405
## [13]  7406 10981

#Replacing the missing values with Mean

Final output shows that missing values with NA are replaced with Mean values and the missing values with “_” are not treated as they were meant to be missing values. Explained this in the data description area.

mydata24<- mydata23
for (i in 1:ncol(mydata23)){mydata24[ ,i][is.na(mydata24[ ,i])] <- mean(mydata24[ ,i],na.rm=TRUE)}
mydata25 <- head(mydata24,15)


#Styling the Table


kable(mydata25, booktabs=T) %>%
  kable_styling(latex_options = c("striped","scale_down","hold_position"))

School_No	School	Type_of_school	Index_Edu_Disadvantage	Suburb	Postcode	Reception	Year_1	Year_2	Year_3	Year_4	Year_5	Year_6	Year_7	Primary_Other	Year_8	Year_9	Year_10	Year_11	Year_12	Secondary_Other	Mean_Attendance_Rate	Minimum_Attendance_Rate	Maximum_Attendance_Rate
104	Coromandel Valley Primary School	PRIM	7	Coromandel Valley	5051	92.3	92.3	93.5	90.7	92.2	92.7	87.9	92.2	_	_	_	_	_	_	_	91.72500	87.9	93.5
106	Cowandilla Primary School	PRIM	4	Cowandilla	5033	86.6	84.3	86.4	84.2	86.7	88.1	84.4	88.3	85.5	_	_	_	_	_	_	86.12500	84.2	88.3
107	Crafers Primary School	PRIM	7	Crafers	5152	95	92.7	95.6	95.4	95.1	94.4	94.8	94.2	_	_	_	_	_	_	_	91.53333	66.6	95.6
111	Crystal Brook Primary School	PRIM	5	Crystal Brook	5523	90.2	93.4	92.7	92.9	91.6	92	93.8	93.4	_	_	_	_	_	_	_	92.50000	90.2	93.8
114	Curramulka Primary School	PRIM	5	Curramulka	5580	97.4	90.6	97.9	93	87.5	_	81.4	99	_	_	_	_	_	_	_	92.40000	81.4	99.0
116	Murray Bridge Special School	SPECIAL	3	Murray Bridge	5253	_	_	_	_	100	_	_	_	84.6	_	_	_	_	_	_	100.00000	100.0	100.0
120	Gilles Street Primary School	PRIM	4	Adelaide	5000	87.7	84.7	85.9	85.2	88.3	88.9	86.3	87.8	84.3	_	_	_	_	_	_	86.85000	84.7	88.9
123	Echunga Primary School	PRIM	7	Echunga	5153	93.3	93.3	89.8	92.1	95.7	94.7	90.7	94.7	_	_	_	_	_	_	_	90.10000	66.6	95.7
124	Eden Hills Primary School	PRIM	7	Eden Hills	5050	90.3	92.2	91	92.6	92.8	92	92.8	91.4	_	_	_	_	_	_	_	91.88750	90.3	92.8
127	Edwardstown Primary School	PRIM	6	Melrose Park	5039	86.6	88	87.2	89.1	83.7	87.2	83.7	85.7	_	_	_	_	_	_	_	86.40000	83.7	89.1
128	Elliston Area School	PRIM/SEC	5	Elliston	5670	81.6	80	93	87.3	91.9	96.6	92.1	89.4	_	93.7	95.1	97.9	_	_	_	90.78182	80.0	97.9
131	Norwood Primary School	PRIM	7	Norwood	5067	89.7	90.3	92.7	91.7	91.8	91.9	90.6	87.9	_	_	_	_	_	_	_	90.82500	87.9	92.7
135	Flinders Park Primary School	PRIM	4	Flinders Park	5025	86.8	90.8	90.4	87.5	86.9	90.6	90.5	92.1	89.8	_	_	_	_	_	_	86.91111	66.6	92.1
136	Lenswood Primary School	PRIM	7	Lenswood	5240	94.6	86.1	91.4	93.2	91.2	93.4	91.6	87.6	_	_	_	_	_	_	_	91.13750	86.1	94.6
138	Frances Primary School	PRIM	7	Frances	5262	96.1	94.5	95.4	91.5	88.1	87.6	95.4	89.9	_	_	_	_	_	_	_	92.31250	87.6	96.1

Scan II

#Detecting Outliers of numerical variables

Outliers are detected in the below variables

*Mean_Attendance_Rate

*Minimum_Attendance_Rate

*Maximum_Attendance_Rate

mydata26 <- select(mydata23,Mean_Attendance_Rate :Maximum_Attendance_Rate)


library(ggplot2)
par(mfrow = c(1, ncol(mydata26)))
invisible(lapply(1:ncol(mydata26), function(i) boxplot(mydata26[, i])))

#Finding the Outliers

*Finding the Outliers of the 3 variables seperatly.

Outliers1 <- boxplot(mydata26$Mean_Attendance_Rate,plot =FALSE)$out
Outliers1

##  [1] 100.00000  68.03636  68.86000  74.30000  60.00000  62.15000  74.48750
##  [8]  74.94615  72.78333  70.85000  64.12143  74.61250  68.12727  73.77778
## [15]  46.84286  42.33846  57.25000  68.68750  72.78333  47.11538  65.41250
## [22]  54.01538  53.64000  52.43846  52.88000  74.96250  74.04444  58.33750
## [29]  59.52308  58.60000  66.18000  53.43846  56.35000

Outliers2 <- boxplot(mydata26$Minimum_Attendance_Rate,plot =FALSE)$out
Outliers2

##  [1] 26.5 33.3 12.5 11.9 21.4 22.2  Inf 22.8 32.6 21.4 28.4 33.5  9.1 12.1

Outliers3 <- boxplot(mydata26$Maximum_Attendance_Rate,plot =FALSE)$out
Outliers3

##  [1] 74.3 71.7 79.0 78.6 76.2 80.5 79.1 80.8 80.8 72.8 56.3 78.5 -Inf 75.2 79.3
## [16] 63.1 72.6 61.6 81.0 79.3 72.4 71.7 75.3 66.9

#Summary of Outliers

summary(Outliers1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   42.34   56.35   65.41   64.30   72.78  100.00

summary(Outliers2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.10   14.72   22.50     Inf   31.55     Inf

summary(Outliers3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf   71.70   75.25    -Inf   79.15   81.00

#Remove Outliers - Mean_Attendance_Rate Variable

*Replacing the Outliers1 with the Median for Mean_Attendance_Rate Variable

*When comparing the Before & After boxplot charts, its clear that outliers have been controlled

mydata27 <- mydata26

mydata27$Mean_Attendance_Rate[mydata27$Mean_Attendance_Rate %in% boxplot(mydata27$Mean_Attendance_Rate)$out]<-median(mydata27$Mean_Attendance_Rate)

boxplot(mydata27$Mean_Attendance_Rate)

#Remove Outliers - Minimum_Attendance_Rate Variable

*Replacing the Outliers2 with the Median for Minimum_Attendance_Rate Variable

*When comparing the Before & After boxplot charts, its clear that outliers have been controlled

mydata28 <- mydata27

mydata28$Minimum_Attendance_Rate[mydata28$Minimum_Attendance_Rate %in% boxplot(mydata28$Minimum_Attendance_Rate)$out]<-median(mydata28$Minimum_Attendance_Rate)

boxplot(mydata28$Minimum_Attendance_Rate)

#Remove Outliers - Maximum_Attendance_Rate Variable

*Replacing the Outliers3 with the Median for Minimum_Attendance_Rate Variable

*When comparing the Before & After boxplot charts, its clear that outliers have been controlled

mydata29 <- mydata28

mydata29$Maximum_Attendance_Rate[mydata29$Maximum_Attendance_Rate %in% boxplot(mydata29$Maximum_Attendance_Rate)$out]<-median(mydata29$Maximum_Attendance_Rate)

boxplot(mydata29$Maximum_Attendance_Rate)

Transform

#Histogram for Mean_Attendance_Rate Variable

*Most appropriate Transformation is the Right skewed transformation which converts likely into a normal distribution `

#Applying the sqrt Transformation

sqrt_att <- sqrt (mydata29$Mean_Attendance_Rate)



hist(sqrt_att, 
     main = "Histogram of the Square Root of Mean Attendance Rate", 
     xlab = "Square Root of Mean Attendance Rate")

right_skew <- mydata29
right_skew <- right_skew$Mean_Attendance_Rate

hist(right_skew, 
     main = "Histogram of right-skewed dataset", 
     xlab = "Right-skewed Mean Attendance Rate")

right_skew_recip <- 1 / right_skew

hist(right_skew_recip)

#Histogram for Minimum_Attendance_Rate Variable

*Most appropriate Transformation is the Left skewed transformation which converts likely into a normal distribution. `

#Applying the sqrt Transformation

sqrt_min <- sqrt (mydata29$Minimum_Attendance_Rate)



hist(sqrt_min, 
     main = "Histogram of the Square Root of Minimum Attendance Rate", 
     xlab = "Square Root of Minimum Attendance Rate")

left_skewed <- mydata29
left_skewed <- left_skewed[,-1]

hist(left_skewed$Minimum_Attendance_Rate,
     
     main= "Histogram of left-skewed dataset",
     xlab="Left-skewed Minimum Attendance Rate")

right_skew <- mydata29
right_skew <- right_skew$Minimum_Attendance_Rate

hist(right_skew, 
     main = "Histogram of right-skewed dataset", 
     xlab = "Right-skewed Minimum Attendance Rate")

right_skew_recip <- 1 / right_skew

hist(right_skew_recip)

#Histogram for Maximum_Attendance_Rate Variable

*Most appropriate Transformation is the left skewed transformation which converts likely into a normal distribution

#Applying the sqrt Transformation

sqrt_max <- sqrt (mydata29$Maximum_Attendance_Rate)



hist(sqrt_max, 
     main = "Histogram of the Square Root of Maximum Attendance Rate", 
     xlab = "Square Root of Maximum Attendance Rate")

left_skewed <- mydata29
left_skewed <- left_skewed[,-1]

hist(left_skewed$Maximum_Attendance_Rate,
     
     main= "Histogram of left-skewed dataset",
     xlab="Left-skewed Maximum Attendance Rate")

right_skew <- mydata29
right_skew <- right_skew$Maximum_Attendance_Rate

hist(right_skew, 
     main = "Histogram of right-skewed dataset", 
     xlab = "Right-skewed Maximum Attendance Rate")

right_skew_recip <- 1 / right_skew

hist(right_skew_recip)

References

*Government of South Australia(2020)Attendance rate by school by year level 2020,Data SA,accessed 12 April 2022,https://data.sa.gov.au/data/dataset/attendance-percentage-by-school-by-year-level

*Government of South Australia(2020)Index of Educational Disadvantage by school,Data SA,accessed 15 April 2022,https://data.sa.gov.au/data/dataset/index-of-disadvantage-by-school

*Zhu, Hao. 2021. kableExtra: Construct Complex Table with Kable and Pipe Syntax.

Reflective journal

The two datsets I found are very interesting and informative datasets where I could gather more insights and provide a very good analysis.
The attendance rate dataset was not a very tidy dataset and it looked messy but my initial plan was to merge them and see how it looks like and see if I can proceed further.
Merging the datsets was not a difficult task as they both had a common variable which is the school number.
After merging the datsets i wanted to clean and tidy the dataset.

There were two sets of missing values in my dataset.

#Missing values with ‘zeros’ and they should be missing values because primary schools have attendances till grade 7 and grades above should be 0.

#Missing values with NA as they should have got attendance rate but missing for some reason.

My intention was to replace the zeros with the “_” sign, so that it will look lot more tidier in the dataset.
There was no issue converting the zero’s to “_” but I didn’t realize the damage it has done to the dataset as all the attendance rate variables have got converted to Character variables and I couldn’t mutate variables as its giving an error message saying the variable has to be numeric.
Therefore I had to convert the Character variable to Numeric variable in order to mutate variables.
When I went to scan the variables, the actual missing value count was incorrect because all the missing values have converted to NA and could not find the actual missing values.
To overcome this situation I had to merge two datsets where one after converting the zeros to “_” and the dataset i subset with the mutating variables. So that I will have the actual missing values where I can scan the data and apply a solution correctly.
One thing I have learnt is you cannot go as you planned as there will be hiccups/shows stoppers/ turning points and the moment you think of giving up you’ll somehow find a workaround because you are doing this subject with real passion. Therefore I would say every hiccup/shows stopper/ turning point was a real learning curve for me as the workarounds and the knowledge you try to gain to fix the issue will be rarely forgotten.

Presentation link

https://www.loom.com/share/7eb012f06c2749aabe45e5cce3df05f1?sharedAppSource=personal_library

Data Wrangling - Dataset challenge

Attendance Rate Vs Index of Educational Disadvantage for SA Government School - Semester 1 - Year 2000

Report by Ashanthi Panaluwa

Data Description

Understand

Tidy & Manipulate Data I

Tidy & Manipulate Data II

Scan I

Scan II

Transform

References

Reflective journal

Presentation link