1 . Prepare Assignment 2 report using this R Markdown template. Feel free to DELETE the instructional text provided in the template.
Once you finalise your report, run all R chunks and Preview your notebook in HTML (by clicking Preview). Make sure your code and outputs are visible.
2 . Upload the report as a PDF file via the File Upload tab under the Assignment 2 page in CANVAS (see instructions file for details) after you attach the file click Submit assignment.
The easiest way to produce a PDF file from the RMarkdown is to Run all R chunks, then Preview your notebook in HTML (by clicking Preview) → Open in Browser (Chrome) → Right click on the report in Chrome → Click Print and Select the Destination Option to Save as PDF.
3 . Publish the report to RPubs (see here) and enter your report’s RPubs URL into the Website URL tab under Assignment 2 RPubs Link Submission page in Canvas (see instructions file for details) and submit this too. This online version of the report will be used for marking. Failure to submit your link will delay your feedback and risk late penalties.
If you have any questions regarding the assignment instructions and the R Markdown template, please post it on discussion board.
Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
# This is the R chunk for the required packages
library(tidyr)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(editrules)
Loading required package: igraph
Attaching package: ‘igraph’
The following objects are masked from ‘package:dplyr’:
as_data_frame, groups, union
The following object is masked from ‘package:tidyr’:
crossing
The following objects are masked from ‘package:stats’:
decompose, spectrum
The following object is masked from ‘package:base’:
union
Attaching package: ‘editrules’
The following objects are masked from ‘package:igraph’:
blocks, normalize
The following object is masked from ‘package:dplyr’:
contains
The following objects are masked from ‘package:tidyr’:
contains, separate
library(deducorrect)
library(outliers)
In your own words, provide a brief summary of the preprocessing. Explain the steps that you have taken to preprocess your data. Write this section last after you have performed all data preprocessing. (Word count Max: 300 words)
A clear description of data sets, their sources, and variable descriptions should be provided. In this section, you must also provide the R codes with outputs (head of data sets) that you used to import/read/scrape the data set. You need to fulfil the minimum requirement #1 and merge at least two data sets to create the one you are going to work on. In addition to the R codes and outputs, you need to explain the steps that you have taken.
# This is the R chunk for the Data Section
## Executive Summary
#This report provides the preprocessing of 2 datasets, which contains methods to merge datasets, convert variable type, label factor variable, reshape data into a tidy format, mutate new variable, scan outliers for numeric variables, transform data, scan errors (missing values, special values, obvious errors, and inconsistency) and use the technique to deal with errors (missing values, special values, obvious errors, and inconsistency) by only using packages of tidyr, dplyr, editrules, deducorrect, and outliers. This report aims to provide a sample of how to preprocess datasets, the source of datasets is all come from the government of VIC https://www.vic.gov.au/, they are all open and have a Creative Commons Licence.
## Data
#Import dataset from website, this dataset is from the government of victoria and it's source website is: https://discover.data.vic.gov.au/dataset/all-schools-fte-enrolments-feb-2015. This dataset contains information collected from the February 2014 school census of Victorian schools until February 2015. This dataset contains 54 variables but we are going to subset to 9 variables otherwise it will be too large and when we merge variables together or merge two datasets together, R will reports Error: vector memory exhausted (limit reached?). After we subset the data set, there will be 9 variables left:
#"Education_Sector":It refers to educational sector.
#"Entity_Type":It refers to type of school district.
#"School_No":It refers to the code that represent the school.
#"School_Name":It refers to the name of school.
#"School_Type":It refers to the type of school.
#"School_Status":It refers to the status of school.
#"Prep.Males.Total..":It refers to the total number of preparatory male student.
#"Prep.Females.Total..":It refers to the total number of preparatory female student.
#"Prep.Total..":It refers to the total number of preparatory student.
#Import dataset from website.
url <- "http://www.education.vic.gov.au/Documents/about/research/datavic/dv171-Feb2015FTEenrolments.csv"
All_Schools_FTE_enrolments_Feb_2015 <- read.csv(url, stringsAsFactors = FALSE)
#We need to subset data because this dataset is too big, if we don't subset dataset, then when we merge variables together or merge two datasets together, R will reports Error: vector memory exhausted (limit reached?).
All_Schools_FTE_enrolments_Feb_2015 <- subset(All_Schools_FTE_enrolments_Feb_2015, select = Education_Sector:Prep.Total..)
head(All_Schools_FTE_enrolments_Feb_2015)
#Import dataset from another website, this dataset is from the government of victoria and it's source website is: https://discover.data.vic.gov.au/dataset/all-schools-fte-enrolments-feb-2016. This dataset contains information collected from the February 2015 school census of Victorian schools until February 2016. This dataset also contains 54 variables but we are going to subset to 9 variables due to the problem of large dataset. And these 9 variables are the same as the previous dataset, it means that they have the same variable and also those variables have the same meaning.
#Import dataset from website.
url <- "http://www.education.vic.gov.au/Documents/about/research/datavic/dv214-allschoolsfetenrolfeb2016.csv"
All_Schools_FTE_enrolments_Feb_2016 <- read.csv(url, stringsAsFactors = FALSE)
#We use anti_join() function to obtain the different data from student enrolments in 2015 (New data), because we woudl like merge additional data from student enrolments in 2016 with student enrolments in 2016 in next step.
New_All_Schools_FTE_enrolments_Feb_2016 <- anti_join(All_Schools_FTE_enrolments_Feb_2016, All_Schools_FTE_enrolments_Feb_2015, by = "School_Name")
#The same situation in this one, we have to minimize data otherwise, R will generate error when merge 2 dataset because they are too large.
New_All_Schools_FTE_enrolments_Feb_2016 <- subset(New_All_Schools_FTE_enrolments_Feb_2016, select = Education_Sector:Prep.Total..)
head(New_All_Schools_FTE_enrolments_Feb_2016)
#We use bind_rows() function to merge two datasets because they have the same variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students <- bind_rows(All_Schools_FTE_enrolments_Feb_2015, New_All_Schools_FTE_enrolments_Feb_2016)
head(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students)
Explain why your data (or one of the data sets) doesn’t conform the tidy data principles (minimum requirement #5). Apply the required steps to reshape the data into a tidy format. In addition to the R codes and outputs, explain everything that you do in this step.
# This is the R chunk for the Tidy & Manipulate Data I
#We put this part (Tidy & Manipulate Data I) prior to the (Understand) part is because there will be some variables created after we tidy this dataset and it is better to convert variable type, rename variable, rearrange variables after we set down the dataset.
#The problem of this dataset is that the column names are not names of variables, but values of a variable. In this case, the column names "Prep.Males.Total..", and "Prep.Females.Total.." represent values of the gender variable, and each row represents two observations, not one.
head(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students)
#Therefore, we use gather() function to merge two variables into a factor variable and also create a new variable which is called "Number_of_students".
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students %>%
gather(`Prep.Males.Total..`, `Prep.Females.Total..`, key = "Gender", value = "Number_of_students")
head(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
NA
NA
Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.
# This is the R chunk for the Understand Section
#Rearrange and rename and rearrange variables
#We first check the order of variables.
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
'data.frame': 4502 obs. of 9 variables:
$ Education_Sector : chr "Catholic" "Catholic" "Catholic" "Catholic" ...
$ Entity_Type : int 2 2 2 2 2 2 2 2 2 2 ...
$ School_No : int 20 25 26 28 29 30 33 35 60 77 ...
$ School_Name : chr "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
$ School_Type : chr "Secondary" "Secondary" "Secondary" "Secondary" ...
$ School_Status : chr "O" "O" "O" "O" ...
$ Prep.Total.. : num 0 0 0 0 34 44 0 0 38 39 ...
$ Gender : chr "Prep.Males.Total.." "Prep.Males.Total.." "Prep.Males.Total.." "Prep.Males.Total.." ...
$ Number_of_students: num 0 0 0 0 15 19 0 0 22 15 ...
#We use select() function to rearrange the order of variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 %>% select(School_Name, School_No, School_Type, School_Status, Education_Sector, Entity_Type, Gender, Number_of_students, Prep.Total..)
#Given that we gather two variables together to build up a factor variable "Gender". We should rename the levels of "Gender".
#Then, use levels() function to check the level name of this variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Gender)
NULL
#We use label() function to re-label the name of levels in a factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Gender <- factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Gender,levels = c("Prep.Females.Total..","Prep.Males.Total.."),
labels = c("Female","Male"))
#We also notice that the variable name of "Prep.Total.." doesn't look good (because there are some ".." in the name) and "Number_of_students" should actually be "Number_of_pre.student".
#We use rename() function to rename variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 %>%
rename(
Total_number_of_pre.student_for_school = Prep.Total..
, Number_of_pre.student = Number_of_students
)
#Check the variables and order again, this also prove that this dataset contains multiple data types (numerics, characters, factors).
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
'data.frame': 4502 obs. of 9 variables:
$ School_Name : chr "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
$ School_No : int 20 25 26 28 29 30 33 35 60 77 ...
$ School_Type : chr "Secondary" "Secondary" "Secondary" "Secondary" ...
$ School_Status : chr "O" "O" "O" "O" ...
$ Education_Sector : chr "Catholic" "Catholic" "Catholic" "Catholic" ...
$ Entity_Type : int 2 2 2 2 2 2 2 2 2 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Number_of_pre.student : num 0 0 0 0 15 19 0 0 22 15 ...
$ Total_number_of_pre.student_for_school: num 0 0 0 0 34 44 0 0 38 39 ...
#Convert variable type
#Notice that "Education_Sector" has some values are the same, so try to covert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector)
is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector)
[1] TRUE
#Check the levels of this variable, and the result is that there are only 3 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector)
[1] "Catholic" "Government" "Independent"
#Entity_Type is obviously a factor variable but R indicate that as int. So, we need to convert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type)
is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type)
[1] TRUE
#To ensure that this variable is a factor variable, we check the levels for this variable and the result is that there are only 2 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type)
[1] "1" "2"
#School_Type is obviously a factor variable but R indicate that as chr. So, we need to convert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type)
is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type)
[1] TRUE
#To ensure that this variable is a factor variable, we check the levels for this variable and the result is that there are only 5 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type)
[1] "Language" "Pri/Sec" "Primary" "Secondary" "Special"
#Notice that "School_Status" has some values are the same, so try to covert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status)
is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status)
[1] TRUE
#Check the levels of this variable, and the result is that there are only 2 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status)
[1] "O" "U"
#This variable has to be integer because it can't be half student.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Number_of_pre.student <- as.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Number_of_pre.student)
is.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Number_of_pre.student)
[1] TRUE
#This variable has to be integer because it can't be half student.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Total_number_of_pre.student_for_school <- as.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Total_number_of_pre.student_for_school)
is.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Total_number_of_pre.student_for_school)
[1] TRUE
#Check the varibale type again to make sure everything is good.
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
'data.frame': 4502 obs. of 9 variables:
$ School_Name : chr "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
$ School_No : int 20 25 26 28 29 30 33 35 60 77 ...
$ School_Type : Factor w/ 5 levels "Language","Pri/Sec",..: 4 4 4 4 3 3 4 4 3 3 ...
$ School_Status : Factor w/ 2 levels "O","U": 1 1 1 1 1 1 1 1 1 1 ...
$ Education_Sector : Factor w/ 3 levels "Catholic","Government",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Entity_Type : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Number_of_pre.student : int 0 0 0 0 15 19 0 0 22 15 ...
$ Total_number_of_pre.student_for_school: int 0 0 0 0 34 44 0 0 38 39 ...
Create/mutate at least one variable from the existing variables (minimum requirement #6). In addition to the R codes and outputs, explain everything that you do in this step.
# This is the R chunk for the Tidy & Manipulate Data II
#Mutate new variable
#Use mutate() function to create a new variable. This new variable is the sex proportion in terms of male and female for each school. And it's formula is: Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- mutate(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2,
Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school
)
#Check if we create "Sex_proportion_for_the_pre.student_in_school" successfully.
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
'data.frame': 4502 obs. of 10 variables:
$ School_Name : chr "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
$ School_No : int 20 25 26 28 29 30 33 35 60 77 ...
$ School_Type : Factor w/ 5 levels "Language","Pri/Sec",..: 4 4 4 4 3 3 4 4 3 3 ...
$ School_Status : Factor w/ 2 levels "O","U": 1 1 1 1 1 1 1 1 1 1 ...
$ Education_Sector : Factor w/ 3 levels "Catholic","Government",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Entity_Type : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Number_of_pre.student : int 0 0 0 0 15 19 0 0 22 15 ...
$ Total_number_of_pre.student_for_school : int 0 0 0 0 34 44 0 0 38 39 ...
$ Sex_proportion_for_the_pre.student_in_school: num NaN NaN NaN NaN 0.441 ...
Scan the data for missing values, special values and obvious errors (i.e. inconsistencies). In this step, you should fulfil the minimum requirement #7. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.
# This is the R chunk for the Scan I
#Sometimes dataset can have obious errors. For example, sex proportion cannot be negative or beyond 1. For checking these inconsistency, we would like to apply rules to check whether this dataset violates the rule by using editfile() function. And also, we could also apply rules to correct errors by using correctionRules() function. When the number of rules increase, it is better to manage the rules separate from the data. The function of editrules() can check which rules are obeyed or not and allows one to find the minimal set of variables to adapt so that all rules can be obeyed. In addition,correctionRules() function force variables to follow it's constraints so that, it is very helpful when correcting data. In this part, we are going to using editrules() and correctionRules() to deal with inconsistency.
#The first thing is to check the consistency of "Total_number_of_pre.student_for_school" = "Number_of_pre.student" for male + "Number_of_pre.student" for female.
#It is more convenient to check the consistency of "Total_number_of_pre.student_for_school" = "Number_of_pre.student" for male + "Number_of_pre.student" for female by spreading the "Gender" out.
#We find out that there is a variable prevent us to spread "Gender" out in "All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2", which is called "Sex_proportion_for_the_pre.student_in_school".
#That is because "Sex_proportion_for_the_pre.student_in_school" = "Number_of_pre.student"/"Total_number_of_pre.student_for_school". And when we spread "Gender" out, it will produce many NaN values.
#And this is why we should exclude "Sex_proportion_for_the_pre.student_in_school" first and then spread "Gender" out.
Data_for_checking_inconsistency1 = subset(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2, select = -Sex_proportion_for_the_pre.student_in_school )
#And then we use spread() function to spread it out.
Data_for_checking_inconsistency2 <- spread(Data_for_checking_inconsistency1, key = Gender, value = Number_of_pre.student)
#Import rules into R
Rules1 <- editfile("/Users/macbook/Documents/RMIT-Master of Analytics/Data wrangling/editrules2.txt", type = "all")
Rules1
Edit set:
num1 : Female + Male == Total_number_of_pre.student_for_school
#Check which value violates the rules, but the weakness is that the data size is too large and R automatically omit some rows.
violatedEdits(Rules1, Data_for_checking_inconsistency2)
edit
record num1
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
7 FALSE
8 FALSE
9 FALSE
10 FALSE
11 FALSE
12 FALSE
13 FALSE
14 FALSE
15 FALSE
16 FALSE
17 FALSE
18 FALSE
19 FALSE
20 FALSE
21 FALSE
22 FALSE
23 FALSE
24 FALSE
25 FALSE
26 FALSE
27 FALSE
28 FALSE
29 FALSE
30 FALSE
31 FALSE
32 FALSE
33 FALSE
34 FALSE
35 FALSE
36 FALSE
37 FALSE
38 FALSE
39 FALSE
40 FALSE
41 FALSE
42 FALSE
43 FALSE
44 FALSE
45 FALSE
46 FALSE
47 FALSE
48 FALSE
49 FALSE
50 FALSE
51 FALSE
52 FALSE
53 FALSE
54 FALSE
55 FALSE
56 FALSE
57 FALSE
58 FALSE
59 FALSE
60 FALSE
61 FALSE
62 FALSE
63 FALSE
64 FALSE
65 FALSE
66 FALSE
67 FALSE
68 FALSE
69 FALSE
70 FALSE
71 FALSE
72 FALSE
73 FALSE
74 FALSE
75 FALSE
76 FALSE
77 FALSE
78 FALSE
79 FALSE
80 FALSE
81 FALSE
82 FALSE
83 FALSE
84 FALSE
85 FALSE
86 FALSE
87 FALSE
88 FALSE
89 FALSE
90 FALSE
91 FALSE
92 FALSE
93 FALSE
94 FALSE
95 FALSE
96 FALSE
97 FALSE
98 FALSE
99 FALSE
100 FALSE
101 FALSE
102 FALSE
103 FALSE
104 FALSE
105 FALSE
106 FALSE
107 FALSE
108 FALSE
109 FALSE
110 FALSE
111 FALSE
112 FALSE
113 FALSE
114 FALSE
115 FALSE
116 FALSE
117 FALSE
118 FALSE
119 FALSE
120 FALSE
121 FALSE
122 FALSE
123 FALSE
124 FALSE
125 FALSE
126 FALSE
127 FALSE
128 FALSE
129 FALSE
130 FALSE
131 FALSE
132 FALSE
133 FALSE
134 FALSE
135 FALSE
136 FALSE
137 FALSE
138 FALSE
139 FALSE
140 FALSE
141 FALSE
142 FALSE
143 FALSE
144 FALSE
145 FALSE
146 FALSE
147 FALSE
148 FALSE
149 FALSE
150 FALSE
151 FALSE
152 FALSE
153 FALSE
154 FALSE
155 FALSE
156 FALSE
157 FALSE
158 FALSE
159 FALSE
160 FALSE
161 FALSE
162 FALSE
163 FALSE
164 FALSE
165 FALSE
166 FALSE
167 FALSE
168 FALSE
169 FALSE
170 FALSE
171 FALSE
172 FALSE
173 FALSE
174 FALSE
175 FALSE
176 FALSE
177 FALSE
178 FALSE
179 FALSE
180 FALSE
181 FALSE
182 FALSE
183 FALSE
184 FALSE
185 FALSE
186 FALSE
187 FALSE
188 FALSE
189 FALSE
190 FALSE
191 FALSE
192 FALSE
193 FALSE
194 FALSE
195 FALSE
196 FALSE
197 FALSE
198 FALSE
199 FALSE
200 FALSE
201 FALSE
202 FALSE
203 FALSE
204 FALSE
205 FALSE
206 FALSE
207 FALSE
208 FALSE
209 FALSE
210 FALSE
211 FALSE
212 FALSE
213 FALSE
214 FALSE
215 FALSE
216 FALSE
217 FALSE
218 FALSE
219 FALSE
220 FALSE
221 FALSE
222 FALSE
223 FALSE
224 FALSE
225 FALSE
226 FALSE
227 FALSE
228 FALSE
229 FALSE
230 FALSE
231 FALSE
232 FALSE
233 FALSE
234 FALSE
235 FALSE
236 FALSE
237 FALSE
238 FALSE
239 FALSE
240 FALSE
241 FALSE
242 FALSE
243 FALSE
244 FALSE
245 FALSE
246 FALSE
247 FALSE
248 FALSE
249 FALSE
250 FALSE
251 FALSE
252 FALSE
253 FALSE
254 FALSE
255 FALSE
256 FALSE
257 FALSE
258 FALSE
259 FALSE
260 FALSE
261 FALSE
262 FALSE
263 FALSE
264 FALSE
265 FALSE
266 FALSE
267 FALSE
268 FALSE
269 FALSE
270 FALSE
271 FALSE
272 FALSE
273 FALSE
274 FALSE
275 FALSE
276 FALSE
277 FALSE
278 FALSE
279 FALSE
280 FALSE
281 FALSE
282 FALSE
283 FALSE
284 FALSE
285 FALSE
286 FALSE
287 FALSE
288 FALSE
289 FALSE
290 FALSE
291 FALSE
292 FALSE
293 FALSE
294 FALSE
295 FALSE
296 FALSE
297 FALSE
298 FALSE
299 FALSE
300 FALSE
301 FALSE
302 FALSE
303 FALSE
304 FALSE
305 FALSE
306 FALSE
307 FALSE
308 FALSE
309 FALSE
310 FALSE
311 FALSE
312 FALSE
313 FALSE
314 FALSE
315 FALSE
316 FALSE
317 FALSE
318 FALSE
319 FALSE
320 FALSE
321 FALSE
322 FALSE
323 FALSE
324 FALSE
325 FALSE
326 FALSE
327 FALSE
328 FALSE
329 FALSE
330 FALSE
331 FALSE
332 FALSE
333 FALSE
334 FALSE
335 FALSE
336 FALSE
337 FALSE
338 FALSE
339 FALSE
340 FALSE
341 FALSE
342 FALSE
343 FALSE
344 FALSE
345 FALSE
346 FALSE
347 FALSE
348 FALSE
349 FALSE
350 FALSE
351 FALSE
352 FALSE
353 FALSE
354 FALSE
355 FALSE
356 FALSE
357 FALSE
358 FALSE
359 FALSE
360 FALSE
361 FALSE
362 FALSE
363 FALSE
364 FALSE
365 FALSE
366 FALSE
367 FALSE
368 FALSE
369 FALSE
370 FALSE
371 FALSE
372 FALSE
373 FALSE
374 FALSE
375 FALSE
376 FALSE
377 FALSE
378 FALSE
379 FALSE
380 FALSE
381 FALSE
382 FALSE
383 FALSE
384 FALSE
385 FALSE
386 FALSE
387 FALSE
388 FALSE
389 FALSE
390 FALSE
391 FALSE
392 FALSE
393 FALSE
394 FALSE
395 FALSE
396 FALSE
397 FALSE
398 FALSE
399 FALSE
400 FALSE
401 FALSE
402 FALSE
403 FALSE
404 FALSE
405 FALSE
406 FALSE
407 FALSE
408 FALSE
409 FALSE
410 FALSE
411 FALSE
412 FALSE
413 FALSE
414 FALSE
415 FALSE
416 FALSE
417 FALSE
418 FALSE
419 FALSE
420 FALSE
421 FALSE
422 FALSE
423 FALSE
424 FALSE
425 FALSE
426 FALSE
427 FALSE
428 FALSE
429 FALSE
430 FALSE
431 FALSE
432 FALSE
433 FALSE
434 FALSE
435 FALSE
436 FALSE
437 FALSE
438 FALSE
439 FALSE
440 FALSE
441 FALSE
442 FALSE
443 FALSE
444 FALSE
445 FALSE
446 FALSE
447 FALSE
448 FALSE
449 FALSE
450 FALSE
451 FALSE
452 FALSE
453 FALSE
454 FALSE
455 FALSE
456 FALSE
457 FALSE
458 FALSE
459 FALSE
460 FALSE
461 FALSE
462 FALSE
463 FALSE
464 FALSE
465 FALSE
466 FALSE
467 FALSE
468 FALSE
469 FALSE
470 FALSE
471 FALSE
472 FALSE
473 FALSE
474 FALSE
475 FALSE
476 FALSE
477 FALSE
478 FALSE
479 FALSE
480 FALSE
481 FALSE
482 FALSE
483 FALSE
484 FALSE
485 FALSE
486 FALSE
487 FALSE
488 FALSE
489 FALSE
490 FALSE
491 FALSE
492 FALSE
493 FALSE
494 FALSE
495 FALSE
496 FALSE
497 FALSE
498 FALSE
499 FALSE
500 FALSE
501 FALSE
502 FALSE
503 FALSE
504 FALSE
505 FALSE
506 FALSE
507 FALSE
508 FALSE
509 FALSE
510 FALSE
511 FALSE
512 FALSE
513 FALSE
514 FALSE
515 FALSE
516 FALSE
517 FALSE
518 FALSE
519 FALSE
520 FALSE
521 FALSE
522 FALSE
523 FALSE
524 FALSE
525 FALSE
526 FALSE
527 FALSE
528 FALSE
529 FALSE
530 FALSE
531 FALSE
532 FALSE
533 FALSE
534 FALSE
535 FALSE
536 FALSE
537 FALSE
538 FALSE
539 FALSE
540 TRUE
541 FALSE
542 FALSE
543 FALSE
544 FALSE
545 FALSE
546 FALSE
547 FALSE
548 FALSE
549 FALSE
550 FALSE
551 FALSE
552 FALSE
553 FALSE
554 FALSE
555 FALSE
556 FALSE
557 FALSE
558 FALSE
559 FALSE
560 FALSE
561 FALSE
562 FALSE
563 FALSE
564 FALSE
565 FALSE
566 FALSE
567 FALSE
568 FALSE
569 FALSE
570 FALSE
571 FALSE
572 FALSE
573 FALSE
574 FALSE
575 FALSE
576 FALSE
577 FALSE
578 FALSE
579 FALSE
580 FALSE
581 FALSE
582 FALSE
583 FALSE
584 FALSE
585 FALSE
586 FALSE
587 FALSE
588 FALSE
589 FALSE
590 FALSE
591 FALSE
592 FALSE
593 FALSE
594 FALSE
595 FALSE
596 FALSE
597 FALSE
598 FALSE
599 FALSE
600 FALSE
601 FALSE
602 FALSE
603 FALSE
604 FALSE
605 FALSE
606 FALSE
607 FALSE
608 FALSE
609 FALSE
610 FALSE
611 FALSE
612 FALSE
613 FALSE
614 FALSE
615 FALSE
616 FALSE
617 FALSE
618 FALSE
619 FALSE
620 FALSE
621 FALSE
622 FALSE
623 FALSE
624 FALSE
625 FALSE
626 FALSE
627 FALSE
628 FALSE
629 FALSE
630 FALSE
631 FALSE
632 FALSE
633 FALSE
634 FALSE
635 FALSE
636 FALSE
637 FALSE
638 FALSE
639 FALSE
640 FALSE
641 FALSE
642 FALSE
643 FALSE
644 FALSE
645 FALSE
646 FALSE
647 FALSE
648 FALSE
649 FALSE
650 FALSE
651 FALSE
652 FALSE
653 FALSE
654 FALSE
655 FALSE
656 FALSE
657 FALSE
658 FALSE
659 FALSE
660 FALSE
661 FALSE
662 FALSE
663 FALSE
664 FALSE
665 FALSE
666 FALSE
667 FALSE
668 FALSE
669 FALSE
670 FALSE
671 FALSE
672 FALSE
673 FALSE
674 FALSE
675 FALSE
676 FALSE
677 FALSE
678 FALSE
679 FALSE
680 FALSE
681 FALSE
682 FALSE
683 FALSE
684 FALSE
685 FALSE
686 FALSE
687 FALSE
688 FALSE
689 FALSE
690 FALSE
691 FALSE
692 FALSE
693 FALSE
694 FALSE
695 FALSE
696 FALSE
697 FALSE
698 FALSE
699 FALSE
700 FALSE
701 FALSE
702 FALSE
703 FALSE
704 FALSE
705 FALSE
706 FALSE
707 FALSE
708 FALSE
709 FALSE
710 FALSE
711 FALSE
712 FALSE
713 FALSE
714 FALSE
715 FALSE
716 FALSE
717 FALSE
718 FALSE
719 FALSE
720 FALSE
721 FALSE
722 FALSE
723 FALSE
724 FALSE
725 FALSE
726 FALSE
727 FALSE
728 FALSE
729 FALSE
730 FALSE
731 FALSE
732 FALSE
733 FALSE
734 FALSE
735 FALSE
736 FALSE
737 FALSE
738 FALSE
739 FALSE
740 FALSE
741 FALSE
742 FALSE
743 FALSE
744 FALSE
745 FALSE
746 FALSE
747 FALSE
748 FALSE
749 FALSE
750 FALSE
751 FALSE
752 FALSE
753 FALSE
754 FALSE
755 FALSE
756 FALSE
757 FALSE
758 FALSE
759 FALSE
760 FALSE
761 FALSE
762 FALSE
763 FALSE
764 FALSE
765 FALSE
766 FALSE
767 FALSE
768 FALSE
769 FALSE
770 FALSE
771 FALSE
772 FALSE
773 FALSE
774 FALSE
775 FALSE
776 FALSE
777 FALSE
778 FALSE
779 FALSE
780 FALSE
781 FALSE
782 FALSE
783 FALSE
784 FALSE
785 FALSE
786 FALSE
787 FALSE
788 FALSE
789 FALSE
790 FALSE
791 FALSE
792 FALSE
793 FALSE
794 FALSE
795 FALSE
796 FALSE
797 FALSE
798 FALSE
799 FALSE
800 FALSE
801 FALSE
802 FALSE
803 FALSE
804 FALSE
805 FALSE
806 FALSE
807 FALSE
808 FALSE
809 FALSE
810 FALSE
811 FALSE
812 FALSE
813 FALSE
814 FALSE
815 FALSE
816 FALSE
817 FALSE
818 FALSE
819 FALSE
820 FALSE
821 FALSE
822 FALSE
823 FALSE
824 FALSE
825 FALSE
826 FALSE
827 FALSE
828 FALSE
829 FALSE
830 FALSE
831 FALSE
832 FALSE
833 FALSE
834 FALSE
835 FALSE
836 FALSE
837 FALSE
838 FALSE
839 FALSE
840 FALSE
841 FALSE
842 FALSE
843 FALSE
844 FALSE
845 FALSE
846 FALSE
847 FALSE
848 FALSE
849 FALSE
850 FALSE
851 FALSE
852 FALSE
853 FALSE
854 FALSE
855 FALSE
856 FALSE
857 FALSE
858 FALSE
859 FALSE
860 FALSE
861 FALSE
862 FALSE
863 FALSE
864 FALSE
865 FALSE
866 FALSE
867 FALSE
868 FALSE
869 FALSE
870 FALSE
871 FALSE
872 FALSE
873 FALSE
874 FALSE
875 FALSE
876 FALSE
877 FALSE
878 FALSE
879 FALSE
880 FALSE
881 FALSE
882 FALSE
883 FALSE
884 FALSE
885 FALSE
886 FALSE
887 FALSE
888 FALSE
889 FALSE
890 FALSE
891 FALSE
892 FALSE
893 FALSE
894 FALSE
895 FALSE
896 FALSE
897 FALSE
898 FALSE
899 FALSE
900 FALSE
901 FALSE
902 FALSE
903 FALSE
904 FALSE
905 FALSE
906 FALSE
907 FALSE
908 FALSE
909 FALSE
910 FALSE
911 FALSE
912 FALSE
913 FALSE
914 FALSE
915 FALSE
916 FALSE
917 FALSE
918 FALSE
919 FALSE
920 FALSE
921 FALSE
922 FALSE
923 FALSE
924 FALSE
925 FALSE
926 FALSE
927 FALSE
928 FALSE
929 FALSE
930 FALSE
931 FALSE
932 FALSE
933 FALSE
934 FALSE
935 FALSE
936 FALSE
937 FALSE
938 FALSE
939 FALSE
940 FALSE
941 FALSE
942 FALSE
943 FALSE
944 FALSE
945 FALSE
946 FALSE
947 FALSE
948 FALSE
949 FALSE
950 FALSE
951 FALSE
952 FALSE
953 FALSE
954 FALSE
955 FALSE
956 FALSE
957 FALSE
958 FALSE
959 FALSE
960 FALSE
961 FALSE
962 FALSE
963 FALSE
964 FALSE
965 FALSE
966 FALSE
967 FALSE
968 FALSE
969 FALSE
970 FALSE
971 FALSE
972 FALSE
973 FALSE
974 FALSE
975 FALSE
976 FALSE
977 FALSE
978 FALSE
979 FALSE
980 FALSE
981 FALSE
982 FALSE
983 FALSE
984 FALSE
985 FALSE
986 FALSE
987 FALSE
988 FALSE
989 FALSE
990 FALSE
991 FALSE
992 FALSE
993 FALSE
994 FALSE
995 FALSE
996 FALSE
997 FALSE
998 FALSE
999 FALSE
1000 FALSE
[ reached getOption("max.print") -- omitted 1251 rows ]
#To deal with the problem of "large data size", we could use summary() function.
Violated1 <- violatedEdits(Rules1, Data_for_checking_inconsistency2)
summary(Violated1)
Edit violations, 2251 observations, 0 completely missing (0%):
Edit violations per record:
#The result shows that this data does have some errors.
#To correct those errors, we import rules for correction.
Rules2 <- correctionRules("/Users/macbook/Documents/RMIT-Master of Analytics/Data wrangling/editrules3.txt")
#Apply rules into the data that contains errors.
cor <- correctWithRules(Rules2, Data_for_checking_inconsistency2)
cor
$corrected
$corrections
#This is the data after correction.
cor$corrected
#And then, check it again whether we successfully correct errors.
violatedEdits(Rules1, cor$corrected)
edit
record num1
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
7 FALSE
8 FALSE
9 FALSE
10 FALSE
11 FALSE
12 FALSE
13 FALSE
14 FALSE
15 FALSE
16 FALSE
17 FALSE
18 FALSE
19 FALSE
20 FALSE
21 FALSE
22 FALSE
23 FALSE
24 FALSE
25 FALSE
26 FALSE
27 FALSE
28 FALSE
29 FALSE
30 FALSE
31 FALSE
32 FALSE
33 FALSE
34 FALSE
35 FALSE
36 FALSE
37 FALSE
38 FALSE
39 FALSE
40 FALSE
41 FALSE
42 FALSE
43 FALSE
44 FALSE
45 FALSE
46 FALSE
47 FALSE
48 FALSE
49 FALSE
50 FALSE
51 FALSE
52 FALSE
53 FALSE
54 FALSE
55 FALSE
56 FALSE
57 FALSE
58 FALSE
59 FALSE
60 FALSE
61 FALSE
62 FALSE
63 FALSE
64 FALSE
65 FALSE
66 FALSE
67 FALSE
68 FALSE
69 FALSE
70 FALSE
71 FALSE
72 FALSE
73 FALSE
74 FALSE
75 FALSE
76 FALSE
77 FALSE
78 FALSE
79 FALSE
80 FALSE
81 FALSE
82 FALSE
83 FALSE
84 FALSE
85 FALSE
86 FALSE
87 FALSE
88 FALSE
89 FALSE
90 FALSE
91 FALSE
92 FALSE
93 FALSE
94 FALSE
95 FALSE
96 FALSE
97 FALSE
98 FALSE
99 FALSE
100 FALSE
101 FALSE
102 FALSE
103 FALSE
104 FALSE
105 FALSE
106 FALSE
107 FALSE
108 FALSE
109 FALSE
110 FALSE
111 FALSE
112 FALSE
113 FALSE
114 FALSE
115 FALSE
116 FALSE
117 FALSE
118 FALSE
119 FALSE
120 FALSE
121 FALSE
122 FALSE
123 FALSE
124 FALSE
125 FALSE
126 FALSE
127 FALSE
128 FALSE
129 FALSE
130 FALSE
131 FALSE
132 FALSE
133 FALSE
134 FALSE
135 FALSE
136 FALSE
137 FALSE
138 FALSE
139 FALSE
140 FALSE
141 FALSE
142 FALSE
143 FALSE
144 FALSE
145 FALSE
146 FALSE
147 FALSE
148 FALSE
149 FALSE
150 FALSE
151 FALSE
152 FALSE
153 FALSE
154 FALSE
155 FALSE
156 FALSE
157 FALSE
158 FALSE
159 FALSE
160 FALSE
161 FALSE
162 FALSE
163 FALSE
164 FALSE
165 FALSE
166 FALSE
167 FALSE
168 FALSE
169 FALSE
170 FALSE
171 FALSE
172 FALSE
173 FALSE
174 FALSE
175 FALSE
176 FALSE
177 FALSE
178 FALSE
179 FALSE
180 FALSE
181 FALSE
182 FALSE
183 FALSE
184 FALSE
185 FALSE
186 FALSE
187 FALSE
188 FALSE
189 FALSE
190 FALSE
191 FALSE
192 FALSE
193 FALSE
194 FALSE
195 FALSE
196 FALSE
197 FALSE
198 FALSE
199 FALSE
200 FALSE
201 FALSE
202 FALSE
203 FALSE
204 FALSE
205 FALSE
206 FALSE
207 FALSE
208 FALSE
209 FALSE
210 FALSE
211 FALSE
212 FALSE
213 FALSE
214 FALSE
215 FALSE
216 FALSE
217 FALSE
218 FALSE
219 FALSE
220 FALSE
221 FALSE
222 FALSE
223 FALSE
224 FALSE
225 FALSE
226 FALSE
227 FALSE
228 FALSE
229 FALSE
230 FALSE
231 FALSE
232 FALSE
233 FALSE
234 FALSE
235 FALSE
236 FALSE
237 FALSE
238 FALSE
239 FALSE
240 FALSE
241 FALSE
242 FALSE
243 FALSE
244 FALSE
245 FALSE
246 FALSE
247 FALSE
248 FALSE
249 FALSE
250 FALSE
251 FALSE
252 FALSE
253 FALSE
254 FALSE
255 FALSE
256 FALSE
257 FALSE
258 FALSE
259 FALSE
260 FALSE
261 FALSE
262 FALSE
263 FALSE
264 FALSE
265 FALSE
266 FALSE
267 FALSE
268 FALSE
269 FALSE
270 FALSE
271 FALSE
272 FALSE
273 FALSE
274 FALSE
275 FALSE
276 FALSE
277 FALSE
278 FALSE
279 FALSE
280 FALSE
281 FALSE
282 FALSE
283 FALSE
284 FALSE
285 FALSE
286 FALSE
287 FALSE
288 FALSE
289 FALSE
290 FALSE
291 FALSE
292 FALSE
293 FALSE
294 FALSE
295 FALSE
296 FALSE
297 FALSE
298 FALSE
299 FALSE
300 FALSE
301 FALSE
302 FALSE
303 FALSE
304 FALSE
305 FALSE
306 FALSE
307 FALSE
308 FALSE
309 FALSE
310 FALSE
311 FALSE
312 FALSE
313 FALSE
314 FALSE
315 FALSE
316 FALSE
317 FALSE
318 FALSE
319 FALSE
320 FALSE
321 FALSE
322 FALSE
323 FALSE
324 FALSE
325 FALSE
326 FALSE
327 FALSE
328 FALSE
329 FALSE
330 FALSE
331 FALSE
332 FALSE
333 FALSE
334 FALSE
335 FALSE
336 FALSE
337 FALSE
338 FALSE
339 FALSE
340 FALSE
341 FALSE
342 FALSE
343 FALSE
344 FALSE
345 FALSE
346 FALSE
347 FALSE
348 FALSE
349 FALSE
350 FALSE
351 FALSE
352 FALSE
353 FALSE
354 FALSE
355 FALSE
356 FALSE
357 FALSE
358 FALSE
359 FALSE
360 FALSE
361 FALSE
362 FALSE
363 FALSE
364 FALSE
365 FALSE
366 FALSE
367 FALSE
368 FALSE
369 FALSE
370 FALSE
371 FALSE
372 FALSE
373 FALSE
374 FALSE
375 FALSE
376 FALSE
377 FALSE
378 FALSE
379 FALSE
380 FALSE
381 FALSE
382 FALSE
383 FALSE
384 FALSE
385 FALSE
386 FALSE
387 FALSE
388 FALSE
389 FALSE
390 FALSE
391 FALSE
392 FALSE
393 FALSE
394 FALSE
395 FALSE
396 FALSE
397 FALSE
398 FALSE
399 FALSE
400 FALSE
401 FALSE
402 FALSE
403 FALSE
404 FALSE
405 FALSE
406 FALSE
407 FALSE
408 FALSE
409 FALSE
410 FALSE
411 FALSE
412 FALSE
413 FALSE
414 FALSE
415 FALSE
416 FALSE
417 FALSE
418 FALSE
419 FALSE
420 FALSE
421 FALSE
422 FALSE
423 FALSE
424 FALSE
425 FALSE
426 FALSE
427 FALSE
428 FALSE
429 FALSE
430 FALSE
431 FALSE
432 FALSE
433 FALSE
434 FALSE
435 FALSE
436 FALSE
437 FALSE
438 FALSE
439 FALSE
440 FALSE
441 FALSE
442 FALSE
443 FALSE
444 FALSE
445 FALSE
446 FALSE
447 FALSE
448 FALSE
449 FALSE
450 FALSE
451 FALSE
452 FALSE
453 FALSE
454 FALSE
455 FALSE
456 FALSE
457 FALSE
458 FALSE
459 FALSE
460 FALSE
461 FALSE
462 FALSE
463 FALSE
464 FALSE
465 FALSE
466 FALSE
467 FALSE
468 FALSE
469 FALSE
470 FALSE
471 FALSE
472 FALSE
473 FALSE
474 FALSE
475 FALSE
476 FALSE
477 FALSE
478 FALSE
479 FALSE
480 FALSE
481 FALSE
482 FALSE
483 FALSE
484 FALSE
485 FALSE
486 FALSE
487 FALSE
488 FALSE
489 FALSE
490 FALSE
491 FALSE
492 FALSE
493 FALSE
494 FALSE
495 FALSE
496 FALSE
497 FALSE
498 FALSE
499 FALSE
500 FALSE
501 FALSE
502 FALSE
503 FALSE
504 FALSE
505 FALSE
506 FALSE
507 FALSE
508 FALSE
509 FALSE
510 FALSE
511 FALSE
512 FALSE
513 FALSE
514 FALSE
515 FALSE
516 FALSE
517 FALSE
518 FALSE
519 FALSE
520 FALSE
521 FALSE
522 FALSE
523 FALSE
524 FALSE
525 FALSE
526 FALSE
527 FALSE
528 FALSE
529 FALSE
530 FALSE
531 FALSE
532 FALSE
533 FALSE
534 FALSE
535 FALSE
536 FALSE
537 FALSE
538 FALSE
539 FALSE
540 FALSE
541 FALSE
542 FALSE
543 FALSE
544 FALSE
545 FALSE
546 FALSE
547 FALSE
548 FALSE
549 FALSE
550 FALSE
551 FALSE
552 FALSE
553 FALSE
554 FALSE
555 FALSE
556 FALSE
557 FALSE
558 FALSE
559 FALSE
560 FALSE
561 FALSE
562 FALSE
563 FALSE
564 FALSE
565 FALSE
566 FALSE
567 FALSE
568 FALSE
569 FALSE
570 FALSE
571 FALSE
572 FALSE
573 FALSE
574 FALSE
575 FALSE
576 FALSE
577 FALSE
578 FALSE
579 FALSE
580 FALSE
581 FALSE
582 FALSE
583 FALSE
584 FALSE
585 FALSE
586 FALSE
587 FALSE
588 FALSE
589 FALSE
590 FALSE
591 FALSE
592 FALSE
593 FALSE
594 FALSE
595 FALSE
596 FALSE
597 FALSE
598 FALSE
599 FALSE
600 FALSE
601 FALSE
602 FALSE
603 FALSE
604 FALSE
605 FALSE
606 FALSE
607 FALSE
608 FALSE
609 FALSE
610 FALSE
611 FALSE
612 FALSE
613 FALSE
614 FALSE
615 FALSE
616 FALSE
617 FALSE
618 FALSE
619 FALSE
620 FALSE
621 FALSE
622 FALSE
623 FALSE
624 FALSE
625 FALSE
626 FALSE
627 FALSE
628 FALSE
629 FALSE
630 FALSE
631 FALSE
632 FALSE
633 FALSE
634 FALSE
635 FALSE
636 FALSE
637 FALSE
638 FALSE
639 FALSE
640 FALSE
641 FALSE
642 FALSE
643 FALSE
644 FALSE
645 FALSE
646 FALSE
647 FALSE
648 FALSE
649 FALSE
650 FALSE
651 FALSE
652 FALSE
653 FALSE
654 FALSE
655 FALSE
656 FALSE
657 FALSE
658 FALSE
659 FALSE
660 FALSE
661 FALSE
662 FALSE
663 FALSE
664 FALSE
665 FALSE
666 FALSE
667 FALSE
668 FALSE
669 FALSE
670 FALSE
671 FALSE
672 FALSE
673 FALSE
674 FALSE
675 FALSE
676 FALSE
677 FALSE
678 FALSE
679 FALSE
680 FALSE
681 FALSE
682 FALSE
683 FALSE
684 FALSE
685 FALSE
686 FALSE
687 FALSE
688 FALSE
689 FALSE
690 FALSE
691 FALSE
692 FALSE
693 FALSE
694 FALSE
695 FALSE
696 FALSE
697 FALSE
698 FALSE
699 FALSE
700 FALSE
701 FALSE
702 FALSE
703 FALSE
704 FALSE
705 FALSE
706 FALSE
707 FALSE
708 FALSE
709 FALSE
710 FALSE
711 FALSE
712 FALSE
713 FALSE
714 FALSE
715 FALSE
716 FALSE
717 FALSE
718 FALSE
719 FALSE
720 FALSE
721 FALSE
722 FALSE
723 FALSE
724 FALSE
725 FALSE
726 FALSE
727 FALSE
728 FALSE
729 FALSE
730 FALSE
731 FALSE
732 FALSE
733 FALSE
734 FALSE
735 FALSE
736 FALSE
737 FALSE
738 FALSE
739 FALSE
740 FALSE
741 FALSE
742 FALSE
743 FALSE
744 FALSE
745 FALSE
746 FALSE
747 FALSE
748 FALSE
749 FALSE
750 FALSE
751 FALSE
752 FALSE
753 FALSE
754 FALSE
755 FALSE
756 FALSE
757 FALSE
758 FALSE
759 FALSE
760 FALSE
761 FALSE
762 FALSE
763 FALSE
764 FALSE
765 FALSE
766 FALSE
767 FALSE
768 FALSE
769 FALSE
770 FALSE
771 FALSE
772 FALSE
773 FALSE
774 FALSE
775 FALSE
776 FALSE
777 FALSE
778 FALSE
779 FALSE
780 FALSE
781 FALSE
782 FALSE
783 FALSE
784 FALSE
785 FALSE
786 FALSE
787 FALSE
788 FALSE
789 FALSE
790 FALSE
791 FALSE
792 FALSE
793 FALSE
794 FALSE
795 FALSE
796 FALSE
797 FALSE
798 FALSE
799 FALSE
800 FALSE
801 FALSE
802 FALSE
803 FALSE
804 FALSE
805 FALSE
806 FALSE
807 FALSE
808 FALSE
809 FALSE
810 FALSE
811 FALSE
812 FALSE
813 FALSE
814 FALSE
815 FALSE
816 FALSE
817 FALSE
818 FALSE
819 FALSE
820 FALSE
821 FALSE
822 FALSE
823 FALSE
824 FALSE
825 FALSE
826 FALSE
827 FALSE
828 FALSE
829 FALSE
830 FALSE
831 FALSE
832 FALSE
833 FALSE
834 FALSE
835 FALSE
836 FALSE
837 FALSE
838 FALSE
839 FALSE
840 FALSE
841 FALSE
842 FALSE
843 FALSE
844 FALSE
845 FALSE
846 FALSE
847 FALSE
848 FALSE
849 FALSE
850 FALSE
851 FALSE
852 FALSE
853 FALSE
854 FALSE
855 FALSE
856 FALSE
857 FALSE
858 FALSE
859 FALSE
860 FALSE
861 FALSE
862 FALSE
863 FALSE
864 FALSE
865 FALSE
866 FALSE
867 FALSE
868 FALSE
869 FALSE
870 FALSE
871 FALSE
872 FALSE
873 FALSE
874 FALSE
875 FALSE
876 FALSE
877 FALSE
878 FALSE
879 FALSE
880 FALSE
881 FALSE
882 FALSE
883 FALSE
884 FALSE
885 FALSE
886 FALSE
887 FALSE
888 FALSE
889 FALSE
890 FALSE
891 FALSE
892 FALSE
893 FALSE
894 FALSE
895 FALSE
896 FALSE
897 FALSE
898 FALSE
899 FALSE
900 FALSE
901 FALSE
902 FALSE
903 FALSE
904 FALSE
905 FALSE
906 FALSE
907 FALSE
908 FALSE
909 FALSE
910 FALSE
911 FALSE
912 FALSE
913 FALSE
914 FALSE
915 FALSE
916 FALSE
917 FALSE
918 FALSE
919 FALSE
920 FALSE
921 FALSE
922 FALSE
923 FALSE
924 FALSE
925 FALSE
926 FALSE
927 FALSE
928 FALSE
929 FALSE
930 FALSE
931 FALSE
932 FALSE
933 FALSE
934 FALSE
935 FALSE
936 FALSE
937 FALSE
938 FALSE
939 FALSE
940 FALSE
941 FALSE
942 FALSE
943 FALSE
944 FALSE
945 FALSE
946 FALSE
947 FALSE
948 FALSE
949 FALSE
950 FALSE
951 FALSE
952 FALSE
953 FALSE
954 FALSE
955 FALSE
956 FALSE
957 FALSE
958 FALSE
959 FALSE
960 FALSE
961 FALSE
962 FALSE
963 FALSE
964 FALSE
965 FALSE
966 FALSE
967 FALSE
968 FALSE
969 FALSE
970 FALSE
971 FALSE
972 FALSE
973 FALSE
974 FALSE
975 FALSE
976 FALSE
977 FALSE
978 FALSE
979 FALSE
980 FALSE
981 FALSE
982 FALSE
983 FALSE
984 FALSE
985 FALSE
986 FALSE
987 FALSE
988 FALSE
989 FALSE
990 FALSE
991 FALSE
992 FALSE
993 FALSE
994 FALSE
995 FALSE
996 FALSE
997 FALSE
998 FALSE
999 FALSE
1000 FALSE
[ reached getOption("max.print") -- omitted 1251 rows ]
#Use summary() function to avoid the problem of large data size.
Violated2 <- violatedEdits(Rules1, cor$corrected)
summary(Violated2)
No violations detected, 0 checks evaluated to NA
NULL
#And then, after correcting the data, we put "Sex_proportion_for_the_pre.student_in_school" back into the dataset.
#The first thing is to reshape the dataset of "Data_for_checking_inconsistency2".
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 <- Data_for_checking_inconsistency2 %>%
gather(`Male`, `Female`, key = "Gender", value = "Number_of_pre.student")
#And then, put "Sex_proportion_for_the_pre.student_in_school" back into the dataset.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 <- mutate(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3,
Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school
)
#And then, rearrange the sequence of variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 %>% select(School_Name, School_No, School_Type, School_Status, Education_Sector, Entity_Type, Gender, Number_of_pre.student, Total_number_of_pre.student_for_school, Sex_proportion_for_the_pre.student_in_school)
#The second thing is to check inifite value, special value, and missing value. For convenience, we first build up a function to better deal with missing values, infinite values, and NaN values. We acutally would like to use sapply() function to look for errors but this function could not deal with NaN values. Thus, we should write up a function to deal with NaN values.
is.specialorNA <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}
#There are 932 na value for column "Sex_proportion_for_the_pre.student_in_school"
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.na(x) ))
School_Name
0
School_No
0
School_Type
0
School_Status
0
Education_Sector
0
Entity_Type
0
Gender
0
Number_of_pre.student
0
Total_number_of_pre.student_for_school
0
Sex_proportion_for_the_pre.student_in_school
932
#There are 0 infinite value for this dataset
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.infinite(x) ))
School_Name
0
School_No
0
School_Type
0
School_Status
0
Education_Sector
0
Entity_Type
0
Gender
0
Number_of_pre.student
0
Total_number_of_pre.student_for_school
0
Sex_proportion_for_the_pre.student_in_school
0
#There are 932 NaN value for column "Sex_proportion_for_the_pre.student_in_school"
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.nan(x) ))
School_Name
0
School_No
0
School_Type
0
School_Status
0
Education_Sector
0
Entity_Type
0
Gender
0
Number_of_pre.student
0
Total_number_of_pre.student_for_school
0
Sex_proportion_for_the_pre.student_in_school
932
#NaN means "not a number". When R generate NaN, it means that the value just makes a little sense. In this case, we find out a lot of NaNs in the variable of “Sex_proportion_for_the_pre.student_in_school”, which is calculated by the formula of:
#Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school.
#Number_of_pre.student can be 0 and Total_number_of_pre.student_for_school can be 0 either. The is the reason of NaN because when 0 divided by 0, it is meaningless and that’s why R generate NaN as the result. To fix this special value, we can just covert NaN to 0 directly.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school[All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school == "NaN" ] <- 0
#And then, check it again whether we eliminate all NaN values and Na values.
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.na(x) ))
School_Name
0
School_No
0
School_Type
0
School_Status
0
Education_Sector
0
Entity_Type
0
Gender
0
Number_of_pre.student
0
Total_number_of_pre.student_for_school
0
Sex_proportion_for_the_pre.student_in_school
0
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.nan(x) ))
School_Name
0
School_No
0
School_Type
0
School_Status
0
Education_Sector
0
Entity_Type
0
Gender
0
Number_of_pre.student
0
Total_number_of_pre.student_for_school
0
Sex_proportion_for_the_pre.student_in_school
0
#We successfully remove all NaN values and Na values.
#The third thing is to check incoensistency, and we do the same thing as what we done previously.
#Firts, we import rules into R.
Rules3 <- editfile("/Users/macbook/Documents/RMIT-Master of Analytics/Data wrangling/editrules.txt", type = "all")
Rules3
Data model:
dat1 : Education_Sector %in% c('Catholic', 'Government', 'Independent')
dat2 : Entity_Type %in% c('1', '2')
dat3 : Gender %in% c('Female', 'Male')
dat4 : School_Status %in% c('O', 'U')
dat5 : School_Type %in% c('Language', 'Pri/Sec', 'Primary', 'Secondary', 'Special')
Edit set:
num1 : Number_of_pre.student <= Total_number_of_pre.student_for_school
num2 : Sex_proportion_for_the_pre.student_in_school <= 1
num3 : 0 <= Sex_proportion_for_the_pre.student_in_school
#Check which value violates the rules
violatedEdits(Rules3, All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3)
edit
record num1 num2 num3 dat1 dat2 dat3 dat4 dat5
1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
7 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
12 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
13 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
14 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
15 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
16 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
17 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
18 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
19 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
20 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
21 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
22 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
23 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
25 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
26 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
27 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
28 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
29 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
30 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
31 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
32 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
33 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
34 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
35 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
36 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
37 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
38 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
39 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
40 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
41 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
42 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
43 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
44 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
45 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
46 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
47 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
48 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
49 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
50 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
51 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
52 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
53 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
54 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
55 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
56 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
57 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
58 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
59 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
60 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
61 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
62 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
63 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
64 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
65 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
66 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
67 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
68 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
69 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
70 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
71 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
72 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
73 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
74 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
75 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
76 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
77 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
78 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
79 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
80 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
81 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
82 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
83 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
84 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
85 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
86 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
87 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
88 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
89 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
90 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
91 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
92 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
93 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
94 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
95 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
96 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
97 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
98 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
99 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
100 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
101 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
102 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
103 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
104 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
105 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
106 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
107 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
108 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
109 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
110 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
111 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
112 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
113 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
114 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
115 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
116 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
117 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
118 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
119 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
120 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
121 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
122 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
123 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
124 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
125 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 4377 rows ]
#To deal with the problem of "large data size", we could use summary() function.
Violated3 <- violatedEdits(Rules3, All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3)
summary(Violated3)
No violations detected, 0 checks evaluated to NA
NULL
#Fortunately, we don't have any inconsistencies.
Apply an appropriate transformation for at least one of the variables. In addition to the R codes and outputs, explain everything that you do in this step. In this step, you should fulfil the minimum requirement #9.
# This is the R chunk for the Transform Section
#7 Data transformation
#The reaosn to put "data transformation" prior to the "Scan II" is that we have data who has right-skewed distribution. And we decide to use z-scores method in order to locate where are the outliers.
#Therefore, we have to transform data from right-skewed distribution to approximately normal distribution first.
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school)
#Based on the variable "Sex_proportion_for_the_pre.student_in_school", it contains 0 values and it is right-skewed.
#We first try to use square root transformation to deal with this data because this data has 0 values and it is reight-skewed.
sqrt_salary <- sqrt(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school)
hist(sqrt_salary)
#As result, it is still right skewed and still need to be improved.
#We note that this data contrains 0 values or 1 values, and it is because there are some all-male school or all-female school.
#When investigating sex porporion, it is meaningless to investigate what percentage of male student in a all-female school or what percentage of female student in all-male school.
#Therefore, we should remove 0 value and 1 value.
subset_no_0_1 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 %>% filter(Sex_proportion_for_the_pre.student_in_school != "0" & Sex_proportion_for_the_pre.student_in_school !="1")
#Now, let's produce a histogram again.
hist(subset_no_0_1$Sex_proportion_for_the_pre.student_in_school)
#As result, it is approximately normal distributed now.
Scan the numeric data for outliers. In this step, you should fulfil the minimum requirement #8. In addition to the R codes and outputs, explain your methodology (i.e. explain why you have chosen that methodology and the actions that you have taken to handle these values) and communicate your results clearly.
# This is the R chunk for the Scan II
#We first check out which variable is numeric.
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3)
'data.frame': 4502 obs. of 10 variables:
$ School_Name : chr "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
$ School_No : int 20 25 26 28 29 30 33 35 60 77 ...
$ School_Type : Factor w/ 5 levels "Language","Pri/Sec",..: 4 4 4 4 3 3 4 4 3 3 ...
$ School_Status : Factor w/ 2 levels "O","U": 1 1 1 1 1 1 1 1 1 1 ...
$ Education_Sector : Factor w/ 3 levels "Catholic","Government",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Entity_Type : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Number_of_pre.student : int 0 0 0 0 15 19 0 0 22 15 ...
$ Total_number_of_pre.student_for_school : int 0 0 0 0 34 44 0 0 38 39 ...
$ Sex_proportion_for_the_pre.student_in_school: num 0 0 0 0 0.441 ...
#We could see that there are 4 numeric variables:
#School_No, Number_of_pre.student, Total_number_of_pre.student_for_school, and Sex_proportion_for_the_pre.student_in_school
#However, School_No is not necessary to be scanned because it is just a code to represent a school and not a numeric number technically.
#First, we are going to investigate outliers of "Sex_proportion_for_the_pre.student_in_school".
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school)
#And also, they are non-symmetric/ non-normal data distributions, it is right-skewed distribution.
#As we known, there are all-male school and all-female school, so we expect that there are some outliers is 0 or 1 in terms of sex proportion.
#So, we should use z-scores method to look for outliers. The reason is that we could know which values are outliers, and knowing how many outliers are there, and if there are just a samll acount of outliers we may choose to delete them.
#Given that we already normalize the right-skewed distribution of "Sex_proportion_for_the_pre.student_in_school"
#Now, we could directly apply z-scores method to look for outliers.
#We now calculate z scores:
z.scores <- subset_no_0_1$Sex_proportion_for_the_pre.student_in_school %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.571547 -0.558508 -0.000538 0.000000 0.557432 3.570471
#we can see that the maximum is 12.04. And we can find the locations of the z-scores whose absolute value is greater than 3 by using which().
which( abs(z.scores) >3 )
[1] 24 120 154 641 701 708 774 1171 1287 1361 1374 1451 1489
[14] 1491 1708 1804 1838 2325 2385 2392 2458 2855 2971 3045 3058 3135
[27] 3173 3175 3349
#By using length() function, we can see that there are 29 outliers.
length (which( abs(z.scores) >3 ))
[1] 29
#It is still possible that those outliers are not the result of data entry errors, measurement errors, experimental errors, intentional errors, data processing errors or due to the sampling (i.e., sampling error) because very low sex proportion or very high sex proportion does happened in some school.
#But it is also possible that those outliers are actually errors.
#In this case, we should use capping technique to deal with outliers because we are not sure about whether there are errors.
#Thus, we should use the safe way to deal with this data: Replacing the outliers with the nearest neighbours that are not outliers.
#First, we need to create our own function: cap. This means that we need to first find the value that are less than (quantiles 0.25)-1.5*IQR and then repalce it as (quantiles 0.05). Second, find the value that are bigger than (quantiles 0.75)+1.5*IQR and then repalce it as (quantiles 0.95).
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
#And then, cap the outliers.
Sex_proportion_capped <- subset_no_0_1$Sex_proportion_for_the_pre.student_in_school %>% cap()
#Let's check outliers again.
z.scores <- Sex_proportion_capped %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.5208407 -0.6305411 -0.0004412 0.0000000 0.6296587 2.5199584
#we can see that the maximum is 2.5199584. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores) >3 )
integer(0)
#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores) >3 ))
[1] 0
#Second, we are going to ivestigate outliers of "Number_of_pre.student" in terms of "gender".
#Given that they are one factor variable and one numeric variable. And also, they are non-symmetric/ non-normal data distributions.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_male <-All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 %>% filter(Gender == "Male")
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_male$Number_of_pre.student)
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_female <-All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 %>% filter(Gender == "Female")
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_female$Number_of_pre.student)
#Thus, in this case, using the technique of "Bivariate box plot" is highly encouraged.
boxplot(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Number_of_pre.student ~ All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Gender, main="Number of pre.student by gender", ylab = "Number of pre.student", xlab = "Gender")
#Using the bivariate box plot approach, we can easily detect outliers in "Number_of_pre.student" for a given "Gender".
#However, these outliers are valuable becasue this variable (Number_of_pre.student) is used for collect information of the number of stduent for different gender in different school.
#And, it is okay that some school have a lot of student (note that outliers are all beyond the upper fence for both male and female) because schools can be like all-male school or all-female school.
#Moreover, school type is divided by "Language", "Pri/Sec", "Primary", "Secondary", and "Special", it is expected that there should be many pre-student for language school or special school, and less pre-student for Pri/Sec, Primary, and Secondary school.
#And, we could see that the Q1 is just around 0 and Q3 is around 30 for both male and female. Obviously, there are a lot of primary and secondary school who has 0 pre-student or only very samll number of pre-student in this datset.
#Therefore, it is possible that some "Language" and "Special" school have a lot of students and are defined as outliers in this plot. So, these outliers are explainable.
#However, it is still possible that some outliers are errors such as measurement error or sampleing error.
#Thus, we should use the safe way to deal with this data: Replacing the outliers with the nearest neighbours that are not outliers.
#Given that we already have our own function: cap and we could just directly apply cap function into dataset.
#We first deal with the outliers for male.
Student_male_capped <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_male$Number_of_pre.student %>% cap()
#Let's check outliers again.
z.scores2 <- Student_male_capped %>% scores(type = "z")
z.scores2 %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.0015 -0.9418 -0.2859 0.0000 0.6086 2.9342
#we can see that the maximum is 2.9342. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores2) >3 )
integer(0)
#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores2) >3 ))
[1] 0
#And then, we deal with the outliers for female.
Student_female_capped <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_female$Number_of_pre.student %>% cap()
#Let's check outliers again.
z.scores3 <- Student_female_capped %>% scores(type = "z")
z.scores3 %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.0039 -0.9410 -0.3118 0.0000 0.6320 2.9602
#we can see that the maximum is 2.9602. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores3) >3 )
integer(0)
#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores3) >3 ))
[1] 0
#Third, we are going to investigate outliers of "Total_number_of_pre.student_for_school".
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Total_number_of_pre.student_for_school)
#And also, they are non-symmetric/ non-normal data distributions.
#Thus, in this case, using the technique of "Univariate box plot" is highly encouraged.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Total_number_of_pre.student_for_school %>% boxplot(main="Box Plot of total number of pre.student for school", ylab="Total number of pre.student", col = "grey")
#This case is very similar to the previous case, outliers are all beyond the upper fence. As we can see in the plot, outliers are all in around 140 students to 250 student. It seems okay because some big school can have 250 pre-student.
#And, we could see that the Q1 is just around 0 and Q3 is around 50. Obviously, this plot has the same situation with the previous one, there are a lot of primary and secondary school who has 0 pre-student only very samll number of pre-student in this datset.
#Therefore, it is possible that some "Language" and "Special" school have a lot of students and are defined as outliers in this plot. So, these outliers are explainable.
#Outliers are valuable and explainable.
#However, it is still possible that some outliers are errors such as measurement error or sampleing error.
#Thus, we should use the safe way to deal with this data: Replacing the outliers with the nearest neighbours that are not outliers.
#Given that we already have our own function: cap and we could just directly apply cap function into dataset.
Total_tudent_capped <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Total_number_of_pre.student_for_school %>% cap()
#Let's check outliers again.
z.scores4 <- Total_tudent_capped %>% scores(type = "z")
z.scores4 %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.0269 -0.9327 -0.2739 0.0000 0.6046 2.8948
#we can see that the maximum is 2.8948. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores4) >3 )
integer(0)
#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores4) >3 ))
[1] 0
NOTE: Note that sometimes the order of the tasks may be different than the order given here. For example, you may need to tidy the data sets first to be able to create the common key to merge. Therefore, for such cases you may have a different ordering of the sections.
Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )