Question 2
# I am going to start by sorting the data in a few different ways that might be more manageable for me to visualize and comprehend. I realize that this isn't always necessary for cleaning data, and it can look messy and overwhelming. I wont include this in my final R markup files in the future, I just want to give you a window into what coding with ADHD looks like and the ways that I make it work for me. :)
newgrades<-grades[order(grades$grade), ]
newgrades
## studentID year grade
## 3 3 sophomor 40 (late)
## 21 20 Junior 44
## 5 5 sophomore 50 (late)
## 15 15 Senior 56 (late)
## 16 15 Senior 56 (late)
## 22 21 Sophomore 57 (late)
## 23 22 freshman 60
## 14 14 Senior 62
## 17 16 Senior 66
## 11 11 Senior 68?
## 4 4 freshman 69
## 2 2 freshman 70
## 19 18 sophomore 72
## 6 6 freshman 74
## 24 23 junior 77
## 12 12 Senir 79
## 26 25 freshman 80
## 7 7 freshman 81
## 1 1 freshman 82
## 13 13 senior 83
## 25 24 Junior 83
## 20 19 sophomore 84
## 10 10 junior 86
## 9 9 Junior 89
## 8 8 Freshman <NA>
## 18 17 sophomore <NA>
# I am not going to even attempt to sort the years, because I can see that there are too many inconsistencies with the names in that column. I will start by figuring out how many different elements/variables (not sure the correct word here) by using the unique and sort functions. I am guessing that the 'year' is the variable and the specific name, such as sophomore, is the element (not sure if you can just verify that for me or not). So, I will use the unique function to tell me all the different elements, combined with the sort function which will alphabetize the results for me.
sort(unique(grades$year))
## [1] "freshman" "freshman " "Freshman " "junior" "Junior" "junior "
## [7] "senior" "Senior" "Senir" "sophomor" "sophomore" "Sophomore"
# Then I'm going to use the class function to see if the year column contains characters or factors.
class(grades$year)
## [1] "character"
# Since the column contains characters, I don't have to convert anything. If I did need to do that, I would have used as.character(unique(grades$year)).
#Now I will work on finding and replacing all the different variations of Freshman, Sophomore, Junior, and Senior using the grep1 function. I'll double check my work after each year just for my own peace of mind. I wont normally include that part on my R markdown file.
grades$year[grepl("freshman",grades$year,ignore.case=TRUE)]<-"Freshman"
grades$year[grepl("freshman ",grades$year,ignore.case=TRUE)]<-"Freshman"
grades$year[grepl("Freshman ",grades$year,ignore.case=TRUE)]<-"Freshman"
sort(unique(grades$year))
## [1] "Freshman" "junior" "Junior" "junior " "senior" "Senior"
## [7] "Senir" "sophomor" "sophomore" "Sophomore"
# Sophomore:
grades$year[grepl("sophomore",grades$year,ignore.case=TRUE)]<-"Sophomore"
grades$year[grepl("sophomor",grades$year,ignore.case=TRUE)]<-"Sophomore"
sort(unique(grades$year))
## [1] "Freshman" "junior" "Junior" "junior " "senior" "Senior"
## [7] "Senir" "Sophomore"
# Junior:
grades$year[grepl("junior",grades$year,ignore.case=TRUE)]<-"Junior"
grades$year[grepl("junior ",grades$year,ignore.case=TRUE)]<-"Junior"
sort(unique(grades$year))
## [1] "Freshman" "Junior" "senior" "Senior" "Senir" "Sophomore"
# Senior:
grades$year[grepl("senior",grades$year,ignore.case=TRUE)]<-"Senior"
grades$year[grepl("Senir",grades$year,ignore.case=TRUE)]<-"Senior"
sort(unique(grades$year))
## [1] "Freshman" "Junior" "Senior" "Sophomore"
# Now I will check for duplicate rows using the duplicated function, which returns TRUE for any duplicates.
duplicated (grades)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE
#Since there are duplicates in the data, I'll use [!duplicated(grades), ] to remove them. The ! reverses the logical vector, turning TRUE (duplicates) into FALSE, so when I subset the data using brackets, only the unique rows are kept. I placed it in the row spot of the brackets because I am only deleting duplicate rows, not columns.
grades<-grades[!duplicated(grades), ]
grades
## studentID year grade
## 1 1 Freshman 82
## 2 2 Freshman 70
## 3 3 Sophomore 40 (late)
## 4 4 Freshman 69
## 5 5 Sophomore 50 (late)
## 6 6 Freshman 74
## 7 7 Freshman 81
## 8 8 Freshman <NA>
## 9 9 Junior 89
## 10 10 Junior 86
## 11 11 Senior 68?
## 12 12 Senior 79
## 13 13 Senior 83
## 14 14 Senior 62
## 15 15 Senior 56 (late)
## 17 16 Senior 66
## 18 17 Sophomore <NA>
## 19 18 Sophomore 72
## 20 19 Sophomore 84
## 21 20 Junior 44
## 22 21 Sophomore 57 (late)
## 23 22 Freshman 60
## 24 23 Junior 77
## 25 24 Junior 83
## 26 25 Freshman 80
# Now I'm going to check for missing data using is.na and sum functions to find out how much data is missing. I'm not going to remove any of these rows with missing data yet, until I know if I need to or not. If I need to then I will use grades<-grades[complete.cases(grades), ] OR grades[!is.na(grades)]
sum(is.na(grades))
## [1] 2
# Some of the rows in the grades column say (late) in them. I am going to remove that, because it complicates the numerical data in the column. Late isn't really necessary information for me to know, and it will make getting averages more difficult. (I am now actually going insane trying to do this part because the ways listed in the tutorial are not meshing with my brain). So I found the rows in the grade column containing the "(late)" text using grep then deleted the last 6 characters using substr to make sure the remaining grades were numeric only, which then lets me convert the column into a numeric format now that the characters are gone. I was getting a "Warning: NAs introduced by coercion" message, so I added the 'suppressWarnings' to my code, I hope that's okay.
late_rows<-grep("\\(late\\)",grades$grade)
grades$grade[late_rows]<-substr(grades$grade[late_rows],1,nchar(grades$grade[late_rows])-6)
grades$grade<-suppressWarnings(as.numeric(grades$grade))
grades
## studentID year grade
## 1 1 Freshman 82
## 2 2 Freshman 70
## 3 3 Sophomore 40
## 4 4 Freshman 69
## 5 5 Sophomore 50
## 6 6 Freshman 74
## 7 7 Freshman 81
## 8 8 Freshman NA
## 9 9 Junior 89
## 10 10 Junior 86
## 11 11 Senior NA
## 12 12 Senior 79
## 13 13 Senior 83
## 14 14 Senior 62
## 15 15 Senior 56
## 17 16 Senior 66
## 18 17 Sophomore NA
## 19 18 Sophomore 72
## 20 19 Sophomore 84
## 21 20 Junior 44
## 22 21 Sophomore 57
## 23 22 Freshman 60
## 24 23 Junior 77
## 25 24 Junior 83
## 26 25 Freshman 80
# Okay I just realized that my row numbers are not sequential after I had deleted the duplicate row. So I am going to look up how to renumber the rows. I'm not sure if this is necessary, but it looks odd skipping from line 15 to 17.
grades<-grades[order(row.names(grades)), ]
row.names(grades)<-NULL
grades
## studentID year grade
## 1 1 Freshman 82
## 2 10 Junior 86
## 3 11 Senior NA
## 4 12 Senior 79
## 5 13 Senior 83
## 6 14 Senior 62
## 7 15 Senior 56
## 8 16 Senior 66
## 9 17 Sophomore NA
## 10 18 Sophomore 72
## 11 2 Freshman 70
## 12 19 Sophomore 84
## 13 20 Junior 44
## 14 21 Sophomore 57
## 15 22 Freshman 60
## 16 23 Junior 77
## 17 24 Junior 83
## 18 25 Freshman 80
## 19 3 Sophomore 40
## 20 4 Freshman 69
## 21 5 Sophomore 50
## 22 6 Freshman 74
## 23 7 Freshman 81
## 24 8 Freshman NA
## 25 9 Junior 89