Week 2 Homework

Question 1

grades<-read.csv("C:/Users/kelse/OneDrive/Documents/Research Design Analysis/Week 2_HW_gradesData.csv", header=TRUE)
grades
##    studentID      year     grade
## 1          1 freshman         82
## 2          2 freshman         70
## 3          3  sophomor 40 (late)
## 4          4 freshman         69
## 5          5 sophomore 50 (late)
## 6          6 freshman         74
## 7          7 freshman         81
## 8          8 Freshman       <NA>
## 9          9    Junior        89
## 10        10    junior        86
## 11        11    Senior       68?
## 12        12     Senir        79
## 13        13    senior        83
## 14        14    Senior        62
## 15        15    Senior 56 (late)
## 16        15    Senior 56 (late)
## 17        16    Senior        66
## 18        17 sophomore      <NA>
## 19        18 sophomore        72
## 20        19 sophomore        84
## 21        20    Junior        44
## 22        21 Sophomore 57 (late)
## 23        22 freshman         60
## 24        23   junior         77
## 25        24    Junior        83
## 26        25  freshman        80

Question 2

# I am going to start by sorting the data in a few different ways that might be more manageable for me to visualize and comprehend. I realize that this isn't always necessary for cleaning data, and it can look messy and overwhelming. I wont include this in my final R markup files in the future, I just want to give you a window into what coding with ADHD looks like and the ways that I make it work for me. :) 
newgrades<-grades[order(grades$grade), ]
newgrades
##    studentID      year     grade
## 3          3  sophomor 40 (late)
## 21        20    Junior        44
## 5          5 sophomore 50 (late)
## 15        15    Senior 56 (late)
## 16        15    Senior 56 (late)
## 22        21 Sophomore 57 (late)
## 23        22 freshman         60
## 14        14    Senior        62
## 17        16    Senior        66
## 11        11    Senior       68?
## 4          4 freshman         69
## 2          2 freshman         70
## 19        18 sophomore        72
## 6          6 freshman         74
## 24        23   junior         77
## 12        12     Senir        79
## 26        25  freshman        80
## 7          7 freshman         81
## 1          1 freshman         82
## 13        13    senior        83
## 25        24    Junior        83
## 20        19 sophomore        84
## 10        10    junior        86
## 9          9    Junior        89
## 8          8 Freshman       <NA>
## 18        17 sophomore      <NA>
# I am not going to even attempt to sort the years, because I can see that there are too many inconsistencies with the names in that column. I will start by figuring out how many different elements/variables (not sure the correct word here) by using the unique and sort functions. I am guessing that the 'year' is the variable and the specific name, such as sophomore, is the element (not sure if you can just verify that for me or not). So, I will use the unique function to tell me all the different elements, combined with the sort function which will alphabetize the results for me. 

sort(unique(grades$year))
##  [1] "freshman"  "freshman " "Freshman " "junior"    "Junior"    "junior "  
##  [7] "senior"    "Senior"    "Senir"     "sophomor"  "sophomore" "Sophomore"
# Then I'm going to use the class function to see if the year column contains characters or factors.
class(grades$year)
## [1] "character"
# Since the column contains characters, I don't have to convert anything. If I did need to do that, I would have used as.character(unique(grades$year)). 

#Now I will work on finding and replacing all the different variations of Freshman, Sophomore, Junior, and Senior using the grep1 function. I'll double check my work after each year just for my own peace of mind. I wont normally include that part on my R markdown file. 
grades$year[grepl("freshman",grades$year,ignore.case=TRUE)]<-"Freshman"
grades$year[grepl("freshman ",grades$year,ignore.case=TRUE)]<-"Freshman"
grades$year[grepl("Freshman ",grades$year,ignore.case=TRUE)]<-"Freshman"
sort(unique(grades$year))
##  [1] "Freshman"  "junior"    "Junior"    "junior "   "senior"    "Senior"   
##  [7] "Senir"     "sophomor"  "sophomore" "Sophomore"
# Sophomore: 
grades$year[grepl("sophomore",grades$year,ignore.case=TRUE)]<-"Sophomore"
grades$year[grepl("sophomor",grades$year,ignore.case=TRUE)]<-"Sophomore"
sort(unique(grades$year))
## [1] "Freshman"  "junior"    "Junior"    "junior "   "senior"    "Senior"   
## [7] "Senir"     "Sophomore"
# Junior: 
grades$year[grepl("junior",grades$year,ignore.case=TRUE)]<-"Junior"
grades$year[grepl("junior ",grades$year,ignore.case=TRUE)]<-"Junior"
sort(unique(grades$year))
## [1] "Freshman"  "Junior"    "senior"    "Senior"    "Senir"     "Sophomore"
# Senior: 
grades$year[grepl("senior",grades$year,ignore.case=TRUE)]<-"Senior"
grades$year[grepl("Senir",grades$year,ignore.case=TRUE)]<-"Senior"
sort(unique(grades$year))
## [1] "Freshman"  "Junior"    "Senior"    "Sophomore"
# Now I will check for duplicate rows using the duplicated function, which returns TRUE for any duplicates. 
duplicated (grades)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE
#Since there are duplicates in the data, I'll use [!duplicated(grades), ] to remove them. The ! reverses the logical vector, turning TRUE (duplicates) into FALSE, so when I subset the data using brackets, only the unique rows are kept. I placed it in the row spot of the brackets because I am only deleting duplicate rows, not columns. 

grades<-grades[!duplicated(grades), ]
grades
##    studentID      year     grade
## 1          1  Freshman        82
## 2          2  Freshman        70
## 3          3 Sophomore 40 (late)
## 4          4  Freshman        69
## 5          5 Sophomore 50 (late)
## 6          6  Freshman        74
## 7          7  Freshman        81
## 8          8  Freshman      <NA>
## 9          9    Junior        89
## 10        10    Junior        86
## 11        11    Senior       68?
## 12        12    Senior        79
## 13        13    Senior        83
## 14        14    Senior        62
## 15        15    Senior 56 (late)
## 17        16    Senior        66
## 18        17 Sophomore      <NA>
## 19        18 Sophomore        72
## 20        19 Sophomore        84
## 21        20    Junior        44
## 22        21 Sophomore 57 (late)
## 23        22  Freshman        60
## 24        23    Junior        77
## 25        24    Junior        83
## 26        25  Freshman        80
# Now I'm going to check for missing data using is.na and sum functions to find out how much data is missing. I'm not going to remove any of these rows with missing data yet, until I know if I need to or not. If I need to then I will use grades<-grades[complete.cases(grades), ] OR grades[!is.na(grades)]
sum(is.na(grades))
## [1] 2
# Some of the rows in the grades column say (late) in them. I am going to remove that, because it complicates the numerical data in the column. Late isn't really necessary information for me to know, and it will make getting averages more difficult. (I am now actually going insane trying to do this part because the ways listed in the tutorial are not meshing with my brain). So I found the rows in the grade column containing the "(late)" text using grep then deleted the last 6 characters using substr to make sure the remaining grades were numeric only, which then lets me convert the column into a numeric format now that the characters are gone. I was getting a "Warning: NAs introduced by coercion" message, so I added the 'suppressWarnings' to my code, I hope that's okay. 
late_rows<-grep("\\(late\\)",grades$grade)
grades$grade[late_rows]<-substr(grades$grade[late_rows],1,nchar(grades$grade[late_rows])-6)
grades$grade<-suppressWarnings(as.numeric(grades$grade))
grades
##    studentID      year grade
## 1          1  Freshman    82
## 2          2  Freshman    70
## 3          3 Sophomore    40
## 4          4  Freshman    69
## 5          5 Sophomore    50
## 6          6  Freshman    74
## 7          7  Freshman    81
## 8          8  Freshman    NA
## 9          9    Junior    89
## 10        10    Junior    86
## 11        11    Senior    NA
## 12        12    Senior    79
## 13        13    Senior    83
## 14        14    Senior    62
## 15        15    Senior    56
## 17        16    Senior    66
## 18        17 Sophomore    NA
## 19        18 Sophomore    72
## 20        19 Sophomore    84
## 21        20    Junior    44
## 22        21 Sophomore    57
## 23        22  Freshman    60
## 24        23    Junior    77
## 25        24    Junior    83
## 26        25  Freshman    80
# Okay I just realized that my row numbers are not sequential after I had deleted the duplicate row. So I am going to look up how to renumber the rows. I'm not sure if this is necessary, but it looks odd skipping from line 15 to 17.
grades<-grades[order(row.names(grades)), ]
row.names(grades)<-NULL
grades 
##    studentID      year grade
## 1          1  Freshman    82
## 2         10    Junior    86
## 3         11    Senior    NA
## 4         12    Senior    79
## 5         13    Senior    83
## 6         14    Senior    62
## 7         15    Senior    56
## 8         16    Senior    66
## 9         17 Sophomore    NA
## 10        18 Sophomore    72
## 11         2  Freshman    70
## 12        19 Sophomore    84
## 13        20    Junior    44
## 14        21 Sophomore    57
## 15        22  Freshman    60
## 16        23    Junior    77
## 17        24    Junior    83
## 18        25  Freshman    80
## 19         3 Sophomore    40
## 20         4  Freshman    69
## 21         5 Sophomore    50
## 22         6  Freshman    74
## 23         7  Freshman    81
## 24         8  Freshman    NA
## 25         9    Junior    89

Question 3

studentsperyear<-table(grades$year)
studentsperyear
## 
##  Freshman    Junior    Senior Sophomore 
##         8         5         6         6

Question 4

a

total_average<-mean(grades$grade,na.rm=TRUE)
total_average
## [1] 70.18182

b

year_average<-by(grades$grade,grades$year,mean,na.rm=TRUE)
year_average
## grades$year: Freshman
## [1] 73.71429
## ------------------------------------------------------------ 
## grades$year: Junior
## [1] 75.8
## ------------------------------------------------------------ 
## grades$year: Senior
## [1] 69.2
## ------------------------------------------------------------ 
## grades$year: Sophomore
## [1] 60.6

Question 5

a

newaverage<-75-total_average
curved_grades<-grades$grade+newaverage
curved_grades
##  [1] 86.81818 90.81818       NA 83.81818 87.81818 66.81818 60.81818 70.81818
##  [9]       NA 76.81818 74.81818 88.81818 48.81818 61.81818 64.81818 81.81818
## [17] 87.81818 84.81818 44.81818 73.81818 54.81818 78.81818 85.81818       NA
## [25] 93.81818

b

grades$curved_grades<-curved_grades
grades
##    studentID      year grade curved_grades
## 1          1  Freshman    82      86.81818
## 2         10    Junior    86      90.81818
## 3         11    Senior    NA            NA
## 4         12    Senior    79      83.81818
## 5         13    Senior    83      87.81818
## 6         14    Senior    62      66.81818
## 7         15    Senior    56      60.81818
## 8         16    Senior    66      70.81818
## 9         17 Sophomore    NA            NA
## 10        18 Sophomore    72      76.81818
## 11         2  Freshman    70      74.81818
## 12        19 Sophomore    84      88.81818
## 13        20    Junior    44      48.81818
## 14        21 Sophomore    57      61.81818
## 15        22  Freshman    60      64.81818
## 16        23    Junior    77      81.81818
## 17        24    Junior    83      87.81818
## 18        25  Freshman    80      84.81818
## 19         3 Sophomore    40      44.81818
## 20         4  Freshman    69      73.81818
## 21         5 Sophomore    50      54.81818
## 22         6  Freshman    74      78.81818
## 23         7  Freshman    81      85.81818
## 24         8  Freshman    NA            NA
## 25         9    Junior    89      93.81818

Question 6

grades[order(-grades$curved_grades), ]
##    studentID      year grade curved_grades
## 25         9    Junior    89      93.81818
## 2         10    Junior    86      90.81818
## 12        19 Sophomore    84      88.81818
## 5         13    Senior    83      87.81818
## 17        24    Junior    83      87.81818
## 1          1  Freshman    82      86.81818
## 23         7  Freshman    81      85.81818
## 18        25  Freshman    80      84.81818
## 4         12    Senior    79      83.81818
## 16        23    Junior    77      81.81818
## 22         6  Freshman    74      78.81818
## 10        18 Sophomore    72      76.81818
## 11         2  Freshman    70      74.81818
## 20         4  Freshman    69      73.81818
## 8         16    Senior    66      70.81818
## 6         14    Senior    62      66.81818
## 15        22  Freshman    60      64.81818
## 14        21 Sophomore    57      61.81818
## 7         15    Senior    56      60.81818
## 21         5 Sophomore    50      54.81818
## 13        20    Junior    44      48.81818
## 19         3 Sophomore    40      44.81818
## 3         11    Senior    NA            NA
## 9         17 Sophomore    NA            NA
## 24         8  Freshman    NA            NA

Side Quest

# I am going to look into how to round these decimal numbers just to clean up the look of things for my own personal sanity. Okay seems like curved_grades<-ceiling(curved_grades) and then grades$curved_grades<-curved_grades does the trick. I'm leaving this in for future reference for myself. 

Question 7

write.csv(grades,file="grades_take2.csv",row.names=FALSE)