Goal:
Give an idea about the set of distance learning classes currently offered at Austin Community College (ACC). This is done in several steps:
Step 1:
Scrapped ACC’s distance learning webpage in order to acquire:
- The broad categories of subjects that are being taught and
- The complete list of courses offered.
trimws(cat[5],which=c("both"))
Categories <- cat[-c(1,5, 10, 12,14,37,60)] # subseting the rows that were incorrectly scraped
#categClean <- gsub("\r\n\tAre you a new student and interested in our programs and trying to register for classes? If so please visit our website to learn more about our program http://www.austincc.edu/tcm Also, schedule an appointment to meet with the Departmental Advisor at davidm@austincc.edu.
#\r\n\t","",categClean) # and a last one from row 11
##\r\n\tSocial Media Communication: Degree & Certificate Courses
#View(categClean)
Categories <- gsub("[\r\n\t]","",Categories)
Findings:
There are approximately 55 course categories that are being currently offered.The complete list of categories in alphabetical order can be seen here (by clicking on the Next button you can see all the categories):
#NEED TO MAKE IT WORK
categClean2 <- as.data.frame(Categories)
categClean2
#Categories <- as.data.frame(categClean2)
#Categories
Step 2:
Taking the exploration one step further, ACC’s website is scrapped in more detail once again.
#### Scraping the courses with their (sub)-sections
acc <- html('http://www6.austincc.edu/schedule/index.php?op=browse&opclass=ViewSched_location&term=218S000&ct=CC&locationid=DIL&reporting_year=2018',
options="HUGE")
classes <- acc%>%
html_nodes("h4 a")%>%
html_text()
#classes
head(classes)
# writing this to an excel file so I have a stored dataset
#write.table(classes, file="classes.csv",sep=",",row.names=F) # exported the data under the name registration.csv
# Importing the dataset - it is included in the zip folder under the name registration.csv
registration <- read.csv("/Users/Zenodrakos/Workspaces/R_workspace/SampleProjects/Projects/registration.csv")
Findings:
There are about 430 different courses with a total of 754 course sections. The maximum number of sections that a course has is 11 and that is “English Composition II”.
class_with_most_sections$course_name
d3 <- as.data.frame(final %>%
group_by (course_name,id) %>%
summarise(total_enrolled= sum(enrolled),total_capacity = sum(class_size)))
View(d3)
# d4= d3 but sorted from largest total enrolment to smallest total enrolment
d4 <- d3[order(-d3$total_enrolled),]
View(d4) #
The top 10% courses (= 43 courses) with the highest total-across all sections student enrolment can be found below (by clicking on the Next button you can see all 43 courses with the highest enrollment):
#keeping the top 10% or the 43 classes with the most registered students in total
head(d4,n=430*10/100)
s <- head(d4,n=430*10/100)
I am plotting the first 12 courses with respect to total enrollment, by displaying for each the total enrolment and the total capacity:
d3a <- head(d3[order(d3$total_capacity, decreasing = T),], 15)
d3.plottable <- d3a[, c(1,3,4)]
d3.plottable <- melt(d3.plottable, id.vars = "course_name")
library(ggplot2)
g <- ggplot(d3.plottable, aes(x = course_name, y = value))
g <- g + geom_bar(aes(fill = variable), position = position_dodge(), stat = "identity") +
coord_flip() + theme(legend.position = "top")
g <- g + labs(x = "Course Name")
g <- g+ labs(y = "Number of Students")
g

#g <- g + labs(x = "Number of students")
#p2 <- p2 + labs(y = "Units")
#p2 <- p2 + labs(title = "Number of transactional units by Category")
#p2
---
title: "Project 1: ACC Distance Learning"
output: 
  html_notebook: 
    code_folding: hide
---


```{r}
library(dplyr)
library(tidyr)
library(rvest)
#library(stringr)
setwd('/Users/Zenodrakos/Workspaces/R_workspace/SampleProjects/Projects')
```

##Goal:
Give an idea about the set of distance learning classes currently offered at Austin Community College (ACC).
This is done in several steps:

####Step 1:
Scrapped ACC's distance learning webpage in order to acquire:

1) The broad categories of subjects that are being taught and 
2) The complete list of courses offered.

```{r,results="hide"}
# Scraping the broad categories 
acc2 <- read_html('http://www6.austincc.edu/schedule/index.php?op=browse&opclass=ViewSched_location&term=218S000&ct=CC&locationid=DIL&reporting_year=2018'
                  ,options = "HUGE")
cat <- acc2 %>%
  html_nodes("h3") %>%
  html_text()
View(cat)




```

```{r,results="hide"}

Categories <- cat[-c(1,5, 10, 12,14,37,60)] # subseting the rows that were incorrectly scraped
#categClean <- gsub("\r\n\tAre you a new student and interested in our programs and trying to register for classes? If so please visit our website to learn more about our program http://www.austincc.edu/tcm Also, schedule an appointment to meet with the Departmental Advisor  at davidm@austincc.edu.
#\r\n\t","",categClean) # and a last one from row 11
##\r\n\tSocial Media Communication: Degree & Certificate Courses
#View(categClean)
Categories <- gsub("[\r\n\t]","",Categories)


```

####Findings:
There are approximately 55 course categories that are being currently offered.The complete list of categories in alphabetical order can be seen here (by clicking  on the Next button you can see all the categories):


```{r}
#NEED TO MAKE IT WORK
categClean2 <- as.data.frame(Categories)
categClean2

```



####Step 2:
Taking the exploration one step further, ACC's website is scrapped in more detail once again.


```{r,results="hide"}
#### Scraping the courses with their (sub)-sections

acc <- html('http://www6.austincc.edu/schedule/index.php?op=browse&opclass=ViewSched_location&term=218S000&ct=CC&locationid=DIL&reporting_year=2018',
            options="HUGE")


classes <- acc%>%
  html_nodes("h4 a")%>%
  html_text()

#classes
head(classes)
```

```{r}
# writing this to an excel file so I have a stored dataset
#write.table(classes, file="classes.csv",sep=",",row.names=F) # exported the data under the name registration.csv

# Importing the dataset - it is included in the zip folder under the name registration.csv
registration <- read.csv("/Users/Zenodrakos/Workspaces/R_workspace/SampleProjects/Projects/registration.csv")

```

```{r}
names(registration) <- ("Text")
counter <- 0

for(i in 1:nrow(registration)) {
  if(str_detect(registration$Text[i], "[A-Z]") == TRUE){
    counter <- counter + 1
    registration$id[i] <- counter
  } else {
    registration$id[i] <- counter
  }
}
View(registration)

course_text <- registration %>% filter(str_detect(Text, "[A-Z]"))
course_seat <- registration %>% filter(!str_detect(Text, "[A-Z]"))

course_info <- course_text %>% left_join(course_seat, by = "id")
names(course_info) <- c("Text", "id", "course_seat")
head(course_info)

course_info <- course_info %>% separate(course_seat, c("enrolled", "class_size", "random"), sep = "/")
View(course_info)

course_info$enrolled <- gsub(" ", "", course_info$enrolled)
course_info$class_size <- gsub(" ", "", course_info$class_size)
course_info$random <- gsub(" ", "", course_info$random)

course_info$enrolled <- as.numeric(substr(course_info$enrolled, 2, nchar(course_info$enrolled)))
View(course_info)

course_info_2 <- course_info %>% filter(enrolled != "NA")

for(i in 1:nrow(course_info_2)) {
  if(str_detect(course_info_2$class_size[i], "]") == TRUE){
    course_info_2$class_size[i] <- as.numeric(substr(course_info_2$class_size[i], 1, nchar(course_info_2$class_size[i]) - 1))
  } else {
    course_info_2$class_size[i] <- as.numeric(course_info_2$class_size[i])
    course_info_2$random[i] <- as.numeric(substr(course_info_2$random[i], 1, nchar(course_info_2$random[i]) - 1))
  }
}

course_info_2$class_size <- as.numeric(course_info_2$class_size)
course_info_2$random <- as.numeric(course_info_2$random)

View(course_info_2)

require( data.table )

course_info_dt <- data.table(course_info_2)

course_info_dt[ , Index := 1:.N , by = c("id") ]
View(course_info_dt)

course_section <- course_info_dt %>%
  group_by(id) %>%
  summarise(number_section = n())

course_info_dt2 <- course_info_dt %>% left_join(course_section, by = "id")
View(course_info_dt2) #there are 437 ID's which means 437 diff subjects (I get 440, so there is an inconsistency of 3)

####3 2nd step - Work with Text row
def <- course_info_dt2$Text 
View(def) 


course_prefix <- substr(def, 0, 4)
#course_code <- substr(def, 5, 9) # 
course_code <- substr(def, 6, 9) 
course_name <- substr(def, 10, length(def))

head(course_prefix)
head(course_code)
head(course_name)
View(def)

final <- data_frame(course_prefix, course_code, course_name)
head(final)
View(final) 

final$def <- def
final <- dplyr::rename(final, Text = def)
final$id <- course_info_dt2$id
final$enrolled <- course_info_dt2$enrolled
final$class_size <- course_info_dt2$class_size
final$waitlisted <- course_info_dt2$random
final$Index <- course_info_dt2$Index
final$number_section <- course_info_dt2$number_section

#Now the dataset is  clean and ready for further inspection:
```

####Findings:


There are about 430 different courses with a total of 754 course sections.
The maximum number of sections that a course has is 11 and that is "English Composition II".
```{r,results="hide"}
nrow(final) # the total number of sections (from all courses) is 754
# Number of distinct courses
d1 <- final %>% 
  distinct(id)
nrow(d1)

#min and max number of sections for a course:
#min(final$number_section) #min number of sections  # It is very logical that it is one
max(final$number_section) # max number of sections

class_with_most_sections <- final %>%
  select(course_name,number_section) %>% 
  filter(number_section == '11')

```

```{r}
d3 <- as.data.frame(final %>%
                group_by (course_name,id) %>% 
                summarise(total_enrolled= sum(enrolled),total_capacity = sum(class_size)))
View(d3)

# d4= d3 but sorted from largest total enrolment to smallest total enrolment
d4 <- d3[order(-d3$total_enrolled),]
View(d4)  #

```


The top 10% courses (= 43 courses) with the highest total-across all sections student enrolment can be found below (by clicking  on the Next button you can see all 43 courses with the highest enrollment): 
```{r}
#keeping the top 10% or the 43 classes with the most registered students in total
head(d4,n=430*10/100)
s <- head(d4,n=430*10/100)
```

I am plotting the first 12 courses with respect to total enrollment, by displaying for each the total enrolment and the total capacity: 


```{r}
d3a <- head(d3[order(d3$total_capacity, decreasing = T),], 15)

d3.plottable <- d3a[, c(1,3,4)]
d3.plottable <- melt(d3.plottable, id.vars = "course_name")


library(ggplot2)
g <- ggplot(d3.plottable, aes(x = course_name, y = value))
g <- g + geom_bar(aes(fill = variable), position = position_dodge(), stat = "identity") + 
  coord_flip() + theme(legend.position = "top")
g <- g + labs(x = "Course Name")
g <- g+ labs(y = "Number of Students")
g


```


