1. read the data,

use str() to check the data structure: the data has 2 columns($xml_url and $full_text), and 107 rows (each row representing a version):

df <- readRDS("immigration_reg_PIT.rds")
str(df)

## 'data.frame':    107 obs. of  2 variables:
##  $ xml_url  : Factor w/ 107 levels "ftp://205.193.86.89/PITXML/regulations/SOR-2002-227/20060322/SOR-2002-227.xml",..: 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:107] "xml_url" "xml_url" "xml_url" "xml_url" ...
##  $ full_text: chr  "part 1interpretation and applicationdivision 1interpretationdefinitions1 the definitions this subsection apply "| __truncated__ "part 1interpretation and applicationdivision 1interpretationdefinitions1 the definitions this subsection apply "| __truncated__ "part 1interpretation and applicationdivision 1interpretationdefinitions1 the definitions this subsection apply "| __truncated__ "part 1interpretation and applicationdivision 1interpretationdefinitions1 the definitions this subsection apply "| __truncated__ ...

check the first column of the first three rows:

df[1:3,1] # check the first three rows and first column

##                                                                       xml_url 
## ftp://205.193.86.89/PITXML/regulations/SOR-2002-227/20060322/SOR-2002-227.xml 
##                                                                       xml_url 
## ftp://205.193.86.89/PITXML/regulations/SOR-2002-227/20060511/SOR-2002-227.xml 
##                                                                       xml_url 
## ftp://205.193.86.89/PITXML/regulations/SOR-2002-227/20060601/SOR-2002-227.xml 
## 107 Levels: ftp://205.193.86.89/PITXML/regulations/SOR-2002-227/20060322/SOR-2002-227.xml ...

2. Questions,

Q1

Each amendment is associated with a date, which is stored in the “xml_url” column. Use a regular expression to extract the date and add it as a separate numeric column to your dataframe.

# use regular expression to add date to dataframe as third column
for (i in 1:nrow(df)){
  df[i,3] = as.Date(regmatches(df[i,1],regexpr("\\d{8}",df[i,1])) , format="%Y%m%d")
}

names(df)[names(df) == "V3"] <- "date"  # change the third column name

str(df)

## 'data.frame':    107 obs. of  3 variables:
##  $ xml_url  : Factor w/ 107 levels "ftp://205.193.86.89/PITXML/regulations/SOR-2002-227/20060322/SOR-2002-227.xml",..: 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "names")= chr [1:107] "xml_url" "xml_url" "xml_url" "xml_url" ...
##  $ full_text: chr  "part 1interpretation and applicationdivision 1interpretationdefinitions1 the definitions this subsection apply "| __truncated__ "part 1interpretation and applicationdivision 1interpretationdefinitions1 the definitions this subsection apply "| __truncated__ "part 1interpretation and applicationdivision 1interpretationdefinitions1 the definitions this subsection apply "| __truncated__ "part 1interpretation and applicationdivision 1interpretationdefinitions1 the definitions this subsection apply "| __truncated__ ...
##  $ date     : Date, format: "2006-03-22" "2006-05-11" ...

Q2

Extract the year information from the date column. How many amendments were passed each year? Use a barplot() to visualize the yearly counts.

df$year <- as.numeric(format(df$date, "%Y"))  #take year data from date
counts <- table(df$year)    # tabulate year data frequency
barplot(counts, main="Year Distribution",
   xlab="Number of Amendments")

Q3

Turn the regulation text into data. Ignoring digits, punctuation and stopwords, how many words are included in each regulation? Visualize the word counts through a line chart. Have the regulations gotten longer or shorter?

library(tm)
df$full_text<-iconv(df$full_text, "ASCII", "UTF-8",sub='')
corpus<-VCorpus(VectorSource(df$full_text))
corpus<-tm_map(corpus, removePunctuation)
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, removeWords, stopwords("english"))

dtm <- DocumentTermMatrix(corpus)
len <- rowSums(as.matrix(dtm))

for (i in 1:107){
  df["len"] = len[i]
}  # add word counts of each amendment

plot the word counts:

#plot(df$year, df$len, type = "l")
plot(df$year,len, type="l", col="green", lwd=5, xlab="year", ylab="wordCounts", main="Word Counts of Each Amendment")

Q4

Calculate the similarity of all amendments (use the distance between words, not 5-character grams) and plot it using a heat map.

library(ggplot2)
library(gplots)

distance_matrix<-as.matrix(dist(dtm, method="binary"))

heatmap.2(distance_matrix,dendrogram='both',Rowv=FALSE,Colv=FALSE,symm =TRUE,trace='none',density.info='none',main ="Similarity of Amendments",labCol =paste("SOR",df$year,sep="-" ),labRow =paste("SOR", df$year,sep="-" ))

Q5

Three broad groupings are visible in the similarity heat map. Select an amendment date from each of the groupings. Search for the regulations on CanLII. Among the CanLII navigation menu items you find “Versions” showing earlier versions of the same regulation, which lets you compare texts. Using that functionality to compare your selected versions to identify one (major) substantive amendment each that distinguishes group 1 and 2 and group 2 and 3.

Assignment –

1. read the data,

2. Questions,

Q1

Q2

Q3

Q4

Q5