Source file ⇒ Assignment_9.Rmd

page <- "http://en.wikipedia.org/wiki/List_of_nuclear_reactors"
xpath <- '//*[@id="mw-content-text"]/table' 
table_list <- page %>%
  html() %>%
  html_nodes(xpath = xpath) %>%
  html_table(fill = TRUE)
## Warning: 'html' is deprecated.
## Use 'read_html' instead.
## See help("Deprecated")

1.Find the table element.

Start with head(table_list[[5]]) and go down the list until you find the table for Japan. Keep in mind that the tables are listed by number in the same order that they appear on the page.

table = table_list[[23]]  #23 is for a Japan
names(table)
##  [1] "Name"                      "Reactor No."              
##  [3] "Reactor"                   NA                         
##  [5] "Status"                    "Capacity in MW"           
##  [7] NA                          "Construction Start Date"  
##  [9] "Commercial Operation Date" "Closure"

2. Look at it using View()

The contents of row 1 don’t refer to a case but to the variable names. To clean this table, you will want to create meaningful variable names and then delete row 1. You may need to refer to the original HTML document to figure out what are appropriate names.

Here are some examples of the types of statements you might find helpful for fixing the variable names.

new_names <- c("Name", "Reactor No.", "Type", "Model","Status", "Net","Gross", "Construction_Start_Date",
               "Commercial_Operation_Date", "Closure")
names(table) <- new_names # reset the variable names
table <- table %>% filter(row_number() != 1) # drop the first row

A quick visualization

plot1 <- ggplot(data = table, aes(x = dmy(Construction_Start_Date),colour = Type, y = Net)) 

plot1 + geom_point() + xlab("Construction Start Date")
## Warning: Removed 3 rows containing missing values (geom_point).

Construction delays

plot2 <- ggplot(data = table, aes(x = dmy(Construction_Start_Date), y = Name )) 

plot2 + geom_segment(aes(x = dmy(Construction_Start_Date), xend = dmy(Commercial_Operation_Date), y = Name, yend = Name)) + labs(x = "Construction start", y = "Reactor site")
## Warning: Removed 5 rows containing missing values (geom_segment).

Part II

1. From the command line, create a new directory called lifespan somewhere within your home directory, or wherever you want to put it. Change your working directory to be this lifespan directory

I created a new working directory named lifeexpectancy:

oski@BCE:~/practice$ mkdir lifeexpectancy

Change your working directory to be this lifespan directory:

oski@BCE:~/practice$ cd lifeexpectancy

How many countries are represented in this file (i.e. how many lines are in the file)?

Write a UNIX command to find out. (Remember the top line of the file contains years, not country data, so we don’t want to count it. For what to turn in, you can just write the UNIX command you’d use to get the total number from which you’d subtract one.)

input: wc -l lifeexpectancy.csv

output: 250 lifeexpectancy.csv

Suppose we want to do some analysis for the years 1950, 1975, and 2000. Use head to look at only the first line of this file again. Based on the fact that the first column is missing and subsequent columns start from 1980 and increase by one each time, figure out what column numbers correspond to 1950, 1975,and 2000. (You don’t need to write a UNIX command to do this, just do the math.)

head -1 lifeexpectancy.csv

In total number of columns, there are (2010 -1800) + 2 = 212 columns. 1 because there are missing two columns where 2nd column represents 2.

We want 1950, so count from 1800 -> (1950 - 1800) + 2 = 152.

We want 1970, so count from 1800 -> (1970 - 1800) + 2 = 177.

We want 2000, so count from 1800 -> (2000 - 1800) + 2 = 202.

Write a UNIX command to keep only these columns of life expectancy.csv.

For now, just print this to the screen. (Hint: take the results of cat lifeexpectancy.csv and pipe it into cut. In addition to the columns you found above, also be sure to keep the first column with the country names.) If you’ve done this step correctly, the first few lines of what gets printed to the screen should look like this:

cat lifeexpectancy.csv | cut -f 1,152,177,202 -d ‘,’

7. Create a clean data file that contains only the data for these years, also removing any line that contains no data for these years.

(Hint: Use the filter egrep “[0-9]” to keep lines that contain numbers.) Carry out this step in one line of UNIX code, using pipes, and redirect the output into a new file called lifeexpectancy.clean.csv.

Use less again to check that lifeexpectancy.clean.csv contains what you want. If not, go back and modify your last command. This is what the first few lines of my file look like:

cut -f 1,152,177,202 -d ‘,’ lifeexpectancy.csv | egrep “[0-9]” > lifeexpectancy.clean.csv

The file makemaps.R contains some code to visualize the data file you just created.

cat makemaps-1.R

Type “R” into the terminal:

oski@BCE:~/practice/lifeexpectancy$ R

install.packages(“maps”) install.packages(“fields”)

Ctrl D

cp * /vagrant/

Extra Credit

  1. Rosling uses a scatter plot. More specifically, ggplot and geom_point()

  2. We have everything except the data set for population.

  3. Using geom_point(), he also sets col = country. He also changes the size of each point depending on the population. Lastly, he combines different graphs within a single presentation through panel data.