Note: If you see have an SSL ERROR when you are getting data from Github, you can try this:
options(width = 320, RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL",
"cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
getOption("RCurlOptions") # just to confirm ssl status!
## $verbose
## [1] FALSE
##
## $capath
## [1] "/Library/Frameworks/R.framework/Versions/3.0/Resources/library/RCurl/CurlSSL/cacert.pem"
##
## $ssl.verifypeer
## [1] FALSE
BUT first try to do it without the lines above
In a previous post we had a table from wikipedia with information of interest. In this code, I will show some variations in data collection and cleaning.
Following from the previous tutorial we had a table from the link:
Link = "http://en.wikipedia.org/wiki/List_of_border_wars"
As it can be seen below, this wikipage has many tables:
First, let's get the table as in the previous post:
library(XML)
whichTable = 3
dirtyTable = getNodeSet(htmlParse(Link), "//table")[[whichTable]]
Then, let's get the table already known, and see its first rows(using head):
## V1 V2 V3 V4 V5
## 1 Start Finish Conflict Combatants Fatalities
## 2 1947 1948 Indo-Pakistani War of 1947 Pakistan â\u0080\u0093 India ~3,000
## 3 1950 Present (CF signed in 1953) Korean War South Korea - North Korea 2 419 010+
## 4 1962 1962 Sino-Indian War PRC â\u0080\u0093 India ~4,000
## 5 1965 1965 Indo-Pakistani War of 1965 Pakistan â\u0080\u0093 India ~6,800
## 6 1966 1989 South African Border War Angola â\u0080\u0093 South Africa Unknown
## 7 1967 1967 Chola incident India â\u0080\u0093 People's Republic of China ~5
## 8 1969 1969 Sino-Soviet border conflict PRC â\u0080\u0093 Soviet Union Unknown
## 9 1971 1971 Indo-Pakistani War of 1971 Pakistan â\u0080\u0093 India ~4,000+
## 10 1979 1990 Sino-Vietnamese conflicts 1979-1990 Vietnam â\u0080\u0093 China Unclear
## 11 1980 1988 Iranâ\u0080\u0093Iraq War Iran â\u0080\u0093 Iraq 1-2 million[1]
## 12 1981 1981 Paquisha War Ecuador â\u0080\u0093 Peru ~10
## 13 1982 1982 1982 Ethiopianâ\u0080\u0093Somali Border War Ethiopia â\u0080\u0093 Somalia Unknown
## 14 1984 1987 Siachen Conflict Pakistan â\u0080\u0093 India ~2,400.
## 15 1985 1985 Agacher Strip War Burkina Faso â\u0080\u0093 Mali 179
## 16 1987 1987 1987 Sino-Indian skirmish India â\u0080\u0093 People's Republic of China Unknown
## 17 1987 1988 Thaiâ\u0080\u0093Laotian Border War Thailand â\u0080\u0093 Laos ~1,000
## 18 1989 1991 Mauritaniaâ\u0080\u0093Senegal Border War Mauritania â\u0080\u0093 Senegal Unknown
## 19 1995 1995 Cenepa War Ecuador â\u0080\u0093 Peru ~450
## 20 1998 2000 Eritrean-Ethiopian War Ethiopia â\u0080\u0093 Eritrea ~70,000
## 21 1999 1999 Kargil War India â\u0080\u0093 Pakistan ~4,500
The code so far is not recognising the first row as the names of the columns, so we solve that setting header as True:
cleanTable = readHTMLTable(dirtyTable, header = T)
cleanTable
## Start Finish Conflict Combatants Fatalities
## 1 1947 1948 Indo-Pakistani War of 1947 Pakistan â\u0080\u0093 India ~3,000
## 2 1950 Present (CF signed in 1953) Korean War South Korea - North Korea 2 419 010+
## 3 1962 1962 Sino-Indian War PRC â\u0080\u0093 India ~4,000
## 4 1965 1965 Indo-Pakistani War of 1965 Pakistan â\u0080\u0093 India ~6,800
## 5 1966 1989 South African Border War Angola â\u0080\u0093 South Africa Unknown
## 6 1967 1967 Chola incident India â\u0080\u0093 People's Republic of China ~5
## 7 1969 1969 Sino-Soviet border conflict PRC â\u0080\u0093 Soviet Union Unknown
## 8 1971 1971 Indo-Pakistani War of 1971 Pakistan â\u0080\u0093 India ~4,000+
## 9 1979 1990 Sino-Vietnamese conflicts 1979-1990 Vietnam â\u0080\u0093 China Unclear
## 10 1980 1988 Iranâ\u0080\u0093Iraq War Iran â\u0080\u0093 Iraq 1-2 million[1]
## 11 1981 1981 Paquisha War Ecuador â\u0080\u0093 Peru ~10
## 12 1982 1982 1982 Ethiopianâ\u0080\u0093Somali Border War Ethiopia â\u0080\u0093 Somalia Unknown
## 13 1984 1987 Siachen Conflict Pakistan â\u0080\u0093 India ~2,400.
## 14 1985 1985 Agacher Strip War Burkina Faso â\u0080\u0093 Mali 179
## 15 1987 1987 1987 Sino-Indian skirmish India â\u0080\u0093 People's Republic of China Unknown
## 16 1987 1988 Thaiâ\u0080\u0093Laotian Border War Thailand â\u0080\u0093 Laos ~1,000
## 17 1989 1991 Mauritaniaâ\u0080\u0093Senegal Border War Mauritania â\u0080\u0093 Senegal Unknown
## 18 1995 1995 Cenepa War Ecuador â\u0080\u0093 Peru ~450
## 19 1998 2000 Eritrean-Ethiopian War Ethiopia â\u0080\u0093 Eritrea ~70,000
## 20 1999 1999 Kargil War India â\u0080\u0093 Pakistan ~4,500
Now we are concern with the strange characters in the fourth column, we will try to see if indicating UTF-8 as encoding can help solve this situation:
dirtyTable = getNodeSet(htmlParse(Link, encoding = "UTF-8"), "//table")[[whichTable]]
cleanTable = readHTMLTable(dirtyTable, header = T)
cleanTable
## Start Finish Conflict Combatants Fatalities
## 1 1947 1948 Indo-Pakistani War of 1947 Pakistan – India ~3,000
## 2 1950 Present (CF signed in 1953) Korean War South Korea - North Korea 2 419 010+
## 3 1962 1962 Sino-Indian War PRC – India ~4,000
## 4 1965 1965 Indo-Pakistani War of 1965 Pakistan – India ~6,800
## 5 1966 1989 South African Border War Angola – South Africa Unknown
## 6 1967 1967 Chola incident India – People's Republic of China ~5
## 7 1969 1969 Sino-Soviet border conflict PRC – Soviet Union Unknown
## 8 1971 1971 Indo-Pakistani War of 1971 Pakistan – India ~4,000+
## 9 1979 1990 Sino-Vietnamese conflicts 1979-1990 Vietnam – China Unclear
## 10 1980 1988 Iran–Iraq War Iran – Iraq 1-2 million[1]
## 11 1981 1981 Paquisha War Ecuador – Peru ~10
## 12 1982 1982 1982 Ethiopian–Somali Border War Ethiopia – Somalia Unknown
## 13 1984 1987 Siachen Conflict Pakistan – India ~2,400.
## 14 1985 1985 Agacher Strip War Burkina Faso – Mali 179
## 15 1987 1987 1987 Sino-Indian skirmish India – People's Republic of China Unknown
## 16 1987 1988 Thai–Laotian Border War Thailand – Laos ~1,000
## 17 1989 1991 Mauritania–Senegal Border War Mauritania – Senegal Unknown
## 18 1995 1995 Cenepa War Ecuador – Peru ~450
## 19 1998 2000 Eritrean-Ethiopian War Ethiopia – Eritrea ~70,000
## 20 1999 1999 Kargil War India – Pakistan ~4,500
This is so much better. Now that we have got rid of the most obvious issues, let's see if we other things are hidding somewhere.Let's simply get a summary and the structure of this table:
summary(cleanTable)
str(cleanTable)
## Start Finish Conflict Combatants Fatalities
## 1987 : 2 1987 : 2 1982 Ethiopian–Somali Border War: 1 Pakistan – India :4 Unknown: 5
## 1947 : 1 1988 : 2 1987 Sino-Indian skirmish : 1 Ecuador – Peru :2 ~1,000 : 1
## 1950 : 1 1948 : 1 Agacher Strip War : 1 India – People's Republic of China:2 ~10 : 1
## 1962 : 1 1962 : 1 Cenepa War : 1 Angola – South Africa :1 ~2,400.: 1
## 1965 : 1 1965 : 1 Chola incident : 1 Burkina Faso – Mali :1 ~3,000 : 1
## 1966 : 1 1967 : 1 Eritrean-Ethiopian War : 1 Ethiopia – Eritrea :1 ~4,000 : 1
## (Other):13 (Other):12 (Other) :14 (Other) :9 (Other):10
## 'data.frame': 20 obs. of 5 variables:
## $ Start : Factor w/ 19 levels "1947","1950",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Finish : Factor w/ 18 levels "1948","1962",..: 1 18 2 3 12 4 5 6 13 11 ...
## $ Conflict : Factor w/ 20 levels "1982 Ethiopian–Somali Border War",..: 7 12 16 8 19 5 17 9 18 10 ...
## $ Combatants: Factor w/ 15 levels "Angola – South Africa",..: 10 13 11 10 1 7 12 10 15 8 ...
## $ Fatalities: Factor w/ 16 levels "~1,000","~10",..: 4 14 5 10 16 9 16 6 15 12 ...
The last column has obvious problems, and we can try to replace the wrong symbols. However, from the last result we see it is a factor instead of a string, we need to correct that (if it is not a string, we can not operate on them with string functions).
dirtyTable = getNodeSet(htmlParse(Link, encoding = "UTF-8"), "//table")[[whichTable]]
cleanTable = readHTMLTable(dirtyTable, header = T, trim = T, stringsAsFactors = F)
indexmillion = grep("million", cleanTable$Fatalities)
cleanTable$Fatalities[indexmillion] = "2000000"
indexunknown = grep("Un", cleanTable$Fatalities)
cleanTable$Fatalities[c(indexunknown)] = NA
cleanTable
## Start Finish Conflict Combatants Fatalities
## 1 1947 1948 Indo-Pakistani War of 1947 Pakistan – India ~3,000
## 2 1950 Present (CF signed in 1953) Korean War South Korea - North Korea 2 419 010+
## 3 1962 1962 Sino-Indian War PRC – India ~4,000
## 4 1965 1965 Indo-Pakistani War of 1965 Pakistan – India ~6,800
## 5 1966 1989 South African Border War Angola – South Africa <NA>
## 6 1967 1967 Chola incident India – People's Republic of China ~5
## 7 1969 1969 Sino-Soviet border conflict PRC – Soviet Union <NA>
## 8 1971 1971 Indo-Pakistani War of 1971 Pakistan – India ~4,000+
## 9 1979 1990 Sino-Vietnamese conflicts 1979-1990 Vietnam – China <NA>
## 10 1980 1988 Iran–Iraq War Iran – Iraq 2000000
## 11 1981 1981 Paquisha War Ecuador – Peru ~10
## 12 1982 1982 1982 Ethiopian–Somali Border War Ethiopia – Somalia <NA>
## 13 1984 1987 Siachen Conflict Pakistan – India ~2,400.
## 14 1985 1985 Agacher Strip War Burkina Faso – Mali 179
## 15 1987 1987 1987 Sino-Indian skirmish India – People's Republic of China <NA>
## 16 1987 1988 Thai–Laotian Border War Thailand – Laos ~1,000
## 17 1989 1991 Mauritania–Senegal Border War Mauritania – Senegal <NA>
## 18 1995 1995 Cenepa War Ecuador – Peru ~450
## 19 1998 2000 Eritrean-Ethiopian War Ethiopia – Eritrea ~70,000
## 20 1999 1999 Kargil War India – Pakistan ~4,500
We see a better table. Now we need to get rid of those problematic symbols and empty spaces.We will extra columns to show the conversion process.
cleanTable$Fatalities2 = gsub("[[:punct:]]", "", cleanTable$Fatalities)
cleanTable$Fatalities3 = gsub(" ", "", cleanTable$Fatalities2)
cleanTable
## Start Finish Conflict Combatants Fatalities Fatalities2 Fatalities3
## 1 1947 1948 Indo-Pakistani War of 1947 Pakistan – India ~3,000 3000 3000
## 2 1950 Present (CF signed in 1953) Korean War South Korea - North Korea 2 419 010+ 2 419 010 2419010
## 3 1962 1962 Sino-Indian War PRC – India ~4,000 4000 4000
## 4 1965 1965 Indo-Pakistani War of 1965 Pakistan – India ~6,800 6800 6800
## 5 1966 1989 South African Border War Angola – South Africa <NA> <NA> <NA>
## 6 1967 1967 Chola incident India – People's Republic of China ~5 5 5
## 7 1969 1969 Sino-Soviet border conflict PRC – Soviet Union <NA> <NA> <NA>
## 8 1971 1971 Indo-Pakistani War of 1971 Pakistan – India ~4,000+ 4000 4000
## 9 1979 1990 Sino-Vietnamese conflicts 1979-1990 Vietnam – China <NA> <NA> <NA>
## 10 1980 1988 Iran–Iraq War Iran – Iraq 2000000 2000000 2000000
## 11 1981 1981 Paquisha War Ecuador – Peru ~10 10 10
## 12 1982 1982 1982 Ethiopian–Somali Border War Ethiopia – Somalia <NA> <NA> <NA>
## 13 1984 1987 Siachen Conflict Pakistan – India ~2,400. 2400 2400
## 14 1985 1985 Agacher Strip War Burkina Faso – Mali 179 179 179
## 15 1987 1987 1987 Sino-Indian skirmish India – People's Republic of China <NA> <NA> <NA>
## 16 1987 1988 Thai–Laotian Border War Thailand – Laos ~1,000 1000 1000
## 17 1989 1991 Mauritania–Senegal Border War Mauritania – Senegal <NA> <NA> <NA>
## 18 1995 1995 Cenepa War Ecuador – Peru ~450 450 450
## 19 1998 2000 Eritrean-Ethiopian War Ethiopia – Eritrea ~70,000 70000 70000
## 20 1999 1999 Kargil War India – Pakistan ~4,500 4500 4500
cleanTable$Fatalities = as.numeric(cleanTable$Fatalities3)
cleanTable$Fatalities2 = NULL
cleanTable$Fatalities3 = NULL
cleanTable
## Start Finish Conflict Combatants Fatalities
## 1 1947 1948 Indo-Pakistani War of 1947 Pakistan – India 3000
## 2 1950 Present (CF signed in 1953) Korean War South Korea - North Korea 2419010
## 3 1962 1962 Sino-Indian War PRC – India 4000
## 4 1965 1965 Indo-Pakistani War of 1965 Pakistan – India 6800
## 5 1966 1989 South African Border War Angola – South Africa NA
## 6 1967 1967 Chola incident India – People's Republic of China 5
## 7 1969 1969 Sino-Soviet border conflict PRC – Soviet Union NA
## 8 1971 1971 Indo-Pakistani War of 1971 Pakistan – India 4000
## 9 1979 1990 Sino-Vietnamese conflicts 1979-1990 Vietnam – China NA
## 10 1980 1988 Iran–Iraq War Iran – Iraq 2000000
## 11 1981 1981 Paquisha War Ecuador – Peru 10
## 12 1982 1982 1982 Ethiopian–Somali Border War Ethiopia – Somalia NA
## 13 1984 1987 Siachen Conflict Pakistan – India 2400
## 14 1985 1985 Agacher Strip War Burkina Faso – Mali 179
## 15 1987 1987 1987 Sino-Indian skirmish India – People's Republic of China NA
## 16 1987 1988 Thai–Laotian Border War Thailand – Laos 1000
## 17 1989 1991 Mauritania–Senegal Border War Mauritania – Senegal NA
## 18 1995 1995 Cenepa War Ecuador – Peru 450
## 19 1998 2000 Eritrean-Ethiopian War Ethiopia – Eritrea 70000
## 20 1999 1999 Kargil War India – Pakistan 4500
The final step would the to clean the cell where a problematic text is input for finish year:
indexyear = grep("CF", cleanTable$Finish)
cleanTable$Finish[indexyear] = "1953"
We can now save the file to our github repository (once we push it from Rstudio):
setwd("~/Documents/GITHUBrepositories/Tutorials/TemplatesR")
write.csv(cleanTable, "wikitable.csv")
You can work with this file now:
library(RCurl)
## Loading required package: bitops
DATA_GithubCleaned <- getURL("https://raw.github.com/MAGALLANESJoseManuel/SocialScienceDataTools/master/TemplatesR/wikitable.csv")
DATA_gitClean <- read.csv(text = DATA_GithubCleaned)
head(DATA_gitClean)
## X Start Finish Conflict Combatants Fatalities
## 1 1 1947 1948 Indo-Pakistani War of 1947 Pakistan – India 3000
## 2 2 1950 1953 Korean War South Korea - North Korea 2419010
## 3 3 1962 1962 Sino-Indian War PRC – India 4000
## 4 4 1965 1965 Indo-Pakistani War of 1965 Pakistan – India 6800
## 5 5 1966 1989 South African Border War Angola – South Africa NA
## 6 6 1967 1967 Chola incident India – People's Republic of China 5