PRÁCTICA DIRGIDA 6

http://rpubs.com/Brayan_Milla/527650

library(htmltab)
linkPage='https://www.nationsonline.org/oneworld/corruption.htm'
linkTabla="/html/body/table[3]"
corruption = htmltab(doc = linkPage, 
               which =linkTabla)

## Neither <thead> nor <th> information found. Taking first table row for the header. If incorrect, specifiy header argument.

## Warning: Columns [ ] seem to have no data and are removed. Use
## rm_nodata_cols = F to suppress this behavior

De la web sabias que tenias varios años:

names(corruption)

## [1] "Rank"        "Country"     "2016  Score" "2015  Score" "2014  Score"
## [6] "2013  Score" "2012  Score" "Region"

Quédate con los scores para el 2016, y claro, Country y Region:

corruption=corruption[,c(2,3,8)]
names(corruption)

## [1] "Country"     "2016  Score" "Region"

Cambio de nombre a “score2016” para evitar espacios en blanco:

names(corruption)[2]="score2016"

Identificando tipo de variable y observamos que todos son character(chr)

str(corruption)

## 'data.frame':    177 obs. of  3 variables:
##  $ Country  : chr  "Denmark" "New Zealand" "Finland" "Sweden" ...
##  $ score2016: chr  "90" "90" "89" "88" ...
##  $ Region   : chr  "Europe" "Asia Pacific" "Europe" "Europe" ...

Cambiando a numerico: El score debe ser un numero, hagamos el cambio:

corruption$score2016=as.numeric(corruption$score2016)

## Warning: NAs introduced by coercion

El cambio ha producido NA (valores perdidos), eso sucede pues R encontro un valor no numerico. Veamos cual es esa fila:

corruption[!complete.cases(corruption$score2016),]

##                                                          Country score2016
## 178 To get in-depth information visit:Transparency International        NA
##                                                           Region
## 178 To get in-depth information visit:Transparency International

Se verifica que esa fila no tenia informacion de un pais, era solo una referencia (visita la web original). De ahi que nos quedamos sin esa fila:

corruption=corruption[complete.cases(corruption$score2016),]

Hasta aqui tenemos:

head(corruption)

##       Country score2016       Region
## 2     Denmark        90       Europe
## 3 New Zealand        90 Asia Pacific
## 4     Finland        89       Europe
## 5      Sweden        88       Europe
## 6 Switzerland        86       Europe
## 7      Norway        85       Europe

No tenemos ordinales, pero esa numerica (el score) la convertiremos en ordinal. Para ello, la organizamos en 10 grupos:

Hacer 10 intervalos con el score del 2016:

corruption$nivel=cut(corruption$score2016,
                     breaks = 10,
                     labels = c(1:10),
                     ordered_result = T)

Ahora tenemos:

head(corruption)

##       Country score2016       Region nivel
## 2     Denmark        90       Europe    10
## 3 New Zealand        90 Asia Pacific    10
## 4     Finland        89       Europe    10
## 5      Sweden        88       Europe    10
## 6 Switzerland        86       Europe    10
## 7      Norway        85       Europe    10

Se nota que a mayor numero del grupo, menos corrupcion. Exploremos la variable corruption$nivel, que es nuestra ordinal que acabamos de crear.