Visualización de datos con R - IBM

Gráfico de barras, histograma, gráfico de pastel

Para la visualización vamos a usar la librería ggplot2, en el caso de no tenerla instalada puede usar:

# install.packages('ggplot2')

Para hacer uso de la librería, vamos a llamarla con el siguiente comando:

library(rlang)

library(ggplot2)

vamos a usar la data: mtcars, y la llamamos directamente

mtcars
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
 [ reached 'max' / getOption("max.print") -- omitted 26 rows ]

veamos que tipos de datos aparecen en nuestra data

str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

tenemos 32 observaciones(tipo de auto). of 11 variables(caracteristicas),

Gráfico de barras con gplot

A continuación, vamos hacer un gráfico de barras de la variable cyl

qplot(mtcars$cyl, geom = "bar", fill = I("blue"), xlab = "Cylinders", ylab = "Number of vehicles", 
    main = "Cylinders in mtcars")

Histogramas con gplot

Vamos a considerar la misma data mtcars, tomaremos la columna hp(horsepower), y usaremos la misma función qplot

qplot(mtcars$hp, geom = "histogram")

Vamos a mejorar nuestro histograma:

  • con el parámetro binwidth=25, tenemos que el rango de numeros se divide en contenedores de tamaño 25.
  • Con el parámetro colour asignamos un color a los bordes.
qplot(mtcars$hp, geom = "histogram", binwidth = 25, colour = I("black"))

con el parámetro xlim=c(50,350) podemos definir el rango para ver el histograma

qplot(mtcars$hp, geom = "histogram", binwidth = 25, colour = I("black"), xlim = c(50, 
    350))

Finalmente podemos asignar nombres a los ejes x,y

qplot(mtcars$hp, geom = "histogram", binwidth = 25, colour = I("black"), xlab = "Horsepower", 
    ylab = "Number of cars", main = "Histograma")

Gráfico de pastel

Vamos a considerar la data mtcars y la columna carb(carburetors). Dado que no existe una función para crear un gráfico de pastel(gráfico circular) de forma directa, primero vamos a crear un gráfico de barras.

barp <- ggplot(mtcars, aes(x = 1, y = sort(mtcars$carb), fill = sort(mtcars$carb))) + 
    geom_bar(stat = "identity")
print(barp)

Para tener nuestra gráfica circular, vamos a tomar el parámetro coord_polar(theta=y)

barp <- barp + coord_polar(theta = "y")
print(barp)

Si queremos mejorar la imagen podemos usar el parámetro __theme

barp <- barp + theme(axis.line = element_blank(), axis.text.x = element_blank(), 
    axis.text.y = element_blank(), axis.ticks = element_blank(), panel.background = element_blank()) + 
    labs(y = "Carburetors")
print(barp)

Grafico de dispersion, Graficos de Regresion

Grafico de dispersion

Vamos a considerar la data mtcars, y haremos un grafico de dispersion con las variables: mpg(millas por galon) y wt(peso)

qplot(mpg, wt, data = mtcars)

Si queremos hacer mas perosnalizable nuestra grafico de dispersion podemos usar el parametro geom_point(shape=2), podemos cambiar la forma de los puntos cabiando el valor de shape.

ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point(shape = 2)

Podemos intrducr una nueva variable a nuestro grafico de dispersion. Por ejemplo consideremos la variable cyl(cylinders); es decir, vamos a categorizar por el numero de cylinders.

Primero vamos a cambiar el tipo de dato de la variable cyl, vamos hacer que sea de tipo factor, y vaos agregarla como una nueva variable

mtcars$cylFactor <- factor(mtcars$cyl)

ggplot(mtcars, aes(x = mpg, y = wt, shape = cylFactor)) + geom_point()

Tambien podemos poner color en lugar de formas para visualizar los diferentes factores de la vairbale cyl.

ggplot(mtcars, aes(x = mpg, y = wt, color = cyl)) + geom_point()

Tambien podemos poner color en lugar de formas para visualizar los diferentes factores de la vairbale cyl.

ggplot(mtcars, aes(x = mpg, y = wt, color = cylFactor)) + geom_point()

La diferencia entre las dos ultimas graficas es el tipo de variable que estaos considerando para llenar los colores, por un lado cyl es una variable numerica, mientras que cylFactor es un factor.

Ahora podemos poner leyenda a nuetsra gráfica

ggplot(mtcars, aes(x = mpg, y = wt, color = cylFactor)) + geom_point() + xlab("Miles per Galon") + 
    ylab("Weight") + labs(colour = "Cylinders") + ggtitle("Grafica de dispersion")

Graficos de Linea y Regresion

Para la grafica de Linea, vamos hacer uso del conjunto de datos integrado EuStock Markets, que contiene datos historicos de cuatro indices bursatiles Europeos

EuStockDF <- as.data.frame(EuStockMarkets)
head(EuStockDF)
      DAX    SMI    CAC   FTSE
1 1628.75 1678.1 1772.8 2443.6
2 1613.63 1688.5 1750.5 2460.2
3 1606.51 1678.6 1718.0 2448.2
4 1621.04 1684.1 1708.1 2470.4
5 1618.16 1686.6 1723.1 2484.7
6 1610.61 1671.6 1714.3 2466.8

Vamos a considerar la variable DAX para hacer nuestra grafica de linea

ggplot(EuStockDF, aes(x = c(1:nrow(EuStockDF)), y = DAX)) + geom_line()

podemos cambiar el grosor de la recta usando el parametro geom_line(size=1.5) y tambien cambiar el color usando el parametro colour=“light blue” dentro de geom_line.

ggplot(EuStockDF, aes(x = c(1:nrow(EuStockDF)), y = DAX)) + geom_line(size = 1.5, 
    colour = "light blue") + labs(x = "Stocks")

Graficas de linea multiples

Consideremos las variables DAX y SMI, para graficarlas en una sola

dax_smi_plt <- ggplot() + geom_line(data = EuStockDF, aes(x = c(1:nrow(EuStockDF)), 
    y = DAX), size = 1.2, colour = "light blue") + geom_line(data = EuStockDF, aes(x = c(1:nrow(EuStockDF)), 
    y = SMI), size = 1.2, colour = "red") + geom_line(data = EuStockDF, aes(x = c(1:nrow(EuStockDF)), 
    y = CAC), size = 1.2, colour = "green") + labs(x = "Time", y = "Stocks") + ggtitle("Eu Stocks")
print(dax_smi_plt)

Regresion Lineal

Vamos a tomar la data mtcars, y haremos una grafica de regresion lineal entre las variables mpg y wt. Considerando el parametro geom_smooth

ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point() + geom_smooth(method = "lm", 
    se = FALSE, color = "red")

Si consideramos una tercera variable el numero de cylindres cylFactord(de tipo factor) y ademas si cambiamos el parametro se=TRUE, esto hace que veamos el intervalo de confianza para nuestro modelo de regresion lineal.

ggplot(mtcars, aes(x = mpg, y = wt, color = cylFactor)) + geom_point() + geom_smooth(method = "lm", 
    se = TRUE, color = "red") + xlab("Miles per Galon") + ylab("Weight") + labs(colour = "Cylinders") + 
    ggtitle("Regresion Lineal")

Gaussian Regresion

Cambiamos el parametro method=“auto” y tenemos nuestro modelo gausiano

ggplot(mtcars, aes(x = mpg, y = wt, color = cylFactor)) + geom_point() + geom_smooth(method = "auto", 
    se = TRUE, color = "red") + xlab("Miles per Galon") + ylab("Weight") + labs(colour = "Cylinders") + 
    ggtitle("Gaussian Regression")

Los modelos Gaussianos son mas complejos y pueden ajustar de mejor forma a los datos que un modelo de regresion lineal.

Herramientas especializadas de visualización

Nube de palabras

Una nube de palabras es una imagen compuesta por palabras que aparecen en un texto o tema en particular. El tamaño de la palabra indica su frecuencia o su importancia.

Vamos a usar las siguientes librerias:

# install.packages('tm') install.packages('wordcloud')
library(tm)
library(wordcloud)

Vamos a crear una carpeta llamada foles_path en nuestro directorio y luego usaremos la funcion dowload.file para descargar el archivo en nuestro nuevo directorio.

# Te creará una carpeta en los documentos de tu sistema con el nombre
# especificado FILES_PATH
dir.create("files_path")

download.file("https://ibm.box.com/shared/static/cmid70rpa7xe4ocitcga1bve7r0kqnia.txt", 
    destfile = "files_path/churchill_speeches.txt", quiet = TRUE)  #Ruta donde estará guardad la descarga + el nombre del archivo

Ahora vamos a leer la data que descargamos, usamos la funcion inspect para ver el tipo de archivo que tenemos, que justamente es un texto.

dirPath <- "files_path/"
speech <- Corpus(DirSource(dirPath))
inspect(speech)
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          churchill_speeches.txt 
At present we lie within a few minutes’ striking distance of the French, Dutch and Belgian coasts, and within a few hours of the great aerodromes of Central Europe. We are even within canon-shot of the Continent.\n\nSo close as that! Is it prudent, is it possible, however much we might desire it, to turn our backs upon Europe and ignore whatever may happen there? I have come to the conclusion – reluctantly I admit – that we cannot get away. Here we are and we must make the best of it. But do not underrate the risks – the grevious risks – we have to run.\n\n\nThis is only the beginning of the reckoning. This is only the first sip, the first foretaste of a bitter cup which will be proffered to us year by year unless, by a supreme recovery of moral health and martial vigour, we arise again and take our stand for freedom as in the olden time.\n\n\nI would say to the House, as I said to those who have joined this Government: I have nothing to offer but blood, toil, tears and sweat. We have before us an ordeal of the most grievous kind. We have before us many, many long months of struggle and of suffering. You ask, what is our policy? I can say: It is to wage war, by sea, land and air, with all our might and with all the strength  that God can give us; to wage war against a monstrous tyranny, never surpassed in the dark, lamentable catalogue of human crime. This is our policy. You ask, what is our aim?\n\nI can answer in one word: It is victory, victory at all costs, victory in spite of all terror, victory, however long and hard the road may be, for without victory, there is no survival.\n\n\nEven though large tracts of Europe and many old and famous states have fallen or may fall into the grip of the Gestapo and all the odious apparatus of Nazi rule, we shall not flag or fail. We shall go on to the end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing confidence and growing strength in the air, we shall defend our island, whatever the cost may be, we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets, we shall fight in the hills; we shall never surrender.\n\n\nThe battle of France is over. I expect that the Battle of Britain is about to begin. Upon this battle depends the survival of Christian civilisation. Upon it depends our own British life, and the long continuity of our institutions and our Empire. The whole fury and might of the enemy must very soon be turned upon us. Hitler knows that he will have to break us in this island or lose the war.\n\nIf we can stand up to him, all Europe may be free and the life of the world may move forward into broad, sunlit uplands. But if we fail, then the whole world, including the United States, including all that we have known and cared for, will sink into the abyss of a new Dark Age made more sinister, and perhaps more protracted, by the lights of perverted science. Let us therefore brace ourselves to our duties, and so bear ourselves that, if the British Empire and its Commonwealth last for a thousand years, men will still say, ‘This was their finest hour.\n\n\nThe gratitude of every home in our Island, in our Empire, and indeed throughout the world, except in the abodes of the guilty, goes out to the British airmen who, undaunted by odds, unwearied in their constant challenge and mortal danger, are turning the tide of the World War by their prowess and by their devotion. Never in the field of human conflict was so much owed by so many to so few.\n\n\nFrom Stettin in the Baltic to Trieste in the Adriatic, an iron curtain has descended across the Continent. Behind that line lie all the capitals of the ancient states of Central and Eastern Europe. Warsaw, Berlin, Prague, Vienna, Budapest, Belgrade, Bucharest and Sofia, all these famous cities and the populations around them lie in what I must call the Soviet sphere.\n\n\nI am very glad that Mr Attlee described my speeches in the war as expressing the will not only of Parliament but of the whole nation. Their will was resolute and remorseless and, as it proved, unconquerable. It fell to me to express it, and if I found the right words you must remember that I have always earned my living by my pen and by my tongue. It was the nation and race dwelling all round the globe that had the lion heart. I had the luck to be called upon to give the roar. 

Para hacer limpieza de los datos, los pasos que vamos a seguir son:

  1. Pasar el texto a letras minusculas, con el para metro tolower
  2. Eliminaremos lo numeros, con el parametro removeNumbers
  3. Eliminaremos las palabras comunes como: “the” y “we”, con el parametro stopwords(‘english’)
  4. Eliminaremos los signos de puntuacion, con el parametro removePunctuation
  5. Eliminaremos los espacios en blanco, con el parametro stripWhitespace
speech <- tm_map(speech, content_transformer(tolower))
speech <- tm_map(speech, removeNumbers)
speech <- tm_map(speech, removeWords, stopwords("english"))
# si queremos eliminar palabras en especifico usamos
speech <- tm_map(speech, removeWords, c("squirrelled"))
speech <- tm_map(speech, removePunctuation)
speech <- tm_map(speech, stripWhitespace)

Ahora vamos a convertir el texto en una matriz

dtm <- TermDocumentMatrix(speech)
m <- as.matrix(dtm)

Vamos a ordenar las palabras por frecuencia, de mayor a menor

v <- sort(rowSums(m), decreasing = TRUE)
v
     shall      fight        may       will     europe       upon    victory 
        11          7          6          6          5          5          5 
       war        can       many       must      world          –     battle 
         5          4          4          4          4          4          3 
   british     empire     island        lie       long      might      never 
         3          3          3          3          3          3          3 
       say     states      whole     within        air        ask    central 
         3          3          3          3          2          2          2 
 continent       dark    depends       even       fail     famous      first 
         2          2          2          2          2          2          2 
    france       give    growing    however      human  including       life 
         2          2          2          2          2          2          2 
      much     nation     policy      risks      stand   strength   survival 
         2          2          2          2          2          2          2 
      wage   whatever       year     abodes      abyss     across      admit 
         2          2          2          1          1          1          1 
  adriatic aerodromes        age        aim     airmen     always    ancient 
         1          1          1          1          1          1          1 
    answer  apparatus      arise     around     attlee       away      backs 
         1          1          1          1          1          1          1 
    baltic    beaches       bear      begin  beginning 
         1          1          1          1          1 
 [ reached getOption("max.print") -- omitted 218 entries ]

vamos a transformarlo en un data frame

d <- data.frame(word = names(v), freq = v)
head(d, 10)
           word freq
shall     shall   11
fight     fight    7
may         may    6
will       will    6
europe   europe    5
upon       upon    5
victory victory    5
war         war    5
can         can    4
many       many    4

Ahora vamos a graficar la nube de palabras

wordcloud(words = d$word, freq = d$freq)

Podemos escoger el numero minimo de frecuencia para observar en la nube de la palabras

wordcloud(words = d$word, freq = d$freq, min.freq = 1)

Podemos poner limite al numero de palabras que se pueden mostrar, ademas de poner colores para una mejor visualizacion.

wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words = 250, colors = brewer.pal(8, 
    "Dark2"))

Finalmente, podemos mejorar nuestra grafica para ser mas atractiva

wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words = 250, colors = brewer.pal(8, 
    "Dark2"), random.order = FALSE)

Creando mapas en R

Vamos a usar la libreria leaflet

library(leaflet)

Vamos a encontrar la ubicacion de la virgen del Panecillo de QUITO-ECUADOR, si no sabe como hallar la longitud y latitud puede hacer lo siguiente:

  • En tu ordenador, abre Google Maps.
  • Haz clic con el botón derecho en el sitio o en el área del mapa. Selecciona la latitud y la longitud para que se copien automáticamente las coordenadas.
map <- leaflet() %>% addTiles() %>% addMarkers(lng = -78.5181, lat = -0.22906, popup = "El Panecillo - QUITO")
map

Poemos tambien encerrar en un rectangulopara una mejor visualizacion, por ejemplo:

map2 <- leaflet() %>% addTiles() %>% addMarkers(lng = -78.5181, lat = -0.22906, popup = "El Panecillo - QUITO") %>% 
    addRectangles(-78.5081, -0.21906, -78.5282, -0.23906)
map2

Vamos a usar la data quakes que tiene datos de longitud y latitud

head(quakes)
     lat   long depth mag stations
1 -20.42 181.62   562 4.8       41
2 -20.62 181.03   650 4.2       15
3 -26.00 184.10    42 5.4       43
4 -17.97 181.66   626 4.1       19
5 -20.42 181.96   649 4.0       11
6 -19.68 184.31   195 4.0       12
map7 <- leaflet(quakes) %>% addTiles() %>% addMarkers(clusterOptions = markerClusterOptions())
map7