Visualización de datos con R - IBM
Gráfico de barras, histograma, gráfico de pastel
Para la visualización vamos a usar la librería ggplot2, en el caso de no tenerla instalada puede usar:
# install.packages('ggplot2')Para hacer uso de la librería, vamos a llamarla con el siguiente comando:
library(rlang)
library(ggplot2)vamos a usar la data: mtcars, y la llamamos directamente
mtcars mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
[ reached 'max' / getOption("max.print") -- omitted 26 rows ]
veamos que tipos de datos aparecen en nuestra data
str(mtcars)'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
tenemos 32 observaciones(tipo de auto). of 11 variables(caracteristicas),
Gráfico de barras con gplot
A continuación, vamos hacer un gráfico de barras de la variable cyl
qplot(mtcars$cyl, geom = "bar", fill = I("blue"), xlab = "Cylinders", ylab = "Number of vehicles",
main = "Cylinders in mtcars")Histogramas con gplot
Vamos a considerar la misma data mtcars, tomaremos la columna hp(horsepower), y usaremos la misma función qplot
qplot(mtcars$hp, geom = "histogram")Vamos a mejorar nuestro histograma:
- con el parámetro binwidth=25, tenemos que el rango de numeros se divide en contenedores de tamaño 25.
- Con el parámetro colour asignamos un color a los bordes.
qplot(mtcars$hp, geom = "histogram", binwidth = 25, colour = I("black"))con el parámetro xlim=c(50,350) podemos definir el rango para ver el histograma
qplot(mtcars$hp, geom = "histogram", binwidth = 25, colour = I("black"), xlim = c(50,
350)) Finalmente podemos asignar nombres a los ejes x,y
qplot(mtcars$hp, geom = "histogram", binwidth = 25, colour = I("black"), xlab = "Horsepower",
ylab = "Number of cars", main = "Histograma")Gráfico de pastel
Vamos a considerar la data mtcars y la columna carb(carburetors). Dado que no existe una función para crear un gráfico de pastel(gráfico circular) de forma directa, primero vamos a crear un gráfico de barras.
barp <- ggplot(mtcars, aes(x = 1, y = sort(mtcars$carb), fill = sort(mtcars$carb))) +
geom_bar(stat = "identity")
print(barp) Para tener nuestra gráfica circular, vamos a tomar el parámetro coord_polar(theta=y)
barp <- barp + coord_polar(theta = "y")
print(barp) Si queremos mejorar la imagen podemos usar el parámetro __theme
barp <- barp + theme(axis.line = element_blank(), axis.text.x = element_blank(),
axis.text.y = element_blank(), axis.ticks = element_blank(), panel.background = element_blank()) +
labs(y = "Carburetors")
print(barp)Grafico de dispersion, Graficos de Regresion
Grafico de dispersion
Vamos a considerar la data mtcars, y haremos un grafico de dispersion con las variables: mpg(millas por galon) y wt(peso)
qplot(mpg, wt, data = mtcars)Si queremos hacer mas perosnalizable nuestra grafico de dispersion podemos usar el parametro geom_point(shape=2), podemos cambiar la forma de los puntos cabiando el valor de shape.
ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point(shape = 2) Podemos intrducr una nueva variable a nuestro grafico de dispersion. Por ejemplo consideremos la variable cyl(cylinders); es decir, vamos a categorizar por el numero de cylinders.
Primero vamos a cambiar el tipo de dato de la variable cyl, vamos hacer que sea de tipo factor, y vaos agregarla como una nueva variable
mtcars$cylFactor <- factor(mtcars$cyl)
ggplot(mtcars, aes(x = mpg, y = wt, shape = cylFactor)) + geom_point()Tambien podemos poner color en lugar de formas para visualizar los diferentes factores de la vairbale cyl.
ggplot(mtcars, aes(x = mpg, y = wt, color = cyl)) + geom_point() Tambien podemos poner color en lugar de formas para visualizar los diferentes factores de la vairbale cyl.
ggplot(mtcars, aes(x = mpg, y = wt, color = cylFactor)) + geom_point() La diferencia entre las dos ultimas graficas es el tipo de variable que estaos considerando para llenar los colores, por un lado cyl es una variable numerica, mientras que cylFactor es un factor.
Ahora podemos poner leyenda a nuetsra gráfica
ggplot(mtcars, aes(x = mpg, y = wt, color = cylFactor)) + geom_point() + xlab("Miles per Galon") +
ylab("Weight") + labs(colour = "Cylinders") + ggtitle("Grafica de dispersion")Graficos de Linea y Regresion
Para la grafica de Linea, vamos hacer uso del conjunto de datos integrado EuStock Markets, que contiene datos historicos de cuatro indices bursatiles Europeos
EuStockDF <- as.data.frame(EuStockMarkets)
head(EuStockDF) DAX SMI CAC FTSE
1 1628.75 1678.1 1772.8 2443.6
2 1613.63 1688.5 1750.5 2460.2
3 1606.51 1678.6 1718.0 2448.2
4 1621.04 1684.1 1708.1 2470.4
5 1618.16 1686.6 1723.1 2484.7
6 1610.61 1671.6 1714.3 2466.8
Vamos a considerar la variable DAX para hacer nuestra grafica de linea
ggplot(EuStockDF, aes(x = c(1:nrow(EuStockDF)), y = DAX)) + geom_line()podemos cambiar el grosor de la recta usando el parametro geom_line(size=1.5) y tambien cambiar el color usando el parametro colour=“light blue” dentro de geom_line.
ggplot(EuStockDF, aes(x = c(1:nrow(EuStockDF)), y = DAX)) + geom_line(size = 1.5,
colour = "light blue") + labs(x = "Stocks")Graficas de linea multiples
Consideremos las variables DAX y SMI, para graficarlas en una sola
dax_smi_plt <- ggplot() + geom_line(data = EuStockDF, aes(x = c(1:nrow(EuStockDF)),
y = DAX), size = 1.2, colour = "light blue") + geom_line(data = EuStockDF, aes(x = c(1:nrow(EuStockDF)),
y = SMI), size = 1.2, colour = "red") + geom_line(data = EuStockDF, aes(x = c(1:nrow(EuStockDF)),
y = CAC), size = 1.2, colour = "green") + labs(x = "Time", y = "Stocks") + ggtitle("Eu Stocks")
print(dax_smi_plt)Regresion Lineal
Vamos a tomar la data mtcars, y haremos una grafica de regresion lineal entre las variables mpg y wt. Considerando el parametro geom_smooth
ggplot(mtcars, aes(x = mpg, y = wt)) + geom_point() + geom_smooth(method = "lm",
se = FALSE, color = "red") Si consideramos una tercera variable el numero de cylindres cylFactord(de tipo factor) y ademas si cambiamos el parametro se=TRUE, esto hace que veamos el intervalo de confianza para nuestro modelo de regresion lineal.
ggplot(mtcars, aes(x = mpg, y = wt, color = cylFactor)) + geom_point() + geom_smooth(method = "lm",
se = TRUE, color = "red") + xlab("Miles per Galon") + ylab("Weight") + labs(colour = "Cylinders") +
ggtitle("Regresion Lineal")Gaussian Regresion
Cambiamos el parametro method=“auto” y tenemos nuestro modelo gausiano
ggplot(mtcars, aes(x = mpg, y = wt, color = cylFactor)) + geom_point() + geom_smooth(method = "auto",
se = TRUE, color = "red") + xlab("Miles per Galon") + ylab("Weight") + labs(colour = "Cylinders") +
ggtitle("Gaussian Regression") Los modelos Gaussianos son mas complejos y pueden ajustar de mejor forma a los datos que un modelo de regresion lineal.
Herramientas especializadas de visualización
Nube de palabras
Una nube de palabras es una imagen compuesta por palabras que aparecen en un texto o tema en particular. El tamaño de la palabra indica su frecuencia o su importancia.
Vamos a usar las siguientes librerias:
# install.packages('tm') install.packages('wordcloud')library(tm)
library(wordcloud)Vamos a crear una carpeta llamada foles_path en nuestro directorio y luego usaremos la funcion dowload.file para descargar el archivo en nuestro nuevo directorio.
# Te creará una carpeta en los documentos de tu sistema con el nombre
# especificado FILES_PATH
dir.create("files_path")
download.file("https://ibm.box.com/shared/static/cmid70rpa7xe4ocitcga1bve7r0kqnia.txt",
destfile = "files_path/churchill_speeches.txt", quiet = TRUE) #Ruta donde estará guardad la descarga + el nombre del archivoAhora vamos a leer la data que descargamos, usamos la funcion inspect para ver el tipo de archivo que tenemos, que justamente es un texto.
dirPath <- "files_path/"
speech <- Corpus(DirSource(dirPath))
inspect(speech)<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 1
churchill_speeches.txt
At present we lie within a few minutes’ striking distance of the French, Dutch and Belgian coasts, and within a few hours of the great aerodromes of Central Europe. We are even within canon-shot of the Continent.\n\nSo close as that! Is it prudent, is it possible, however much we might desire it, to turn our backs upon Europe and ignore whatever may happen there? I have come to the conclusion – reluctantly I admit – that we cannot get away. Here we are and we must make the best of it. But do not underrate the risks – the grevious risks – we have to run.\n\n\nThis is only the beginning of the reckoning. This is only the first sip, the first foretaste of a bitter cup which will be proffered to us year by year unless, by a supreme recovery of moral health and martial vigour, we arise again and take our stand for freedom as in the olden time.\n\n\nI would say to the House, as I said to those who have joined this Government: I have nothing to offer but blood, toil, tears and sweat. We have before us an ordeal of the most grievous kind. We have before us many, many long months of struggle and of suffering. You ask, what is our policy? I can say: It is to wage war, by sea, land and air, with all our might and with all the strength that God can give us; to wage war against a monstrous tyranny, never surpassed in the dark, lamentable catalogue of human crime. This is our policy. You ask, what is our aim?\n\nI can answer in one word: It is victory, victory at all costs, victory in spite of all terror, victory, however long and hard the road may be, for without victory, there is no survival.\n\n\nEven though large tracts of Europe and many old and famous states have fallen or may fall into the grip of the Gestapo and all the odious apparatus of Nazi rule, we shall not flag or fail. We shall go on to the end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing confidence and growing strength in the air, we shall defend our island, whatever the cost may be, we shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and in the streets, we shall fight in the hills; we shall never surrender.\n\n\nThe battle of France is over. I expect that the Battle of Britain is about to begin. Upon this battle depends the survival of Christian civilisation. Upon it depends our own British life, and the long continuity of our institutions and our Empire. The whole fury and might of the enemy must very soon be turned upon us. Hitler knows that he will have to break us in this island or lose the war.\n\nIf we can stand up to him, all Europe may be free and the life of the world may move forward into broad, sunlit uplands. But if we fail, then the whole world, including the United States, including all that we have known and cared for, will sink into the abyss of a new Dark Age made more sinister, and perhaps more protracted, by the lights of perverted science. Let us therefore brace ourselves to our duties, and so bear ourselves that, if the British Empire and its Commonwealth last for a thousand years, men will still say, ‘This was their finest hour.\n\n\nThe gratitude of every home in our Island, in our Empire, and indeed throughout the world, except in the abodes of the guilty, goes out to the British airmen who, undaunted by odds, unwearied in their constant challenge and mortal danger, are turning the tide of the World War by their prowess and by their devotion. Never in the field of human conflict was so much owed by so many to so few.\n\n\nFrom Stettin in the Baltic to Trieste in the Adriatic, an iron curtain has descended across the Continent. Behind that line lie all the capitals of the ancient states of Central and Eastern Europe. Warsaw, Berlin, Prague, Vienna, Budapest, Belgrade, Bucharest and Sofia, all these famous cities and the populations around them lie in what I must call the Soviet sphere.\n\n\nI am very glad that Mr Attlee described my speeches in the war as expressing the will not only of Parliament but of the whole nation. Their will was resolute and remorseless and, as it proved, unconquerable. It fell to me to express it, and if I found the right words you must remember that I have always earned my living by my pen and by my tongue. It was the nation and race dwelling all round the globe that had the lion heart. I had the luck to be called upon to give the roar.
Para hacer limpieza de los datos, los pasos que vamos a seguir son:
- Pasar el texto a letras minusculas, con el para metro tolower
- Eliminaremos lo numeros, con el parametro removeNumbers
- Eliminaremos las palabras comunes como: “the” y “we”, con el parametro stopwords(‘english’)
- Eliminaremos los signos de puntuacion, con el parametro removePunctuation
- Eliminaremos los espacios en blanco, con el parametro stripWhitespace
speech <- tm_map(speech, content_transformer(tolower))
speech <- tm_map(speech, removeNumbers)
speech <- tm_map(speech, removeWords, stopwords("english"))
# si queremos eliminar palabras en especifico usamos
speech <- tm_map(speech, removeWords, c("squirrelled"))
speech <- tm_map(speech, removePunctuation)
speech <- tm_map(speech, stripWhitespace)Ahora vamos a convertir el texto en una matriz
dtm <- TermDocumentMatrix(speech)
m <- as.matrix(dtm)Vamos a ordenar las palabras por frecuencia, de mayor a menor
v <- sort(rowSums(m), decreasing = TRUE)
v shall fight may will europe upon victory
11 7 6 6 5 5 5
war can many must world – battle
5 4 4 4 4 4 3
british empire island lie long might never
3 3 3 3 3 3 3
say states whole within air ask central
3 3 3 3 2 2 2
continent dark depends even fail famous first
2 2 2 2 2 2 2
france give growing however human including life
2 2 2 2 2 2 2
much nation policy risks stand strength survival
2 2 2 2 2 2 2
wage whatever year abodes abyss across admit
2 2 2 1 1 1 1
adriatic aerodromes age aim airmen always ancient
1 1 1 1 1 1 1
answer apparatus arise around attlee away backs
1 1 1 1 1 1 1
baltic beaches bear begin beginning
1 1 1 1 1
[ reached getOption("max.print") -- omitted 218 entries ]
vamos a transformarlo en un data frame
d <- data.frame(word = names(v), freq = v)
head(d, 10) word freq
shall shall 11
fight fight 7
may may 6
will will 6
europe europe 5
upon upon 5
victory victory 5
war war 5
can can 4
many many 4
Ahora vamos a graficar la nube de palabras
wordcloud(words = d$word, freq = d$freq) Podemos escoger el numero minimo de frecuencia para observar en la nube de la palabras
wordcloud(words = d$word, freq = d$freq, min.freq = 1) Podemos poner limite al numero de palabras que se pueden mostrar, ademas de poner colores para una mejor visualizacion.
wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words = 250, colors = brewer.pal(8,
"Dark2")) Finalmente, podemos mejorar nuestra grafica para ser mas atractiva
wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words = 250, colors = brewer.pal(8,
"Dark2"), random.order = FALSE)Creando mapas en R
Vamos a usar la libreria leaflet
library(leaflet)Vamos a encontrar la ubicacion de la virgen del Panecillo de QUITO-ECUADOR, si no sabe como hallar la longitud y latitud puede hacer lo siguiente:
- En tu ordenador, abre Google Maps.
- Haz clic con el botón derecho en el sitio o en el área del mapa. Selecciona la latitud y la longitud para que se copien automáticamente las coordenadas.
map <- leaflet() %>% addTiles() %>% addMarkers(lng = -78.5181, lat = -0.22906, popup = "El Panecillo - QUITO")
mapmap2 <- leaflet() %>% addTiles() %>% addMarkers(lng = -78.5181, lat = -0.22906, popup = "El Panecillo - QUITO") %>%
addRectangles(-78.5081, -0.21906, -78.5282, -0.23906)
map2Vamos a usar la data quakes que tiene datos de longitud y latitud
head(quakes) lat long depth mag stations
1 -20.42 181.62 562 4.8 41
2 -20.62 181.03 650 4.2 15
3 -26.00 184.10 42 5.4 43
4 -17.97 181.66 626 4.1 19
5 -20.42 181.96 649 4.0 11
6 -19.68 184.31 195 4.0 12
map7 <- leaflet(quakes) %>% addTiles() %>% addMarkers(clusterOptions = markerClusterOptions())
map7