Source file ⇒ The_Analytics_Edge_edX_MIT15.071x_June2015_6.rmd
These are my notes for the lectures of the The_Analytics_Edge_edX_MIT15.0 71x_June2015“ by Professor Dimitris Bertsimas. The goal of these notes is to provide the reproducible R code for all the lectures.
A good list of resources about using R for GGPLOT2 & BASE R Graphics are given below:(GGplot2 package is the Hadley Wickham’s(Chief scientist @Rstudio) creation which needs a special mention here.Popularity of R and more specifically visualisation using R owes to this great package)
Why Visualization?
QUICK QUESTION
Normally, a scatterplot only allows us to visualize two dimensions - one on the x-axis, and one on the y-axis. In the previous video we were able to show a third dimension on the scatterplot using what attribute?
Ans:Color
EXPLANATION:On slide 3, we show the scatterplot from slide 2, but with the number of cylinders shown by the color of the points. This allows us to visualize a third dimension of our data.
Initial Exploration Shows a Relationship (slide 2)
slide2
Explore Further: Color by Factor (slide 3)
slide3
The World Health Organization
“WHO is the authority for health within the United Nations system. It is responsible for providing leadership on global health matters,shaping the health research agenda,setting norms and standards,articulating evidence-based policy options, providing technical support to countries and monitoring and assessing health trends.”
who
The World Health Report
better
QUICK QUESTION
Why is it particularly helpful for WHO to provide data visualizations? Select all that apply.
Ans:When communicating information to the general public, a visualization like the Energy Consumption one is much easier to absorb than a table of numbers would be.
Visualizations can easily be used by policymakers and others who wish to present data from WHO.
EXPLANATION:While there are other ways to display the data given in many visualizations (like tables), visualizations help to better communicate data to the public and can easily be used by others in presentations.
What is a Data Visualization?
Anscombe’s Quartet
AnscombeQuartet
AnscombeQuartet1
ggplot2
Graphics in Base R vs ggplot
Grammar of Graphics
QUICK QUESTION
In this quick question, we’ll be asking you questions about the following three plots, that we saw in Video 1. We’ll refer to them as the “Scatterplot”, the “Histogram”, and the “US Map”.
The Scatterplot:
Scatterplot_Week8
In the Scatterplot, what are the geometric objects?
Ans:Points
The Histogram:
Histogram_Week8
In the Histogram, what are the geometric objects?
Ans:Bars
The US Map:
USmap_Week8
In the US Map, what are the geometric objects?
Ans:Polygons
EXPLANATION:The geometric objects for the Scatterplot are points, for the Histogram are bars, and for the US Map are polygons (the states). All three plots defined colors in the plot.
COLORS AND SHAPES IN R
ptShapes
# VIDEO 4 - A BASIC SCATTERPLOT
# Read in data
WHO = read.csv("WHO.csv")
str(WHO)
## 'data.frame': 194 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
## $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
## $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
## $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
## $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
## $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
## $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
## $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
## $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
## $ GNI : num 1140 8820 8310 NA 5230 ...
## $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
## $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
# Plot from Week 1
plot(WHO$GNI, WHO$FertilityRate)
# Let's redo this using ggplot
# Install and load the ggplot2 library:
#install.packages("ggplot2")
suppressPackageStartupMessages(library(ggplot2))
# Create the ggplot object with the data and the aesthetic mapping:
scatterplot = ggplot(WHO, aes(x = GNI, y = FertilityRate))
# Add the geom_point geometry
scatterplot + geom_point()
## Warning: Removed 35 rows containing missing values (geom_point).
# Make a line graph instead:
scatterplot + geom_line()
## Warning: Removed 32 rows containing missing values (geom_path).
# Switch back to our points:
scatterplot + geom_point()
## Warning: Removed 35 rows containing missing values (geom_point).
# Redo the plot with blue triangles instead of circles:
scatterplot + geom_point(color = "blue", size = 3, shape = 17)
## Warning: Removed 35 rows containing missing values (geom_point).
# Another option:
scatterplot + geom_point(color = "darkred", size = 3, shape = 8)
## Warning: Removed 35 rows containing missing values (geom_point).
#COLORS AND SHAPES IN R
#If you want to see all of the available colors in R, type in your R console:
colors()
## [1] "white" "aliceblue" "antiquewhite"
## [4] "antiquewhite1" "antiquewhite2" "antiquewhite3"
## [7] "antiquewhite4" "aquamarine" "aquamarine1"
## [10] "aquamarine2" "aquamarine3" "aquamarine4"
## [13] "azure" "azure1" "azure2"
## [16] "azure3" "azure4" "beige"
## [19] "bisque" "bisque1" "bisque2"
## [22] "bisque3" "bisque4" "black"
## [25] "blanchedalmond" "blue" "blue1"
## [28] "blue2" "blue3" "blue4"
## [31] "blueviolet" "brown" "brown1"
## [34] "brown2" "brown3" "brown4"
## [37] "burlywood" "burlywood1" "burlywood2"
## [40] "burlywood3" "burlywood4" "cadetblue"
## [43] "cadetblue1" "cadetblue2" "cadetblue3"
## [46] "cadetblue4" "chartreuse" "chartreuse1"
## [49] "chartreuse2" "chartreuse3" "chartreuse4"
## [52] "chocolate" "chocolate1" "chocolate2"
## [55] "chocolate3" "chocolate4" "coral"
## [58] "coral1" "coral2" "coral3"
## [61] "coral4" "cornflowerblue" "cornsilk"
## [64] "cornsilk1" "cornsilk2" "cornsilk3"
## [67] "cornsilk4" "cyan" "cyan1"
## [70] "cyan2" "cyan3" "cyan4"
## [73] "darkblue" "darkcyan" "darkgoldenrod"
## [76] "darkgoldenrod1" "darkgoldenrod2" "darkgoldenrod3"
## [79] "darkgoldenrod4" "darkgray" "darkgreen"
## [82] "darkgrey" "darkkhaki" "darkmagenta"
## [85] "darkolivegreen" "darkolivegreen1" "darkolivegreen2"
## [88] "darkolivegreen3" "darkolivegreen4" "darkorange"
## [91] "darkorange1" "darkorange2" "darkorange3"
## [94] "darkorange4" "darkorchid" "darkorchid1"
## [97] "darkorchid2" "darkorchid3" "darkorchid4"
## [100] "darkred" "darksalmon" "darkseagreen"
## [103] "darkseagreen1" "darkseagreen2" "darkseagreen3"
## [106] "darkseagreen4" "darkslateblue" "darkslategray"
## [109] "darkslategray1" "darkslategray2" "darkslategray3"
## [112] "darkslategray4" "darkslategrey" "darkturquoise"
## [115] "darkviolet" "deeppink" "deeppink1"
## [118] "deeppink2" "deeppink3" "deeppink4"
## [121] "deepskyblue" "deepskyblue1" "deepskyblue2"
## [124] "deepskyblue3" "deepskyblue4" "dimgray"
## [127] "dimgrey" "dodgerblue" "dodgerblue1"
## [130] "dodgerblue2" "dodgerblue3" "dodgerblue4"
## [133] "firebrick" "firebrick1" "firebrick2"
## [136] "firebrick3" "firebrick4" "floralwhite"
## [139] "forestgreen" "gainsboro" "ghostwhite"
## [142] "gold" "gold1" "gold2"
## [145] "gold3" "gold4" "goldenrod"
## [148] "goldenrod1" "goldenrod2" "goldenrod3"
## [151] "goldenrod4" "gray" "gray0"
## [154] "gray1" "gray2" "gray3"
## [157] "gray4" "gray5" "gray6"
## [160] "gray7" "gray8" "gray9"
## [163] "gray10" "gray11" "gray12"
## [166] "gray13" "gray14" "gray15"
## [169] "gray16" "gray17" "gray18"
## [172] "gray19" "gray20" "gray21"
## [175] "gray22" "gray23" "gray24"
## [178] "gray25" "gray26" "gray27"
## [181] "gray28" "gray29" "gray30"
## [184] "gray31" "gray32" "gray33"
## [187] "gray34" "gray35" "gray36"
## [190] "gray37" "gray38" "gray39"
## [193] "gray40" "gray41" "gray42"
## [196] "gray43" "gray44" "gray45"
## [199] "gray46" "gray47" "gray48"
## [202] "gray49" "gray50" "gray51"
## [205] "gray52" "gray53" "gray54"
## [208] "gray55" "gray56" "gray57"
## [211] "gray58" "gray59" "gray60"
## [214] "gray61" "gray62" "gray63"
## [217] "gray64" "gray65" "gray66"
## [220] "gray67" "gray68" "gray69"
## [223] "gray70" "gray71" "gray72"
## [226] "gray73" "gray74" "gray75"
## [229] "gray76" "gray77" "gray78"
## [232] "gray79" "gray80" "gray81"
## [235] "gray82" "gray83" "gray84"
## [238] "gray85" "gray86" "gray87"
## [241] "gray88" "gray89" "gray90"
## [244] "gray91" "gray92" "gray93"
## [247] "gray94" "gray95" "gray96"
## [250] "gray97" "gray98" "gray99"
## [253] "gray100" "green" "green1"
## [256] "green2" "green3" "green4"
## [259] "greenyellow" "grey" "grey0"
## [262] "grey1" "grey2" "grey3"
## [265] "grey4" "grey5" "grey6"
## [268] "grey7" "grey8" "grey9"
## [271] "grey10" "grey11" "grey12"
## [274] "grey13" "grey14" "grey15"
## [277] "grey16" "grey17" "grey18"
## [280] "grey19" "grey20" "grey21"
## [283] "grey22" "grey23" "grey24"
## [286] "grey25" "grey26" "grey27"
## [289] "grey28" "grey29" "grey30"
## [292] "grey31" "grey32" "grey33"
## [295] "grey34" "grey35" "grey36"
## [298] "grey37" "grey38" "grey39"
## [301] "grey40" "grey41" "grey42"
## [304] "grey43" "grey44" "grey45"
## [307] "grey46" "grey47" "grey48"
## [310] "grey49" "grey50" "grey51"
## [313] "grey52" "grey53" "grey54"
## [316] "grey55" "grey56" "grey57"
## [319] "grey58" "grey59" "grey60"
## [322] "grey61" "grey62" "grey63"
## [325] "grey64" "grey65" "grey66"
## [328] "grey67" "grey68" "grey69"
## [331] "grey70" "grey71" "grey72"
## [334] "grey73" "grey74" "grey75"
## [337] "grey76" "grey77" "grey78"
## [340] "grey79" "grey80" "grey81"
## [343] "grey82" "grey83" "grey84"
## [346] "grey85" "grey86" "grey87"
## [349] "grey88" "grey89" "grey90"
## [352] "grey91" "grey92" "grey93"
## [355] "grey94" "grey95" "grey96"
## [358] "grey97" "grey98" "grey99"
## [361] "grey100" "honeydew" "honeydew1"
## [364] "honeydew2" "honeydew3" "honeydew4"
## [367] "hotpink" "hotpink1" "hotpink2"
## [370] "hotpink3" "hotpink4" "indianred"
## [373] "indianred1" "indianred2" "indianred3"
## [376] "indianred4" "ivory" "ivory1"
## [379] "ivory2" "ivory3" "ivory4"
## [382] "khaki" "khaki1" "khaki2"
## [385] "khaki3" "khaki4" "lavender"
## [388] "lavenderblush" "lavenderblush1" "lavenderblush2"
## [391] "lavenderblush3" "lavenderblush4" "lawngreen"
## [394] "lemonchiffon" "lemonchiffon1" "lemonchiffon2"
## [397] "lemonchiffon3" "lemonchiffon4" "lightblue"
## [400] "lightblue1" "lightblue2" "lightblue3"
## [403] "lightblue4" "lightcoral" "lightcyan"
## [406] "lightcyan1" "lightcyan2" "lightcyan3"
## [409] "lightcyan4" "lightgoldenrod" "lightgoldenrod1"
## [412] "lightgoldenrod2" "lightgoldenrod3" "lightgoldenrod4"
## [415] "lightgoldenrodyellow" "lightgray" "lightgreen"
## [418] "lightgrey" "lightpink" "lightpink1"
## [421] "lightpink2" "lightpink3" "lightpink4"
## [424] "lightsalmon" "lightsalmon1" "lightsalmon2"
## [427] "lightsalmon3" "lightsalmon4" "lightseagreen"
## [430] "lightskyblue" "lightskyblue1" "lightskyblue2"
## [433] "lightskyblue3" "lightskyblue4" "lightslateblue"
## [436] "lightslategray" "lightslategrey" "lightsteelblue"
## [439] "lightsteelblue1" "lightsteelblue2" "lightsteelblue3"
## [442] "lightsteelblue4" "lightyellow" "lightyellow1"
## [445] "lightyellow2" "lightyellow3" "lightyellow4"
## [448] "limegreen" "linen" "magenta"
## [451] "magenta1" "magenta2" "magenta3"
## [454] "magenta4" "maroon" "maroon1"
## [457] "maroon2" "maroon3" "maroon4"
## [460] "mediumaquamarine" "mediumblue" "mediumorchid"
## [463] "mediumorchid1" "mediumorchid2" "mediumorchid3"
## [466] "mediumorchid4" "mediumpurple" "mediumpurple1"
## [469] "mediumpurple2" "mediumpurple3" "mediumpurple4"
## [472] "mediumseagreen" "mediumslateblue" "mediumspringgreen"
## [475] "mediumturquoise" "mediumvioletred" "midnightblue"
## [478] "mintcream" "mistyrose" "mistyrose1"
## [481] "mistyrose2" "mistyrose3" "mistyrose4"
## [484] "moccasin" "navajowhite" "navajowhite1"
## [487] "navajowhite2" "navajowhite3" "navajowhite4"
## [490] "navy" "navyblue" "oldlace"
## [493] "olivedrab" "olivedrab1" "olivedrab2"
## [496] "olivedrab3" "olivedrab4" "orange"
## [499] "orange1" "orange2" "orange3"
## [502] "orange4" "orangered" "orangered1"
## [505] "orangered2" "orangered3" "orangered4"
## [508] "orchid" "orchid1" "orchid2"
## [511] "orchid3" "orchid4" "palegoldenrod"
## [514] "palegreen" "palegreen1" "palegreen2"
## [517] "palegreen3" "palegreen4" "paleturquoise"
## [520] "paleturquoise1" "paleturquoise2" "paleturquoise3"
## [523] "paleturquoise4" "palevioletred" "palevioletred1"
## [526] "palevioletred2" "palevioletred3" "palevioletred4"
## [529] "papayawhip" "peachpuff" "peachpuff1"
## [532] "peachpuff2" "peachpuff3" "peachpuff4"
## [535] "peru" "pink" "pink1"
## [538] "pink2" "pink3" "pink4"
## [541] "plum" "plum1" "plum2"
## [544] "plum3" "plum4" "powderblue"
## [547] "purple" "purple1" "purple2"
## [550] "purple3" "purple4" "red"
## [553] "red1" "red2" "red3"
## [556] "red4" "rosybrown" "rosybrown1"
## [559] "rosybrown2" "rosybrown3" "rosybrown4"
## [562] "royalblue" "royalblue1" "royalblue2"
## [565] "royalblue3" "royalblue4" "saddlebrown"
## [568] "salmon" "salmon1" "salmon2"
## [571] "salmon3" "salmon4" "sandybrown"
## [574] "seagreen" "seagreen1" "seagreen2"
## [577] "seagreen3" "seagreen4" "seashell"
## [580] "seashell1" "seashell2" "seashell3"
## [583] "seashell4" "sienna" "sienna1"
## [586] "sienna2" "sienna3" "sienna4"
## [589] "skyblue" "skyblue1" "skyblue2"
## [592] "skyblue3" "skyblue4" "slateblue"
## [595] "slateblue1" "slateblue2" "slateblue3"
## [598] "slateblue4" "slategray" "slategray1"
## [601] "slategray2" "slategray3" "slategray4"
## [604] "slategrey" "snow" "snow1"
## [607] "snow2" "snow3" "snow4"
## [610] "springgreen" "springgreen1" "springgreen2"
## [613] "springgreen3" "springgreen4" "steelblue"
## [616] "steelblue1" "steelblue2" "steelblue3"
## [619] "steelblue4" "tan" "tan1"
## [622] "tan2" "tan3" "tan4"
## [625] "thistle" "thistle1" "thistle2"
## [628] "thistle3" "thistle4" "tomato"
## [631] "tomato1" "tomato2" "tomato3"
## [634] "tomato4" "turquoise" "turquoise1"
## [637] "turquoise2" "turquoise3" "turquoise4"
## [640] "violet" "violetred" "violetred1"
## [643] "violetred2" "violetred3" "violetred4"
## [646] "wheat" "wheat1" "wheat2"
## [649] "wheat3" "wheat4" "whitesmoke"
## [652] "yellow" "yellow1" "yellow2"
## [655] "yellow3" "yellow4" "yellowgreen"
# Add a title to the plot:
scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")
## Warning: Removed 35 rows containing missing values (geom_point).
# Save our plot:
fertilityGNIplot = scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")
pdf("MyPlot.pdf")
print(fertilityGNIplot)
## Warning: Removed 35 rows containing missing values (geom_point).
dev.off()
## png
## 2
####################################
#QUICK QUESTION
#In R, change the shape of your points to the number 15. What shape are the points now?
scatterplot + geom_point(shape = 15)
## Warning: Removed 35 rows containing missing values (geom_point).
#or
scatterplot + geom_point(color = "darkred", size = 3, shape = 15)
## Warning: Removed 35 rows containing missing values (geom_point).
#Ans:Squares
suppressPackageStartupMessages(library(ggplot2))
# Color the points by region (which is a factor variabl:
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = Region)) + geom_point()
## Warning: Removed 35 rows containing missing values (geom_point).
# Color the points according to country's life expectancy(which is a numerical variable and hence we get a gradient legend):
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = LifeExpectancy)) + geom_point()
## Warning: Removed 35 rows containing missing values (geom_point).
# Is the fertility rate of a country was a good predictor of the percentage of the population under 15?
ggplot(WHO, aes(x = FertilityRate, y = Under15)) + geom_point()
## Warning: Removed 11 rows containing missing values (geom_point).
#The scatter plot shows that the variables are certainly correlated but as the fertility rates increases the var under 15 starts increasing less.So this does not look like alinear relationship but we suspect that a log transformation of fertility rate would be better.
# Let's try a log transformation:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point()
## Warning: Removed 11 rows containing missing values (geom_point).
# Simple linear regression model to predict the percentage of the population under 15, using the log of the fertility rate:
mod = lm(Under15 ~ log(FertilityRate), data = WHO)
summary(mod)
##
## Call:
## lm(formula = Under15 ~ log(FertilityRate), data = WHO)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3131 -1.7742 0.0446 1.7440 7.7174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6540 0.4478 17.09 <2e-16 ***
## log(FertilityRate) 22.0547 0.4175 52.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 181 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.9391, Adjusted R-squared: 0.9387
## F-statistic: 2790 on 1 and 181 DF, p-value: < 2.2e-16
#From the summary output we can see that the coefficients are highly significant and R*2 is 0.9391 implying that log of the predictor FertilityRate is a great predictor of Under15.
#Visualisation was a great way to realise that the log transformation would be better.If we has just used FertilityRate , we would have got R*2 as 0.87 which is a significant decrease in R*2 value.
# Add this regression line to our plot as another layer:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm")
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).
# 99% confidence interval (By default ggplot2 will draw a 95% shaded region)
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", level = 0.99)
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).
# No confidence interval in the plot
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", se = FALSE)
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).
# Change the color of the regression line:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", colour = "orange")
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).
######################################
#QUICK QUESTION
#Create the fertility rate versus population under 15 plot again:
ggplot(WHO, aes(x = FertilityRate, y = Under15)) + geom_point()
## Warning: Removed 11 rows containing missing values (geom_point).
#Now, color the points by the Region variable.
ggplot(WHO, aes(x = FertilityRate, y = Under15, color = Region)) + geom_point() + scale_color_brewer(palette="Dark2")
## Warning: Removed 11 rows containing missing values (geom_point).
#or
ggplot(WHO, aes(x = FertilityRate, y = Under15, color=Region)) + geom_point()
## Warning: Removed 11 rows containing missing values (geom_point).
#Note: You can add scale_color_brewer(palette="Dark2") to your plot if you are having a hard time distinguishing the colors (this color palette is often better if you are colorblind). To use this option, you should just add scale_color_brewer(palette="Dark2") to your plotting command right after geom_point().
#One region in particular has a lot of countries with a very low fertility rate and a very low percentage of the population under 15. Which region is it?
#Ans:Europe
#EXPLANATION:Most of the countries in Europe have a very low fertility rate and a very low percentage of the population under 15.
The Analytical Policeman
Example: Los Angeles Police Dept.
“I’m not going to get more money. I’m not going to get more cops. I have to be better at using what I have, and that’s what predictive policing is about. If this old street cop can change the way that he thinks about this stuff, then I know that my [officers] can do the same.” - Los Angeles Police Chief Charlie Beck
Role of Analytics
QUICK QUESTION
The Los Angeles Police Department sees the benefits of predictive policing as which of the following? Select all that apply.
Ans:Allowing more intelligent officer deployment
Preventing crime
Using resources more effectively
EXPLANATION:According the the Los Angeles Police Department, predictive policing does not eliminate the need for police officers or increase the rate at which they catch criminals. It does, however, allow more intelligent officer deployment, prevents crime, and helps them use resources more effectively.
Understanding the Past
Crime Over Time
Line graph:
line
Table:
table
Heatmaps:
heat
heat2
A Chicago Crime Heatmap
QUICK QUESTION
For which of the following situations would a heat map be an appropriate visualization choice? Select all that apply.
Ans:Visualizing the areas on a geographical map with the most crime
Comparing crime counts by police district and time throughout a city
EXPLANATION:A heatmap would be useful for the middle two options, because they are trying to visualize crime counts relative to two variables. For the first option, you could use a basic scatterplot with time on the x-axis and amount of crime on the y-axis. For the last option, you could use a bar plot with a bar for each month and the height being the average amount of crime in that month.
# VIDEO 3 - A Basic Line Plot
# Load our data:
mvt = read.csv("mvt.csv", stringsAsFactors=FALSE) #since we have a text field and we want it to be read in properly, we use stringsAsFactors=FALSE
str(mvt)
## 'data.frame': 191641 obs. of 3 variables:
## $ Date : chr "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude: num -87.6 -87.7 -87.8 -87.7 -87.6 ...
# Convert the Date variable to a format that R will recognize using strptime():
mvt$Date = strptime(mvt$Date, format="%m/%d/%y %H:%M")
# Extract the hour and the day of the week:
mvt$Weekday = weekdays(mvt$Date)
mvt$Hour = mvt$Date$hour
# Let's take a look at the structure of our data again:
str(mvt)
## 'data.frame': 191641 obs. of 5 variables:
## $ Date : POSIXlt, format: "2012-12-31 23:15:00" "2012-12-31 22:00:00" ...
## $ Latitude : num 41.8 41.9 42 41.8 41.8 ...
## $ Longitude: num -87.6 -87.7 -87.8 -87.7 -87.6 ...
## $ Weekday : chr "Monday" "Monday" "Monday" "Monday" ...
## $ Hour : int 23 22 22 22 21 20 20 20 19 18 ...
#we can see the 2 new variables added to the existing dataset
#Create a simple line plot - need the total number of crimes on each day of the week. We can get this information by creating a table:
table(mvt$Weekday)
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 29284 27397 27118 26316 27319 26791 27416
# Save this table as a data frame:
WeekdayCounts = as.data.frame(table(mvt$Weekday))
str(WeekdayCounts)
## 'data.frame': 7 obs. of 2 variables:
## $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7
## $ Freq: int 29284 27397 27118 26316 27319 26791 27416
# Load the ggplot2 library:
library(ggplot2)
# Create our plot
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1)) #group =1groups all date into one line as we want one line in our plot
#But we can see that the days of the week are slightly out of order as ggplot2 plot the days in alphabetical order.What we want is chronological order
# Make the "Var1" variable an ORDERED factor variable
WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))
# Try again:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1))
# Change our x and y labels:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1)) + xlab("Day of the Week") + ylab("Total Motor Vehicle Thefts")
#############################
#QUICK QUESTION
#Create a new line plot, like the one in Video 3, but add the argument "linetype=2". So the geom_line part of the plotting command should look like:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1), linetype=2) + xlab("Day of the Week") + ylab("Total Motor Vehicle Thefts")
#What does this do?
#Ans: Makes the line dashed
#Now, change the alpha parameter to 0.3 by replacing "linetype=2" with "alpha=0.3" in the plot command. What does this do?
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1), alpha=0.3) + xlab("Day of the Week") + ylab("Total Motor Vehicle Thefts")
#Ans:Makes the line lighter in color
#EXPLANATION:The linetype parameter makes the line dashed, and the alpha parameter makes the line lighter in color, or more transparent.
# VIDEO 4 - Adding the Hour of the Day
#Break down by week days and hours
# Create a counts table for the weekday and hour:
table(mvt$Weekday, mvt$Hour)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## Friday 1873 932 743 560 473 602 839 1203 1268 1286 938 822
## Monday 1900 825 712 527 415 542 772 1123 1323 1235 971 737
## Saturday 2050 1267 985 836 652 508 541 650 858 1039 946 789
## Sunday 2028 1236 1019 838 607 461 478 483 615 864 884 787
## Thursday 1856 816 696 508 400 534 799 1135 1298 1301 932 731
## Tuesday 1691 777 603 464 414 520 845 1118 1175 1174 948 786
## Wednesday 1814 790 619 469 396 561 862 1140 1329 1237 947 763
##
## 12 13 14 15 16 17 18 19 20 21 22 23
## Friday 1207 857 937 1140 1165 1318 1623 1652 1736 1881 2308 1921
## Monday 1129 824 958 1059 1136 1252 1518 1503 1622 1815 2009 1490
## Saturday 1204 767 963 1086 1055 1084 1348 1390 1570 1702 2078 1750
## Sunday 1192 789 959 1037 1083 1160 1389 1342 1706 1696 2079 1584
## Thursday 1093 752 831 1044 1131 1258 1510 1537 1668 1776 2134 1579
## Tuesday 1108 762 908 1071 1090 1274 1553 1496 1696 1816 2044 1458
## Wednesday 1225 804 863 1075 1076 1289 1580 1507 1718 1748 2093 1511
# Save this to a data frame so that we can used it for visualisation:
DayHourCounts = as.data.frame(table(mvt$Weekday, mvt$Hour))
str(DayHourCounts)
## 'data.frame': 168 obs. of 3 variables:
## $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7 1 2 3 ...
## $ Var2: Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Freq: int 1873 1900 2050 2028 1856 1691 1814 932 825 1267 ...
# Convert the second variable, Var2, to numbers and call it Hour:
DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))
str(DayHourCounts)
## 'data.frame': 168 obs. of 4 variables:
## $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7 1 2 3 ...
## $ Var2: Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Freq: int 1873 1900 2050 2028 1856 1691 1814 932 825 1267 ...
## $ Hour: num 0 0 0 0 0 0 0 1 1 1 ...
# Create out plot:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1))
#we get 7 lines corresponding to each day of the week
# Change the colors to differentiate the days of the week
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1), size=2)
# Separate the weekends from the weekdays:
DayHourCounts$Type = ifelse((DayHourCounts$Var1 == "Sunday") | (DayHourCounts$Var1 == "Saturday"), "Weekend", "Weekday")
# Redo our plot, this time coloring by Type:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2)
# Make the lines a little transparent:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2, alpha=0.5)
# Fix the order of the days:
DayHourCounts$Var1 = factor(DayHourCounts$Var1, ordered=TRUE, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
# Make a heatmap:
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq))
# Change the label on the legend, and get rid of the y-label:
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts") + theme(axis.title.y = element_blank())
# Change the color scheme of the heat map
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") + theme(axis.title.y = element_blank())
QUICK QUESTION
In this quick question, we’ll ask you questions about the following plots. Plot (1) is the heat map we generated at the end of Video 4. Plot (2) and Plot (3) were generated by changing argument values of the command used to generate Plot (1).
Plot (1)
Week8_Crime_QQ4_1
Plot (2)
Week8_Crime_QQ4_2
Plot (3)
Week8_Crime_QQ4_3
#Generating the above plots
#Q:Which argument(s) did we change to get Plot (2)? Select all that apply.
# Plot2
ggplot(DayHourCounts, aes(x =Var1 , y = Hour)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") + theme(axis.title.y = element_blank())
#Ans:x & y
#Q:Which argument(s) did we change to get Plot (3)? Select all that apply.
#plot3
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill=Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="black") + theme(axis.title.y=element_blank())
#Ans:high
# VIDEO 5 - Maps
# Install and load two new packages:
#install.packages("maps")
#install.packages("ggmap")
library(maps)
library(ggmap)
# Load a map of Chicago into R:
chicago = get_map(location = "chicago", zoom = 11)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=chicago&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=chicago&sensor=false
##Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=chicago&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
##Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=chicago&sensor=false
# Look at the map
ggmap(chicago)
# Plot the first 100 motor vehicle thefts on the chicago map:
ggmap(chicago) + geom_point(data = mvt[1:100,], aes(x = Longitude, y = Latitude))
## Warning: Removed 7 rows containing missing values (geom_point).
# Round our latitude and longitude to 2 digits of accuracy, and create a crime counts data frame for each area:
LatLonCounts = as.data.frame(table(round(mvt$Longitude,2), round(mvt$Latitude,2)))
str(LatLonCounts)
## 'data.frame': 1638 obs. of 3 variables:
## $ Var1: Factor w/ 42 levels "-87.93","-87.92",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Var2: Factor w/ 39 levels "41.64","41.65",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 0 0 0 0 0 0 0 0 0 0 ...
# Convert our Longitude and Latitude variable to numbers (coverting a factor var into a numeric var):
LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2))
# Plot these points on our map:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq))
## Warning: Removed 615 rows containing missing values (geom_point).
# Change the color scheme:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq)) + scale_colour_gradient(low="yellow", high="red")
## Warning: Removed 615 rows containing missing values (geom_point).
# We can also use the geom_tile geometry similar to heat map
ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill="red")
## Warning: Removed 615 rows containing missing values (geom_tile).
#QUICK QUESTION
#In the previous video, our heatmap was plotting squares out in the water, which seems a little strange. We can fix this by removing the observations from our data frame that have Freq = 0.
#Take a subset of LatLonCounts, only keeping the observations for which Freq > 0, and call it LatLonCounts2.
LatLonCounts2 <- subset(LatLonCounts, Freq > 0)
#Redo the heatmap from the end of Video 5, using LatLonCounts2 instead of LatLonCounts. You should no longer see any squares out in the water, or in any areas where there were no motor vehicle thefts.
ggmap(chicago) + geom_tile(data = LatLonCounts2, aes(x = Long, y = Lat, alpha = Freq), fill = "red")
## Warning: Removed 119 rows containing missing values (geom_tile).
#How many observations did we remove?
nrow(LatLonCounts) - nrow(LatLonCounts2)
## [1] 952
#Ans:952
#EXPLANATION:The number of observations in LatLonCounts2 is 686, and the number of observations in LatLonCounts is 1638. These numbers can be found by using nrow or str.
# VIDEO 6 - Geographical Map on US
# Load our data (Data provided by FBI for the total number of murders state by state:
murders = read.csv("murders.csv")
str(murders)
## 'data.frame': 51 obs. of 6 variables:
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Population : int 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934 601723 19687653 ...
## $ PopulationDensity: num 94.65 1.26 57.05 56.43 244.2 ...
## $ Murders : int 199 31 352 130 1811 117 131 48 131 987 ...
## $ GunMurders : int 135 19 232 93 1257 65 97 38 99 669 ...
## $ GunOwnership : num 0.517 0.578 0.311 0.553 0.213 0.347 0.167 0.255 0.036 0.245 ...
# Load the map of the US which is already available in R
statesMap = map_data("state")
##
## # maps v3.1: updated 'world': all lakes moved to separate new #
## # 'lakes' database. Type '?world' or 'news(package="maps")'. #
str(statesMap)
## 'data.frame': 15537 obs. of 6 variables:
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ subregion: chr NA NA NA NA ...
# Plot the map:
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")
# Create a new variable called region with the lowercase names to MATCH the statesMap since we have state names in State var of the murders dataset which start with capital letter and statenames in statesMap dataset are in region var which start with lowercase:
murders$region = tolower(murders$State)
# Join the statesMap data and the murders data into new one dataframe using the merge(:
murderMap = merge(statesMap, murders, by="region") #by="region is the identifier used to merge the 2 dataframes which were matched by region var
str(murderMap)
## 'data.frame': 15537 obs. of 12 variables:
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ subregion : chr NA NA NA NA ...
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Population : int 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 ...
## $ PopulationDensity: num 94.7 94.7 94.7 94.7 94.7 ...
## $ Murders : int 199 199 199 199 199 199 199 199 199 199 ...
## $ GunMurders : int 135 135 135 135 135 135 135 135 135 135 ...
## $ GunOwnership : num 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 ...
# Plot the number of murder on our map of the United States:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Murders)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")
# Plot a map of the population:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Population)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")
#lets plot the same map with different colour scheme
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Population)) + geom_polygon(color = "black") + scale_fill_gradient(low="green", high="red", guide = "legend")
# Create a new variable that is the number of murders per 100,000 population:
murderMap$MurderRate = murderMap$Murders / murderMap$Population * 100000
# Redo our plot with murder rate:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")
#lets plot the same map with different colour scheme
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low="green", high="red", guide = "legend")
# Redo the plot, removing any states with murder rates above 10 which removes Washington DC which is a very small state (an outlier):
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend", limits = c(0,10))
#lets plot the same map with different colour scheme
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low="green", high="red", guide = "legend", limits = c(0,10))
###################
#QUICK QUESTION
#Redo the map from Video 6, but this time fill each state with the variable GunOwnership. This shows the percentage of people in each state who own a gun.
#Which of the following states has the highest gun ownership rate? To see the state labels, take a look at the World Atlas map here:http://www.worldatlas.com/webimage/testmaps/usanames.htm
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = GunOwnership)) + geom_polygon(color ="black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")
#lets plot the same map with different colour scheme
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = GunOwnership)) + geom_polygon(color ="black") + scale_fill_gradient(low="green", high="red", guide = "legend")
#Ans:Montana
#EXPLANATION:Of these five states, the one that is the most red is Montana.
Eye on Crime
Predictive Policing
Great Power, Great Responsibility
What is the difference?
What does this mean?
Bad Visualizations?
pie1
pie2
pie3
# Unit 7 - Recitation
# VIDEO 3 - Bar Charts
# Load ggplot library
library(ggplot2)
# Load our data, which lives in intl.csv
intl = read.csv("intl.csv")
str(intl)
## 'data.frame': 8 obs. of 2 variables:
## $ Region : Factor w/ 8 levels "Africa","Asia",..: 2 3 6 4 5 1 7 8
## $ PercentOfIntl: num 0.531 0.201 0.098 0.09 0.054 0.02 0.015 0.002
# We want to make a bar plot with region on the X axis
# and Percentage on the y-axis.
ggplot(intl, aes(x=Region, y=PercentOfIntl)) +
geom_bar(stat="identity") +
geom_text(aes(label=PercentOfIntl))
#stat="identity" arg means that use the value of y axis as it is
# Make Region an ordered factor
# We can do this with the re-order command and transform command.
intl = transform(intl, Region = reorder(Region, -PercentOfIntl)) #reordering in decreasing order
# Look at the structure
str(intl)
## 'data.frame': 8 obs. of 2 variables:
## $ Region : Factor w/ 8 levels "Asia","Europe",..: 1 2 3 4 5 6 7 8
## ..- attr(*, "scores")= num [1:8(1d)] -0.02 -0.531 -0.201 -0.09 -0.054 -0.098 -0.015 -0.002
## .. ..- attr(*, "dimnames")=List of 1
## .. .. ..$ : chr "Africa" "Asia" "Europe" "Latin Am. & Caribbean" ...
## $ PercentOfIntl: num 0.531 0.201 0.098 0.09 0.054 0.02 0.015 0.002
# Make the percentages out of 100 instead of fractions
intl$PercentOfIntl = intl$PercentOfIntl * 100
# Make the plot
ggplot(intl, aes(x=Region, y=PercentOfIntl)) +
geom_bar(stat="identity", fill="steelblue") +
geom_text(aes(label=PercentOfIntl), vjust=-0.4) + # vjust=-0.4 arg just moves the values just above the bars
ylab("Percent of International Students") +
theme(axis.title.x = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))
Better Visualization?
bar
# VIDEO 5 - World map
# Load the ggmap package
library(ggmap)
# Load in the international student data
intlall = read.csv("intlall.csv",stringsAsFactors=FALSE)
# Lets look at the first few rows
head(intlall)
## Citizenship UG G SpecialUG SpecialG ExhangeVisiting Total
## 1 Albania 3 1 0 0 0 4
## 2 Antigua and Barbuda NA NA NA 1 NA 1
## 3 Argentina NA 19 NA NA NA 19
## 4 Armenia 3 2 NA NA NA 5
## 5 Australia 6 32 NA NA 1 39
## 6 Austria NA 11 NA NA 5 16
# Those NAs are really 0s, and we can replace them easily
intlall[is.na(intlall)] = 0
# Now lets look again
head(intlall)
## Citizenship UG G SpecialUG SpecialG ExhangeVisiting Total
## 1 Albania 3 1 0 0 0 4
## 2 Antigua and Barbuda 0 0 0 1 0 1
## 3 Argentina 0 19 0 0 0 19
## 4 Armenia 3 2 0 0 0 5
## 5 Australia 6 32 0 0 1 39
## 6 Austria 0 11 0 0 5 16
# Load the world map
world_map = map_data("world")
str(world_map)
## 'data.frame': 99338 obs. of 6 variables:
## $ long : num -69.9 -69.9 -69.9 -70 -70.1 ...
## $ lat : num 12.5 12.4 12.4 12.5 12.5 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ subregion: chr NA NA NA NA ...
# Lets merge intlall into world_map using the merge command
world_map = merge(world_map, intlall, by.x ="region", by.y = "Citizenship")
str(world_map)
## 'data.frame': 63634 obs. of 12 variables:
## $ region : chr "Albania" "Albania" "Albania" "Albania" ...
## $ long : num 20.5 20.4 19.5 20.5 20.4 ...
## $ lat : num 41.3 39.8 42.5 40.1 41.5 ...
## $ group : num 6 6 6 6 6 6 6 6 6 6 ...
## $ order : int 789 822 870 815 786 821 818 779 879 795 ...
## $ subregion : chr NA NA NA NA ...
## $ UG : num 3 3 3 3 3 3 3 3 3 3 ...
## $ G : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SpecialUG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ExhangeVisiting: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Total : int 4 4 4 4 4 4 4 4 4 4 ...
# Plot the map
ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(fill="white", color="black") +
coord_map("mercator")
# Reorder the data
world_map = world_map[order(world_map$group, world_map$order),]
# Redo the plot
ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(fill="white", color="black") +
coord_map("mercator")
# Lets look for China
table(intlall$Citizenship)
##
## Albania Antigua and Barbuda
## 1 1
## Argentina Armenia
## 1 1
## Australia Austria
## 1 1
## Bahrain Bangladesh
## 1 1
## Belarus Belgium
## 1 1
## Bolivia Bosnia-Hercegovina
## 1 1
## Brazil Bulgaria
## 1 1
## Cambodia Cameroon
## 1 1
## Canada Chile
## 1 1
## China (People's Republic Of) Colombia
## 1 1
## Costa Rica Cote d'Ivoire
## 1 1
## Croatia Cyprus
## 1 1
## Czech Republic Denmark
## 1 1
## Ecuador Egypt
## 1 1
## El Salvador Estonia
## 1 1
## Ethiopia Finland
## 1 1
## France Georgia
## 1 1
## Germany Ghana
## 1 1
## Greece Guatemala
## 1 1
## Haiti Hong Kong
## 1 1
## Hungary Iceland
## 1 1
## India Indonesia
## 1 1
## Iran Iraq
## 1 1
## Ireland Israel
## 1 1
## Italy Jamaica
## 1 1
## Japan Jordan
## 1 1
## Kazakhstan Kenya
## 1 1
## Korea, South Kuwait
## 1 1
## Latvia Lebanon
## 1 1
## Lithuania Macedonia
## 1 1
## Malaysia Mauritius
## 1 1
## Mexico Moldova
## 1 1
## Mongolia Montenegro
## 1 1
## Morocco Nepal
## 1 1
## Netherlands New Zealand
## 1 1
## Nigeria Norway
## 1 1
## Pakistan Paraguay
## 1 1
## Peru Philippines
## 1 1
## Poland Portugal
## 1 1
## Qatar Romania
## 1 1
## Russia Rwanda
## 1 1
## Saudi Arabia Serbia
## 1 1
## Sierra Leone Singapore
## 1 1
## Slovakia Somalia
## 1 1
## South Africa Spain
## 1 1
## Sri Lanka St. Lucia
## 1 1
## St. Vincent & The Grenadines Sudan
## 1 1
## Sweden Switzerland
## 1 1
## Syria Taiwan
## 1 1
## Tanzania Thailand
## 1 1
## Trinidad & Tobago Tunisia
## 1 1
## Turkey Uganda
## 1 1
## Ukraine United Arab Emirates
## 1 1
## United Kingdom Unknown
## 1 1
## Uruguay Venezuela
## 1 1
## Vietnam West Bank
## 1 1
## Zambia Zimbabwe
## 1 1
# Lets "fix" that in the intlall dataset
intlall$Citizenship[intlall$Citizenship=="China (People's Republic Of)"] = "China"
# Lets look for China
table(intlall$Citizenship)
##
## Albania Antigua and Barbuda
## 1 1
## Argentina Armenia
## 1 1
## Australia Austria
## 1 1
## Bahrain Bangladesh
## 1 1
## Belarus Belgium
## 1 1
## Bolivia Bosnia-Hercegovina
## 1 1
## Brazil Bulgaria
## 1 1
## Cambodia Cameroon
## 1 1
## Canada Chile
## 1 1
## China Colombia
## 1 1
## Costa Rica Cote d'Ivoire
## 1 1
## Croatia Cyprus
## 1 1
## Czech Republic Denmark
## 1 1
## Ecuador Egypt
## 1 1
## El Salvador Estonia
## 1 1
## Ethiopia Finland
## 1 1
## France Georgia
## 1 1
## Germany Ghana
## 1 1
## Greece Guatemala
## 1 1
## Haiti Hong Kong
## 1 1
## Hungary Iceland
## 1 1
## India Indonesia
## 1 1
## Iran Iraq
## 1 1
## Ireland Israel
## 1 1
## Italy Jamaica
## 1 1
## Japan Jordan
## 1 1
## Kazakhstan Kenya
## 1 1
## Korea, South Kuwait
## 1 1
## Latvia Lebanon
## 1 1
## Lithuania Macedonia
## 1 1
## Malaysia Mauritius
## 1 1
## Mexico Moldova
## 1 1
## Mongolia Montenegro
## 1 1
## Morocco Nepal
## 1 1
## Netherlands New Zealand
## 1 1
## Nigeria Norway
## 1 1
## Pakistan Paraguay
## 1 1
## Peru Philippines
## 1 1
## Poland Portugal
## 1 1
## Qatar Romania
## 1 1
## Russia Rwanda
## 1 1
## Saudi Arabia Serbia
## 1 1
## Sierra Leone Singapore
## 1 1
## Slovakia Somalia
## 1 1
## South Africa Spain
## 1 1
## Sri Lanka St. Lucia
## 1 1
## St. Vincent & The Grenadines Sudan
## 1 1
## Sweden Switzerland
## 1 1
## Syria Taiwan
## 1 1
## Tanzania Thailand
## 1 1
## Trinidad & Tobago Tunisia
## 1 1
## Turkey Uganda
## 1 1
## Ukraine United Arab Emirates
## 1 1
## United Kingdom Unknown
## 1 1
## Uruguay Venezuela
## 1 1
## Vietnam West Bank
## 1 1
## Zambia Zimbabwe
## 1 1
# We'll repeat our merge and order from before
world_map = merge(map_data("world"), intlall,
by.x ="region",
by.y = "Citizenship")
world_map = world_map[order(world_map$group, world_map$order),]
ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=Total), color="black") +
coord_map("mercator")
# We can try other projections -like orthographic projetion--- this one is visually interesting
ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=Total), color="black") +
coord_map("ortho", orientation=c(20, 30, 0))
ggplot(world_map, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=Total), color="black") +
coord_map("ortho", orientation=c(-37, 175, 0))
Bad Scales
scale1
scale2
scale3
Two Rights Make A Wrong
scale4
Family Matters
scale5
# VIDEO 7 - Line Charts
# First, lets make sure we have ggplot2 loaded
library(ggplot2)
# Now lets load our dataframe
households = read.csv("households_2.csv")
str(households)
## 'data.frame': 8 obs. of 7 variables:
## $ Year : int 1970 1980 1990 1995 2000 2005 2010 2012
## $ MarriedWChild : num 40.3 30.9 26.3 25.5 24.1 22.9 20.9 19.6
## $ MarriedWOChild: num 30.3 29.9 29.8 28.9 28.7 28.3 28.8 29.1
## $ OtherFamily : num 10.6 12.9 14.8 15.6 16 16.7 17.4 17.8
## $ MenAlone : num 5.6 8.6 9.7 10.2 10.7 11.3 11.9 12.3
## $ WomenAlone : num 11.5 14 14.9 14.7 14.8 15.3 14.8 15.2
## $ OtherNonfamily: num 1.7 3.6 4.6 5 5.7 5.6 6.2 6.1
#from the above structure we dont know what to put in aes argument in ggplot
#We have to convert the data so that it can be used by ggplot()
#ggplot2 requires year,group & fraction
#What we need is to convert the Wide format of the given dataset to Long format
# Load reshape2 package
library(reshape2)
# Lets look at the first two columns of our households dataframe
head(households[,1:2])
## Year MarriedWChild
## 1 1970 40.3
## 2 1980 30.9
## 3 1990 26.3
## 4 1995 25.5
## 5 2000 24.1
## 6 2005 22.9
head(households)
## Year MarriedWChild MarriedWOChild OtherFamily MenAlone WomenAlone
## 1 1970 40.3 30.3 10.6 5.6 11.5
## 2 1980 30.9 29.9 12.9 8.6 14.0
## 3 1990 26.3 29.8 14.8 9.7 14.9
## 4 1995 25.5 28.9 15.6 10.2 14.7
## 5 2000 24.1 28.7 16.0 10.7 14.8
## 6 2005 22.9 28.3 16.7 11.3 15.3
## OtherNonfamily
## 1 1.7
## 2 3.6
## 3 4.6
## 4 5.0
## 5 5.7
## 6 5.6
# First few rows of our melted households dataframe
head(melt(households, id="Year"))
## Year variable value
## 1 1970 MarriedWChild 40.3
## 2 1980 MarriedWChild 30.9
## 3 1990 MarriedWChild 26.3
## 4 1995 MarriedWChild 25.5
## 5 2000 MarriedWChild 24.1
## 6 2005 MarriedWChild 22.9
households[,1:3]
## Year MarriedWChild MarriedWOChild
## 1 1970 40.3 30.3
## 2 1980 30.9 29.9
## 3 1990 26.3 29.8
## 4 1995 25.5 28.9
## 5 2000 24.1 28.7
## 6 2005 22.9 28.3
## 7 2010 20.9 28.8
## 8 2012 19.6 29.1
melt(households, id="Year")[1:10,3]
## [1] 40.3 30.9 26.3 25.5 24.1 22.9 20.9 19.6 30.3 29.9
melt(households, id="Year")[1:10,]
## Year variable value
## 1 1970 MarriedWChild 40.3
## 2 1980 MarriedWChild 30.9
## 3 1990 MarriedWChild 26.3
## 4 1995 MarriedWChild 25.5
## 5 2000 MarriedWChild 24.1
## 6 2005 MarriedWChild 22.9
## 7 2010 MarriedWChild 20.9
## 8 2012 MarriedWChild 19.6
## 9 1970 MarriedWOChild 30.3
## 10 1980 MarriedWOChild 29.9
# Plot it
ggplot(melt(households, id="Year"),
aes(x=Year, y=value, color=variable)) +
geom_line(size=2) + geom_point(size=5) +
ylab("Percentage of Households")
ELECTION FORECASTING REVISITED
In the recitation from Unit 3, we used logistic regression on polling data in order to construct US presidential election predictions. We separated our data into a training set, containing data from 2004 and 2008 polls, and a test set, containing the data from 2012 polls. We then proceeded to develop a logistic regression model to forecast the 2012 US presidential election.
In this homework problem, we’ll revisit our logistic regression model from Unit 3, and learn how to plot the output on a map of the United States. Unlike what we did in the Crime lecture, this time we’ll be plotting predictions rather than data!
Note:The maps package contains other built-in maps, including a US county map, a world map, and maps for France and Italy.
library(ggplot2)
library(maps)
library(ggmap)
#load the US map and save it to the variable statesMap, like we did during the Crime lecture:
statesMap = map_data("state")
str(statesMap)
## 'data.frame': 15537 obs. of 6 variables:
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ subregion: chr NA NA NA NA ...
#PROBLEM 1.1 - DRAWING A MAP OF THE US
#If you look at the structure of the statesMap data frame using the str function, you should see that there are 6 variables. One of the variables, group, defines the different shapes or polygons on the map. Sometimes a state may have multiple groups, for example, if it includes islands. How many different groups are there?
table(statesMap$group)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 202 149 312 516 79 91 94 10 872 381 233 329 257 256 113
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 397 650 399 566 36 220 30 460 370 373 382 315 238 208 70
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 125 205 78 16 290 21 168 37 733 12 105 238 284 236 172
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 66 304 166 289 1088 59 129 96 15 623 17 17 19 44 448
## 61 62 63
## 373 388 68
#or
length(table(statesMap$group))
## [1] 63
#Ans:63
####################################
#PROBLEM 1.2 - DRAWING A MAP OF THE US (1 point possible)
#Drawing the map of the United States:
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")
#We specified two colors in geom_polygon -- fill and color. Which one defined the color of the outline of the states?
#Ans:color
#EXPLANATION:In our plot, the states are outlined in black, which is the color we specified for the option "color". To confirm that this is changing the outline color of the states, you can try re-running the command with a different color:
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "pink")
########################################
#PROBLEM 2.1 - COLORING THE STATES BY PREDICTIONS
#Now, let's color the map of the US according to our 2012 US presidential election predictions from the Unit 3 Recitation. We'll rebuild the model here, using the dataset PollingImputed.csv. Be sure to use this file so that you don't have to redo the imputation to fill in the missing values, like we did in the Unit 3 Recitation.
#Load election data
polling <- read.csv("PollingImputed.csv")
# split the data using the subset function into a training set called "Train" that has observations from 2004 and 2008, and a testing set called "Test" that has observations from 2012
Train = subset(polling, Year < 2012)
Test = subset(polling, Year == 2012)
str(Train)
## 'data.frame': 100 obs. of 7 variables:
## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 4 4 5 5 ...
## $ Year : int 2004 2008 2004 2008 2004 2008 2004 2008 2004 2008 ...
## $ Rasmussen : int 11 21 19 16 5 5 7 10 -11 -27 ...
## $ SurveyUSA : int 18 25 21 18 15 3 5 7 -11 -24 ...
## $ DiffCount : int 5 5 1 6 8 9 8 5 -8 -5 ...
## $ PropR : num 1 1 1 1 1 1 1 1 0 0 ...
## $ Republican: int 1 1 1 1 1 1 1 1 0 0 ...
str(Test)
## 'data.frame': 45 obs. of 7 variables:
## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 3 4 5 6 7 9 10 11 12 13 ...
## $ Year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## $ Rasmussen : int 8 13 -12 3 -7 2 5 -22 31 -22 ...
## $ SurveyUSA : int 5 21 -14 -2 -13 0 8 -24 24 -16 ...
## $ DiffCount : int 4 2 -6 -5 -8 6 4 -2 1 -5 ...
## $ PropR : num 0.833 1 0 0.308 0 ...
## $ Republican: int 1 1 0 0 0 0 1 0 1 0 ...
#Note that we only have 45 states in our testing set, since we are missing observations for Alaska, Delaware, Alabama, Wyoming, and Vermont, so these states will not appear colored in our map.
#create a logistic regression model and make predictions on the test set:
mod2 = glm(Republican~SurveyUSA+DiffCount, data=Train, family="binomial")
TestPrediction = predict(mod2, newdata=Test, type="response")
#TestPrediction gives the predicted probabilities for each state, but let's also create a vector of Republican/Democrat predictions :
TestPredictionBinary = as.numeric(TestPrediction > 0.5)
#Now, put the predictions and state labels in a data.frame so that we can use ggplot:
predictionDataFrame = data.frame(TestPrediction, TestPredictionBinary, Test$State)
#For how many states is our binary prediction 1 (for 2012), corresponding to Republican?
sum(predictionDataFrame$TestPredictionBinary == 1)
## [1] 22
#or from
table(TestPredictionBinary)
## TestPredictionBinary
## 0 1
## 23 22
#Ans:22
#What is the average predicted probability of our model (on the Test set, for 2012)?
mean(predictionDataFrame$TestPrediction)
## [1] 0.4852626
#Ans:0.4852626
####################################################
#PROBLEM 2.2 - COLORING THE STATES BY PREDICTIONS
#Now, we need to merge "predictionDataFrame" with the map data "statesMap", like we did in lecture.
#Before doing so, we need to convert the Test.State variable to lowercase, so that it matches the region variable in statesMap:
predictionDataFrame$region = tolower(predictionDataFrame$Test.State)
#Now, merge the two data frames using the following command:
predictionMap = merge(statesMap, predictionDataFrame, by = "region")
#Lastly, we need to make sure the observations are in order so that the map is drawn properly, by typing the following:
predictionMap = predictionMap[order(predictionMap$order),]
#How many observations are there in predictionMap?
str(predictionMap)
## 'data.frame': 15034 obs. of 9 variables:
## $ region : chr "arizona" "arizona" "arizona" "arizona" ...
## $ long : num -115 -115 -115 -115 -115 ...
## $ lat : num 35 35.1 35.1 35.2 35.2 ...
## $ group : num 2 2 2 2 2 2 2 2 2 2 ...
## $ order : int 204 205 206 207 208 209 210 211 212 213 ...
## $ subregion : chr NA NA NA NA ...
## $ TestPrediction : num 0.974 0.974 0.974 0.974 0.974 ...
## $ TestPredictionBinary: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Test.State : Factor w/ 50 levels "Alabama","Alaska",..: 3 3 3 3 3 3 3 3 3 3 ...
#or
nrow(predictionMap)
## [1] 15034
#Ans:15034
#How many observations are there in statesMap?
str(statesMap)
## 'data.frame': 15537 obs. of 6 variables:
## $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
## $ lat : num 30.4 30.4 30.4 30.3 30.3 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "alabama" "alabama" "alabama" "alabama" ...
## $ subregion: chr NA NA NA NA ...
#or
nrow(statesMap)
## [1] 15537
#Ans:15537
########################################################
#PROBLEM 2.3 - COLORING THE STATES BY PREDICTIONS
#When we merged the data in the previous problem, it caused the number of observations to change. Why? Check out the help page for merge by typing ?merge to help you answer this question.
#?merge
#Ans:Because we only make predictions for 45 states, we no longer have observations for some of the states. These observations were removed in the merging process.
#EXPLANATION:When we merge data, it only merged the observations that exist in both data sets. So since we are merging based on the region variable, we will lose all observations that have a value of "region" that doesn't exist in both data frames. You can change this default behavior by using the all.x and all.y arguments of the merge function. For more information, look at the help page for the merge function by typing ?merge in your R console.
###############################################
#PROBLEM 2.4 - COLORING THE STATES BY PREDICTIONS
#Now we are ready to color the US map with our predictions! We can color the states according to our binary predictions:
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black")
#The states appear light blue and dark blue in this map. Which color represents a Republican prediction?
#Ans:Light blue
#EXPLANATION:Our logistic regression model assigned 1 to Republican and 0 to Democrat. As we can see from the legend, 1 corresponds to a light blue color on the map and 0 corresponds to a dark blue color on the map.
##############################################
#PROBLEM 2.5 - COLORING THE STATES BY PREDICTIONS
#We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let's replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US. This can be done with the following command:
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
#Alternatively, we could plot the probabilities instead of the binary predictions. Change the plot command above to instead color the states by the variable TestPrediction. You should see a gradient of colors ranging from red to blue.
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
#Do the colors of the states in the map for TestPrediction look different from the colors of the states in the map with TestPredictionBinary? Why or why not?
#Ans:The two maps look very similar. This is because most of our predicted probabilities are close to 0 or close to 1
#EXPLANATION:The only state that appears purple (the color between red and blue) is the state of Iowa, so the maps look very similar. If you take a look at TestPrediction, you can see that most of our predicted probabilities are very close to 0 or very close to 1. In fact, we don't have a single predicted probability between 0.065 and 0.93.
#Lets change the color scheme, by changing the arguments low = "blue" and high = "red" to colors of your choice (to see all of the color options in R, type colors() in your R console).
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black") + scale_fill_gradient(low = "green", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
#Changing the color scheme to gray scale, by changing the low and high colors to "gray" and "black".
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black") + scale_fill_gradient(low = "gray", high = "black", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
####################################################
#PROBLEM 3.1 - UNDERSTANDING THE PREDICTIONS
#In the 2012 election, the state of Florida ended up being a very close race. It was ultimately won by the Democratic party.
#Did we predict this state correctly or incorrectly? To see the names and locations of the different states, take a look at the World Atlas map here:http://www.worldatlas.com/webimage/testmaps/usanames.htm
#Ans:We incorrectly predicted this state by predicting that it would be won by the Republican party.
#EXPLANATION:In our prediction map, the state of Florida is colored red, meaning that we predicted Republican. So we incorrectly predicted this state.
######################################
#PROBLEM 3.2 - UNDERSTANDING THE PREDICTIONS
#What was our predicted probability for the state of Florida?
predictionDataFrame[which(predictionDataFrame$Test.State == "Florida"), ]
## TestPrediction TestPredictionBinary Test.State region
## 24 0.9640395 1 Florida florida
#Ans:Our prediction model did not do a very good job of correctly predicting the state of Florida, and we were very confident in our incorrect prediction.
#EXPLANATION:We predicted Republican for the state of Florida with high probability, meaning that we were very confident in our incorrect prediction! Historically, Florida is usually a close race, but our model doesn't know this. The model only uses polling results for the particular year. For Florida in 2012, Survey USA predicted a tie, but other polls predicted Republican, so our model predicted Republican.
#####################################################
#PROBLEM 4 - PARAMETER SETTINGS
#In this part, we'll explore what the different parameter settings of geom_polygon do. Throughout the problem, use the help page for geom_polygon, which can be accessed by ?geom_polygon. To see more information about a certain parameter, just type a question mark and then the parameter name to get the help page for that parameter. Experiment with different parameter settings to try and replicate the plots!
#?geom_polygon
#We'll be asking questions about the following three plots:
#Plot (1)
#The first plot can be generated by setting the parameter linetype = 3:
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black",linetype=3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
#Ans:linetype
#Plot (2)
#The second plot can be generated by setting the parameter size = 3:
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black",size=3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
#Ans:size
###################################################
#PROBLEM 4.2 - PARAMETER SETTINGS
#Plot (3) was created by changing the value of a different geom_polygon parameter to have value 0.3. Which parameter did we use?
#Plot (3) can be created by changing the alpha parameter:
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black",alpha=0.3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")
#Ans:alpha
#The "alpha" parameter controls the transparency or darkness of the color. A smaller value of alpha will make the colors lighter.
The cliche goes that the world is an increasingly interconnected place, and the connections between different entities are often best represented with a graph. Graphs are comprised of vertices (also often called “nodes”) and edges connecting those nodes. In this assignment, we will learn how to visualize networks using the igraph package in R.
For this assignment, we will visualize social networking data using anonymized data from Facebook; this data was originally curated in a recent paper about computing social circles in social networks. In our visualizations, the vertices in our network will represent Facebook users and the edges will represent these users being Facebook friends with each other.
The first dataset we will use, edges.csv, contains variables V1 and V2, which label the endpoints of edges in our network. Each row represents a pair of users in our graph who are Facebook friends. For a pair of friends A and B, edges.csv will only contain a single row – the smaller identifier will be listed first in this row. From this row, we will know that A is friends with B and B is friends with A.
The second dataset, users.csv, contains information about the Facebook users, who are the vertices in our network. This file contains the following variables:
id: A unique identifier for this user; this is the value that appears in the rows of edges.csv
gender: An identifier for the gender of a user taking the values A and B. Because the data is anonymized, we don’t know which value refers to males and which value refers to females.
school: An identifier for the school the user attended taking the values A and AB (users with AB attended school A as well as another school B). Because the data is anonymized, we don’t know the schools represented by A and B.
locale: An identifier for the locale of the user taking the values A and B. Because the data is anonymized, we don’t know which value refers to what locale.
#PROBLEM 1.1 - SUMMARIZING THE DATA
#Load the datasets
edges <- read.csv("edges.csv")
str(edges)
## 'data.frame': 146 obs. of 2 variables:
## $ V1: int 4019 4023 4023 4027 3988 3982 3994 3998 3993 3982 ...
## $ V2: int 4026 4031 4030 4032 4021 3986 3998 3999 3995 4021 ...
users <- read.csv("users.csv")
str(users)
## 'data.frame': 59 obs. of 4 variables:
## $ id : int 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 ...
## $ gender: Factor w/ 3 levels "","A","B": 2 3 3 3 3 3 2 3 3 2 ...
## $ school: Factor w/ 3 levels "","A","AB": 2 1 1 1 1 2 1 1 2 1 ...
## $ locale: Factor w/ 3 levels "","A","B": 3 3 3 3 3 3 2 3 3 2 ...
#How many Facebook users are there in our dataset?
nrow(users)
## [1] 59
#Ans:59
#In our dataset, what is the average number of friends per user? Hint: this question is tricky, and it might help to start by thinking about a small example with two users who are friends.
#Average number of friends per user
nrow(edges) * 2 / nrow(users)
## [1] 4.949153
#Ans:4.949153
#EXPLANATION:From str(edges) or nrow(edges), we see that there are 146 pairs of users in our dataset who are Facebook friends. However, each pair (A, B) must be counted twice, because B is a friend of A and A is a friend of B. To think of this in simpler terms, consider a network with just new people, A and B, and a single edge (A, B). Even though there are two vertices and one edge, each user has on average one friend.
#For our network, the average number of friends per user is 292/59=4.95.
#Finally, note that in all likelihood these users have a much higher number of Facebook friends. We are computing here the average number of people in this dataset who are their friends, instead of the average total number of Facebook friends.
############################################################
#PROBLEM 1.2 - SUMMARIZING THE DATA
#Out of all the students who listed a school, what was the most common locale?
table(users$locale, users$school)
##
## A AB
## 3 0 0
## A 6 0 0
## B 31 17 2
#Ans:Locale B
#EXPLANATION:From table(users$locale, users$school), we read that all students listed at schools A and B listed their locale as B.
#####################################
#PROBLEM 1.3 - SUMMARIZING THE DATA
#Is it possible that either school A or B is an all-girls or all-boys school?
table(users$gender, users$school)
##
## A AB
## 1 1 0
## A 11 3 1
## B 28 13 1
#Ans:No
#EXPLANATION:We see from table(users$gender, users$school) that both genders A and B have attended schools A and B.
###################################
#PROBLEM 2.1 - CREATING A NETWORK
#We will be using the igraph package to visualize networks;
#install.packages("igraph")
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:stringr':
##
## %>%
## The following objects are masked from 'package:tidyr':
##
## %>%, crossing
## The following objects are masked from 'package:lubridate':
##
## %--%, union
## The following objects are masked from 'package:dplyr':
##
## %>%, as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
#We can create a new graph object using the graph.data.frame() function. Based on ?graph.data.frame, which of the following commands will create a graph g describing our social network, with the attributes of each user correctly loaded?
#?graph.data.frame
g <- graph.data.frame(edges, FALSE, users)
g = graph.data.frame(edges, TRUE, users)
plot(g, vertex.size=5, vertex.label=NA)
#Ans:g = graph.data.frame(edges, FALSE, users)
#EXPLANATION:From ?graph.data.frame, we can see that the function expects the first two columns of parameter d to specify the edges in the graph -- our edges object fits this description.
#Our edges are undirected -- if A is a Facebook friend of B then B is a Facebook friend of A. Therefore, we set the directed parameter to FALSE.
#The vertices parameter expects a data frame where the first column is a vertex id and the remaining columns are properties of vertices in our graph. This is the case with our users data frame.
#Note: A directed graph is one where the edges only go one way -- they point from one vertex to another. The other option is an undirected graph, which means that the relations between the vertices are symmetric.
####################################
#PROBLEM 2.2 - CREATING A NETWORK
#Use the correct command from Problem 2.1 to load the graph g.
#Now, we want to plot our graph. By default, the vertices are large and have text labels of a user's identifier. Because this would clutter the output, we will plot with no text labels and smaller vertices:
g <- graph.data.frame(edges, FALSE, users)
plot(g, vertex.size=5, vertex.label=NA)
#In this graph, there are a number of groups of nodes where all the nodes in each group are connected but the groups are disjoint from one another, forming "islands" in the graph. Such groups are called "connected components," or "components" for short. How many connected components with at least 2 nodes are there in the graph?
#Ans:4
#EXPLANATION:In addition to the large connected component, there is a 4-node component and two 2-node components.
#How many users are there with no friends in the network?
#Ans:7
#EXPLANATION:There are 7 nodes that are not connected to any other nodes. Each forms a 1-node connected component.
####################################
#PROBLEM 2.3 - CREATING A NETWORK
#In our graph, the "degree" of a node is its number of friends. We have already seen that some nodes in our graph have degree 0 (these are the nodes with no friends), while others have much higher degree. We can use degree(g) to compute the degree of all the nodes in our graph g.
degree(g)
## 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995
## 7 13 1 0 5 8 1 6 5 3 2 2 5 10 8
## 594 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009
## 3 3 10 13 3 8 1 6 4 9 2 1 3 0 9
## 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024
## 0 3 1 5 11 0 3 8 6 7 7 10 0 17 0
## 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038
## 3 8 6 1 1 18 10 1 2 1 0 1 3 8
#How many users are friends with 10 or more other Facebook users in this network?
sum(degree(g) >= 10)
## [1] 9
#Ans:9
#EXPLANATION:From table(degree(g)) or table(degree(g) >= 10), we can see that there are 9 users with 10 or more friends in this network.
####################################
#PROBLEM 2.4 - CREATING A NETWORK
#In a network, it's often visually useful to draw attention to "important" nodes in the network. While this might mean different things in different contexts, in a social network we might consider a user with a large number of friends to be an important user. From the previous problem, we know this is the same as saying that nodes with a high degree are important users.
#To visually draw attention to these nodes, we will change the size of the vertices so the vertices with high degrees are larger. To do this, we will change the "size" attribute of the vertices of our graph to be an increasing function of their degrees:
#To make vertices with high degree (more friends) larger:
V(g)$size = degree(g)/2+2
#Now that we have specified the vertex size of each vertex, we will no longer use the vertex.size parameter when we plot our graph:
plot(g, vertex.label=NA)
table(V(g)$size)
##
## 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8.5 10.5 11
## 7 10 4 9 1 4 4 3 6 2 4 1 2 1 1
#What is the largest size we assigned to any node in our graph?
#Ans:11
#What is the smallest size we assigned to any node in our graph?
#Ans:2
#EXPLANATION:From
table(degree(g))
##
## 0 1 2 3 4 5 6 7 8 9 10 11 13 17 18
## 7 10 4 9 1 4 4 3 6 2 4 1 2 1 1
#or
summary(degree(g))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 4.949 8.000 18.000
#we see that the maximum degree of any node in the graph is 18 and the minimum degree of any node is 0. Therefore, the maximum size of any point is 18/2+2=11, and the minimum size is 0/2+2=2.
##########################
#PROBLEM 3.1 - COLORING VERTICES
#Thus far, we have changed the "size" attributes of our vertices. However, we can also change the colors of vertices to capture additional information about the Facebook users we are depicting.
#When changing the size of nodes, we first obtained the vertices of our graph with V(g) and then accessed the the size attribute with V(g)$size. To change the color, we will update the attribute V(g)$color.
#To color the vertices based on the gender of the user, we will need access to that variable. When we created our graph g, we provided it with the data frame users, which had variables gender, school, and locale. These are now stored as attributes V(g)$gender, V(g)$school, and V(g)$locale.
#We can update the colors by setting the color to black for all vertices, than setting it to red for the vertices with gender A and setting it to gray for the vertices with gender B:
V(g)$color = "black"
V(g)$color[V(g)$gender == "A"] = "red"
V(g)$color[V(g)$gender == "B"] = "gray"
plot(g, vertex.label=NA)
#Plot the resulting graph. What is the gender of the users with the highest degree in the graph?
#Ans:Gender B
#EXPLANATION:After updating V(g)$color, run plot(g, vertex.label=NA) to plot the graph. All the largest nodes (the ones with the highest degree) are colored gray, which corresponds to Gender B.
###########################
#PROBLEM 3.2 - COLORING VERTICES
#Now, color the vertices based on the school that each user in our network attended.
#Color vertex based on school
V(g)$color <- "black"
V(g)$color[V(g)$school == "A"] <- "red"
V(g)$color[V(g)$school == "AB"] <- "green"
plot(g, vertex.label=NA)
#Are the two users who attended both schools A and B Facebook friends with each other?
#Ans:Yes
#What best describes the users with highest degree?
#Ans:Some, but not all, of the high-degree users attended school A
#EXPLANATION:As with coloring by gender, we will set the color for all vertices to black, and then we will set the color for students from school A to red and the color for students from schools A and B to gray. Finally we will plot the updated graph.The two students who attended schools A and B are colored gray; we can see from the graph that they are Facebook friends (aka they are connected by an edge). The high-degree users (depicted by the large nodes) are a mixture of red and black color, meaning some of these users attended school A and other did not.
##################################
#PROBLEM 3.3 - COLORING VERTICES
#Now, color the vertices based on the locale of the user.
#The large connected component is most associated with which locale?
#Color vertex based on locale
V(g)$color <- "black"
V(g)$color[V(g)$locale == "A"] <- "red"
V(g)$color[V(g)$locale == "B"] <- "green"
plot(g, vertex.label=NA)
#Ans:Locale B
#The 4-user connected component is most associated with which locale?
#Ans:Locale A
#EXPLANATION:As with the other coloring tasks, we will set the color for all vertices to black, and then we will set the color for users from locale A to red and the color for users from locale B to gray. Finally we will plot the updated graph.Nearly all of the vertices from the large connected component are colored gray, indicating users from Locale B. Meanwhile, all the vertices in the 4-user connected component are colored red, indicating users from Locale A.
##############################
#PROBLEM 4 - OTHER PLOTTING OPTIONS
#The help page is a helpful tool when making visualizations. Answer the following questions with the help of ?igraph.plotting and experimentation in your R console.
#?igraph.plotting
#Making #D plots using rglplot
#Which igraph plotting function would enable us to plot our graph in 3-D?
library(rgl)
rglplot(g, vertex.label=NA)
#Ans:rglplot
#What parameter to the plot() function would we use to change the edge width when plotting g?
plot(g, edge.width=2, vertex.label=NA)
#Ans:edge.width
#EXPLANATION:The three functions to plot the igraph are plot.igraph (the function we used through the command "plot"), tkplot, and rglplot. rglplot makes 3-D plots -- you can try one with rglplot(g, vertex.label=NA). Once you've made the plot, you can click and drag to rotate the graph. To use this function, you will need to install and load the "rgl" package.
#To change the edge width, you need to change the edge parameter called "width". From ?igraph.plotting, we read that we need to append the prefix "edge." to the beginning for our call to plot, so the full parameter is called "edge.width". For instance, we could plot with edge width 2 with the command plot(g, edge.width=2, vertex.label=NA).
#experimentation with plotting options
plot(g)
#or
plot.igraph(g)
tkplot(g)
## [1] 1
#examples from igraph.plotting help page
# plotting a random graph, set the parameters in the command arguments
g1 <- barabasi.game(100)
plot(g1, layout=layout_with_fr, vertex.size=4,
vertex.label.dist=0.5, vertex.color="red", edge.arrow.size=0.5)
# plot a random graph, different color for each component
g2 <- sample_gnp(100, 1/100)
comps <- components(g2)$membership
colbar <- rainbow(max(comps)+1)
V(g2)$color <- colbar[comps+1]
plot(g2, layout=layout_with_fr, vertex.size=5, vertex.label=NA)
# plot communities in a graph
g3 <- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)
g3 <- add_edges(g3, c(1,6, 1,11, 6,11))
com <- cluster_spinglass(g3, spins=5)
V(g3)$color <- com$membership+1
g3 <- set_graph_attr(g3, "layout", layout_with_kk(g3))
plot(g3, vertex.label.dist=1.5)
# draw a bunch of trees, fix layout
igraph_options(plot.layout=layout_as_tree)
plot(make_tree(20, 2))
plot(make_tree(50, 3), vertex.size=3, vertex.label=NA)
tkplot(make_tree(50, 2, mode="undirected"), vertex.size=10,
vertex.color="green")
## [1] 2
Earlier in the course, we used text analytics as a predictive tool, using word frequencies as independent variables in our models. However, sometimes our goal is to understand commonly occurring topics in text data instead of to predict the value of some dependent variable. In such cases, word clouds can be a visually appealing way to display the most frequent words in a body of text.
A word cloud arranges the most common words in some text, using size to indicate the frequency of a word. For instance, this is a word cloud for the complete works of Shakespeare, removing English stopwords:
shakespeare
While we could generate word clouds using free generators available on the Internet, we will have more flexibility and control over the process if we do so in R. We will visualize the text of tweets about Apple, a dataset we used earlier in the course. As a reminder, this dataset (which can be downloaded from tweets.csv) has the following variables:
Tweet – the text of the tweet
Avg – the sentiment of the tweet, as assigned by users of Amazon Mechanical Turk. The score ranges on a scale from -2 to 2, where 2 means highly positive sentiment, -2 means highly negative sentiment, and 0 means neutral sentiment.
#PROBLEM 1.1 - PREPARING THE DATA
#Load the dataset "tweets.csv" remembering to use stringsAsFactors=FALSE when loading the data.
tweets <- read.csv("tweets.csv", stringsAsFactors=FALSE)
#Next, perform the following pre-processing tasks (like we did in Unit 5), noting that we don't stem the words in the document or remove sparse terms:
#Clean up data. Excluding stemming because it will be easier to red and understand the word cloud if it includes full words.
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
##
## Attaching package: 'tm'
## The following object is masked from 'package:mosaic':
##
## inspect
library(SnowballC)
#1) Create a corpus using the Tweet variable
corpus <- Corpus(VectorSource(tweets$Tweet))
#2) Convert the corpus to lowercase (don't forget to type "corpus = tm_map(corpus, PlainTextDocument)" in your R console right after this step)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
#3) Remove punctuation from the corpus
corpus <- tm_map(corpus, removePunctuation)
#4) Remove all English-language stopwords
corpus<- tm_map(corpus, removeWords, stopwords("english"))
#5) Build a document-term matrix out of the corpus
frequencies <- DocumentTermMatrix(corpus)
#6) Convert the document-term matrix to a data frame called allTweets
allTweets <- as.data.frame(as.matrix(frequencies))
#How many unique words are there across all the documents?
frequencies
## <<DocumentTermMatrix (documents: 1181, terms: 3780)>>
## Non-/sparse entries: 10273/4453907
## Sparsity : 100%
## Maximal term length: 115
## Weighting : term frequency (tf)
#or
ncol(allTweets)
## [1] 3780
#or
#str(allTweets)
#Ans:3780
#EXPLANATION: we can read that there are 3780 unique words across all the tweets.
######################################
#PROBLEM 1.2 - PREPARING THE DATA
#Although we typically stem words during the text preprocessing step, we did not do so here. What is the most compelling rationale for skipping this step when visualizing text data?
#Ans:It will be easier to read and understand the word cloud if it includes full words instead of just the word stems
#EXPLANATION:We want to create an interpretable display of a document's contents, and our results will be easier to read if they include full words instead of just the stems.Stemming has relatively minor computational burden, and we certainly could create a word cloud with a stemmed document.
#####################################
#PROBLEM 2.1 - BUILDING A WORD CLOUD
#Install and load the "wordcloud" package, which is needed to build word clouds.
#install.packages("wordcloud")
library(wordcloud)
## Loading required package: RColorBrewer
#As we can read from ?wordcloud, we will need to provide the function with a vector of words and a vector of word frequencies. Which function can we apply to allTweets to get a vector of the words in our dataset, which we'll pass as the first argument to wordcloud()?
#Ans:colnames
#EXPLANATION:Each tweet represents a row of allTweets, and each word represents a column. We need the names of all the columns of allTweets, which is returned by colnames(allTweets). While str(allTweets) displays the names of the variables along with other information, it doesn't return a vector that we can use as the first argument to wordcloud().
######################################
#PROBLEM 2.2 - BUILDING A WORD CLOUD
head(colnames(allTweets))
## [1] "000" "075" "0909" "0910" "099" "100"
#Which function should we apply to allTweets to obtain the frequency of each word across all tweets?
head(colSums(allTweets))
## 000 075 0909 0910 099 100
## 1 3 1 1 1 2
#Ans: colSums
#EXPLANATION:Each tweet represents a row in allTweets, and each word represents a column. Therefore, we need to access the sums of each column in allTweets, which is returned by colSums(allTweets).
################################
#PROBLEM 2.3 - BUILDING A WORD CLOUD
#Use allTweets to build a word cloud. Make sure to check out the help page for wordcloud if you are not sure how to do this.
wordcloud(colnames(allTweets), colSums(allTweets))
## Warning in wordcloud(colnames(allTweets), colSums(allTweets)): apple could
## not be fit on page. It will not be plotted.
#Because we are plotting a large number of words, you might get warnings that some of the words could not be fit on the page and were therefore not plotted -- this is especially likely if you are using a smaller screen. You can address these warnings by plotting the words smaller. From ?wordcloud, we can see that the "scale" parameter controls the sizes of the plotted words. By default, the sizes range from 4 for the most frequent words to 0.5 for the least frequent, as denoted by the parameter "scale=c(4, 0.5)". We could obtain a much smaller plot with, for instance, parameter "scale=c(2, 0.25)".
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25))
#What is the most common word across all the tweets (it will be the largest in the outputted word cloud)? Please type the word exactly how you see it in the word cloud. The most frequent word might not be printed if you got a warning about words being cut off -- if this happened, be sure to follow the instructions in the paragraph above.
#Ans:apple
#EXPLANATION:"apple" is by far the largest, and therefore most common, word.
###################################
#PROBLEM 2.4 - BUILDING A WORD CLOUD
#In the previous subproblem, we noted that there is one word with a much higher frequency than the other words. Repeat the steps to load and pre-process the corpus, this time removing the most frequent word in addition to all elements of stopwords("english") in the call to tm_map with removeWords. For a refresher on how to remove this additional word, see the Twitter text analytics lecture.
#Remove the most frequent word 'apple' and regenerate graph
corpus <- Corpus(VectorSource(tweets$Tweet))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, c("apple", stopwords("english")))
frequencies <- DocumentTermMatrix(corpus)
#Replace allTweets with the document-term matrix of this new corpus -- we will use this updated corpus for the remainder of the assignment.
allTweets <- as.data.frame(as.matrix(frequencies))
#Create a word cloud with the updated corpus.
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25))
#What is the most common word in this new corpus (the largest word in the outputted word cloud)? The most frequent word might not be printed if you got a warning about words being cut off -- if this happened, be sure to follow the instructions in the previous problem.
#Ans:iphone
#EXPLANATION:The most common (largest) word is now "iphone".
########################################################
#PROBLEM 3 - SIZE AND COLOR
#So far, the word clouds we've built have not been too visually appealing -- they are crowded by having too many words displayed, and they don't take advantage of color. One important step to building visually appealing visualizations is to experiment with the parameters available, which in this case can be viewed by typing ?wordcloud in your R console. In this problem, you should look through the help page and experiment with different parameters to answer the questions.
#?wordcloud
#Below are four word clouds, each of which uses different parameter settings in the call to the wordcloud() function:
#Word Cloud A:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),rot.per=0.5)
#Word Cloud B:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10,random.order=FALSE)
#Word Cloud C:
negativeTweets = subset(allTweets, tweets$Avg <= -1)
wordcloud(colnames(negativeTweets), colSums(negativeTweets),scale=c(2, 0.25),min.freq=4,colors=brewer.pal(9,"Purples")[5:9])
#Word Cloud D:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10,random.order=FALSE,random.color=TRUE,colors=brewer.pal(9,"Purples")[5:9])
#PROBLEM 3.1 - SIZE AND COLOR
#Which word cloud is based only on the negative tweets (tweets with Avg value -1 or less)?
#Ans:Word Cloud C
#EXPLANATION:Word Cloud C is the only one with a different distribution of the most frequent words -- negative words (or censored versions of negative words) are much more common in this cloud.It is quite simple to obtain a word cloud that is limited to a subset of the tweets using the subset function.
#########################
#PROBLEM 3.2 - SIZE AND COLOR
#Only one word cloud was created without modifying parameters min.freq or max.words. Which word cloud is this?
#Ans:Word Cloud A
#EXPLANATION:min.freq and max.words are parameters that can be used to remove the least frequent words, resulting is a less cluttered word cloud. Word Cloud A is much more cluttered than the others because it did not use either of these parameters, and therefore is displaying every word that appears more than 3 times.
###########################
##PROBLEM 3.3 - SIZE AND COLOR
#Which word clouds were created with parameter random.order set to FALSE?
#Ans:Word Cloud B & Word Cloud D
#EXPLANATION:If random.order is set to FALSE, then the most frequent (largest) words will be plotted first, resulting in them being displayed together in the center of the word cloud. This is the case in Word Cloud B and Word Cloud D.
################################################
#PROBLEM 3.4 - SIZE AND COLOR
#Which word cloud was built with a non-default value for parameter rot.per?
#Ans:Word Cloud A
#EXPLANATION:rot.per controls the proportion of words that are rotated to be vertical in the word cloud. By default 10% of words are rotated. However in Word Cloud A a much higher proportion (50%) are rotated, which was achieved by setting rot.per=0.5.
#########################################
#PROBLEM 3.5 - SIZE AND COLOR
#In Word Cloud C and Word Cloud D, we provided a color palette ranging from light purple to dark purple as the parameter colors (you will learn how to make such a color palette later in this assignment). For which word cloud was the parameter random.color set to TRUE?
#Ans:Word Cloud D
#EXPLANATION:When random.color is set to TRUE, the words will be colored randomly. This is the case in Word Cloud D. Meanwhile, colors were assigned based on the number of appearances in Word Cloud C.
##########################################################
#PROBLEM 4.1 - SELECTING A COLOR PALETTE
#The use of a palette of colors can often improve the overall effect of a visualization. We can easily select our own colors when plotting; for instance, we could pass c("red", "green", "blue") as the colors parameter to wordcloud(). The RColorBrewer package, which is based on the ColorBrewer project (colorbrewer.org), provides pre-selected palettes that can lead to more visually appealing images. Though these palettes are designed specifically for coloring maps, we can also use them in our word clouds and other visualizations.
#The function brewer.pal() returns color palettes from the ColorBrewer project when provided with appropriate parameters, and the function display.brewer.all() displays the palettes we can choose from.
display.brewer.all()
#Which color palette would be most appropriate for use in a word cloud for which we want to use color to indicate word frequency?
#Ans:YlOrRd
#EXPLANATION:From ?brewer.pal we read that Accent and Set2 are both "qualitative palettes," which means color changes don't imply a change in magnitude (we can also see this in the output of display.brewer.all). As a result, the colors selected would not visually identify the least and most frequent words.
#On the other hand, YlOrRd is a "sequential palette," with earlier colors begin lighter and later colors being darker. Therefore, it is a good palette choice for indicating low-frequency vs. high-frequency words.
###################################
#PROBLEM 4.2 - SELECTING A COLOR PALETTE
#Which RColorBrewer palette name would be most appropriate to use when preparing an image for a document that must be in grayscale?
#Displaying grey palette
display.brewer.pal(7, "Greys")
#Ans:Greys
#EXPLANATION:As we can see from display.brewer.all(), palette "Greys" is the only one completely in grayscale.
######################################
#PROBLEM 4.3 - SELECTING A COLOR PALETTE
#In sequential palettes, sometimes there is an undesirably large contrast between the lightest and darkest colors. You can see this effect when plotting a word cloud for allTweets with parameter colors=brewer.pal(9, "Blues"), which returns a sequential blue palette with 9 colors.
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10, colors=brewer.pal(9,"Blues"))
#Which of the following commands addresses this issue by removing the first 4 elements of the 9-color palette of blue colors? Select all that apply.
#Removing the light colors
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10, colors=brewer.pal(9,"Blues")[c(5,6,7,8,9)])
#or a shorthand for this indexing is:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10, colors=brewer.pal(9,"Blues")[5:9])
#Alternately
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10,colors=brewer.pal(9, "Blues")[c(-1, -2, -3, -4)])
#or a shorthand for this indexing is:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10, colors=brewer.pal(9, "Blues")[-1:-4])
#Ans:brewer.pal(9, "Blues")[c(-1, -2, -3, -4)]
#brewer.pal(9, "Blues")[c(5, 6, 7, 8, 9)]
#EXPLANATION:The fourth option limits to elements 5-9, which removes the first four. The second option uses negative indexes, which means remove elements 1-4.
In the crime lecture, we saw how we can use heatmaps to give a 2-dimensional representation of 3-dimensional data: we made heatmaps of crime counts by time of the day and day of the week. In this problem, we’ll learn how to use histograms to show counts by one variable, and then how to visualize 3 dimensions by creating multiple histograms.
We’ll use the parole data parole.csv from Unit 3. Before, we used this data to predict parole violators. Now, let’s try to get a little more insight into this dataset using histograms. As a reminder, the variables in this dataset are:
race = 1 if the parolee is white, 2 otherwise
age_ = the parolee’s age in years at the time of release from prison
state = a code for the parolee’s state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. These three states were selected due to having a high representation in the dataset.
time.served = the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).
max.sentence = the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).
multiple.offenses = 1 if the parolee was incarcerated for multiple offenses, 0 otherwise.
crime = a code for the parolee’s main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.
violator = 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.
#PROBLEM 1.1
#Load the data
parole<-read.csv("parole.csv")
str(parole)
## 'data.frame': 675 obs. of 9 variables:
## $ male : int 1 0 1 1 1 1 1 0 0 1 ...
## $ race : int 1 1 2 1 2 2 1 1 1 2 ...
## $ age : num 33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
## $ state : int 1 1 1 1 1 1 1 1 1 1 ...
## $ time.served : num 5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
## $ max.sentence : int 18 12 12 18 12 18 18 12 13 12 ...
## $ multiple.offenses: int 0 0 0 0 0 0 0 0 0 0 ...
## $ crime : int 4 3 3 1 1 4 3 1 3 2 ...
## $ violator : int 0 0 0 0 0 0 0 0 0 0 ...
#Since male, state, and crime are all unordered factors, convert them to factor variables:
parole$male = as.factor(parole$male)
parole$state = as.factor(parole$state)
parole$crime = as.factor(parole$crime)
#see the structure of the dataset again
str(parole)
## 'data.frame': 675 obs. of 9 variables:
## $ male : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 1 1 2 ...
## $ race : int 1 1 2 1 2 2 1 1 1 2 ...
## $ age : num 33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
## $ state : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ time.served : num 5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
## $ max.sentence : int 18 12 12 18 12 18 18 12 13 12 ...
## $ multiple.offenses: int 0 0 0 0 0 0 0 0 0 0 ...
## $ crime : Factor w/ 4 levels "1","2","3","4": 4 3 3 1 1 4 3 1 3 2 ...
## $ violator : int 0 0 0 0 0 0 0 0 0 0 ...
#What fraction of parole violators are female?
prop.table(table(parole$male, parole$violator),2)
##
## 0 1
## 0 0.1943049 0.1794872
## 1 0.8056951 0.8205128
#Ans:0.1794872
#EXPLANATION:The total number of violators is 78, and 14 of them are female.
############################################
#PROBLEM 1.2
#In this dataset, which crime is the most common in Kentucky?
table(parole$crime[parole$state==2])
##
## 1 2 3 4
## 42 10 64 4
#Ans:Drug-related crime
#EXPLANATION:The code 2 corresponds to Kentucky, and the most common crime is 3, which corresponds to Drug-related crime.
########################################
#PROBLEM 2.1 - CREATING A BASIC HISTOGRAM
#Recall from lecture that in ggplot, we need to specify the dataset, the aesthetic, and the geometry. To create a histogram, the geometry will be geom_histogram. The data we'll use is parole, and the aesthetic will be the map from a variable to the x-axis of the histogram.
#Create a histogram to find out the distribution of the age of parolees:
library(ggplot2)
ggplot(data = parole, aes(x = age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#By default, geom_histogram divides the data into 30 bins. Change the width of the bins to 5 years by adding the argument "binwidth = 5" to geom_histogram.
ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5)
#Note that by default, histograms create bins where the left endpoint is included in the bin, but the right endpoint isn't. So the first bin in this histogram represents parolees who are between 15 and 19 years old. The last bin in this histogram represents parolees who are between 65 and 69 years old.
#What is the age bracket with the most parolees?
#Ans:20-24
#EXPLANATION:The tallest bar corresponds to the age bracket with the most parolees, which is 20-24.
######################################
#PROBLEM 2.2 - CREATING A BASIC HISTOGRAM
#Redo the histogram, adding the following argument to the geom_histogram function: color="blue". What does this do? Select all that apply.
ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5,color="blue")
#Ans:Changes the outline color of the bars
#################################################
#PROBLEM 3.1 - ADDING ANOTHER DIMENSION
#Now suppose we are interested in seeing how the age distribution of male parolees compares to the age distribution of female parolees.
#One option would be to create a heatmap with age on one axis and male (a binary variable in our data set) on the other axis. Another option would be to stick with histograms, but to create a separate histogram for each gender. ggplot has the ability to do this automatically using the facet_grid command.
#To create separate histograms for male and female, type the following command into your R console:
ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5) + facet_grid(male ~ .)
#The histogram for female parolees is shown at the top, and the histogram for male parolees is shown at the bottom.
#What is the age bracket with the most female parolees?
#Ans:35-39
#EXPLANATION:Looking at the histogram at the top, we can see that the tallest bar corresponds to the age bracket 35-39.
#######################################
#PROBLEM 3.2 - ADDING ANOTHER DIMENSION
#Now change the facet_grid argument to be ".~male" instead of "male~.". What does this do?
ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5) + facet_grid(.~male)
#Puts the histograms side-by-side instead of on top of each other.
##########################################
#PROBLEM 3.3 - ADDING ANOTHER DIMENSION
#An alternative to faceting is to simply color the different groups differently. To color the data points by group, we need to tell ggplot that a property of the data (male or not male) should be translated to an aesthetic property of the histogram. We can do this by setting the fill parameter within the aesthetic to male.
#producing a histogram where data points are colored by group:
ggplot(data = parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5)
#Since we didn't specify colors to use, ggplot will use its default color selection. Let's change this by defining our own color palette.:
colorPalette = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
#This is actually a colorblind-friendly palette, desribed on this Cookbook for R page. Now, generate your histogram again, using colorPalette:
ggplot(data = parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5) + scale_fill_manual(values=colorPalette)
#What color is the histogram for the female parolees?
#Ans:Black
#EXPLANATION:From the previous question, we saw that the female parolee histogram was much smaller than the male parolee histogram. So it looks like the female histogram is the black-colored one. We can also read this from the legend.
###################################
#PROBLEM 3.4 - ADDING ANOTHER DIMENSION
#Coloring the groups differently is a good way to see the breakdown of age by sex within the single, aggregated histogram. However, the bars here are stacked, meaning that the height of the orange bars in each age bin represents the total number of parolees in that age bin, not just the number of parolees in that group.
#An alternative to a single, stacked histogram is to create two histograms and overlay them on top of each other. This is a simple adjustment to our previous command.
#We just need to:
#1) Tell ggplot not to stack the histograms by adding the argument position="identity" to the geom_histogram function.
ggplot(parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5, position = "identity") + scale_fill_manual(values=colorPalette)
#2) Make the bars semi-transparent so we can see both colors by adding the argument alpha=0.5 to the geom_histogram function.
ggplot(parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5, position = "identity", alpha = 0.5) + scale_fill_manual(values=colorPalette)
#Which of the following buckets contain no female paroles? Select all that apply.
#Ans:15-19, 55-59, and 65-69
#####################################################
#PROBLEM 4.1 - TIME SERVED
#Now let's explore another aspect of the data: the amount of time served by parolees. Create a basic histogram like the one we created in Problem 2, but this time with time.served on the x-axis. Set the bin width to one month.
ggplot(data = parole, aes(x = time.served)) + geom_histogram(binwidth = 1)
#What is the most common length of time served, according to this histogram?
#Ans:Between 4 and 5 months
#EXPLANATION:The highest bar corresponds to between 4 and 5 months.
#################################
#PROBLEM 4.2 - TIME SERVED
#Change the binwidth to 0.1 months. Now what is the most common length of time served, according to the histogram?
ggplot(data = parole, aes(x = time.served)) + geom_histogram(binwidth = 0.1)
#Ans:Between 3.0 and 3.1 months
#EXPLANATION:Now, the highest bar corresponds to between 3.0 and 3.1 months.
#Be careful when choosing the binwidth - it can significantly affect the interpretation of a histogram! When visualizing histograms, it is always a good idea to vary the bin size in order to understand the data at various granularities.
#################################
#PROBLEM 4.3 - TIME SERVED
#Now, suppose we suspect that it is unlikely that each type of crime has the same distribution of time served. To visualize this, change the binwidth back to 1 month, and use facet_grid to create a separate histogram of time.served for each value of the variable crime.
ggplot(data = parole, aes(x = time.served)) + geom_histogram(binwidth = 1) + facet_grid(crime ~ .)
#Which crime type has no observations where time served is less than one month? Recall that crime type #2 is larceny, #3 is drug-related crime, #4 is driving-related crime, and #1 is any other crime.
#Ans:Driving-related
#For which crime does the frequency of 5-6 month prison terms exceed the frequencies of each other term length?
#Ans:Drug-related
###################################################
#PROBLEM 4.4 - TIME SERVED
#Now, instead of faceting the histograms, overlay them. Remember to set the position and alpha parameters so that the histograms are not stacked. Also, make sure to indicate that the fill aesthetic should be "crime".
ggplot(data=parole, aes(x=time.served, fill=crime)) + geom_histogram(binwidth=1, position="identity", alpha=0.5)
#In this case, faceting seems like a better alternative. Why?
#Ans:With four different groups, it can be hard to tell them apart when they are overlayed.
#EXPLANATION:While overlaying the plots is allowed and lets us observe some attributes of the plots like the most common crime type, it can be hard to tell them apart and if they have similar values it can be hard to read.
Wonder World of Visualisation comes to end…..this is just the begining!