Source file ⇒ The_Analytics_Edge_edX_MIT15.071x_June2015_6.rmd

Unit 7: Visualization

Preliminaries

Visualizing the world_An Introduction to visualisation_1

VIDEO 1: THE POWER OF VISUALIZATIONS

Why Visualization?

  • “The picture-examining eye is the best finder we have of the wholly unanticipated” -John Tukey
  • Visualizing data allows us to discern relationships,structures, distributions, outliers, patterns, behaviors,dependencies, and outcomes
  • Useful for initial data exploration, for interpreting your model, and for communicating your results

QUICK QUESTION

Normally, a scatterplot only allows us to visualize two dimensions - one on the x-axis, and one on the y-axis. In the previous video we were able to show a third dimension on the scatterplot using what attribute?

Ans:Color

EXPLANATION:On slide 3, we show the scatterplot from slide 2, but with the number of cylinders shown by the color of the points. This allows us to visualize a third dimension of our data.

Initial Exploration Shows a Relationship (slide 2)

slide2

slide2

Explore Further: Color by Factor (slide 3)

slide3

slide3

VIDEO 2: THE WORLD HEALTH ORGANIZATION (WHO)

The World Health Organization

“WHO is the authority for health within the United Nations system. It is responsible for providing leadership on global health matters,shaping the health research agenda,setting norms and standards,articulating evidence-based policy options, providing technical support to countries and monitoring and assessing health trends.”

who

who

The World Health Report

  • WHO communicates information about global health in order to inform citizens, donors, policymakers and organizations across the world
  • Their primary publication is “World Health Report”
  • Each issue focuses on a specific aspect of global health, and includes statistics and experts’ assessments
better

better

QUICK QUESTION

Why is it particularly helpful for WHO to provide data visualizations? Select all that apply.

Ans:When communicating information to the general public, a visualization like the Energy Consumption one is much easier to absorb than a table of numbers would be.

Visualizations can easily be used by policymakers and others who wish to present data from WHO.

EXPLANATION:While there are other ways to display the data given in many visualizations (like tables), visualizations help to better communicate data to the public and can easily be used by others in presentations.

VIDEO 3: WHAT IS DATA VISUALIZATION?

What is a Data Visualization?

  • A mapping of data properties to visual properties
  • Data properties are usually numerical or categorical
  • Visual properties can be (x,y) coordinates, colors,sizes, shapes, heights, . . .

Anscombe’s Quartet

AnscombeQuartet

AnscombeQuartet

  • The 4 data sets show same mean & variance of X & Y.It also gives same correlation as well as same regression line.
AnscombeQuartet1

AnscombeQuartet1

  • Visualising each data set gives a completely different picture.It shows us how different they are.

ggplot2

  • “ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.” -Hadley Wickham, creator, www.ggplot2.org

Graphics in Base R vs ggplot

  • In base R, each mapping of data properties to visual properties is its own special case
    • Graphics composed of simple elements like points, lines
    • Difficult to add elements to existing plots
  • In ggplot, the mapping of data properties to visual properties is done by adding layers to the plot

Grammar of Graphics

  • ggplot graphics consist of at least 3 elements:
    1. Data, in a data frame
    2. Aesthetic mapping describing how variables in the data frame are mapped to graphical attributes
      • Color, shape, scale, x-y axes, subsets,.
    3. Geometric objects determine how values are rendered graphically
      • Points, lines, boxplots, bars, polygons,.

QUICK QUESTION

In this quick question, we’ll be asking you questions about the following three plots, that we saw in Video 1. We’ll refer to them as the “Scatterplot”, the “Histogram”, and the “US Map”.

The Scatterplot:

Scatterplot_Week8

Scatterplot_Week8

In the Scatterplot, what are the geometric objects?

Ans:Points

The Histogram:

Histogram_Week8

Histogram_Week8

In the Histogram, what are the geometric objects?

Ans:Bars

The US Map:

USmap_Week8

USmap_Week8

In the US Map, what are the geometric objects?

Ans:Polygons

EXPLANATION:The geometric objects for the Scatterplot are points, for the Histogram are bars, and for the US Map are polygons (the states). All three plots defined colors in the plot.

VIDEO 4 - A BASIC SCATTERPLOT (R script reproduced here)

  • In this lecture, we’ll be using the data file WHO.csv. This data comes from the Global Health Observatory Data Repository.

COLORS AND SHAPES IN R

ptShapes

ptShapes

# VIDEO 4 - A BASIC SCATTERPLOT

# Read in data
WHO = read.csv("WHO.csv")

str(WHO)
## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
# Plot from Week 1
plot(WHO$GNI, WHO$FertilityRate)

# Let's redo this using ggplot 

# Install and load the ggplot2 library:
#install.packages("ggplot2")
suppressPackageStartupMessages(library(ggplot2))

# Create the ggplot object with the data and the aesthetic mapping:
scatterplot = ggplot(WHO, aes(x = GNI, y = FertilityRate))

# Add the geom_point geometry
scatterplot + geom_point()
## Warning: Removed 35 rows containing missing values (geom_point).

# Make a line graph instead:
scatterplot + geom_line()
## Warning: Removed 32 rows containing missing values (geom_path).

# Switch back to our points:
scatterplot + geom_point()
## Warning: Removed 35 rows containing missing values (geom_point).

# Redo the plot with blue triangles instead of circles:
scatterplot + geom_point(color = "blue", size = 3, shape = 17) 
## Warning: Removed 35 rows containing missing values (geom_point).

# Another option:
scatterplot + geom_point(color = "darkred", size = 3, shape = 8) 
## Warning: Removed 35 rows containing missing values (geom_point).

#COLORS AND SHAPES IN R
#If you want to see all of the available colors in R, type in your R console:
colors()
##   [1] "white"                "aliceblue"            "antiquewhite"        
##   [4] "antiquewhite1"        "antiquewhite2"        "antiquewhite3"       
##   [7] "antiquewhite4"        "aquamarine"           "aquamarine1"         
##  [10] "aquamarine2"          "aquamarine3"          "aquamarine4"         
##  [13] "azure"                "azure1"               "azure2"              
##  [16] "azure3"               "azure4"               "beige"               
##  [19] "bisque"               "bisque1"              "bisque2"             
##  [22] "bisque3"              "bisque4"              "black"               
##  [25] "blanchedalmond"       "blue"                 "blue1"               
##  [28] "blue2"                "blue3"                "blue4"               
##  [31] "blueviolet"           "brown"                "brown1"              
##  [34] "brown2"               "brown3"               "brown4"              
##  [37] "burlywood"            "burlywood1"           "burlywood2"          
##  [40] "burlywood3"           "burlywood4"           "cadetblue"           
##  [43] "cadetblue1"           "cadetblue2"           "cadetblue3"          
##  [46] "cadetblue4"           "chartreuse"           "chartreuse1"         
##  [49] "chartreuse2"          "chartreuse3"          "chartreuse4"         
##  [52] "chocolate"            "chocolate1"           "chocolate2"          
##  [55] "chocolate3"           "chocolate4"           "coral"               
##  [58] "coral1"               "coral2"               "coral3"              
##  [61] "coral4"               "cornflowerblue"       "cornsilk"            
##  [64] "cornsilk1"            "cornsilk2"            "cornsilk3"           
##  [67] "cornsilk4"            "cyan"                 "cyan1"               
##  [70] "cyan2"                "cyan3"                "cyan4"               
##  [73] "darkblue"             "darkcyan"             "darkgoldenrod"       
##  [76] "darkgoldenrod1"       "darkgoldenrod2"       "darkgoldenrod3"      
##  [79] "darkgoldenrod4"       "darkgray"             "darkgreen"           
##  [82] "darkgrey"             "darkkhaki"            "darkmagenta"         
##  [85] "darkolivegreen"       "darkolivegreen1"      "darkolivegreen2"     
##  [88] "darkolivegreen3"      "darkolivegreen4"      "darkorange"          
##  [91] "darkorange1"          "darkorange2"          "darkorange3"         
##  [94] "darkorange4"          "darkorchid"           "darkorchid1"         
##  [97] "darkorchid2"          "darkorchid3"          "darkorchid4"         
## [100] "darkred"              "darksalmon"           "darkseagreen"        
## [103] "darkseagreen1"        "darkseagreen2"        "darkseagreen3"       
## [106] "darkseagreen4"        "darkslateblue"        "darkslategray"       
## [109] "darkslategray1"       "darkslategray2"       "darkslategray3"      
## [112] "darkslategray4"       "darkslategrey"        "darkturquoise"       
## [115] "darkviolet"           "deeppink"             "deeppink1"           
## [118] "deeppink2"            "deeppink3"            "deeppink4"           
## [121] "deepskyblue"          "deepskyblue1"         "deepskyblue2"        
## [124] "deepskyblue3"         "deepskyblue4"         "dimgray"             
## [127] "dimgrey"              "dodgerblue"           "dodgerblue1"         
## [130] "dodgerblue2"          "dodgerblue3"          "dodgerblue4"         
## [133] "firebrick"            "firebrick1"           "firebrick2"          
## [136] "firebrick3"           "firebrick4"           "floralwhite"         
## [139] "forestgreen"          "gainsboro"            "ghostwhite"          
## [142] "gold"                 "gold1"                "gold2"               
## [145] "gold3"                "gold4"                "goldenrod"           
## [148] "goldenrod1"           "goldenrod2"           "goldenrod3"          
## [151] "goldenrod4"           "gray"                 "gray0"               
## [154] "gray1"                "gray2"                "gray3"               
## [157] "gray4"                "gray5"                "gray6"               
## [160] "gray7"                "gray8"                "gray9"               
## [163] "gray10"               "gray11"               "gray12"              
## [166] "gray13"               "gray14"               "gray15"              
## [169] "gray16"               "gray17"               "gray18"              
## [172] "gray19"               "gray20"               "gray21"              
## [175] "gray22"               "gray23"               "gray24"              
## [178] "gray25"               "gray26"               "gray27"              
## [181] "gray28"               "gray29"               "gray30"              
## [184] "gray31"               "gray32"               "gray33"              
## [187] "gray34"               "gray35"               "gray36"              
## [190] "gray37"               "gray38"               "gray39"              
## [193] "gray40"               "gray41"               "gray42"              
## [196] "gray43"               "gray44"               "gray45"              
## [199] "gray46"               "gray47"               "gray48"              
## [202] "gray49"               "gray50"               "gray51"              
## [205] "gray52"               "gray53"               "gray54"              
## [208] "gray55"               "gray56"               "gray57"              
## [211] "gray58"               "gray59"               "gray60"              
## [214] "gray61"               "gray62"               "gray63"              
## [217] "gray64"               "gray65"               "gray66"              
## [220] "gray67"               "gray68"               "gray69"              
## [223] "gray70"               "gray71"               "gray72"              
## [226] "gray73"               "gray74"               "gray75"              
## [229] "gray76"               "gray77"               "gray78"              
## [232] "gray79"               "gray80"               "gray81"              
## [235] "gray82"               "gray83"               "gray84"              
## [238] "gray85"               "gray86"               "gray87"              
## [241] "gray88"               "gray89"               "gray90"              
## [244] "gray91"               "gray92"               "gray93"              
## [247] "gray94"               "gray95"               "gray96"              
## [250] "gray97"               "gray98"               "gray99"              
## [253] "gray100"              "green"                "green1"              
## [256] "green2"               "green3"               "green4"              
## [259] "greenyellow"          "grey"                 "grey0"               
## [262] "grey1"                "grey2"                "grey3"               
## [265] "grey4"                "grey5"                "grey6"               
## [268] "grey7"                "grey8"                "grey9"               
## [271] "grey10"               "grey11"               "grey12"              
## [274] "grey13"               "grey14"               "grey15"              
## [277] "grey16"               "grey17"               "grey18"              
## [280] "grey19"               "grey20"               "grey21"              
## [283] "grey22"               "grey23"               "grey24"              
## [286] "grey25"               "grey26"               "grey27"              
## [289] "grey28"               "grey29"               "grey30"              
## [292] "grey31"               "grey32"               "grey33"              
## [295] "grey34"               "grey35"               "grey36"              
## [298] "grey37"               "grey38"               "grey39"              
## [301] "grey40"               "grey41"               "grey42"              
## [304] "grey43"               "grey44"               "grey45"              
## [307] "grey46"               "grey47"               "grey48"              
## [310] "grey49"               "grey50"               "grey51"              
## [313] "grey52"               "grey53"               "grey54"              
## [316] "grey55"               "grey56"               "grey57"              
## [319] "grey58"               "grey59"               "grey60"              
## [322] "grey61"               "grey62"               "grey63"              
## [325] "grey64"               "grey65"               "grey66"              
## [328] "grey67"               "grey68"               "grey69"              
## [331] "grey70"               "grey71"               "grey72"              
## [334] "grey73"               "grey74"               "grey75"              
## [337] "grey76"               "grey77"               "grey78"              
## [340] "grey79"               "grey80"               "grey81"              
## [343] "grey82"               "grey83"               "grey84"              
## [346] "grey85"               "grey86"               "grey87"              
## [349] "grey88"               "grey89"               "grey90"              
## [352] "grey91"               "grey92"               "grey93"              
## [355] "grey94"               "grey95"               "grey96"              
## [358] "grey97"               "grey98"               "grey99"              
## [361] "grey100"              "honeydew"             "honeydew1"           
## [364] "honeydew2"            "honeydew3"            "honeydew4"           
## [367] "hotpink"              "hotpink1"             "hotpink2"            
## [370] "hotpink3"             "hotpink4"             "indianred"           
## [373] "indianred1"           "indianred2"           "indianred3"          
## [376] "indianred4"           "ivory"                "ivory1"              
## [379] "ivory2"               "ivory3"               "ivory4"              
## [382] "khaki"                "khaki1"               "khaki2"              
## [385] "khaki3"               "khaki4"               "lavender"            
## [388] "lavenderblush"        "lavenderblush1"       "lavenderblush2"      
## [391] "lavenderblush3"       "lavenderblush4"       "lawngreen"           
## [394] "lemonchiffon"         "lemonchiffon1"        "lemonchiffon2"       
## [397] "lemonchiffon3"        "lemonchiffon4"        "lightblue"           
## [400] "lightblue1"           "lightblue2"           "lightblue3"          
## [403] "lightblue4"           "lightcoral"           "lightcyan"           
## [406] "lightcyan1"           "lightcyan2"           "lightcyan3"          
## [409] "lightcyan4"           "lightgoldenrod"       "lightgoldenrod1"     
## [412] "lightgoldenrod2"      "lightgoldenrod3"      "lightgoldenrod4"     
## [415] "lightgoldenrodyellow" "lightgray"            "lightgreen"          
## [418] "lightgrey"            "lightpink"            "lightpink1"          
## [421] "lightpink2"           "lightpink3"           "lightpink4"          
## [424] "lightsalmon"          "lightsalmon1"         "lightsalmon2"        
## [427] "lightsalmon3"         "lightsalmon4"         "lightseagreen"       
## [430] "lightskyblue"         "lightskyblue1"        "lightskyblue2"       
## [433] "lightskyblue3"        "lightskyblue4"        "lightslateblue"      
## [436] "lightslategray"       "lightslategrey"       "lightsteelblue"      
## [439] "lightsteelblue1"      "lightsteelblue2"      "lightsteelblue3"     
## [442] "lightsteelblue4"      "lightyellow"          "lightyellow1"        
## [445] "lightyellow2"         "lightyellow3"         "lightyellow4"        
## [448] "limegreen"            "linen"                "magenta"             
## [451] "magenta1"             "magenta2"             "magenta3"            
## [454] "magenta4"             "maroon"               "maroon1"             
## [457] "maroon2"              "maroon3"              "maroon4"             
## [460] "mediumaquamarine"     "mediumblue"           "mediumorchid"        
## [463] "mediumorchid1"        "mediumorchid2"        "mediumorchid3"       
## [466] "mediumorchid4"        "mediumpurple"         "mediumpurple1"       
## [469] "mediumpurple2"        "mediumpurple3"        "mediumpurple4"       
## [472] "mediumseagreen"       "mediumslateblue"      "mediumspringgreen"   
## [475] "mediumturquoise"      "mediumvioletred"      "midnightblue"        
## [478] "mintcream"            "mistyrose"            "mistyrose1"          
## [481] "mistyrose2"           "mistyrose3"           "mistyrose4"          
## [484] "moccasin"             "navajowhite"          "navajowhite1"        
## [487] "navajowhite2"         "navajowhite3"         "navajowhite4"        
## [490] "navy"                 "navyblue"             "oldlace"             
## [493] "olivedrab"            "olivedrab1"           "olivedrab2"          
## [496] "olivedrab3"           "olivedrab4"           "orange"              
## [499] "orange1"              "orange2"              "orange3"             
## [502] "orange4"              "orangered"            "orangered1"          
## [505] "orangered2"           "orangered3"           "orangered4"          
## [508] "orchid"               "orchid1"              "orchid2"             
## [511] "orchid3"              "orchid4"              "palegoldenrod"       
## [514] "palegreen"            "palegreen1"           "palegreen2"          
## [517] "palegreen3"           "palegreen4"           "paleturquoise"       
## [520] "paleturquoise1"       "paleturquoise2"       "paleturquoise3"      
## [523] "paleturquoise4"       "palevioletred"        "palevioletred1"      
## [526] "palevioletred2"       "palevioletred3"       "palevioletred4"      
## [529] "papayawhip"           "peachpuff"            "peachpuff1"          
## [532] "peachpuff2"           "peachpuff3"           "peachpuff4"          
## [535] "peru"                 "pink"                 "pink1"               
## [538] "pink2"                "pink3"                "pink4"               
## [541] "plum"                 "plum1"                "plum2"               
## [544] "plum3"                "plum4"                "powderblue"          
## [547] "purple"               "purple1"              "purple2"             
## [550] "purple3"              "purple4"              "red"                 
## [553] "red1"                 "red2"                 "red3"                
## [556] "red4"                 "rosybrown"            "rosybrown1"          
## [559] "rosybrown2"           "rosybrown3"           "rosybrown4"          
## [562] "royalblue"            "royalblue1"           "royalblue2"          
## [565] "royalblue3"           "royalblue4"           "saddlebrown"         
## [568] "salmon"               "salmon1"              "salmon2"             
## [571] "salmon3"              "salmon4"              "sandybrown"          
## [574] "seagreen"             "seagreen1"            "seagreen2"           
## [577] "seagreen3"            "seagreen4"            "seashell"            
## [580] "seashell1"            "seashell2"            "seashell3"           
## [583] "seashell4"            "sienna"               "sienna1"             
## [586] "sienna2"              "sienna3"              "sienna4"             
## [589] "skyblue"              "skyblue1"             "skyblue2"            
## [592] "skyblue3"             "skyblue4"             "slateblue"           
## [595] "slateblue1"           "slateblue2"           "slateblue3"          
## [598] "slateblue4"           "slategray"            "slategray1"          
## [601] "slategray2"           "slategray3"           "slategray4"          
## [604] "slategrey"            "snow"                 "snow1"               
## [607] "snow2"                "snow3"                "snow4"               
## [610] "springgreen"          "springgreen1"         "springgreen2"        
## [613] "springgreen3"         "springgreen4"         "steelblue"           
## [616] "steelblue1"           "steelblue2"           "steelblue3"          
## [619] "steelblue4"           "tan"                  "tan1"                
## [622] "tan2"                 "tan3"                 "tan4"                
## [625] "thistle"              "thistle1"             "thistle2"            
## [628] "thistle3"             "thistle4"             "tomato"              
## [631] "tomato1"              "tomato2"              "tomato3"             
## [634] "tomato4"              "turquoise"            "turquoise1"          
## [637] "turquoise2"           "turquoise3"           "turquoise4"          
## [640] "violet"               "violetred"            "violetred1"          
## [643] "violetred2"           "violetred3"           "violetred4"          
## [646] "wheat"                "wheat1"               "wheat2"              
## [649] "wheat3"               "wheat4"               "whitesmoke"          
## [652] "yellow"               "yellow1"              "yellow2"             
## [655] "yellow3"              "yellow4"              "yellowgreen"
# Add a title to the plot:
scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")
## Warning: Removed 35 rows containing missing values (geom_point).

# Save our plot:
fertilityGNIplot = scatterplot + geom_point(colour = "blue", size = 3, shape = 17) + ggtitle("Fertility Rate vs. Gross National Income")

pdf("MyPlot.pdf")

print(fertilityGNIplot)
## Warning: Removed 35 rows containing missing values (geom_point).
dev.off()
## png 
##   2
####################################

#QUICK QUESTION

#In R, change the shape of your points to the number 15. What shape are the points now?
scatterplot + geom_point(shape = 15)
## Warning: Removed 35 rows containing missing values (geom_point).

#or
scatterplot + geom_point(color = "darkred", size = 3, shape = 15) 
## Warning: Removed 35 rows containing missing values (geom_point).

#Ans:Squares

VIDEO 5 - MORE ADVANCED SCATTERPLOTS (R script reproduced here)

suppressPackageStartupMessages(library(ggplot2))

# Color the points by region (which is a factor variabl: 
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = Region)) + geom_point()
## Warning: Removed 35 rows containing missing values (geom_point).

# Color the points according to country's life expectancy(which is a numerical variable and hence we get a gradient legend):
ggplot(WHO, aes(x = GNI, y = FertilityRate, color = LifeExpectancy)) + geom_point()
## Warning: Removed 35 rows containing missing values (geom_point).

# Is the fertility rate of a country was a good predictor of the percentage of the population under 15?
ggplot(WHO, aes(x = FertilityRate, y = Under15)) + geom_point()
## Warning: Removed 11 rows containing missing values (geom_point).

#The scatter plot shows that the variables are certainly correlated but as the fertility rates increases the var under 15 starts increasing less.So this does not look like alinear relationship but we suspect that a log transformation of fertility rate would be better.

# Let's try a log transformation:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point()
## Warning: Removed 11 rows containing missing values (geom_point).

# Simple linear regression model to predict the percentage of the population under 15, using the log of the fertility rate:
mod = lm(Under15 ~ log(FertilityRate), data = WHO)
summary(mod)
## 
## Call:
## lm(formula = Under15 ~ log(FertilityRate), data = WHO)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3131  -1.7742   0.0446   1.7440   7.7174 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.6540     0.4478   17.09   <2e-16 ***
## log(FertilityRate)  22.0547     0.4175   52.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 181 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.9391, Adjusted R-squared:  0.9387 
## F-statistic:  2790 on 1 and 181 DF,  p-value: < 2.2e-16
#From the summary output we can see that the coefficients are highly significant and R*2 is  0.9391 implying that log of the predictor FertilityRate is a great predictor of Under15.
#Visualisation was a great way to realise that the log transformation would be better.If we has just used FertilityRate , we would have got R*2 as 0.87 which is a significant decrease in R*2 value.

# Add this regression line to our plot as another layer:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm")
## Warning: Removed 11 rows containing non-finite values (stat_smooth).
## Warning: Removed 11 rows containing missing values (geom_point).

# 99% confidence interval (By default ggplot2 will draw a 95% shaded region)
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", level = 0.99)
## Warning: Removed 11 rows containing non-finite values (stat_smooth).

## Warning: Removed 11 rows containing missing values (geom_point).

# No confidence interval in the plot
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", se = FALSE)
## Warning: Removed 11 rows containing non-finite values (stat_smooth).

## Warning: Removed 11 rows containing missing values (geom_point).

# Change the color of the regression line:
ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", colour = "orange")
## Warning: Removed 11 rows containing non-finite values (stat_smooth).

## Warning: Removed 11 rows containing missing values (geom_point).

######################################

#QUICK QUESTION  

#Create the fertility rate versus population under 15 plot again:
ggplot(WHO, aes(x = FertilityRate, y = Under15)) + geom_point()
## Warning: Removed 11 rows containing missing values (geom_point).

#Now, color the points by the Region variable.
ggplot(WHO, aes(x = FertilityRate, y = Under15, color = Region)) + geom_point() + scale_color_brewer(palette="Dark2")
## Warning: Removed 11 rows containing missing values (geom_point).

#or

ggplot(WHO, aes(x = FertilityRate, y = Under15, color=Region)) + geom_point()
## Warning: Removed 11 rows containing missing values (geom_point).

#Note: You can add scale_color_brewer(palette="Dark2") to your plot if you are having a hard time distinguishing the colors (this color palette is often better if you are colorblind). To use this option, you should just add scale_color_brewer(palette="Dark2") to your plotting command right after geom_point(). 

#One region in particular has a lot of countries with a very low fertility rate and a very low percentage of the population under 15. Which region is it?
#Ans:Europe
#EXPLANATION:Most of the countries in Europe have a very low fertility rate and a very low percentage of the population under 15.

The Analytical Policeman_Visualization for law & order_2

VIDEO 1: PREDICTIVE POLICING

The Analytical Policeman

  • The explosion of computerized data affects all parts of society, including law and order
  • In the past, human judgment and experience was the only tool in identifying patterns in criminal behavior
  • Police forces around the US and the world are augmenting human judgment with analytics - sometimes described as “predictive policing”

Example: Los Angeles Police Dept.

“I’m not going to get more money. I’m not going to get more cops. I have to be better at using what I have, and that’s what predictive policing is about. If this old street cop can change the way that he thinks about this stuff, then I know that my [officers] can do the same.” - Los Angeles Police Chief Charlie Beck

Role of Analytics

  • The analytical tools you have learned in this class can be used to make these “predictive policing” models
  • However, communicating the results of these models is essential - a linear regression output table will not be of use to a policewoman on patrol
  • Visualization bridges the gap between the data and mathematics and the end user

QUICK QUESTION

The Los Angeles Police Department sees the benefits of predictive policing as which of the following? Select all that apply.

Ans:Allowing more intelligent officer deployment

Preventing crime

Using resources more effectively

EXPLANATION:According the the Los Angeles Police Department, predictive policing does not eliminate the need for police officers or increase the rate at which they catch criminals. It does, however, allow more intelligent officer deployment, prevents crime, and helps them use resources more effectively.

VIDEO 2: VISUALIZING CRIME OVER TIME

Understanding the Past

  • Before we even consider a predictive model, we should try to understand the historical data
  • Many cities in the US and around the world provide logs of reported crimes, usually including the time, location, and nature of the event
  • We will use data from Chicago about motor vehicle thefts

Crime Over Time

Line graph:

  • Suppose we wanted to communicate crime patterns over the course of an average week
  • We could display daily averages using a line graph,but this does not seem like it would be too useful
line

line

Table:

  • We can replace our x-axis with the hour of the day, and have a different line for every day of the week, but this would be a jumbled mess with 7 lines!
  • We could use no visualization at all, and instead present the information in a table
table

table

  • This is valid, but how can we make the table more interesting and usable?

Heatmaps:

  • Heatmaps are a way of visualizing data using three attributes. The x-axis and y-axis are typically displayed horizontally and vertically
  • The third attribute is represented by shades of color. For example,a low number might be blue, and a high number might be red.
heat

heat

  • We can pick different color schemes based on the type of data to convey different messages
heat2

heat2

  • The x-axis and y-axis don’t need to be continuous -they can be categorical
  • We could even combine a heatmap with a geographical map - we will discuss this later in the class

A Chicago Crime Heatmap

  • We will use Chicago motor vehicle theft data to explore patterns of crime:
    • Over days of the week
    • Over hours of the day
  • We’re interested in the total number of car thefts that occur in any particular hour of a day of the week over the whole data set

QUICK QUESTION

For which of the following situations would a heat map be an appropriate visualization choice? Select all that apply.

Ans:Visualizing the areas on a geographical map with the most crime

Comparing crime counts by police district and time throughout a city

EXPLANATION:A heatmap would be useful for the middle two options, because they are trying to visualize crime counts relative to two variables. For the first option, you could use a basic scatterplot with time on the x-axis and amount of crime on the y-axis. For the last option, you could use a bar plot with a bar for each month and the height being the average amount of crime in that month.

VIDEO 3: A LINE PLOT (R script reproduced here)

# VIDEO 3 - A Basic Line Plot

# Load our data:
mvt = read.csv("mvt.csv", stringsAsFactors=FALSE) #since we have a text field and we want it to be read in properly, we use stringsAsFactors=FALSE

str(mvt)
## 'data.frame':    191641 obs. of  3 variables:
##  $ Date     : chr  "12/31/12 23:15" "12/31/12 22:00" "12/31/12 22:00" "12/31/12 22:00" ...
##  $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...
# Convert the Date variable to a format that R will recognize using strptime():
mvt$Date = strptime(mvt$Date, format="%m/%d/%y %H:%M")

# Extract the hour and the day of the week:
mvt$Weekday = weekdays(mvt$Date)
mvt$Hour = mvt$Date$hour

# Let's take a look at the structure of our data again:
str(mvt)
## 'data.frame':    191641 obs. of  5 variables:
##  $ Date     : POSIXlt, format: "2012-12-31 23:15:00" "2012-12-31 22:00:00" ...
##  $ Latitude : num  41.8 41.9 42 41.8 41.8 ...
##  $ Longitude: num  -87.6 -87.7 -87.8 -87.7 -87.6 ...
##  $ Weekday  : chr  "Monday" "Monday" "Monday" "Monday" ...
##  $ Hour     : int  23 22 22 22 21 20 20 20 19 18 ...
#we can see the 2  new variables added to the existing dataset

#Create a simple line plot - need the total number of crimes on each day of the week. We can get this information by creating a table:
table(mvt$Weekday)
## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##     29284     27397     27118     26316     27319     26791     27416
# Save this table as a data frame:
WeekdayCounts = as.data.frame(table(mvt$Weekday))

str(WeekdayCounts) 
## 'data.frame':    7 obs. of  2 variables:
##  $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7
##  $ Freq: int  29284 27397 27118 26316 27319 26791 27416
# Load the ggplot2 library:
library(ggplot2)

# Create our plot
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1)) #group =1groups all date into one line as we want one line   in our plot

#But we can see that the days of the week are slightly out of order as ggplot2 plot the days in alphabetical order.What we want is chronological order

# Make the "Var1" variable an ORDERED factor variable
WeekdayCounts$Var1 = factor(WeekdayCounts$Var1, ordered=TRUE, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday","Saturday"))

# Try again:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1))

# Change our x and y labels:
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1)) + xlab("Day of the Week") + ylab("Total Motor Vehicle Thefts")

#############################

#QUICK QUESTION  

#Create a new line plot, like the one in Video 3, but add the argument "linetype=2". So the geom_line part of the plotting command should look like:

ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1), linetype=2) + xlab("Day of the Week") + ylab("Total Motor Vehicle Thefts")

#What does this do?
#Ans: Makes the line dashed

#Now, change the alpha parameter to 0.3 by replacing "linetype=2" with "alpha=0.3" in the plot command. What does this do?
ggplot(WeekdayCounts, aes(x=Var1, y=Freq)) + geom_line(aes(group=1), alpha=0.3) + xlab("Day of the Week") + ylab("Total Motor Vehicle Thefts")

#Ans:Makes the line lighter in color

#EXPLANATION:The linetype parameter makes the line dashed, and the alpha parameter makes the line lighter in color, or more transparent.

VIDEO 4: A HEATMAP (R script reproduced here)

# VIDEO 4 - Adding the Hour of the Day

#Break down by week days and hours

# Create a counts table for the weekday and hour:
table(mvt$Weekday, mvt$Hour)
##            
##                0    1    2    3    4    5    6    7    8    9   10   11
##   Friday    1873  932  743  560  473  602  839 1203 1268 1286  938  822
##   Monday    1900  825  712  527  415  542  772 1123 1323 1235  971  737
##   Saturday  2050 1267  985  836  652  508  541  650  858 1039  946  789
##   Sunday    2028 1236 1019  838  607  461  478  483  615  864  884  787
##   Thursday  1856  816  696  508  400  534  799 1135 1298 1301  932  731
##   Tuesday   1691  777  603  464  414  520  845 1118 1175 1174  948  786
##   Wednesday 1814  790  619  469  396  561  862 1140 1329 1237  947  763
##            
##               12   13   14   15   16   17   18   19   20   21   22   23
##   Friday    1207  857  937 1140 1165 1318 1623 1652 1736 1881 2308 1921
##   Monday    1129  824  958 1059 1136 1252 1518 1503 1622 1815 2009 1490
##   Saturday  1204  767  963 1086 1055 1084 1348 1390 1570 1702 2078 1750
##   Sunday    1192  789  959 1037 1083 1160 1389 1342 1706 1696 2079 1584
##   Thursday  1093  752  831 1044 1131 1258 1510 1537 1668 1776 2134 1579
##   Tuesday   1108  762  908 1071 1090 1274 1553 1496 1696 1816 2044 1458
##   Wednesday 1225  804  863 1075 1076 1289 1580 1507 1718 1748 2093 1511
# Save this to a data frame so that we can used it for visualisation:
DayHourCounts = as.data.frame(table(mvt$Weekday, mvt$Hour))

str(DayHourCounts)
## 'data.frame':    168 obs. of  3 variables:
##  $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ Var2: Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Freq: int  1873 1900 2050 2028 1856 1691 1814 932 825 1267 ...
# Convert the second variable, Var2, to numbers and call it Hour:
DayHourCounts$Hour = as.numeric(as.character(DayHourCounts$Var2))

str(DayHourCounts)
## 'data.frame':    168 obs. of  4 variables:
##  $ Var1: Factor w/ 7 levels "Friday","Monday",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ Var2: Factor w/ 24 levels "0","1","2","3",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Freq: int  1873 1900 2050 2028 1856 1691 1814 932 825 1267 ...
##  $ Hour: num  0 0 0 0 0 0 0 1 1 1 ...
# Create out plot:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1))

#we get 7 lines corresponding to each day of the week

# Change the colors to differentiate the days of the week
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Var1), size=2)

# Separate the weekends from the weekdays:
DayHourCounts$Type = ifelse((DayHourCounts$Var1 == "Sunday") | (DayHourCounts$Var1 == "Saturday"), "Weekend", "Weekday")

# Redo our plot, this time coloring by Type:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2) 

# Make the lines a little transparent:
ggplot(DayHourCounts, aes(x=Hour, y=Freq)) + geom_line(aes(group=Var1, color=Type), size=2, alpha=0.5) 

# Fix the order of the days:
DayHourCounts$Var1 = factor(DayHourCounts$Var1, ordered=TRUE, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

# Make a heatmap:
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq))

# Change the label on the legend, and get rid of the y-label:

ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts") + theme(axis.title.y = element_blank())

# Change the color scheme of the heat map
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") + theme(axis.title.y = element_blank())

QUICK QUESTION

In this quick question, we’ll ask you questions about the following plots. Plot (1) is the heat map we generated at the end of Video 4. Plot (2) and Plot (3) were generated by changing argument values of the command used to generate Plot (1).

Plot (1)

Week8_Crime_QQ4_1

Week8_Crime_QQ4_1

Plot (2)

Week8_Crime_QQ4_2

Week8_Crime_QQ4_2

Plot (3)

Week8_Crime_QQ4_3

Week8_Crime_QQ4_3

#Generating the above plots

#Q:Which argument(s) did we change to get Plot (2)? Select all that apply.

# Plot2
ggplot(DayHourCounts, aes(x =Var1 , y = Hour)) + geom_tile(aes(fill = Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="red") + theme(axis.title.y = element_blank())

#Ans:x & y

#Q:Which argument(s) did we change to get Plot (3)? Select all that apply.

#plot3
ggplot(DayHourCounts, aes(x = Hour, y = Var1)) + geom_tile(aes(fill=Freq)) + scale_fill_gradient(name="Total MV Thefts", low="white", high="black") + theme(axis.title.y=element_blank())

#Ans:high

VIDEO 5: A GEOGRAPHICAL HOT SPOT MAP (R script reproduced here)

# VIDEO 5 - Maps

# Install and load two new packages:
#install.packages("maps")
#install.packages("ggmap")
library(maps)
library(ggmap)

# Load a map of Chicago into R:
chicago = get_map(location = "chicago", zoom = 11)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=chicago&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=chicago&sensor=false
##Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=chicago&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
##Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=chicago&sensor=false


# Look at the map
ggmap(chicago)

# Plot the first 100 motor vehicle thefts on the chicago map:
ggmap(chicago) + geom_point(data = mvt[1:100,], aes(x = Longitude, y = Latitude))
## Warning: Removed 7 rows containing missing values (geom_point).

# Round our latitude and longitude to 2 digits of accuracy, and create a crime counts data frame for each area:
LatLonCounts = as.data.frame(table(round(mvt$Longitude,2), round(mvt$Latitude,2)))

str(LatLonCounts)
## 'data.frame':    1638 obs. of  3 variables:
##  $ Var1: Factor w/ 42 levels "-87.93","-87.92",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Var2: Factor w/ 39 levels "41.64","41.65",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  0 0 0 0 0 0 0 0 0 0 ...
# Convert our Longitude and Latitude variable to numbers (coverting a factor var into a numeric var):
LatLonCounts$Long = as.numeric(as.character(LatLonCounts$Var1))
LatLonCounts$Lat = as.numeric(as.character(LatLonCounts$Var2))

# Plot these points on our map:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq))
## Warning: Removed 615 rows containing missing values (geom_point).

# Change the color scheme:
ggmap(chicago) + geom_point(data = LatLonCounts, aes(x = Long, y = Lat, color = Freq, size=Freq)) + scale_colour_gradient(low="yellow", high="red")
## Warning: Removed 615 rows containing missing values (geom_point).

# We can also use the geom_tile geometry similar to heat map
ggmap(chicago) + geom_tile(data = LatLonCounts, aes(x = Long, y = Lat, alpha = Freq), fill="red")
## Warning: Removed 615 rows containing missing values (geom_tile).

#QUICK QUESTION  

#In the previous video, our heatmap was plotting squares out in the water, which seems a little strange. We can fix this by removing the observations from our data frame that have Freq = 0.

#Take a subset of LatLonCounts, only keeping the observations for which Freq > 0, and call it LatLonCounts2.

LatLonCounts2 <- subset(LatLonCounts, Freq > 0)

#Redo the heatmap from the end of Video 5, using LatLonCounts2 instead of LatLonCounts. You should no longer see any squares out in the water, or in any areas where there were no motor vehicle thefts.

ggmap(chicago) + geom_tile(data = LatLonCounts2, aes(x = Long, y = Lat, alpha = Freq), fill = "red")
## Warning: Removed 119 rows containing missing values (geom_tile).

#How many observations did we remove?
nrow(LatLonCounts) - nrow(LatLonCounts2)
## [1] 952
#Ans:952
#EXPLANATION:The number of observations in LatLonCounts2 is 686, and the number of observations in LatLonCounts is 1638. These numbers can be found by using nrow or str.

VIDEO 6: A HEATMAP ON THE UNITED STATES (R script reproduced here)

# VIDEO 6 - Geographical Map on US

# Load our data (Data provided by FBI for the total number of murders state by state:
murders = read.csv("murders.csv")

str(murders)
## 'data.frame':    51 obs. of  6 variables:
##  $ State            : Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Population       : int  4779736 710231 6392017 2915918 37253956 5029196 3574097 897934 601723 19687653 ...
##  $ PopulationDensity: num  94.65 1.26 57.05 56.43 244.2 ...
##  $ Murders          : int  199 31 352 130 1811 117 131 48 131 987 ...
##  $ GunMurders       : int  135 19 232 93 1257 65 97 38 99 669 ...
##  $ GunOwnership     : num  0.517 0.578 0.311 0.553 0.213 0.347 0.167 0.255 0.036 0.245 ...
# Load the map of the US which is already available in R
statesMap = map_data("state")
## 
##  # maps v3.1: updated 'world': all lakes moved to separate new #
##  # 'lakes' database. Type '?world' or 'news(package="maps")'.  #
str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...
# Plot the map:
ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black") 

# Create a new variable called region with the lowercase names to MATCH the statesMap since we have state names in State var of the murders dataset which start with capital letter and statenames in statesMap dataset are in region var which start with lowercase:
murders$region = tolower(murders$State)  

# Join the statesMap data and the murders data into new one dataframe using the merge(:
murderMap = merge(statesMap, murders, by="region") #by="region is the identifier used to merge the 2 dataframes which were matched by region var
str(murderMap)
## 'data.frame':    15537 obs. of  12 variables:
##  $ region           : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ long             : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat              : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ subregion        : chr  NA NA NA NA ...
##  $ State            : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Population       : int  4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 4779736 ...
##  $ PopulationDensity: num  94.7 94.7 94.7 94.7 94.7 ...
##  $ Murders          : int  199 199 199 199 199 199 199 199 199 199 ...
##  $ GunMurders       : int  135 135 135 135 135 135 135 135 135 135 ...
##  $ GunOwnership     : num  0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 0.517 ...
# Plot the number of murder on our map of the United States:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Murders)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

# Plot a map of the population:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Population)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

#lets plot the same map with different colour scheme
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = Population)) + geom_polygon(color = "black") + scale_fill_gradient(low="green", high="red", guide = "legend")

# Create a new variable that is the number of murders per 100,000 population:
murderMap$MurderRate = murderMap$Murders / murderMap$Population * 100000

# Redo our plot with murder rate:
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

#lets plot the same map with different colour scheme
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low="green", high="red", guide = "legend")

# Redo the plot, removing any states with murder rates above 10 which removes Washington DC which is a very small state (an outlier):
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low = "black", high = "red", guide = "legend", limits = c(0,10))

#lets plot the same map with different colour scheme
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = MurderRate)) + geom_polygon(color = "black") + scale_fill_gradient(low="green", high="red", guide = "legend", limits = c(0,10))

###################

#QUICK QUESTION  

#Redo the map from Video 6, but this time fill each state with the variable GunOwnership. This shows the percentage of people in each state who own a gun.

#Which of the following states has the highest gun ownership rate? To see the state labels, take a look at the World Atlas map here:http://www.worldatlas.com/webimage/testmaps/usanames.htm

ggplot(murderMap, aes(x = long, y = lat, group = group, fill = GunOwnership)) + geom_polygon(color ="black") + scale_fill_gradient(low = "black", high = "red", guide = "legend")

#lets plot the same map with different colour scheme
ggplot(murderMap, aes(x = long, y = lat, group = group, fill = GunOwnership)) + geom_polygon(color ="black") + scale_fill_gradient(low="green", high="red", guide = "legend")

#Ans:Montana
#EXPLANATION:Of these five states, the one that is the most red is Montana.

VIDEO 7: THE ANALYTICS EDGE

Eye on Crime

  • Criminal activity-related data often has both components of time and location
  • Sometimes all that is required is a line chart, but heatmaps can visualize data that would be too big for a table
  • Plotting data on maps is much more effective than a table for location based data, and is eye-catching

Predictive Policing

  • Many police forces are exploiting their databases to focus finite resources on problem areas
  • Not only do analytics help improve policework, the outputs are also good communication tools to decision makers in government and to the wider public
  • The application of analytics to data like this is new and growing, with companies like PredPol and Palantir leading the effort

The Good, the bad and the Ugly_Visualization recitation_3

VIDEO 1: INTRODUCTION

Great Power, Great Responsibility

  • here are many ways to visualize the same data.
  • You have just seen how to make quite attractive visualizations with ggplot2, which has good default settings, but judgement is still required, e.g. do I vary the size, or do I vary the color?
  • Excel, etc. can also be used to make perfectly acceptable visualizations - or terrible ones.

What is the difference?

  • Good visualizations.
    • Clearly and accurately convey the key messages in the data
  • Bad visualizations.
    • Obfuscate the data (either through ignorance, or malice!)

What does this mean?

  • Visualizations can be used by an analyst for their own consumption, to gain insights.
  • Visualizations can also be used to provide information to a decision maker, and/or to convince someone.
  • Bad visualizations hide patterns that could give insight, or mislead decision makers.

VIDEO 2: PIE CHARTS

Bad Visualizations?

pie1

pie1

pie2

pie2

pie3

pie3

VIDEO 3: BAR CHARTS IN R (R script reproduced here)

# Unit 7 - Recitation

# VIDEO 3 - Bar Charts

# Load ggplot library
library(ggplot2)

# Load our data, which lives in intl.csv
intl = read.csv("intl.csv")
str(intl)
## 'data.frame':    8 obs. of  2 variables:
##  $ Region       : Factor w/ 8 levels "Africa","Asia",..: 2 3 6 4 5 1 7 8
##  $ PercentOfIntl: num  0.531 0.201 0.098 0.09 0.054 0.02 0.015 0.002
# We want to make a bar plot with region on the X axis
# and Percentage on the y-axis.
ggplot(intl, aes(x=Region, y=PercentOfIntl)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=PercentOfIntl))

#stat="identity" arg means that use the value of y axis as it is

# Make Region an ordered factor
# We can do this with the re-order command and transform command. 
intl = transform(intl, Region = reorder(Region, -PercentOfIntl)) #reordering in decreasing order

# Look at the structure
str(intl)
## 'data.frame':    8 obs. of  2 variables:
##  $ Region       : Factor w/ 8 levels "Asia","Europe",..: 1 2 3 4 5 6 7 8
##   ..- attr(*, "scores")= num [1:8(1d)] -0.02 -0.531 -0.201 -0.09 -0.054 -0.098 -0.015 -0.002
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr  "Africa" "Asia" "Europe" "Latin Am. & Caribbean" ...
##  $ PercentOfIntl: num  0.531 0.201 0.098 0.09 0.054 0.02 0.015 0.002
# Make the percentages out of 100 instead of fractions
intl$PercentOfIntl = intl$PercentOfIntl * 100

# Make the plot
ggplot(intl, aes(x=Region, y=PercentOfIntl)) +
geom_bar(stat="identity", fill="steelblue") +
geom_text(aes(label=PercentOfIntl), vjust=-0.4) + # vjust=-0.4 arg just moves the values just above the bars 
ylab("Percent of International Students") +
theme(axis.title.x = element_blank(), axis.text.x = element_text(angle = 45, hjust = 1))

VIDEO 4: A BETTER VISUALIZATION

Better Visualization?

bar

bar

  • All data is visible!
  • Don’t lose small regions.
  • Can easily compare relative sizes
  • Something to consider is that, for some people and applications, being not as “visually exciting” is a negative.

VIDEO 5: WORLD MAPS IN R (R script reproduced here)

# VIDEO 5 - World map

# Load the ggmap package
library(ggmap)

# Load in the international student data
intlall = read.csv("intlall.csv",stringsAsFactors=FALSE)

# Lets look at the first few rows
head(intlall)
##           Citizenship UG  G SpecialUG SpecialG ExhangeVisiting Total
## 1             Albania  3  1         0        0               0     4
## 2 Antigua and Barbuda NA NA        NA        1              NA     1
## 3           Argentina NA 19        NA       NA              NA    19
## 4             Armenia  3  2        NA       NA              NA     5
## 5           Australia  6 32        NA       NA               1    39
## 6             Austria NA 11        NA       NA               5    16
# Those NAs are really 0s, and we can replace them easily
intlall[is.na(intlall)] = 0

# Now lets look again
head(intlall) 
##           Citizenship UG  G SpecialUG SpecialG ExhangeVisiting Total
## 1             Albania  3  1         0        0               0     4
## 2 Antigua and Barbuda  0  0         0        1               0     1
## 3           Argentina  0 19         0        0               0    19
## 4             Armenia  3  2         0        0               0     5
## 5           Australia  6 32         0        0               1    39
## 6             Austria  0 11         0        0               5    16
# Load the world map
world_map = map_data("world")
str(world_map)
## 'data.frame':    99338 obs. of  6 variables:
##  $ long     : num  -69.9 -69.9 -69.9 -70 -70.1 ...
##  $ lat      : num  12.5 12.4 12.4 12.5 12.5 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ subregion: chr  NA NA NA NA ...
# Lets merge intlall into world_map using the merge command
world_map = merge(world_map, intlall, by.x ="region", by.y = "Citizenship")
str(world_map)
## 'data.frame':    63634 obs. of  12 variables:
##  $ region         : chr  "Albania" "Albania" "Albania" "Albania" ...
##  $ long           : num  20.5 20.4 19.5 20.5 20.4 ...
##  $ lat            : num  41.3 39.8 42.5 40.1 41.5 ...
##  $ group          : num  6 6 6 6 6 6 6 6 6 6 ...
##  $ order          : int  789 822 870 815 786 821 818 779 879 795 ...
##  $ subregion      : chr  NA NA NA NA ...
##  $ UG             : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ G              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SpecialUG      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ExhangeVisiting: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Total          : int  4 4 4 4 4 4 4 4 4 4 ...
# Plot the map
ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(fill="white", color="black") +
  coord_map("mercator")

# Reorder the data
world_map = world_map[order(world_map$group, world_map$order),]

# Redo the plot
ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(fill="white", color="black") +
  coord_map("mercator")

# Lets look for China
table(intlall$Citizenship) 
## 
##                      Albania          Antigua and Barbuda 
##                            1                            1 
##                    Argentina                      Armenia 
##                            1                            1 
##                    Australia                      Austria 
##                            1                            1 
##                      Bahrain                   Bangladesh 
##                            1                            1 
##                      Belarus                      Belgium 
##                            1                            1 
##                      Bolivia           Bosnia-Hercegovina 
##                            1                            1 
##                       Brazil                     Bulgaria 
##                            1                            1 
##                     Cambodia                     Cameroon 
##                            1                            1 
##                       Canada                        Chile 
##                            1                            1 
## China (People's Republic Of)                     Colombia 
##                            1                            1 
##                   Costa Rica                Cote d'Ivoire 
##                            1                            1 
##                      Croatia                       Cyprus 
##                            1                            1 
##               Czech Republic                      Denmark 
##                            1                            1 
##                      Ecuador                        Egypt 
##                            1                            1 
##                  El Salvador                      Estonia 
##                            1                            1 
##                     Ethiopia                      Finland 
##                            1                            1 
##                       France                      Georgia 
##                            1                            1 
##                      Germany                        Ghana 
##                            1                            1 
##                       Greece                    Guatemala 
##                            1                            1 
##                        Haiti                    Hong Kong 
##                            1                            1 
##                      Hungary                      Iceland 
##                            1                            1 
##                        India                    Indonesia 
##                            1                            1 
##                         Iran                         Iraq 
##                            1                            1 
##                      Ireland                       Israel 
##                            1                            1 
##                        Italy                      Jamaica 
##                            1                            1 
##                        Japan                       Jordan 
##                            1                            1 
##                   Kazakhstan                        Kenya 
##                            1                            1 
##                 Korea, South                       Kuwait 
##                            1                            1 
##                       Latvia                      Lebanon 
##                            1                            1 
##                    Lithuania                    Macedonia 
##                            1                            1 
##                     Malaysia                    Mauritius 
##                            1                            1 
##                       Mexico                      Moldova 
##                            1                            1 
##                     Mongolia                   Montenegro 
##                            1                            1 
##                      Morocco                        Nepal 
##                            1                            1 
##                  Netherlands                  New Zealand 
##                            1                            1 
##                      Nigeria                       Norway 
##                            1                            1 
##                     Pakistan                     Paraguay 
##                            1                            1 
##                         Peru                  Philippines 
##                            1                            1 
##                       Poland                     Portugal 
##                            1                            1 
##                        Qatar                      Romania 
##                            1                            1 
##                       Russia                       Rwanda 
##                            1                            1 
##                 Saudi Arabia                       Serbia 
##                            1                            1 
##                 Sierra Leone                    Singapore 
##                            1                            1 
##                     Slovakia                      Somalia 
##                            1                            1 
##                 South Africa                        Spain 
##                            1                            1 
##                    Sri Lanka                    St. Lucia 
##                            1                            1 
## St. Vincent & The Grenadines                        Sudan 
##                            1                            1 
##                       Sweden                  Switzerland 
##                            1                            1 
##                        Syria                       Taiwan 
##                            1                            1 
##                     Tanzania                     Thailand 
##                            1                            1 
##            Trinidad & Tobago                      Tunisia 
##                            1                            1 
##                       Turkey                       Uganda 
##                            1                            1 
##                      Ukraine         United Arab Emirates 
##                            1                            1 
##               United Kingdom                      Unknown 
##                            1                            1 
##                      Uruguay                    Venezuela 
##                            1                            1 
##                      Vietnam                    West Bank 
##                            1                            1 
##                       Zambia                     Zimbabwe 
##                            1                            1
# Lets "fix" that in the intlall dataset
intlall$Citizenship[intlall$Citizenship=="China (People's Republic Of)"] = "China"

# Lets look for China
table(intlall$Citizenship) 
## 
##                      Albania          Antigua and Barbuda 
##                            1                            1 
##                    Argentina                      Armenia 
##                            1                            1 
##                    Australia                      Austria 
##                            1                            1 
##                      Bahrain                   Bangladesh 
##                            1                            1 
##                      Belarus                      Belgium 
##                            1                            1 
##                      Bolivia           Bosnia-Hercegovina 
##                            1                            1 
##                       Brazil                     Bulgaria 
##                            1                            1 
##                     Cambodia                     Cameroon 
##                            1                            1 
##                       Canada                        Chile 
##                            1                            1 
##                        China                     Colombia 
##                            1                            1 
##                   Costa Rica                Cote d'Ivoire 
##                            1                            1 
##                      Croatia                       Cyprus 
##                            1                            1 
##               Czech Republic                      Denmark 
##                            1                            1 
##                      Ecuador                        Egypt 
##                            1                            1 
##                  El Salvador                      Estonia 
##                            1                            1 
##                     Ethiopia                      Finland 
##                            1                            1 
##                       France                      Georgia 
##                            1                            1 
##                      Germany                        Ghana 
##                            1                            1 
##                       Greece                    Guatemala 
##                            1                            1 
##                        Haiti                    Hong Kong 
##                            1                            1 
##                      Hungary                      Iceland 
##                            1                            1 
##                        India                    Indonesia 
##                            1                            1 
##                         Iran                         Iraq 
##                            1                            1 
##                      Ireland                       Israel 
##                            1                            1 
##                        Italy                      Jamaica 
##                            1                            1 
##                        Japan                       Jordan 
##                            1                            1 
##                   Kazakhstan                        Kenya 
##                            1                            1 
##                 Korea, South                       Kuwait 
##                            1                            1 
##                       Latvia                      Lebanon 
##                            1                            1 
##                    Lithuania                    Macedonia 
##                            1                            1 
##                     Malaysia                    Mauritius 
##                            1                            1 
##                       Mexico                      Moldova 
##                            1                            1 
##                     Mongolia                   Montenegro 
##                            1                            1 
##                      Morocco                        Nepal 
##                            1                            1 
##                  Netherlands                  New Zealand 
##                            1                            1 
##                      Nigeria                       Norway 
##                            1                            1 
##                     Pakistan                     Paraguay 
##                            1                            1 
##                         Peru                  Philippines 
##                            1                            1 
##                       Poland                     Portugal 
##                            1                            1 
##                        Qatar                      Romania 
##                            1                            1 
##                       Russia                       Rwanda 
##                            1                            1 
##                 Saudi Arabia                       Serbia 
##                            1                            1 
##                 Sierra Leone                    Singapore 
##                            1                            1 
##                     Slovakia                      Somalia 
##                            1                            1 
##                 South Africa                        Spain 
##                            1                            1 
##                    Sri Lanka                    St. Lucia 
##                            1                            1 
## St. Vincent & The Grenadines                        Sudan 
##                            1                            1 
##                       Sweden                  Switzerland 
##                            1                            1 
##                        Syria                       Taiwan 
##                            1                            1 
##                     Tanzania                     Thailand 
##                            1                            1 
##            Trinidad & Tobago                      Tunisia 
##                            1                            1 
##                       Turkey                       Uganda 
##                            1                            1 
##                      Ukraine         United Arab Emirates 
##                            1                            1 
##               United Kingdom                      Unknown 
##                            1                            1 
##                      Uruguay                    Venezuela 
##                            1                            1 
##                      Vietnam                    West Bank 
##                            1                            1 
##                       Zambia                     Zimbabwe 
##                            1                            1
# We'll repeat our merge and order from before
world_map = merge(map_data("world"), intlall, 
                  by.x ="region",
                  by.y = "Citizenship")
world_map = world_map[order(world_map$group, world_map$order),]

ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(aes(fill=Total), color="black") +
  coord_map("mercator")

# We can try other projections -like orthographic projetion--- this one is visually interesting
ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(aes(fill=Total), color="black") +
  coord_map("ortho", orientation=c(20, 30, 0))

ggplot(world_map, aes(x=long, y=lat, group=group)) +
  geom_polygon(aes(fill=Total), color="black") +
  coord_map("ortho", orientation=c(-37, 175, 0))

VIDEO 6: SCALES

Bad Scales

scale1

scale1

scale2

scale2

scale3

scale3

Two Rights Make A Wrong

scale4

scale4

Family Matters

scale5

scale5

  • If we are interested in shares within a year, its good.
  • If we want to see rates of change, it is pretty much unusable!
  • If we want to compare year-to-year, its possible though imperfect.
  • Numbers are relative - absolute numbers may reveal, e.g. married couples without children is constant across years.

VIDEO 7: USING LINE CHARTS INSTEAD (R script reproduced here)

# VIDEO 7 - Line Charts

# First, lets make sure we have ggplot2 loaded
library(ggplot2)

# Now lets load our dataframe
households = read.csv("households_2.csv")
str(households)
## 'data.frame':    8 obs. of  7 variables:
##  $ Year          : int  1970 1980 1990 1995 2000 2005 2010 2012
##  $ MarriedWChild : num  40.3 30.9 26.3 25.5 24.1 22.9 20.9 19.6
##  $ MarriedWOChild: num  30.3 29.9 29.8 28.9 28.7 28.3 28.8 29.1
##  $ OtherFamily   : num  10.6 12.9 14.8 15.6 16 16.7 17.4 17.8
##  $ MenAlone      : num  5.6 8.6 9.7 10.2 10.7 11.3 11.9 12.3
##  $ WomenAlone    : num  11.5 14 14.9 14.7 14.8 15.3 14.8 15.2
##  $ OtherNonfamily: num  1.7 3.6 4.6 5 5.7 5.6 6.2 6.1
#from the above structure we dont know what to put in aes argument in ggplot
#We have to convert the data so that it can be used by ggplot()
#ggplot2 requires year,group & fraction
#What we need is to convert the Wide format of the given dataset to Long format

# Load reshape2 package
library(reshape2)

# Lets look at the first two columns of our households dataframe
head(households[,1:2])
##   Year MarriedWChild
## 1 1970          40.3
## 2 1980          30.9
## 3 1990          26.3
## 4 1995          25.5
## 5 2000          24.1
## 6 2005          22.9
head(households)
##   Year MarriedWChild MarriedWOChild OtherFamily MenAlone WomenAlone
## 1 1970          40.3           30.3        10.6      5.6       11.5
## 2 1980          30.9           29.9        12.9      8.6       14.0
## 3 1990          26.3           29.8        14.8      9.7       14.9
## 4 1995          25.5           28.9        15.6     10.2       14.7
## 5 2000          24.1           28.7        16.0     10.7       14.8
## 6 2005          22.9           28.3        16.7     11.3       15.3
##   OtherNonfamily
## 1            1.7
## 2            3.6
## 3            4.6
## 4            5.0
## 5            5.7
## 6            5.6
# First few rows of our melted households dataframe
head(melt(households, id="Year"))
##   Year      variable value
## 1 1970 MarriedWChild  40.3
## 2 1980 MarriedWChild  30.9
## 3 1990 MarriedWChild  26.3
## 4 1995 MarriedWChild  25.5
## 5 2000 MarriedWChild  24.1
## 6 2005 MarriedWChild  22.9
households[,1:3]
##   Year MarriedWChild MarriedWOChild
## 1 1970          40.3           30.3
## 2 1980          30.9           29.9
## 3 1990          26.3           29.8
## 4 1995          25.5           28.9
## 5 2000          24.1           28.7
## 6 2005          22.9           28.3
## 7 2010          20.9           28.8
## 8 2012          19.6           29.1
melt(households, id="Year")[1:10,3]
##  [1] 40.3 30.9 26.3 25.5 24.1 22.9 20.9 19.6 30.3 29.9
melt(households, id="Year")[1:10,]
##    Year       variable value
## 1  1970  MarriedWChild  40.3
## 2  1980  MarriedWChild  30.9
## 3  1990  MarriedWChild  26.3
## 4  1995  MarriedWChild  25.5
## 5  2000  MarriedWChild  24.1
## 6  2005  MarriedWChild  22.9
## 7  2010  MarriedWChild  20.9
## 8  2012  MarriedWChild  19.6
## 9  1970 MarriedWOChild  30.3
## 10 1980 MarriedWOChild  29.9
# Plot it
ggplot(melt(households, id="Year"),       
       aes(x=Year, y=value, color=variable)) +
  geom_line(size=2) + geom_point(size=5) +  
  ylab("Percentage of Households")

Assignment 7

ELECTION FORECASTING REVISITED

ELECTION FORECASTING REVISITED

In the recitation from Unit 3, we used logistic regression on polling data in order to construct US presidential election predictions. We separated our data into a training set, containing data from 2004 and 2008 polls, and a test set, containing the data from 2012 polls. We then proceeded to develop a logistic regression model to forecast the 2012 US presidential election.

In this homework problem, we’ll revisit our logistic regression model from Unit 3, and learn how to plot the output on a map of the United States. Unlike what we did in the Crime lecture, this time we’ll be plotting predictions rather than data!

Note:The maps package contains other built-in maps, including a US county map, a world map, and maps for France and Italy.

library(ggplot2)
library(maps)
library(ggmap)

#load the US map and save it to the variable statesMap, like we did during the Crime lecture:
statesMap = map_data("state")

str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...
#PROBLEM 1.1 - DRAWING A MAP OF THE US  

#If you look at the structure of the statesMap data frame using the str function, you should see that there are 6 variables. One of the variables, group, defines the different shapes or polygons on the map. Sometimes a state may have multiple groups, for example, if it includes islands. How many different groups are there?
table(statesMap$group)
## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
##  202  149  312  516   79   91   94   10  872  381  233  329  257  256  113 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##  397  650  399  566   36  220   30  460  370  373  382  315  238  208   70 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  125  205   78   16  290   21  168   37  733   12  105  238  284  236  172 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   66  304  166  289 1088   59  129   96   15  623   17   17   19   44  448 
##   61   62   63 
##  373  388   68
#or 
length(table(statesMap$group)) 
## [1] 63
#Ans:63

####################################

#PROBLEM 1.2 - DRAWING A MAP OF THE US  (1 point possible)

#Drawing the map of the United States:

ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "black")

#We specified two colors in geom_polygon -- fill and color. Which one defined the color of the outline of the states?
#Ans:color
#EXPLANATION:In our plot, the states are outlined in black, which is the color we specified for the option "color". To confirm that this is changing the outline color of the states, you can try re-running the command with a different color:

ggplot(statesMap, aes(x = long, y = lat, group = group)) + geom_polygon(fill = "white", color = "pink")

########################################

#PROBLEM 2.1 - COLORING THE STATES BY PREDICTIONS

#Now, let's color the map of the US according to our 2012 US presidential election predictions from the Unit 3 Recitation. We'll rebuild the model here, using the dataset PollingImputed.csv. Be sure to use this file so that you don't have to redo the imputation to fill in the missing values, like we did in the Unit 3 Recitation.

#Load election data
polling <- read.csv("PollingImputed.csv")

# split the data using the subset function into a training set called "Train" that has observations from 2004 and 2008, and a testing set called "Test" that has observations from 2012
Train = subset(polling, Year < 2012)
Test = subset(polling, Year == 2012)
str(Train)
## 'data.frame':    100 obs. of  7 variables:
##  $ State     : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ Year      : int  2004 2008 2004 2008 2004 2008 2004 2008 2004 2008 ...
##  $ Rasmussen : int  11 21 19 16 5 5 7 10 -11 -27 ...
##  $ SurveyUSA : int  18 25 21 18 15 3 5 7 -11 -24 ...
##  $ DiffCount : int  5 5 1 6 8 9 8 5 -8 -5 ...
##  $ PropR     : num  1 1 1 1 1 1 1 1 0 0 ...
##  $ Republican: int  1 1 1 1 1 1 1 1 0 0 ...
str(Test)
## 'data.frame':    45 obs. of  7 variables:
##  $ State     : Factor w/ 50 levels "Alabama","Alaska",..: 3 4 5 6 7 9 10 11 12 13 ...
##  $ Year      : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  $ Rasmussen : int  8 13 -12 3 -7 2 5 -22 31 -22 ...
##  $ SurveyUSA : int  5 21 -14 -2 -13 0 8 -24 24 -16 ...
##  $ DiffCount : int  4 2 -6 -5 -8 6 4 -2 1 -5 ...
##  $ PropR     : num  0.833 1 0 0.308 0 ...
##  $ Republican: int  1 1 0 0 0 0 1 0 1 0 ...
#Note that we only have 45 states in our testing set, since we are missing observations for Alaska, Delaware, Alabama, Wyoming, and Vermont, so these states will not appear colored in our map.

#create a logistic regression model and make predictions on the test set:

mod2 = glm(Republican~SurveyUSA+DiffCount, data=Train, family="binomial")

TestPrediction = predict(mod2, newdata=Test, type="response")

#TestPrediction gives the predicted probabilities for each state, but let's also create a vector of Republican/Democrat predictions :
TestPredictionBinary = as.numeric(TestPrediction > 0.5)

#Now, put the predictions and state labels in a data.frame so that we can use ggplot:

predictionDataFrame = data.frame(TestPrediction, TestPredictionBinary, Test$State)

#For how many states is our binary prediction 1 (for 2012), corresponding to Republican?
sum(predictionDataFrame$TestPredictionBinary == 1)
## [1] 22
#or from
table(TestPredictionBinary)
## TestPredictionBinary
##  0  1 
## 23 22
#Ans:22

#What is the average predicted probability of our model (on the Test set, for 2012)?
mean(predictionDataFrame$TestPrediction)
## [1] 0.4852626
#Ans:0.4852626

####################################################

#PROBLEM 2.2 - COLORING THE STATES BY PREDICTIONS  

#Now, we need to merge "predictionDataFrame" with the map data "statesMap", like we did in lecture. 

#Before doing so, we need to convert the Test.State variable to lowercase, so that it matches the region variable in statesMap:
predictionDataFrame$region = tolower(predictionDataFrame$Test.State)

#Now, merge the two data frames using the following command:
predictionMap = merge(statesMap, predictionDataFrame, by = "region")

#Lastly, we need to make sure the observations are in order so that the map is drawn properly, by typing the following:

predictionMap = predictionMap[order(predictionMap$order),]

#How many observations are there in predictionMap?
str(predictionMap)
## 'data.frame':    15034 obs. of  9 variables:
##  $ region              : chr  "arizona" "arizona" "arizona" "arizona" ...
##  $ long                : num  -115 -115 -115 -115 -115 ...
##  $ lat                 : num  35 35.1 35.1 35.2 35.2 ...
##  $ group               : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ order               : int  204 205 206 207 208 209 210 211 212 213 ...
##  $ subregion           : chr  NA NA NA NA ...
##  $ TestPrediction      : num  0.974 0.974 0.974 0.974 0.974 ...
##  $ TestPredictionBinary: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Test.State          : Factor w/ 50 levels "Alabama","Alaska",..: 3 3 3 3 3 3 3 3 3 3 ...
#or
nrow(predictionMap)
## [1] 15034
#Ans:15034

#How many observations are there in statesMap?
str(statesMap)
## 'data.frame':    15537 obs. of  6 variables:
##  $ long     : num  -87.5 -87.5 -87.5 -87.5 -87.6 ...
##  $ lat      : num  30.4 30.4 30.4 30.3 30.3 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "alabama" "alabama" "alabama" "alabama" ...
##  $ subregion: chr  NA NA NA NA ...
#or
nrow(statesMap)
## [1] 15537
#Ans:15537

########################################################

#PROBLEM 2.3 - COLORING THE STATES BY PREDICTIONS  

#When we merged the data in the previous problem, it caused the number of observations to change. Why? Check out the help page for merge by typing ?merge to help you answer this question.

#?merge
#Ans:Because we only make predictions for 45 states, we no longer have observations for some of the states. These observations were removed in the merging process.
#EXPLANATION:When we merge data, it only merged the observations that exist in both data sets. So since we are merging based on the region variable, we will lose all observations that have a value of "region" that doesn't exist in both data frames. You can change this default behavior by using the all.x and all.y arguments of the merge function. For more information, look at the help page for the merge function by typing ?merge in your R console.

###############################################

#PROBLEM 2.4 - COLORING THE STATES BY PREDICTIONS  

#Now we are ready to color the US map with our predictions! We can color the states according to our binary predictions:

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary)) + geom_polygon(color = "black")

#The states appear light blue and dark blue in this map. Which color represents a Republican prediction?
#Ans:Light blue
#EXPLANATION:Our logistic regression model assigned 1 to Republican and 0 to Democrat. As we can see from the legend, 1 corresponds to a light blue color on the map and 0 corresponds to a dark blue color on the map.

##############################################

#PROBLEM 2.5 - COLORING THE STATES BY PREDICTIONS  

#We see that the legend displays a blue gradient for outcomes between 0 and 1. However, when plotting the binary predictions there are only two possible outcomes: 0 or 1. Let's replot the map with discrete outcomes. We can also change the color scheme to blue and red, to match the blue color associated with the Democratic Party in the US and the red color associated with the Republican Party in the US. This can be done with the following command:

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPredictionBinary))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

#Alternatively, we could plot the probabilities instead of the binary predictions. Change the plot command above to instead color the states by the variable TestPrediction. You should see a gradient of colors ranging from red to blue. 

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black") + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

#Do the colors of the states in the map for TestPrediction look different from the colors of the states in the map with TestPredictionBinary? Why or why not?
#Ans:The two maps look very similar. This is because most of our predicted probabilities are close to 0 or close to 1
#EXPLANATION:The only state that appears purple (the color between red and blue) is the state of Iowa, so the maps look very similar. If you take a look at TestPrediction, you can see that most of our predicted probabilities are very close to 0 or very close to 1. In fact, we don't have a single predicted probability between 0.065 and 0.93.


#Lets change the color scheme, by changing the arguments low = "blue" and high = "red" to colors of your choice (to see all of the color options in R, type colors() in your R console). 

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black") + scale_fill_gradient(low = "green", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

#Changing the color scheme to gray scale, by changing the low and high colors to "gray" and "black".

ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black") + scale_fill_gradient(low = "gray", high = "black", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

####################################################

#PROBLEM 3.1 - UNDERSTANDING THE PREDICTIONS  

#In the 2012 election, the state of Florida ended up being a very close race. It was ultimately won by the Democratic party.

#Did we predict this state correctly or incorrectly? To see the names and locations of the different states, take a look at the World Atlas map here:http://www.worldatlas.com/webimage/testmaps/usanames.htm
#Ans:We incorrectly predicted this state by predicting that it would be won by the Republican party.
#EXPLANATION:In our prediction map, the state of Florida is colored red, meaning that we predicted Republican. So we incorrectly predicted this state.

######################################

#PROBLEM 3.2 - UNDERSTANDING THE PREDICTIONS

#What was our predicted probability for the state of Florida?

predictionDataFrame[which(predictionDataFrame$Test.State == "Florida"), ]
##    TestPrediction TestPredictionBinary Test.State  region
## 24      0.9640395                    1    Florida florida
#Ans:Our prediction model did not do a very good job of correctly predicting the state of Florida, and we were very confident in our incorrect prediction.
#EXPLANATION:We predicted Republican for the state of Florida with high probability, meaning that we were very confident in our incorrect prediction! Historically, Florida is usually a close race, but our model doesn't know this. The model only uses polling results for the particular year. For Florida in 2012, Survey USA predicted a tie, but other polls predicted Republican, so our model predicted Republican.

#####################################################

#PROBLEM 4 - PARAMETER SETTINGS

#In this part, we'll explore what the different parameter settings of geom_polygon do. Throughout the problem, use the help page for geom_polygon, which can be accessed by ?geom_polygon. To see more information about a certain parameter, just type a question mark and then the parameter name to get the help page for that parameter. Experiment with different parameter settings to try and replicate the plots!

#?geom_polygon

#We'll be asking questions about the following three plots:

#Plot (1)
#The first plot can be generated by setting the parameter linetype = 3:
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black",linetype=3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

#Ans:linetype

#Plot (2)
#The second plot can be generated by setting the parameter size = 3:
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black",size=3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

#Ans:size

###################################################

#PROBLEM 4.2 - PARAMETER SETTINGS  

#Plot (3) was created by changing the value of a different geom_polygon parameter to have value 0.3. Which parameter did we use?

#Plot (3) can be created by changing the alpha parameter:
ggplot(predictionMap, aes(x = long, y = lat, group = group, fill = TestPrediction))+ geom_polygon(color = "black",alpha=0.3) + scale_fill_gradient(low = "blue", high = "red", guide = "legend", breaks= c(0,1), labels = c("Democrat", "Republican"), name = "Prediction 2012")

#Ans:alpha
#The "alpha" parameter controls the transparency or darkness of the color. A smaller value of alpha will make the colors lighter.

VISUALIZING NETWORK DATA

The cliche goes that the world is an increasingly interconnected place, and the connections between different entities are often best represented with a graph. Graphs are comprised of vertices (also often called “nodes”) and edges connecting those nodes. In this assignment, we will learn how to visualize networks using the igraph package in R.

For this assignment, we will visualize social networking data using anonymized data from Facebook; this data was originally curated in a recent paper about computing social circles in social networks. In our visualizations, the vertices in our network will represent Facebook users and the edges will represent these users being Facebook friends with each other.

The first dataset we will use, edges.csv, contains variables V1 and V2, which label the endpoints of edges in our network. Each row represents a pair of users in our graph who are Facebook friends. For a pair of friends A and B, edges.csv will only contain a single row – the smaller identifier will be listed first in this row. From this row, we will know that A is friends with B and B is friends with A.

The second dataset, users.csv, contains information about the Facebook users, who are the vertices in our network. This file contains the following variables:

  • id: A unique identifier for this user; this is the value that appears in the rows of edges.csv

  • gender: An identifier for the gender of a user taking the values A and B. Because the data is anonymized, we don’t know which value refers to males and which value refers to females.

  • school: An identifier for the school the user attended taking the values A and AB (users with AB attended school A as well as another school B). Because the data is anonymized, we don’t know the schools represented by A and B.

  • locale: An identifier for the locale of the user taking the values A and B. Because the data is anonymized, we don’t know which value refers to what locale.

#PROBLEM 1.1 - SUMMARIZING THE DATA  

#Load the datasets

edges <- read.csv("edges.csv")
str(edges)
## 'data.frame':    146 obs. of  2 variables:
##  $ V1: int  4019 4023 4023 4027 3988 3982 3994 3998 3993 3982 ...
##  $ V2: int  4026 4031 4030 4032 4021 3986 3998 3999 3995 4021 ...
users <- read.csv("users.csv")
str(users)
## 'data.frame':    59 obs. of  4 variables:
##  $ id    : int  3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 ...
##  $ gender: Factor w/ 3 levels "","A","B": 2 3 3 3 3 3 2 3 3 2 ...
##  $ school: Factor w/ 3 levels "","A","AB": 2 1 1 1 1 2 1 1 2 1 ...
##  $ locale: Factor w/ 3 levels "","A","B": 3 3 3 3 3 3 2 3 3 2 ...
#How many Facebook users are there in our dataset?
nrow(users)
## [1] 59
#Ans:59

#In our dataset, what is the average number of friends per user? Hint: this question is tricky, and it might help to start by thinking about a small example with two users who are friends.

#Average number of friends per user
nrow(edges) * 2 / nrow(users)
## [1] 4.949153
#Ans:4.949153
#EXPLANATION:From str(edges) or nrow(edges), we see that there are 146 pairs of users in our dataset who are Facebook friends. However, each pair (A, B) must be counted twice, because B is a friend of A and A is a friend of B. To think of this in simpler terms, consider a network with just new people, A and B, and a single edge (A, B). Even though there are two vertices and one edge, each user has on average one friend.
#For our network, the average number of friends per user is 292/59=4.95.
#Finally, note that in all likelihood these users have a much higher number of Facebook friends. We are computing here the average number of people in this dataset who are their friends, instead of the average total number of Facebook friends.

############################################################

#PROBLEM 1.2 - SUMMARIZING THE DATA  

#Out of all the students who listed a school, what was the most common locale?
table(users$locale, users$school)
##    
##         A AB
##      3  0  0
##   A  6  0  0
##   B 31 17  2
#Ans:Locale B 
#EXPLANATION:From table(users$locale, users$school), we read that all students listed at schools A and B listed their locale as B.

#####################################

#PROBLEM 1.3 - SUMMARIZING THE DATA  

#Is it possible that either school A or B is an all-girls or all-boys school?
table(users$gender, users$school)
##    
##         A AB
##      1  1  0
##   A 11  3  1
##   B 28 13  1
#Ans:No
#EXPLANATION:We see from table(users$gender, users$school) that both genders A and B have attended schools A and B.

###################################

#PROBLEM 2.1 - CREATING A NETWORK 

#We will be using the igraph package to visualize networks;

#install.packages("igraph") 
library(igraph) 
## 
## Attaching package: 'igraph'
## The following object is masked from 'package:stringr':
## 
##     %>%
## The following objects are masked from 'package:tidyr':
## 
##     %>%, crossing
## The following objects are masked from 'package:lubridate':
## 
##     %--%, union
## The following objects are masked from 'package:dplyr':
## 
##     %>%, as_data_frame, groups, union
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union
#We can create a new graph object using the graph.data.frame() function. Based on ?graph.data.frame, which of the following commands will create a graph g describing our social network, with the attributes of each user correctly loaded?

#?graph.data.frame

g <- graph.data.frame(edges, FALSE, users)

g = graph.data.frame(edges, TRUE, users)
 
plot(g, vertex.size=5, vertex.label=NA)

#Ans:g = graph.data.frame(edges, FALSE, users) 
#EXPLANATION:From ?graph.data.frame, we can see that the function expects the first two columns of parameter d to specify the edges in the graph -- our edges object fits this description.
#Our edges are undirected -- if A is a Facebook friend of B then B is a Facebook friend of A. Therefore, we set the directed parameter to FALSE.
#The vertices parameter expects a data frame where the first column is a vertex id and the remaining columns are properties of vertices in our graph. This is the case with our users data frame.
#Note: A directed graph is one where the edges only go one way -- they point from one vertex to another. The other option is an undirected graph, which means that the relations between the vertices are symmetric.

####################################

#PROBLEM 2.2 - CREATING A NETWORK  

#Use the correct command from Problem 2.1 to load the graph g.

#Now, we want to plot our graph. By default, the vertices are large and have text labels of a user's identifier. Because this would clutter the output, we will plot with no text labels and smaller vertices:

g <- graph.data.frame(edges, FALSE, users)
plot(g, vertex.size=5, vertex.label=NA)

#In this graph, there are a number of groups of nodes where all the nodes in each group are connected but the groups are disjoint from one another, forming "islands" in the graph. Such groups are called "connected components," or "components" for short. How many connected components with at least 2 nodes are there in the graph?
#Ans:4
#EXPLANATION:In addition to the large connected component, there is a 4-node component and two 2-node components.

#How many users are there with no friends in the network?
#Ans:7
#EXPLANATION:There are 7 nodes that are not connected to any other nodes. Each forms a 1-node connected component.

####################################

#PROBLEM 2.3 - CREATING A NETWORK  

#In our graph, the "degree" of a node is its number of friends. We have already seen that some nodes in our graph have degree 0 (these are the nodes with no friends), while others have much higher degree. We can use degree(g) to compute the degree of all the nodes in our graph g.

degree(g)
## 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 
##    7   13    1    0    5    8    1    6    5    3    2    2    5   10    8 
##  594 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 
##    3    3   10   13    3    8    1    6    4    9    2    1    3    0    9 
## 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 
##    0    3    1    5   11    0    3    8    6    7    7   10    0   17    0 
## 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 
##    3    8    6    1    1   18   10    1    2    1    0    1    3    8
#How many users are friends with 10 or more other Facebook users in this network?
sum(degree(g) >= 10)
## [1] 9
#Ans:9
#EXPLANATION:From table(degree(g)) or table(degree(g) >= 10), we can see that there are 9 users with 10 or more friends in this network.

####################################

#PROBLEM 2.4 - CREATING A NETWORK  

#In a network, it's often visually useful to draw attention to "important" nodes in the network. While this might mean different things in different contexts, in a social network we might consider a user with a large number of friends to be an important user. From the previous problem, we know this is the same as saying that nodes with a high degree are important users.

#To visually draw attention to these nodes, we will change the size of the vertices so the vertices with high degrees are larger. To do this, we will change the "size" attribute of the vertices of our graph to be an increasing function of their degrees:

#To make vertices with high degree (more friends) larger:
V(g)$size = degree(g)/2+2

#Now that we have specified the vertex size of each vertex, we will no longer use the vertex.size parameter when we plot our graph:

plot(g, vertex.label=NA)

table(V(g)$size)
## 
##    2  2.5    3  3.5    4  4.5    5  5.5    6  6.5    7  7.5  8.5 10.5   11 
##    7   10    4    9    1    4    4    3    6    2    4    1    2    1    1
#What is the largest size we assigned to any node in our graph?
#Ans:11

#What is the smallest size we assigned to any node in our graph?
#Ans:2
#EXPLANATION:From 
table(degree(g)) 
## 
##  0  1  2  3  4  5  6  7  8  9 10 11 13 17 18 
##  7 10  4  9  1  4  4  3  6  2  4  1  2  1  1
#or
summary(degree(g))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   4.949   8.000  18.000
#we see that the maximum degree of any node in the graph is 18 and the minimum degree of any node is 0. Therefore, the maximum size of any point is 18/2+2=11, and the minimum size is 0/2+2=2.

##########################

#PROBLEM 3.1 - COLORING VERTICES  

#Thus far, we have changed the "size" attributes of our vertices. However, we can also change the colors of vertices to capture additional information about the Facebook users we are depicting.

#When changing the size of nodes, we first obtained the vertices of our graph with V(g) and then accessed the the size attribute with V(g)$size. To change the color, we will update the attribute V(g)$color.

#To color the vertices based on the gender of the user, we will need access to that variable. When we created our graph g, we provided it with the data frame users, which had variables gender, school, and locale. These are now stored as attributes V(g)$gender, V(g)$school, and V(g)$locale.

#We can update the colors by setting the color to black for all vertices, than setting it to red for the vertices with gender A and setting it to gray for the vertices with gender B:

V(g)$color = "black"

V(g)$color[V(g)$gender == "A"] = "red"

V(g)$color[V(g)$gender == "B"] = "gray"

plot(g, vertex.label=NA)

#Plot the resulting graph. What is the gender of the users with the highest degree in the graph?
#Ans:Gender B
#EXPLANATION:After updating V(g)$color, run plot(g, vertex.label=NA) to plot the graph. All the largest nodes (the ones with the highest degree) are colored gray, which corresponds to Gender B.

###########################

#PROBLEM 3.2 - COLORING VERTICES  

#Now, color the vertices based on the school that each user in our network attended.

#Color vertex based on school
V(g)$color <- "black"
V(g)$color[V(g)$school == "A"] <- "red"
V(g)$color[V(g)$school == "AB"] <- "green"
plot(g, vertex.label=NA)

#Are the two users who attended both schools A and B Facebook friends with each other?
#Ans:Yes

#What best describes the users with highest degree?
#Ans:Some, but not all, of the high-degree users attended school A 

#EXPLANATION:As with coloring by gender, we will set the color for all vertices to black, and then we will set the color for students from school A to red and the color for students from schools A and B to gray. Finally we will plot the updated graph.The two students who attended schools A and B are colored gray; we can see from the graph that they are Facebook friends (aka they are connected by an edge). The high-degree users (depicted by the large nodes) are a mixture of red and black color, meaning some of these users attended school A and other did not.

##################################

#PROBLEM 3.3 - COLORING VERTICES  

#Now, color the vertices based on the locale of the user.
#The large connected component is most associated with which locale?

#Color vertex based on locale
V(g)$color <- "black"
V(g)$color[V(g)$locale == "A"] <- "red"
V(g)$color[V(g)$locale == "B"] <- "green"
plot(g, vertex.label=NA)

#Ans:Locale B

#The 4-user connected component is most associated with which locale?
#Ans:Locale A

#EXPLANATION:As with the other coloring tasks, we will set the color for all vertices to black, and then we will set the color for users from locale A to red and the color for users from locale B to gray. Finally we will plot the updated graph.Nearly all of the vertices from the large connected component are colored gray, indicating users from Locale B. Meanwhile, all the vertices in the 4-user connected component are colored red, indicating users from Locale A.

##############################

#PROBLEM 4 - OTHER PLOTTING OPTIONS 

#The help page is a helpful tool when making visualizations. Answer the following questions with the help of ?igraph.plotting and experimentation in your R console.

#?igraph.plotting

#Making #D plots using rglplot
#Which igraph plotting function would enable us to plot our graph in 3-D?
library(rgl)
rglplot(g, vertex.label=NA)
#Ans:rglplot

#What parameter to the plot() function would we use to change the edge width when plotting g?
plot(g, edge.width=2, vertex.label=NA)

#Ans:edge.width

#EXPLANATION:The three functions to plot the igraph are plot.igraph (the function we used through the command "plot"), tkplot, and rglplot. rglplot makes 3-D plots -- you can try one with rglplot(g, vertex.label=NA). Once you've made the plot, you can click and drag to rotate the graph. To use this function, you will need to install and load the "rgl" package.

#To change the edge width, you need to change the edge parameter called "width". From ?igraph.plotting, we read that we need to append the prefix "edge." to the beginning for our call to plot, so the full parameter is called "edge.width". For instance, we could plot with edge width 2 with the command plot(g, edge.width=2, vertex.label=NA).

#experimentation with plotting options
plot(g)

#or
plot.igraph(g)

tkplot(g)
## [1] 1
#examples from igraph.plotting help page

# plotting a random graph, set the parameters in the command arguments
g1 <- barabasi.game(100)
plot(g1, layout=layout_with_fr, vertex.size=4,
     vertex.label.dist=0.5, vertex.color="red", edge.arrow.size=0.5)

# plot a random graph, different color for each component
g2 <- sample_gnp(100, 1/100)
comps <- components(g2)$membership
colbar <- rainbow(max(comps)+1)
V(g2)$color <- colbar[comps+1]
plot(g2, layout=layout_with_fr, vertex.size=5, vertex.label=NA)

# plot communities in a graph
g3 <- make_full_graph(5) %du% make_full_graph(5) %du% make_full_graph(5)
g3 <- add_edges(g3, c(1,6, 1,11, 6,11))
com <- cluster_spinglass(g3, spins=5)
V(g3)$color <- com$membership+1
g3 <- set_graph_attr(g3, "layout", layout_with_kk(g3))
plot(g3, vertex.label.dist=1.5)

# draw a bunch of trees, fix layout
igraph_options(plot.layout=layout_as_tree)
plot(make_tree(20, 2))

plot(make_tree(50, 3), vertex.size=3, vertex.label=NA)

tkplot(make_tree(50, 2, mode="undirected"), vertex.size=10,
vertex.color="green")
## [1] 2

VISUALIZING TEXT DATA USING WORD CLOUDS

Earlier in the course, we used text analytics as a predictive tool, using word frequencies as independent variables in our models. However, sometimes our goal is to understand commonly occurring topics in text data instead of to predict the value of some dependent variable. In such cases, word clouds can be a visually appealing way to display the most frequent words in a body of text.

A word cloud arranges the most common words in some text, using size to indicate the frequency of a word. For instance, this is a word cloud for the complete works of Shakespeare, removing English stopwords:

shakespeare

shakespeare

While we could generate word clouds using free generators available on the Internet, we will have more flexibility and control over the process if we do so in R. We will visualize the text of tweets about Apple, a dataset we used earlier in the course. As a reminder, this dataset (which can be downloaded from tweets.csv) has the following variables:

  • Tweet – the text of the tweet

  • Avg – the sentiment of the tweet, as assigned by users of Amazon Mechanical Turk. The score ranges on a scale from -2 to 2, where 2 means highly positive sentiment, -2 means highly negative sentiment, and 0 means neutral sentiment.

#PROBLEM 1.1 - PREPARING THE DATA  

#Load the dataset "tweets.csv" remembering to use stringsAsFactors=FALSE when loading the data.
tweets <- read.csv("tweets.csv", stringsAsFactors=FALSE)

#Next, perform the following pre-processing tasks (like we did in Unit 5), noting that we don't stem the words in the document or remove sparse terms:

#Clean up data. Excluding stemming because it will be easier to red and understand the word cloud if it includes full words.

library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## Attaching package: 'tm'
## The following object is masked from 'package:mosaic':
## 
##     inspect
library(SnowballC)

#1) Create a corpus using the Tweet variable
corpus <- Corpus(VectorSource(tweets$Tweet))

#2) Convert the corpus to lowercase (don't forget to type "corpus = tm_map(corpus, PlainTextDocument)" in your R console right after this step)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)

#3) Remove punctuation from the corpus
corpus <- tm_map(corpus, removePunctuation)

#4) Remove all English-language stopwords
corpus<- tm_map(corpus, removeWords, stopwords("english"))

#5) Build a document-term matrix out of the corpus
frequencies <- DocumentTermMatrix(corpus)

#6) Convert the document-term matrix to a data frame called allTweets
allTweets <- as.data.frame(as.matrix(frequencies))

#How many unique words are there across all the documents?
frequencies 
## <<DocumentTermMatrix (documents: 1181, terms: 3780)>>
## Non-/sparse entries: 10273/4453907
## Sparsity           : 100%
## Maximal term length: 115
## Weighting          : term frequency (tf)
#or
ncol(allTweets)
## [1] 3780
#or
#str(allTweets)

#Ans:3780
#EXPLANATION: we can read that there are 3780 unique words across all the tweets.

######################################

#PROBLEM 1.2 - PREPARING THE DATA 

#Although we typically stem words during the text preprocessing step, we did not do so here. What is the most compelling rationale for skipping this step when visualizing text data?
#Ans:It will be easier to read and understand the word cloud if it includes full words instead of just the word stems 
#EXPLANATION:We want to create an interpretable display of a document's contents, and our results will be easier to read if they include full words instead of just the stems.Stemming has relatively minor computational burden, and we certainly could create a word cloud with a stemmed document.

#####################################

#PROBLEM 2.1 - BUILDING A WORD CLOUD  

#Install and load the "wordcloud" package, which is needed to build word clouds.
#install.packages("wordcloud")
library(wordcloud)
## Loading required package: RColorBrewer
#As we can read from ?wordcloud, we will need to provide the function with a vector of words and a vector of word frequencies. Which function can we apply to allTweets to get a vector of the words in our dataset, which we'll pass as the first argument to wordcloud()?
#Ans:colnames
#EXPLANATION:Each tweet represents a row of allTweets, and each word represents a column. We need the names of all the columns of allTweets, which is returned by colnames(allTweets). While str(allTweets) displays the names of the variables along with other information, it doesn't return a vector that we can use as the first argument to wordcloud().

######################################

#PROBLEM 2.2 - BUILDING A WORD CLOUD

head(colnames(allTweets))
## [1] "000"  "075"  "0909" "0910" "099"  "100"
#Which function should we apply to allTweets to obtain the frequency of each word across all tweets?
head(colSums(allTweets))
##  000  075 0909 0910  099  100 
##    1    3    1    1    1    2
#Ans: colSums
#EXPLANATION:Each tweet represents a row in allTweets, and each word represents a column. Therefore, we need to access the sums of each column in allTweets, which is returned by colSums(allTweets).

################################

#PROBLEM 2.3 - BUILDING A WORD CLOUD  

#Use allTweets to build a word cloud. Make sure to check out the help page for wordcloud if you are not sure how to do this.
wordcloud(colnames(allTweets), colSums(allTweets))
## Warning in wordcloud(colnames(allTweets), colSums(allTweets)): apple could
## not be fit on page. It will not be plotted.

#Because we are plotting a large number of words, you might get warnings that some of the words could not be fit on the page and were therefore not plotted -- this is especially likely if you are using a smaller screen. You can address these warnings by plotting the words smaller. From ?wordcloud, we can see that the "scale" parameter controls the sizes of the plotted words. By default, the sizes range from 4 for the most frequent words to 0.5 for the least frequent, as denoted by the parameter "scale=c(4, 0.5)". We could obtain a much smaller plot with, for instance, parameter "scale=c(2, 0.25)".

wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25))

#What is the most common word across all the tweets (it will be the largest in the outputted word cloud)? Please type the word exactly how you see it in the word cloud. The most frequent word might not be printed if you got a warning about words being cut off -- if this happened, be sure to follow the instructions in the paragraph above.
#Ans:apple
#EXPLANATION:"apple" is by far the largest, and therefore most common, word.

###################################

#PROBLEM 2.4 - BUILDING A WORD CLOUD  

#In the previous subproblem, we noted that there is one word with a much higher frequency than the other words. Repeat the steps to load and pre-process the corpus, this time removing the most frequent word in addition to all elements of stopwords("english") in the call to tm_map with removeWords. For a refresher on how to remove this additional word, see the Twitter text analytics lecture.

#Remove the most frequent word 'apple' and regenerate graph

corpus <- Corpus(VectorSource(tweets$Tweet))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, c("apple", stopwords("english")))
frequencies <- DocumentTermMatrix(corpus)

#Replace allTweets with the document-term matrix of this new corpus -- we will use this updated corpus for the remainder of the assignment.
allTweets <- as.data.frame(as.matrix(frequencies))

#Create a word cloud with the updated corpus. 
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25))

#What is the most common word in this new corpus (the largest word in the outputted word cloud)? The most frequent word might not be printed if you got a warning about words being cut off -- if this happened, be sure to follow the instructions in the previous problem.
#Ans:iphone
#EXPLANATION:The most common (largest) word is now "iphone".

########################################################

#PROBLEM 3 - SIZE AND COLOR

#So far, the word clouds we've built have not been too visually appealing -- they are crowded by having too many words displayed, and they don't take advantage of color. One important step to building visually appealing visualizations is to experiment with the parameters available, which in this case can be viewed by typing ?wordcloud in your R console. In this problem, you should look through the help page and experiment with different parameters to answer the questions.

#?wordcloud

#Below are four word clouds, each of which uses different parameter settings in the call to the wordcloud() function:

#Word Cloud A:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),rot.per=0.5)

#Word Cloud B:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10,random.order=FALSE)

#Word Cloud C:
negativeTweets = subset(allTweets, tweets$Avg <= -1)
wordcloud(colnames(negativeTweets), colSums(negativeTweets),scale=c(2, 0.25),min.freq=4,colors=brewer.pal(9,"Purples")[5:9])

#Word Cloud D:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10,random.order=FALSE,random.color=TRUE,colors=brewer.pal(9,"Purples")[5:9])

#PROBLEM 3.1 - SIZE AND COLOR 

#Which word cloud is based only on the negative tweets (tweets with Avg value -1 or less)?
#Ans:Word Cloud C
#EXPLANATION:Word Cloud C is the only one with a different distribution of the most frequent words -- negative words (or censored versions of negative words) are much more common in this cloud.It is quite simple to obtain a word cloud that is limited to a subset of the tweets using the subset function.

#########################

#PROBLEM 3.2 - SIZE AND COLOR 

#Only one word cloud was created without modifying parameters min.freq or max.words. Which word cloud is this?
#Ans:Word Cloud A
#EXPLANATION:min.freq and max.words are parameters that can be used to remove the least frequent words, resulting is a less cluttered word cloud. Word Cloud A is much more cluttered than the others because it did not use either of these parameters, and therefore is displaying every word that appears more than 3 times.

###########################

##PROBLEM 3.3 - SIZE AND COLOR  

#Which word clouds were created with parameter random.order set to FALSE?
#Ans:Word Cloud B &  Word Cloud D
#EXPLANATION:If random.order is set to FALSE, then the most frequent (largest) words will be plotted first, resulting in them being displayed together in the center of the word cloud. This is the case in Word Cloud B and Word Cloud D.

################################################

#PROBLEM 3.4 - SIZE AND COLOR  

#Which word cloud was built with a non-default value for parameter rot.per?
#Ans:Word Cloud A 
#EXPLANATION:rot.per controls the proportion of words that are rotated to be vertical in the word cloud. By default 10% of words are rotated. However in Word Cloud A a much higher proportion (50%) are rotated, which was achieved by setting rot.per=0.5.

#########################################

#PROBLEM 3.5 - SIZE AND COLOR  

#In Word Cloud C and Word Cloud D, we provided a color palette ranging from light purple to dark purple as the parameter colors (you will learn how to make such a color palette later in this assignment). For which word cloud was the parameter random.color set to TRUE?
#Ans:Word Cloud D
#EXPLANATION:When random.color is set to TRUE, the words will be colored randomly. This is the case in Word Cloud D. Meanwhile, colors were assigned based on the number of appearances in Word Cloud C.

##########################################################

#PROBLEM 4.1 - SELECTING A COLOR PALETTE  

#The use of a palette of colors can often improve the overall effect of a visualization. We can easily select our own colors when plotting; for instance, we could pass c("red", "green", "blue") as the colors parameter to wordcloud(). The RColorBrewer package, which is based on the ColorBrewer project (colorbrewer.org), provides pre-selected palettes that can lead to more visually appealing images. Though these palettes are designed specifically for coloring maps, we can also use them in our word clouds and other visualizations.

#The function brewer.pal() returns color palettes from the ColorBrewer project when provided with appropriate parameters, and the function display.brewer.all() displays the palettes we can choose from.

display.brewer.all()

#Which color palette would be most appropriate for use in a word cloud for which we want to use color to indicate word frequency?
#Ans:YlOrRd
#EXPLANATION:From ?brewer.pal we read that Accent and Set2 are both "qualitative palettes," which means color changes don't imply a change in magnitude (we can also see this in the output of display.brewer.all). As a result, the colors selected would not visually identify the least and most frequent words.
#On the other hand, YlOrRd is a "sequential palette," with earlier colors begin lighter and later colors being darker. Therefore, it is a good palette choice for indicating low-frequency vs. high-frequency words.

###################################

#PROBLEM 4.2 - SELECTING A COLOR PALETTE

#Which RColorBrewer palette name would be most appropriate to use when preparing an image for a document that must be in grayscale?
#Displaying grey palette
display.brewer.pal(7, "Greys")

#Ans:Greys
#EXPLANATION:As we can see from display.brewer.all(), palette "Greys" is the only one completely in grayscale.

######################################

#PROBLEM 4.3 - SELECTING A COLOR PALETTE

#In sequential palettes, sometimes there is an undesirably large contrast between the lightest and darkest colors. You can see this effect when plotting a word cloud for allTweets with parameter colors=brewer.pal(9, "Blues"), which returns a sequential blue palette with 9 colors.

wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10, colors=brewer.pal(9,"Blues"))

#Which of the following commands addresses this issue by removing the first 4 elements of the 9-color palette of blue colors? Select all that apply.

#Removing the light colors
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10, colors=brewer.pal(9,"Blues")[c(5,6,7,8,9)])

#or a shorthand for this indexing is:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10, colors=brewer.pal(9,"Blues")[5:9])

#Alternately
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10,colors=brewer.pal(9, "Blues")[c(-1, -2, -3, -4)])

#or a shorthand for this indexing is:
wordcloud(colnames(allTweets), colSums(allTweets), scale=c(2, 0.25),min.freq=10, colors=brewer.pal(9, "Blues")[-1:-4])

#Ans:brewer.pal(9, "Blues")[c(-1, -2, -3, -4)]  
#brewer.pal(9, "Blues")[c(5, 6, 7, 8, 9)] 
#EXPLANATION:The fourth option limits to elements 5-9, which removes the first four. The second option uses negative indexes, which means remove elements 1-4.

VISUALIZING ATTRIBUTES OF PAROLE VIOLATORS (OPTIONAL)

In the crime lecture, we saw how we can use heatmaps to give a 2-dimensional representation of 3-dimensional data: we made heatmaps of crime counts by time of the day and day of the week. In this problem, we’ll learn how to use histograms to show counts by one variable, and then how to visualize 3 dimensions by creating multiple histograms.

We’ll use the parole data parole.csv from Unit 3. Before, we used this data to predict parole violators. Now, let’s try to get a little more insight into this dataset using histograms. As a reminder, the variables in this dataset are:

  • male = 1 if the parolee is male, 0 if female
  • race = 1 if the parolee is white, 2 otherwise

  • age_ = the parolee’s age in years at the time of release from prison

  • state = a code for the parolee’s state. 2 is Kentucky, 3 is Louisiana, 4 is Virginia, and 1 is any other state. These three states were selected due to having a high representation in the dataset.

  • time.served = the number of months the parolee served in prison (limited by the inclusion criteria to not exceed 6 months).

  • max.sentence = the maximum sentence length for all charges, in months (limited by the inclusion criteria to not exceed 18 months).

  • multiple.offenses = 1 if the parolee was incarcerated for multiple offenses, 0 otherwise.

  • crime = a code for the parolee’s main crime leading to incarceration. 2 is larceny, 3 is drug-related crime, 4 is driving-related crime, and 1 is any other crime.

  • violator = 1 if the parolee violated the parole, and 0 if the parolee completed the parole without violation.

#PROBLEM 1.1

#Load the data
parole<-read.csv("parole.csv")
str(parole)
## 'data.frame':    675 obs. of  9 variables:
##  $ male             : int  1 0 1 1 1 1 1 0 0 1 ...
##  $ race             : int  1 1 2 1 2 2 1 1 1 2 ...
##  $ age              : num  33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
##  $ state            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ time.served      : num  5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
##  $ max.sentence     : int  18 12 12 18 12 18 18 12 13 12 ...
##  $ multiple.offenses: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ crime            : int  4 3 3 1 1 4 3 1 3 2 ...
##  $ violator         : int  0 0 0 0 0 0 0 0 0 0 ...
#Since male, state, and crime are all unordered factors, convert them to factor variables:

parole$male = as.factor(parole$male)
parole$state = as.factor(parole$state)
parole$crime = as.factor(parole$crime)

#see the structure of the dataset again
str(parole)
## 'data.frame':    675 obs. of  9 variables:
##  $ male             : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 1 1 2 ...
##  $ race             : int  1 1 2 1 2 2 1 1 1 2 ...
##  $ age              : num  33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
##  $ state            : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ time.served      : num  5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
##  $ max.sentence     : int  18 12 12 18 12 18 18 12 13 12 ...
##  $ multiple.offenses: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ crime            : Factor w/ 4 levels "1","2","3","4": 4 3 3 1 1 4 3 1 3 2 ...
##  $ violator         : int  0 0 0 0 0 0 0 0 0 0 ...
#What fraction of parole violators are female?
prop.table(table(parole$male, parole$violator),2)
##    
##             0         1
##   0 0.1943049 0.1794872
##   1 0.8056951 0.8205128
#Ans:0.1794872
#EXPLANATION:The total number of violators is 78, and 14 of them are female.

############################################

#PROBLEM 1.2 

#In this dataset, which crime is the most common in Kentucky?
table(parole$crime[parole$state==2])
## 
##  1  2  3  4 
## 42 10 64  4
#Ans:Drug-related crime
#EXPLANATION:The code 2 corresponds to Kentucky, and the most common crime is 3, which corresponds to Drug-related crime.

########################################

#PROBLEM 2.1 - CREATING A BASIC HISTOGRAM  

#Recall from lecture that in ggplot, we need to specify the dataset, the aesthetic, and the geometry. To create a histogram, the geometry will be geom_histogram. The data we'll use is parole, and the aesthetic will be the map from a variable to the x-axis of the histogram.

#Create a histogram to find out the distribution of the age of parolees:

library(ggplot2)

ggplot(data = parole, aes(x = age)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#By default, geom_histogram divides the data into 30 bins. Change the width of the bins to 5 years by adding the argument "binwidth = 5" to geom_histogram.

ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5)

#Note that by default, histograms create bins where the left endpoint is included in the bin, but the right endpoint isn't. So the first bin in this histogram represents parolees who are between 15 and 19 years old. The last bin in this histogram represents parolees who are between 65 and 69 years old.

#What is the age bracket with the most parolees?
#Ans:20-24
#EXPLANATION:The tallest bar corresponds to the age bracket with the most parolees, which is 20-24.

######################################

#PROBLEM 2.2 - CREATING A BASIC HISTOGRAM  

#Redo the histogram, adding the following argument to the geom_histogram function: color="blue". What does this do? Select all that apply.
ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5,color="blue")

#Ans:Changes the outline color of the bars

#################################################

#PROBLEM 3.1 - ADDING ANOTHER DIMENSION  

#Now suppose we are interested in seeing how the age distribution of male parolees compares to the age distribution of female parolees.

#One option would be to create a heatmap with age on one axis and male (a binary variable in our data set) on the other axis. Another option would be to stick with histograms, but to create a separate histogram for each gender. ggplot has the ability to do this automatically using the facet_grid command.

#To create separate histograms for male and female, type the following command into your R console:

ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5) + facet_grid(male ~ .)

#The histogram for female parolees is shown at the top, and the histogram for male parolees is shown at the bottom.

#What is the age bracket with the most female parolees?
#Ans:35-39
#EXPLANATION:Looking at the histogram at the top, we can see that the tallest bar corresponds to the age bracket 35-39.

#######################################

#PROBLEM 3.2 - ADDING ANOTHER DIMENSION  

#Now change the facet_grid argument to be ".~male" instead of "male~.". What does this do?
ggplot(data = parole, aes(x = age)) + geom_histogram(binwidth = 5) + facet_grid(.~male)

#Puts the histograms side-by-side instead of on top of each other.

##########################################

#PROBLEM 3.3 - ADDING ANOTHER DIMENSION  

#An alternative to faceting is to simply color the different groups differently. To color the data points by group, we need to tell ggplot that a property of the data (male or not male) should be translated to an aesthetic property of the histogram. We can do this by setting the fill parameter within the aesthetic to male.

#producing a histogram where data points are colored by group:

ggplot(data = parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5)

#Since we didn't specify colors to use, ggplot will use its default color selection. Let's change this by defining our own color palette.:

colorPalette = c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

#This is actually a colorblind-friendly palette, desribed on this Cookbook for R page. Now, generate your histogram again, using colorPalette:

ggplot(data = parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5) + scale_fill_manual(values=colorPalette)

#What color is the histogram for the female parolees?
#Ans:Black
#EXPLANATION:From the previous question, we saw that the female parolee histogram was much smaller than the male parolee histogram. So it looks like the female histogram is the black-colored one. We can also read this from the legend.

###################################

#PROBLEM 3.4 - ADDING ANOTHER DIMENSION  

#Coloring the groups differently is a good way to see the breakdown of age by sex within the single, aggregated histogram. However, the bars here are stacked, meaning that the height of the orange bars in each age bin represents the total number of parolees in that age bin, not just the number of parolees in that group.

#An alternative to a single, stacked histogram is to create two histograms and overlay them on top of each other. This is a simple adjustment to our previous command.

#We just need to:

#1) Tell ggplot not to stack the histograms by adding the argument position="identity" to the geom_histogram function.
ggplot(parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5, position = "identity") + scale_fill_manual(values=colorPalette)

#2) Make the bars semi-transparent so we can see both colors by adding the argument alpha=0.5 to the geom_histogram function.
ggplot(parole, aes(x = age, fill = male)) + geom_histogram(binwidth = 5, position = "identity", alpha = 0.5) + scale_fill_manual(values=colorPalette)

#Which of the following buckets contain no female paroles? Select all that apply.
#Ans:15-19, 55-59, and 65-69 

#####################################################

#PROBLEM 4.1 - TIME SERVED  

#Now let's explore another aspect of the data: the amount of time served by parolees. Create a basic histogram like the one we created in Problem 2, but this time with time.served on the x-axis. Set the bin width to one month.

ggplot(data = parole, aes(x = time.served)) + geom_histogram(binwidth = 1)

#What is the most common length of time served, according to this histogram?
#Ans:Between 4 and 5 months
#EXPLANATION:The highest bar corresponds to between 4 and 5 months.

#################################

#PROBLEM 4.2 - TIME SERVED  

#Change the binwidth to 0.1 months. Now what is the most common length of time served, according to the histogram?
ggplot(data = parole, aes(x = time.served)) + geom_histogram(binwidth = 0.1)

#Ans:Between 3.0 and 3.1 months
#EXPLANATION:Now, the highest bar corresponds to between 3.0 and 3.1 months.

#Be careful when choosing the binwidth - it can significantly affect the interpretation of a histogram! When visualizing histograms, it is always a good idea to vary the bin size in order to understand the data at various granularities.

#################################

#PROBLEM 4.3 - TIME SERVED

#Now, suppose we suspect that it is unlikely that each type of crime has the same distribution of time served. To visualize this, change the binwidth back to 1 month, and use facet_grid to create a separate histogram of time.served for each value of the variable crime.

ggplot(data = parole, aes(x = time.served)) + geom_histogram(binwidth = 1) + facet_grid(crime ~ .)

#Which crime type has no observations where time served is less than one month? Recall that crime type #2 is larceny, #3 is drug-related crime, #4 is driving-related crime, and #1 is any other crime.
#Ans:Driving-related

#For which crime does the frequency of 5-6 month prison terms exceed the frequencies of each other term length?
#Ans:Drug-related

###################################################

#PROBLEM 4.4 - TIME SERVED  

#Now, instead of faceting the histograms, overlay them. Remember to set the position and alpha parameters so that the histograms are not stacked. Also, make sure to indicate that the fill aesthetic should be "crime".

ggplot(data=parole, aes(x=time.served, fill=crime)) + geom_histogram(binwidth=1, position="identity", alpha=0.5)

#In this case, faceting seems like a better alternative. Why?
#Ans:With four different groups, it can be hard to tell them apart when they are overlayed.
#EXPLANATION:While overlaying the plots is allowed and lets us observe some attributes of the plots like the most common crime type, it can be hard to tell them apart and if they have similar values it can be hard to read.

Wonder World of Visualisation comes to end…..this is just the begining!