Visualizing data

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_7_Visualization")
WHO <- read.csv("WHO.csv")
str(WHO)

## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...

Base R plot

plot(WHO$GNI, WHO$FertilityRate)

Use ggplot

Now, remember we need at least three things to create a plot using ggplot– data, an aesthetic mapping of variables in the data frame to visual output, and a geometric object. So first, let’s create the ggplot object with the data and the aesthetic mapping. We’ll save it to the variable scatterplot, and then use the ggplot function, where the first argument is the name of our data set, WHO, which specifies the data to use, and the second argument is the aesthetic mapping, aes. In parentheses, we have to decide what we want on the x-axis and what we want on the y-axis. We want the x-axis to be GNI, and we want the y-axis to be FertilityRate.

Now, we need to tell ggplot what geometric objects to put in the plot. We could use bars, lines, points, or something else. This is a big difference between ggplot and regular plotting in R. You can build different types of graphs by using the same ggplot object. There’s no need to learn one function for bar graphs, a completely different function for line graphs, etc.

So first, let’s just create a straightforward scatterplot. So the geometry we want to add is points. We can do this by typing the name of our ggplot object, scatterplot, and then adding the function, geom_point(). If you hit Enter, you should see a new plot in the Graphics window that looks similar to our original plot, but there are already a few nice improvements. One is that we don’t have the data set name with a dollar sign in front of the label on each axis, just the variable name. Another is that we have these nice grid lines in the background and solid points that pop out from the background.

library(ggplot2)
scatterplot <- ggplot(WHO, aes(x = GNI, y = FertilityRate))
scatterplot + geom_point()

## Warning in loop_apply(n, do.ply): Removed 35 rows containing missing
## values (geom_point).

We could have made a line graph just as easily by changing point to line. So in your R console, hit the up arrow, and then just delete “point” and type “line” and hit Enter.

scatterplot + geom_line()

## Warning in loop_apply(n, do.ply): Removed 32 rows containing missing
## values (geom_path).

Let’s switch to point because it makes more sense

scatterplot + geom_point()

## Warning in loop_apply(n, do.ply): Removed 35 rows containing missing
## values (geom_point).

In addition to specifying that the geometry we want is points, we can add other options, like the color, shape, and size of the points. Let’s redo our plot with blue triangles instead of circles.

scatterplot + geom_point(color = "blue", size = 3, shape = 17)

## Warning in loop_apply(n, do.ply): Removed 35 rows containing missing
## values (geom_point).

scatterplot + geom_point(color = "darkred", size = 3, shape = 8)

## Warning in loop_apply(n, do.ply): Removed 35 rows containing missing
## values (geom_point).

Now, let’s add a title to the plot.

scatterplot + geom_point(color = "darkred", size = 3, shape = 8) + ggtitle("Fertility rate vs. gross national income")

## Warning in loop_apply(n, do.ply): Removed 35 rows containing missing
## values (geom_point).

Now, let’s save our plot to a file. We can do this by first saving our plot to a variable.

fertilityGNIPlot <- scatterplot + geom_point(color = "darkred", size = 3, shape = 8) + ggtitle("Fertility rate vs. gross national income")

Now, let’s create a file we want to save our plot to. We can do that with the pdf function.

pdf("Myplot.pdf")
print(fertilityGNIPlot)

## Warning in loop_apply(n, do.ply): Removed 35 rows containing missing
## values (geom_point).

dev.off()

## png 
##   2

color the points by region

ggplot(WHO, aes(x = GNI, y = FertilityRate, color = Region)) + geom_point()

## Warning in loop_apply(n, do.ply): Removed 35 rows containing missing
## values (geom_point).

Let’s now instead color the points according to the country’s life expectancy. Note that life expectancy is a numberical variable.

ggplot(WHO, aes(x = GNI, y = FertilityRate, color = LifeExpectancy)) + geom_point()

## Warning in loop_apply(n, do.ply): Removed 35 rows containing missing
## values (geom_point).

Let’s take a look at a different plot now. Suppose we were interested in seeing whether the fertility rate of a country was a good predictor of the percentage of the population under 15.

ggplot(WHO, aes(x = FertilityRate, y = Under15)) + geom_point()

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (geom_point).

It looks like the variables are certainly correlated, but as the fertility rate increases, the variable, Under15 starts increasing less. So this doesn’t really look like a linear relationship. But we suspect that a log transformation of FertilityRate will be better.

ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point()

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (geom_point).

Now build a linear regression model

model <- lm(Under15 ~ log(FertilityRate), data = WHO)
summary(model)

## 
## Call:
## lm(formula = Under15 ~ log(FertilityRate), data = WHO)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3131  -1.7742   0.0446   1.7440   7.7174 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.6540     0.4478   17.09   <2e-16 ***
## log(FertilityRate)  22.0547     0.4175   52.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 181 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.9391, Adjusted R-squared:  0.9387 
## F-statistic:  2790 on 1 and 181 DF,  p-value: < 2.2e-16

Add regression line to the plot

ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm")

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (stat_smooth).

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (geom_point).

By default, ggplot will draw a 95% confidence interval shaded around the line. We can change this by specifying options within the statistics layer.

ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", level = 0.99)

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (stat_smooth).

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (geom_point).

We could instead take away the confidence interval altogether by deleting level = 0.99 and typing se = FALSE.

ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", se = F)

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (stat_smooth).

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (geom_point).

We could also change the color of the regression line by typing as an option, color = “orange”.

ggplot(WHO, aes(x = log(FertilityRate), y = Under15)) + geom_point() + stat_smooth(method = "lm", se = F, color = "orange")

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (stat_smooth).

## Warning in loop_apply(n, do.ply): Removed 11 rows containing missing
## values (geom_point).