library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
crime <- read_csv('http://datasets.flowingdata.com/crimeRatesByState2005.csv')
## Parsed with column specification:
## cols(
## state = col_character(),
## murder = col_double(),
## forcible_rape = col_double(),
## robbery = col_double(),
## aggravated_assault = col_double(),
## burglary = col_double(),
## larceny_theft = col_double(),
## motor_vehicle_theft = col_double(),
## population = col_double()
## )
head(crime)
## # A tibble: 6 x 9
## state murder forcible_rape robbery aggravated_assa… burglary larceny_theft
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Unit… 5.6 31.7 141. 291. 727. 2286.
## 2 Alab… 8.2 34.3 141. 248. 954. 2650
## 3 Alas… 4.8 81.1 80.9 465. 622. 2599.
## 4 Ariz… 7.5 33.8 144. 327. 948. 2965.
## 5 Arka… 6.7 42.9 91.1 387. 1085. 2711.
## 6 Cali… 6.9 26 176. 317. 693. 1916.
## # … with 2 more variables: motor_vehicle_theft <dbl>, population <dbl>
# map values in data to X and Y axes
ggplot(crime, aes(murder, burglary))
The brackets after the ggplot function define the data frame to be used, followed by the aes mapping of variables in the data to the chart’s X and Y axes
# Change the theme
ggplot(crime, aes(x = murder, y = burglary)) +
xlab("Murder rates in each state per 100,000") +
ylab("Burglary rates in each state per 100,000") +
theme_minimal(base_size = 12)
p1 <- ggplot(crime, aes(x = murder, y = burglary)) +
xlab("Murder rates in each state per 100,000") +
ylab("Burglary rates in each state per 100,000") +
theme_minimal(base_size = 12)
p1 + geom_point()
crime2 <- crime[crime$state != "District of Columbia",]
crime2 <- crime2[crime2$state != "United States",]
p2 <- ggplot(crime2, aes(x = murder, y = burglary)) +
xlab("Murder rates in each state per 100,000") +
ylab("Burglary rates in each state per 100,000") +
theme_minimal(base_size = 12)
p2 + geom_point()
Fix the axes to start at 0
p3 <- p2 + xlim(0, 10) + ylim(0, 1200)
p3 + geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).
Add a smoother in red with confidence interval
p4 <- p3 + geom_point() + geom_smooth(color = 'red')
p4
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Add a linear regression with confidence interval
p5 <- p3 + geom_point() + geom_smooth(method='lm', formula=y~x)
p5
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Add a title, make the line dashed, and remove the confidence interval (the comman se=FALSE takes away the CI band)
p6 <- p3 + geom_point() + geom_smooth(method = 'lm', formula = y~x, se = FALSE, linetype = 'dotdash', size = 0.3) +
ggtitle("MURDERS VERSUS BURGLARIES IN THE U.S.")
p6
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
In the form, y=mx + b, we use the command, lm(y~x), meaning, fit the predictor variable x into the model to predict y. Look at the values of (Intercept) and murder. The column, Estimate gives the value you need in your linear model. The column for Pr(>|t|) describes whether the predictor is useful to the model. The more asterisks, the more the variable contributes to the model.
cor(crime2$burglary, crime2$murder)
## [1] 0.6231757
fit1 <- lm(burglary ~ murder, data = crime2)
summary(fit1)
##
## Call:
## lm(formula = burglary ~ murder, data = crime2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -385.38 -132.23 2.97 138.78 386.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 398.26 59.28 6.718 1.99e-08 ***
## murder 62.17 11.26 5.521 1.34e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 186.5 on 48 degrees of freedom
## Multiple R-squared: 0.3883, Adjusted R-squared: 0.3756
## F-statistic: 30.48 on 1 and 48 DF, p-value: 1.342e-06
Cor stands for “correlation”. This is a value between (inclusively) -1 and 1. The correlation coefficient tells how strong or weak the correlation is. Values closer to +/- 1 are strong correlation (the sign is determined by the linear slope), values close to +/- 0.5 are weak correlation, and values close to zero have no correlation. The model has the equation: burglary = 62.17(murder) + 398.26 The slope may be interpreted in the following: For each additional murder per 100,000, there is a predicted increase of 62.17 burglaries per 100,00. The p-values on the right both have 3 asterisks so they suggest the model is meaningful, but we also need to look at the Adjusted R-Squared value. It states that about 38% of the variation in the observations may be explained by the model. In other words, 62% of the variation in the data is likely not explained by this model.
##What about more variables? Can a model with more predictors also be used? What would we be trying to predict? By adding additional predictors, does it improve the model?
Check out the pairwise comparisons with density curves and correlation output
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(crime2, columns = 2:8)
Robbery appears to be correlated with murder, burglary to larceny_theft and aggravated_assault, and burglary seems correlated with forcible_rate and larceny_theft. Some of the strongest correlations are with burglary/larceny_theft and burglary/aggravated_assault
#Another method: Use a correlation plot to explore the correlation among all variables This correlation plot shows similar pairwise results as above, but in a heatmap of correlation values.
library(DataExplorer)
plot_correlation(crime)
## Warning in dummify(data, maxcat = maxcat): Ignored all discrete features since
## `maxcat` set to 20 categories!
##What are we really trying to predict? If we are trying to predict murder rates, then we can see if any of the predictor variables contribute to this model. Note the adjusted R-squared value is 58.66% The only variable that does not appear to be as significant as the others is motor_vehicle_theft. So drop that and re-run the model.
fit3 <- lm(murder~burglary + forcible_rape + aggravated_assault + larceny_theft, data = crime2)
summary(fit3)
##
## Call:
## lm(formula = murder ~ burglary + forcible_rape + aggravated_assault +
## larceny_theft, data = crime2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1805 -1.0081 -0.3100 0.8769 3.4506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8415937 1.0779713 2.636 0.011468 *
## burglary 0.0055794 0.0014826 3.763 0.000483 ***
## forcible_rape -0.0471186 0.0220236 -2.139 0.037858 *
## aggravated_assault 0.0096735 0.0022395 4.320 8.51e-05 ***
## larceny_theft -0.0012679 0.0005938 -2.135 0.038211 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.55 on 45 degrees of freedom
## Multiple R-squared: 0.6061, Adjusted R-squared: 0.5711
## F-statistic: 17.31 on 4 and 45 DF, p-value: 1.153e-08
The residuals plot looks worse once we drop motor_vehicle_theft, AND the adjusted R-squared value dropped to 57.11%. Maybe one last exploration is to change what we are predicting
fit4 <- lm(burglary~murder + forcible_rape + aggravated_assault + larceny_theft + motor_vehicle_theft,
data = crime2)
summary(fit4)
##
## Call:
## lm(formula = burglary ~ murder + forcible_rape + aggravated_assault +
## larceny_theft + motor_vehicle_theft, data = crime2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -271.30 -79.31 -7.40 94.15 387.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -138.27831 102.16469 -1.353 0.18281
## murder 40.97562 12.08935 3.389 0.00149 **
## forcible_rape 2.19565 2.01915 1.087 0.28278
## aggravated_assault 0.12375 0.23536 0.526 0.60168
## larceny_theft 0.22136 0.04716 4.693 2.64e-05 ***
## motor_vehicle_theft 0.06020 0.11702 0.514 0.60955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 137 on 44 degrees of freedom
## Multiple R-squared: 0.6975, Adjusted R-squared: 0.6631
## F-statistic: 20.29 on 5 and 44 DF, p-value: 1.907e-10
##Interesting!! The adjusted R-squared went up to 66.31%. The residuals plot looks worse, but we can drop all predictors that are no longer significant. We are left with murder and larceny_theft. Run that model.
fit5 <- lm(burglary ~ murder + larceny_theft, data = crime2)
summary(fit5)
##
## Call:
## lm(formula = burglary ~ murder + larceny_theft, data = crime2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -297.93 -101.83 2.21 98.54 352.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -117.85224 90.22758 -1.306 0.198
## murder 47.15078 8.55929 5.509 1.48e-06 ***
## larceny_theft 0.25566 0.03919 6.524 4.31e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 136.6 on 47 degrees of freedom
## Multiple R-squared: 0.679, Adjusted R-squared: 0.6654
## F-statistic: 49.72 on 2 and 47 DF, p-value: 2.522e-12
The model looks good. The residuals plot shows observations 20, 33, and 46 have an effect on the residuals plot, and 33 appears to have high leverage with Cooks distance.
Maryland is 20 North Carolina is 33 Virginia is 46
It is interesting that they are all very close in proximity.
p3 +
geom_point(mapping = aes(murder, burglary, size = population), color = 'red') +
xlim(0, 10) + ylim(0, 1200) +
ggtitle("MURDERS VERSUS BURGLARIES IN THE U.S.", subtitle = "Sizes of circles are proportional to state populations")
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Warning: Removed 1 rows containing missing values (geom_point).
##Finally, add some interactivity to the plot with plotly
p <- ggplot(crime2, aes(x = murder, y = burglary, size = population, text = paste("state: ", state))) +
theme_minimal(base_size = 12) +
geom_point(alpha = 0.5, color = 'red') + xlim(0, 10) + ylim(0, 1200) +
ggtitle("MURDERS VERSUS BURGLARIES IN THE U.S.", subtitle = "Sizes of circles are proportional to state populations")
p <- ggplotly(p)
p
Now we will explore a series of other geom functions using the food stamps data.
# load data
food_stamps <- read_csv("food_stamps.csv")
## Parsed with column specification:
## cols(
## year = col_double(),
## participants = col_double(),
## costs = col_double()
## )
# save basic chart template
food_stamps_chart <- ggplot(food_stamps, aes(x = year, y = participants)) +
xlab("Year")+
ylab("Participants (millions)") +
theme_minimal(base_size = 14)
food_stamps_chart
food_stamps_chart +
geom_line()
food_stamps_chart +
geom_line(size = 1.5, color = 'red') +
ggtitle("Line chart")
food_stamps_chart +
geom_line() +
geom_point() +
ggtitle("Dot-and-line chart")
# Make a column chart
food_stamps_chart +
geom_bar(stat = 'identity') +
ggtitle("Column chart") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank())
geom_bar works a little differently to the geoms we have considered previously. If you have not mapped data values to the Y axis with aes, its default behavior is to set the heights of the bars by counting the number of records for values along the X axis. If you have mapped a variable to the Y axis, and want the heights of the bars to represent values in the data, use you must use stat=“identity”.
# Make a bar chart
food_stamps_chart +
geom_bar(stat = "identity") +
ggtitle("Bar chart") +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()) +
coord_flip()
For some geoms, notably geom_bar, you can set color for their outline as well as the interior of the shape.
When setting colors, color refers to the outline, fill to the interior of the shape.
# set color and fill
food_stamps_chart +
geom_bar(stat = "identity", color = "#888888", fill = "#CCCCCC", alpha = 0.5) +
ggtitle("Column chart")
# fill the bars according to values for the cost of the program
food_stamps_chart +
geom_bar(stat = "identity", color = "white", aes(fill = costs))
This code uses an aes mapping to color the bars according values for the costs of the program, in billions of dollars. ggplot2 recognizes that costs is a continuous variable, but its default sequential scheme applies more intense blues to lower values, which is counterintuitive.
# use a colorbrewer sequential palette
food_stamps_chart +
geom_bar(stat = "identity", color = "#888888", aes(fill = costs)) +
scale_fill_distiller(name = "Cost\n($ billion)", palette = "Reds", direction = 1)
scale_fill_distiller (and scale_color_distiller) work like scale_color_brewer, but set color gradients for ColorBrewer’s sequential and diverging color palettes. Direction = 1 ensures that larger numbers are mapped to more intense colors (direction = -1 reverses the color mapping). Notice also the in the title for the legend. This introduces a new line.
This code uses the theme function to move the legend from its default position to the right of the chart to use some empty space on the chart itself
food_stamps_chart +
geom_bar(stat = "identity", color = "#888888", aes(fill = costs)) +
scale_fill_distiller(name = "Cost\n($billions)", palette = "Reds", direction = 1) +
theme(legend.position = c(0.15, 0.8))
The coordinates for the legend are given as a list: The first number sets the horizontal position, from left to right, on a scale from 0 to 1; the second number sets the vertical position, from bottom to top, again on a scale from 0 to 1.