Install and load the packages dslabs and babynames to use the datasets. You will also want to load ggplot2 and dplyr to manipulate and visualize the data.
library(dslabs)
library(babynames)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(cowplot)
Requirements:
For questions 1-5, answer using dplyr and ggplot2.
Include both the pdf output and the Rmarkdown file for your submission to Canvas.
The dataset murders includes gun murder data for US states in 2012. The dataset can be obtained via the package dslabs. You can read about the dataset murders before proceding.
?murders
head(murders)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
murders
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
## 11 Georgia GA South 9920000 376
## 12 Hawaii HI West 1360301 7
## 13 Idaho ID West 1567582 12
## 14 Illinois IL North Central 12830632 364
## 15 Indiana IN North Central 6483802 142
## 16 Iowa IA North Central 3046355 21
## 17 Kansas KS North Central 2853118 63
## 18 Kentucky KY South 4339367 116
## 19 Louisiana LA South 4533372 351
## 20 Maine ME Northeast 1328361 11
## 21 Maryland MD South 5773552 293
## 22 Massachusetts MA Northeast 6547629 118
## 23 Michigan MI North Central 9883640 413
## 24 Minnesota MN North Central 5303925 53
## 25 Mississippi MS South 2967297 120
## 26 Missouri MO North Central 5988927 321
## 27 Montana MT West 989415 12
## 28 Nebraska NE North Central 1826341 32
## 29 Nevada NV West 2700551 84
## 30 New Hampshire NH Northeast 1316470 5
## 31 New Jersey NJ Northeast 8791894 246
## 32 New Mexico NM West 2059179 67
## 33 New York NY Northeast 19378102 517
## 34 North Carolina NC South 9535483 286
## 35 North Dakota ND North Central 672591 4
## 36 Ohio OH North Central 11536504 310
## 37 Oklahoma OK South 3751351 111
## 38 Oregon OR West 3831074 36
## 39 Pennsylvania PA Northeast 12702379 457
## 40 Rhode Island RI Northeast 1052567 16
## 41 South Carolina SC South 4625364 207
## 42 South Dakota SD North Central 814180 8
## 43 Tennessee TN South 6346105 219
## 44 Texas TX South 25145561 805
## 45 Utah UT West 2763885 22
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
Construct a graph that summarizes the US murders dataset (murders). You may want to explore scaling the data or graphing logs of the data.
Requirements: In addition to displaying and labeling two quantitative variables on the x and y-axis, represent at least one more variable using shapes and/or colors of points. Give your graph a title. Also, label the (x, y) coordinates. After your graph, include a write-up describing the variables you chose, the way you manipulated the data, and describe your conclusion based on the visualization.
murders %>% ggplot(aes(x =population , y =total,color = region, label = abb)) +
geom_point() +
geom_label() +
ggtitle("US Gun Murder Data") +
xlab("State Population") +
ylab("Total Murders") +
labs(fill = "Region")
I chose to look at state population vs total murders using region as a third variable with color. I didn’t do anything special with the data, I just put it in a ggplot scatter plot, using state population as the x, total murders as the y, region as the colored data on the plot, and for an added bonus, put the state abbreviation on the plot as well. My conclusion is the more people in the state, the more people there are to murder, and therefore are murdered. California has the highest state population in the US, Wyoming the lowest.
Use the data mtcars.
?mtcars
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Create side-by-side bar chart.
Requirements: side-by-side bar chart representing two different categories of the measured variable. One of the categories should be created using dplyr based on either the type of car or the mpg variable. Add coloring, a key, and labels to the axes. Describe the visualization, including the variables you selected, the way you split up the data, and any conclusions based on your graphs.
df1 = mtcars %>%
group_by(am, cyl) %>%
filter(am == 0) %>%
summarize(xbar1 = mean(mpg))
## `summarise()` has grouped output by 'am'. You can override using the `.groups` argument.
df2 = mtcars %>%
group_by(am, cyl) %>%
filter(am == 1) %>%
summarize(xbar2 = mean(mpg))
## `summarise()` has grouped output by 'am'. You can override using the `.groups` argument.
dfc = data.frame(df1, xbar2 = df2$xbar2)
i <- plot_ly(data = dfc, x = ~cyl, y = ~xbar1, type = 'bar', name = 'Automatic Transmission')
i <- i %>% add_trace(y = ~xbar2, name = 'Manual Transmission')
i <- i %>% layout(yaxis = list(title = 'Avg MPG'), barmode = 'group')
i
I wanted to look at the average miles per gallon a car has based on the number of cylinders (either 4, 6, or 8) and based on the type of transmission, either automatic or manual. I split the data into automatic transmission and manual transmission, and then graphed with number of cylinders as the x, average mpg as the y. I noticed that manual transmission always had higher average mpg than automatic transmission, regardless of the the number of cylinders inside the vehicle.
The dataset babynames contains data provided by the SSA. This includes all names with at least 5 uses. This data frame has five variables: year, sex, name, n and prop (n divided by total number of applicants in that year, which means proportions are of people of that gender with that name born in that year).
?babynames
head(babynames)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
tail(babynames)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 M Zyhier 5 0.00000255
## 2 2017 M Zykai 5 0.00000255
## 3 2017 M Zykeem 5 0.00000255
## 4 2017 M Zylin 5 0.00000255
## 5 2017 M Zylis 5 0.00000255
## 6 2017 M Zyrie 5 0.00000255
Create side-by-side plots visualizing this data using ggplot2.
Requirements: Find the 5 most popular baby boy name and the 5 most popular baby girl names in the data. Then, plot two side-by-side line plots of the number of babies named each of those names in the given data. Plot year on the x-axis and count on the y-axis. Make the lines depend on the baby name. You should also make each of the lines a distinct color with a legend to relate the name to the unique colors. Hint: You may wish to use function plot_grid() in package cowplot to combine two graphs in the same image.
Include a write up after the visualization including the variables selected and why, the way you manipulated the data, and any conclusions based on the exercise.
babies = babynames %>%
group_by(sex, name) %>%
summarise(n = sum(n))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
baby = babies %>%
group_by(sex) %>%
slice_max(order_by = n, n = 5)
baby
## # A tibble: 10 x 3
## # Groups: sex [2]
## sex name n
## <chr> <chr> <int>
## 1 F Mary 4123200
## 2 F Elizabeth 1629679
## 3 F Patricia 1571692
## 4 F Jennifer 1466281
## 5 F Linda 1452249
## 6 M James 5150472
## 7 M John 5115466
## 8 M Robert 4814815
## 9 M Michael 4350824
## 10 M William 4102604
boys = babynames %>%
filter(sex == "M", name == c("James", "John", "Robert", "Michael", "William"))
girls = babynames %>%
filter(sex == "F", name == c("Mary", "Elizabeth", "Patricia", "Jennifer", "Linda"))
library(cowplot)
boiiiiii = ggplot(data = boys, aes(x = year, y = n, col = name)) +
geom_line()
guuuuurrrrrl = ggplot(data = girls, aes(x = year, y = n, col = name)) +
geom_line()
boiiiiii
guuuuurrrrrl
plot_grid(boiiiiii, guuuuurrrrrl, labels = "AUTO")
Since we want to find the top five baby names of each gender, I first used dplyr to find the top five names for each gender all time. Then we take those names, separate by gender, and then put them in a ggplot. I put each graph on its own before a make them a side-by-side plot so it’s easier to read the data. For males, there seems to be a lot of babies named one of the top five names at the same time, regardless of the year. That’s different from the females, as specific names were more popular at different times: Mary was most popular up until right after 1940, where Linda took the cake as most popular. More recently, Jennifer was the most popular of the popular names.
Use the starwars data from package dplyr.
?starwars
head(starwars)
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Dart… 202 136 none white yellow 41.9 male mascu…
## 5 Leia… 150 49 brown light brown 19 fema… femin…
## 6 Owen… 178 120 brown, gr… light blue 52 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Visualize one characteristic of Star Wars characters using at least 3 density graphs plotted on the same axis Be sure to include titles, labels, and a key. Your data should be manipulated using dplyr and the graphs should be formulated using ggplot2. Following your graph, create a write-up on which variables you chose and why, and any conclusions that you have come to based on your analysis.
table(starwars$species)
##
## Aleena Besalisk Cerean Chagrian Clawdite
## 1 1 1 1 1
## Droid Dug Ewok Geonosian Gungan
## 6 1 1 1 3
## Human Hutt Iktotchi Kaleesh Kaminoan
## 35 1 1 1 2
## Kel Dor Mirialan Mon Calamari Muun Nautolan
## 1 2 1 1 1
## Neimodian Pau'an Quermian Rodian Skakoan
## 1 1 1 1 1
## Sullustan Tholothian Togruta Toong Toydarian
## 1 1 1 1 1
## Trandoshan Twi'lek Vulptereen Wookiee Xexto
## 1 2 1 2 1
## Yoda's species Zabrak
## 1 2
z = starwars %>%
mutate(cat = case_when(
species == "Human" ~ "Human",
species == "Droid" ~ "Droid",
TRUE ~ "Other"))
ggplot(data = z, aes(x = height, fill = cat)) +
geom_density() +
ggtitle("Denisty Graph of Height by Star Wars Species") +
xlab("Height") +
ylab("Density")
## Warning: Removed 6 rows containing non-finite values (stat_density).
The characteristic I chose to visualize was height by species. There are 3 species categories: Human, Droid, and Other. Human and Droid had the most entries (as seen in the table before the graph), which is why we chose those two as its oen category and combining all the others. After creating its own category by mutating, we put it in ggplot. In conclusion, droids aren’t very tall, most of the humans are about the same height, and since other has so many different species, it’s density graph spans the entire x-axis (although there is a definite peak at about the same height as the peak of the humans)
Consider the divorce_margarine data set to see if there is a linear relationship between divorce rates in Maine and per capita consumption of margarine
?divorce_margarine
head(divorce_margarine)
## divorce_rate_maine margarine_consumption_per_capita year
## 1 5.0 8.2 2000
## 2 4.7 7.0 2001
## 3 4.6 6.5 2002
## 4 4.4 5.3 2003
## 5 4.3 5.2 2004
## 6 4.1 4.0 2005
Find the least squares regression equation, using margarine_consumption_per_capita as the x variable and divorce_rate_maine as the y variable. Be sure to also find the p-value for the slope component and the correlation between the two variables. Then, using ggplot2, plot the points of each (x, y) coordinate pair, color-coded based on year. Then overlay the least squares regression equation. Include confidence bands. After your graph, include a write-up that explains the variables selected, methodology for setting up the data, and any conclusions based on the analysis.
data=divorce_margarine
head(data)
## divorce_rate_maine margarine_consumption_per_capita year
## 1 5.0 8.2 2000
## 2 4.7 7.0 2001
## 3 4.6 6.5 2002
## 4 4.4 5.3 2003
## 5 4.3 5.2 2004
## 6 4.1 4.0 2005
y=data$divorce_rate_maine
x=as.matrix(data$margarine_consumption_per_capita)
l=lm(y~x,data=data)
l
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Coefficients:
## (Intercept) x
## 3.3086 0.2014
summary(l)
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.05583 -0.01816 -0.01452 0.03601 0.04625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.308626 0.048032 68.88 2.20e-12 ***
## x 0.201386 0.008735 23.05 1.33e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03841 on 8 degrees of freedom
## Multiple R-squared: 0.9852, Adjusted R-squared: 0.9833
## F-statistic: 531.5 on 1 and 8 DF, p-value: 1.33e-08
#Correlation between two variables
cor(y,x)
## [,1]
## [1,] 0.9925585
#Scatter plot for two variables
ggplot(data, aes(x=x, y=y))+geom_point()+geom_text(label=(data$year))
#adding regression line
ggplot(data, aes(x=x, y=y))+geom_point()+geom_smooth(method=lm,se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
#with confidence interval
ggplot(data, aes(x=x, y=y))+geom_point()+geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
I selected the variables asked of me (
margarine_consumption_per_capita as the x variable and divorce_rate_maine as the y variable), and made a linear regression model with the variables. Then I asked for the summary and the correlation of the linear regression model. Then I plotted the data, plotted the least squares regression equation, then I included the confidence interval on the graph. According to the data, there does seem to be a positive correlated relationship between margarine consumption and divorce rate in Maine. In other words, the more margarine consumed, the more likely you are going to get divorced in Maine.
Using a new dataset (either base R or imported), create a plot using plotly that visualizes at least 3 variables. Include a write-up on where the data is from, what the data is showing, and why you chose this particular graph and variables. Describe the process you used to manipulate the data and create the visualization.
cars <- plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec,
marker = list(color = ~mpg, colorscale = c('#FFE1A1', '#683531'), showscale = TRUE))
cars <- cars %>% add_markers()
cars <- cars %>%
layout(scene = list(xaxis = list(title = 'Weight'),
yaxis = list(title = 'Gross horsepower'),
zaxis = list(title = '1/4 mile time')),
annotations = list(
x = 1.13,
y = 1.05,
text = 'Miles/(US) gallon',
xref = 'paper',
yref = 'paper',
showarrow = FALSE
))
cars
I wanted to see the three dimensional relationship between weight of the vehicle, gross horsepower of a vehicle, and quarter mile time. Using a 3D scatter plot, I plotted using the three variables while also coloring in the points based on how much mpg the vehicle gets. The darker the dot, the better mpg the vehicle gets. To summarizing the results, the heavier the car, the less mpg it has, and the horsepower and quarter mile time is average. The lighter the car, the more mpg it has, the less horsepwoer it has, but the faster the quarter mile time is.