Lab 3

Install and load the packages dslabs and babynames to use the datasets. You will also want to load ggplot2 and dplyr to manipulate and visualize the data.

library(dslabs)
library(babynames)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(cowplot)

Requirements:

For questions 1-5, answer using dplyr and ggplot2.
Include both the pdf output and the Rmarkdown file for your submission to Canvas.

Quesiton 1

The dataset murders includes gun murder data for US states in 2012. The dataset can be obtained via the package dslabs. You can read about the dataset murders before proceding.

?murders
head(murders)

##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

murders

##                   state abb        region population total
## 1               Alabama  AL         South    4779736   135
## 2                Alaska  AK          West     710231    19
## 3               Arizona  AZ          West    6392017   232
## 4              Arkansas  AR         South    2915918    93
## 5            California  CA          West   37253956  1257
## 6              Colorado  CO          West    5029196    65
## 7           Connecticut  CT     Northeast    3574097    97
## 8              Delaware  DE         South     897934    38
## 9  District of Columbia  DC         South     601723    99
## 10              Florida  FL         South   19687653   669
## 11              Georgia  GA         South    9920000   376
## 12               Hawaii  HI          West    1360301     7
## 13                Idaho  ID          West    1567582    12
## 14             Illinois  IL North Central   12830632   364
## 15              Indiana  IN North Central    6483802   142
## 16                 Iowa  IA North Central    3046355    21
## 17               Kansas  KS North Central    2853118    63
## 18             Kentucky  KY         South    4339367   116
## 19            Louisiana  LA         South    4533372   351
## 20                Maine  ME     Northeast    1328361    11
## 21             Maryland  MD         South    5773552   293
## 22        Massachusetts  MA     Northeast    6547629   118
## 23             Michigan  MI North Central    9883640   413
## 24            Minnesota  MN North Central    5303925    53
## 25          Mississippi  MS         South    2967297   120
## 26             Missouri  MO North Central    5988927   321
## 27              Montana  MT          West     989415    12
## 28             Nebraska  NE North Central    1826341    32
## 29               Nevada  NV          West    2700551    84
## 30        New Hampshire  NH     Northeast    1316470     5
## 31           New Jersey  NJ     Northeast    8791894   246
## 32           New Mexico  NM          West    2059179    67
## 33             New York  NY     Northeast   19378102   517
## 34       North Carolina  NC         South    9535483   286
## 35         North Dakota  ND North Central     672591     4
## 36                 Ohio  OH North Central   11536504   310
## 37             Oklahoma  OK         South    3751351   111
## 38               Oregon  OR          West    3831074    36
## 39         Pennsylvania  PA     Northeast   12702379   457
## 40         Rhode Island  RI     Northeast    1052567    16
## 41       South Carolina  SC         South    4625364   207
## 42         South Dakota  SD North Central     814180     8
## 43            Tennessee  TN         South    6346105   219
## 44                Texas  TX         South   25145561   805
## 45                 Utah  UT          West    2763885    22
## 46              Vermont  VT     Northeast     625741     2
## 47             Virginia  VA         South    8001024   250
## 48           Washington  WA          West    6724540    93
## 49        West Virginia  WV         South    1852994    27
## 50            Wisconsin  WI North Central    5686986    97
## 51              Wyoming  WY          West     563626     5

Construct a graph that summarizes the US murders dataset (murders). You may want to explore scaling the data or graphing logs of the data.

Requirements: In addition to displaying and labeling two quantitative variables on the x and y-axis, represent at least one more variable using shapes and/or colors of points. Give your graph a title. Also, label the (x, y) coordinates. After your graph, include a write-up describing the variables you chose, the way you manipulated the data, and describe your conclusion based on the visualization.

murders %>% ggplot(aes(x =population , y =total,color = region, label = abb)) +
  geom_point() +
  geom_label() +
  ggtitle("US Gun Murder Data") +
  xlab("State Population") +
  ylab("Total Murders") +
  labs(fill = "Region")

I chose to look at state population vs total murders using region as a third variable with color. I didn’t do anything special with the data, I just put it in a ggplot scatter plot, using state population as the x, total murders as the y, region as the colored data on the plot, and for an added bonus, put the state abbreviation on the plot as well. My conclusion is the more people in the state, the more people there are to murder, and therefore are murdered. California has the highest state population in the US, Wyoming the lowest.

Question 2

Use the data mtcars.

?mtcars
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Create side-by-side bar chart.

Requirements: side-by-side bar chart representing two different categories of the measured variable. One of the categories should be created using dplyr based on either the type of car or the mpg variable. Add coloring, a key, and labels to the axes. Describe the visualization, including the variables you selected, the way you split up the data, and any conclusions based on your graphs.

df1 = mtcars %>% 
  group_by(am, cyl) %>%
  filter(am == 0) %>%
  summarize(xbar1 = mean(mpg))

## `summarise()` has grouped output by 'am'. You can override using the `.groups` argument.

df2 = mtcars %>% 
  group_by(am, cyl) %>%
  filter(am == 1) %>%
  summarize(xbar2 = mean(mpg))

## `summarise()` has grouped output by 'am'. You can override using the `.groups` argument.

dfc = data.frame(df1, xbar2 = df2$xbar2)

i <- plot_ly(data = dfc, x = ~cyl, y = ~xbar1, type = 'bar', name = 'Automatic Transmission')
i <- i %>% add_trace(y = ~xbar2, name = 'Manual Transmission')
i <- i %>% layout(yaxis = list(title = 'Avg MPG'), barmode = 'group')
i

I wanted to look at the average miles per gallon a car has based on the number of cylinders (either 4, 6, or 8) and based on the type of transmission, either automatic or manual. I split the data into automatic transmission and manual transmission, and then graphed with number of cylinders as the x, average mpg as the y. I noticed that manual transmission always had higher average mpg than automatic transmission, regardless of the the number of cylinders inside the vehicle.

Question 3

The dataset babynames contains data provided by the SSA. This includes all names with at least 5 uses. This data frame has five variables: year, sex, name, n and prop (n divided by total number of applicants in that year, which means proportions are of people of that gender with that name born in that year).

?babynames
head(babynames)

## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

tail(babynames)

## # A tibble: 6 x 5
##    year sex   name       n       prop
##   <dbl> <chr> <chr>  <int>      <dbl>
## 1  2017 M     Zyhier     5 0.00000255
## 2  2017 M     Zykai      5 0.00000255
## 3  2017 M     Zykeem     5 0.00000255
## 4  2017 M     Zylin      5 0.00000255
## 5  2017 M     Zylis      5 0.00000255
## 6  2017 M     Zyrie      5 0.00000255

Create side-by-side plots visualizing this data using ggplot2.

Requirements: Find the 5 most popular baby boy name and the 5 most popular baby girl names in the data. Then, plot two side-by-side line plots of the number of babies named each of those names in the given data. Plot year on the x-axis and count on the y-axis. Make the lines depend on the baby name. You should also make each of the lines a distinct color with a legend to relate the name to the unique colors. Hint: You may wish to use function plot_grid() in package cowplot to combine two graphs in the same image.

Include a write up after the visualization including the variables selected and why, the way you manipulated the data, and any conclusions based on the exercise.

babies = babynames %>%
  group_by(sex, name) %>%
  summarise(n = sum(n))

## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.

baby = babies %>%
  group_by(sex) %>%
  slice_max(order_by = n, n = 5)

baby

## # A tibble: 10 x 3
## # Groups:   sex [2]
##    sex   name            n
##    <chr> <chr>       <int>
##  1 F     Mary      4123200
##  2 F     Elizabeth 1629679
##  3 F     Patricia  1571692
##  4 F     Jennifer  1466281
##  5 F     Linda     1452249
##  6 M     James     5150472
##  7 M     John      5115466
##  8 M     Robert    4814815
##  9 M     Michael   4350824
## 10 M     William   4102604

boys = babynames %>%
  filter(sex == "M", name == c("James", "John", "Robert", "Michael", "William"))

girls = babynames %>%
  filter(sex == "F", name == c("Mary", "Elizabeth", "Patricia", "Jennifer", "Linda"))

library(cowplot)

boiiiiii = ggplot(data = boys, aes(x = year, y = n, col = name)) +
  geom_line()
guuuuurrrrrl = ggplot(data = girls, aes(x = year, y = n, col = name)) +
  geom_line()

boiiiiii

guuuuurrrrrl

plot_grid(boiiiiii, guuuuurrrrrl, labels = "AUTO")

Since we want to find the top five baby names of each gender, I first used dplyr to find the top five names for each gender all time. Then we take those names, separate by gender, and then put them in a ggplot. I put each graph on its own before a make them a side-by-side plot so it’s easier to read the data. For males, there seems to be a lot of babies named one of the top five names at the same time, regardless of the year. That’s different from the females, as specific names were more popular at different times: Mary was most popular up until right after 1940, where Linda took the cake as most popular. More recently, Jennifer was the most popular of the popular names.

Question 4

Use the starwars data from package dplyr.

?starwars
head(starwars)

## # A tibble: 6 x 14
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia…    150    49 brown      light      brown           19   fema… femin…
## 6 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Visualize one characteristic of Star Wars characters using at least 3 density graphs plotted on the same axis Be sure to include titles, labels, and a key. Your data should be manipulated using dplyr and the graphs should be formulated using ggplot2. Following your graph, create a write-up on which variables you chose and why, and any conclusions that you have come to based on your analysis.

table(starwars$species)

## 
##         Aleena       Besalisk         Cerean       Chagrian       Clawdite 
##              1              1              1              1              1 
##          Droid            Dug           Ewok      Geonosian         Gungan 
##              6              1              1              1              3 
##          Human           Hutt       Iktotchi        Kaleesh       Kaminoan 
##             35              1              1              1              2 
##        Kel Dor       Mirialan   Mon Calamari           Muun       Nautolan 
##              1              2              1              1              1 
##      Neimodian         Pau'an       Quermian         Rodian        Skakoan 
##              1              1              1              1              1 
##      Sullustan     Tholothian        Togruta          Toong      Toydarian 
##              1              1              1              1              1 
##     Trandoshan        Twi'lek     Vulptereen        Wookiee          Xexto 
##              1              2              1              2              1 
## Yoda's species         Zabrak 
##              1              2

z = starwars %>%
  mutate(cat = case_when(
    species == "Human" ~ "Human",
    species == "Droid" ~ "Droid",
    TRUE ~ "Other")) 

ggplot(data = z, aes(x = height, fill = cat)) +
  geom_density() +
  ggtitle("Denisty Graph of Height by Star Wars Species") +
  xlab("Height") +
  ylab("Density")

## Warning: Removed 6 rows containing non-finite values (stat_density).

The characteristic I chose to visualize was height by species. There are 3 species categories: Human, Droid, and Other. Human and Droid had the most entries (as seen in the table before the graph), which is why we chose those two as its oen category and combining all the others. After creating its own category by mutating, we put it in ggplot. In conclusion, droids aren’t very tall, most of the humans are about the same height, and since other has so many different species, it’s density graph spans the entire x-axis (although there is a definite peak at about the same height as the peak of the humans)

Question 5

Consider the divorce_margarine data set to see if there is a linear relationship between divorce rates in Maine and per capita consumption of margarine

?divorce_margarine
head(divorce_margarine)

##   divorce_rate_maine margarine_consumption_per_capita year
## 1                5.0                              8.2 2000
## 2                4.7                              7.0 2001
## 3                4.6                              6.5 2002
## 4                4.4                              5.3 2003
## 5                4.3                              5.2 2004
## 6                4.1                              4.0 2005

Find the least squares regression equation, using margarine_consumption_per_capita as the x variable and divorce_rate_maine as the y variable. Be sure to also find the p-value for the slope component and the correlation between the two variables. Then, using ggplot2, plot the points of each (x, y) coordinate pair, color-coded based on year. Then overlay the least squares regression equation. Include confidence bands. After your graph, include a write-up that explains the variables selected, methodology for setting up the data, and any conclusions based on the analysis.

data=divorce_margarine
head(data)

##   divorce_rate_maine margarine_consumption_per_capita year
## 1                5.0                              8.2 2000
## 2                4.7                              7.0 2001
## 3                4.6                              6.5 2002
## 4                4.4                              5.3 2003
## 5                4.3                              5.2 2004
## 6                4.1                              4.0 2005

y=data$divorce_rate_maine
x=as.matrix(data$margarine_consumption_per_capita)
l=lm(y~x,data=data)
l

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Coefficients:
## (Intercept)            x  
##      3.3086       0.2014

summary(l)

## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05583 -0.01816 -0.01452  0.03601  0.04625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.308626   0.048032   68.88 2.20e-12 ***
## x           0.201386   0.008735   23.05 1.33e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03841 on 8 degrees of freedom
## Multiple R-squared:  0.9852, Adjusted R-squared:  0.9833 
## F-statistic: 531.5 on 1 and 8 DF,  p-value: 1.33e-08

#Correlation between two variables
cor(y,x)

##           [,1]
## [1,] 0.9925585

#Scatter plot for two variables
ggplot(data, aes(x=x, y=y))+geom_point()+geom_text(label=(data$year))

#adding regression line
ggplot(data, aes(x=x, y=y))+geom_point()+geom_smooth(method=lm,se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

#with confidence interval
ggplot(data, aes(x=x, y=y))+geom_point()+geom_smooth(method=lm)

## `geom_smooth()` using formula 'y ~ x'

I selected the variables asked of me (margarine_consumption_per_capita as the x variable and divorce_rate_maine as the y variable), and made a linear regression model with the variables. Then I asked for the summary and the correlation of the linear regression model. Then I plotted the data, plotted the least squares regression equation, then I included the confidence interval on the graph. According to the data, there does seem to be a positive correlated relationship between margarine consumption and divorce rate in Maine. In other words, the more margarine consumed, the more likely you are going to get divorced in Maine.

Question 6

Using a new dataset (either base R or imported), create a plot using plotly that visualizes at least 3 variables. Include a write-up on where the data is from, what the data is showing, and why you chose this particular graph and variables. Describe the process you used to manipulate the data and create the visualization.

cars <- plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec,
               marker = list(color = ~mpg, colorscale = c('#FFE1A1', '#683531'), showscale = TRUE))
cars <- cars %>% add_markers()
cars <- cars %>% 
  layout(scene = list(xaxis = list(title = 'Weight'),
                                   yaxis = list(title = 'Gross horsepower'),
                                   zaxis = list(title = '1/4 mile time')),
                      annotations = list(
                        x = 1.13,
                        y = 1.05,
                        text = 'Miles/(US) gallon',
                        xref = 'paper',
                        yref = 'paper',
                        showarrow = FALSE
                        ))
cars

I wanted to see the three dimensional relationship between weight of the vehicle, gross horsepower of a vehicle, and quarter mile time. Using a 3D scatter plot, I plotted using the three variables while also coloring in the points based on how much mpg the vehicle gets. The darker the dot, the better mpg the vehicle gets. To summarizing the results, the heavier the car, the less mpg it has, and the horsepower and quarter mile time is average. The lighter the car, the more mpg it has, the less horsepwoer it has, but the faster the quarter mile time is.