Sariah Nokes - Intro to the Tidyverse

You are thinking of launching a new digital magazine subscription service, similar to Netflix, but with magazines. To determine the viability of this idea and potential key segments, we will look at this data of magazine subscribers and not subscribers. The dataset we will use in this assignment is the Magazine Subscription Data.

Start by loading in the correct packages and dataset.

install.packages("tidyverse")

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)

install.packages("ggplot2")

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)

library(readr)
mag <- read_csv("Magazine Subscription Data.csv")

## Parsed with column specification:
## cols(
##   age = col_double(),
##   gender = col_character(),
##   income = col_double(),
##   kids = col_double(),
##   ownHome = col_character(),
##   subscribe = col_character(),
##   Segment = col_character()
## )

If you recall, we learned about 5 verbs to use when coding in the tidyverse. They were: filter, arrange, mutate, summarize, and group_by. Which of these functions could be used to display only people with 0 children? mutate___________
Use this function to make it so that you only see magazine subscribers with 0 children.

filter(mag, kids=="0")

## # A tibble: 121 x 7
##      age gender income  kids ownHome subscribe Segment   
##    <dbl> <chr>   <dbl> <dbl> <chr>   <chr>     <chr>     
##  1  43.2 Male   44169.     0 ownYes  subNo     Suburb mix
##  2  28.5 Male   47245.     0 ownNo   subNo     Suburb mix
##  3  35.2 Female 52568.     0 ownYes  subNo     Suburb mix
##  4  47.6 Male   47918.     0 ownYes  subNo     Suburb mix
##  5  37.6 Female 65767.     0 ownNo   subNo     Suburb mix
##  6  42.0 Female 53127.     0 ownYes  subNo     Suburb mix
##  7  44.0 Female 41255.     0 ownYes  subNo     Suburb mix
##  8  44.5 Male   57363.     0 ownNo   subNo     Suburb mix
##  9  45.3 Female 65170.     0 ownNo   subNo     Suburb mix
## 10  42.3 Male   49675.     0 ownYes  subNo     Suburb mix
## # … with 111 more rows

Now let’s arrange the number of people with 0 kids according to their age. What is the youngest age with 0 kids? 19____

arrange(mag,age)

## # A tibble: 300 x 7
##      age gender income  kids ownHome subscribe Segment  
##    <dbl> <chr>   <dbl> <dbl> <chr>   <chr>     <chr>    
##  1  19.3 Female 18593.     0 ownNo   subNo     Urban hip
##  2  20.7 Male   22517.     3 ownNo   subNo     Urban hip
##  3  21.0 Female 27244.     1 ownNo   subNo     Urban hip
##  4  21.2 Male   18419.     1 ownYes  subYes    Urban hip
##  5  21.4 Male   16646.     3 ownNo   subNo     Urban hip
##  6  21.5 Female 17083.     2 ownNo   subNo     Urban hip
##  7  21.8 Male   27807.     2 ownNo   subYes    Urban hip
##  8  22.1 Male   21107.     0 ownNo   subNo     Urban hip
##  9  22.2 Female 20222.     2 ownYes  subYes    Urban hip
## 10  22.3 Female 24541.     1 ownNo   subNo     Urban hip
## # … with 290 more rows

Let’s now apply the arrange function with descending order. What is the oldest age with 0 kids? 80___

arrange(mag,desc(age))

## # A tibble: 300 x 7
##      age gender  income  kids ownHome subscribe Segment  
##    <dbl> <chr>    <dbl> <dbl> <chr>   <chr>     <chr>    
##  1  80.5 Male    82077.     0 ownYes  subYes    Travelers
##  2  78.2 Female  24604.     0 ownYes  subNo     Travelers
##  3  75.9 Female  23968.     0 ownYes  subNo     Travelers
##  4  71.9 Female  60279.     0 ownYes  subYes    Travelers
##  5  70.6 Male    48697.     0 ownNo   subNo     Travelers
##  6  68.1 Female  51535.     0 ownNo   subNo     Travelers
##  7  68.1 Female  25772.     0 ownYes  subNo     Travelers
##  8  68.1 Male   104312.     0 ownYes  subNo     Travelers
##  9  68.0 Female  69075.     0 ownNo   subNo     Travelers
## 10  66.9 Male    54061.     0 ownYes  subNo     Travelers
## # … with 290 more rows

Let’s take a look at just females. What is the highest income among females? 106430.05

filter(mag, gender=="Female") %>% 
arrange(desc(income))

## # A tibble: 157 x 7
##      age gender  income  kids ownHome subscribe Segment   
##    <dbl> <chr>    <dbl> <dbl> <chr>   <chr>     <chr>     
##  1  54.9 Female 106430.     0 ownYes  subNo     Travelers 
##  2  57.8 Female 105538.     0 ownYes  subNo     Travelers 
##  3  66.4 Female 101174.     0 ownYes  subNo     Travelers 
##  4  55.2 Female  96509.     0 ownYes  subNo     Travelers 
##  5  47.8 Female  92431.     0 ownYes  subNo     Travelers 
##  6  56.3 Female  91509.     0 ownYes  subNo     Travelers 
##  7  53.7 Female  85770.     0 ownYes  subNo     Travelers 
##  8  62.4 Female  82349.     0 ownYes  subNo     Travelers 
##  9  37.3 Female  81042.     1 ownNo   subNo     Suburb mix
## 10  47.9 Female  79544.     1 ownYes  subNo     Suburb mix
## # … with 147 more rows

Let’s start off by grouping according to Segment. How many segments are there?

 mag %>% group_by (Segment) %>% summarise(count=n())

## # A tibble: 4 x 2
##   Segment    count
##   <chr>      <int>
## 1 Moving up     70
## 2 Suburb mix   100
## 3 Travelers     80
## 4 Urban hip     50

Now we will combine a few other principles, including summarize, to calculate the average age for each Segment. Name this output “segment_age”. Name the new average age variable “avg_age”.

segment_age <- 
 mag %>% group_by (Segment) %>% summarise(avg_age = mean(age))
head(segment_age)

## # A tibble: 4 x 2
##   Segment    avg_age
##   <chr>        <dbl>
## 1 Moving up     36.3
## 2 Suburb mix    39.9
## 3 Travelers     57.9
## 4 Urban hip     23.9

Let’s take a look at children alongside age. Add average kids named “avg_kids” as another variable. Name this output “segment_age_kids”. What segment has the lowest number of kids on average? __________

 segment_age_kids <- mag %>% group_by (Segment) %>% summarise(avg_age = mean(age), avg_kids= mean(kids))

head(segment_age_kids)

## # A tibble: 4 x 3
##   Segment    avg_age avg_kids
##   <chr>        <dbl>    <dbl>
## 1 Moving up     36.3     1.91
## 2 Suburb mix    39.9     1.92
## 3 Travelers     57.9     0   
## 4 Urban hip     23.9     1.1

Now let’s make some visualizations. Use ggplot to make a bar chart out of our average age for each segment.

ggplot(data=segment_age_kids, aes(x=Segment, y=avg_age)) +
  geom_bar(stat="identity")

10. Let’s look at what the distribution of income is like across our data. Make a histogram showing the distribution of income.

ggplot(mag, aes(x=income)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note the bin-width of the histogram above is pretty bad and distorts the graph. Let’s change the binwidth to 10,500 and re-graph.

ggplot(mag, aes(x=income)) + geom_histogram(binwidth=10500)

Let’s now take a look at how the distribution of income varies by segment. Make a boxplot of income by segment. What segment has the largest range of income? _____TRAVELERS______

 ggplot(mag, aes(x=Segment, y=income)) + 
  geom_boxplot()

We also want to know what the distribution of age is like among genders. Make a new boxplot to show this. Are the age distributions similar or different for the genders? Explain Similar because their medians and IQR’s are close in range____________

 ggplot(mag, aes(x=gender, y=age)) + 
  geom_boxplot()

Let’s look at the relationship between age and income. Plot these two in a scatterplot. Paste your code below.

 ggplot(mag, aes(x=age, y=income)) + 
  geom_point()

This is a good graph, but it could be more useful. Let’s try adding some color differentiation. Re-graph and make color show gender. Are the trends of the two genders generally similar or different? ______________

 ggplot(mag, aes(x=age, y=income, color=gender)) + 
  geom_point()

Now let’s see what this graph looks like across Segments. Use the facet_wrap feature to plot the graphs for each segment. Note: This graph should feature all of the previous things we’ve done. We have been building this up with explaining income by age, then adding color for gender, and now we will facet_wrap by segment.

color <- ggplot(mag, aes(x=age, y=income, color=gender)) + 
  geom_point()

color <- color + facet_wrap(~Segment, ncol=2)
color