Analyst Homework

I’ll get started by setting up my working directory and reading in the data.

setwd("C:/Users/Nathan/Desktop/Working Directory")
ad.df <- read.csv("Analyst_HW.csv", header = T, stringsAsFactors = F)

What is the structure of the data?

str(ad.df)       #find out the structure at a glance

## 'data.frame':    14905865 obs. of  7 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age        : int  36 73 30 49 47 47 0 46 16 52 ...
##  $ Gender     : int  0 1 0 1 1 0 0 0 0 0 ...
##  $ Impressions: int  3 3 3 3 11 11 7 5 3 4 ...
##  $ Clicks     : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ Signed_In  : int  1 1 1 1 1 1 0 1 1 1 ...
##  $ day        : int  1 1 1 1 1 1 1 1 1 1 ...

summary(ad.df)   #basic summary statistics

##        X                 Age             Gender        Impressions
##  Min.   :       1   Min.   :  0.00   Min.   :0.0000   Min.   : 0  
##  1st Qu.: 3726467   1st Qu.:  0.00   1st Qu.:0.0000   1st Qu.: 3  
##  Median : 7452933   Median : 26.00   Median :0.0000   Median : 5  
##  Mean   : 7452933   Mean   : 26.24   Mean   :0.3231   Mean   : 5  
##  3rd Qu.:11179399   3rd Qu.: 46.00   3rd Qu.:1.0000   3rd Qu.: 6  
##  Max.   :14905865   Max.   :115.00   Max.   :1.0000   Max.   :21  
##      Clicks          Signed_In           day       
##  Min.   :0.00000   Min.   :0.0000   Min.   : 1.00  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 8.00  
##  Median :0.00000   Median :1.0000   Median :16.00  
##  Mean   :0.09773   Mean   :0.6234   Mean   :15.98  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:24.00  
##  Max.   :6.00000   Max.   :1.0000   Max.   :31.00

Part I

Task 1

Create a new variable, age_group, that categorizes users as “<18”, “18-24”, “25-34”, “35-44”, “45-54”, “55-64” and “65+”.

From the summary stats I saw that the max age was 115, so I’ll set that as my upper bound for my age cuts.

breaks <- c(0,18,25,35,45,55,65,116)
labels <- c('<18','18-24','25-34','35-44', '45-54','55-64','65+')

ad.df$age_group <- cut(ad.df$Age,
                       breaks = breaks,
                       labels = labels,
                       right = F, ordered_result = T)
# Let's check to see if it's right (there should be no NA's)
summary(ad.df$age_group)    #No NA values

##     <18   18-24   25-34   35-44   45-54   55-64     65+ 
## 6016115 1176595 1673650 2044613 1859487 1299303  836102

Task 2: For a single day,

Plot the distributions of number of impressions and click-through-rate

# (CTR=# clicks/# impressions)
ad.df.day <- subset(ad.df, day == 10)       # I'll pick the 10th, since it's my birthday day

Let’s look at the impressions and CTRs for the six age groups

library(ggplot2)
library(plyr)

ggplot(ad.df.day, aes(x=Impressions, colour=age_group)) + geom_density()

They are normally distributed, looks like the averages are around 5 for each age group. It’s hard to see diferences with the overlapping though. Let’s plot the means for each.

#this we can use to plot the means of the different groups for comparison
mimp <- ddply(ad.df.day, "age_group", summarise, imp.mean=mean(Impressions))
mimp

##   age_group imp.mean
## 1       <18 5.008970
## 2     18-24 5.010811
## 3     25-34 4.988482
## 4     35-44 4.998832
## 5     45-54 4.984716
## 6     55-64 4.989324
## 7       65+ 4.982199

ggplot(ad.df.day, aes(x=Impressions)) + geom_histogram(binwidth=.5, colour="black", fill="white") + 
  facet_grid(age_group ~ .) +
  geom_vline(data=mimp, aes(xintercept=imp.mean),
             linetype="dashed", size=1, colour="red")

Now we can see that the ‘<18’ group has the most impressions, with the ‘35-44’ group in second.

Now let’s look at the CTR. To analyze / visualize CTR we need to first create a CTR variable.

ad.df.day$ctr <- (ad.df.day$Clicks)/(ad.df.day$Impressions)   # there were 3020 NA's due to '0/0';
ad.df.day$ctr[is.na(ad.df.day$ctr)] <- 0                      # replace the NA's with '0'

mctr <- ddply(ad.df.day, "age_group", summarise, ctr.mean=mean(ctr))
mctr

##   age_group    ctr.mean
## 1       <18 0.028138911
## 2     18-24 0.011232519
## 3     25-34 0.010139070
## 4     35-44 0.010132852
## 5     45-54 0.009574999
## 6     55-64 0.020482843
## 7       65+ 0.029169597

ggplot(ad.df.day, aes(x=ctr, colour=age_group)) + geom_density()  # again, very difficult to see

ggplot(ad.df.day, aes(x=ctr)) + geom_histogram(binwidth=.5, colour="black", fill="white") + 
  facet_grid(age_group ~ .) +
  geom_vline(data=mctr, aes(xintercept=ctr.mean),
             linetype="dashed", size=1, colour="red")

The means for the ‘65+’ and ‘<18’ are visibly higher than the other groups. We can see this better if we just graph the CTR means by age group.

ggplot(data=mctr, aes(x=age_group, y=ctr.mean, fill=age_group)) +
  geom_bar(stat="identity")

Define a new variable to segment or categorize users based on their click behavior

Maybe we could segment users based on click status…

library(lattice)
summary(ad.df.day$Clicks)                      #Overall, not very many clicks

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.09216 0.00000 4.00000

prop.table(table(ad.df.day$Clicks))*100        #91.3% of click data is '0', only 8.1% being 1 click.

## 
##            0            1            2            3            4 
## 91.343210400  8.131131755  0.494736796  0.028712403  0.002208646

I’ll create a grouping variable ‘have_clicked’; 1= have 1+ clicks, 0= no clicks

ad.df.day$have_clicked <- ifelse(ad.df.day$Clicks>=1, 1, 0)
prop.table(table(as.factor(ad.df.day$have_clicked)))   #see proportions of new groups

## 
##         0         1 
## 0.9134321 0.0865679

histogram(~as.factor(ad.df.day$have_clicked) | ad.df.day$age_group, 
          data=ad.df.day, type = "count",layout=c(7,1), 
          col=c("light blue", "orange"), xlab= "Clicks by Age group, 0=No clicks, 1=At least 1 click",
          ylab ="Count of records")

The data shows an interesting pattern - lots of customers are in the “<18” age group and overall, more people are not clicking than those who do. The ‘0’ series follows a fairly normal curve, indicating that the customer age group distribution is fairly normal for day 10, overlooking those less than 18.

Let’s look at just those who have clicks - is their data normally distributed?

ClkA <- prop.table(table(ad.df.day$age_group, ad.df.day$have_clicked==1))  #'ClkA'...sounds like a rapper!
ClkA <- data.frame(ClkA)
ClkA <- subset(ClkA, Var2==TRUE)      #get rid of data not in the cluster
ClkA$Var2 <- NULL
s <- sum(ClkA$Freq)
ClkA$RelFrq <- ClkA$Freq/s            #calculate relative frequency

library(data.table)
labl <- data.table(ClkA)[, per := sprintf("%.1f%%", RelFrq*100), by = Var1]
ggplot(ClkA, aes(x=Var1, y=RelFrq, fill=Var1))+geom_bar(stat="identity")+
  labs(title="1+ Clicks by Age Group", x= "Age Groups", 
       y= "Relative Frequency of Clicks")+ guides(fill=FALSE)+
       geom_text(data = labl, aes(x = Var1, y = RelFrq, label = per), vjust= -0.5)

Looks like the data for those with 1+ click are not normally distributed. The ‘<18’ age segment make up almost half of all clicks for day 10, followed by the older groups (55-64, 65+) making up 21.1% of 1+ clicks.

Explore the data and make visual and quantitative comparisons across user segments / demographics

means <- with(ad.df.day, aggregate(x=list(Y=Impressions),
                   by=list(A=age_group, B=Gender),mean))
with(means, interaction.plot(x.factor=A, trace.factor=B, response=Y, type='b',
                             main = "Impressions by gender and age group, day 10",
                             xlab = "Age groups", ylab= "Average Impressions"))

This shows an interesting result - generally, the average number of impressions decreases as you move up in age group. Women, ‘0’, seem to have a higher average of impressions with every age group.

prop.table(table(ad.df.day$Gender))  #looks like there's more women than men in this data

## 
##         0         1 
## 0.6371879 0.3628121

# test to see if there are significant (with alpha = 0.05) differences between age groups and genders
TukeyHSD(aov(Y~(as.factor(A))+(as.factor(B)), means))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Y ~ (as.factor(A)) + (as.factor(B)), data = means)
## 
## $`as.factor(A)`
##                      diff          lwr           upr     p adj
## 18-24-<18    0.0023022891 -0.016064247  0.0206688256 0.9972371
## 25-34-<18   -0.0198334524 -0.038199989 -0.0014669159 0.0357225
## 35-44-<18   -0.0098559333 -0.028222470  0.0085106033 0.3923595
## 45-54-<18   -0.0237776653 -0.042144202 -0.0054111288 0.0152988
## 55-64-<18   -0.0191677647 -0.037534301 -0.0008012281 0.0415547
## 65+-<18     -0.0287184923 -0.047085029 -0.0103519558 0.0059187
## 25-34-18-24 -0.0221357415 -0.040502278 -0.0037692050 0.0215580
## 35-44-18-24 -0.0121582224 -0.030524759  0.0062083141 0.2265793
## 45-54-18-24 -0.0260799544 -0.044446491 -0.0077134179 0.0096830
## 55-64-18-24 -0.0214700538 -0.039836590 -0.0031035172 0.0248753
## 65+-18-24   -0.0310207814 -0.049387318 -0.0126542449 0.0039526
## 35-44-25-34  0.0099775191 -0.008389017  0.0283440556 0.3816150
## 45-54-25-34 -0.0039442129 -0.022310749  0.0144223236 0.9604502
## 55-64-25-34  0.0006656877 -0.017700849  0.0190322243 0.9999978
## 65+-25-34   -0.0088850399 -0.027251576  0.0094814966 0.4859155
## 45-54-35-44 -0.0139217320 -0.032288269  0.0044448045 0.1464780
## 55-64-35-44 -0.0093118314 -0.027678368  0.0090547052 0.4431463
## 65+-35-44   -0.0188625590 -0.037229096 -0.0004960225 0.0445721
## 55-64-45-54  0.0046099006 -0.013756636  0.0229764372 0.9238380
## 65+-45-54   -0.0049408270 -0.023307364  0.0134257095 0.9002983
## 65+-55-64   -0.0095507277 -0.027917264  0.0088158089 0.4203114
## 
## $`as.factor(B)`
##            diff        lwr          upr     p adj
## 1-0 -0.01026019 -0.0160228 -0.004497566 0.0047878

It looks like there’s a difference between several groups, especially between 65+ and <18 (surprising?) and the 65+ and 18-24, among others. Genders are significantly different at 0.05 alpha level.

Let’s check out if there are differences with gender and age group with clicks…

means1 <- with(ad.df.day, aggregate(x=list(Y=Clicks),
                                   by=list(A=age_group, B=Gender),mean))
with(means1, interaction.plot(x.factor=A, trace.factor=B, response=Y, type='b',
                             main = "Clicks by gender and age group, day 10",
                             xlab = "Age groups", ylab= "Average Clicks"))

TukeyHSD(aov(Y~(as.factor(A))+(as.factor(B)), means1))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Y ~ (as.factor(A)) + (as.factor(B)), data = means1)
## 
## $`as.factor(A)`
##                      diff         lwr         upr     p adj
## 18-24-<18   -0.0921921597 -0.11331052 -0.07107379 0.0000161
## 25-34-<18   -0.0976283786 -0.11874674 -0.07651001 0.0000108
## 35-44-<18   -0.0979647600 -0.11908312 -0.07684640 0.0000106
## 45-54-<18   -0.1008209662 -0.12193933 -0.07970260 0.0000087
## 55-64-<18   -0.0462250931 -0.06734346 -0.02510673 0.0009664
## 65+-<18     -0.0001628079 -0.02128117  0.02095556 1.0000000
## 25-34-18-24 -0.0054362189 -0.02655458  0.01568215 0.9158447
## 35-44-18-24 -0.0057726003 -0.02689097  0.01534576 0.8941317
## 45-54-18-24 -0.0086288066 -0.02974717  0.01248956 0.6374858
## 55-64-18-24  0.0459670665  0.02484870  0.06708543 0.0009969
## 65+-18-24    0.0920293517  0.07091099  0.11314772 0.0000163
## 35-44-25-34 -0.0003363814 -0.02145475  0.02078198 1.0000000
## 45-54-25-34 -0.0031925877 -0.02431095  0.01792578 0.9926247
## 55-64-25-34  0.0514032854  0.03028492  0.07252165 0.0005333
## 65+-25-34    0.0974655706  0.07634721  0.11858394 0.0000109
## 45-54-35-44 -0.0028562062 -0.02397457  0.01826216 0.9958620
## 55-64-35-44  0.0517396669  0.03062130  0.07285803 0.0005140
## 65+-35-44    0.0978019520  0.07668359  0.11892032 0.0000107
## 55-64-45-54  0.0545958731  0.03347751  0.07571424 0.0003795
## 65+-45-54    0.1006581583  0.07953979  0.12177652 0.0000088
## 65+-55-64    0.0460622852  0.02494392  0.06718065 0.0009855
## 
## $`as.factor(B)`
##            diff         lwr         upr     p adj
## 1-0 0.000898213 -0.00572781 0.007524236 0.7513962

This shows visually how average clicks between genders are not very different while there are big differences between age groups. The Tukey HSD confirms that there is no difference between genders while showing which age groups are statistically different.

Create metrics/measurements/statistics that summarize the data

For this let’s just look at numeric data…

library(corrplot)
library(gplots)

new.df <- as.data.frame(sapply(ad.df.day[,c(2:6,8,10)], as.numeric))  #if needed coerce to numeric

corrplot.mixed(corr=cor(new.df), upper = "ellipse", tl.pos = "lt", 
               col= colorpanel(50, "red", "gray50", "blue4"))

This correlation plot helps us see besides the obvious correlations like Age and age_group, there is a 82% correlation with Age and Signed_in, suggesting a strong linear relationship with these variables, even though this is just simulated data. If this were real data, I’d expect to see a stronger correlation with Impressions and Clicks than just 13% among other things.

Along with the correlation matrix, it is useful to see the quantile breakdown for our numeric variables.

quant <- function(x) {quantile(x,probs=0:10/10)}
apply(new.df[,1:dim(new.df)[2]],2,quant)

##      Age Gender Impressions Clicks Signed_In age_group have_clicked
## 0%     0      0           0      0         0         1            0
## 10%    0      0           2      0         0         1            0
## 20%    0      0           3      0         0         1            0
## 30%    0      0           4      0         0         1            0
## 40%   23      0           4      0         1         2            0
## 50%   31      0           5      0         1         3            0
## 60%   38      0           5      0         1         4            0
## 70%   45      1           6      0         1         5            0
## 80%   52      1           7      0         1         5            0
## 90%   61      1           8      0         1         6            0
## 100% 106      1          20      4         1         7            1

With this great numeric data I’m going to try to create a new matrix for k-means clustering (attempting with mixing binary and interval data).

library(cluster)
set.seed(84105)                              # my zip code

new.df.seg <- kmeans(new.df, centers = 3)    #choosing 3 clusters
# summary of clusters
seg.summ <- function(data, groups){
  aggregate(data, list(groups), function(x) mean(as.numeric(x)))
} #thanks to "R for Marketing Research and Analytics" by Chapman and Feit for this function!

#Now we can see some differences in the clusters in the data
seg.summ(new.df, new.df.seg$cluster)

##   Group.1        Age     Gender Impressions     Clicks  Signed_In
## 1       1  0.9550272 0.04514474    5.008762 0.14134898 0.06567445
## 2       2 31.2184120 0.53113028    4.998626 0.05466881 1.00000000
## 3       3 57.6833979 0.49046132    4.985688 0.08670597 1.00000000
##   age_group have_clicked
## 1  1.000000   0.12995275
## 2  3.121837   0.05303699
## 3  5.744824   0.08235087

From the k-means we got 3 groups. I notice the greatest differences in Age and age_group. We can visualize the distributions of different metrics of our clusters with boxplots.

boxplot(new.df$Impressions~ new.df.seg$cluster, xlab ="Clusters", ylab = "Impressions")   #not very different

boxplot(new.df$Age~ new.df.seg$cluster, xlab ="Clusters", ylab = "Age")                   #differs by age

boxplot(new.df$have_clicked~ new.df.seg$cluster, xlab ="Clusters", ylab = "Have clicks")  #not different

We can also plot the clusters, but for this k-means / my computer it’s computationally bankrupting.

#plot of clusters  (not run)
clusplot(new.df, new.df.seg$cluster, color =T, shade = T, 
         labels= 4, lines = 0, main = "K-means clusters for 10th day")

Task 3: Now extend your analysis across days. Visualize metrics and distributions over time

Since these data represent May 2012 we can figure out with the help of a calendar exactly what weekdays are represented.

ad.df$weekday <- paste('2012','05',ad.df$day, sep = '-')     #using the $day variable to build the date
ad.df$weekday <- weekdays.Date(as.Date(ad.df$weekday))       #convert from date to weekday (takes long time)

d.order <- c("Sunday", "Monday", "Tuesday", "Wednesday",
                    "Thursday", "Friday", "Saturday")     # I want the weekdays in this order

dt.imp <- aggregate(ad.df$Impressions, by=list(weekday=ad.df$weekday),sum)    # Weekday impressions
dt.clk <- aggregate(ad.df$Clicks, by=list(weekday=ad.df$weekday),sum)         # Weekday clicks
dt.sgn <- aggregate(ad.df$Signed_In, by=list(weekday=ad.df$weekday),sum)      # Weekday sign-ins
mt.imp <- aggregate(ad.df$Impressions, by=list(weekday=ad.df$day),sum)        # Daily impressions
mt.clk <- aggregate(ad.df$Clicks, by=list(weekday=ad.df$day),sum)             # Daily clicks
mt.sgn <- aggregate(ad.df$Signed_In, by=list(weekday=ad.df$day),sum)          # Daily sign-ins

dt.imp$weekday <- factor(as.character(dt.imp$weekday), levels = d.order)      # order data by correct weekday order
dt.imp <- dt.imp[order(dt.imp$weekday),]
dt.clk$weekday <- factor(as.character(dt.clk$weekday), levels = d.order)
dt.clk <- dt.clk[order(dt.clk$weekday),]
dt.sgn$weekday <- factor(as.character(dt.sgn$weekday), levels = d.order)
dt.sgn <- dt.sgn[order(dt.sgn$weekday),]

I made this simple plotting function to help generate visuals quickly.

vizr <- function(data, ylab, title){
  require(ggplot2)
  p<-ggplot(data, aes(weekday, x, group=1)) + geom_line() + 
    labs(x="Days", y= ylab ,title = title) + geom_smooth(method=lm)
  return(p)
}

vizr(dt.imp, "# of Impressions", "Impressions by day of week")

vizr(dt.clk, "# of Clicks", "Clicks by day of week")

vizr(dt.sgn, "# of Sign-ins", "Sign-ins by day of week")

vizr(mt.imp, "# of Impressions", "Impressions by day of May")

vizr(mt.clk, "# of Clicks", "Clicks by day of May")

vizr(mt.sgn, "# of Sign-ins", "Sign-ins by day of May")

Task 4: Describe and interpret any patterns you find

There is a definite cyclical pattern in all metrics (Impressions, Sign-ins and clicks) with peaks on Sundays and troughs on Saturdays. Impressions have no linear trend, remaining relatively constant for the month. Clicks are seeing an upward trend towards the end of the month. Lastly, Sign-ins have a decreasing rate trend and a distinguishable drop for the last two weeks of the month, dropping over 100000 on peak days. For the weekdays, Sunday is the big day. For the rest of the week it trends downward, having its lowest point on Saturday. Sign-ins see a local peak on Thursdays.

Task 5: construct a model that predicts user behavior

At first I tried a basic OLS linear model regressing Clicks against Age, Impressions and Signed_in status. I was skeptical about the results, expecting at least a weak R-squared. This approach was worth a try, but in the end everything turned out to be statistically sigificant (alpha=0.05) in predicting Clicks. From what I learned about the correlations with Clicks, Age was the only mildly correlated variable, and with data this big, I decided that the OLS approach is not going to give me an idea of what’s really going on.

# First model (not run)
model <- lm(Clicks ~ Impressions + Signed_In + Age)
summary(model)

This makes sense, but due to the volume of data all predictors are very significant, so it really doesn’t do anything from a model standpoint because of way to many “false positives”.

Instead of using regression analysis to predict drivers for clicks, I decided to use K nearest neighbors algorthm to group observations on the binary variable of ‘Signed_in’. By feeding new observations to the algorithm, one can “predict” sign-in behavior.

library(car)
library(class)
library(scales)
nDT <- some(ad.df, 100000)  # Let's work with a subset of the data (i.e. 100k rows of random data)
set.seed(841052)
nDT$rand <- runif(nrow(nDT))

After running this algorithm and finding that, with almost any k, the knn had too many ties. This seems to be caused because there are so many similar records that the algorithm can’t “break the ties”. To help this, I’ll add a variable to add some random noise to the data set to break up the ties.

nDT.new <- as.data.frame(sapply(nDT[,c(2,4,5,10)], as.numeric))
#normalize continuous data for better performance with KNN
normy <- function(x){
  return((x - min(x)) / (max(x) - min(x)))
}

nDT.new <- data.frame(apply(nDT[,c(2,4,5,10)], 2, normy))
summary(nDT.new)

##       Age          Impressions         Clicks             rand       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.1500   1st Qu.:0.00000   1st Qu.:0.2476  
##  Median :0.2524   Median :0.2500   Median :0.00000   Median :0.5010  
##  Mean   :0.2545   Mean   :0.2501   Mean   :0.02462   Mean   :0.5006  
##  3rd Qu.:0.4466   3rd Qu.:0.3000   3rd Qu.:0.00000   3rd Qu.:0.7520  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000

Now I’ll set up my training and testing models; I like to keep 10% for test.

train <- nDT.new[1:90000,]
test <- nDT.new[90001:nrow(nDT.new),]
train_target <- nDT[1:90000,6]
test_target <- nDT[90001:nrow(nDT),6]      # the 6th column has the acutal sign in data

# KNN model
k <- sqrt(nrow(nDT))   # choosing k as sqrt of length of observations
k                      # I'll round up to the next odd number

## [1] 316.2278

model1 <- knn(train=train, test=test, cl=train_target, k=317)

Awesome! so…how accurate was the model? I’ll use a basic confusion matrix approach.

mod1 <-table(test_target, model1)
mod1

##            model1
## test_target    0    1
##           0 4095    0
##           1   24 5881

sum(diag(mod1))/sum(mod1)    #overall model accuracy

## [1] 0.9976

sum(mod1[1,1])/sum(mod1[,1]) #model precision

## [1] 0.9941733

sum(mod1[1,1])/sum(mod1[1,]) #model recall

## [1] 1

sum(mod1[2,1])/sum(mod1[2,]) #model false positive rate

## [1] 0.004064352

Overall this model did a great job at classifying user behavior related to ‘Sign ups’. When I ran it, it incorrectly categorized 22 records (false positive rate of 0.3%) as ‘0’ or a not sign in event. It did well with 100% recall (was able to find all the ‘0’ records).

Part II: SQL

(1) How many bikes in the inventory does each supplier supply?

SELECT BikeID,

COUNT(*) count

FROM BikeSupplier JOIN Inventory ON (BikeID = BikeID)

GROUP BY SupplierID

(2) Which customers rented bikes for 5 days or more?

SELECT CustID

FROM Rentals

WHERE DURATION>=5

(3) BONUS (optional): For which bikes have more than 4 been ordered?

SELECT * FROM Orders WHERE name LIKE ‘N%’

HAVING COUNT(*) > 4

(4) BONUS (optional): Which bikes (i.e. bike names) aren’t in the inventory?

SELECT * FROM ( SELECT BikeID FROM Inventory) s1

FULL OUTER JOIN ( SELECT BikeID, BIKE_NAME FROM Bikes) s2

ON s1.BikeID = s2.BikeID

WHERE s1 IS NULL