In this part of the course project, you will work with a subset of the Student Survey data set that contains information on students who watch either a lot or very little TV each week. This data set is from a survey of college student habits and includes the following variables:

ExerciseLevel: Whether the student exercises more than eight hours per week (HighExcs) or less than eight hours per week (LowExcs)

TVLevel: Whether the student watches TV for more than seven hours per week (HighTV) or less than three hours per week (LowTV)
Award: Type of award the student is most interested in: Nobel or Olympic

You will analyze these data through the lens of a data scientist working for the campus store in which the Student Survey data was collected.

Your team is thinking about running a TV ad marketing to college students and they want to know whether running a TV advertisement is worthwhile. If it is, they also want to know what their discount strategy should be. To answer these questions, you will use the Student Survey data set to determine whether it is worthwhile to run a TV ad, whether your store should advertise a discount on sports gear or science books, and whether there are any confounding factors that might influence the association between how much TV students watch and their interests. You will first determine the percentage of students in the HighTV category, then examine the association between TV level and preferred Award, and finally examine whether the exercise level of those students is a confounding factor.

The data set you will work with is loaded in the code chunk below. The variables in this version of the data set are TV (numerical), ExerciseLevel (categorical: HighExcs and LowExcs), TVLevel (categorical: HighTV and LowTV), and Award (categorical: Nobel or Olympic). Use the following command to load the data frame in R.

X1 <- c(1,4,5,7,8,10,13,14,16,17,18,19,20,22,23,24,25,28,32,33,
        34,35,44,45,46,47,48,49,50,51,52,53,54,55,58,60,
        62,63,64,66,67,68,69,70,76,77,78,79,84,85)
Year <- c("Senior","Junior","Sophomore","FirstYear",
          "Sophomore","FirstYear","Sophomore","Junior",
          "FirstYear","FirstYear","Sophomore","Sophomore",
          "Sophomore","Sophomore","FirstYear","Sophomore",
          "Sophomore","Junior","Sophomore","FirstYear",
          "Senior","Junior","Sophomore","FirstYear",
          "Sophomore","FirstYear","Sophomore","Junior",
          "FirstYear","FirstYear","Sophomore","Sophomore",
          "Sophomore","Sophomore","FirstYear","Sophomore",
          "Sophomore","Junior","Sophomore","FirstYear",
          "Sophomore","Sophomore",
          "Sophomore","Sophomore","FirstYear","Sophomore",
          "Sophomore","Junior","Sophomore","FirstYear")
Sex <- c("M","M","F","F","M","F","M","F","F","F","M",
         "F","F","F","M","M","M","M","F","M","M","M","M","M",
         "M","M","M","M","F","F","F","F","M",
         "F","M","M","M","M","F","F",
         "F","F","M",
         "F","M","M","M","M","F","F")
Smoke <- c("Non-Smoker","Non-Smoker","Non-Smoker","Non-Smoker",
           "Non-Smoker",
           "Non-Smoker","Non-Smoker","Non-Smoker",
           "Non-Smoker","Non-Smoker","Non-Smoker","Non-Smoker",
           "Non-Smoker",
           "Non-Smoker","Non-Smoker","Non-Smoker","Non-Smoker",
           "Non-Smoker","Non-Smoker","Non-Smoker","Smoker","Smoker",
           "Smoker","Smoker",
           "Smoker","Smoker","Smoker","Smoker","Smoker","Smoker","Smoker",
           "Smoker","Smoker","Smoker","Smoker",
           "Smoker","Smoker","Smoker","Smoker","Smoker",
           "Smoker","Smoker","Smoker","Smoker",
           "Non-Smoker","Non-Smoker","Non-Smoker","Non-Smoker","Smoker","Smoker")
Award <- c("Olympic","Nobel","Olympic","Olympic","Olympic","Nobel",
           "Nobel","Olympic","Nobel","Olympic","Olympic","Olympic",
           "Olympic","Olympic","Olympic","Olympic","Olympic","Olympic",
           "Nobel","Olympic","Olympic","Olympic","Olympic","Olympic","Nobel",
           "Nobel","Olympic","Olympic","Olympic","Olympic","Nobel","Nobel","Olympic",
           "Olympic","Olympic","Olympic","Olympic","Nobel","Nobel","Nobel","Olympic", "Nobel","Olympic","Nobel","Olympic","Olympic",
           "Olympic","Nobel","Nobel","Nobel")
TV <- c(1,1,3,10,8,1,8,1,2,15,3,1,2,2,20,10,3,2,2,20,10,2,8,2,14,20,20,10,3,1,14,0,20,2,3,1,7,3,12,3,20,8,0,12,3,3,8,8,3,2)
ExerciseLevel <- c("HighExcs","HighExcs","LowExcs","HighExcs","HighExcs",
                   "HighExcs","HighExcs","HighExcs","LowExcs","LowExcs",
                   "HighExcs","HighExcs","LowExcs","HighExcs","HighExcs",
                   "HighExcs","HighExcs","HighExcs","LowExcs","LowExcs",
                   "HighExcs","HighExcs","LowExcs","HighExcs","HighExcs",
                   "HighExcs","HighExcs","HighExcs","LowExcs","LowExcs",
                   "HighExcs","HighExcs","LowExcs","HighExcs","HighExcs",
                   "HighExcs","HighExcs","HighExcs","LowExcs","LowExcs",
                   "HighExcs","HighExcs","LowExcs","HighExcs","HighExcs",
                   "HighExcs","HighExcs","HighExcs","LowExcs","LowExcs")
TVLevel <- c("LowTV","LowTV","LowTV","HighTV","HighTV","LowTV",
             "LowTV","HighTV","LowTV","LowTV",
             "LowTV","LowTV","LowTV","HighTV","HighTV","LowTV",
             "LowTV","HighTV","LowTV","LowTV",
             "LowTV","LowTV","LowTV","HighTV","HighTV","LowTV",
             "LowTV","HighTV","LowTV","LowTV",
             "LowTV","LowTV","LowTV","HighTV","HighTV","LowTV",
             "LowTV","HighTV","LowTV","LowTV",
             "LowTV","LowTV","LowTV","HighTV","HighTV","LowTV",
             "LowTV","HighTV","LowTV","LowTV")

dat <- data.frame(X1,Year,Sex, Smoke, Award, TV, ExerciseLevel, TVLevel)

crimson = '#b31b1b'   # crimson
lightGray = '#cecece' # lightGray
darkGray = '#606366'  # darkGray
skyBlue = '#92b2c4'   # skyblue
gold = '#fbb040'      # gold
ecBlack = '#393f47'   # ecBlack

Step 1:

Your team assumes that if at least 40% of the students are in the HighTV category, a televised ad campaign would be worthwhile. Determine the proportion of students in the HighTV category.

TvFactor <- factor(dat$TVLevel,
                  levels = c("0","1"),
                  labels = c("LowTV","HighTV"))
dat$TvFactor <- TvFactor
dat$TvFactor <- dat$TVLevel
head(dat, n = 10)
##    X1      Year Sex      Smoke   Award TV ExerciseLevel TVLevel TvFactor
## 1   1    Senior   M Non-Smoker Olympic  1      HighExcs   LowTV    LowTV
## 2   4    Junior   M Non-Smoker   Nobel  1      HighExcs   LowTV    LowTV
## 3   5 Sophomore   F Non-Smoker Olympic  3       LowExcs   LowTV    LowTV
## 4   7 FirstYear   F Non-Smoker Olympic 10      HighExcs  HighTV   HighTV
## 5   8 Sophomore   M Non-Smoker Olympic  8      HighExcs  HighTV   HighTV
## 6  10 FirstYear   F Non-Smoker   Nobel  1      HighExcs   LowTV    LowTV
## 7  13 Sophomore   M Non-Smoker   Nobel  8      HighExcs   LowTV    LowTV
## 8  14    Junior   F Non-Smoker Olympic  1      HighExcs  HighTV   HighTV
## 9  16 FirstYear   F Non-Smoker   Nobel  2       LowExcs   LowTV    LowTV
## 10 17 FirstYear   F Non-Smoker Olympic 15       LowExcs   LowTV    LowTV
table(head(dat$TVLevel, n=11))
## 
## HighTV  LowTV 
##      3      8
4 / (4 + 7)
## [1] 0.3636364
7 / (4 + 7)
## [1] 0.6363636
#or we can do the similar thing by prop cmmnd
prop.table(table(head(dat$TVLevel,n = 11)))
## 
##    HighTV     LowTV 
## 0.2727273 0.7272727
#the proportion of HighTV is 0.2727273

Step 2:

Visualize the proportion of students in each TV level with a barplot. Based on your results, should your team run a televised ad campaign?

tbl.dat <- table(dat$TVLevel)
tbl.dat
## 
## HighTV  LowTV 
##     15     35
prop.table(tbl.dat)
## 
## HighTV  LowTV 
##    0.3    0.7
# We are going to give all the amount of
#the data to a variable names counts
counts = table(dat$TVLevel)
head(counts, n = 10)
## 
## HighTV  LowTV 
##     15     35
par(mar=2+c(5.1,4.1,4.1,2.1))
barplot(counts,
        main = "Proportion of studentn in Each TV Level",
        col = crimson,
        ylab = "HighTV",
        xlab = "LowTV?",
        cex.axis = 1.5,
        cex.names = 1.5,
        cex.lab = 1.5,
        cex.main = 1.5)

Step 3:

Your team assumes that if more students who watch a lot of TV are interested in Nobel Prizes, they will prefer science books. In contrast, your team assumes that if more students who watch a lot of TV are interested in the Olympics, they will prefer sports gear.

Create a table that shows the proportions of HighTV and LowTV students that prefer each award type. Your team will use this table to decide whether to run a sale on science books or sports gear based on their assumptions, above.

table(dat$TVLevel, dat$Award)
##         
##          Nobel Olympic
##   HighTV     4      11
##   LowTV     13      22
#After this, we are going to show this, in row axis

prop.table(table(dat$TVLevel, dat$Award), 1)
##         
##              Nobel   Olympic
##   HighTV 0.2666667 0.7333333
##   LowTV  0.3714286 0.6285714
tbl <- table(dat$TVLevel, dat$Award)
tbl
##         
##          Nobel Olympic
##   HighTV     4      11
##   LowTV     13      22
t(tbl)
##          
##           HighTV LowTV
##   Nobel        4    13
##   Olympic     11    22

Step 4:

Create a stacked barplot to present the results of the previous analysis to your team. Recall that your team assumes that if more students who watch a lot of TV are interested in Nobel Prizes, they will prefer science books, whereas if more students who watch a lot of TV are interested in the Olympics, they will prefer sports gear.

Based on your results, should your team discount science books or sports gear?

barplot(tbl, 
  main="Proportion Of Stuent's Choice Based On Award Types",
  cex.names = 1.5, col = c("red", "gray"),
  legend.text=c("HighTV","LowTV"), 
  args.legend=list(x="topleft",cex=1))

Step 5:

TV ads are expensive, so before your team runs the campaign you decide to check for confounding factors. Your team has assumed that students who prefer the Olympics will be interested in sports gear, but the amount of exercise students do could influence the association between the award a student prefers and the amount of TV they watch. For instance, if the students who prefer the Olympics and watch more TV are primarily from the HighExcs group, the team would consider featuring exercise gear instead of sports gear in the TV ad.

Check whether the difference in the level of TV watched by students who are more interested in different awards is influenced by their level of exercise. Summarize this information by creating a table of proportions.

TvFactor <- factor(dat$TVLevel,
                  levels = c("0","1"),
                  labels = c("LowTV","HighTV"))
dat$TvFactor <- TvFactor
dat$TvFactor <- dat$TVLevel
head(dat, n = 10)
##    X1      Year Sex      Smoke   Award TV ExerciseLevel TVLevel TvFactor
## 1   1    Senior   M Non-Smoker Olympic  1      HighExcs   LowTV    LowTV
## 2   4    Junior   M Non-Smoker   Nobel  1      HighExcs   LowTV    LowTV
## 3   5 Sophomore   F Non-Smoker Olympic  3       LowExcs   LowTV    LowTV
## 4   7 FirstYear   F Non-Smoker Olympic 10      HighExcs  HighTV   HighTV
## 5   8 Sophomore   M Non-Smoker Olympic  8      HighExcs  HighTV   HighTV
## 6  10 FirstYear   F Non-Smoker   Nobel  1      HighExcs   LowTV    LowTV
## 7  13 Sophomore   M Non-Smoker   Nobel  8      HighExcs   LowTV    LowTV
## 8  14    Junior   F Non-Smoker Olympic  1      HighExcs  HighTV   HighTV
## 9  16 FirstYear   F Non-Smoker   Nobel  2       LowExcs   LowTV    LowTV
## 10 17 FirstYear   F Non-Smoker Olympic 15       LowExcs   LowTV    LowTV
dat$Exlvl=ifelse(dat$ExerciseLevel=="HighExcs", 1, 0)



prop = aggregate(Exlvl ~ ExerciseLevel + Award, 
                 FUN = mean,
                 data = dat)
prop
##   ExerciseLevel   Award Exlvl
## 1      HighExcs   Nobel     1
## 2       LowExcs   Nobel     0
## 3      HighExcs Olympic     1
## 4       LowExcs Olympic     0

Step 6:

Create a side-by-side barplot that shows how student exercise levels influence the association between a student’s preferred award and the amount of TV they watch.

library(lattice) # Load the lattice package so you can make a side-by-side barplot
barchart(Exlvl~ExerciseLevel,
         groups=Award,    
         dat=prop,      
         pch=19, cex=1.5, ylim=c(0,1), 
         main="Proportion Of High & Low Excs Based On The Award Types",
         par.settings = list(superpose.polygon = 
                               list(col=c(crimson, darkGray))),             
         
         scales=list(cex=1.5), 
         xlab=list(label="HighExcs", cex = 1.5), 
       
      
         ylab=list(label="LowExcs", cex = 1.5), 

         auto.key=list(space="right",cex=1.5)) 

prop.Award=aggregate(Exlvl ~ Award,
                      FUN=mean, data=dat)
prop.Award
##     Award     Exlvl
## 1   Nobel 0.6470588
## 2 Olympic 0.7272727

Step 7:

Based on the side-by-side barplot and the table you made to understand how exercise level influences the relationship between award of interest and TV level, is exercise level a confounding factor? Why or why not?

Based on the bar chart information that is shown on the screen, both parties are on the same level. This means we need to see the relationship between the exercise level and whether it’s in the HighExcs or LowExcs. To bring an example, it gives us the privilege to assess the observed and unobserved confounders.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.