SDS Report

## magrittr    dplyr reshape2  ggplot2    RCurl 
##     TRUE     TRUE     TRUE     TRUE     TRUE

My sample subjects were students all over campus, and my sample size was 1875 obeservations at 11 different locations. The variables observed were the sex, race/ethnicity, and whether or not the student was listening to music at the time of observation. The units were individual people.

To collect this data, I observed Welch hall from 8:45 AM – 9:00 AM on Monday, and recorded the gender and ethnicity of each student that walked by, and whether or not they were listening to music. Then, from 10:45 – 11:00 AM, I observed Burdine hall, recording the same data. From 2:45 PM – 3:00 PM I observed RLM. On Tuesday I observed the GDC from 9:15 AM – 9:30 AM, then I observed Painter from 12:15 PM – 12:30 PM, finishing with the CLA from 3:15 PM – 3:30 PM. Wednesday, I observed the west mall from 8:45 AM – 9:00 AM, and then McCombs from 10:45 AM – 11:00 AM, finishing with the main mall from 12:45 PM – 1:00 PM. On Thursday, I observed the east mall from 9:15 AM -9:30 AM, the first floor of Jester from 12:15 PM – 12:30 PM, and the UTC from 1:45 PM – 2:00 PM.

I repeated this process three times, and observed these places on different days each new process.

url <- getURL("https://docs.google.com/spreadsheets/d/1Nwdb7NYm7uU6JtT_cbDf8gM1AImay-y55Z8VCKlxVFU/pub?gid=1089481875&single=true&output=csv")

data.sheet <- read.csv(textConnection(url)) %>%
  select(Location, Race, Sex, Yes, No)

# qplot(data=Sheet, x=Yes, y=No, color=Location)
# qplot(data=Sheet, x=Race, y=Yes-No, fill=Sex, geom="boxplot")

Response Variable

Statistics

Our Response Variable can be measured as a difference between the number listening to music (Yes) and not listening to music (No). To create this response variable, I subtracted the following two vectors:

data.sheet$Response <- data.sheet$Yes - data.sheet$No

Running a five number summary on this response vector, we get the following result:

fivenum(data.sheet$Response)

## [1] -10.0  -6.0  -3.0  -1.5   2.0

Furthermore, we can infer from the standard deviation (SD = 3.08) that our data appears to have a wide spread. To further investigate, let’s plot a histogram of the data to check for normality

Plots

suppressPackageStartupMessages(require(ggplot2))

data.sheet$Response <- as.numeric(data.sheet$Response)


ggplot(data.sheet, aes(x=Response)) +
  # geom_histogram(aes(x=Yes), alpha=0.25, color="black", fill="green", binwidth=1) +
  # geom_histogram(aes(x=No), alpha=0.25, color="black", fill="red", binwidth=1) +
  geom_histogram(alpha=1, color="black", fill="#AAAAFF", binwidth=1) +
  geom_density(fill=NA) + aes(y = ..count..) +
  labs(x="Net Number of People Wearing Headphones",
       title="Histogram of Response Variable")

A few inferences may be gathered from this these statistics/visualization.

The first inference is that we might have a binomial distribution (there is a peak at -3 and a smaller peak at -7). It’s possible that a disproportionate sample size of white students (and an under representation of black students) could be skewing the data.

However, if we disregard the possibility of a binomial distribution, our data looks to be normally distributed (without any obvious outliers).

Explanitory Variables

In this experiment, I attempted to explain the response variable on the basis of two Explanatory Variables: Sex and Race/Ethnicity.

Sex: Histogram

group_by(data.sheet, Sex) %>%
  ggplot(aes(x=Response, fill=Sex)) + 
  geom_histogram(alpha=1, color="black", binwidth=1) + 
  facet_wrap("Sex") + 
  labs(x="Net Number of People Wearing Headphones",
       title="Histogram of Response Variable\nSplit by Observed Sex")

Based on initial impressions from the histogram above, it appears that wearing headphones isn’t clearly influenced by sex.

Sex: Density

group_by(data.sheet, Sex) %>%
  ggplot(aes(x=Response, fill=Sex)) + 
  geom_density(alpha=0.5, color="black") + 
  # facet_wrap("Sex") + 
  labs(x="Net Number of People Wearing Headphones",
       title="Density of Response Variable\nSplit by Observed Sex")

This density plot represents that same distribution as the histogram above, but gives the edges a more smooth appearance. This plot strongly suggests that sex alone is not a good predictor of an individual’s preference to wear headphones.

Sex: Statistics

# Sum the 'Yes' & 'No' vectors to get our total sample size
data.sheet$Total.Responses <- data.sheet$Yes + data.sheet$No

# Make two new fata.frames, broken down by sex
Male   <- filter(data.sheet, Sex=="Men") %>% select(-Sex)
Female <- filter(data.sheet, Sex=="Women") %>% select(-Sex)

fivenum(Male$Response)
## [1] -9.0 -5.5 -3.0 -1.0  2.0
sd(Male$Response)
## [1] 2.9723

fivenum(Female$Response)
## [1] -10  -7  -3  -2   2
sd(Female$Response)
## [1] 3.20148

Again, it appears that there isn’t a significant distinction between females and males. These broad statistics support my initial inferences based on the plots

Race/Ethnicity: Histogram

group_by(data.sheet, Race) %>%
  ggplot(aes(x=Response, fill=Race)) + 
  geom_histogram(alpha=1, color="black", binwidth=1) + 
  facet_wrap("Sex") + 
  labs(x="Net Number of People Wearing Headphones",
       title="Histogram of Response Variable\nSplit by Observed Race/Ethnicity")

Based on initial impressions from the histogram above, it appears whites are less likley to wear headphones, and blacks appear to be more likley to wear headphones (this would explain the second peak around -7). Asians and hispanics seem to fall somewhere in the middle, making up the majority of the responses recorded around that mean/median.

Race/Ethnicity: Density

group_by(data.sheet, Race) %>%
  ggplot(aes(x=Response, fill=Race)) + 
  geom_density(alpha=0.5, color="black") + 
  # facet_wrap("Sex") + 
  labs(x="Net Number of People Wearing Headphones",
       title="Density of Response Variable\nSplit by Observed Race/Ethnicity")

Race/Ethnicity: Statistics

White <- filter(data.sheet, Race=="White") %>% select(-Race)
Hispanic <- filter(data.sheet, Race=="Hispanic") %>% select(-Race)
Asian <- filter(data.sheet, Race=="Asian") %>% select(-Race)
Black <- filter(data.sheet, Race=="Black")  %>% select(-Race)

fivenum(White$Response)
## [1] -10  -9  -8  -7  -2
sd(White$Response)
## [1] 1.87025

fivenum(Hispanic$Response)
## [1] -7 -4 -3 -2 -1
sd(Hispanic$Response)
## [1] 1.573592

fivenum(Asian$Response)
## [1] -7 -3 -3 -2  2
sd(Asian$Response)
## [1] 1.849301

fivenum(Black$Response)
## [1] -6 -2  0  0  2
sd(Black$Response)
## [1] 1.936771

These stats seem to suppourt my inference from the graphs, praticualry the variation between whites and blacks.