## googlesheets dplyr reshape2 ggplot2 ggthemes
## TRUE TRUE TRUE TRUE TRUE
My sample subjects were students all over campus, and my sample size was 1875 observations at 11 different locations. The variables observed were the sex, race/ethnicity, and whether or not the student was listening to music at the time of observation. The units were individual people.
To collect this data, I observed Welch hall from 8:45 AM – 9:00 AM on Monday, and recorded the gender and ethnicity of each student that walked by, and whether or not they were listening to music. Then, from 10:45 – 11:00 AM, I observed Burdine hall, recording the same data. From 2:45 PM – 3:00 PM I observed RLM. On Tuesday I observed the GDC from 9:15 AM – 9:30 AM, then I observed Painter from 12:15 PM – 12:30 PM, finishing with the CLA from 3:15 PM – 3:30 PM. Wednesday, I observed the west mall from 8:45 AM – 9:00 AM, and then McCombs from 10:45 AM – 11:00 AM, finishing with the main mall from 12:45 PM – 1:00 PM. On Thursday, I observed the east mall from 9:15 AM -9:30 AM, the first floor of Jester from 12:15 PM – 12:30 PM, and the UTC from 1:45 PM – 2:00 PM.
I repeated this process three times, and observed these places on different days each new process.
# url <- getURL("https://docs.google.com/spreadsheets/d/1Nwdb7NYm7uU6JtT_cbDf8gM1AImay-y55Z8VCKlxVFU/pub?gid=1089481875&single=true&output=csv")
# read.csv(textConnection(url))
data.sheet <- gs_title("Music Form (Responses)") %>% gs_read_csv("Dataset") %>%
select(Location, Race, Sex, Yes, No)
Our Response Variable can be measured as a difference between the number listening to music (Yes) and not listening to music (No). To create this response variable, I subtracted the following two vectors:
# This is how the class wants you to do it
data.sheet$Response <- data.sheet$Yes - data.sheet$No
# # This is how I'd do it (using dplyr)
# data.sheet %>% mutate(Response = Yes - No)
Running a five number summary on this response vector, we get the following result:
fivenum(data.sheet$Response)
## [1] -10.0 -6.0 -3.0 -1.5 2.0
Furthermore, we can infer from the standard deviation (SD = 3.08) that our data appears to have a wide spread. To further investigate, let’s plot a histogram of the data to check for normality
hisgrm <- data.sheet %>%
ggplot(aes(x=Response)) +
labs(x="Net Number of People Wearing Headphones", y="Count",
title="Histogram of Response Variable")
# Net Results
hisgrm +
geom_histogram(alpha=1, color="black", fill="#AAAAFF", binwidth=1) +
geom_density(fill=NA) + aes(y = ..count..)
A few inferences may be gathered from this these statistics/visualization.
The first inference is that we might have a binomial distribution (there is a peak at -3 and a smaller peak at -7). It’s possible that a disproportionate sample size of white students (and an under representation of black students) could be skewing the data.
However, if we disregard the possibility of a binomial distribution, our data looks to be normally distributed (without any obvious outliers).
# Disagregated results
hisgrm +
geom_histogram(aes(x=Yes), alpha=0.25, color="black", fill="green", binwidth=1) +
geom_histogram(aes(x=No), alpha=0.25, color="black", fill="red", binwidth=1) +
labs(x="Number of People Observed at Location",
title="Histogram of Response Variable \n(disagregated)")
In this visualization, we’ve dis-aggregated the Response variable into it’s two components, Yes (in green, representing the number of people who were wearing headphones) and No (in red, representing the number of people who were not wearing headphones).
Remember, the x-axis, the count observed for each demographic at each location, should vary by the total number of observations per demographic. For example, if we presume that there is a 50/50 split between Yes and No, then we’d expect that larger cohorts (e.g. white men at Welch) will have more No’s and more Yes’s than smaller cohorts (e.g. black women at GDC).
In essence, we don’t care as much about the absolute numbers, but instead the proportions of the Yes count to the No count (which is what our Response vector represents).
In this experiment, I attempted to explain the response variable on the basis of two Explanatory Variables: Sex and Race/Ethnicity.
group_by(data.sheet, Sex) %>%
ggplot(aes(x=Response, fill=Sex)) +
geom_histogram(alpha=1, color="black", binwidth=1) +
facet_wrap("Sex") +
labs(x="Net Number of People Wearing Headphones",
title="Histogram of Response Variable\nSplit by Observed Sex")
Based on initial impressions from the histogram above, it appears that wearing headphones isn’t clearly influenced by sex.
group_by(data.sheet, Sex) %>%
ggplot(aes(x=Response, fill=Sex)) +
geom_density(alpha=0.5, color="black") +
labs(x="Net Number of People Wearing Headphones",
title="Density of Response Variable\n Split by Observed Sex")
This density plot represents that same distribution as the histogram above, but gives the edges a more smooth appearance. This plot strongly suggests that sex alone is not a good predictor of an individual’s preference to wear headphones.
Sex: Statistics
# Sum the 'Yes' & 'No' vectors to get our total sample size
data.sheet$Total.Responses <- data.sheet$Yes + data.sheet$No
# Make two new fata.frames, broken down by sex
Male <- data.sheet %>% filter(Sex=="Men") %>% select(-Sex)
Female <- data.sheet %>% filter(Sex=="Women") %>% select(-Sex)
fivenum(Male$Response)
## [1] -9.0 -5.5 -3.0 -1.0 2.0
sd(Male$Response)
## [1] 2.9723
fivenum(Female$Response)
## [1] -10 -7 -3 -2 2
sd(Female$Response)
## [1] 3.20148
Again, it appears that there isn’t a significant distinction between females and males. These broad statistics support my initial inferences based on the plots
data.sheet %>%
group_by(Race) %>%
ggplot(aes(x=Response, fill=Race)) +
geom_histogram(alpha=1, color="black", binwidth=1) +
facet_wrap("Sex") +
labs(x="Net Number of People Wearing Headphones",
title="Histogram of Response Variable\nSplit by Sex; Colored by Observed Race/Ethnicity")
Based on initial impressions from the histogram above, it appears whites are less likely to wear headphones, and blacks appear to be more likely to wear headphones (this would explain the second peak around -7). Asians and Hispanics seem to fall somewhere in the middle, making up the majority of the responses recorded around that mean/median.
data.sheet %>%
group_by(Race) %>%
ggplot(aes(x=Response, fill=Race)) +
geom_density(alpha=0.5, color="black") +
# facet_wrap("Sex") +
labs(x="Net Number of People Wearing Headphones",
title="Density of Response Variable\nColored by Observed Race/Ethnicity")
White <- data.sheet %>% filter(Race=="White") %>% select(-Race)
Hispanic <- data.sheet %>% filter(Race=="Hispanic") %>% select(-Race)
Asian <- data.sheet %>% filter(Race=="Asian") %>% select(-Race)
Black <- data.sheet %>% filter(Race=="Black") %>% select(-Race)
fivenum(White$Response)
## [1] -10 -9 -8 -7 -2
sd(White$Response)
## [1] 1.87025
fivenum(Hispanic$Response)
## [1] -7 -4 -3 -2 -1
sd(Hispanic$Response)
## [1] 1.573592
fivenum(Asian$Response)
## [1] -7 -3 -3 -2 2
sd(Asian$Response)
## [1] 1.849301
fivenum(Black$Response)
## [1] -6 -2 0 0 2
sd(Black$Response)
## [1] 1.936771
These stats seem to support my inference from the graphs, practicality the variation between whites and blacks.
# Make a data.frame with summary totals
data.sheet_sum <- data.sheet %>%
# Group by explanitory variables
group_by(Location, Race, Sex) %>%
# Collapse table by groups, suming the response and sample size
summarise(
Yes = sum(Yes, na.rm=T),
Response = sum(Response, na.rm=T),
n = sum(Total.Responses, na.rm=T))
The heat maps above illustrates the percent of students who were wearing headphones, relative to the total size of the cohort. The color and number are scaled appropriately, and this view is useful for a visual deception of the variation among our dimensions
--- LICENSE ---
Copyright (C) 2016 Hunter Ratliff
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Source Code: Github: HunterRatliff1/Geni-Project
Published to: RPubs
Author: Hunter Ratliff @HunterRatliff1
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
In the spirit of Reproducible Research, below is the information About the R Session at the time it was compiled:
devtools::session_info()
## setting value
## version R version 3.2.3 (2015-12-10)
## system x86_64, darwin14.5.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2016-02-11
##
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.2.0)
## cellranger 1.0.0 2015-06-20 CRAN (R 3.2.0)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
## curl 0.9.4 2015-11-20 CRAN (R 3.2.2)
## DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
## devtools 1.10.0 2016-01-23 CRAN (R 3.2.3)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
## dplyr * 0.4.3 2015-09-01 CRAN (R 3.2.0)
## evaluate 0.8 2015-09-18 CRAN (R 3.2.0)
## formatR 1.2.1 2015-09-18 CRAN (R 3.2.0)
## ggplot2 * 2.0.0 2015-12-18 CRAN (R 3.2.3)
## ggthemes * 3.0.1 2016-01-10 CRAN (R 3.2.3)
## googlesheets * 0.1.0 2015-07-05 CRAN (R 3.2.0)
## gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
## htmltools 0.3 2015-12-29 CRAN (R 3.2.3)
## httr 1.0.0 2015-06-25 CRAN (R 3.2.0)
## jsonlite 0.9.19 2015-11-28 CRAN (R 3.2.2)
## knitr * 1.12.3 2016-01-22 CRAN (R 3.2.3)
## labeling 0.3 2014-08-23 CRAN (R 3.2.0)
## lazyeval 0.1.10 2015-01-02 CRAN (R 3.2.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
## memoise 1.0.0 2016-01-29 CRAN (R 3.2.3)
## munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.0)
## R6 2.1.1 2015-08-19 CRAN (R 3.2.0)
## Rcpp 0.12.2 2015-11-15 CRAN (R 3.2.2)
## reshape2 * 1.4.1 2014-12-06 CRAN (R 3.2.0)
## rmarkdown 0.9.2 2016-01-01 CRAN (R 3.2.3)
## scales 0.3.0 2015-08-25 CRAN (R 3.2.0)
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.0)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## xml2 0.1.2 2015-09-01 CRAN (R 3.2.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)