R Homework 5: Big Data Analysis ################################# #################################
In this assignment we will do our first big data analysis with a real dataset!
(Make sure to download the YRBSS Dataset on the Canvas Page)
This assignment will consist of four parts. Reference the R Lesson videos if you have any questions.
Dataset Information Youth Risk Behavior Surveillance System (YRBSS) https://www.cdc.gov/healthyyouth/data/yrbs/index.html (More information here) #####################
The Youth Risk Behavior Surveillance System (YRBSS) monitors six categories of health-related behaviors that contribute to the leading causes of death and disability among youth and adults, including—
- Behaviors that contribute to unintentional injuries and violence
- Sexual behaviors related to unintended pregnancy and sexually transmitted diseases, including HIV infection
- Alcohol and other drug use
- Tobacco use
- Unhealthy dietary behaviors
- Inadequate physical activity
YRBSS also measures the prevalence of obesity and asthma and other health-related behaviors plus sexual identity and sex of sexual contacts.
YRBSS is a system of surveys. It includes 1) a national school-based survey conducted by CDC and state, territorial, tribal, and 2) local surveys conducted by state, territorial, and local education and health agencies and tribal governments.
#UMD Resources (https://www.counseling.umd.edu/)
Counseling Services
Business Hours: 8:30 a.m to 4:30 p.m. ( Monday through Friday)
After-Hours Crisis Support Phone Services: 4:30 pm to 8:30 am weekdays, 24 hours/day over the weekend
Phone Number: (301) 314-7651.
Accessibility and Disability Services
Business Hours: 8:30 a.m to 4:30 p.m. (Monday through Friday)
Phone Number: (301) 314-7682.
#Part 1: Uploading Datasets ########################### ###########################
Just as in every assignment, we will first load the YRBSS dataset from our downloaded files.
#Uploading Dataset
library(readxl)
library(readr)
copy <- read.csv("/Users/sscoli/Downloads/CORRECT DATA hW 5.csv")
#Note: If readxl is not found, make sure to do install.packages() function first to install the readxl package...
#Note: Get in the habit of always doing a original copy and then working copy.
#Part 2: Data Cleaning ####################### #######################
After we uploaded the dataset, now we will to do some data cleaning for it.
Since this dataset is already very large, we will do “Listwise Deletion” method for cleaning…
mrclean <- na.omit(copy)
#Do this on the working dataset
How many rows of data did we lose from listwise deletion?
14765-7895
## [1] 6870
#Hint (compare the # rows of the original dataset with the working dataset)
#Part 3: Describing Datasets ############################ ############################
How many Variables & Rows are there in the dataset? (Provide Code Below & Answer in ” “)
#Variables
ncol(mrclean)
## [1] 25
"25"
## [1] "25"
#Rows
nrow(mrclean)
## [1] 7895
"7895"
## [1] "7895"
Now Check if the Structure of the Variables are Correct in the Dataset
#Show Structure of Dataset
str(mrclean)
## 'data.frame': 7895 obs. of 25 variables:
## $ ï..ID : int 1 2 3 6 9 11 12 14 15 17 ...
## $ Age : int 16 18 16 16 17 16 16 17 17 17 ...
## $ Sex : chr "Male" "Male" "Female" "Female" ...
## $ Gun_Carrying : int 1 1 1 1 1 5 1 1 1 1 ...
## $ Gun_CarryingSchool : int 1 1 1 1 1 5 1 1 1 1 ...
## $ Unsafe_to_School : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Threat_at_School : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Forced_Sex : int 1 1 1 2 1 1 1 1 1 1 ...
## $ Sex_Violence : int 1 1 1 3 1 1 1 1 1 1 ...
## $ Bullied_School : int 1 1 1 2 2 2 1 1 1 1 ...
## $ Bullied_Online : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Suicide_Consider : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Suicide_Plan : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Suicide_Attempt : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Alcohol_Days : int 1 1 1 3 1 4 1 1 2 1 ...
## $ Alcohol_Binge : int 1 1 1 3 1 3 1 1 1 1 ...
## $ Alcohol_Max : int 1 1 1 7 1 6 1 1 2 1 ...
## $ Marijuana_Days : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Cocaine : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Inhalant : int 1 1 1 2 1 1 1 1 1 1 ...
## $ Ever_Heroin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Meth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Ecstasy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Steroid : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_PrescriptionAbuse: int 1 1 1 2 1 1 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:6870] 4 5 7 8 10 13 16 23 28 29 ...
## ..- attr(*, "names")= chr [1:6870] "4" "5" "7" "8" ...
Convert the Sex Variable from a Character to a Factor Variable in the Working Dataset
mrclean$Sex <- as.factor(mrclean$Sex)
Check if you were able to successful change the Sex Variable into a Factor Variable in the Working Dataset
str(mrclean)
## 'data.frame': 7895 obs. of 25 variables:
## $ ï..ID : int 1 2 3 6 9 11 12 14 15 17 ...
## $ Age : int 16 18 16 16 17 16 16 17 17 17 ...
## $ Sex : Factor w/ 3 levels " ","Female","Male": 3 3 2 2 2 3 2 3 2 2 ...
## $ Gun_Carrying : int 1 1 1 1 1 5 1 1 1 1 ...
## $ Gun_CarryingSchool : int 1 1 1 1 1 5 1 1 1 1 ...
## $ Unsafe_to_School : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Threat_at_School : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Forced_Sex : int 1 1 1 2 1 1 1 1 1 1 ...
## $ Sex_Violence : int 1 1 1 3 1 1 1 1 1 1 ...
## $ Bullied_School : int 1 1 1 2 2 2 1 1 1 1 ...
## $ Bullied_Online : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Suicide_Consider : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Suicide_Plan : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Suicide_Attempt : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Alcohol_Days : int 1 1 1 3 1 4 1 1 2 1 ...
## $ Alcohol_Binge : int 1 1 1 3 1 3 1 1 1 1 ...
## $ Alcohol_Max : int 1 1 1 7 1 6 1 1 2 1 ...
## $ Marijuana_Days : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Cocaine : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Inhalant : int 1 1 1 2 1 1 1 1 1 1 ...
## $ Ever_Heroin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Meth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Ecstasy : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_Steroid : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Ever_PrescriptionAbuse: int 1 1 1 2 1 1 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:6870] 4 5 7 8 10 13 16 23 28 29 ...
## ..- attr(*, "names")= chr [1:6870] "4" "5" "7" "8" ...
#Note:
#Female = 1
#Male = 2
#Part 4: Correlation Analysis ############################# #############################
Using the variables in the YRBSS dataset (listed on the Word Document titled YRBSS Variables), select appropriate variables for the correlation analysis:
- Correlation Analysis: Two Ordinal or Continuous Variables
Feel free to select any variables that you are interested in! We are doing our first real statistical analysis here.
#What Two Variables Did you Pick?
"Gun_Carrying & Unsafe_to_School"
## [1] "Gun_Carrying & Unsafe_to_School"
#Why did you pick them?
"These two stuck out to me because they're correlated. If someone brings a gu into school, that will make students unsafe."
## [1] "These two stuck out to me because they're correlated. If someone brings a gu into school, that will make students unsafe."
#Make a hypothesis of the relationship between the variables. Try to reference actual Psychological Theory and/or Empirical Research.
"If Gun_Carrying is allowed in schools, then there will be an increase in Unsafe_to_School (positvely correlated).This is true because shootings in schools are not uncommon."
## [1] "If Gun_Carrying is allowed in schools, then there will be an increase in Unsafe_to_School (positvely correlated).This is true because shootings in schools are not uncommon."
#What type of Correlation do you need to run? Pearson or Spearman?
"Spearman"
## [1] "Spearman"
Describe the Variables you Selected
#Are these two variables continuous or ordinal?
"These two variables are ordinal"
## [1] "These two variables are ordinal"
#What is the Mean & Median for Variable 1?
mean(mrclean$Gun_Carrying)
## [1] 1.380367
median(mrclean$Gun_Carrying)
## [1] 1
"The mean is 1.380367 & the median is 1 for variable 1."
## [1] "The mean is 1.380367 & the median is 1 for variable 1."
#What is the Mean & Median for Variable 2?
mean(mrclean$Unsafe_to_School)
## [1] 1.082711
median(mrclean$Unsafe_to_School)
## [1] 1
"The mean is 1.082711 & the median is 1 for variable 2."
## [1] "The mean is 1.082711 & the median is 1 for variable 2."
Statistical Analysis ######################
Now let’s use the cor.test() function to test our correlational analysis.
cor.test(mrclean$Gun_Carrying, mrclean$Unsafe_to_School, method = "spearman")
## Warning in cor.test.default(mrclean$Gun_Carrying, mrclean$Unsafe_to_School, :
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: mrclean$Gun_Carrying and mrclean$Unsafe_to_School
## S = 7.6928e+10, p-value = 3.441e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.06204484
#Note: Make sure to designate pearson or spearman... If you do not, pearson will be the default.
Was your hypothesis supported? How do you know?
"My hypothesis is supported because the p-value is small, meaning we reject the null and accept our hypothesis. This is a positive correlation"
## [1] "My hypothesis is supported because the p-value is small, meaning we reject the null and accept our hypothesis. This is a positive correlation"
#Note: Reference the p-value and directionality of the relationship
Provide an explanation for why your hypothesis was supported or not supported. Again try to reference psychological theory or any possible limitations of the survey design and/or sample of the dataset.
"My hypothesis was supported because if there is an increase in gun carrying, then more students will feel unsafe at school. Someone can possibly bring a gun into the school and do harm."
## [1] "My hypothesis was supported because if there is an increase in gun carrying, then more students will feel unsafe at school. Someone can possibly bring a gun into the school and do harm."
Part 5. Two-Sample T-Test Analysis ################################## ##################################
Using the variables in the YRBSS dataset (listed on the Word Document titled YRBSS Variables), select appropriate variables for the correlation analysis:
- Two-Sample T-Test Analysis: One Binary Variable & One Continuous or Ordinal Variable
Again, feel free to select any variables that you are interested in!
#What Two Variables Did you Pick?
"I picked Bullied_Online & Suicide_Consider"
## [1] "I picked Bullied_Online & Suicide_Consider"
#Why did you pick them?
"I picked these two variables because there is a good chance of a positive correlation between the two. I also have been bullied online before."
## [1] "I picked these two variables because there is a good chance of a positive correlation between the two. I also have been bullied online before."
#Make a hypothesis of the relationship between the variables. Try to reference actual Psychological Theory and/or Empirical Research.
"If a person is bullied online, then they're likely to have sucicidal thoughts. There have been many research studies that support online bullying leading to sucicidal thoughts."
## [1] "If a person is bullied online, then they're likely to have sucicidal thoughts. There have been many research studies that support online bullying leading to sucicidal thoughts."
Describe the Variables you Selected
#First, Make sure that your binary factor is a factor variable! Use levels() to check if you were able to successfully convert the variable to a factor variable.
mrclean$Bullied_Online <- as.factor(mrclean$Bullied_Online)
levels(mrclean$Bullied_Online)
## [1] "1" "2"
#What is the sample size (i.e., # of rows) for each of level of the Binary Factor Variable?
library(psych)
describe.by(mrclean, mrclean$Bullied_Online)
## Warning: describe.by is deprecated. Please use the describeBy function
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min
## ï..ID 1 6804 6033.31 3673.13 5859.5 5866.10 4218.00 1
## Age 2 6804 16.05 1.23 16.0 16.07 1.48 12
## Sex* 3 6804 2.51 0.50 3.0 2.51 0.00 1
## Gun_Carrying 4 6804 1.37 1.05 1.0 1.04 0.00 1
## Gun_CarryingSchool 5 6804 1.08 0.53 1.0 1.00 0.00 1
## Unsafe_to_School 6 6804 1.06 0.33 1.0 1.00 0.00 1
## Threat_at_School 7 6804 1.07 0.48 1.0 1.00 0.00 1
## Forced_Sex 8 6804 1.04 0.20 1.0 1.00 0.00 1
## Sex_Violence 9 6804 1.10 0.49 1.0 1.00 0.00 1
## Bullied_School 10 6804 1.11 0.31 1.0 1.01 0.00 1
## Bullied_Online* 11 6804 1.00 0.00 1.0 1.00 0.00 1
## Suicide_Consider 12 6804 1.13 0.34 1.0 1.04 0.00 1
## Suicide_Plan 13 6804 1.10 0.30 1.0 1.00 0.00 1
## Suicide_Attempt 14 6804 1.07 0.38 1.0 1.00 0.00 1
## Alcohol_Days 15 6804 1.43 0.92 1.0 1.18 0.00 1
## Alcohol_Binge 16 6804 1.28 0.88 1.0 1.03 0.00 1
## Alcohol_Max 17 6804 1.79 1.78 1.0 1.29 0.00 1
## Marijuana_Days 18 6804 1.40 1.09 1.0 1.08 0.00 1
## Ever_Cocaine 19 6804 1.06 0.40 1.0 1.00 0.00 1
## Ever_Inhalant 20 6804 1.08 0.46 1.0 1.00 0.00 1
## Ever_Heroin 21 6804 1.01 0.25 1.0 1.00 0.00 1
## Ever_Meth 22 6804 1.03 0.32 1.0 1.00 0.00 1
## Ever_Ecstasy 23 6804 1.04 0.32 1.0 1.00 0.00 1
## Ever_Steroid 24 6804 1.03 0.33 1.0 1.00 0.00 1
## Ever_PrescriptionAbuse 25 6804 1.22 0.74 1.0 1.01 0.00 1
## max range skew kurtosis se
## ï..ID 14741 14740 0.38 -0.47 44.53
## Age 18 6 -0.04 -0.98 0.01
## Sex* 3 2 -0.08 -1.87 0.01
## Gun_Carrying 5 4 2.82 6.48 0.01
## Gun_CarryingSchool 5 4 6.82 45.78 0.01
## Unsafe_to_School 5 4 7.14 60.16 0.00
## Threat_at_School 8 7 10.47 130.60 0.01
## Forced_Sex 2 1 4.55 18.74 0.00
## Sex_Violence 5 4 5.68 35.58 0.01
## Bullied_School 2 1 2.55 4.50 0.00
## Bullied_Online* 1 0 NaN NaN 0.00
## Suicide_Consider 2 1 2.15 2.64 0.00
## Suicide_Plan 2 1 2.67 5.11 0.00
## Suicide_Attempt 5 4 6.55 49.20 0.00
## Alcohol_Days 7 6 2.72 8.15 0.01
## Alcohol_Binge 7 6 3.59 13.57 0.01
## Alcohol_Max 8 7 2.37 4.46 0.02
## Marijuana_Days 6 5 3.06 8.76 0.01
## Ever_Cocaine 6 5 9.20 95.29 0.00
## Ever_Inhalant 6 5 7.74 67.37 0.01
## Ever_Heroin 6 5 18.50 355.80 0.00
## Ever_Meth 6 5 13.20 184.36 0.00
## Ever_Ecstasy 6 5 10.98 140.56 0.00
## Ever_Steroid 6 5 12.12 161.91 0.00
## Ever_PrescriptionAbuse 6 5 4.24 19.65 0.01
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min
## ï..ID 1 1091 6004.13 3598.48 5800 5845.18 4124.59 36
## Age 2 1091 15.87 1.23 16 15.85 1.48 12
## Sex* 3 1091 2.31 0.48 2 2.27 0.00 1
## Gun_Carrying 4 1091 1.46 1.14 1 1.13 0.00 1
## Gun_CarryingSchool 5 1091 1.13 0.65 1 1.00 0.00 1
## Unsafe_to_School 6 1091 1.24 0.72 1 1.05 0.00 1
## Threat_at_School 7 1091 1.33 1.14 1 1.04 0.00 1
## Forced_Sex 8 1091 1.17 0.38 1 1.09 0.00 1
## Sex_Violence 9 1091 1.48 0.99 1 1.24 0.00 1
## Bullied_School 10 1091 1.67 0.47 2 1.71 0.00 1
## Bullied_Online* 11 1091 2.00 0.00 2 2.00 0.00 2
## Suicide_Consider 12 1091 1.39 0.49 1 1.36 0.00 1
## Suicide_Plan 13 1091 1.31 0.46 1 1.26 0.00 1
## Suicide_Attempt 14 1091 1.37 0.86 1 1.13 0.00 1
## Alcohol_Days 15 1091 1.77 1.26 1 1.49 0.00 1
## Alcohol_Binge 16 1091 1.55 1.25 1 1.22 0.00 1
## Alcohol_Max 17 1091 2.31 2.16 1 1.84 0.00 1
## Marijuana_Days 18 1091 1.69 1.41 1 1.31 0.00 1
## Ever_Cocaine 19 1091 1.16 0.73 1 1.00 0.00 1
## Ever_Inhalant 20 1091 1.25 0.86 1 1.02 0.00 1
## Ever_Heroin 21 1091 1.07 0.57 1 1.00 0.00 1
## Ever_Meth 22 1091 1.09 0.60 1 1.00 0.00 1
## Ever_Ecstasy 23 1091 1.11 0.59 1 1.00 0.00 1
## Ever_Steroid 24 1091 1.11 0.61 1 1.00 0.00 1
## Ever_PrescriptionAbuse 25 1091 1.52 1.15 1 1.22 0.00 1
## max range skew kurtosis se
## ï..ID 14722 14686 0.39 -0.40 108.94
## Age 18 6 -0.01 -0.81 0.04
## Sex* 3 2 0.65 -1.10 0.01
## Gun_Carrying 5 4 2.37 4.16 0.03
## Gun_CarryingSchool 5 4 5.26 27.04 0.02
## Unsafe_to_School 5 4 3.55 13.26 0.02
## Threat_at_School 8 7 4.43 20.79 0.03
## Forced_Sex 2 1 1.74 1.03 0.01
## Sex_Violence 5 4 2.21 4.24 0.03
## Bullied_School 2 1 -0.71 -1.50 0.01
## Bullied_Online* 2 0 NaN NaN 0.00
## Suicide_Consider 2 1 0.47 -1.78 0.01
## Suicide_Plan 2 1 0.84 -1.30 0.01
## Suicide_Attempt 5 4 2.66 6.82 0.03
## Alcohol_Days 7 6 1.92 3.50 0.04
## Alcohol_Binge 7 6 2.46 5.66 0.04
## Alcohol_Max 8 7 1.55 1.00 0.07
## Marijuana_Days 6 5 2.09 3.18 0.04
## Ever_Cocaine 6 5 5.41 30.06 0.02
## Ever_Inhalant 6 5 4.16 18.03 0.03
## Ever_Heroin 6 5 8.24 67.13 0.02
## Ever_Meth 6 5 7.56 57.64 0.02
## Ever_Ecstasy 6 5 6.86 50.50 0.02
## Ever_Steroid 6 5 6.64 46.50 0.02
## Ever_PrescriptionAbuse 6 5 2.48 5.67 0.03
"6804 people were not bullied & 1091 were bullied."
## [1] "6804 people were not bullied & 1091 were bullied."
#Hint: try Googling the table() function to do this... This is a shortcut! Otherwise you can try other methods we have learned to do so. There are many ways to do this!
#What is the Mean & Median for the Continuous or Ordinal Variable?
mean(mrclean$Suicide_Consider)
## [1] 1.168461
median(mrclean$Suicide_Consider)
## [1] 1
"mean = 1.168461 & median = 1"
## [1] "mean = 1.168461 & median = 1"
#Is this variable Continuous or Ordinal?
"This variable is ordinal"
## [1] "This variable is ordinal"
Statistical Analysis ######################
Now let’s use the t.test() function to test our two-sample t-test. Let’s make our t-test two-sided and assume that the variance of both groups are equal (i.e. pooled variance estimate or student’s t-test)!
Remember X will be our binary factor variable and y will be our ordinal/continuous variable.
t.test(mrclean$Suicide_Consider ~ mrclean$Bullied_Online, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mrclean$Suicide_Consider by mrclean$Bullied_Online
## t = -21.249, df = 7893, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -0.2755602 -0.2290131
## sample estimates:
## mean in group 1 mean in group 2
## 1.133598 1.385885
#Hint: t.test(y variable ~ x variable, alternative = "two.sided or one.sided", var.equal = TRUE or FALSE)
#Hint: t.test(y variable ~ x variable, alternative = "two.sided", var.equal = TRUE)
Was your hypothesis supported? How do you know?
"My hypothesis was supported because the p-value is small, meaning we reject the null hypothesis. Though it is a small difference numerically, group 2 is slightly higher- people who were bullied online have considered suicide. There is a positive correlation."
## [1] "My hypothesis was supported because the p-value is small, meaning we reject the null hypothesis. Though it is a small difference numerically, group 2 is slightly higher- people who were bullied online have considered suicide. There is a positive correlation."
#Note: Reference the p-value and directionality of the relationship (reference the means)
#Note: Remember what group 1 & group 2 refer to!
Provide an explanation for why your hypothesis was supported or not supported. Again try to reference psychological theory or any possible limitations of the survey design and/or sample of the dataset.
"My hypothesis was supported because if someone is bullied online, especially if continously, they're more likely to experience depressing thoughts such as suicide."
## [1] "My hypothesis was supported because if someone is bullied online, especially if continously, they're more likely to experience depressing thoughts such as suicide."