# Load any R Packages you may need
library(tidyverse)
library(fivethirtyeight)
library(mosaic)
Reminder: Knit early and often!
Head start is a program that promotes school readiness for children of low-income families. This federally funded program is supposed to provide the students with skills that will help them through high school. In a scientific study to determine the effectiveness of this program, researchers collected data from 10,000 children. These children were split into two research groups based on if they had participated in the Head Start program or not. This longitudinal study followed the children until they were adults and recorded if they had graduated high schools (yes/no). The study demonstrated that the Head Start program have positive impacts on student’s graduation rates.
Identify the variables and their types (categorical or numerical).
Explanatory Variable: Participation in Headstart (yes/no). This is a categorical variable
Response Variable: High School Graduation status (yes/no). This is a categorical variable
Identify the response/outcome and the explanatory variables.
Response/Outcome Variable: If they graduate high school or not Explanatory Variabe: If they participated in Headstart
State the main research question in this study.
Does participating in Headstart Early Childhood Education program increase the likelihood of high school completion/graduation?
Researchers collected data to examine the relationship between pollutants and pre-term births in Southern California. During the study air pollution levels were measured by air quality monitoring stations. Specifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and course particulate matter (PM_10) in µg/m3. Length of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth. The analysis suggested that increased ambient PM_10 and to a lesser degree, CO concentrations may be associated with the occurrence of preterm births.
Identify the variables and their types (categorical or numerical).
explanatory Variable: Level of Pollutants in Southern California. More specifically CO, NO2, Ozone, and PM_10. This is a numerical variable because we are measuring how much of the pollutants are there, not just if they are there.
outcome Variable: Length of Gestation of births between 1989 and 1993. This is numerical variable
Identify the response/outcome and the explanatory variables.
Response/Outcome Variable: Occurrence of preterm births. This is a numerical variable
Explanatory Variable: Amount of pollutants in the air in Southern California
‘State the main research question in this study.’
Does the amount of pollutants in the air effect the length of Gestation? ie Is there a relationship between the amount of air pollutants in Southern California and amount of preterm births in 1989-1993?
Suppose that a statistics professor records the following for each student enrolled in their class:
For the following questions, identify the response/outcome variable and the explanatory variable(s). Also classify each variable as quantitative or categorical. For categorical variables, also indicate whether the variable is binary (i.e., has only two levels).
Does political inclination vary among majors?
Explanatory Variable: Major. This is nominal categorical, but there is a variety of questions Outcome Variable: Political inclination. This is nominal categorical, but there is three options, not two.
Is a student’s score on the first exam useful for predicting their score on the final exam?
Outcome Variable: Score on final exam. This is quantitative Explanatory Variable: Score on the first exam. This is quantitative
Can we tell much about a student’s handedness by knowing their average sleeping time, major, and time spent on the final exam?
There is no data given to support a relationship between a students handedness and their average sleeping time, major, and time spent on the final exam. From the perspective of a psychology major, you cannot figure out a student’s handedness by knowing their average sleeping time, major, or time spent on the final exam.
The flying dataset in the fivethirtyeight R
package contains data from a recent article on airplane etiquette (view
article here).
You may view the dataset by running the following:
View(flying)
How many rows are in the dataset? What does each row in the dataset represent?
There are 1040 rows or observations. Each row is a different observation or participants. So 1040 people filled out the survey
Using the help page for the flying dataset
(?flying), identify four variables in the dataset and
describe what they mean.
1.) frequency-How often do you travel by plane 2.) recline_obligation-Under normal circumstances, does a person who reclines their seat during a flight have any obligation to the person sitting behind them? 3.) switch_seats_friends-Is it rude to ask someone to switch seats with you in order to be closer to friends? 4.)smoked-Have you ever smoked a cigarette in an airplane bathroom when it was against the rules?
Use the summary() function with the
talk_stranger variable (Hint: Use the $
operator). Explain why the result is different from when we used the
summary() function in the Module 1 slides.
summary(flying$talk_stranger)
## No Somewhat Very NA's
## 675 153 27 185
summary(flying)
## respondent_id gender age height
## Min. :3.432e+09 Length:1040 18-29:220 5'4" : 79
## 1st Qu.:3.432e+09 Class :character 30-44:254 5'7" : 76
## Median :3.433e+09 Mode :character 45-60:275 5'8" : 76
## Mean :3.433e+09 > 60 :258 5'6" : 75
## 3rd Qu.:3.433e+09 NA's : 33 5'9" : 72
## Max. :3.436e+09 (Other):480
## NA's :182
## children_under_18 household_income
## Mode :logical $0 - $24,999 : 99
## FALSE:662 $25,000 - $49,999 :159
## TRUE :189 $50,000 - $99,999 :294
## NA's :189 $100,000 - $149,999:159
## $150,000+ : 0
## NA's :329
##
## education location
## Less than high school degree : 8 Length:1040
## High school degree : 98 Class :character
## Some college or Associate degree:286 Mode :character
## Bachelor degree :325
## Graduate degree :284
## NA's : 39
##
## frequency recline_frequency recline_obligation
## Never :166 Never :171 Mode :logical
## Once a year or less :633 Once in a while :257 FALSE:311
## Once a month or less :205 About half the time:118 TRUE :543
## A few times per month: 29 Usually :175 NA's :186
## A few times per week : 4 Always :137
## Every day : 3 NA's :182
##
## recline_rude recline_eliminate switch_seats_friends switch_seats_family
## No :502 Mode :logical No :631 No :705
## Somewhat:281 FALSE:595 Somewhat:184 Somewhat:125
## Very : 71 TRUE :259 Very : 35 Very : 20
## NA's :186 NA's :186 NA's :190 NA's :190
##
##
##
## wake_up_bathroom wake_up_walk baby unruly_child
## No :535 No :226 No :592 No :147
## Somewhat:281 Somewhat:446 Somewhat:182 Somewhat:351
## Very : 34 Very :178 Very : 75 Very :351
## NA's :190 NA's :190 NA's :191 NA's :191
##
##
##
## two_arm_rests middle_arm_rest shade unsold_seat
## Length:1040 Length:1040 Length:1040 No :690
## Class :character Class :character Class :character Somewhat:128
## Mode :character Mode :character Mode :character Very : 37
## NA's :185
##
##
##
## talk_stranger get_up electronics
## No :675 It is not okay to get up during flight: 13 Mode :logical
## Somewhat:153 Once : 67 FALSE:713
## Very : 27 Twice :277 TRUE :136
## NA's :185 Three times :296 NA's :191
## Four times :111
## More than five times times : 91
## NA's :185
## smoked
## Mode :logical
## FALSE:842
## TRUE :7
## NA's :191
##
##
##
If you type “summary(flying$talk_stranger)”, then it will only give you the data. Specifically for this data set, you will be getting the counts of 4 different categories. If you type “summary(flying)” you will get data for all 27 variables and that is just a lot to process as a statistician, so you’re just making it harder for yourself. So dollar signing in is better. Lastly if you type in “summary(talk_stranger)” R is going to get confused. It will see an object but won’t know what data set it belongs to, so you are going to get an error message
The dataset infants.csv was obtained some years ago from
a study conducted at the South End Community Health Center in Boston.
The study was designed to measure the impact of intense nutritional
counseling on pregnant women and their new-born infants. You will find
this dataset in the Files tab in the bottom-right panel. The variables
include: Race, age at delivery, smoking status of mother, mother’s
pre-pregnancy weight (in pounds), mother’s weight at delivery (in
pounds), whether the mother breastfed her infant, infants birth-weight
(in grams), infants length at birth (in centimeters), and time (in
minutes) spent with the nutritionist.
Import this dataset into R. How many mothers were included in this dataset?
summary(infants)
## race age smoke preweight
## Length:68 Min. :16.00 Length:68 Min. : 85.0
## Class :character 1st Qu.:21.00 Class :character 1st Qu.:112.8
## Mode :character Median :24.00 Mode :character Median :125.0
## Mean :24.75 Mean :130.8
## 3rd Qu.:27.25 3rd Qu.:140.0
## Max. :41.00 Max. :264.0
## delweight breastfed bwt bwl
## Min. :117.0 Length:68 Min. :2000 Min. :41.00
## 1st Qu.:140.0 Class :character 1st Qu.:2950 1st Qu.:48.00
## Median :156.0 Mode :character Median :3245 Median :50.00
## Mean :160.5 Mean :3234 Mean :49.87
## 3rd Qu.:173.0 3rd Qu.:3515 3rd Qu.:52.00
## Max. :255.0 Max. :4458 Max. :60.00
## timenut
## Min. : 32.00
## 1st Qu.: 71.75
## Median : 85.50
## Mean : 86.18
## 3rd Qu.:100.00
## Max. :128.00
There are 68 mothers in here. I figured this out by using summary and for the variables breastfed, smoke, and race the length was 68. So I deduced that there were 68 mothers because there are 68 observation ### (b) Obtain the names of the variables.
The variables include: “race”,“age”,“smoke”,“preweight” ,“delweight” “breastfed”,“bwt”,“bwl”,“timenut”
names(infants)
## [1] "race" "age" "smoke" "preweight" "delweight" "breastfed"
## [7] "bwt" "bwl" "timenut"
Calculate the average age of the mothers in the dataset.
mean(infants$age)
## [1] 24.75
The average age of the mothers in the data set is 24.75 years old.
The following code will create a new variable, wtgain,
which represents each mother’s weight gain during pregnancy.
infants$wtgain = infants$delweight - infants$preweight
What is the average weight gain?
mean(infants$delweight)- mean(infants$preweight)
## [1] 29.77941
mean(infants$wtgain)
## [1] 29.77941
Average weight gain: 29.77941
The treadmill.csv dataset contains observations on 31
males that volunteered for a study on methods for measuring fitness. The
variables contain information on the subject number
(Subject), subjects’ maximum treadmill oxygen consumption
(TreadMillOx, in ml per kg per minute) and maximum pulse
rate (TreadMillMaxPulse, in beats per minute), time to run
1.5 miles (RunTime, in minutes), maximum pulse during 1.5
mile run (RunPulse, in beats per minute), resting pulse
rate (RestPulse, beats per minute), Body Weight
(BodyWeight, in kg), and Age (in years).
Import this dataset into R, and calculate the mean body weight.
#View(treadmill)
mean(treadmill$BodyWeight)
## [1] 77.44452
Mean body weight is 77.44 kilograms
The weight responses are in kilograms and you might prefer to see
them in pounds. The conversion to pounds is lbs=2.205*kgs. We will
create a new variable in the treadmill dataset called
WeightLB using this code:
WeightLBs = 2.205*treadmill$BodyWeight
Find the mean and standard deviation of WeightLBs.
#view(treadmill)
mean(WeightLBs)
## [1] 170.7652
sd(WeightLBs)
## [1] 18.36449
The mean is around 170.76 pounds The standard deviation, amount of typical variation or spread from the mean, is around 18.36 pounds
State a research question that could be associated with this dataset.
Does the weight of the runner affect the time it takes them to run the 1.5 miles?
All done!
Knit the completed R Markdown file as a HTML document (click the “Knit” button at the top of the script editor window) and upload it to moodle.