STAT 227: Assignment 1

# Load any R Packages you may need
library(tidyverse)
library(fivethirtyeight)
library(mosaic)

Reminder: Knit early and often!

Exercise 1

Head start is a program that promotes school readiness for children of low-income families. This federally funded program is supposed to provide the students with skills that will help them through high school. In a scientific study to determine the effectiveness of this program, researchers collected data from 10,000 children. These children were split into two research groups based on if they had participated in the Head Start program or not. This longitudinal study followed the children until they were adults and recorded if they had graduated high schools (yes/no). The study demonstrated that the Head Start program have positive impacts on student’s graduation rates.

(a)

Identify the variables and their types (categorical or numerical).

Explanatory Variable: Participation in Headstart (yes/no). This is a categorical variable

Response Variable: High School Graduation status (yes/no). This is a categorical variable

(b)

Identify the response/outcome and the explanatory variables.

Response/Outcome Variable: If they graduate high school or not Explanatory Variabe: If they participated in Headstart

(c)

State the main research question in this study.

Does participating in Headstart Early Childhood Education program increase the likelihood of high school completion/graduation?

Exercise 2

Researchers collected data to examine the relationship between pollutants and pre-term births in Southern California. During the study air pollution levels were measured by air quality monitoring stations. Specifically, levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in parts per hundred million, and course particulate matter (PM_10) in µg/m3. Length of gestation data were collected on 143,196 births between the years 1989 and 1993, and air pollution exposure during gestation was calculated for each birth. The analysis suggested that increased ambient PM_10 and to a lesser degree, CO concentrations may be associated with the occurrence of preterm births.

(a)

Identify the variables and their types (categorical or numerical).

explanatory Variable: Level of Pollutants in Southern California. More specifically CO, NO2, Ozone, and PM_10. This is a numerical variable because we are measuring how much of the pollutants are there, not just if they are there.

outcome Variable: Length of Gestation of births between 1989 and 1993. This is numerical variable

(b)

Identify the response/outcome and the explanatory variables.

Response/Outcome Variable: Occurrence of preterm births. This is a numerical variable

Explanatory Variable: Amount of pollutants in the air in Southern California

(c)

‘State the main research question in this study.’

Does the amount of pollutants in the air effect the length of Gestation? ie Is there a relationship between the amount of air pollutants in Southern California and amount of preterm births in 1989-1993?

Exercise 3

Suppose that a statistics professor records the following for each student enrolled in their class:

Major
Score on first exam
Attendance rate
Time spent sleeping the previous night
Handedness (left- or right-handed)
Political inclination (liberal, moderate, or conservative)
Time spent on the final exam
Score on the final exam

For the following questions, identify the response/outcome variable and the explanatory variable(s). Also classify each variable as quantitative or categorical. For categorical variables, also indicate whether the variable is binary (i.e., has only two levels).

(a)

Does political inclination vary among majors?

Explanatory Variable: Major. This is nominal categorical, but there is a variety of questions Outcome Variable: Political inclination. This is nominal categorical, but there is three options, not two.

(b)

Is a student’s score on the first exam useful for predicting their score on the final exam?

Outcome Variable: Score on final exam. This is quantitative Explanatory Variable: Score on the first exam. This is quantitative

(c)

Can we tell much about a student’s handedness by knowing their average sleeping time, major, and time spent on the final exam?

There is no data given to support a relationship between a students handedness and their average sleeping time, major, and time spent on the final exam. From the perspective of a psychology major, you cannot figure out a student’s handedness by knowing their average sleeping time, major, or time spent on the final exam.

Exercise 4

The flying dataset in the fivethirtyeight R package contains data from a recent article on airplane etiquette (view article here). You may view the dataset by running the following:

View(flying)

(a)

How many rows are in the dataset? What does each row in the dataset represent?

There are 1040 rows or observations. Each row is a different observation or participants. So 1040 people filled out the survey

(b)

Using the help page for the flying dataset (?flying), identify four variables in the dataset and describe what they mean.

1.) frequency-How often do you travel by plane 2.) recline_obligation-Under normal circumstances, does a person who reclines their seat during a flight have any obligation to the person sitting behind them? 3.) switch_seats_friends-Is it rude to ask someone to switch seats with you in order to be closer to friends? 4.)smoked-Have you ever smoked a cigarette in an airplane bathroom when it was against the rules?

(c)

Use the summary() function with the talk_stranger variable (Hint: Use the $ operator). Explain why the result is different from when we used the summary() function in the Module 1 slides.

summary(flying$talk_stranger)

##       No Somewhat     Very     NA's 
##      675      153       27      185

summary(flying)

##  respondent_id          gender             age          height   
##  Min.   :3.432e+09   Length:1040        18-29:220   5'4"   : 79  
##  1st Qu.:3.432e+09   Class :character   30-44:254   5'7"   : 76  
##  Median :3.433e+09   Mode  :character   45-60:275   5'8"   : 76  
##  Mean   :3.433e+09                      > 60 :258   5'6"   : 75  
##  3rd Qu.:3.433e+09                      NA's : 33   5'9"   : 72  
##  Max.   :3.436e+09                                  (Other):480  
##                                                     NA's   :182  
##  children_under_18            household_income
##  Mode :logical     $0 - $24,999       : 99    
##  FALSE:662         $25,000 - $49,999  :159    
##  TRUE :189         $50,000 - $99,999  :294    
##  NA's :189         $100,000 - $149,999:159    
##                    $150,000+          :  0    
##                    NA's               :329    
##                                               
##                             education     location        
##  Less than high school degree    :  8   Length:1040       
##  High school degree              : 98   Class :character  
##  Some college or Associate degree:286   Mode  :character  
##  Bachelor degree                 :325                     
##  Graduate degree                 :284                     
##  NA's                            : 39                     
##                                                           
##                  frequency             recline_frequency recline_obligation
##  Never                :166   Never              :171     Mode :logical     
##  Once a year or less  :633   Once in a while    :257     FALSE:311         
##  Once a month or less :205   About half the time:118     TRUE :543         
##  A few times per month: 29   Usually            :175     NA's :186         
##  A few times per week :  4   Always             :137                       
##  Every day            :  3   NA's               :182                       
##                                                                            
##    recline_rude recline_eliminate switch_seats_friends switch_seats_family
##  No      :502   Mode :logical     No      :631         No      :705       
##  Somewhat:281   FALSE:595         Somewhat:184         Somewhat:125       
##  Very    : 71   TRUE :259         Very    : 35         Very    : 20       
##  NA's    :186   NA's :186         NA's    :190         NA's    :190       
##                                                                           
##                                                                           
##                                                                           
##  wake_up_bathroom   wake_up_walk       baby       unruly_child
##  No      :535     No      :226   No      :592   No      :147  
##  Somewhat:281     Somewhat:446   Somewhat:182   Somewhat:351  
##  Very    : 34     Very    :178   Very    : 75   Very    :351  
##  NA's    :190     NA's    :190   NA's    :191   NA's    :191  
##                                                               
##                                                               
##                                                               
##  two_arm_rests      middle_arm_rest       shade             unsold_seat 
##  Length:1040        Length:1040        Length:1040        No      :690  
##  Class :character   Class :character   Class :character   Somewhat:128  
##  Mode  :character   Mode  :character   Mode  :character   Very    : 37  
##                                                           NA's    :185  
##                                                                         
##                                                                         
##                                                                         
##   talk_stranger                                    get_up    electronics    
##  No      :675   It is not okay to get up during flight: 13   Mode :logical  
##  Somewhat:153   Once                                  : 67   FALSE:713      
##  Very    : 27   Twice                                 :277   TRUE :136      
##  NA's    :185   Three times                           :296   NA's :191      
##                 Four times                            :111                  
##                 More than five times times            : 91                  
##                 NA's                                  :185                  
##    smoked       
##  Mode :logical  
##  FALSE:842      
##  TRUE :7        
##  NA's :191      
##                 
##                 
##

If you type “summary(flying$talk_stranger)”, then it will only give you the data. Specifically for this data set, you will be getting the counts of 4 different categories. If you type “summary(flying)” you will get data for all 27 variables and that is just a lot to process as a statistician, so you’re just making it harder for yourself. So dollar signing in is better. Lastly if you type in “summary(talk_stranger)” R is going to get confused. It will see an object but won’t know what data set it belongs to, so you are going to get an error message

Exercise 5

The dataset infants.csv was obtained some years ago from a study conducted at the South End Community Health Center in Boston. The study was designed to measure the impact of intense nutritional counseling on pregnant women and their new-born infants. You will find this dataset in the Files tab in the bottom-right panel. The variables include: Race, age at delivery, smoking status of mother, mother’s pre-pregnancy weight (in pounds), mother’s weight at delivery (in pounds), whether the mother breastfed her infant, infants birth-weight (in grams), infants length at birth (in centimeters), and time (in minutes) spent with the nutritionist.

(a)

Import this dataset into R. How many mothers were included in this dataset?

summary(infants)

##      race                age           smoke             preweight    
##  Length:68          Min.   :16.00   Length:68          Min.   : 85.0  
##  Class :character   1st Qu.:21.00   Class :character   1st Qu.:112.8  
##  Mode  :character   Median :24.00   Mode  :character   Median :125.0  
##                     Mean   :24.75                      Mean   :130.8  
##                     3rd Qu.:27.25                      3rd Qu.:140.0  
##                     Max.   :41.00                      Max.   :264.0  
##    delweight      breastfed              bwt            bwl       
##  Min.   :117.0   Length:68          Min.   :2000   Min.   :41.00  
##  1st Qu.:140.0   Class :character   1st Qu.:2950   1st Qu.:48.00  
##  Median :156.0   Mode  :character   Median :3245   Median :50.00  
##  Mean   :160.5                      Mean   :3234   Mean   :49.87  
##  3rd Qu.:173.0                      3rd Qu.:3515   3rd Qu.:52.00  
##  Max.   :255.0                      Max.   :4458   Max.   :60.00  
##     timenut      
##  Min.   : 32.00  
##  1st Qu.: 71.75  
##  Median : 85.50  
##  Mean   : 86.18  
##  3rd Qu.:100.00  
##  Max.   :128.00

There are 68 mothers in here. I figured this out by using summary and for the variables breastfed, smoke, and race the length was 68. So I deduced that there were 68 mothers because there are 68 observation ### (b) Obtain the names of the variables.

The variables include: “race”,“age”,“smoke”,“preweight” ,“delweight” “breastfed”,“bwt”,“bwl”,“timenut”

names(infants)

## [1] "race"      "age"       "smoke"     "preweight" "delweight" "breastfed"
## [7] "bwt"       "bwl"       "timenut"

(c)

Calculate the average age of the mothers in the dataset.

mean(infants$age)

## [1] 24.75

The average age of the mothers in the data set is 24.75 years old.

(d)

The following code will create a new variable, wtgain, which represents each mother’s weight gain during pregnancy.

infants$wtgain = infants$delweight - infants$preweight

What is the average weight gain?

mean(infants$delweight)- mean(infants$preweight)

## [1] 29.77941

mean(infants$wtgain)

## [1] 29.77941

Average weight gain: 29.77941

Exercise 6

The treadmill.csv dataset contains observations on 31 males that volunteered for a study on methods for measuring fitness. The variables contain information on the subject number (Subject), subjects’ maximum treadmill oxygen consumption (TreadMillOx, in ml per kg per minute) and maximum pulse rate (TreadMillMaxPulse, in beats per minute), time to run 1.5 miles (RunTime, in minutes), maximum pulse during 1.5 mile run (RunPulse, in beats per minute), resting pulse rate (RestPulse, beats per minute), Body Weight (BodyWeight, in kg), and Age (in years).

(a)

Import this dataset into R, and calculate the mean body weight.

#View(treadmill)

mean(treadmill$BodyWeight)

## [1] 77.44452

Mean body weight is 77.44 kilograms

(b)

The weight responses are in kilograms and you might prefer to see them in pounds. The conversion to pounds is lbs=2.205*kgs. We will create a new variable in the treadmill dataset called WeightLB using this code:

WeightLBs = 2.205*treadmill$BodyWeight

Find the mean and standard deviation of WeightLBs.

#view(treadmill)

mean(WeightLBs)

## [1] 170.7652

sd(WeightLBs)

## [1] 18.36449

The mean is around 170.76 pounds The standard deviation, amount of typical variation or spread from the mean, is around 18.36 pounds

(c)

State a research question that could be associated with this dataset.

Does the weight of the runner affect the time it takes them to run the 1.5 miles?

All done!

Knit the completed R Markdown file as a HTML document (click the “Knit” button at the top of the script editor window) and upload it to moodle.