This homework assignment consists of two sections. The first section deals with data structures and the second section is a small data analysis project. You will use the data wrangling and tidying knowledge for this section.
General Instructions:
echo = TRUE is specified in the setup code chunk at the beginning of the document. Please do not override this in your code chunks.eval = FALSE option. Another possibility is to use the error = TRUE option in the code chunk.This section focuses on some basic manipulations of vectors in R.
Create three vectors in R: One called evennums which contains the even integers from 1 through 15. One called charnums which contains character representations of the numbers 4 through 8, namely, “4”, “5”, “6”, “7”, “8”. And one called mixed which contains the same values as in charnums but which also contains the letters “a”, “b” and “c”. No commentary or explanations are necessary.
evennums <- c(2,4,6,8,10,12,14)
print(evennums)
[1] 2 4 6 8 10 12 14
charnums <- c("4","5","6","7","8")
print(charnums)
[1] "4" "5" "6" "7" "8"
mixed <- c(charnums,"a","b","c")
print(mixed)
[1] "4" "5" "6" "7" "8" "a" "b" "c"
Investigate what happens when you try to convert evennums to character and to logical. Investigate what happens when you convert charnums to numeric. Investigate what happens when you convert mixed to numeric. Comment on each of these conversions.
#character
#evennums converted to character representations from integers
evennums <- c(2,4,6,8,10,12,14)
y <- as.character(evennums) #
print(y)
[1] "2" "4" "6" "8" "10" "12" "14"
#logical
#converts evennums into (true or false) values
#all true
evennums <- c(2,4,6,8,10,12,14)
x <- as.logical(evennums)
print(x)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#charnums to numeric
#converts charnums from character values to integer values
charnums <- c("4","5","6","7","8")
x = as.numeric(charnums)
print(x)
[1] 4 5 6 7 8
#mixed to numeric
#the character representations of integers were converted into integers
#but the last 3 characters ("a","b", "c") came to "NA", it can not be done.
mixed <- c(charnums,"a","b","c")
z = as.numeric(mixed)
print(z)
[1] 4 5 6 7 8 NA NA NA
No commentary is necessary on this part.
evennums.#extract first element of 'evennums'
evennums <- c(2,4,6,8,10,12,14)
print(evennums[1])
[1] 2
evennums. In this case you are NOT allowed to use the fact that evennums has seven elements, rather, you must give code which would work no matter how many elements evennums has.#extract the last element of evennums
#using tail(vector, n=1)
evennums <- c(2,4,6,8,10,12,14)
x = tail(evennums,1)
print(x)
[1] 14
evennums.#extract all but the first element
#using tail(evennums,n=-1)
evennums <- c(2,4,6,8,10,12,14)
x = tail(evennums,-1)
print(x)
[1] 4 6 8 10 12 14
evennums.#extract all but the first two and last two elements
evennums <- c(2,4,6,8,10,12,14)
x = evennums[3:5]
print(x)
[1] 6 8 10
#sequence 'y' using seq(start,stop,step)function
y <- seq(0,1,length.out = 50)
print(y)
[1] 0.00000000 0.02040816 0.04081633 0.06122449 0.08163265 0.10204082
[7] 0.12244898 0.14285714 0.16326531 0.18367347 0.20408163 0.22448980
[13] 0.24489796 0.26530612 0.28571429 0.30612245 0.32653061 0.34693878
[19] 0.36734694 0.38775510 0.40816327 0.42857143 0.44897959 0.46938776
[25] 0.48979592 0.51020408 0.53061224 0.55102041 0.57142857 0.59183673
[31] 0.61224490 0.63265306 0.65306122 0.67346939 0.69387755 0.71428571
[37] 0.73469388 0.75510204 0.77551020 0.79591837 0.81632653 0.83673469
[43] 0.85714286 0.87755102 0.89795918 0.91836735 0.93877551 0.95918367
[49] 0.97959184 1.00000000
#calculate mean of sequence
y <- seq(0,1,length.out = 50)
mean <- mean(y)
print(mean)
[1] 0.5
The dataset contains information about births in the United States. The full data set is from the Centers for Disease Control. The data for this homework assignment is a “small” sample (chosen at random) of slightly over one million records from the full data set. The data for this homework assignment also only contain a subset of the variables in the full data set.
Load tidyverse, which includes dplyr, tidyr, and other packages, and the load `knitr.
library(tidyverse)
library(knitr)
Read in the data and convert the data frame to a tibble.
birth_data <- read.csv("BirthData.csv", header = TRUE)
birth_data <- as_tibble(birth_data)
A glimpse of the data:
glimpse(birth_data)
Rows: 1,103,629
Columns: 8
$ year <int> 1969, 1969, 1969, 1969, 1969, 1969, 1969, 1969, 1969,...
$ month <int> 9, 8, 9, 2, 3, 5, 5, 5, 6, 8, 8, 11, 11, 11, 1, 12, 3...
$ state <chr> "AL", "AZ", "AZ", "CA", "CA", "CA", "CA", "CA", "CA",...
$ is_male <lgl> FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TR...
$ weight_pounds <dbl> 1.624807, 7.500126, 8.937540, 6.999677, 6.876218, 7.1...
$ mother_age <int> 20, 35, 17, 20, 25, 30, 17, 22, 26, 26, 19, 25, 26, 2...
$ child_race <int> 2, 1, 1, 1, 2, 1, 1, 4, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1,...
$ plurality <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
The variables in the data set are:
| Variable | Description |
|---|---|
year |
the year of the birth |
month |
the month of the birth |
state |
the state where the birth occurred, including “DC” for Washington D.C. |
is_male |
which is TRUE if the child is male, FALSE otherwise |
weight_pounds |
the child’s birth weight in pounds |
mother_age |
the age of the mother |
child_race |
race of the child. |
plurality |
the number of children born as a result of the pregnancy, with 1 representing a single birth, 2 representing twins, etc. |
For both of Questions 1 and 2 you should show the R code used and the output of the str andglimpse functions applied to the data frame. Use of dplyr functions and the pipe operator is highly recommended.
Create a variable called region in the data frame birth_data which takes the values Northeast, Midwest, South, and West. The first two Steps have been done for you.
Here are the states in each region:
Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island and Vermont, New Jersey, New York, and Pennsylvania
Illinois, Indiana, Michigan, Ohio and Wisconsin, Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, and South Dakota
Delaware, District of Columbia, Florida, Georgia, Maryland, North Carolina, South Carolina, Virginia, and West Virginia, Alabama, Kentucky, Mississippi, and Tennessee, Arkansas, Louisiana, Oklahoma, and Texas
Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah and Wyoming, Alaska, California, Hawaii, Oregon and Washington
#Step 1: Assign the regions.
NE <- c("CT", "ME", "MA", "NH", "RI", "VT", "NJ", "NY", "PA")
MW <- c("IL", "IN", "MI", "OH", "WI", "IA", "KS", "MN", "MO", "NE", "ND", "SD")
SO <- c("DE", "DC", "FL", "GA", "MD", "NC", "SC", "VA", "WV", "AL", "KY", "MS", "TN", "AR", "LA", "OK", "TX")
WE <- c("AZ", "CO", "ID", "MT", "NV", "NM", "UT", "WY", "AK", "CA", "HI", "OR", "WA")
## Step 2 Create a blank vector
birth_data$region <- rep(NA, length(birth_data$state))
## Hint use if-else and %in% to create the regions.
##ifelse(birth_data$state == "NE",
## birth_data$region == "NE",NA)
##(birth_data$state == "MW",
## birth_data$region == "MW",NA)
#ifelse(birth_data$state == "SO",
# birth_data$region == "SO",NA)
#ifelse(birth_data$state == "WE",
# birth_data$region == "WE",NA)
Create a variable in birth_data called state_color which takes the values red, blue, and purple, using the following divisions.
state_color <- c('red','blue','purple')
birth_data$state_color <- rep(NA, length(birth_data$state))
head(birth_data)
Alaska, Idaho, Kansas, Nebraska, North Dakota, Oklahoma, South Dakota, Utah, Wyoming, Texas, Alabama, Mississippi, South Carolina, Montana, Georgia, Missouri, Louisiana, Tennessee, Arkansas, Kentucky, Arizona, West Virginia.
North Carolina, Virginia, Florida, Ohio, Colorado, Nevada, Indiana, Iowa, New Mexico.
New Hampshire, Pennsylvania, California, Michigan, Illinois, Maryland, Delaware, New Jersey, Connecticut, Vermont, Maine, Washington, Oregon, Wisconsin, New York, Massachusetts, Rhode Island, Hawaii, Minnesota, District of Columbia.
RED <- c("AK", "ID", "KS", "NE", "ND", "OK", "SD", "UT", "WY", "TX", "AL", "MS", "SC", "MT", "GA", "MO", "LA", "TN", "AR", "KY", "AZ", "WV")
PURPLE <- c("NC", "VA", "FL", "OH", "CO", "NV", "IN", "IA", "NM")
BLUE <- c("NH", "PA", "CA", "MI", "IL", "MD", "DE", "NJ", "CT", "VT", "ME", "WA", "OR", "WI", "NY", "MA", "RI", "HI", "MN", "DC")
## try using mutate
Create two new objects perc_male and perc_female that caluclates the percentile ranking of a baby’s weight with respect to the baby’s sex.
## The dataset to find the male percentiles
birth_data1<-birth_data%>%
filter(is_male== TRUE)#%>%
# select(is_male, weight_pounds, plurality)
glimpse(birth_data1)
Rows: 566,380
Columns: 10
$ year <int> 1969, 1969, 1969, 1969, 1969, 1969, 1969, 1969, 1969,...
$ month <int> 8, 9, 2, 5, 5, 6, 1, 3, 6, 7, 10, 2, 3, 5, 7, 8, 10, ...
$ state <chr> "AZ", "AZ", "CA", "CA", "CA", "CA", "CO", "CT", "CT",...
$ is_male <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,...
$ weight_pounds <dbl> 7.500126, 8.937540, 6.999677, 7.187070, 7.374463, 9.6...
$ mother_age <int> 35, 17, 20, 30, 22, 26, 27, 28, 24, 20, 23, 26, 23, 1...
$ child_race <int> 1, 1, 1, 1, 4, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 9,...
$ plurality <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ region <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ state_color <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## Hint: use the quantile function to find the percentiles.
Create another new variable that records the percentile ranking of a baby’s weight with respect to the baby’s plurality (i.e., whether it was a single child, twin, triplet, etc.). [i.e., if a baby is a twin (plurality = 2), the variable should record the percentile ranking of the baby’s weight relative only to all other twins.]
## The dataset for plurality = 1 ; do the same for the other pluralities
birth_data1<-birth_data%>%
filter(plurality == 1)#%>%
glimpse(birth_data1)
Rows: 1,046,856
Columns: 10
$ year <int> 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971,...
$ month <int> 5, 8, 9, 10, 12, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 6, ...
$ state <chr> "AL", "AL", "AL", "AL", "AL", "CA", "CA", "CA", "CA",...
$ is_male <lgl> FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, ...
$ weight_pounds <dbl> 6.000983, 8.313632, 5.500533, 6.437498, 6.499227, 7.3...
$ mother_age <int> 32, 38, 20, 25, 38, 24, 38, 20, 20, 24, 28, 25, 20, 2...
$ child_race <int> 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1,...
$ plurality <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ region <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ state_color <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## Hint: use the quantile function to find the percentiles.
Provide an example case in which these two percentile rankings in Question 3 and Question 4 (gender vs plurality) would be quite similar. Provide another example case in which these two percentile rankings would be quite different.
Agree or disagree with this claim. If you agree, provide a rationale for why it is correct. If you disagree, provide a counter-example that reveals the error in its thinking:
“If these two percentile rankings are very different from one another, we should suspect that the baby in question is more likely to be a twin/triplet/etc., than a single-birth.”
Some of the variables have missing values, and these may be related to different data collection choices during different years. For example, possibly plurality wasn’t recorded during some years, or state of birth wasn’t recorded during some years. In this exercise we investigate using some dplyr functions. Hint: The group_by and summarize functions will help.
Count the number of missing values in each variable in the data frame.
Use group_by and summarize to count the number of missing values of the two variables, state and child_race, for each year, and to also count the total number of observations per year.
Are there particular years when these two variables are either not available, or of limited availability?
Create the following data frame which gives the counts, the mean weight of babies and the mean age of mothers for the six levels of plurality. Comment on what you notice about the relationship of plurality and birth weight, and the relationship of plurality and age of the mother.
Create a data frame which gives the counts, the mean weight of babies and the mean age of mothers for each combination of the four levels of state_color and the two levels of is_male.
The deadline to submit Homework 1 is 11:00pm on Saturday, February 13th. This is a individual assignment. Submit your work by uploading your RMD and HTML/PDF files through D2L. Kindly double check your submission to note whether the everything is displayed in the uploaded version of the output in D2L or not. If submitting HTML outputs, please zip the files for submission. Late work will not be accepted except under certain extraordinary circumstances.
Post general questions in the Teams HW 1 channel. If you are trying to get help on a code error, explain your error in detail
Feel free to visit us in during our virtual office hours or make an appointment.
Communicate with your classmates, but do not share snippets of code.
The instructional team will not answer any questions within the first 24 hours of this homework being assigned, and we will not answer any questions after 6 P.M of the due date.
This is an individual assignment.You may discuss ideas, how to debug code, and how to approach a problem with your classmates in the discussion board forum. You may not copy-and-paste another’s code from this class. As a reminder, below is the policy on sharing and using other’s code.
Similar reproducible examples (reprex) exist online that will help you answer many of the questions posed on group assignments, and homework assignments. Use of these resources is allowed unless it is written explicitly on the assignment. You must always cite any code you copy or use as inspiration. Copied code without citation is plagiarism and will result in a 0 for the assignment.
You must use R Markdown. Formatting is at your discretion but is graded. Use the in-class assignments and resources available online for inspiration. Another useful resource for R Markdown formatting is available at: https://holtzy.github.io/Pimp-my-rmd/
| Topic | Points |
|---|---|
| Questions 1-4 (Sec 1) and 1-10 (Sec 2) | 70 |
| R Markdown formatting | 5 |
| Communication of results | 10 |
| Rmd file compilation | 5 |
| Code style and named code chunks | 10 |
Total|100