Rivers School Science Researchers
Famous Historical Examples
Can be Excellent for Planning
The coding language = R
The Integrated Development Environment (IDE) = Posit Cloud
How to think of this: You write an english paper in Google Docs. In that case english is the language and you use the Google Docs environment to actually type and edit that paper.
Library function = Getting the tools out of the shed that you’ll need to useYou do this by running the library function for each package you are going to use.
For this tutorial, we will use.
Tidyverse is made up of a collection of packages that include tools making it easier to visualize, summarize, and wrangle data.
Openintro is a package referenced in our textbook, Introduction to Modern Statistics. Among other things, it includes a number of datasets. During this tutorial, you will generate similar visualizations to those seen when reading.
The code you type and run
What are the observational units/cases?
What are the variables?
How is my data organized?
You want Tidy Data
Each row is an observation
Each column is a variable
Throughout the rest of this tutorial we will explore the births14
This dataset is included in the openintro package
Since it is part of an R package can type ?births14 to access its documentation and learn more about it.
Description
Births14 datasetRunning the head function allows you to see the first 6 rows of the data frame
Type and run this code
# A tibble: 6 × 13
fage mage mature weeks premie visits gained weight lowbirthweight sex
<int> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 34 34 younger mom 37 full … 14 28 6.96 not low male
2 36 31 younger mom 41 full … 12 41 8.86 not low fema…
3 37 36 mature mom 37 full … 10 28 7.51 not low fema…
4 NA 16 younger mom 38 full … NA 29 6.19 not low male
5 32 31 younger mom 36 premie 12 48 6.75 not low fema…
6 32 26 younger mom 39 full … 14 45 6.69 not low fema…
# ℹ 3 more variables: habit <chr>, marital <chr>, whitemom <chr>
Births14 datasetThe glimpse function allows you to see how many variables, what type each is, and how many observations are in the data frame.
Type and Run this code
Rows: 1,000
Columns: 13
$ fage <int> 34, 36, 37, NA, 32, 32, 37, 29, 30, 29, 30, 34, 28, 28,…
$ mage <dbl> 34, 31, 36, 16, 31, 26, 36, 24, 32, 26, 34, 27, 22, 31,…
$ mature <chr> "younger mom", "younger mom", "mature mom", "younger mo…
$ weeks <dbl> 37, 41, 37, 38, 36, 39, 36, 40, 39, 39, 42, 40, 40, 39,…
$ premie <chr> "full term", "full term", "full term", "full term", "pr…
$ visits <dbl> 14, 12, 10, NA, 12, 14, 10, 13, 15, 11, 14, 16, 20, 15,…
$ gained <dbl> 28, 41, 28, 29, 48, 45, 20, 65, 25, 22, 40, 30, 31, NA,…
$ weight <dbl> 6.96, 8.86, 7.51, 6.19, 6.75, 6.69, 6.13, 6.74, 8.94, 9…
$ lowbirthweight <chr> "not low", "not low", "not low", "not low", "not low", …
$ sex <chr> "male", "female", "female", "male", "female", "female",…
$ habit <chr> "nonsmoker", "nonsmoker", "nonsmoker", "nonsmoker", "no…
$ marital <chr> "married", "married", "married", "not married", "marrie…
$ whitemom <chr> "white", "white", "not white", "white", "white", "white…
Keep it simple to start.
Know that, you’ll go around the wheel a lot and new questions will arise following your conclusions, but you have to start somewhere.
Examples:
How much do babies in this sample tend to weigh?
How many weeks did pregnancies tend to last?
What percent of babies were born to mothers who were married?
Start by visualizing the distribution of the weight variable.
Notes:
|> is a pipe symbol and should be read as “AND THEN”
+ is used to add layers to a ggplot.
We now need to generate statistics to better describe the weights.
Central Tendency
Mean
Median
Spread of the data
Standard Deviation: Average distance from the mean
Interquartile Range: The middle 50% of the data
1births14 |>
2 filter(!is.na(weight)) |>
3 summarize(
4 Mean = mean(weight),
5 Median = median(weight),
6 Standard_Deviation = sd(weight),
7 IQR = IQR(weight)
)births14 data
filter function remove out any observations where the weight was not recorded
summarize is the function we will use to generate the statistics.
mean is what is doing the calculation of the weight variable, while Mean is the name we are giving this calculation.
median is what is doing the calculation of the weight variable, while Median is the name we are giving this calculation.
sd is what is doing the calculation of the weight variable, while Standard_Deviation is the name we are giving this calculation.
IQR is what is doing the calculation of the weight variable, while IQR is the name we are giving this calculation.
# A tibble: 1 × 4
Mean Median Standard_Deviation IQR
<dbl> <dbl> <dbl> <dbl>
1 7.20 7.31 1.31 1.46
# A tibble: 1 × 4
Mean Median Standard_Deviation IQR
<dbl> <dbl> <dbl> <dbl>
1 7.20 7.31 1.31 1.46
The group_by function does as it says. It puts babies, in this case, into two groups, one for those with moms who smoked and one for those with moms who did not smoke.
births14 |>
filter(!is.na(weight)) |>
group_by(habit) |>
summarize(
Mean = mean(weight),
Median = median(weight),
Standard_Deviation = sd(weight),
IQR = IQR(weight)
)# A tibble: 3 × 5
habit Mean Median Standard_Deviation IQR
<chr> <dbl> <dbl> <dbl> <dbl>
1 nonsmoker 7.27 7.35 1.23 1.49
2 smoker 6.68 7.03 1.60 1.85
3 <NA> 7.05 7.1 1.91 1.99
geom_point will make the scatter plot
labs function and change what is inside the various parentheses
births14 |>
ggplot(aes(x=weeks, y = weight)) +
geom_point() +
1 geom_smooth(method = 'lm') +
labs(
title = "Weight vs Length of Pregnancy",
x = "Length of Pregnancy (weeks)",
y = "Weight (pounds)"
)lm or linear method for purposes of instruction.
Rochette, R., & Lê, S. (2020). Grammar of graphics: gg basics. QCBS R Workshop Series. 1