This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: cereal: This is a data frame containing information about different cereals. It has rows representing each cereal and columns representing various attributes.
cereal[,1]: This part of the code selects the entire first column of the cereal data frame. The comma , is used to separate row and column indices. By leaving the row index empty before the comma, it selects all rows. The 1 after the comma indicates that only the first column should be selected.
rownames(cereal): This part of the code accesses the row names of the cereal data frame. In R, data frames have row names that can be used to label each row. By default, row names are set to the row numbers, but they can be changed to more meaningful values.
rownames(cereal) = cereal[,1]: This line of code sets the row names of the cereal data frame to the values found in its first column. This can be useful when the first column of a data frame contains unique identifiers for each row, such as cereal names, and you want to use these identifiers as row names for easier access and readability…;lpp[]
summary(cereal[,“sugars”]): The summary() function computes summary statistics for the “sugars” column of the cereal data frame. It returns the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values for this column.
cereal[cereal==-1] = NA: This line of code replaces all “-1” values in the cereal data frame with NA (missing values). The expression cereal==-1 creates a logical matrix with the same dimensions as cereal, where each element is TRUE if the corresponding element in cereal is equal to -1 and FALSE otherwise. By using this logical matrix to index cereal, we can selectively assign NA values to all elements of cereal that are equal to -1.
summary(cereal[,“sugars”]): After redefining the “-1” values as NA, the summary() function is called again to generate the summary statistics for the “sugars” column of the cereal data frame. The output will now exclude the previously “-1” values, as they have been replaced with NA and are treated as missing values.
library(ggplot2)
library(tidyr)
## Reading in the data
cereal = read.csv("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/Cereal.csv")
head(cereal)
## name mfr type calories protein fat sodium fiber carbo
## 1 100%_Bran N C 70 4 1 130 10.0 5.0
## 2 100%_Natural_Bran Q C 120 3 5 15 2.0 8.0
## 3 All-Bran K C 70 4 1 260 9.0 7.0
## 4 All-Bran_with_Extra_Fiber K C 50 4 0 140 14.0 8.0
## 5 Almond_Delight R C 110 2 2 200 1.0 14.0
## 6 Apple_Cinnamon_Cheerios G C 110 2 2 180 1.5 10.5
## sugars potass vitamins shelf weight cups rating
## 1 6 280 25 3 1 0.33 68.40297
## 2 8 135 0 3 1 1.00 33.98368
## 3 5 320 25 3 1 0.33 59.42551
## 4 0 330 25 3 1 0.50 93.70491
## 5 8 -1 25 3 1 0.75 34.38484
## 6 10 70 25 1 1 0.75 29.50954
#summary(cereal)
rownames(cereal)= cereal[, 1]
print(rownames(cereal))
## [1] "100%_Bran"
## [2] "100%_Natural_Bran"
## [3] "All-Bran"
## [4] "All-Bran_with_Extra_Fiber"
## [5] "Almond_Delight"
## [6] "Apple_Cinnamon_Cheerios"
## [7] "Apple_Jacks"
## [8] "Basic_4"
## [9] "Bran_Chex"
## [10] "Bran_Flakes"
## [11] "Cap'n'Crunch"
## [12] "Cheerios"
## [13] "Cinnamon_Toast_Crunch"
## [14] "Clusters"
## [15] "Cocoa_Puffs"
## [16] "Corn_Chex"
## [17] "Corn_Flakes"
## [18] "Corn_Pops"
## [19] "Count_Chocula"
## [20] "Cracklin'_Oat_Bran"
## [21] "Cream_of_Wheat_(Quick)"
## [22] "Crispix"
## [23] "Crispy_Wheat_&_Raisins"
## [24] "Double_Chex"
## [25] "Froot_Loops"
## [26] "Frosted_Flakes"
## [27] "Frosted_Mini-Wheats"
## [28] "Fruit_&_Fibre_Dates,_Walnuts,_and_Oats"
## [29] "Fruitful_Bran"
## [30] "Fruity_Pebbles"
## [31] "Golden_Crisp"
## [32] "Golden_Grahams"
## [33] "Grape_Nuts_Flakes"
## [34] "Grape-Nuts"
## [35] "Great_Grains_Pecan"
## [36] "Honey_Graham_Ohs"
## [37] "Honey_Nut_Cheerios"
## [38] "Honey-comb"
## [39] "Just_Right_Crunchy__Nuggets"
## [40] "Just_Right_Fruit_&_Nut"
## [41] "Kix"
## [42] "Life"
## [43] "Lucky_Charms"
## [44] "Maypo"
## [45] "Muesli_Raisins,_Dates,_&_Almonds"
## [46] "Muesli_Raisins,_Peaches,_&_Pecans"
## [47] "Mueslix_Crispy_Blend"
## [48] "Multi-Grain_Cheerios"
## [49] "Nut&Honey_Crunch"
## [50] "Nutri-Grain_Almond-Raisin"
## [51] "Nutri-grain_Wheat"
## [52] "Oatmeal_Raisin_Crisp"
## [53] "Post_Nat._Raisin_Bran"
## [54] "Product_19"
## [55] "Puffed_Rice"
## [56] "Puffed_Wheat"
## [57] "Quaker_Oat_Squares"
## [58] "Quaker_Oatmeal"
## [59] "Raisin_Bran"
## [60] "Raisin_Nut_Bran"
## [61] "Raisin_Squares"
## [62] "Rice_Chex"
## [63] "Rice_Krispies"
## [64] "Shredded_Wheat"
## [65] "Shredded_Wheat_'n'Bran"
## [66] "Shredded_Wheat_spoon_size"
## [67] "Smacks"
## [68] "Special_K"
## [69] "Strawberry_Fruit_Wheats"
## [70] "Total_Corn_Flakes"
## [71] "Total_Raisin_Bran"
## [72] "Total_Whole_Grain"
## [73] "Triples"
## [74] "Trix"
## [75] "Wheat_Chex"
## [76] "Wheaties"
## [77] "Wheaties_Honey_Gold"
colnames(cereal)
## [1] "name" "mfr" "type" "calories" "protein" "fat"
## [7] "sodium" "fiber" "carbo" "sugars" "potass" "vitamins"
## [13] "shelf" "weight" "cups" "rating"
summary(cereal[,"sugars"])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.000 3.000 7.000 6.922 11.000 15.000
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ purrr 1.0.1
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cereal %>%
filter(carbo==min(carbo, na.rm = TRUE)) %>%
select(name)
## name
## Quaker_Oatmeal Quaker_Oatmeal
## redefine "-1" as NA
cereal[cereal==-1] = NA
summary(cereal[,"sugars"])
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 7.000 7.026 11.000 15.000 1
ggplot(cereal, aes(x=mfr, y = carbo, fill=mfr)) + geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (`stat_boxplot()`).
## We can do this in by grouping by manufacturer and then summarising the sugar variable with mean.
cereal %>%
group_by(mfr) %>%
summarise(meanSugars = mean(sugars, na.rm = TRUE))
## # A tibble: 7 × 2
## mfr meanSugars
## <chr> <dbl>
## 1 A 3
## 2 G 7.95
## 3 K 7.57
## 4 N 1.83
## 5 P 8.78
## 6 Q 6.14
## 7 R 6.12
cereal %>%
group_by(mfr) %>%
summarise(meanSugars = mean(sugars, na.rm = TRUE), medianSugars = median(sugars, na.rm = TRUE)) %>%
pivot_longer(cols=c("meanSugars", "medianSugars"), names_to="summary", values_to="sugars") %>%
ggplot(aes(x=mfr, y=sugars, colour=summary, group=summary)) + geom_point() + geom_line()
cereal %>%
filter(shelf == 2) %>%
select(name)
## name
## Apple_Jacks Apple_Jacks
## Cap'n'Crunch Cap'n'Crunch
## Cinnamon_Toast_Crunch Cinnamon_Toast_Crunch
## Cocoa_Puffs Cocoa_Puffs
## Corn_Pops Corn_Pops
## Count_Chocula Count_Chocula
## Cream_of_Wheat_(Quick) Cream_of_Wheat_(Quick)
## Froot_Loops Froot_Loops
## Frosted_Mini-Wheats Frosted_Mini-Wheats
## Fruity_Pebbles Fruity_Pebbles
## Golden_Grahams Golden_Grahams
## Honey_Graham_Ohs Honey_Graham_Ohs
## Kix Kix
## Life Life
## Lucky_Charms Lucky_Charms
## Maypo Maypo
## Nut&Honey_Crunch Nut&Honey_Crunch
## Raisin_Bran Raisin_Bran
## Smacks Smacks
## Strawberry_Fruit_Wheats Strawberry_Fruit_Wheats
## Trix Trix
respiratory <- read.delim("https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/respiratory.txt", sep = "\t")
tab <- table(respiratory$treatment, respiratory$status)
tab
##
## good poor
## placebo 127 158
## treatment 172 98
chisq.test(tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 19.682, df = 1, p-value = 9.148e-06
test = chisq.test(tab)
test$expected >= 5
##
## good poor
## placebo TRUE TRUE
## treatment TRUE TRUE
OR <- (tab[1,1]*tab[2,2])/(tab[2,1]*tab[1,2])
OR
## [1] 0.4579776
You can also embed plots, for example: library(tidyverse): This line loads the tidyverse library, which is a collection of R packages designed for data manipulation and visualization. It includes packages such as dplyr, ggplot2, tidyr, and readr.
cereal %>%: The %>% operator is a pipe from the magrittr package, which is also part of the tidyverse. It is used to chain together multiple functions. In this case, it takes the cereal data frame as input and passes it as the first argument to the next function, filter().
filter(carbo==min(carbo, na.rm = TRUE)): The filter() function from the dplyr package is used to subset the rows of the cereal data frame based on a condition. In this case, the condition is carbo == min(carbo, na.rm = TRUE), which selects the rows where the “carbo” value is equal to the minimum “carbo” value in the data frame. The na.rm = TRUE argument within the min() function is used to remove NA values when calculating the minimum.
select(name): The select() function from the dplyr package is used to choose a specific column from the filtered data frame. In this case, the “name” column is selected, which contains the names of the cereals.
cereal %>%: The %>% operator is used to pass the cereal data frame as input to the following functions.
group_by(mfr): The group_by() function is used to group the data frame by the “mfr” (manufacturer) column.
summarise(meanSugars = mean(sugars, na.rm = TRUE), medianSugars = median(sugars, na.rm = TRUE)): The summarise() function is used to calculate summary statistics for each group (manufacturer). Here, the mean and median of the “sugars” column are calculated, ignoring NA values with na.rm = TRUE.
pivot_longer(cols=c(“meanSugars”, “medianSugars”), names_to=“summary”, values_to=“sugars”): The pivot_longer() function is used to reshape the data from wide format to long format, which is suitable for plotting with ggplot2. The “meanSugars” and “medianSugars” columns are converted to a single “sugars” column, and a new “summary” column is created to indicate whether the value is a mean or median.
ggplot(aes(x=mfr, y=sugars, colour=summary, group=summary)): The ggplot() function initializes a new ggplot object with the specified aesthetics (x, y, color, and group). The x-axis represents the manufacturer, the y-axis represents the sugar values, the color represents the summary statistic (mean or median), and the group aesthetic is used to draw lines connecting points with the same summary statistic.
geom_point() + geom_line(): The geom_point() function adds a scatterplot layer, and the geom_line() function adds lines connecting the points. The resulting plot displays the mean and median sugar values for each manufacturer, with distinct colors for each summary statistic and lines connecting points of the same statistic.
respiratory <- read.delim(“https://wimr-genomics.vip.sydney.edu.au/AMED3002/data/respiratory.txt”,
sep = “): The read.delim() function is used to read a tab-separated file
from the provided URL. The sep =” argument specifies that the delimiter
between values in the file is a tab character. The resulting data frame
is assigned to the variable respiratory.
tab <- table(respiratory\(treatment, respiratory\)status): The table() function is used to create a contingency table based on two categorical variables, “treatment” and “status”, in the respiratory data frame. The resulting table shows the frequency distribution of the “status” variable across different “treatment” groups.
tab: This line of code prints the contingency table to the console.