Select a dataset from http://vincentarelbundock.github.io/Rdatasets/ download it and perform the following tasks:
library(readr)
iceCore <- read_csv("https://raw.githubusercontent.com/justinm0rgan/bridge-workshop/main/R/hw2/edcCO2.csv?token=GHSAT0AAAAAABPMFD5D4X2BPT6CJAYMB3F6YPAN2QA", col_select = c(2,3),
show_col_types = FALSE)
## New names:
## * `` -> ...1
summary(iceCore)
## age co2
## Min. : 137 Min. :171.6
## 1st Qu.:137134 1st Qu.:207.5
## Median :423206 Median :231.4
## Mean :390906 Mean :230.8
## 3rd Qu.:627408 3rd Qu.:251.5
## Max. :798512 Max. :298.6
cat("Age mean:", round(mean(iceCore$age),2), "and median:", median(iceCore$age))
## Age mean: 390906 and median: 423206.5
cat("\nCO2 mean: ", round(mean(iceCore$co2),2), "and median:",
median(iceCore$co2))
##
## CO2 mean: 230.84 and median: 231.45
iceCoreSub <- iceCore[1:10,]
colnames(iceCoreSub) <- c("AGE", "CO2")
summary(iceCoreSub)
## AGE CO2
## Min. :137.0 Min. :274.9
## 1st Qu.:308.0 1st Qu.:278.0
## Median :444.5 Median :279.6
## Mean :483.0 Mean :279.4
## 3rd Qu.:643.8 3rd Qu.:280.9
## Max. :877.0 Max. :282.2
cat("\nAge mean:", round(mean(iceCoreSub$AGE),2), "and median:", median(iceCoreSub$AGE))
##
## Age mean: 483 and median: 444.5
cat("\nCO2 mean:", round(mean(iceCoreSub$CO2),2), "and median:",
median(iceCoreSub$CO2))
##
## CO2 mean: 279.37 and median: 279.6
The mean and median for Age went down significantly (~390,000 to ~480 and ~423,000 to 444 respectively) in the 10 row subset data frame because there were less glaciers of old age in the subset. This tells us there is a large variance in age (which we could find out with measures of dispersion such as Range, Variance and Std). Additionally, the mean being higher then the median tells us this subset distribution is positively skewed.
CO2 levels mean and median were ~40 higher in the subset. This means there was less variance in C02 levels in the subset, and more in the original data frame. Additionally, the equality of mean and median in this particular characteristic, means the subset CO2 data is normally distributed, with no skew.
categorical
variables prior to executing the next process. I decided to create an ageQuart
column, which would label each value in the age column with its respective quartile.summary(iceCore)
## age co2
## Min. : 137 Min. :171.6
## 1st Qu.:137134 1st Qu.:207.5
## Median :423206 Median :231.4
## Mean :390906 Mean :230.8
## 3rd Qu.:627408 3rd Qu.:251.5
## Max. :798512 Max. :298.6
iceCore$ageQuart <- ifelse(iceCore$age < 137134, "First",
ifelse((iceCore$age >= 137134) & (iceCore$age< 423206),"Second",
ifelse((iceCore$age >= 423206) & (iceCore$age< 627408),"Third",
ifelse(iceCore$age > 627408, "Fourth",NA))))
iceCore["ageQuart"][iceCore["ageQuart"] == "First"] <- "1st"
iceCore$ageQuart[iceCore$ageQuart == "Second"] <- "2nd"
iceCore$ageQuart[iceCore$ageQuart == "Third"] <- "3rd"
iceCore$ageQuart[iceCore$ageQuart == "Fourth"] <- "4th"
iceCore[order(iceCore$co2),][21:30,]
## # A tibble: 10 × 3
## age co2 ageQuart
## <dbl> <dbl> <chr>
## 1 659524 182. 4th
## 2 749401 183. 4th
## 3 746643 183. 4th
## 4 163698 184. 2nd
## 5 660084 184. 4th
## 6 750476 184 4th
## 7 718779 184. 4th
## 8 22015 184. 1st
## 9 21257 185. 1st
## 10 271256 185. 2nd