R Bridge Course Week 2 Assignment

Select a dataset from http://vincentarelbundock.github.io/Rdatasets/ download it and perform the following tasks:

library(readr)
iceCore <- read_csv("https://raw.githubusercontent.com/justinm0rgan/bridge-workshop/main/R/hw2/edcCO2.csv?token=GHSAT0AAAAAABPMFD5D4X2BPT6CJAYMB3F6YPAN2QA", col_select = c(2,3),
                    show_col_types = FALSE)
## New names:
## * `` -> ...1
  1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes of your data.
summary(iceCore)
##       age              co2       
##  Min.   :   137   Min.   :171.6  
##  1st Qu.:137134   1st Qu.:207.5  
##  Median :423206   Median :231.4  
##  Mean   :390906   Mean   :230.8  
##  3rd Qu.:627408   3rd Qu.:251.5  
##  Max.   :798512   Max.   :298.6
cat("Age mean:", round(mean(iceCore$age),2), "and  median:", median(iceCore$age))
## Age mean: 390906 and  median: 423206.5
cat("\nCO2 mean: ", round(mean(iceCore$co2),2), "and median:",
median(iceCore$co2))
## 
## CO2 mean:  230.84 and median: 231.45
  1. Create a new data frame with a subset of the columns AND rows. There are several ways to do this so feel free to try a couple if you want. Make sure to rename the new data set so it simply just doesn’t write it over.
iceCoreSub <- iceCore[1:10,]
  1. Create new column names for each column in the new data frame created in step 2.
colnames(iceCoreSub) <- c("AGE", "CO2") 
  1. Use the summary function to create an overview of your new data frame created in step The print the mean and median for the same two attributes. Please compare (i.e. tell me how the values changed and why)
summary(iceCoreSub)
##       AGE             CO2       
##  Min.   :137.0   Min.   :274.9  
##  1st Qu.:308.0   1st Qu.:278.0  
##  Median :444.5   Median :279.6  
##  Mean   :483.0   Mean   :279.4  
##  3rd Qu.:643.8   3rd Qu.:280.9  
##  Max.   :877.0   Max.   :282.2
cat("\nAge mean:", round(mean(iceCoreSub$AGE),2), "and  median:", median(iceCoreSub$AGE))
## 
## Age mean: 483 and  median: 444.5
cat("\nCO2 mean:", round(mean(iceCoreSub$CO2),2), "and median:",
median(iceCoreSub$CO2))
## 
## CO2 mean: 279.37 and median: 279.6

The mean and median for Age went down significantly (~390,000 to ~480 and ~423,000 to 444 respectively) in the 10 row subset data frame because there were less glaciers of old age in the subset. This tells us there is a large variance in age (which we could find out with measures of dispersion such as Range, Variance and Std). Additionally, the mean being higher then the median tells us this subset distribution is positively skewed.

CO2 levels mean and median were ~40 higher in the subset. This means there was less variance in C02 levels in the subset, and more in the original data frame. Additionally, the equality of mean and median in this particular characteristic, means the subset CO2 data is normally distributed, with no skew.

summary(iceCore)
##       age              co2       
##  Min.   :   137   Min.   :171.6  
##  1st Qu.:137134   1st Qu.:207.5  
##  Median :423206   Median :231.4  
##  Mean   :390906   Mean   :230.8  
##  3rd Qu.:627408   3rd Qu.:251.5  
##  Max.   :798512   Max.   :298.6
iceCore$ageQuart <- ifelse(iceCore$age < 137134, "First",
                           ifelse((iceCore$age >= 137134) & (iceCore$age< 423206),"Second",
                                  ifelse((iceCore$age >= 423206) & (iceCore$age< 627408),"Third",
                                         ifelse(iceCore$age > 627408, "Fourth",NA))))
  1. For at least 3 different/distinct values in a column please rename so that every value in that column is renamed. For example, change the letter “e” to “excellent”, the letter “a” to “average’ and the word “bad” to “terrible”.
iceCore["ageQuart"][iceCore["ageQuart"] == "First"] <- "1st"
iceCore$ageQuart[iceCore$ageQuart == "Second"] <- "2nd"
iceCore$ageQuart[iceCore$ageQuart == "Third"] <- "3rd"
iceCore$ageQuart[iceCore$ageQuart == "Fourth"] <- "4th"
  1. Display enough rows to see examples of all of steps 1-5 above. This means use a function to show me enough row values that I can see the changes.
iceCore[order(iceCore$co2),][21:30,]
## # A tibble: 10 × 3
##       age   co2 ageQuart
##     <dbl> <dbl> <chr>   
##  1 659524  182. 4th     
##  2 749401  183. 4th     
##  3 746643  183. 4th     
##  4 163698  184. 2nd     
##  5 660084  184. 4th     
##  6 750476  184  4th     
##  7 718779  184. 4th     
##  8  22015  184. 1st     
##  9  21257  185. 1st     
## 10 271256  185. 2nd
  1. BONUS – place the original .csv in a github file and have R read from the link. This should be your own github – not the file source. This will be a very useful skill as you progress in your data science education and career.