Note: we have not covered all of these functions yet, but try to challenge yourself on ones like filter and mutate – we will be getting into them more in Thursday’s session.
Another note: you may need to install.packages() on these packages first.
## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
If you get stuck and have a chunk of code that just isn’t working, you can set it to eval=FALSE and it won’t be evaluated when you knit the Markdown file. See example below.
ghudhbg + 7
1. Create the following two objects.
bday = "12-Jun"
name = "James"
2. Make an object “me” that is “bday” and “name” combined.
me <- c(name, bday)
me
## [1] "James" "12-Jun"
3. Determine the data class for “me”.
class(me)
## [1] "character"
4. If I want to do me / 2 I get the following error:
Error in me/2 : non-numeric argument to binary operator.
Why? Write your answer as a comment inside the R chunk below.
#You cannot use a math operator on a character, in Python I think you can concatenate this way but not In R
The following questions involve an outside dataset.
We will be working with a dataset from the “Kaggle” website, which hosts competitions for prediction and machine learning. This particular dataset contains information about temperature measures from the Rover Environmental Monitoring Station (REMS) on Mars. These data are collected by Spain and Finland. More details on this dataset are here: https://www.kaggle.com/datasets/deepcontractor/mars-rover-environmental-monitoring-station/data.
5. Bring the dataset into R. The dataset is located at: https://daseh.org/data/kaggleMars_Dataset.csv. You can
use the link, download it, or use whatever method you like for getting
the file. Once you get the file, read the dataset in using
read_csv() and assign it the name mars.
mars <- read.csv("kaggleMars_Dataset.csv")
6. Import the data “dictionary” from https://daseh.org/data/kaggleMars_dictionary.txt. Use
the read_tsv() function and assign it the name “key”.
key <- read_tsv("dictionary.txt")
## Rows: 12 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): earth_year, Year on Earth
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
7. You should now be ready to work with the “mars” dataset.
str() or
glimpse(). Write your answer as a comment inside the R
chunk below.# 1.
names(mars)
## [1] "earth_year" "earth_date" "mars_date" "solar_day"
## [5] "max_ground_temp" "min_ground_temp" "max_air_temp" "min_air_temp"
## [9] "mean_pressure" "sunrise" "sunset" "UV_Radiation"
## [13] "weather"
# 2.
#integer
str(mars[1])
## 'data.frame': 3197 obs. of 1 variable:
## $ earth_year: int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
#char
str(mars[2])
## 'data.frame': 3197 obs. of 1 variable:
## $ earth_date: chr "01-26 UTC" "01-25 UTC" "01-24 UTC" "01-23 UTC" ...
#char
str(mars[3])
## 'data.frame': 3197 obs. of 1 variable:
## $ mars_date: chr "Mars, Month 6 - LS 163deg" "Mars, Month 6 - LS 163deg" "Mars, Month 6 - LS 162deg" "Mars, Month 6 - LS 162deg" ...
#integer
str(mars[4])
## 'data.frame': 3197 obs. of 1 variable:
## $ solar_day: int 3368 3367 3366 3365 3364 3363 3362 3361 3360 3359 ...
#integer
str(mars[5])
## 'data.frame': 3197 obs. of 1 variable:
## $ max_ground_temp: int -3 -3 -4 -6 -7 -8 -4 -6 -6 -9 ...
#integer
str(mars[6])
## 'data.frame': 3197 obs. of 1 variable:
## $ min_ground_temp: int -71 -72 -70 -70 -71 -71 -72 -70 -71 -71 ...
#integer
str(mars[7])
## 'data.frame': 3197 obs. of 1 variable:
## $ max_air_temp: int 10 10 8 9 8 8 5 5 3 5 ...
#integer
str(mars[8])
## 'data.frame': 3197 obs. of 1 variable:
## $ min_air_temp: int -84 -87 -81 -91 -92 -80 -84 -73 -89 -80 ...
#character
str(mars[9])
## 'data.frame': 3197 obs. of 1 variable:
## $ mean_pressure: chr "707" "707" "708" "707" ...
#character
str(mars[10])
## 'data.frame': 3197 obs. of 1 variable:
## $ sunrise: chr "5:25" "5:25" "5:25" "5:26" ...
#character
str(mars[11])
## 'data.frame': 3197 obs. of 1 variable:
## $ sunset: chr "17:20" "17:20" "17:21" "17:21" ...
#character
str(mars[12])
## 'data.frame': 3197 obs. of 1 variable:
## $ UV_Radiation: chr "moderate" "moderate" "moderate" "moderate" ...
#character
str(mars[13])
## 'data.frame': 3197 obs. of 1 variable:
## $ weather: chr "Sunny" "Sunny" "Sunny" "Sunny" ...
8. How many data points (rows) are in the dataset? How many variables (columns) are recorded for each data point?
library(glue)
#a little function for later on
datasize = function(arg1, arg2){
rows <- nrow(arg2)
columns <- ncol(arg1)
glue("This dataset has, {toupper(columns)} columns and {toupper(rows)} rows")
# Output: "This dataset has, 13 columns and 3197 rows"
}
datasize(mars, mars)
## This dataset has, 13 columns and 3197 rows
9. Filter out (i.e., remove) measurements from earlier than 2015 (according to the Earth year), as well as any rows with missing data (NA). Replace the original “mars” object by reassigning the new filtered dataset to “mars”. How many data points are left after filtering?
Hint: use drop_na() to remove rows with missing values.
mars_old <- mars
mars <- drop_na(mars)
datasize(mars, mars)
## This dataset has, 13 columns and 3168 rows
glue("the data set has changed by a value of {nrow(mars_old)-nrow(mars)}")
## the data set has changed by a value of 29
10. From this point on, work with the filtered “mars” dataset from the above question. A Martian year is equivalent to 668.6 sols (or solar days). Create a new variable (column) called “years_since_landing” that shows how many Martian years the Curiosity rover had been on Mars for each measurement (divide “solar_day” by 668.6). Check to make sure the new column is there.
Hint: use the mutate() function.
library(dplyr)
mars <- mars %>%
mutate(years_since_landing = (mars[[4]] / 668.6))
11. What is the range of the maximum ground temperature (“max_ground_temp”) of the dataset?
glue("The range of max ground temps is {min(mars$max_ground_temp)} to {max(mars$max_ground_temp)}")
## The range of max ground temps is -67 to 11
glue("i.e. {abs(min(mars$max_ground_temp)-max(mars$max_ground_temp))}")
## i.e. 78
12. Create a random sample with of atmospheric pressure readings from
mars. To determine the column that corresponds to
atmospheric pressure, check the “key” corresponding to the data
dictionary that you imported above in question 6. Use
sample() and pull(). Remember that by default
random samples differ each time you run the code.
mars %>% pull(8) %>% sample(1)
## [1] -82
13. How many data points are from days where the maximum ground temperature got above 0 degrees Celsius? What percent/proportion do these represent? Use:
filter() and nrow()group_by() and summarize() orsum()above_zero <- mars %>% pull(5) %>% {. > 0} %>% sum()
glue("There were {above_zero} days where the ground temp was above zero")
## There were 242 days where the ground temp was above zero
14. How many different UV radiation levels (“UV_Radiation”) are there?
Hint: use length() with
unique() or table(). Remember to
pull() the right column.
UV_class <- mars %>% pull(12) %>% unique() %>% length()
glue("This data set reports {UV_class} levels of UV radiation")
## This data set reports 4 levels of UV radiation
# reports the number of UV classifications
15. How many different weather conditions (“weather”) are reported?
weather_discretion <- mars %>% pull(13) %>% unique() %>% length()
glue("This data set reports {weather_discretion} categories of weather")
## This data set reports 1 categories of weather
# reports the number of UV classifications
16. Which UV radiation level had the highest maximum air temperature, and what was it?
Hint: Use group_by() with
summarize().
max_index <- which.max(mars[[7]])
air_max <- mars[[7]][max_index]
glue("The max air temp is {air_max} celsius at UV: {mars[max_index, 13]}")
## The max air temp is 24 celsius at UV: Sunny
17. Extend on the code you wrote for question 16. Use the
arrange() function to sort the output by maximum air
temperature.
mars_sorted <- mars %>% arrange(desc(max_air_temp))
head(mars_sorted, 10)
## earth_year earth_date mars_date solar_day max_ground_temp
## 1 2016 08-12 UTC Mars, Month 7 - LS 202deg 1428 4
## 2 2020 06-14 UTC Mars, Month 8 - LS 219deg 2793 6
## 3 2020 06-04 UTC Mars, Month 8 - LS 213deg 2783 -5
## 4 2017 01-17 UTC Mars, Month 11 - LS 300deg 1582 -1
## 5 2016 11-10 UTC Mars, Month 9 - LS 258deg 1516 4
## 6 2020 04-23 UTC Mars, Month 7 - LS 188deg 2742 -3
## 7 2020 04-22 UTC Mars, Month 7 - LS 187deg 2741 -1
## 8 2020 04-19 UTC Mars, Month 7 - LS 185deg 2738 0
## 9 2020 04-18 UTC Mars, Month 7 - LS 185deg 2737 0
## 10 2020 04-17 UTC Mars, Month 7 - LS 184deg 2736 0
## min_ground_temp max_air_temp min_air_temp mean_pressure sunrise sunset
## 1 -72 24 -69 808 5:18 17:24
## 2 -68 22 -85 832 5:22 17:33
## 3 -69 21 -83 817 5:20 17:29
## 4 -74 20 -75 862 6:34 18:48
## 5 -71 20 -75 909 5:52 18:09
## 6 -68 19 -77 752 5:18 17:21
## 7 -69 19 -70 752 5:18 17:21
## 8 -69 19 -84 745 5:19 17:20
## 9 -67 19 -81 745 5:19 17:20
## 10 -70 19 -83 745 5:19 17:20
## UV_Radiation weather years_since_landing
## 1 very_high Sunny 2.135806
## 2 high Sunny 4.177386
## 3 high Sunny 4.162429
## 4 high Sunny 2.366138
## 5 high Sunny 2.267424
## 6 high Sunny 4.101107
## 7 high Sunny 4.099611
## 8 high Sunny 4.095124
## 9 high Sunny 4.093628
## 10 high Sunny 4.092133
18. How many measurements were taken on days when the UV radiation was “low” and the maximum air temperature was above freezing? Use:
filter() and count()filter() and tally() orsum()# Logical statement inside of a pipeline
UV_low_nfreeze <- mars %>%
filter(UV_Radiation == "low", max_air_temp > 0) %>%
nrow()
#report to the terminal
glue("There were {UV_low_nfreeze} measurements taken when UV was Low and temperatures were above 0 Celsius")
## There were 13 measurements taken when UV was Low and temperatures were above 0 Celsius
19. How many days was the UV radiation was “high” or “very high”? use:
filter() and count()filter() and tally() orsum()# Logical statement inside of a pipeline
UV_HIGHS <- mars %>%
filter(UV_Radiation == "high" | UV_Radiation == "very_high") %>%
nrow()
glue("There were {UV_HIGHS} days where the UV Radiation levels were high or very high")
## There were 1635 days where the UV Radiation levels were high or very high
20. Select all columns in “mars” where the column names starts with
“min” (using select() and starts_with(). Then,
use colMeans() to summarize across these columns.
mars_min <- mars %>% select(starts_with("min"))
min_means <- colMeans(mars_min)
# view result
min_means
## min_ground_temp min_air_temp
## -75.01515 -80.31755
21. Using “mars”, create a new binary (TRUEs and FALSEs) column to indicate if the day’s maximum air temperature was above freezing. Call the new column “above_freezing”.
mars <- mars %>%
mutate(above_freezing = max_air_temp > 0)
22. What is the average atmospheric pressure for days that have an air temperature above freezing and UV radiation level of “moderate”? How does this compare with days that do NOT fit these criteria?
# add a logical column
mars <- mars %>%
mutate(temp_uv_group = max_air_temp > 0 & UV_Radiation == "moderate")
mars <- mars %>%
mutate(mean_pressure = as.numeric(mean_pressure))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `mean_pressure = as.numeric(mean_pressure)`.
## Caused by warning:
## ! NAs introduced by coercion
# for some reason one of the pressure rows had an S in it... I guess i will leave it for now?
# nevermind, removing
mars_new <- mars
mars_new <- drop_na(mars_new)
datasize(mars_new, mars_new)
## This dataset has, 16 columns and 3167 rows
glue("the data set has changed by a value of {nrow(mars)-nrow(mars_new)}")
## the data set has changed by a value of 1
# now doing pressure summary
pressure_summary <- mars_new %>%
group_by(temp_uv_group) %>%
summarize(avg_pressure = mean(mean_pressure, na.rm = TRUE))
pressure_summary
## # A tibble: 2 × 2
## temp_uv_group avg_pressure
## <lgl> <dbl>
## 1 FALSE 829.
## 2 TRUE 827.
23. Among days with a “moderate” UV level that are above freezing, what is the distribution of the earth year in which these days occurred?
# Logical statement inside of a pipeline
UV_MOD_ANTIFREEZE <- mars %>%
filter(UV_Radiation == "moderate" & above_freezing == TRUE)
UV_MOD_ANTIFREEZE %>%
count(earth_year)
## earth_year n
## 1 2014 72
## 2 2015 41
## 3 2016 31
## 4 2017 6
## 5 2018 74
## 6 2019 72
## 7 2020 152
## 8 2021 126
## 9 2022 17
24. How many days (using filter() or sum()
) have a maximum ground or air temperature above zero and have a UV
level of “high” or “very_high”?
# Logical statement inside of a pipeline
LOGICAL_MADNESS <- mars %>%
filter((UV_Radiation == "high" | UV_Radiation == "very_high") & (max_ground_temp > 0 | max_air_temp > 0))
sum_logic <- nrow(LOGICAL_MADNESS)
glue("The condition is true for {sum_logic} days")
## The condition is true for 1273 days
25. Make a boxplot (boxplot()) that looks at earth year
(“earth_year”) on the x-axis and minimum air temperature
(“min_air_temp”) on the y-axis.
boxplot(min_air_temp ~ earth_year, data = mars_new,
main = "Minimum Air Temperature by Earth Year",
xlab = "Earth Year",
ylab = "Minimum Air Temperature (C)",
col = "lightblue",
border = "gray40",
las = 2)
26. Knit your document into a report.
You use the knit button to do this. Make sure all your code is working first!