Instructions

Note: we have not covered all of these functions yet, but try to challenge yourself on ones like filter and mutate – we will be getting into them more in Thursday’s session.

Another note: you may need to install.packages() on these packages first.

## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)

If you get stuck and have a chunk of code that just isn’t working, you can set it to eval=FALSE and it won’t be evaluated when you knit the Markdown file. See example below.

ghudhbg + 7

Problem Set

1. Create the following two objects.

  1. Make an object “bday”. Assign it your birthday in day-month format (1-Jan).
  2. Make another object “name”. Assign it your name. Make sure to use quotation marks for anything with text!
bday = "12-Jun"
name = "James"

2. Make an object “me” that is “bday” and “name” combined.

me <- c(name, bday)
me
## [1] "James"  "12-Jun"

3. Determine the data class for “me”.

class(me)
## [1] "character"

4. If I want to do me / 2 I get the following error: Error in me/2 : non-numeric argument to binary operator. Why? Write your answer as a comment inside the R chunk below.

#You cannot use a math operator on a character, in Python I think you can concatenate this way but not In R

The following questions involve an outside dataset.

We will be working with a dataset from the “Kaggle” website, which hosts competitions for prediction and machine learning. This particular dataset contains information about temperature measures from the Rover Environmental Monitoring Station (REMS) on Mars. These data are collected by Spain and Finland. More details on this dataset are here: https://www.kaggle.com/datasets/deepcontractor/mars-rover-environmental-monitoring-station/data.

5. Bring the dataset into R. The dataset is located at: https://daseh.org/data/kaggleMars_Dataset.csv. You can use the link, download it, or use whatever method you like for getting the file. Once you get the file, read the dataset in using read_csv() and assign it the name mars.

mars <- read.csv("kaggleMars_Dataset.csv")

6. Import the data “dictionary” from https://daseh.org/data/kaggleMars_dictionary.txt. Use the read_tsv() function and assign it the name “key”.

key <- read_tsv("dictionary.txt")
## Rows: 12 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): earth_year, Year on Earth
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

7. You should now be ready to work with the “mars” dataset.

  1. Preview the data so that you can see the names of the columns. There are several possible functions to do this.
  2. Determine the class of the columns using str() or glimpse(). Write your answer as a comment inside the R chunk below.
# 1.
names(mars)
##  [1] "earth_year"      "earth_date"      "mars_date"       "solar_day"      
##  [5] "max_ground_temp" "min_ground_temp" "max_air_temp"    "min_air_temp"   
##  [9] "mean_pressure"   "sunrise"         "sunset"          "UV_Radiation"   
## [13] "weather"
# 2.

#integer
str(mars[1])
## 'data.frame':    3197 obs. of  1 variable:
##  $ earth_year: int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
#char
str(mars[2])
## 'data.frame':    3197 obs. of  1 variable:
##  $ earth_date: chr  "01-26 UTC" "01-25 UTC" "01-24 UTC" "01-23 UTC" ...
#char
str(mars[3])
## 'data.frame':    3197 obs. of  1 variable:
##  $ mars_date: chr  "Mars, Month 6 - LS 163deg" "Mars, Month 6 - LS 163deg" "Mars, Month 6 - LS 162deg" "Mars, Month 6 - LS 162deg" ...
#integer
str(mars[4])
## 'data.frame':    3197 obs. of  1 variable:
##  $ solar_day: int  3368 3367 3366 3365 3364 3363 3362 3361 3360 3359 ...
#integer
str(mars[5])
## 'data.frame':    3197 obs. of  1 variable:
##  $ max_ground_temp: int  -3 -3 -4 -6 -7 -8 -4 -6 -6 -9 ...
#integer
str(mars[6])
## 'data.frame':    3197 obs. of  1 variable:
##  $ min_ground_temp: int  -71 -72 -70 -70 -71 -71 -72 -70 -71 -71 ...
#integer
str(mars[7])
## 'data.frame':    3197 obs. of  1 variable:
##  $ max_air_temp: int  10 10 8 9 8 8 5 5 3 5 ...
#integer
str(mars[8])
## 'data.frame':    3197 obs. of  1 variable:
##  $ min_air_temp: int  -84 -87 -81 -91 -92 -80 -84 -73 -89 -80 ...
#character
str(mars[9])
## 'data.frame':    3197 obs. of  1 variable:
##  $ mean_pressure: chr  "707" "707" "708" "707" ...
#character
str(mars[10])
## 'data.frame':    3197 obs. of  1 variable:
##  $ sunrise: chr  "5:25" "5:25" "5:25" "5:26" ...
#character
str(mars[11])
## 'data.frame':    3197 obs. of  1 variable:
##  $ sunset: chr  "17:20" "17:20" "17:21" "17:21" ...
#character
str(mars[12])
## 'data.frame':    3197 obs. of  1 variable:
##  $ UV_Radiation: chr  "moderate" "moderate" "moderate" "moderate" ...
#character
str(mars[13])
## 'data.frame':    3197 obs. of  1 variable:
##  $ weather: chr  "Sunny" "Sunny" "Sunny" "Sunny" ...

8. How many data points (rows) are in the dataset? How many variables (columns) are recorded for each data point?

library(glue)

#a little function for later on

datasize = function(arg1, arg2){
  rows <- nrow(arg2)
  columns <- ncol(arg1)
  glue("This dataset has, {toupper(columns)} columns and {toupper(rows)} rows")
  # Output: "This dataset has, 13 columns and 3197 rows"
}
datasize(mars, mars)
## This dataset has, 13 columns and 3197 rows

9. Filter out (i.e., remove) measurements from earlier than 2015 (according to the Earth year), as well as any rows with missing data (NA). Replace the original “mars” object by reassigning the new filtered dataset to “mars”. How many data points are left after filtering?

Hint: use drop_na() to remove rows with missing values.

mars_old <- mars
mars <- drop_na(mars)

datasize(mars, mars)
## This dataset has, 13 columns and 3168 rows
glue("the data set has changed by a value of {nrow(mars_old)-nrow(mars)}")
## the data set has changed by a value of 29

10. From this point on, work with the filtered “mars” dataset from the above question. A Martian year is equivalent to 668.6 sols (or solar days). Create a new variable (column) called “years_since_landing” that shows how many Martian years the Curiosity rover had been on Mars for each measurement (divide “solar_day” by 668.6). Check to make sure the new column is there.

Hint: use the mutate() function.

library(dplyr)

mars <- mars %>%
  mutate(years_since_landing = (mars[[4]] / 668.6))

11. What is the range of the maximum ground temperature (“max_ground_temp”) of the dataset?

glue("The range of max ground temps is {min(mars$max_ground_temp)} to {max(mars$max_ground_temp)}")
## The range of max ground temps is -67 to 11
glue("i.e. {abs(min(mars$max_ground_temp)-max(mars$max_ground_temp))}")
## i.e. 78

12. Create a random sample with of atmospheric pressure readings from mars. To determine the column that corresponds to atmospheric pressure, check the “key” corresponding to the data dictionary that you imported above in question 6. Use sample() and pull(). Remember that by default random samples differ each time you run the code.

mars %>% pull(8) %>% sample(1)
## [1] -82

13. How many data points are from days where the maximum ground temperature got above 0 degrees Celsius? What percent/proportion do these represent? Use:

above_zero <- mars %>% pull(5) %>% {. > 0} %>% sum()
glue("There were {above_zero} days where the ground temp was above zero")
## There were 242 days where the ground temp was above zero

14. How many different UV radiation levels (“UV_Radiation”) are there?

Hint: use length() with unique() or table(). Remember to pull() the right column.

UV_class <- mars %>% pull(12) %>% unique() %>% length()
glue("This data set reports {UV_class} levels of UV radiation")
## This data set reports 4 levels of UV radiation
# reports the number of UV classifications

15. How many different weather conditions (“weather”) are reported?

weather_discretion <- mars %>% pull(13) %>% unique() %>% length()
glue("This data set reports {weather_discretion} categories of weather")
## This data set reports 1 categories of weather
# reports the number of UV classifications

16. Which UV radiation level had the highest maximum air temperature, and what was it?

Hint: Use group_by() with summarize().

max_index <- which.max(mars[[7]])
air_max <- mars[[7]][max_index]

glue("The max air temp is {air_max} celsius at UV: {mars[max_index, 13]}")
## The max air temp is 24 celsius at UV: Sunny

17. Extend on the code you wrote for question 16. Use the arrange() function to sort the output by maximum air temperature.

mars_sorted <- mars %>% arrange(desc(max_air_temp))
head(mars_sorted, 10)
##    earth_year earth_date                  mars_date solar_day max_ground_temp
## 1        2016  08-12 UTC  Mars, Month 7 - LS 202deg      1428               4
## 2        2020  06-14 UTC  Mars, Month 8 - LS 219deg      2793               6
## 3        2020  06-04 UTC  Mars, Month 8 - LS 213deg      2783              -5
## 4        2017  01-17 UTC Mars, Month 11 - LS 300deg      1582              -1
## 5        2016  11-10 UTC  Mars, Month 9 - LS 258deg      1516               4
## 6        2020  04-23 UTC  Mars, Month 7 - LS 188deg      2742              -3
## 7        2020  04-22 UTC  Mars, Month 7 - LS 187deg      2741              -1
## 8        2020  04-19 UTC  Mars, Month 7 - LS 185deg      2738               0
## 9        2020  04-18 UTC  Mars, Month 7 - LS 185deg      2737               0
## 10       2020  04-17 UTC  Mars, Month 7 - LS 184deg      2736               0
##    min_ground_temp max_air_temp min_air_temp mean_pressure sunrise sunset
## 1              -72           24          -69           808    5:18  17:24
## 2              -68           22          -85           832    5:22  17:33
## 3              -69           21          -83           817    5:20  17:29
## 4              -74           20          -75           862    6:34  18:48
## 5              -71           20          -75           909    5:52  18:09
## 6              -68           19          -77           752    5:18  17:21
## 7              -69           19          -70           752    5:18  17:21
## 8              -69           19          -84           745    5:19  17:20
## 9              -67           19          -81           745    5:19  17:20
## 10             -70           19          -83           745    5:19  17:20
##    UV_Radiation weather years_since_landing
## 1     very_high   Sunny            2.135806
## 2          high   Sunny            4.177386
## 3          high   Sunny            4.162429
## 4          high   Sunny            2.366138
## 5          high   Sunny            2.267424
## 6          high   Sunny            4.101107
## 7          high   Sunny            4.099611
## 8          high   Sunny            4.095124
## 9          high   Sunny            4.093628
## 10         high   Sunny            4.092133

18. How many measurements were taken on days when the UV radiation was “low” and the maximum air temperature was above freezing? Use:

# Logical statement inside of a pipeline
UV_low_nfreeze <- mars %>%
  filter(UV_Radiation == "low", max_air_temp > 0) %>%
  nrow()

#report to the terminal
glue("There were {UV_low_nfreeze} measurements taken when UV was Low and temperatures were above 0 Celsius")
## There were 13 measurements taken when UV was Low and temperatures were above 0 Celsius

19. How many days was the UV radiation was “high” or “very high”? use:

# Logical statement inside of a pipeline
UV_HIGHS <- mars %>%
  filter(UV_Radiation == "high" | UV_Radiation == "very_high") %>%
  nrow()

glue("There were {UV_HIGHS} days where the UV Radiation levels were high or very high")
## There were 1635 days where the UV Radiation levels were high or very high

20. Select all columns in “mars” where the column names starts with “min” (using select() and starts_with(). Then, use colMeans() to summarize across these columns.

mars_min <- mars %>% select(starts_with("min"))


min_means <- colMeans(mars_min)

# view result
min_means
## min_ground_temp    min_air_temp 
##       -75.01515       -80.31755

21. Using “mars”, create a new binary (TRUEs and FALSEs) column to indicate if the day’s maximum air temperature was above freezing. Call the new column “above_freezing”.

mars <- mars %>%
  mutate(above_freezing = max_air_temp > 0)

22. What is the average atmospheric pressure for days that have an air temperature above freezing and UV radiation level of “moderate”? How does this compare with days that do NOT fit these criteria?

# add a logical column
mars <- mars %>%
  mutate(temp_uv_group = max_air_temp > 0 & UV_Radiation == "moderate")

mars <- mars %>%
  mutate(mean_pressure = as.numeric(mean_pressure))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `mean_pressure = as.numeric(mean_pressure)`.
## Caused by warning:
## ! NAs introduced by coercion
# for some reason one of the pressure rows had an S in it... I guess i will leave it for now?

# nevermind, removing

mars_new <- mars
mars_new <- drop_na(mars_new)

datasize(mars_new, mars_new)
## This dataset has, 16 columns and 3167 rows
glue("the data set has changed by a value of {nrow(mars)-nrow(mars_new)}")
## the data set has changed by a value of 1
# now doing pressure summary
pressure_summary <- mars_new %>%
  group_by(temp_uv_group) %>%
  summarize(avg_pressure = mean(mean_pressure, na.rm = TRUE))

pressure_summary
## # A tibble: 2 × 2
##   temp_uv_group avg_pressure
##   <lgl>                <dbl>
## 1 FALSE                 829.
## 2 TRUE                  827.

23. Among days with a “moderate” UV level that are above freezing, what is the distribution of the earth year in which these days occurred?

# Logical statement inside of a pipeline
UV_MOD_ANTIFREEZE <- mars %>%
  filter(UV_Radiation == "moderate" & above_freezing == TRUE)

UV_MOD_ANTIFREEZE %>%
  count(earth_year)
##   earth_year   n
## 1       2014  72
## 2       2015  41
## 3       2016  31
## 4       2017   6
## 5       2018  74
## 6       2019  72
## 7       2020 152
## 8       2021 126
## 9       2022  17

24. How many days (using filter() or sum() ) have a maximum ground or air temperature above zero and have a UV level of “high” or “very_high”?

# Logical statement inside of a pipeline
LOGICAL_MADNESS <- mars %>%
  filter((UV_Radiation == "high" | UV_Radiation == "very_high") & (max_ground_temp > 0 | max_air_temp > 0))

sum_logic <- nrow(LOGICAL_MADNESS)

glue("The condition is true for {sum_logic} days")
## The condition is true for 1273 days

25. Make a boxplot (boxplot()) that looks at earth year (“earth_year”) on the x-axis and minimum air temperature (“min_air_temp”) on the y-axis.

boxplot(min_air_temp ~ earth_year, data = mars_new,
        main = "Minimum Air Temperature by Earth Year",
        xlab = "Earth Year",
        ylab = "Minimum Air Temperature (C)",
        col = "lightblue",     
        border = "gray40",        
        las = 2)                  

26. Knit your document into a report.

You use the knit button to do this. Make sure all your code is working first!