Markdown Author: Jessie Bell, 2023

Libraries Used: none

Answers: orange

Long Form Question KEY

Long Form Questions 1

Growing season is defined as the number of days between the last frost of the spring and the first frost of the fall. A frost is defined as a minimum temperature below 32◦F. The data below give the lengths of the growing seasons for 30 consecutive years (in order from 1970s-1990s) in Leavenworth, WA and Mazama, WA.

growingSeason <- read.csv("DataExam1_growing_season (1).csv")

# Using excel is great but it can get super difficult to change MASSIVE amounts of data. Practicing these skills in R will help you later if you are ever dealing with large buoy, temp, salinity, etc...datasets. 

# you can change your data from "wide" to "long" in R by doing the following steps: 

#pull each column out of the dataset

year <- growingSeason$Year

decade <- growingSeason$Decade

Leavenworth <- (growingSeason$Leavenworth) 

LeavenworthLabel <- rep(c("Leavenworth"), times=length(Leavenworth)) #Make a new column that labels each data in Leavenworth with the category "Leavenworth"

Mazama <- growingSeason$Mazama 

MazamaLabel <- rep(c("Mazama"), times=length(Mazama))

#bind your data together using cbind

Leavenworthdata <- cbind(year, decade, Leavenworth, LeavenworthLabel) #combine all Leavenworth data

Mazamadata <- cbind(year, decade, Mazama, MazamaLabel) 
head(Mazamadata)

##      year   decade Mazama MazamaLabel
## [1,] "1970" "70"   "200"  "Mazama"   
## [2,] "1971" "70"   "202"  "Mazama"   
## [3,] "1972" "70"   "205"  "Mazama"   
## [4,] "1973" "70"   "219"  "Mazama"   
## [5,] "1974" "70"   "226"  "Mazama"   
## [6,] "1975" "70"   "217"  "Mazama"

#bind your data together in longform using rbind

long_data <- rbind(Leavenworthdata, Mazamadata) #rbind is similar to cbind. cbind adds data into wide formate (columns on columns) and rbind adds it in longform (row on row)

#Rename the column headers
long_data <- data.frame(long_data)
  
colnames(long_data) <- c("Year", "Decade", "GrowingSeason", "Location")

#and voila!

1

Look at the data to get a sense of what’s there, then describe each variable. (2pts)

Answer: All data are discrete numerical. The data include the column headers shown below.

head(growingSeason) #both datasets can get you to where you need to be in the end.

##   Year Decade Leavenworth Mazama
## 1 1970     70         191    200
## 2 1971     70         198    202
## 3 1972     70         223    205
## 4 1973     70         224    219
## 5 1974     70         217    226
## 6 1975     70         208    217

head(long_data)

##   Year Decade GrowingSeason    Location
## 1 1970     70           191 Leavenworth
## 2 1971     70           198 Leavenworth
## 3 1972     70           223 Leavenworth
## 4 1973     70           224 Leavenworth
## 5 1974     70           217 Leavenworth
## 6 1975     70           208 Leavenworth

2

What kinds of questions can be asked with this data. (2pts)

Answer: What year was growing season the highest on average? Which growing season is better on average, Mazama or Leavenworth? What decade was growing season the highest on average? etc…

3

Construct histograms of the growing season data for both sites (across all years). Save your histograms in a document and describe their shape. (2pts)

Answer: Both distributions seem to have a bit of a right skew, though Mazama could also be bimodal, with 2 modes shown.

hist(growingSeason$Leavenworth, col="#e68613", main="Growing Season Distribution", xlab="Leavenworth")

hist(growingSeason$Mazama, col="#ed68ed", main="Growing Season Distribution", xlab="Mazama")

#another way to do this: 
library(ggplot2)

ggplot(long_data, aes(as.numeric(GrowingSeason), fill=Location))+
  geom_histogram(bins=15, alpha = 0.7)+ #alpha makes the colors a bit transparent. 0 is clear and 1 is bold.
  labs(title="Growing Season Distribution", x="", y="Frequency")+
  scale_fill_manual(values=c("#e68613", "#ed68ed"))

  #as.numeric() forces GrowingSeason from category into numeric data

4

Summarize the growing season by site across the whole dataset (include a measure of center and spread). Then visualize this summary. (4pts) Save your summary table and graphs in a document.

Answer: Leavenworth has a longer growing season (194 days) and a smaller spread (16.5 days) in the data (it is more consistent there that Mazama with a mean of 179 and sd of 23).

#the first way with wide data "growingSeason"

#YOU HAVE TO SEPARATE YOUR DATA FIRST. This has already been done above but we can do it again if it helps

Mazama1 <- growingSeason$Mazama

Leavenworth1 <- growingSeason$Leavenworth

meanLeaven <- mean(Leavenworth1)
meanMaza <- mean(Mazama1)
sdLeaven <- sd(Leavenworth1)
sdMaza <- sd(Mazama1)

#the second way with long_data dataset

mean <- tapply(as.numeric(long_data$GrowingSeason), long_data$Location, "mean", na.rm=T)

sd <- tapply(as.numeric(long_data$GrowingSeason), long_data$Location, "sd", na.rm=T)

summary <- data.frame(mean, sd)

summary

##                 mean       sd
## Leavenworth 193.9667 16.53311
## Mazama      179.2000 22.64935

#the first way with growingSeason dataset

#setup a display that will print both boxplots side by side using par() function.
par(mfrow=c(1,2))
boxplot(growingSeason$Leavenworth, ylim=c(135, 230), col="#e68613")
boxplot(growingSeason$Mazama, ylim=c(135, 230), col="#ed68ed")

#The second way wiht long_data dataset
ggplot(long_data, aes(Location, as.numeric(GrowingSeason), fill=Location))+
  geom_boxplot()+
  labs(title="Growing Season", y="Length of season (days)", x="")+
  scale_fill_manual(values=c("#e68613", "#ed68ed"))

5

Then summarize the growing season by decade across the whole data set (by a measure of center and spread). Then visualize this summary. (4pts)

Answer: ****

seventy <- subset(growingSeason, Decade=="70")

eighty <- subset(growingSeason, Decade=="80")

ninety <- subset(growingSeason, Decade=="90")

#you can do this the longway with growingSeason dataset. You would have to take the mean of both locations then add them together and divide by 2. 

#Using long_data dataset
c <- tapply(as.numeric(long_data$GrowingSeason), long_data$Decade, "mean", na.rm=T)

d <- tapply(as.numeric(long_data$GrowingSeason), long_data$Decade, "sd", na.rm=T)

e <- c("70s", "80s", "90s")

#Rename the column headers
summarybydecade <- data.frame(e, c, d)
 
summarybydecade

##      e      c        d
## 70 70s 205.95 11.69559
## 80 80s 184.10 16.16005
## 90 90s 169.70 16.52462

colnames(summarybydecade) <- c("Decade","Mean", "SD")

#and voila!
summarybydecade

##    Decade   Mean       SD
## 70    70s 205.95 11.69559
## 80    80s 184.10 16.16005
## 90    90s 169.70 16.52462

#turn it into a prettier table: 
#install.packages("gt")
library(gt)
table1 <- gt(summarybydecade)|>
  opt_stylize(style=6, color="cyan")|>
  tab_header(title = "Table 1: Summary Stats") |>
  opt_table_font(google_font("Caveat"))

table1

Decade	Mean	SD
Table 1: Summary Stats
70s	205.95	11.69559
80s	184.10	16.16005
90s	169.70	16.52462

# see! You can do SO MUCH IN R! You don't need any other program.

7

Show the comparison of the growing seasons by site and decade on one graph of your choice (4pts).

Answer: Because growing season is numeric and both site & decade are categories, we really only have a few ways to display this. Categorical data vs. numerical data are displayed by using: 1) boxplots, 2) stripcharts, 3) violin plots. The easiest way to do this is with ggplot.

ggplot(long_data, aes(Location, as.numeric(GrowingSeason), fill=Decade))+
  labs(title= "Decadal growing seasons", x="", y="Growing Season (days)")+
  geom_boxplot()

#I really can't imagine using anything else except ggplot for this. Good luck friends!

8

Discuss your plot, how do these the growing seasons compare in these two places data compare? (2pts)

Answer: Over the decades, there is a similar trend, with the 70s showing the longest growing season and the 90s showing the shortest. There seems to be a much lower variation in the Mazama vs. Leavenworth data in the 90s. Mazama’s growing season seems much smaller in comparison. I wonder what happened?

9

Make a probability distribution of the expected length of growing season in Mazama based on the data from the the 1990s. Then make a probability distribution based on the expected growing season in Mazama based on the data from 1970s-1990s. (2pts)

Answer: See distribution below. Look at summary function to quickly view datarange. You should also look back into Table1 to see the mean and sd by decade.

mean(ninety$Mazama) # we already subset this in previous quesion

## [1] 155.1

sd(ninety$Mazama)

## [1] 6.297266

seq1 <- seq(100, 228, .1) #seq creates a sequence of numbers that say, "start at 30, end at 71 with 1 increments (I chose 35 at first and didn't like the bell-shape, so I lowered it to see the whole curve"


plot(seq1, dnorm(seq1, 155, 6.3), type="l", col="#00abd6", ylab = "Probability Density", xlab="The 90's", 
     main="PD for Mazama in the 90's X~N(155, 7.6)")

mean(as.numeric(long_data$GrowingSeason))

## [1] 186.5833

sd(as.numeric(long_data$GrowingSeason))

## [1] 21.02242

plot(seq1, dnorm(seq1, 187, 21), type="l", col="#00abd6", ylab = "Probability Density", xlab="The 90's", 
     main="PD for Mazama in the 90's X~N(155, 7.6)")

#you can change seq1 so that you can see the whole distribution *chaing the x

10

What kind of values for length of growing season would you we need to see in 2024 to find something statistically significant given the probability distribution from the data from the 1990s? If we include all the data from the past 3 decades, would it require that the growing season from 2024 be shorter or longer to be considered extreme? (3pts)

Answer: You should answer by imagining a second distribution that has a lower tail in the upper tail of the distribution above. I will accept a well thought out answer, you can see how I would have answered the questions in Long Form Question 2 problem 10.

Long Form Questions 2

Given that WA state is currently in a drought in Fall 2023, there is great interest around temperature at WA’s ski areas, as temperature determines if precipitation falls as rain or snow. The Northwest Avalanche Center manages weather stations across the state to supported by the WDOT and the US Forest Service. Two stations near WWU are the Mt Baker Heather Meadows Station (approx. 4100ft) and the Steven’s Pass station (approx. 4800ft)

tempData <- read.csv("DataExam1_temperature.csv")

1

Look at the data. Describe each variable. (2pts)

Answer: Date_Time_PST: is a character string that contains discrete number values. Month: Ordinal Categorical, MtBaker_Temp: continuous numerical, and StevenesPass_Temp: continuous numerical.

head(tempData)

##   Date_Time._PST   Month MtBaker_Temperature_deg.F
## 1 10/23/23 11:00 October                     35.93
## 2 10/23/23 10:00 October                     36.37
## 3  10/23/23 9:00 October                     37.46
## 4  10/23/23 8:00 October                     37.28
## 5  10/23/23 7:00 October                     37.04
## 6  10/23/23 6:00 October                     36.70
##   StevensPass_Temperature_deg.F
## 1                         41.11
## 2                         41.01
## 3                         40.58
## 4                         40.07
## 5                         39.98
## 6                         40.09

2

Discuss the data dataset in terms of what kinds of comparisons/analyses can be made with this data. (2pts)

Answer: We can compare temperature differences between each location. We can separate the temperature data by month and look into those differences. You find out which passes are cold enough to develop snowpack if it were to precipitate.

3

Construct histograms of the temperature data for both sites (across all days). Save your histograms in a document and describe their shape. (2pts)

Answer: Both temperature data look pretty much the same. They are bell-shaped with a small skew to the right.

hist(tempData$MtBaker_Temperature_deg.F, col="#00b8e7", main="Mt. Baker Temperature Distribution")

hist(tempData$StevensPass_Temperature_deg.F, col="#f8766d", main="Stevens Pass Temperature Distribution")

4

Summarize the growing temperature by site across the whole dataset (include a measure of center and spread). Then visualize this summary. (4pts)

Answer: Both locations have similar mean and spread. Both Mt. Baker and Stevens Pass are on average about 49 °F, with a standard deviation of about 8 degrees. Stevens Pass has more extreme values in high temperatures. That might be interesting to look into.

meanBaker <- mean(tempData$MtBaker_Temperature_deg.F, na.rm = T)

meanStevens <- mean(tempData$StevensPass_Temperature_deg.F, na.rm=T)

sdBaker <- sd(tempData$MtBaker_Temperature_deg.F, na.rm = T)

sdStevens <- sd(tempData$StevensPass_Temperature_deg.F, na.rm = T)

summarytable <- cbind(meanBaker, sdBaker, meanStevens, sdStevens)

summarytable

##      meanBaker  sdBaker meanStevens sdStevens
## [1,]  49.27052 8.526778    49.41775  8.234449

Baker <- tempData$MtBaker_Temperature_deg.F

Stevens <- (tempData$StevensPass_Temperature_deg.F)

both <- data.frame(Baker, Stevens)
summary(both)

##      Baker          Stevens     
##  Min.   :32.66   Min.   :32.60  
##  1st Qu.:42.88   1st Qu.:44.31  
##  Median :47.51   Median :47.88  
##  Mean   :49.27   Mean   :49.42  
##  3rd Qu.:54.21   3rd Qu.:52.83  
##  Max.   :73.54   Max.   :79.60  
##  NA's   :5832    NA's   :5841

boxplot(both, main="Mt. Baker & Stevens Pass Temps", vertical=T, col=c("#00b8e7", "#f8766d"), pch=21, ylab = "Temperature (F)", ylim = c(31, 85))

Using tapply()

Below I show you another way to answer the first part of question 4. It seems to take longer to wrangle the data though.

Temperature <- as.numeric(tempData$MtBaker_Temperature_deg.F)

Location <- (rep(c("Baker"), times=length(Temperature)))

Stevens <- as.numeric(tempData$StevensPass_Temperature_deg.F)

StevensPass <- (rep(c("StevensPass"), times=length(Stevens), na.rm=T))

baker2 <- cbind(Temperature, Location)
stevens2 <- cbind(Stevens, StevensPass)


newdata <- rbind(baker2, stevens2) 

df <- data.frame(newdata, na.rm=T)
#now you can use tapply() but this way seems a bit longer than the way shown above. 

head(df)

##   Temperature Location na.rm
## 1       35.93    Baker  TRUE
## 2       36.37    Baker  TRUE
## 3       37.46    Baker  TRUE
## 4       37.28    Baker  TRUE
## 5       37.04    Baker  TRUE
## 6        36.7    Baker  TRUE

a <- tapply(as.numeric(df$Temperature), df$Location, "mean", na.rm=T)

b <- tapply(as.numeric(df$Temperature), df$Location, "sd", na.rm=T)

data.frame(a,b)

##                    a        b
## Baker       49.27052 8.526778
## StevensPass 49.41775 8.234449

#I don't think this way is faster that the first.

5

Then summarize the temperature by month across the whole data set (by a measure of center and spread). Then visualize this summary. (4pts)

Answer: you should have both data summary and data visualizaiton below.

#data summary
october <- subset(tempData, Month == "October")

september <- subset(tempData, Month == "September")

BmeanOct <- mean(october$MtBaker_Temperature_deg.F, na.rm=T)

BsdOct <- sd(october$MtBaker_Temperature_deg.F, na.rm=T)

BmeanSept <- mean(september$MtBaker_Temperature_deg.F, na.rm=T)

BsdSept <- sd(september$MtBaker_Temperature_deg.F, na.rm=T)

SmeanOct <- mean(october$StevensPass_Temperature_deg.F, na.rm=T)

SsdOct <- sd(october$StevensPass_Temperature_deg.F, na.rm=T)

SmeanSept <- mean(september$StevensPass_Temperature_deg.F, na.rm=T)

SsdSept <- sd(september$StevensPass_Temperature_deg.F, na.rm=T)


summarybymonth <- cbind(BmeanOct, BsdOct, BmeanSept, BsdSept, SmeanOct, SsdOct, SmeanSept, SsdSept)

#print data summary
summarybymonth

##      BmeanOct   BsdOct BmeanSept  BsdSept SmeanOct   SsdOct SmeanSept  SsdSept
## [1,] 48.07381 7.618898  50.16804 9.050607 47.61998 6.259151  50.78315 9.236373

Boxplot

you should use a boxplot to show your data because it displays both the center & the spread of your data. If you chose to display a bargraph, it only shows the mean.

par(mfrow=c(1,2))

boxplot(tempData$MtBaker_Temperature_deg.F ~ tempData$Month, ylab="Temperature (F)", xlab="", col="#00abd6", main="Mt. Baker")
boxplot(tempData$StevensPass_Temperature_deg.F ~ tempData$Month, ylab="", xlab="", col="#f8766d", main="Stevens Pass")

7

Show the comparison of the temperature by site and month on one graph of your choice (4pts).

Answer: You can show this data in anyway that you choose, so I will accept anything here, as long as you chose a different graph than problem 6, and it makes sense with the type of data you are showing.

8

Discuss your plot, how do these the temperature compare in these two places data compare? (2pts)

Answer: You should have real numbers in this summary. You should talk about the extreme values in the Stevens Pass data, discuss the spread of both locations, the mean of both locations.

9

Make a probability distribution of the expected temperature at Mt Baker based on the data from October. Then make a probability distribution based on the expected temperature at Mt Baker based on the data from September and October. (2pts)

Answer: The distributions are shown below.

summary(october$MtBaker_Temperature_deg.F)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.90   42.48   46.44   48.07   53.37   70.79

#this tells you the min of October Mt. Baker data. I will start the seq below at about 35 and I will go to the max, 71, and a mean of 48. Previously we found the standard deviation for Baker in oct is 7.6. 

seq <- seq(25, 71, .1) #seq creates a sequence of numbers that say, "start at 30, end at 71 with 1 increments (I chose 35 at first and didn't like the bell-shape, so I lowered it to see the whole curve"


plot(seq, dnorm(seq, 48, 7.6), type="l", col="#00abd6", ylab = "Probability Density", xlab="October", 
     main="PD for Mt. Baker X~N(48, 7.6)")

#for fun - not required
my.mean <- BmeanOct
my.mean

## [1] 48.07381

my.sd <- BsdOct
x = seq(-4, 4, length=100)*my.sd+my.mean
y = dnorm(x, my.mean, my.sd)
#now create reverse x so shading can occur
xrev <- c(x, rev(x))
#create a vector of y
yzers <- c(y, rep(0, length(y)))
plot(x, y, type="l", 
     ylab = 'Probability Density',
     xlab = 'October',
     main = 'PD for Mt. Baker X~N(48, 7.6)', col="#00abd6")+
polygon(xrev,yzers, col="#00abd6",  border=NA)

## integer(0)

summary(october$MtBaker_Temperature_deg.F)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.90   42.48   46.44   48.07   53.37   70.79

#this tells you the min of October Mt. Baker data. I will start the seq below at about 35 and I will go to the max, 71, and a mean of 48. Previously we found the standard deviation for Baker in oct is 7.6. 

seq <- seq(25, 71, .1) #seq creates a sequence of numbers that say, "start at 30, end at 71 with 1 increments (I chose 35 at first and didn't like the bell-shape, so I lowered it to see the whole curve"


plot(seq, dnorm(seq, meanBaker, sdBaker), type="l", col="#c178f7", ylab = "Probability Density", xlab="September & October", 
     main="PD for Mt. Baker X~N(49, 8.5)")

#for fun - not required
my.mean2 <- meanBaker
my.mean2

## [1] 49.27052

my.sd2 <- sdBaker
my.sd2

## [1] 8.526778

x = seq(-4, 4, length=100)*my.sd2+my.mean2
y = dnorm(x, my.mean2, my.sd2)
#now create reverse x so shading can occur
xrev <- c(x, rev(x))
#create a vector of y
yzers <- c(y, rep(0, length(y)))
plot(x, y, type="l", 
     ylab = 'Probability Density',
     xlab = 'September & October',
     main = 'PD for Mt. Baker X~N(49, 8.5)', col="#c178f7")+
polygon(xrev,yzers, col="#c178f7",  border=NA)

## integer(0)

10

What kind of values for temperature would you we need to see in October 2024 to find something statistically significantly different than this year given the probability distribution from the data from October 2023? If we include all the data from September and October, how does it change the outcome? (3pts)

Answer: Both questions are asking about the 2.5% of data in the tails. Three sds away from each of the means will tell you at about what point that is. The month of October would need the lower tail to be around 70 degrees F. This means that data with an avg. of 90 degrees would make data statistically significant. You can use the same logic for the second dataset.

48+7.6*3

## [1] 70.8

49+8.5*3

## [1] 74.5

Midterm 1

Long Form Question KEY

Long Form Questions 1

1

2

3

4

5

7

8

9

10

Long Form Questions 2

1

2

3

4

Using tapply()

5

Boxplot

7

8

9

10