Markdown Author: Jessie Bell, 2023
Libraries Used: none
Answers: orange
Growing season is defined as the number of days between the last frost of the spring and the first frost of the fall. A frost is defined as a minimum temperature below 32â—¦F. The data below give the lengths of the growing seasons for 30 consecutive years (in order from 1970s-1990s) in Leavenworth, WA and Mazama, WA.
growingSeason <- read.csv("DataExam1_growing_season (1).csv")
# Using excel is great but it can get super difficult to change MASSIVE amounts of data. Practicing these skills in R will help you later if you are ever dealing with large buoy, temp, salinity, etc...datasets.
# you can change your data from "wide" to "long" in R by doing the following steps:
#pull each column out of the dataset
year <- growingSeason$Year
decade <- growingSeason$Decade
Leavenworth <- (growingSeason$Leavenworth)
LeavenworthLabel <- rep(c("Leavenworth"), times=length(Leavenworth)) #Make a new column that labels each data in Leavenworth with the category "Leavenworth"
Mazama <- growingSeason$Mazama
MazamaLabel <- rep(c("Mazama"), times=length(Mazama))
#bind your data together using cbind
Leavenworthdata <- cbind(year, decade, Leavenworth, LeavenworthLabel) #combine all Leavenworth data
Mazamadata <- cbind(year, decade, Mazama, MazamaLabel)
head(Mazamadata)
## year decade Mazama MazamaLabel
## [1,] "1970" "70" "200" "Mazama"
## [2,] "1971" "70" "202" "Mazama"
## [3,] "1972" "70" "205" "Mazama"
## [4,] "1973" "70" "219" "Mazama"
## [5,] "1974" "70" "226" "Mazama"
## [6,] "1975" "70" "217" "Mazama"
#bind your data together in longform using rbind
long_data <- rbind(Leavenworthdata, Mazamadata) #rbind is similar to cbind. cbind adds data into wide formate (columns on columns) and rbind adds it in longform (row on row)
#Rename the column headers
long_data <- data.frame(long_data)
colnames(long_data) <- c("Year", "Decade", "GrowingSeason", "Location")
#and voila!
Look at the data to get a sense of what’s there, then describe each variable. (2pts)
Answer: All data are discrete numerical. The data include the column headers shown below.
head(growingSeason) #both datasets can get you to where you need to be in the end.
## Year Decade Leavenworth Mazama
## 1 1970 70 191 200
## 2 1971 70 198 202
## 3 1972 70 223 205
## 4 1973 70 224 219
## 5 1974 70 217 226
## 6 1975 70 208 217
head(long_data)
## Year Decade GrowingSeason Location
## 1 1970 70 191 Leavenworth
## 2 1971 70 198 Leavenworth
## 3 1972 70 223 Leavenworth
## 4 1973 70 224 Leavenworth
## 5 1974 70 217 Leavenworth
## 6 1975 70 208 Leavenworth
What kinds of questions can be asked with this data. (2pts)
Answer: What year was growing season the highest on average? Which growing season is better on average, Mazama or Leavenworth? What decade was growing season the highest on average? etc…
Construct histograms of the growing season data for both sites (across all years). Save your histograms in a document and describe their shape. (2pts)
Answer: Both distributions seem to have a bit of a right skew, though Mazama could also be bimodal, with 2 modes shown.
hist(growingSeason$Leavenworth, col="#e68613", main="Growing Season Distribution", xlab="Leavenworth")
hist(growingSeason$Mazama, col="#ed68ed", main="Growing Season Distribution", xlab="Mazama")
#another way to do this:
library(ggplot2)
ggplot(long_data, aes(as.numeric(GrowingSeason), fill=Location))+
geom_histogram(bins=15, alpha = 0.7)+ #alpha makes the colors a bit transparent. 0 is clear and 1 is bold.
labs(title="Growing Season Distribution", x="", y="Frequency")+
scale_fill_manual(values=c("#e68613", "#ed68ed"))
#as.numeric() forces GrowingSeason from category into numeric data
Summarize the growing season by site across the whole dataset (include a measure of center and spread). Then visualize this summary. (4pts) Save your summary table and graphs in a document.
Answer: Leavenworth has a longer growing season (194 days) and a smaller spread (16.5 days) in the data (it is more consistent there that Mazama with a mean of 179 and sd of 23).
#the first way with wide data "growingSeason"
#YOU HAVE TO SEPARATE YOUR DATA FIRST. This has already been done above but we can do it again if it helps
Mazama1 <- growingSeason$Mazama
Leavenworth1 <- growingSeason$Leavenworth
meanLeaven <- mean(Leavenworth1)
meanMaza <- mean(Mazama1)
sdLeaven <- sd(Leavenworth1)
sdMaza <- sd(Mazama1)
#the second way with long_data dataset
mean <- tapply(as.numeric(long_data$GrowingSeason), long_data$Location, "mean", na.rm=T)
sd <- tapply(as.numeric(long_data$GrowingSeason), long_data$Location, "sd", na.rm=T)
summary <- data.frame(mean, sd)
summary
## mean sd
## Leavenworth 193.9667 16.53311
## Mazama 179.2000 22.64935
#the first way with growingSeason dataset
#setup a display that will print both boxplots side by side using par() function.
par(mfrow=c(1,2))
boxplot(growingSeason$Leavenworth, ylim=c(135, 230), col="#e68613")
boxplot(growingSeason$Mazama, ylim=c(135, 230), col="#ed68ed")
#The second way wiht long_data dataset
ggplot(long_data, aes(Location, as.numeric(GrowingSeason), fill=Location))+
geom_boxplot()+
labs(title="Growing Season", y="Length of season (days)", x="")+
scale_fill_manual(values=c("#e68613", "#ed68ed"))
Then summarize the growing season by decade across the whole data set (by a measure of center and spread). Then visualize this summary. (4pts)
Answer: ****
seventy <- subset(growingSeason, Decade=="70")
eighty <- subset(growingSeason, Decade=="80")
ninety <- subset(growingSeason, Decade=="90")
#you can do this the longway with growingSeason dataset. You would have to take the mean of both locations then add them together and divide by 2.
#Using long_data dataset
c <- tapply(as.numeric(long_data$GrowingSeason), long_data$Decade, "mean", na.rm=T)
d <- tapply(as.numeric(long_data$GrowingSeason), long_data$Decade, "sd", na.rm=T)
e <- c("70s", "80s", "90s")
#Rename the column headers
summarybydecade <- data.frame(e, c, d)
summarybydecade
## e c d
## 70 70s 205.95 11.69559
## 80 80s 184.10 16.16005
## 90 90s 169.70 16.52462
colnames(summarybydecade) <- c("Decade","Mean", "SD")
#and voila!
summarybydecade
## Decade Mean SD
## 70 70s 205.95 11.69559
## 80 80s 184.10 16.16005
## 90 90s 169.70 16.52462
#turn it into a prettier table:
#install.packages("gt")
library(gt)
table1 <- gt(summarybydecade)|>
opt_stylize(style=6, color="cyan")|>
tab_header(title = "Table 1: Summary Stats") |>
opt_table_font(google_font("Caveat"))
table1
| Table 1: Summary Stats | ||
| Decade | Mean | SD |
|---|---|---|
| 70s | 205.95 | 11.69559 |
| 80s | 184.10 | 16.16005 |
| 90s | 169.70 | 16.52462 |
# see! You can do SO MUCH IN R! You don't need any other program.
Show the comparison of the growing seasons by site and decade on one graph of your choice (4pts).
Answer: Because growing season is numeric and both site & decade are categories, we really only have a few ways to display this. Categorical data vs. numerical data are displayed by using: 1) boxplots, 2) stripcharts, 3) violin plots. The easiest way to do this is with ggplot.
ggplot(long_data, aes(Location, as.numeric(GrowingSeason), fill=Decade))+
labs(title= "Decadal growing seasons", x="", y="Growing Season (days)")+
geom_boxplot()
#I really can't imagine using anything else except ggplot for this. Good luck friends!
Discuss your plot, how do these the growing seasons compare in these two places data compare? (2pts)
Answer: Over the decades, there is a similar trend, with the 70s showing the longest growing season and the 90s showing the shortest. There seems to be a much lower variation in the Mazama vs. Leavenworth data in the 90s. Mazama’s growing season seems much smaller in comparison. I wonder what happened?
Make a probability distribution of the expected length of growing season in Mazama based on the data from the the 1990s. Then make a probability distribution based on the expected growing season in Mazama based on the data from 1970s-1990s. (2pts)
Answer: See distribution below. Look at summary function to quickly view datarange. You should also look back into Table1 to see the mean and sd by decade.
mean(ninety$Mazama) # we already subset this in previous quesion
## [1] 155.1
sd(ninety$Mazama)
## [1] 6.297266
seq1 <- seq(100, 228, .1) #seq creates a sequence of numbers that say, "start at 30, end at 71 with 1 increments (I chose 35 at first and didn't like the bell-shape, so I lowered it to see the whole curve"
plot(seq1, dnorm(seq1, 155, 6.3), type="l", col="#00abd6", ylab = "Probability Density", xlab="The 90's",
main="PD for Mazama in the 90's X~N(155, 7.6)")
mean(as.numeric(long_data$GrowingSeason))
## [1] 186.5833
sd(as.numeric(long_data$GrowingSeason))
## [1] 21.02242
plot(seq1, dnorm(seq1, 187, 21), type="l", col="#00abd6", ylab = "Probability Density", xlab="The 90's",
main="PD for Mazama in the 90's X~N(155, 7.6)")
#you can change seq1 so that you can see the whole distribution *chaing the x
What kind of values for length of growing season would you we need to see in 2024 to find something statistically significant given the probability distribution from the data from the 1990s? If we include all the data from the past 3 decades, would it require that the growing season from 2024 be shorter or longer to be considered extreme? (3pts)
Answer: You should answer by imagining a second distribution that has a lower tail in the upper tail of the distribution above. I will accept a well thought out answer, you can see how I would have answered the questions in Long Form Question 2 problem 10.
Given that WA state is currently in a drought in Fall 2023, there is great interest around temperature at WA’s ski areas, as temperature determines if precipitation falls as rain or snow. The Northwest Avalanche Center manages weather stations across the state to supported by the WDOT and the US Forest Service. Two stations near WWU are the Mt Baker Heather Meadows Station (approx. 4100ft) and the Steven’s Pass station (approx. 4800ft)
tempData <- read.csv("DataExam1_temperature.csv")
Look at the data. Describe each variable. (2pts)
Answer: Date_Time_PST: is a character string that contains discrete number values. Month: Ordinal Categorical, MtBaker_Temp: continuous numerical, and StevenesPass_Temp: continuous numerical.
head(tempData)
## Date_Time._PST Month MtBaker_Temperature_deg.F
## 1 10/23/23 11:00 October 35.93
## 2 10/23/23 10:00 October 36.37
## 3 10/23/23 9:00 October 37.46
## 4 10/23/23 8:00 October 37.28
## 5 10/23/23 7:00 October 37.04
## 6 10/23/23 6:00 October 36.70
## StevensPass_Temperature_deg.F
## 1 41.11
## 2 41.01
## 3 40.58
## 4 40.07
## 5 39.98
## 6 40.09
Discuss the data dataset in terms of what kinds of comparisons/analyses can be made with this data. (2pts)
Answer: We can compare temperature differences between each location. We can separate the temperature data by month and look into those differences. You find out which passes are cold enough to develop snowpack if it were to precipitate.
Construct histograms of the temperature data for both sites (across all days). Save your histograms in a document and describe their shape. (2pts)
Answer: Both temperature data look pretty much the same. They are bell-shaped with a small skew to the right.
hist(tempData$MtBaker_Temperature_deg.F, col="#00b8e7", main="Mt. Baker Temperature Distribution")
hist(tempData$StevensPass_Temperature_deg.F, col="#f8766d", main="Stevens Pass Temperature Distribution")
Summarize the growing temperature by site across the whole dataset (include a measure of center and spread). Then visualize this summary. (4pts)
Answer: Both locations have similar mean and spread. Both Mt. Baker and Stevens Pass are on average about 49 °F, with a standard deviation of about 8 degrees. Stevens Pass has more extreme values in high temperatures. That might be interesting to look into.
meanBaker <- mean(tempData$MtBaker_Temperature_deg.F, na.rm = T)
meanStevens <- mean(tempData$StevensPass_Temperature_deg.F, na.rm=T)
sdBaker <- sd(tempData$MtBaker_Temperature_deg.F, na.rm = T)
sdStevens <- sd(tempData$StevensPass_Temperature_deg.F, na.rm = T)
summarytable <- cbind(meanBaker, sdBaker, meanStevens, sdStevens)
summarytable
## meanBaker sdBaker meanStevens sdStevens
## [1,] 49.27052 8.526778 49.41775 8.234449
Baker <- tempData$MtBaker_Temperature_deg.F
Stevens <- (tempData$StevensPass_Temperature_deg.F)
both <- data.frame(Baker, Stevens)
summary(both)
## Baker Stevens
## Min. :32.66 Min. :32.60
## 1st Qu.:42.88 1st Qu.:44.31
## Median :47.51 Median :47.88
## Mean :49.27 Mean :49.42
## 3rd Qu.:54.21 3rd Qu.:52.83
## Max. :73.54 Max. :79.60
## NA's :5832 NA's :5841
boxplot(both, main="Mt. Baker & Stevens Pass Temps", vertical=T, col=c("#00b8e7", "#f8766d"), pch=21, ylab = "Temperature (F)", ylim = c(31, 85))
Below I show you another way to answer the first part of question 4. It seems to take longer to wrangle the data though.
Temperature <- as.numeric(tempData$MtBaker_Temperature_deg.F)
Location <- (rep(c("Baker"), times=length(Temperature)))
Stevens <- as.numeric(tempData$StevensPass_Temperature_deg.F)
StevensPass <- (rep(c("StevensPass"), times=length(Stevens), na.rm=T))
baker2 <- cbind(Temperature, Location)
stevens2 <- cbind(Stevens, StevensPass)
newdata <- rbind(baker2, stevens2)
df <- data.frame(newdata, na.rm=T)
#now you can use tapply() but this way seems a bit longer than the way shown above.
head(df)
## Temperature Location na.rm
## 1 35.93 Baker TRUE
## 2 36.37 Baker TRUE
## 3 37.46 Baker TRUE
## 4 37.28 Baker TRUE
## 5 37.04 Baker TRUE
## 6 36.7 Baker TRUE
a <- tapply(as.numeric(df$Temperature), df$Location, "mean", na.rm=T)
b <- tapply(as.numeric(df$Temperature), df$Location, "sd", na.rm=T)
data.frame(a,b)
## a b
## Baker 49.27052 8.526778
## StevensPass 49.41775 8.234449
#I don't think this way is faster that the first.
Then summarize the temperature by month across the whole data set (by a measure of center and spread). Then visualize this summary. (4pts)
Answer: you should have both data summary and data visualizaiton below.
#data summary
october <- subset(tempData, Month == "October")
september <- subset(tempData, Month == "September")
BmeanOct <- mean(october$MtBaker_Temperature_deg.F, na.rm=T)
BsdOct <- sd(october$MtBaker_Temperature_deg.F, na.rm=T)
BmeanSept <- mean(september$MtBaker_Temperature_deg.F, na.rm=T)
BsdSept <- sd(september$MtBaker_Temperature_deg.F, na.rm=T)
SmeanOct <- mean(october$StevensPass_Temperature_deg.F, na.rm=T)
SsdOct <- sd(october$StevensPass_Temperature_deg.F, na.rm=T)
SmeanSept <- mean(september$StevensPass_Temperature_deg.F, na.rm=T)
SsdSept <- sd(september$StevensPass_Temperature_deg.F, na.rm=T)
summarybymonth <- cbind(BmeanOct, BsdOct, BmeanSept, BsdSept, SmeanOct, SsdOct, SmeanSept, SsdSept)
#print data summary
summarybymonth
## BmeanOct BsdOct BmeanSept BsdSept SmeanOct SsdOct SmeanSept SsdSept
## [1,] 48.07381 7.618898 50.16804 9.050607 47.61998 6.259151 50.78315 9.236373
you should use a boxplot to show your data because it displays both the center & the spread of your data. If you chose to display a bargraph, it only shows the mean.
par(mfrow=c(1,2))
boxplot(tempData$MtBaker_Temperature_deg.F ~ tempData$Month, ylab="Temperature (F)", xlab="", col="#00abd6", main="Mt. Baker")
boxplot(tempData$StevensPass_Temperature_deg.F ~ tempData$Month, ylab="", xlab="", col="#f8766d", main="Stevens Pass")
Show the comparison of the temperature by site and month on one graph of your choice (4pts).
Answer: You can show this data in anyway that you choose, so I will accept anything here, as long as you chose a different graph than problem 6, and it makes sense with the type of data you are showing.
Discuss your plot, how do these the temperature compare in these two places data compare? (2pts)
Answer: You should have real numbers in this summary. You should talk about the extreme values in the Stevens Pass data, discuss the spread of both locations, the mean of both locations.
Make a probability distribution of the expected temperature at Mt Baker based on the data from October. Then make a probability distribution based on the expected temperature at Mt Baker based on the data from September and October. (2pts)
Answer: The distributions are shown below.
summary(october$MtBaker_Temperature_deg.F)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.90 42.48 46.44 48.07 53.37 70.79
#this tells you the min of October Mt. Baker data. I will start the seq below at about 35 and I will go to the max, 71, and a mean of 48. Previously we found the standard deviation for Baker in oct is 7.6.
seq <- seq(25, 71, .1) #seq creates a sequence of numbers that say, "start at 30, end at 71 with 1 increments (I chose 35 at first and didn't like the bell-shape, so I lowered it to see the whole curve"
plot(seq, dnorm(seq, 48, 7.6), type="l", col="#00abd6", ylab = "Probability Density", xlab="October",
main="PD for Mt. Baker X~N(48, 7.6)")
#for fun - not required
my.mean <- BmeanOct
my.mean
## [1] 48.07381
my.sd <- BsdOct
x = seq(-4, 4, length=100)*my.sd+my.mean
y = dnorm(x, my.mean, my.sd)
#now create reverse x so shading can occur
xrev <- c(x, rev(x))
#create a vector of y
yzers <- c(y, rep(0, length(y)))
plot(x, y, type="l",
ylab = 'Probability Density',
xlab = 'October',
main = 'PD for Mt. Baker X~N(48, 7.6)', col="#00abd6")+
polygon(xrev,yzers, col="#00abd6", border=NA)
## integer(0)
summary(october$MtBaker_Temperature_deg.F)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.90 42.48 46.44 48.07 53.37 70.79
#this tells you the min of October Mt. Baker data. I will start the seq below at about 35 and I will go to the max, 71, and a mean of 48. Previously we found the standard deviation for Baker in oct is 7.6.
seq <- seq(25, 71, .1) #seq creates a sequence of numbers that say, "start at 30, end at 71 with 1 increments (I chose 35 at first and didn't like the bell-shape, so I lowered it to see the whole curve"
plot(seq, dnorm(seq, meanBaker, sdBaker), type="l", col="#c178f7", ylab = "Probability Density", xlab="September & October",
main="PD for Mt. Baker X~N(49, 8.5)")
#for fun - not required
my.mean2 <- meanBaker
my.mean2
## [1] 49.27052
my.sd2 <- sdBaker
my.sd2
## [1] 8.526778
x = seq(-4, 4, length=100)*my.sd2+my.mean2
y = dnorm(x, my.mean2, my.sd2)
#now create reverse x so shading can occur
xrev <- c(x, rev(x))
#create a vector of y
yzers <- c(y, rep(0, length(y)))
plot(x, y, type="l",
ylab = 'Probability Density',
xlab = 'September & October',
main = 'PD for Mt. Baker X~N(49, 8.5)', col="#c178f7")+
polygon(xrev,yzers, col="#c178f7", border=NA)
## integer(0)
What kind of values for temperature would you we need to see in October 2024 to find something statistically significantly different than this year given the probability distribution from the data from October 2023? If we include all the data from September and October, how does it change the outcome? (3pts)
Answer: Both questions are asking about the 2.5% of data in the tails. Three sds away from each of the means will tell you at about what point that is. The month of October would need the lower tail to be around 70 degrees F. This means that data with an avg. of 90 degrees would make data statistically significant. You can use the same logic for the second dataset.
48+7.6*3
## [1] 70.8
49+8.5*3
## [1] 74.5