library(tidyverse)
library(dplyr)
library(ggplot2)
library(forcats)
library(broom)
library(ggthemes)
library(RColorBrewer)
setwd("C:/Users/wesle/Downloads/Data 110")
weatherds <- readr::read_csv("weather.csv")Project 1 (Data 110)
Introduction
The data in this project is from the National Weather Service’s 122 different Weather Forecast Offices under the National Oceanic and Atmospheric Administration. The data set contains weather data such as date (year, month, day), city, state, average, max, and min temperature, and wind direction and speed. I plan to explore the relationship between wind speeds and temperature.
Load the Libraries & Data
Cleaning Data
names(weatherds) <- tolower(names(weatherds)) # Change Uppercase to Lowercase
names(weatherds) <- gsub(" ",".",names(weatherds)) # Replace spaces with .
names(weatherds) <- gsub("\\.","_",names(weatherds)) # Replace . with _
weatherds1 <- weatherds |>
mutate(station_state = fct_recode(station_state, "Delaware" = "DE", "Virginia" = "VA")) # Code from Data 101 to fix the values in station_stateGrouping and Getting the Average
wds1 <- weatherds1 |>
group_by(station_state) |>
summarize(avgtemp = mean(data_temperature_avg_temp)) # Makes a dataset grouped by state that has the average temperature of each state
wds2 <- weatherds1 |>
group_by(station_state) |>
summarize(avgws = mean(data_wind_speed)) # Makes a dataset grouped by state that has the average wind speed of each stateJoining the Datasets
tempws <- inner_join(wds1, wds2, by=c("station_state")) # Joining the two datasets by their stateLinear Regression
ggplot(tempws, aes(x=avgws, y=avgtemp)) +
geom_point(alpha = 0.75, aes(color = station_state)) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
geom_text(aes(label = station_state), size = 2.25) +
labs(x = "Average Wind Speeds (MPH)", y = "Average Temperature (F)", title = "Average Temperature vs. Average Wind Speeds in each US State and Puerto Rico", caption = "The National Weather Service (2016)") +
scale_color_manual(values = c("red", "orange", "yellow", "green", "blue", "purple", "pink", "brown", "black", "maroon", "maroon1", "mediumseagreen", "dodgerblue4", "aquamarine", "skyblue1", "violet", "orangered3", "firebrick4", "orchid4", "seagreen1", "gold4", "gold", "darkslategrey", "deeppink2", "hotpink3", "indianred1", "khaki", "khaki4", "magenta", "tomato4", "tomato", "thistle", "turquoise1", "turquoise4", "plum1", "olivedrab", "olivedrab1", "chartreuse", "chartreuse4", "chocolate", "chocolate4", "chocolate1", "brown1", "darkolivegreen", "darkolivegreen1", "rosybrown", "rosybrown1", "seashell4", "salmon", "salmon4", "sienna1")) +
theme_economist_white()`geom_smooth()` using formula = 'y ~ x'
cor(tempws$avgtemp, tempws$avgws) # Checking correlation[1] -0.1851887
lm1 <- lm(avgtemp ~ avgws, data = tempws)
tidy(lm1)# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 61.2 3.77 16.2 2.35e-21
2 avgws -0.731 0.554 -1.32 1.93e- 1
glance(lm1)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0343 0.0146 8.47 1.74 0.193 1 -180. 367. 372.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Code from Math 217 to get the equation, adjusted R squared, and p-valueLiner Regression/Model Equation
avgtemp = 61.2062 - 0.7305(avgws)
Adjusted R Squared & p-value Meaning
The adjusted R Squared is 0.0146 meaning that the average wind speed in each state only explains 1.46% of the variance in average temperature in each state. The p value is 0.1933, which is significantly greater than 0.05, means that the null hypothesis, which would be that average wind speeds do have an impact on average temperatures, is rejected.
Plot Analysis
With how scattered the points are around the line of best fit made by the linear model it shows how the relationship/correlation between average wind speeds and average temperature is week, which is supported by the value of their correlation coeffecient which is -0.1852; being very close to zero.
Additional Plot
ggplot(weatherds1, aes(x=data_temperature_avg_temp, y=data_wind_speed)) +
geom_point(alpha = 0.75, aes(color = station_state)) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(x = "Average Wind Speeds (MPH)", y = "Average Temperature (F)", title = "Average Temperature vs. Average Wind Speeds in each US State/City and Puerto Rico", caption = "The National Weather Service (2016)") +
scale_color_manual(values = c("red", "orange", "yellow", "green", "blue", "purple", "pink", "brown", "black", "maroon", "maroon1", "mediumseagreen", "dodgerblue4", "aquamarine", "skyblue1", "violet", "orangered3", "firebrick4", "orchid4", "seagreen1", "gold4", "gold", "darkslategrey", "deeppink2", "hotpink3", "indianred1", "khaki", "khaki4", "magenta", "tomato4", "tomato", "thistle", "turquoise1", "turquoise4", "plum1", "olivedrab", "olivedrab1", "chartreuse", "chartreuse4", "chocolate", "chocolate4", "chocolate1", "brown1", "darkolivegreen", "darkolivegreen1", "rosybrown", "rosybrown1", "seashell4", "salmon", "salmon4", "sienna1")) +
theme_economist()`geom_smooth()` using formula = 'y ~ x'
cor(weatherds1$data_temperature_avg_temp, weatherds1$data_wind_speed) # Checking correlation[1] -0.1623299
Conclusion/Essay
The dataset didn’t have much that needed to be cleaned; the only things that needed to be changed about the dataset were the column names and a value within two observations. For changing the columns I first set all of the characters to lowercase as there were a few characters in each column that were uppercase. I then replaced all spaces with periods as I would later replace all of those periods with underscores. I did this in that order because orginally I wasn’t able to figure out how to replace the periods that were already present in the column names of the dataset with underscores. But after reading the help section of gsub on R using ?gsub I figured out how to replace those periods with underscores so I later added that line of code. I then changed a value within two variables that were inconsistent with the rest of the dataset. Those values being a DE and VA in the station_state column instead of Delaware and Virginia. This needed to be done as other observations from Delaware and Virginia weren’t entered in this column as DE or VA and to find the average wind speeds and temperature of each state I needed to fix those values to do so. The second visualization made was created to see how the graph would look when the averages were still based on the city the data was taken from and not the total average for the state. I then also took the correlation as it was hard to tell how correlated the varibles were with how many data points were shown on the graph, and since that value was -0.1623 the correlation within the cities was lower than when comparing at a state level.
Dataset Link
https://corgis-edu.github.io/corgis/csv/weather/