#load libraries and csvs
library(tidyverse)
library("arrow")
library(leaflet)
setwd("C:/Users/enomc/OneDrive - montgomerycollege.edu/Documents/Data Science")
weather<-read_parquet("daily_weather.parquet")
countries<-read_csv("countries.csv")
cities<-read_csv("cities.csv")finalprojectchoieno
Image source:Pexels(also linked)
For my dataset, I chose one focusing on weather. I got my source from NOAA. I plan to use the city name variable, dates, country, season, lat, long, population. I chose this dataset because as I was wondering what topic to pick for my dataset Maryland actually got a decent amount of snow! Not only that but weather is something I actually check very often whether it is deciding how I should dress, or whether or not to bring umbrella or other accessories to stay warm. This data was collected by NOAA through weather monitoring stations by Meteostat API a web service with weather observations.
Background information The data used in this project comes from the Meteostat API, which is a website that provides weather and climate data from many weather stations around the world. Meteostat collects information like temperature, location, and dates from official weather stations and makes it available for people to use. The data can be filtered by location and time period, which made it useful for this project. Meteostat shares its data under an open license and also uses data from NOAA. Meteostat. (n.d.). Meteostat Historical Weather & Climate Data API. Retrieved December 15, 2025, from https://rapidapi.com/meteostat/api/meteostat
I loaded the required libraries and datasets needed for data cleaning and analysis.
I made a new dataset to have less observations only including dates starting from 2020.
#make a new weather dataset only including data from dates after 2020 january 1
weather2 <- weather |>
mutate(date2 = ymd(as.Date(date))) |>
filter(date2 >= "2020-01-01")I made a new dataset to include observations from 1970.
#make a new weather dataset only including data from dates after 1970 january 1
weather3 <- weather |>
mutate(date2 = ymd(as.Date(date))) |>
filter(date2 >= "1970-01-01")I used distinct function to remove duplicate city and station ID in the cities dataset.
#Did with Mrs. Saidi
cities <- cities |> distinct(city_name, station_id, .keep_all = TRUE)
#I got my source from here when I was trouble shooting how to correct my join
#https://dplyr.tidyverse.org/reference/distinct.htmlI joined weather, and cities dataset which used shared variables city name and station ID. Next, I joined the newly joined dataset with countries to make one dataset. (2020)
#join weather2 and cities dataset which both use city_name, and station_id
join1 <- left_join(weather2, cities, by = c("city_name", "station_id"))
#join the join1 with the weather parquet in order to combine all three of them
choidataset <- left_join(join1, countries, by = c("country", "iso2", "iso3"))Joining weather, city, and country dataset (1970).
#join weather3 and cities dataset which both use city_name, and station_id
join2 <- left_join(weather3, cities, by = c("city_name", "station_id"))
#join the join1 with the weather parquet in order to combine all three of them
joined_1970 <- left_join(join2, countries, by = c("country", "iso2", "iso3"))Export dataset
#change to csv in order to submit on dropbox
write_csv(choidataset, "weather_NOAA.csv")Multiple linear regression analysis
options(scipen = 999) # Turn off scientific notation for readability got this code from also used in project 1: https://stackoverflow.com/questions/25946047/how-to-prevent-scientific-notation-in-r
lin_reg <- lm(avg_temp_c ~ latitude + season, data = choidataset)
plot(lin_reg)



summary(lin_reg)
Call:
lm(formula = avg_temp_c ~ latitude + season, data = choidataset)
Residuals:
Min 1Q Median 3Q Max
-60.978 -4.936 0.748 5.991 26.453
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.2727059 0.0154865 1438.204 < 0.0000000000000002 ***
latitude -0.1793000 0.0002607 -687.886 < 0.0000000000000002 ***
seasonSpring 0.1496316 0.0195642 7.648 0.0000000000000204 ***
seasonSummer 6.5034694 0.0194376 334.581 < 0.0000000000000002 ***
seasonWinter -6.7806967 0.0197087 -344.045 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.556 on 1616453 degrees of freedom
(4646 observations deleted due to missingness)
Multiple R-squared: 0.3727, Adjusted R-squared: 0.3727
F-statistic: 2.401e+05 on 4 and 1616453 DF, p-value: < 0.00000000000000022
#I chose average temperature in celsius as my variable and used latitude and season as predictors. Latitude affects how warm locations are like how near they are to the equator, seasons have temperature differences throughout the year. The average temperature in Celsius = 22.273-0.179(latitude) + 0.15(seasonSpring) + 6.50(seasonSummer) -6.78 (seasonWinter) is the model for equation. Latitude has a negative slope for each 1-degree increase in latitude, the average temperature goes down by about 0.18°C. As the latitude increases temperature decreases. Spring increases by 0.15°C, summer increases by 6.50°C, and winter drops by 6.78°C. Fall is not included because it is the baseline season so these values are all compared to fall. All the predictors have three asterisks, meaning they are statistically significant. The Adjusted R² is about 0.37, so the model explains 37% of the variation in the observations is explained by this model. Basically about 63% of the variation is not explained by this model.
Creating dataset for recent visulization starting from 2023
enodataset <- choidataset |>
mutate(date2 = ymd(as.Date(date))) |>
filter(date2 >= "2023-01-01")Visualization 1, latitude vs average temp
eplot <- ggplot(
enodataset,
aes(
x = latitude,
y = avg_temp_c,
color = season,
text=paste(
"City:", city_name,
"<br>Season:", season,
"<br>Avg Temp C:", avg_temp_c,
"<br>Latitude:", latitude
)
)
)+
geom_point(alpha = 0.05, size = 2) +
geom_smooth(se = FALSE) +
scale_color_brewer(palette = "Set2") +
labs(color = "Season",
title = "Relationship Between Latitude and Average Temperature",
x = "Latitude (degrees)",
y = "Average Temperature (°C)",
color = "Season",
size = "Population",
caption = "Source: NOAA via Meteostat API")+
theme_minimal(base_size = 11)
eplotIgnoring unknown labels:
• size : "Population"
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 18 rows containing missing values or values outside the scale range
(`geom_point()`).

#make visualization 1 interactive
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:arrow':
schema
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
ggplotly(eplot, tooltip="text")Ignoring unknown labels:
• size : "Population"
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_smooth()`).