finalprojectchoieno

Author

E Choi

Image source:Pexels(also linked)

For my dataset, I chose one focusing on weather. I got my source from NOAA. I plan to use the city name variable, dates, country, season, lat, long, population. I chose this dataset because as I was wondering what topic to pick for my dataset Maryland actually got a decent amount of snow! Not only that but weather is something I actually check very often whether it is deciding how I should dress, or whether or not to bring umbrella or other accessories to stay warm. This data was collected by NOAA through weather monitoring stations by Meteostat API a web service with weather observations.

Background information The data used in this project comes from the Meteostat API, which is a website that provides weather and climate data from many weather stations around the world. Meteostat collects information like temperature, location, and dates from official weather stations and makes it available for people to use. The data can be filtered by location and time period, which made it useful for this project. Meteostat shares its data under an open license and also uses data from NOAA. Meteostat. (n.d.). Meteostat Historical Weather & Climate Data API. Retrieved December 15, 2025, from https://rapidapi.com/meteostat/api/meteostat

I loaded the required libraries and datasets needed for data cleaning and analysis.

#load libraries and csvs
library(tidyverse)
library("arrow")
library(leaflet)
setwd("C:/Users/enomc/OneDrive - montgomerycollege.edu/Documents/Data Science")
weather<-read_parquet("daily_weather.parquet")
countries<-read_csv("countries.csv")
cities<-read_csv("cities.csv")

I made a new dataset to have less observations only including dates starting from 2020.

#make a new weather dataset only including data from dates after 2020 january 1
weather2 <- weather |>
  mutate(date2 = ymd(as.Date(date))) |>
  filter(date2 >= "2020-01-01")

I made a new dataset to include observations from 1970.

#make a new weather dataset only including data from dates after 1970 january 1
weather3 <- weather |>
  mutate(date2 = ymd(as.Date(date))) |>
  filter(date2 >= "1970-01-01")

I used distinct function to remove duplicate city and station ID in the cities dataset.

#Did with Mrs. Saidi
cities <- cities |> distinct(city_name, station_id, .keep_all = TRUE)
#I got my source from here when I was trouble shooting how to correct my join
#https://dplyr.tidyverse.org/reference/distinct.html

I joined weather, and cities dataset which used shared variables city name and station ID. Next, I joined the newly joined dataset with countries to make one dataset. (2020)

#join weather2 and cities dataset which both use city_name, and station_id 
join1 <- left_join(weather2, cities, by = c("city_name", "station_id"))
#join the join1 with the weather parquet in order to combine all three of them
choidataset <- left_join(join1, countries, by = c("country", "iso2", "iso3"))

Joining weather, city, and country dataset (1970).

#join weather3 and cities dataset which both use city_name, and station_id 
join2 <- left_join(weather3, cities, by = c("city_name", "station_id"))
#join the join1 with the weather parquet in order to combine all three of them
joined_1970 <- left_join(join2, countries, by = c("country", "iso2", "iso3"))

Export dataset

#change to csv in order to submit on dropbox
write_csv(choidataset, "weather_NOAA.csv")

Multiple linear regression analysis

options(scipen = 999) # Turn off scientific notation for readability got this code from also used in project 1: https://stackoverflow.com/questions/25946047/how-to-prevent-scientific-notation-in-r
lin_reg <- lm(avg_temp_c ~ latitude + season, data = choidataset)
plot(lin_reg)

summary(lin_reg)


Call:
lm(formula = avg_temp_c ~ latitude + season, data = choidataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.978  -4.936   0.748   5.991  26.453 

Coefficients:
               Estimate Std. Error  t value             Pr(>|t|)    
(Intercept)  22.2727059  0.0154865 1438.204 < 0.0000000000000002 ***
latitude     -0.1793000  0.0002607 -687.886 < 0.0000000000000002 ***
seasonSpring  0.1496316  0.0195642    7.648   0.0000000000000204 ***
seasonSummer  6.5034694  0.0194376  334.581 < 0.0000000000000002 ***
seasonWinter -6.7806967  0.0197087 -344.045 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.556 on 1616453 degrees of freedom
  (4646 observations deleted due to missingness)
Multiple R-squared:  0.3727,    Adjusted R-squared:  0.3727 
F-statistic: 2.401e+05 on 4 and 1616453 DF,  p-value: < 0.00000000000000022

#I chose average temperature in celsius as my variable and used latitude and season as predictors. Latitude affects how warm locations are like how near they are to the equator, seasons have temperature differences throughout the year.

The average temperature in Celsius = 22.273-0.179(latitude) + 0.15(seasonSpring) + 6.50(seasonSummer) -6.78 (seasonWinter) is the model for equation. Latitude has a negative slope for each 1-degree increase in latitude, the average temperature goes down by about 0.18°C. As the latitude increases temperature decreases. Spring increases by 0.15°C, summer increases by 6.50°C, and winter drops by 6.78°C. Fall is not included because it is the baseline season so these values are all compared to fall. All the predictors have three asterisks, meaning they are statistically significant. The Adjusted R² is about 0.37, so the model explains 37% of the variation in the observations is explained by this model. Basically about 63% of the variation is not explained by this model.

Creating dataset for recent visulization starting from 2023

enodataset <- choidataset |>
  mutate(date2 = ymd(as.Date(date))) |>
  filter(date2 >= "2023-01-01")

Visualization 1, latitude vs average temp

eplot <- ggplot(
  enodataset, 
  aes(
    x = latitude, 
    y = avg_temp_c, 
    color = season, 
    text=paste(
  "City:", city_name,
  "<br>Season:", season,
  "<br>Avg Temp C:", avg_temp_c,
  "<br>Latitude:", latitude
    )
  )
)+
  geom_point(alpha = 0.05, size = 2) +
  geom_smooth(se = FALSE) +
  scale_color_brewer(palette = "Set2") +
   labs(color = "Season",
        title = "Relationship Between Latitude and Average Temperature",
    x = "Latitude (degrees)",
    y = "Average Temperature (°C)",
    color = "Season",
    size = "Population",
    caption = "Source: NOAA via Meteostat API")+
theme_minimal(base_size = 11)
eplot

Ignoring unknown labels:
• size : "Population"
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 18 rows containing missing values or values outside the scale range
(`geom_point()`).

#make visualization 1 interactive
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:arrow':

    schema

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

ggplotly(eplot, tooltip="text")

Ignoring unknown labels:
• size : "Population"

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_smooth()`).

My visualization shows the relationship between latitude and average temperatures! I chose to use the enodataset which includes data beginning from 2023 because the amount of observations was too large. The color or legend shows which season the dots and the line represent. Additionally, you can see that latitude in degrees represents the x-axis. Also, the y-axis is represented by the average temperature in celcius. I thought this visualization was interesting because as you can see when the latitude goes into the negatives there is much less variation compared to when the latitude goes into the positives. As expected the equator seems to have the least variation in terms of changes of temperature over the seasons. As the latitude goes to the right we can really see a lot more variation such as winter temperatures really dropping much more compared to the negative latitudes. Also one of the graphs had to be interactive or annotated and I decided to make this one interactive if you hover over a point it will tell you the lattiude, average temperatrue in celsius, the season, and the city.

Region categories

unique(joined_1970$region)

 [1] "Southern and Central Asia" "Southern Europe"          
 [3] "Northern Africa"           "Polynesia"                
 [5] NA                          "Central Africa"           
 [7] "Caribbean"                 "South America"            
 [9] "Middle East"               "Australia and New Zealand"
[11] "Western Europe"            "Eastern Europe"           
[13] "Central America"           "Western Africa"           
[15] "North America"             "Southern Africa"          
[17] "Southeast Asia"            "Eastern Africa"           
[19] "Eastern Asia"              "Nordic Countries"         
[21] "Baltic Countries"          "Melanesia"                
[23] "Antarctica"                "Micronesia"               
[25] "British Isles"

Combining regions into broader categories

#Attempt to combine multiple variations of Africa, Asia, and Europe together tried with prof
africa <- c("Central Africa", "Northern Africa", "Eastern Africa", "Southern Africa", "Western Africa")
temp<-joined_1970|>
  mutate(region2= case_when(
    region %in% africa ~ "Africa",
    region %in% c("Eastern Europe", "Western Europe")~ "Europe",
    region %in% c("Southeast Asia", "Southern and Central Asia", "Polynesia", "Melanesia")~ "Asia",
    region %in% "Antarctica"~ "Antartica",
    region %in% "British Isles"~ " British Isles",
    region %in% "Caribbean" ~ "Caribbean",
    region %in% "Middle East" ~ "Middle East",
    region %in% "North America" ~ "North America",
    region %in% "South America" ~ "South America",
    TRUE ~ "Others"
  ))

Creating yearly regional dataset

#most of this was working with professor + removing nroth america and caribbean because they did not provide valuable data
yeardf<-temp|>
  mutate(year=year(date2)) |>
  group_by(year,region2) |>
  summarize(mean_temp=mean(avg_temp_c)) |>
  filter(!is.na(mean_temp)) |>
  filter(!region2 %in% c("North America", "Caribbean"))

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Visualization 2: regional temperature over time

plottwo <- ggplot(yeardf, aes(x = year, y = mean_temp, color = region2)) +
  geom_smooth(se = FALSE) +
   labs(color = "Region",
        title = "Change in average region temperature over time",
    x = "Year",
    y = "Average Temperature °C)",
    color = "Region",
    caption = "Source: NOAA via Meteostat API")+
  theme_minimal(base_size = 12)
plottwo

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

This visualization represents the change in average region temperature over time or in this case years. The x-axis is represented by years, and the y-axis is represented by the average temperature in celsius. Also, the color or legend is represented by the region. I really wanted to include broader versions of Asia, and Africa, but unfortunately I could not get them to appear on the dataset so went with only five regions. Additionally, I removed North America, and the Caribbean because their data was too short and didn’t contribute much. For the most part over the years the regions stay pretty similar to their beginning average temperature values. However, for Europe the average temperature increase by around five to six degrees celsius and seems to only be increasing. This can lead to questions about climate change and what not because average temperatures rising that steadily isn’t exactly normal.

Conclusion

In this project, I analyzed weather data from NOAA to better understand how average temperature changes by latitude, season, and region. The linear regression results showed that both latitude and season have a strong relationship with temperature. The first visualization showed how temperature varies across latitudes and seasons, and the interactive features allow users to see information for individual cities. The second visualization showed how average temperatures have changed over time or years in different regions with some having interesting trends. Overall, this project can show patterns in global temperature data and highlight how weather can vary across different parts of the world, seasons, and time.