The topic of my project is about wind turbines. The dataset I picked was found in the United States Geological Survey website and is part of our class google drive selection. The topic is fascinating for me because their benefits have a positive impact in quality life worldwide from job creation, to empowerment of local communities to providing clean air. Wind turbines help produce renewable elecricity and therefore reduce relying on fossil fuels and greenhouse gas emissions.
The dataset includes variables on the state, county, year, turbine capacity, turbine hub height, turbine rotor diameter, turbine swept area, turbine total height, project capacity, project number turbines and latitude and longitude. For this specific assignment, I have focused on seven of those variables: state, county, year, turbine hub height, turbine rotor diameter, latitude and longitude.
After finding my working director, loading the libraries needed and loading the dataset, the cleaning process of this assignment included putting all the names of the variables in lowercase using command names, changing where there is a “.” and putting a “_” instead using command gsub. I have also renamed the variables site.longitude and site.latitude to longitude and latitude respectively using command rename. After that, I filtered the state CA, the specific years 1985, 1995, 2005 and 2015 as well as the specific counties Kern and Riverside. Then, I removed all the nas from the variables I wanted to focus on, removed the variables that I will not work with and renamed CA to California.
Finally, I created my plot and my map to show the turbine hub height and turbine rotor diameter in the two counties of California across the selected years.
Describe your actions in each chunk
Load the necessary libraries
In these two chunks, I am loading and installing the necessary libraries that will help me do this assignment.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Find my working directory
My first step was to find my working directory using getwd.
getwd()
[1] "/Users/doriashima/Downloads"
Load your dataset
Then I was able to load my dataset using setwd and read_csv.
Rows: 63961 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Site.State, Site.County
dbl (10): Year, Turbine.Capacity, Turbine.Hub_Height, Turbine.Rotor_Diameter...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(windturbinesdataset)
# A tibble: 6 × 12
Site.State Site.County Year Turbine.Capacity Turbine.Hub_Height
<chr> <chr> <dbl> <dbl> <dbl>
1 IA Story County 2017 3000 87.5
2 IA Hardin County 2017 3000 87.5
3 IA Story County 2017 3000 87.5
4 IA Story County 2017 3000 87.5
5 IA Story County 2017 3000 87.5
6 IA Story County 2017 3000 87.5
# ℹ 7 more variables: Turbine.Rotor_Diameter <dbl>, Turbine.Swept_Area <dbl>,
# Turbine.Total_Height <dbl>, Project.Capacity <dbl>,
# Project.Number_Turbines <dbl>, Site.Latitude <dbl>, Site.Longitude <dbl>
Clean the dataset
The next chunks are all about the cleaning of the dataset that I conducted which included putting all the names of the variables in lowercase using command names, changing where there is a “.” and putting a “_” instead using command gsub. I have also renamed the variables site.longitude and site.latitude to longitude and latitude respectively using command rename. After that, I filtered the state CA, the specific years 1985, 1995, 2005 and 2015 as well as the specific counties Kern and Riverside. Then, I removed all the nas from the variables I wanted to focus on, removed the variables that I will not work with and renamed CA to California.
# A tibble: 6 × 12
site.state site.county year turbine.capacity turbine.hub_height
<chr> <chr> <dbl> <dbl> <dbl>
1 IA Story County 2017 3000 87.5
2 IA Hardin County 2017 3000 87.5
3 IA Story County 2017 3000 87.5
4 IA Story County 2017 3000 87.5
5 IA Story County 2017 3000 87.5
6 IA Story County 2017 3000 87.5
# ℹ 7 more variables: turbine.rotor_diameter <dbl>, turbine.swept_area <dbl>,
# turbine.total_height <dbl>, project.capacity <dbl>,
# project.number_turbines <dbl>, site.latitude <dbl>, site.longitude <dbl>
summary(windturbinesdataset$year)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1983 2008 2012 2012 2017 2021
names(windturbinesdataset) <-gsub("\\.","_",names(windturbinesdataset))wind <- windturbinesdataset|>rename(lon = site_latitude, lat = site_longitude)head(wind)
# A tibble: 6 × 12
site_state site_county year turbine_capacity turbine_hub_height
<chr> <chr> <dbl> <dbl> <dbl>
1 IA Story County 2017 3000 87.5
2 IA Hardin County 2017 3000 87.5
3 IA Story County 2017 3000 87.5
4 IA Story County 2017 3000 87.5
5 IA Story County 2017 3000 87.5
6 IA Story County 2017 3000 87.5
# ℹ 7 more variables: turbine_rotor_diameter <dbl>, turbine_swept_area <dbl>,
# turbine_total_height <dbl>, project_capacity <dbl>,
# project_number_turbines <dbl>, lon <dbl>, lat <dbl>
# A tibble: 6 × 9
site_state site_county year turbine_capacity turbine_hub_height
<chr> <chr> <dbl> <dbl> <dbl>
1 California Riverside County 1995 225 40
2 California Riverside County 1995 225 40
3 California Riverside County 1995 225 40
4 California Riverside County 1995 225 40
5 California Riverside County 1995 225 40
6 California Riverside County 1995 225 40
# ℹ 4 more variables: turbine_rotor_diameter <dbl>, turbine_swept_area <dbl>,
# lon <dbl>, lat <dbl>
Plot 1:
My first plot shows a comparison of turbine hub height and turbine rotor diameter between Riverside county and Kern county in California across the years of 1985, 1995, 2005, and 2015.
p1 <-ggplot(clean_CA, aes(x = year, y = turbine_hub_height, color = site_county)) +geom_point(aes(size = turbine_rotor_diameter), alpha =0.7) +geom_jitter(size =1) +#geom_text(aes(=)) +scale_color_brewer(palette ="Set1") +labs(title ="Turbine Hub Height in Kern County and Riverside County in California", caption ="USGS",y ="Turbine Hub Height",x ="Year" ) +theme_bw(base_family ="serif")p1
In this graph, I have added an interactive style using plotly.
ggplotly(p1)
Correlation
I wanted to see the correlation between turbine rotor diameter and turbine hub height and saw that it was a strong correlation with a value of 0.991. Its p value with three asterisks showed the contribution to the model and the adjusted r square of 0.98 showed that 98% of the variation in the data may be explained by the model.
Assuming "lon" and "lat" are longitude and latitude, respectively
map2
Concluding Essay
While conducting research on wind turbines, I learned that the taller turbine hub height is, the faster turbines are able to access more consistent winds at higher altitudes. Accessing more consistent winds at higher altitudes increases energy production. As far as the turbine rotor diameter, I learned that the larger the diameters, the more efficiency is enabled.
Both the plot 1 and the map show that Kern County has wind turbines that have higher hub heights and rotor diameters than Riverside County which means that Kern County’s energy production and efficiency are higher than Riverside County.
We can also observe between 1990 and 2000, there were no data on Kern County and between 2000 and 2010, there were no data on Riverside County. Only between 2010 and 2020 were data collected for both Counties. Unfortunately, while doing research, I could not find what historical event could have been associated with the lack of data in the earlier dates expect from maybe the fact that wind energy only started in the 1980s. The availability of data and funding to manage data in those early years may have not been in place yet.