For this project, I picked a data-set that showcases the hydropower facilities that were in use as of 2011. The dimension variables are for the crest of the hydro-power facility, the top of the structure that regulates the water level of the dam. The variables include the crest elevation, length, and structural height. The remaining variables define the attributes including the facilities name, county and state it’s located in. It’s longitude and latitude variable defines the facilities exact location. The project organization defines the organization orchestrating the facilities use, and the project year defines when the production for the project began. Finally, the watercourse variable describes the body of flowing water that the hydro-power facility uses. For this project, I will explore the relationship between the dimensions of hydro-power facilities crest and their development in the United States. The source for my data-set is the 2011 Active Hydro-power Reclamation Report from the United States Bureau of Reclamation.
Loading necessary libraries and data-set
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(dplyr)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# Loads the neccesary data visualization librarieshydropower <- readr::read_csv('hydropower.csv')
Rows: 336 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Identity.Name, Identity.Watercourse, Location.County, Location.Stat...
dbl (6): Dimensions.Crest Elevation, Dimensions.Crest Length, Dimensions.Str...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Reads the data for the visualizationhead(hydropower)
The data-set met the three criteria for it’s data to be considered tidy-compatible, so it didn’t need specific data-wrangling. The rows, columns, and cells were all correctly stated and placed in the data-set, with no overlapping or error. In my opinion, any graph-specific data wrangling, like filtering for outliers, shoyld be placed in the graph r-chunk itself rather than done beforehand to prevent errors found later.
Linear Regression Analysis
The purpose of these linear regression analysis’s’ are to verify any correlations between the crest variables and the year of production. This allows us to see if there were any significant relationships, ideally positive to show growth, in the hydro-power facilities of the Untied States.
Regression 1 Analysis:
Showcases a faintly positive slope between the crest’s elevation from the ground and the year of production. This showcases a slight improvement in elevation capacity, but the relationship wouldn’t be considered strong as it’s only slight. The p-value is 0.93, making the results to extremely not significant. The adjusted R-Square being negative also shows that the results don’t explain variability. Elevation = B0 + Year
p1 <-ggplot(hydropower, aes(x =`Identity.Project.Year`, y =`Dimensions.Crest Elevation`)) +# Regular ggplot statement for plot 1, showing data, x variable and y variablelabs(title ="Crest Elevation by Year of Production",caption ="Source: 2011 US Bureau of Reclamation Hydropower Assessment Report")+# Label for the title and the captionfilter(hydropower, `Dimensions.Crest Elevation`<10000, `Identity.Project.Year`>1900)+# Filters for graphing purposes by limiting crest elevation value to less than 10,000 and the year of production to after 19th century, removes outliersxlab("Year of Production") +ylab("Crest Elevation in Feet") +# Both X-axis and Y-axis labelstheme_linedraw(base_size =15)+# Sets the theme for the graphgeom_point()+# Plots the points on the graphgeom_smooth(color ='blue', method ='lm', formula ='y~x')# Creates a blue linear regression line using the standard formula identifying the relationship between y to xp1
summary(lm(`Dimensions.Crest Elevation`~`Identity.Project.Year`, data = hydropower))
Call:
lm(formula = `Dimensions.Crest Elevation` ~ Identity.Project.Year,
data = hydropower)
Residuals:
Min 1Q Median 3Q Max
-4251 -2111 -182 1385 41930
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4409.86137 1035.94558 4.257 2.7e-05 ***
Identity.Project.Year -0.04466 0.54229 -0.082 0.934
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3467 on 334 degrees of freedom
Multiple R-squared: 2.031e-05, Adjusted R-squared: -0.002974
F-statistic: 0.006783 on 1 and 334 DF, p-value: 0.9344
# Gives summary statistics for further analysis
Regression 2 Analysis:
Regression 2 showcases a slightly negative slope between the length of the crest per facility and the year of production. This shows us that there is a negative relationship between the length of the hydro-power facilities crest and the year of production. The slope being only slightly negative indicates that the relationship is not strong. The p-value is around 0.42, making the results very much not statistically significant. Furthermore, the results do not explain variability around the mean as the adjusted R square is negative. Length = B0 + Year
p2 <-ggplot(hydropower, aes(x =`Identity.Project.Year`, y =`Dimensions.Crest Length`)) +filter(hydropower, `Dimensions.Crest Length`<5000, `Identity.Project.Year`>1900)+# For this plot, p2, the crest length was filtered for less than 5000 as to remove outliers. This allows the relationship to be presented clearer.labs(title ="Crest Length by Year of Production",caption ="Source: 2011 US Bureau of Reclamation Hydropower Assessment Report")+xlab("Year of Production") +ylab("Crest Length in Feet") +theme_linedraw(base_size =15)+geom_point()+geom_smooth(color ='red', method ='lm', formula ='y~x')p2
summary(lm(`Dimensions.Crest Length`~`Identity.Project.Year`, data = hydropower))
Call:
lm(formula = `Dimensions.Crest Length` ~ Identity.Project.Year,
data = hydropower)
Residuals:
Min 1Q Median 3Q Max
-3075 -2342 -1632 -286 73587
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1335.3646 2088.0057 0.640 0.523
Identity.Project.Year 0.8905 1.0930 0.815 0.416
Residual standard error: 6988 on 334 degrees of freedom
Multiple R-squared: 0.001983, Adjusted R-squared: -0.001005
F-statistic: 0.6638 on 1 and 334 DF, p-value: 0.4158
Regression 3 Analysis:
The third regression shows a faintly positive slope, yet greater than plot 1’s, for the structural height of the hydro-power facility and it’s year of production. This indicates that there is a positive relationship between the two variables. The positive slope is faint, however, so the impact of the year on the structural height would not be considered strong. The p-value is slightly below 0.05, making the results not statistically significant. The adjusted R square being so small, around 0.00657 means that the line very poor fit. Height = B0 + Year
p3 <-ggplot(hydropower, aes(x =`Identity.Project.Year`, y =`Dimensions.Structural Height`)) +filter(hydropower, `Dimensions.Structural Height`<500, `Identity.Project.Year`>1900)+# For plot 3, the structural height was filtered to be less than 500 as to remove outliers and make the regression line clearerlabs(title ="Structural Height by Year of Production",caption ="Source: 2011 US Bureau of Reclamation Hydropower Assessment Report")+xlab("Year of Production") +ylab("Structural Height in Feet") +theme_linedraw(base_size =15)+geom_point()+geom_smooth(color ='green', method ='lm', formula ='y~x')p3
summary(lm(`Dimensions.Structural Height`~`Identity.Project.Year`, data = hydropower))
Call:
lm(formula = `Dimensions.Structural Height` ~ Identity.Project.Year,
data = hydropower)
Residuals:
Min 1Q Median 3Q Max
-137.10 -81.67 -12.38 47.42 584.60
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.42482 35.50865 2.180 0.0299 *
Identity.Project.Year 0.03334 0.01859 1.793 0.0738 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 118.8 on 334 degrees of freedom
Multiple R-squared: 0.009538, Adjusted R-squared: 0.006572
F-statistic: 3.216 on 1 and 334 DF, p-value: 0.07381
Data Visualization
# (upper this time) Standard ggplot statement for plot 4, utlizing point color for the location state and point size for the crest elevationp4 <-ggplot(hydropower, aes(x =`Dimensions.Crest Length`, y =`Dimensions.Structural Height`, color =`Location.State`, size =`Dimensions.Crest Elevation`))+# Plots the point, with transparencey for overlapping pointsgeom_point(alpha =0.5)+# Filters the X and Y axis's for clarity in the graph. Size didn't need to be filtered here as it didn't serve a neccesary purpose in clarity.filter(hydropower, `Dimensions.Crest Length`<5000, `Dimensions.Structural Height`<500)+# Labels for the title, graph capiton, color caption, and the size captionlabs(title ="Relationship between Hydro-power facility dimensions and Location in the United States",caption ="Source: 2011 US Bureau of Reclamation Hydropower Assessment Report",color ="State(s) of Hydro-power facility",size ="Crest Elevation in Feet")+# Individual labels for the X axis and Y axis. While it could be put in the standard labs statement, I had been using this method earlier for programming clarity so I figured I would keep it.xlab("Crest Length in Feet")+ylab("Structural Height in Feet")+# Sets the theme for the graph designtheme_grey(base_size =12)p4
Conclusion
To start the conclusion, I cleaned up the data-set when appropriate through filtering the data-set for ranges that would be clear for graphing purposes. This involved setting limits for my Float and Integer variables, as they were the ones that involved numbers. That was the only cleaning I did however, as the data-set came organized both for tidyverse work and other data visualization.
The data visualization showcased the dimensions of the hydro-power facilities as it relates to the state it’s located in. The former linear regressions were looking for the yearly relationship, and this graph looks for the location relationship. We can see the states are in alphabetical order, allowing for ease of understanding in the contrasts of the colors. The graph also shows us that there is some relationship between the abundance and size with the West. It also does show the South and Midwest, but the West of the continental United States is most abundant in hydro-power facilities. I was surprised at the range of sizes for the facilities, as I felt that they would have been more flexible instead of showing some consistency.
With the visualization, however, came missing contexts that I felt could have helped grasp a bigger picture of the overall question. I would have had the impact of the budget shown in the graphs, as well as the measure of public support for these facilities. Both of these variables can help explain the strength of the relationships found in the graph. This would also allow us to come to more thorough conclusions when relating to the question of why there is such a lack of development in hydro-power facility funding.