Project 1

Author

Jaiden Soto

Introduction

For this project, I picked a data-set that showcases the hydropower facilities that were in use as of 2011. The dimension variables are for the crest of the hydro-power facility, the top of the structure that regulates the water level of the dam. The variables include the crest elevation, length, and structural height. The remaining variables define the attributes including the facilities name, county and state it’s located in. It’s longitude and latitude variable defines the facilities exact location. The project organization defines the organization orchestrating the facilities use, and the project year defines when the production for the project began. Finally, the watercourse variable describes the body of flowing water that the hydro-power facility uses. For this project, I will explore the relationship between the dimensions of hydro-power facilities crest and their development in the United States. The source for my data-set is the 2011 Active Hydro-power Reclamation Report from the United States Bureau of Reclamation.

Loading necessary libraries and data-set

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

# Loads the neccesary data visualization libraries

hydropower <- readr::read_csv('hydropower.csv')

Rows: 336 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Identity.Name, Identity.Watercourse, Location.County, Location.Stat...
dbl (6): Dimensions.Crest Elevation, Dimensions.Crest Length, Dimensions.Str...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Reads the data for the visualization

head(hydropower)

# A tibble: 6 × 11
  `Dimensions.Crest Elevation` `Dimensions.Crest Length` Dimensions.Structural…¹
                         <dbl>                     <dbl>                   <dbl>
1                        1520                      3800                      86 
2                        3348                      1850                     110 
3                        5510                     76300                     265 
4                        2172.                     2784.                    132.
5                        1564                      1104                     110 
6                        3684                      2784.                     18 
# ℹ abbreviated name: ¹`Dimensions.Structural Height`
# ℹ 8 more variables: Identity.Name <chr>, Identity.Watercourse <chr>,
#   Location.County <chr>, Location.Latitude <dbl>, Location.Longitude <dbl>,
#   Location.State <chr>, Identity.Project.Organization <chr>,
#   Identity.Project.Year <dbl>

Data Wrangling

The data-set met the three criteria for it’s data to be considered tidy-compatible, so it didn’t need specific data-wrangling. The rows, columns, and cells were all correctly stated and placed in the data-set, with no overlapping or error. In my opinion, any graph-specific data wrangling, like filtering for outliers, shoyld be placed in the graph r-chunk itself rather than done beforehand to prevent errors found later.

Linear Regression Analysis

The purpose of these linear regression analysis’s’ are to verify any correlations between the crest variables and the year of production. This allows us to see if there were any significant relationships, ideally positive to show growth, in the hydro-power facilities of the Untied States.

Regression 1 Analysis:

Showcases a faintly positive slope between the crest’s elevation from the ground and the year of production. This showcases a slight improvement in elevation capacity, but the relationship wouldn’t be considered strong as it’s only slight. The p-value is 0.93, making the results to extremely not significant. The adjusted R-Square being negative also shows that the results don’t explain variability. Elevation = B0 + Year

p1 <-ggplot(hydropower, aes(x = `Identity.Project.Year`, y = `Dimensions.Crest Elevation`)) +
  
# Regular ggplot statement for plot 1, showing data, x variable and y variable
  
    labs(title = "Crest Elevation by Year of Production",
       caption = "Source: 2011 US Bureau of Reclamation Hydropower Assessment  Report")+
  
# Label for the title and the caption
  
  filter(hydropower, `Dimensions.Crest Elevation` < 10000, `Identity.Project.Year` > 1900)+
  
# Filters for graphing purposes by limiting crest elevation value to less than 10,000 and the year of production to after 19th century, removes outliers
  
  xlab("Year of Production") +
  ylab("Crest Elevation in Feet") +
  
# Both X-axis and Y-axis labels
  
  theme_linedraw(base_size = 15)+
  
# Sets the theme for the graph
  
  geom_point()+
  
# Plots the points on the graph
  
  geom_smooth(color = 'blue', method = 'lm', formula = 'y~x')
# Creates a blue linear regression line using the standard formula identifying the relationship between y to x
p1

summary(lm(`Dimensions.Crest Elevation`~`Identity.Project.Year`, data = hydropower))


Call:
lm(formula = `Dimensions.Crest Elevation` ~ Identity.Project.Year, 
    data = hydropower)

Residuals:
   Min     1Q Median     3Q    Max 
 -4251  -2111   -182   1385  41930 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4409.86137 1035.94558   4.257  2.7e-05 ***
Identity.Project.Year   -0.04466    0.54229  -0.082    0.934    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3467 on 334 degrees of freedom
Multiple R-squared:  2.031e-05, Adjusted R-squared:  -0.002974 
F-statistic: 0.006783 on 1 and 334 DF,  p-value: 0.9344

# Gives summary statistics for further analysis

Regression 2 Analysis:

Regression 2 showcases a slightly negative slope between the length of the crest per facility and the year of production. This shows us that there is a negative relationship between the length of the hydro-power facilities crest and the year of production. The slope being only slightly negative indicates that the relationship is not strong. The p-value is around 0.42, making the results very much not statistically significant. Furthermore, the results do not explain variability around the mean as the adjusted R square is negative. Length = B0 + Year

p2 <-ggplot(hydropower, aes(x = `Identity.Project.Year`, y = `Dimensions.Crest Length`)) +
  filter(hydropower, `Dimensions.Crest Length` < 5000, `Identity.Project.Year` > 1900)+
  # For this plot, p2, the crest length was filtered for less than 5000 as to remove outliers. This allows the relationship to be presented clearer.
  
  labs(title = "Crest Length by Year of Production",
       caption = "Source: 2011 US Bureau of Reclamation Hydropower Assessment  Report")+
  xlab("Year of Production") +
  ylab("Crest Length in Feet") +
  theme_linedraw(base_size = 15)+
  geom_point()+
  geom_smooth(color = 'red', method = 'lm', formula = 'y~x')
p2

summary(lm(`Dimensions.Crest Length`~`Identity.Project.Year`, data = hydropower))


Call:
lm(formula = `Dimensions.Crest Length` ~ Identity.Project.Year, 
    data = hydropower)

Residuals:
   Min     1Q Median     3Q    Max 
 -3075  -2342  -1632   -286  73587 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)           1335.3646  2088.0057   0.640    0.523
Identity.Project.Year    0.8905     1.0930   0.815    0.416

Residual standard error: 6988 on 334 degrees of freedom
Multiple R-squared:  0.001983,  Adjusted R-squared:  -0.001005 
F-statistic: 0.6638 on 1 and 334 DF,  p-value: 0.4158

Regression 3 Analysis:

The third regression shows a faintly positive slope, yet greater than plot 1’s, for the structural height of the hydro-power facility and it’s year of production. This indicates that there is a positive relationship between the two variables. The positive slope is faint, however, so the impact of the year on the structural height would not be considered strong. The p-value is slightly below 0.05, making the results not statistically significant. The adjusted R square being so small, around 0.00657 means that the line very poor fit. Height = B0 + Year

p3 <-ggplot(hydropower, aes(x = `Identity.Project.Year`, y = `Dimensions.Structural Height`)) +
  filter(hydropower, `Dimensions.Structural Height` < 500, `Identity.Project.Year` > 1900)+
  # For plot 3, the structural height was filtered to be less than 500 as to remove outliers and make the regression line clearer
  
  labs(title = "Structural Height by Year of Production",
       caption = "Source: 2011 US Bureau of Reclamation Hydropower Assessment  Report")+
  xlab("Year of Production") +
  ylab("Structural Height in Feet") +
  theme_linedraw(base_size = 15)+
  geom_point()+
  geom_smooth(color = 'green', method = 'lm', formula = 'y~x')
p3

summary(lm(`Dimensions.Structural Height`~`Identity.Project.Year`, data = hydropower))


Call:
lm(formula = `Dimensions.Structural Height` ~ Identity.Project.Year, 
    data = hydropower)

Residuals:
    Min      1Q  Median      3Q     Max 
-137.10  -81.67  -12.38   47.42  584.60 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)  
(Intercept)           77.42482   35.50865   2.180   0.0299 *
Identity.Project.Year  0.03334    0.01859   1.793   0.0738 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 118.8 on 334 degrees of freedom
Multiple R-squared:  0.009538,  Adjusted R-squared:  0.006572 
F-statistic: 3.216 on 1 and 334 DF,  p-value: 0.07381

Data Visualization

# (upper this time) Standard ggplot statement for plot 4, utlizing point color for the location state and point size for the crest elevation
p4 <- ggplot(hydropower, aes(x = `Dimensions.Crest Length`, y = `Dimensions.Structural Height`, color = `Location.State`, size = `Dimensions.Crest Elevation`))+

# Plots the point, with transparencey for overlapping points
  geom_point(alpha = 0.5)+

# Filters the X and Y axis's for clarity in the graph. Size didn't need to be filtered here as it didn't serve a neccesary purpose in clarity.
  filter(hydropower, `Dimensions.Crest Length` < 5000, `Dimensions.Structural Height` < 500)+
  
# Labels for the title, graph capiton, color caption, and the size caption
labs(title = "Relationship between Hydro-power facility dimensions and Location in the United States",
     caption = "Source: 2011 US Bureau of Reclamation Hydropower Assessment  Report",
     color = "State(s) of Hydro-power facility",
     size = "Crest Elevation in Feet")+
  
# Individual labels for the X axis and Y axis. While it could be put in the standard labs statement, I had been using this method earlier for programming clarity so I figured I would keep it.
  xlab("Crest Length in Feet")+
  ylab("Structural Height in Feet")+
  
# Sets the theme for the graph design
  theme_grey(base_size = 12)
p4

Conclusion

To start the conclusion, I cleaned up the data-set when appropriate through filtering the data-set for ranges that would be clear for graphing purposes. This involved setting limits for my Float and Integer variables, as they were the ones that involved numbers. That was the only cleaning I did however, as the data-set came organized both for tidyverse work and other data visualization.

The data visualization showcased the dimensions of the hydro-power facilities as it relates to the state it’s located in. The former linear regressions were looking for the yearly relationship, and this graph looks for the location relationship. We can see the states are in alphabetical order, allowing for ease of understanding in the contrasts of the colors. The graph also shows us that there is some relationship between the abundance and size with the West. It also does show the South and Midwest, but the West of the continental United States is most abundant in hydro-power facilities. I was surprised at the range of sizes for the facilities, as I felt that they would have been more flexible instead of showing some consistency.

With the visualization, however, came missing contexts that I felt could have helped grasp a bigger picture of the overall question. I would have had the impact of the budget shown in the graphs, as well as the measure of public support for these facilities. Both of these variables can help explain the strength of the relationships found in the graph. This would also allow us to come to more thorough conclusions when relating to the question of why there is such a lack of development in hydro-power facility funding.