The dataset that will be looked into is based on the Solar Photovoltaic
Incentive Program that has been completed among various Counties and Cities in
New York State beginning in August 2010. This program is for installing grid-connected
solar electric or photovoltaic systems for residential, commercial buildings/sectors.
data that is included is the year is
- Project Installation Year
- PRoject Cost
- City and/or County location
- Expected Kwh Annual Production/Output
Get the data from the website
# get URL of where the dataset is located and read it for analysis
URL <- "https://data.ny.gov/api/views/3pzs-2zsk/rows.csv?accessType=DOWNLOAD"
solar_PV <- read.csv(URL)
Get some basic summary of the data like its structure and summary
# show the structure of the dataset
str(solar_PV)
## 'data.frame': 1529 obs. of 11 variables:
## $ Project.Install.Year : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Contractor : Factor w/ 92 levels "1st Light Energy Inc.",..: 58 58 58 58 58 58 58 58 58 58 ...
## $ County : Factor w/ 62 levels "","Albany","Allegany",..: 3 7 12 15 16 17 18 20 23 29 ...
## $ City : Factor w/ 364 levels "Airmont","Akron",..: 236 236 236 236 236 236 236 236 236 236 ...
## $ Project.Count.by.City : int 1 1 1 4 2 2 1 1 1 2 ...
## $ Project.Cost : num 34440 39000 20198 163678 70272 ...
## $ Incentive..Dollars : num 8050 8820 4428 45080 21070 ...
## $ Total.Nameplate.KW : num 4.6 5.04 2.53 25.76 12.24 ...
## $ Expected.KWh.Annual.Production: num 5400 5916 2970 30238 14368 ...
## $ Solicitation : Factor w/ 1 level "PON 2112": 1 1 1 1 1 1 1 1 1 1 ...
## $ Location.1 : Factor w/ 364 levels "Airmont, NY\n(41.11093304700006, -74.09858787699994)",..: 236 236 236 236 236 236 236 236 236 236 ...
# show some basic summary statistics on the data
summary(solar_PV)
## Project.Install.Year Contractor
## Min. :2010 SolarCity(CS) :344
## 1st Qu.:2013 Other :318
## Median :2014 SunRun Inc. (E) :124
## Mean :2014 Sungevity Development LLC (E) : 87
## 3rd Qu.:2015 NRG Residential Solar Solutions LLC (E)(CS): 65
## Max. :2015 Solar Liberty Energy Systems Inc. (CS) : 61
## (Other) :530
## County City Project.Count.by.City
## Orange :161 Other : 318 Min. : 1.00
## Westchester:119 Staten Island: 34 1st Qu.: 3.00
## Ulster :104 Brooklyn : 28 Median : 5.00
## Rockland :102 Ithaca : 26 Mean : 11.32
## Dutchess : 90 Schenectady : 20 3rd Qu.: 10.00
## Tompkins : 67 Albany : 18 Max. :703.00
## (Other) :886 (Other) :1085
## Project.Cost Incentive..Dollars Total.Nameplate.KW
## Min. : 9375 Min. : 3600 Min. : 2.40
## 1st Qu.: 125632 1st Qu.: 26675 1st Qu.: 26.93
## Median : 222322 Median : 45500 Median : 46.98
## Mean : 582777 Mean : 133045 Mean : 121.80
## 3rd Qu.: 484340 3rd Qu.: 108840 3rd Qu.: 106.56
## Max. :31372807 Max. :6578670 Max. :6090.06
##
## Expected.KWh.Annual.Production Solicitation
## Min. : 2817 PON 2112:1529
## 1st Qu.: 31609
## Median : 55141
## Mean : 142964
## 3rd Qu.: 125061
## Max. :7148279
##
## Location.1
## Other, NY\n : 318
## Staten Island, NY\n(40.64244049200005, -74.07527883899996): 34
## Brooklyn, NY\n(42.43316414800006, -78.74751623799995) : 28
## Ithaca, NY\n(42.44051296300006, -76.49545767599994) : 26
## Schenectady, NY\n(42.81225278100004, -73.94101735499999) : 20
## Albany, NY\n(42.65155245500006, -73.75520746399997) : 18
## (Other) :1085
See the names and clean up the Column names if necessary
names(solar_PV)
## [1] "Project.Install.Year" "Contractor"
## [3] "County" "City"
## [5] "Project.Count.by.City" "Project.Cost"
## [7] "Incentive..Dollars" "Total.Nameplate.KW"
## [9] "Expected.KWh.Annual.Production" "Solicitation"
## [11] "Location.1"
# R takes characters like , and spaces and replaces them with .
# Lets rather replace those instead with _ character
colnames(solar_PV) <- gsub("\\.\\.*" , "_", colnames(solar_PV))
# to display the updated (cleaner) names
colnames(solar_PV)
## [1] "Project_Install_Year" "Contractor"
## [3] "County" "City"
## [5] "Project_Count_by_City" "Project_Cost"
## [7] "Incentive_Dollars" "Total_Nameplate_KW"
## [9] "Expected_KWh_Annual_Production" "Solicitation"
## [11] "Location_1"
Examine some data with basic plots; show the log distribution of
Expected KWh Annual Production and boxplot based on year project was installed
in homes/buildings
# show the (log) frequency histogram of Expected KWh Annual Production
# by taking the log of the Expected KWh Annual Production we can visualize
# better the data and see what kind of distribution it reveals
with(solar_PV,hist(log(Expected_KWh_Annual_Production)))

# now show a box plot of Expected_KWh_Annual_Production per year
# Color based on group for better reading and visulization
with(solar_PV,boxplot(log(Expected_KWh_Annual_Production) ~ Project_Install_Year,
xlab = "Project Install Year",
ylab = "Expected Kwh Annual Production",
col = c("red","blue","green","yellow","purple","orange"),
main = "Expected Kwh Annual Production grouped by Project Year"))

show a scatterplot of the project count by city vs Project Cost
with(solar_PV,plot(Project_Count_by_City,Project_Cost,
xlab = "Project Count by City",
ylab = "Project Cost",
main = "Project Count by City vs Project Cost", pch = 8))
# draw a best fit line for the data so we can see if there is a strong relationship
abline(lm(solar_PV$Project_Cost ~ solar_PV$Project_Count_by_City), col = "red")

Some things to note on these plots
- The Expected Kwh Annual Production is a right-skewed distribution
- Based on the box plot, the median Expected Annual Kwh Production was
about the same in 2011, 2012, 2014 and 2015. Also in 2014 and 2015,
Many projects were expected to produce much higher Kwh of solar energy
that in previous years
- The scatterplot of Project Count by City vs Project Cost shows a strong
positive linear relationship. Cities where there weren’t many Projects
are heavily clustered around having cheap project costs.