The dataset that will be looked into is based on the Solar Photovoltaic

Incentive Program that has been completed among various Counties and Cities in

New York State beginning in August 2010. This program is for installing grid-connected

solar electric or photovoltaic systems for residential, commercial buildings/sectors.

data that is included is the year is

- Project Installation Year

- PRoject Cost

- City and/or County location

- Expected Kwh Annual Production/Output

Get the data from the website

# get URL of where the dataset is located and read it for analysis
URL <- "https://data.ny.gov/api/views/3pzs-2zsk/rows.csv?accessType=DOWNLOAD"
solar_PV <- read.csv(URL)

Get some basic summary of the data like its structure and summary

# show the structure of the dataset
str(solar_PV)
## 'data.frame':    1529 obs. of  11 variables:
##  $ Project.Install.Year          : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Contractor                    : Factor w/ 92 levels "1st Light Energy Inc.",..: 58 58 58 58 58 58 58 58 58 58 ...
##  $ County                        : Factor w/ 62 levels "","Albany","Allegany",..: 3 7 12 15 16 17 18 20 23 29 ...
##  $ City                          : Factor w/ 364 levels "Airmont","Akron",..: 236 236 236 236 236 236 236 236 236 236 ...
##  $ Project.Count.by.City         : int  1 1 1 4 2 2 1 1 1 2 ...
##  $ Project.Cost                  : num  34440 39000 20198 163678 70272 ...
##  $ Incentive..Dollars            : num  8050 8820 4428 45080 21070 ...
##  $ Total.Nameplate.KW            : num  4.6 5.04 2.53 25.76 12.24 ...
##  $ Expected.KWh.Annual.Production: num  5400 5916 2970 30238 14368 ...
##  $ Solicitation                  : Factor w/ 1 level "PON 2112": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Location.1                    : Factor w/ 364 levels "Airmont, NY\n(41.11093304700006, -74.09858787699994)",..: 236 236 236 236 236 236 236 236 236 236 ...
# show some basic summary statistics on the data
summary(solar_PV)
##  Project.Install.Year                                       Contractor 
##  Min.   :2010         SolarCity(CS)                              :344  
##  1st Qu.:2013         Other                                      :318  
##  Median :2014         SunRun Inc. (E)                            :124  
##  Mean   :2014         Sungevity Development LLC (E)              : 87  
##  3rd Qu.:2015         NRG Residential Solar Solutions LLC (E)(CS): 65  
##  Max.   :2015         Solar Liberty Energy Systems Inc. (CS)     : 61  
##                       (Other)                                    :530  
##          County               City      Project.Count.by.City
##  Orange     :161   Other        : 318   Min.   :  1.00       
##  Westchester:119   Staten Island:  34   1st Qu.:  3.00       
##  Ulster     :104   Brooklyn     :  28   Median :  5.00       
##  Rockland   :102   Ithaca       :  26   Mean   : 11.32       
##  Dutchess   : 90   Schenectady  :  20   3rd Qu.: 10.00       
##  Tompkins   : 67   Albany       :  18   Max.   :703.00       
##  (Other)    :886   (Other)      :1085                        
##   Project.Cost      Incentive..Dollars Total.Nameplate.KW
##  Min.   :    9375   Min.   :   3600    Min.   :   2.40   
##  1st Qu.:  125632   1st Qu.:  26675    1st Qu.:  26.93   
##  Median :  222322   Median :  45500    Median :  46.98   
##  Mean   :  582777   Mean   : 133045    Mean   : 121.80   
##  3rd Qu.:  484340   3rd Qu.: 108840    3rd Qu.: 106.56   
##  Max.   :31372807   Max.   :6578670    Max.   :6090.06   
##                                                          
##  Expected.KWh.Annual.Production   Solicitation 
##  Min.   :   2817                PON 2112:1529  
##  1st Qu.:  31609                               
##  Median :  55141                               
##  Mean   : 142964                               
##  3rd Qu.: 125061                               
##  Max.   :7148279                               
##                                                
##                                                      Location.1   
##  Other, NY\n                                               : 318  
##  Staten Island, NY\n(40.64244049200005, -74.07527883899996):  34  
##  Brooklyn, NY\n(42.43316414800006, -78.74751623799995)     :  28  
##  Ithaca, NY\n(42.44051296300006, -76.49545767599994)       :  26  
##  Schenectady, NY\n(42.81225278100004, -73.94101735499999)  :  20  
##  Albany, NY\n(42.65155245500006, -73.75520746399997)       :  18  
##  (Other)                                                   :1085

See the names and clean up the Column names if necessary

names(solar_PV)
##  [1] "Project.Install.Year"           "Contractor"                    
##  [3] "County"                         "City"                          
##  [5] "Project.Count.by.City"          "Project.Cost"                  
##  [7] "Incentive..Dollars"             "Total.Nameplate.KW"            
##  [9] "Expected.KWh.Annual.Production" "Solicitation"                  
## [11] "Location.1"
# R takes characters like , and spaces and replaces them with .
# Lets rather replace those instead with _ character
colnames(solar_PV) <- gsub("\\.\\.*" , "_", colnames(solar_PV))

# to display the updated (cleaner) names
colnames(solar_PV)
##  [1] "Project_Install_Year"           "Contractor"                    
##  [3] "County"                         "City"                          
##  [5] "Project_Count_by_City"          "Project_Cost"                  
##  [7] "Incentive_Dollars"              "Total_Nameplate_KW"            
##  [9] "Expected_KWh_Annual_Production" "Solicitation"                  
## [11] "Location_1"

Examine some data with basic plots; show the log distribution of

Expected KWh Annual Production and boxplot based on year project was installed

in homes/buildings

# show the (log) frequency histogram of Expected KWh Annual Production
# by taking the log of the Expected KWh Annual Production we can visualize
# better the data and see what kind of distribution it reveals
with(solar_PV,hist(log(Expected_KWh_Annual_Production)))

# now show a box plot of Expected_KWh_Annual_Production per year
# Color based on group for better reading and visulization
with(solar_PV,boxplot(log(Expected_KWh_Annual_Production) ~ Project_Install_Year,
                      xlab = "Project Install Year",
                      ylab = "Expected Kwh Annual Production",
                      col = c("red","blue","green","yellow","purple","orange"),
                      main = "Expected Kwh Annual Production grouped by Project Year"))

show a scatterplot of the project count by city vs Project Cost

with(solar_PV,plot(Project_Count_by_City,Project_Cost,
                   xlab = "Project Count by City",
                   ylab = "Project Cost",
                   main = "Project Count by City vs Project Cost", pch = 8))

# draw a best fit line for the data so we can see if there is a strong relationship
abline(lm(solar_PV$Project_Cost ~ solar_PV$Project_Count_by_City), col = "red")

Some things to note on these plots

- The Expected Kwh Annual Production is a right-skewed distribution

- Based on the box plot, the median Expected Annual Kwh Production was

about the same in 2011, 2012, 2014 and 2015. Also in 2014 and 2015,

Many projects were expected to produce much higher Kwh of solar energy

that in previous years

- The scatterplot of Project Count by City vs Project Cost shows a strong

positive linear relationship. Cities where there weren’t many Projects

are heavily clustered around having cheap project costs.