INTRODUCTION

The use of solar power has become more prominent in the United States (U.S.) especially because of its affordability and environmental benefits. Due to the increased use of solar power in the U.S., Stanford started a project involving the mapping of solar panel installations across the U.S using satellite imagery. To generate more insight, they merged the solar data with environmental, socioeconomic and demographic data from American Community Survey, 2015 (U.S. Census Bureau) and NASA Surface Meteorology and Solar Energy. The Dataset we intend using for this project was retrieved from Kaggle and contains 168 variables (166 numerical variables and 2 categorical variables) and 72,537 observations. Each row of the Dataset represents the number of solar power systems per census tract in a county of a state in the United States (fips number). The solar installations represented in the Dataset were of selected urban areas in 48 U.S. states. Some of the variables in the Dataset include; solar system count, population, county, state, average household income, education, occupation, etc. The Dataset also contains some missing values. For more details on Stanford Dataset see http://web.stanford.edu/group/deepsolar/home.

The motivation behind studying this Dataset is to understand the factors that influence the purchase or installation of solar power in the U.S. We can determine this by finding the correlations or relationships between the variables (columns) and the relative closeness of different units (rows) in the Dataset. The analyses above will provide insight and understanding to the questions below;

● Identifying socioeconomic factors (e.g., education, income, occupation) correlating with solar installations?

● Can the solar panel data help boost the quality of the inference of socioeconomic parameters?

Data Description

The deepsolar Dataset contains 168 variables (166 numerical variables and 2 categorical variables) and 72,537 observations. Each row of the Dataset represents the number of solar power systems per census tract in a county of a state in the United States (fips number). The solar installations represented in the Dataset were of selected urban areas in 48 U.S. states. Some of the variables in the Dataset include; solar system count, population, county, state, average household income, education, occupation, etc. The Dataset also contains some missing values For this project we decided to scale down the number of observations to 4000 and selected 43 key variables we considered of good importance for the multivariate analysis. The criteria for selecting these variables includes such variables that we considered as socioeconomic factors and eliminating other variables that appeared to be constant while studying the Dataset. Below is a map of an area view of the data showing point locations where solar systems are installed in the US based on their fip number.

## Warning in validateCoords(lng, lat, funcName): Data contains 5802 rows with
## either missing or invalid lat/lon values and will be ignored

Data Cleaning

Below is a summary description of the new Dataset based on number of solar system installations.

Including Plots

You can also embed plots, for example:

#{r pressure, echo=FALSE}
plot(pressure)

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.