Introduction

The primary objective of this project is going to be Exploratory data analysis. The data I am considering for this purpose is ‘Open data 500’ which contains information about a list of 500+ U.S. based companies.

The purpose of this data is to generate new business and develop new products and services. For this, I intend on understanding and analysing the data by importing the data into R, clean the data and building reports by certain categories. By this approach, I will be able to suggest which sector/industry makes more revenue and which sector/industry can be considered for development.

Packages Required

To analyze this data, we will use the following R packages:

library(tidyverse)
library(data.table)
library(DT)

Package tidyverse is a combination of packagees which helps us with functions to clean and manipulate date, plot and perform some explanatory data analysis which includes plots.

Package data.table provides us with the function ‘fread’ which is faster in terms of reading data from files into R.

Package DT provides us with a function datatable to print the data in the form of a table to the html pages.

Data Preparation

As stated in the introduction, the data we are using for this project is Open Data 500 which contains information about a list of 500+ U.S. based companies from 2014. This list is complied by GovLab through

Below is more information regarding the above processes:

Outreach Campaign

  • Mass email to over 3,000 contacts in the GovLab network
  • Mass email to over 2,000 contacts OpenDataNow.com
  • Blog posts on TheGovLab.org and OpenDataNow.com
  • Social media recommendations
  • Media coverage of the Open Data 500
  • Attending presentations and conferences

Expert Advice

  • Recommendations from government and non-governmental organizations
  • Guidance and feedback from Open Data 500 advisors

Research

  • Companies identified for the book, Open Data Now
  • Companies using datasets from Data.gov
  • Directory of open data companies developed by Deloitte
  • Online Open Data Userbase created by Socrata
  • General research from publicly available sources

Study Goals

The primary study goals of the data were as below:

  • Provide a basis for assessing the economic value of government open data
  • Encourage the development of new open data companies
  • Foster a dialogue between government and business on how government data can be made more useful

My primary objective is do a Exploratory data analysis to understand the data and clean the data and build reports by certain categories.

Here, we start with the data cleaning process:

#Read the date into R
data_companies <- fread("us_companies.csv")


#Viewing the dataframe data_companies in a table format
View(data_companies)

#Checking the dimensions of the data_companies
dim(data_companies)
## [1] 529  23

The raw data has 529observations and 23 variables. As per my understanding of the dataset, I have decided to drop the below variables due to following reasons:

  • comapny_name_id: This vaiable is same as the variable company_name.
  • social_impact: This variable is irrelevant to the analysis as it has 98% of its values missing
  • data_types,example_uses,data_impacts,financial_info: These variables are irrelevant to the analysis
# Deleting the variable "company_name_id" bcoz it is same as "company_name"
data_companies <- data_companies[,-c(1,14,18:21)]  

 #converting the variable "zip_code" into a character and "Revenue" into numeric
data_companies$zip_code <- as.character(data_companies$zip_code) 
data_companies$Revenue <- as.numeric(data_companies$Revenue)

#Correcting faulty values in full_time_employees and source_count
data_companies$full_time_employees[data_companies$full_time_employees == "10-Jan"] <- "1-10"
data_companies$full_time_employees[data_companies$full_time_employees == "Nov-50"] <- "11-50"
data_companies$source_count[data_companies$source_count == "10-Jan"] <- "1-10"
data_companies$source_count[data_companies$source_count == "Nov-50"] <- "11-50"

#converting the variable full_time_employees and source_count into a factor
data_companies$full_time_employees <- as.factor(data_companies$full_time_employees)
data_companies$source_count <- as.factor(data_companies$source_count)

Also, we have converted the variable zip_code in the data to a character in the above code. Along with that, we have changed the faulty values of “10-Jan” and “Nov-50” in variables full_time_employees and source_count to “01-10” and “11-50”. After that, we have converted the variables full_time_employees and source_count into categorical variables.

attach(data_companies)

revenue_bin <- vector()
revenue_bin <- ifelse(is.na(Revenue),"NA", ifelse(Revenue < 1, "0-1M",ifelse(Revenue < 100,"0-100M", ifelse(Revenue < 1000,"100M-1B","More than 1B"))))
data_companies <- cbind(data_companies,revenue_bin)

detach(data_companies)

In the above segment, we are trying to create a new columnrevenue_bin which is a categorical variable for segmenting the Revenue columm.

Here, I am trying to find the number of missing values in each Variable.

colSums(is.na(data_companies))
##        company_name                 url        year_founded 
##                   0                   0                   0 
##                city               state             country 
##                   0                   0                   0 
##            zip_code full_time_employees        company_type 
##                  39                  29                  16 
##    company_category      revenue_source      business_model 
##                   3                   0                  76 
##         description   description_short        source_count 
##                   0                   0                 303 
##        last_updated             Revenue         revenue_bin 
##                   0                  60                   0

I would like to take the below actions for the missing values:

  • For zip_code, full_time_employees, company_type, company_category the NA values are acutally the string “NA” in the data which is being read as NA value by R. Also, these are categorial values and ‘NA’ values will help us in understanding the data more. So I wouldnt be imputing them with anything.
  • For business_model,source_count, I would leave these missing values as is, as these variables are not of interest.
  • Revenue is a numeric column. The imputation of these missing values would be done on the basis of the explanatory analysis we are going to perform. For this explanatory analysis, we can keep these values and the summary functions can be run by ignoring these values.
datatable(head(data_companies),class = 'cell-border stripe')
#Checking the dimensions of the data_companies
dim(data_companies)
## [1] 529  18

The data that we obtain after clenaning has 529 observations and 18 variables in total. Out of the 18 columns, the columns of interest are year_founded, state, full_time_employees, company_type, company_category, Revenue, revemnue_bin . ‘Revenue’ is the numeric column and it is in Million dollars and the rest of the columns are categorical in nature. The minimum and maximum values of Revenue variable are 0.094 and 9.510^{5}. The mean value and median of the Revenue are 4489.8038678 and 10.

Explanatory Data Analysis

In this section, we are going to plot various frequency tables and bar graphs of revenue vs some categorial variable and try to draw some conclusions fromt these plots.

#Frequency table of the year_column variable
data.frame(table(data_companies$year_founded))
##    Var1 Freq
## 1  1799    1
## 2  1812    1
## 3  1836    1
## 4  1841    1
## 5  1845    1
## 6  1856    1
## 7  1860    1
## 8  1864    1
## 9  1868    1
## 10 1876    1
## 11 1880    1
## 12 1882    2
## 13 1885    1
## 14 1891    1
## 15 1895    1
## 16 1896    1
## 17 1899    1
## 18 1900    2
## 19 1901    1
## 20 1905    1
## 21 1907    1
## 22 1909    1
## 23 1912    3
## 24 1913    1
## 25 1914    2
## 26 1919    3
## 27 1922    2
## 28 1926    2
## 29 1928    2
## 30 1931    1
## 31 1933    1
## 32 1935    2
## 33 1936    2
## 34 1937    3
## 35 1941    1
## 36 1945    2
## 37 1946    2
## 38 1948    1
## 39 1950    1
## 40 1951    1
## 41 1953    1
## 42 1954    1
## 43 1955    1
## 44 1960    1
## 45 1961    1
## 46 1962    1
## 47 1963    1
## 48 1964    1
## 49 1965    2
## 50 1966    1
## 51 1967    1
## 52 1968    3
## 53 1969    4
## 54 1971    1
## 55 1972    1
## 56 1973    2
## 57 1975    2
## 58 1976    3
## 59 1977    2
## 60 1978    3
## 61 1979    4
## 62 1980    3
## 63 1981    1
## 64 1982    2
## 65 1983    3
## 66 1984    4
## 67 1986    1
## 68 1987    6
## 69 1988    2
## 70 1989    5
## 71 1990    2
## 72 1992    2
## 73 1993    5
## 74 1994    4
## 75 1995    6
## 76 1996    4
## 77 1997    4
## 78 1998   10
## 79 1999   10
## 80 2000   16
## 81 2001   10
## 82 2002    7
## 83 2003   11
## 84 2004   13
## 85 2005   17
## 86 2006   21
## 87 2007   28
## 88 2008   26
## 89 2009   32
## 90 2010   50
## 91 2011   51
## 92 2012   41
## 93 2013   30
## 94 2014    8
## 95 2015    1

From the above frequency table, we can see that out of the 529 companies started in 314 years, almost 290 companies where founded in 10 years(2003-2012).

#Frequency table of the year_column variable
ggplot(data_companies,aes(year_founded)) + geom_bar(aes(fill = full_time_employees))

In the above plot, it is interesting to find companies that were founded before 1900 and 1950 and still have ‘51-200’ employees. Also, many companies that started after 2000 have less than 200 employees.

#Frequency table of the year_column variable
ggplot(data_companies,aes(year_founded)) + geom_bar(aes(fill = company_category))

In the above plot, we can see that most of the companies that started in the laste 1990s and 200os are “Data/Technology”, “Finance & Investment”, “Housing/Real Estate”, “Research & Consulting”

#Frequency table of the year_column variable
ggplot(data_companies,aes(year_founded)) + geom_bar(aes(fill = company_type))

In the above plot, we can see that the “private” companies dominated the early 1990s and post 2000 years.

Now we start looking at the categorical variable State

Below is the frequency table of the state:

#Frequency table of the State variable
data.frame(table(data_companies$state))
##    Var1 Freq
## 1    AL    1
## 2    AR    1
## 3    AZ    3
## 4    CA  132
## 5    CO   11
## 6    CT    8
## 7    DC   25
## 8    FL    7
## 9    GA    7
## 10   IA    1
## 11   IL   26
## 12   IN    7
## 13   KA    1
## 14   KY    1
## 15   MA   42
## 16   MD   13
## 17   ME    2
## 18   MI    5
## 19   MN    3
## 20   MO    6
## 21   MT    1
## 22   NC    5
## 23   NE    1
## 24   NH    2
## 25   NJ   15
## 26   NV    1
## 27   NY  106
## 28   OH    7
## 29   OK    1
## 30   OR    2
## 31   PA   11
## 32   RI    1
## 33   TN    3
## 34   TX   17
## 35   UT    2
## 36   VA   22
## 37   VT    1
## 38   WA   25
## 39   WI    4
ggplot(data_companies,aes(state)) + geom_bar()

From the above bar plot, it is evident that California has the maximum number of companies. The states of California, New York, Massachusetts consitute almost 50% of the companies.

ggplot(data_companies,aes(x = state,y = Revenue)) + 
  geom_bar(stat = "sum",aes(fill = year_founded)) 

From the Bar plot of Revenue vs State, we can see that Illinois is 3rd in terms of Revenue whereas Massachusetts was 3rd in terms of number of companies. Also, most of the companies in Illinois and New York were started beofore 2000 where as the companies in California are relatively new i.e. after 2000.

Now we are going to look at the categorical variable full_time_employees

data.frame(table(data_companies$full_time_employees))
##           Var1 Freq
## 1         1-10  143
## 2  1,001-5,000   30
## 3      10,001+   56
## 4        11-50  115
## 5      201-500   25
## 6 5,001-10,000   16
## 7    501-1,000   22
## 8       51-200   93
ggplot(data_companies,aes(full_time_employees)) + geom_bar()

From the above plot, we can see that the number of companies with no of employees less than 200 is 351 which is more than 60% of the toal companies.

ggplot(data_companies,aes(x = full_time_employees,y = Revenue)) + geom_bar(stat = "sum") 

If we combine the above two plots, we find that the companies with strength ‘51-200’ are 3rd in number, but the total revenue of these companies is almost negligible in comparison to other groups or total revenue.

Now we move on to the categorical variable company_type

data.frame(table(data_companies$company_type))
##                             Var1 Freq
## 1                      Nonprofit   15
## 2 nonprofit + commercial spinoff    1
## 3                    partnership    1
## 4                    Partnership    5
## 5          Permitting Assistance    1
## 6                        Private  396
## 7                         Public   92
## 8          Public (Federal-USGS)    1
## 9            sole proprietorship    1

More than 60% of the companies are private comapnies.

ggplot(data_companies,aes(x = company_type,y = Revenue)) + 
  geom_bar(stat = "sum",aes(fill = year_founded)) +
  theme(axis.text.x = element_text(angle = 90))

The interesting observation from the above plot is that the public companies are distributed over the years, whereas the private companies increased during the late 1990s to 2000s.

Now we would like to look at the categorical variable company_category.

Below is the Frequendy table of company_category

#the Frequency table of company_category
data.frame(table(data_companies$company_category))
##                         Var1 Freq
## 1      Aerospace and Defense    1
## 2  Business & Legal Services   44
## 3            Data/Technology   98
## 4                  Education   19
## 5                     Energy   28
## 6      Environment & Weather   11
## 7       Finance & Investment   75
## 8         Food & Agriculture    6
## 9         Geospatial/Mapping   30
## 10                Governance   43
## 11                Healthcare   40
## 12       Housing/Real Estate   21
## 13                 Insurance   11
## 14      Lifestyle & Consumer   25
## 15                     Media    1
## 16     Research & Consulting   28
## 17       Scientific Research   17
## 18            Transportation   28
ggplot(data_companies,aes(company_category)) + 
    geom_bar(aes(fill = company_type)) + 
    theme(axis.text.x = element_text(angle = 90))

In terms of number of companies in a category, “Data/Technology” has the maximum number of companies.

ggplot(data_companies,aes(x = company_category,y = Revenue)) + 
  geom_bar(stat = "sum",aes(fill = company_type)) +
  theme(axis.text.x = element_text(angle = 90))

By combining the above two plots we can see that, that the revenue of the “Scientific Research” Category is the highest among all Company categories whereas the number of companies is highest in the “Data/Technology” category. Also, the “scientific Research” Category is completely “Private”.

#Frequency table of the revenue_bin
data.frame(table(revenue_bin))
##    revenue_bin Freq
## 1       0-100M  341
## 2         0-1M   19
## 3      100M-1B   50
## 4 More than 1B   59
## 5           NA   60

More than 50% of the companies falls in the bin “0-100M”.

ggplot(data_companies,aes(revenue_bin)) + geom_bar(aes(fill = company_type))

From the above plot, we can see that the proprotion of “Private” companies in the category “0-100M” is really high in comparison to other bins.

ggplot(data_companies,aes(revenue_bin)) + geom_bar(aes(fill = full_time_employees))

The companies with less than 200 employess constitute more than 2/3rds of the “0-100M” companies.

Summary

The primary objective for this project is to perform an exploratory analysis ona dataset and try to apply some/all the techniques learned in the class to do the same. We have used the Open 500 Companies data for this purpose.Firstly,we imported the data into R, cleaned the data, added some additonal columns and plotted some graphs to make some deductions out of them.

Following are the interesting insights that I oculd deduct out of the analysis:

  1. Out of the 529 companies started in 314 years, almost 290 companies where founded in 10 years(2003-2012).
  2. There are companies that were founded before 1950 and still have ‘51-200’ employees.
  3. Many companies that started after 2000 have less than 200 employees.
  4. Most of the companies that started in the laste 1990s and 2000s belong to “Data/Technology” or “Finance & Investment” or “Housing/Real Estate” or “Research & Consulting” categories
  5. “Private” companies dominated the early 1990s and post 2000 years.
  6. California has the maximum number of companies. The states of California, New York, Massachusetts consitute almost 50% of the companies.
  7. Illinois is 3rd in terms of Revenue whereas Massachusetts was 3rd in terms of number of companies. Also, most of the companies in Illinois and New York were started beofore 2000 where as the companies in California are relatively new i.e. after 2000.
  8. The companies with strength ‘51-200’ are 3rd in number, but the total revenue of these companies is almost negligible in comparison to other groups or total revenue.
  9. The public companies are distributed over the years, whereas the private companies increased during the late 1990s to 2000s.
  10. In terms of number of companies in a category, “Data/Technology” has the maximum number of companies.
  11. The revenue of the “Scientific Research” Category is the highest among all Company categories. Also, the “scientific Research” category is completely “Private”.
  12. More than 50% of the companies falls in the bin “0-100M”.
  13. The proprotion of “Private” companies in the category “0-100M” is really high in comparison to other bins.
  14. The companies with less than 200 employess constitute more than 2/3rds of the “0-100M” companies.

Following are the implications/business ideas I deduce from my insights:

  1. Since 200, many small companies have been founded that are competing in the market.
  2. “Data/Technology”, “Finance & Investment”, “housing/Real Estate” or “Research & Consulting” are the growing fields.
  3. Since late 1990s, Private Companies are taking over the market. SO investing in private companies would be beneficial.
  4. The states of California, New York, Massachusetts offer a good market to new companies.
  5. But there are states like Illinois that are better in terms of revenue.
  6. Most of the Companies in the state of Illinois are pre 2000 founded. Thus, the state can be taken into consideration for investment because of less competition.
  7. Data/Technology is the most emerging fieldwith most competition.
  8. SCientific Research Is the field with less companies involved than the Data/Technology but more revenue.
  9. The number of small companies i.e with revenue less than 100M Constitute more proprotion of the toal revenue because of their large numbers.

Limitations

  1. The selection of the 529 companies was not random. It is through an Outreach Campaign, Expert Advise and Research. As a result, our analysis may result in a bias.The criterion for the inclusion of the companies in the campaign/Research is unknown.
  2. The company categories dont include all the possible categories, beacuse of which our results cant be generalised.
  3. The data was collected in 2014 and might not be valid for such analysis at a current date.
  4. The data is not complete. A better complete data can be accumulated for such analysis.
  5. ALso, 500+ companies is not large enough data to perform an analysis to recognise some pattern/trend.