Introduction

The primary objective of this project is going to be explanatory data analysis. The data I am considering for this purpose is ‘Open data 500’ which contains information about a list of 500+ U.S. based companies.

The purpose of this data is to generate new business and develop new products and services. For this, I intend on understanding and analysing the data by importing the data into R, clean the data and building reports by certain categories. By this approach, I will be able to suggest which sector/industry makes more revenue and which sector/industry can be considered for development.

Packages Required

To analyze this data, we will use the following R packages:

library(lubridate)
library(tidyverse)
library(data.table)

Pacakge lubridate helps us with functions to work with date-times and time-spans.

Package tidyverse is a combination of packagees which helps us with functions to clean and manipulate date, plot and perform some explanatory data analysis which includes plots.

Package data.table provides us with the function ‘fread’ which is faster in terms of reading data from files into R.

Data Preparation

As stated in the introduction, the data we are using for this project is Open Data 500 which contains information about a list of 500+ U.S. based companies from 2014. This list is complied by GovLab through

Below is more information regarding the above processes:

Outreach Campaign

  • Mass email to over 3,000 contacts in the GovLab network
  • Mass email to over 2,000 contacts OpenDataNow.com
  • Blog posts on TheGovLab.org and OpenDataNow.com
  • Social media recommendations
  • Media coverage of the Open Data 500
  • Attending presentations and conferences

Expert Advice

  • Recommendations from government and non-governmental organizations
  • Guidance and feedback from Open Data 500 advisors

Research

  • Companies identified for the book, Open Data Now
  • Companies using datasets from Data.gov
  • Directory of open data companies developed by Deloitte
  • Online Open Data Userbase created by Socrata
  • General research from publicly available sources

Study Goals

The primary study goals of the data were as below:

  • Provide a basis for assessing the economic value of government open data
  • Encourage the development of new open data companies
  • Foster a dialogue between government and business on how government data can be made more useful

My primary objective is do a explanatory data analysis to understand the data and clean the data and build reports by certain categories.

Here, we start with the data cleaning process:

#Read the date into R
data_companies <- fread("us_companies.csv")

#Looking at the structure of data_companies
str(data_companies)
## Classes 'data.table' and 'data.frame':   529 obs. of  22 variables:
##  $ company_name_id    : chr  "3-round-stones-inc" "48-factoring-inc" "5psolutions" "abt-associates" ...
##  $ company_name       : chr  "3 Round Stones, Inc." "48 Factoring Inc." "5PSolutions" "Abt Associates" ...
##  $ url                : chr  "http://3RoundStones.com" "https://www.48factoring.com" "www.5psolutions.com" "abtassoc.com" ...
##  $ year_founded       : int  2010 2014 2007 1965 1999 1989 1962 1969 2001 2009 ...
##  $ city               : chr  "Washington" "Philadelphia" "Fairfax" "Cambridge" ...
##  $ state              : chr  "DC" "PA" "VA" "MA" ...
##  $ country            : chr  "us" "us" "us" "us" ...
##  $ zip_code           : int  20004 19087 22003 2138 94583 60601 16803 72201 92618 95510 ...
##  $ full_time_employees: chr  "10-Jan" "51-200" "10-Jan" "1,001-5,000" ...
##  $ company_type       : chr  "Private" "Private" "Private" "Private" ...
##  $ company_category   : chr  "Data/Technology" "Finance & Investment" "Data/Technology" "Research & Consulting" ...
##  $ revenue_source     : chr  "Data analysis for clients, Database licensing, Subscriptions" "Financial Services" "Subscriptions, User fees for web or mobile access" "Data analysis for clients, Database licensing" ...
##  $ business_model     : chr  "Business to Business, Business to Consumer" "Business to Business" "Business to Business, Business to Consumer, Business to Government" "" ...
##  $ social_impact      : chr  "" "Small Business Owners" "" "" ...
##  $ description        : chr  "3 Round Stones produces a platform for publishing data on the Web. 3 Round Stones provides commercial support f"| __truncated__ "The company mission is to provide finance to small business. We also provide financing to small business with b"| __truncated__ "At 5PSolutions, we wish to make all basic information of different categories easily available to via tablets or phones." "Abt Associates is a mission-driven, international company conducting research and program implementation in the"| __truncated__ ...
##  $ description_short  : chr  "Our Open Source platform is used by the Fortune2000 and US Government Agencies to collect, publish and reuse da"| __truncated__ "48 Factoring Inc. is one of the best financial services company using unique factoring 2.0 financial product wh"| __truncated__ "5PSolutions are artisans of mobile platforms." "Abt Associates is a mission-driven, global leader in research and program implementation in the fields of healt"| __truncated__ ...
##  $ source_count       : chr  NA "Nov-50" NA "101+" ...
##  $ data_types         : chr  "" "Business" "" "" ...
##  $ example_uses       : chr  "" "" "" "" ...
##  $ data_impacts       : chr  "[]" "[u'Cost efficiency', u'Job growth', u'Revenue growth']" "[]" "[]" ...
##  $ financial_info     : chr  "3 Round Stones is a profitable, self-funded, woman-owned start-up.  Our team has several successful serial entr"| __truncated__ "" "" "Employee-owned company. $552M/year." ...
##  $ last_updated       : chr  "44:26.0" "36:39.9" "09:35.5" "23:21.4" ...
##  - attr(*, ".internal.selfref")=<externalptr>
#Viewing the dataframe data_companies in a table format
View(data_companies)

#Checking the dimensions of the data_companies
dim(data_companies)
## [1] 529  22

The raw data has 529 observations and 22 variables. As per my understanding of the dataset, I have decided to drop the below variables due to following reasons:

  • comapny_name_id: This vaiable is same as the variable company_name.
  • social_impact: This variable is irrelevant to the analysis as it has 98% of its values missing
  • data_types,example_uses,data_impacts,financial_info: These variables are irrelevant to the analysis
# Deleting the variable "company_name_id" bcoz it is same as "company_name"
data_companies <- data_companies[,-c(1,14,18:21)]  

 #converting the variable "zip_code" into a character
data_companies$zip_code <- as.character(data_companies$zip_code) 

#Correcting faulty values in full_time_employees and source_count
data_companies$full_time_employees[data_companies$full_time_employees == "10-Jan"] <- "1-10"
data_companies$full_time_employees[data_companies$full_time_employees == "Nov-50"] <- "11-50"
data_companies$source_count[data_companies$source_count == "10-Jan"] <- "1-10"
data_companies$source_count[data_companies$source_count == "Nov-50"] <- "11-50"

#converting the variable full_time_employees and source_count into a factor
data_companies$full_time_employees <- as.factor(data_companies$full_time_employees)
data_companies$source_count <- as.factor(data_companies$source_count)

Also, we have converted the variable “zip_code” in the data to a character in the above code. Along with that, we have changed the faulty values of “10-Jan” and “Nov-50” in variables full_time_employees and source_count to “01-10” and “11-50”. After that, we have converted the variables full_time_employees and source_count into categorical variables.

Now, I would like to check the structure of the data once again.

str(data_companies)
## Classes 'data.table' and 'data.frame':   529 obs. of  16 variables:
##  $ company_name       : chr  "3 Round Stones, Inc." "48 Factoring Inc." "5PSolutions" "Abt Associates" ...
##  $ url                : chr  "http://3RoundStones.com" "https://www.48factoring.com" "www.5psolutions.com" "abtassoc.com" ...
##  $ year_founded       : int  2010 2014 2007 1965 1999 1989 1962 1969 2001 2009 ...
##  $ city               : chr  "Washington" "Philadelphia" "Fairfax" "Cambridge" ...
##  $ state              : chr  "DC" "PA" "VA" "MA" ...
##  $ country            : chr  "us" "us" "us" "us" ...
##  $ zip_code           : chr  "20004" "19087" "22003" "2138" ...
##  $ full_time_employees: Factor w/ 8 levels "1-10","1,001-5,000",..: 1 8 1 2 7 3 5 6 4 3 ...
##  $ company_type       : chr  "Private" "Private" "Private" "Private" ...
##  $ company_category   : chr  "Data/Technology" "Finance & Investment" "Data/Technology" "Research & Consulting" ...
##  $ revenue_source     : chr  "Data analysis for clients, Database licensing, Subscriptions" "Financial Services" "Subscriptions, User fees for web or mobile access" "Data analysis for clients, Database licensing" ...
##  $ business_model     : chr  "Business to Business, Business to Consumer" "Business to Business" "Business to Business, Business to Consumer, Business to Government" "" ...
##  $ description        : chr  "3 Round Stones produces a platform for publishing data on the Web. 3 Round Stones provides commercial support f"| __truncated__ "The company mission is to provide finance to small business. We also provide financing to small business with b"| __truncated__ "At 5PSolutions, we wish to make all basic information of different categories easily available to via tablets or phones." "Abt Associates is a mission-driven, international company conducting research and program implementation in the"| __truncated__ ...
##  $ description_short  : chr  "Our Open Source platform is used by the Fortune2000 and US Government Agencies to collect, publish and reuse da"| __truncated__ "48 Factoring Inc. is one of the best financial services company using unique factoring 2.0 financial product wh"| __truncated__ "5PSolutions are artisans of mobile platforms." "Abt Associates is a mission-driven, global leader in research and program implementation in the fields of healt"| __truncated__ ...
##  $ source_count       : Factor w/ 5 levels "","1-10","101+",..: NA 4 NA 3 3 NA NA 3 NA 3 ...
##  $ last_updated       : chr  "44:26.0" "36:39.9" "09:35.5" "23:21.4" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Here, I am trying to find the number of missing values in each Variable.

colSums(is.na(data_companies))
##        company_name                 url        year_founded 
##                   0                   0                   0 
##                city               state             country 
##                   0                   0                   0 
##            zip_code full_time_employees        company_type 
##                  37                  29                   1 
##    company_category      revenue_source      business_model 
##                   0                   0                   0 
##         description   description_short        source_count 
##                   0                   0                 299 
##        last_updated 
##                   0

I would like to take the below actions for the missing values:

  • For zip_code, I would like to search for the company and fill out these missing values.
  • For full_time_employees, I would like to replace the missing values with the string “NA”.
  • For company_type, I would like to replace the missing values with the string “NA”.
  • For source_count, I would like to replace the missing values with the string “NA”.

(One interesting thing i have observed is, the values that were “NA” in the csv file are being read as NA values in R. Blank values in the csv file are not being shown as NA in R. I would like to look into fixing this issue.)

Now, we will look at the structure and summary of the dataframe data_companies

str(data_companies)
## Classes 'data.table' and 'data.frame':   529 obs. of  16 variables:
##  $ company_name       : chr  "3 Round Stones, Inc." "48 Factoring Inc." "5PSolutions" "Abt Associates" ...
##  $ url                : chr  "http://3RoundStones.com" "https://www.48factoring.com" "www.5psolutions.com" "abtassoc.com" ...
##  $ year_founded       : int  2010 2014 2007 1965 1999 1989 1962 1969 2001 2009 ...
##  $ city               : chr  "Washington" "Philadelphia" "Fairfax" "Cambridge" ...
##  $ state              : chr  "DC" "PA" "VA" "MA" ...
##  $ country            : chr  "us" "us" "us" "us" ...
##  $ zip_code           : chr  "20004" "19087" "22003" "2138" ...
##  $ full_time_employees: Factor w/ 8 levels "1-10","1,001-5,000",..: 1 8 1 2 7 3 5 6 4 3 ...
##  $ company_type       : chr  "Private" "Private" "Private" "Private" ...
##  $ company_category   : chr  "Data/Technology" "Finance & Investment" "Data/Technology" "Research & Consulting" ...
##  $ revenue_source     : chr  "Data analysis for clients, Database licensing, Subscriptions" "Financial Services" "Subscriptions, User fees for web or mobile access" "Data analysis for clients, Database licensing" ...
##  $ business_model     : chr  "Business to Business, Business to Consumer" "Business to Business" "Business to Business, Business to Consumer, Business to Government" "" ...
##  $ description        : chr  "3 Round Stones produces a platform for publishing data on the Web. 3 Round Stones provides commercial support f"| __truncated__ "The company mission is to provide finance to small business. We also provide financing to small business with b"| __truncated__ "At 5PSolutions, we wish to make all basic information of different categories easily available to via tablets or phones." "Abt Associates is a mission-driven, international company conducting research and program implementation in the"| __truncated__ ...
##  $ description_short  : chr  "Our Open Source platform is used by the Fortune2000 and US Government Agencies to collect, publish and reuse da"| __truncated__ "48 Factoring Inc. is one of the best financial services company using unique factoring 2.0 financial product wh"| __truncated__ "5PSolutions are artisans of mobile platforms." "Abt Associates is a mission-driven, global leader in research and program implementation in the fields of healt"| __truncated__ ...
##  $ source_count       : Factor w/ 5 levels "","1-10","101+",..: NA 4 NA 3 3 NA NA 3 NA 3 ...
##  $ last_updated       : chr  "44:26.0" "36:39.9" "09:35.5" "23:21.4" ...
##  - attr(*, ".internal.selfref")=<externalptr>
summary(data_companies)
##  company_name           url             year_founded      city          
##  Length:529         Length:529         Min.   :1799   Length:529        
##  Class :character   Class :character   1st Qu.:1994   Class :character  
##  Mode  :character   Mode  :character   Median :2007   Mode  :character  
##                                        Mean   :1993                     
##                                        3rd Qu.:2010                     
##                                        Max.   :2015                     
##                                                                         
##     state             country            zip_code        
##  Length:529         Length:529         Length:529        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   full_time_employees company_type       company_category  
##  1-10       :143      Length:529         Length:529        
##  11-50      :115      Class :character   Class :character  
##  51-200     : 93      Mode  :character   Mode  :character  
##  10,001+    : 56                                           
##  1,001-5,000: 30                                           
##  (Other)    : 63                                           
##  NA's       : 29                                           
##  revenue_source     business_model     description       
##  Length:529         Length:529         Length:529        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  description_short  source_count last_updated      
##  Length:529               :  4   Length:529        
##  Class :character   1-10  : 57   Class :character  
##  Mode  :character   101+  :126   Mode  :character  
##                     11-50 : 32                     
##                     51-100: 11                     
##                     NA's  :299                     
## 

Proposed Exploratory Data Analysis

I plan to go through the following ideas from here:

  1. As mentioned in Data Preparation, I plan on imputing the missing values of certain variables with NA(as they are categorical).
  2. There is an issue of reading “blank values” in the data into R. I plan on looking into it and learning something new out of it.
  3. Search for the zip_code of the companies that have missing zip_code and fill up these values.
  4. Search for the revenue of all these companies and add a new column “Revenue” which would be the key parameter of analysis.
  5. Plan to change the multi-valued columns in such a way that I can use them for analysis.
  6. Analysis of revenue against categorical variables like full_time_employees, company_type,company_category etc.
  7. Plan to get a better understanding of the data as I go through the data analysis.
  8. My plots would comprise of pie charts,frequency bar charts, line charts or a combination of these charts.