The primary objective of this project is going to be explanatory data analysis. The data I am considering for this purpose is ‘Open data 500’ which contains information about a list of 500+ U.S. based companies.
The purpose of this data is to generate new business and develop new products and services. For this, I intend on understanding and analysing the data by importing the data into R, clean the data and building reports by certain categories. By this approach, I will be able to suggest which sector/industry makes more revenue and which sector/industry can be considered for development.
To analyze this data, we will use the following R packages:
library(lubridate)
library(tidyverse)
library(data.table)
Pacakge lubridate helps us with functions to work with date-times and time-spans.
Package tidyverse is a combination of packagees which helps us with functions to clean and manipulate date, plot and perform some explanatory data analysis which includes plots.
Package data.table provides us with the function ‘fread’ which is faster in terms of reading data from files into R.
As stated in the introduction, the data we are using for this project is Open Data 500 which contains information about a list of 500+ U.S. based companies from 2014. This list is complied by GovLab through
Below is more information regarding the above processes:
The primary study goals of the data were as below:
My primary objective is do a explanatory data analysis to understand the data and clean the data and build reports by certain categories.
Here, we start with the data cleaning process:
#Read the date into R
data_companies <- fread("us_companies.csv")
#Looking at the structure of data_companies
str(data_companies)
## Classes 'data.table' and 'data.frame': 529 obs. of 22 variables:
## $ company_name_id : chr "3-round-stones-inc" "48-factoring-inc" "5psolutions" "abt-associates" ...
## $ company_name : chr "3 Round Stones, Inc." "48 Factoring Inc." "5PSolutions" "Abt Associates" ...
## $ url : chr "http://3RoundStones.com" "https://www.48factoring.com" "www.5psolutions.com" "abtassoc.com" ...
## $ year_founded : int 2010 2014 2007 1965 1999 1989 1962 1969 2001 2009 ...
## $ city : chr "Washington" "Philadelphia" "Fairfax" "Cambridge" ...
## $ state : chr "DC" "PA" "VA" "MA" ...
## $ country : chr "us" "us" "us" "us" ...
## $ zip_code : int 20004 19087 22003 2138 94583 60601 16803 72201 92618 95510 ...
## $ full_time_employees: chr "10-Jan" "51-200" "10-Jan" "1,001-5,000" ...
## $ company_type : chr "Private" "Private" "Private" "Private" ...
## $ company_category : chr "Data/Technology" "Finance & Investment" "Data/Technology" "Research & Consulting" ...
## $ revenue_source : chr "Data analysis for clients, Database licensing, Subscriptions" "Financial Services" "Subscriptions, User fees for web or mobile access" "Data analysis for clients, Database licensing" ...
## $ business_model : chr "Business to Business, Business to Consumer" "Business to Business" "Business to Business, Business to Consumer, Business to Government" "" ...
## $ social_impact : chr "" "Small Business Owners" "" "" ...
## $ description : chr "3 Round Stones produces a platform for publishing data on the Web. 3 Round Stones provides commercial support f"| __truncated__ "The company mission is to provide finance to small business. We also provide financing to small business with b"| __truncated__ "At 5PSolutions, we wish to make all basic information of different categories easily available to via tablets or phones." "Abt Associates is a mission-driven, international company conducting research and program implementation in the"| __truncated__ ...
## $ description_short : chr "Our Open Source platform is used by the Fortune2000 and US Government Agencies to collect, publish and reuse da"| __truncated__ "48 Factoring Inc. is one of the best financial services company using unique factoring 2.0 financial product wh"| __truncated__ "5PSolutions are artisans of mobile platforms." "Abt Associates is a mission-driven, global leader in research and program implementation in the fields of healt"| __truncated__ ...
## $ source_count : chr NA "Nov-50" NA "101+" ...
## $ data_types : chr "" "Business" "" "" ...
## $ example_uses : chr "" "" "" "" ...
## $ data_impacts : chr "[]" "[u'Cost efficiency', u'Job growth', u'Revenue growth']" "[]" "[]" ...
## $ financial_info : chr "3 Round Stones is a profitable, self-funded, woman-owned start-up. Our team has several successful serial entr"| __truncated__ "" "" "Employee-owned company. $552M/year." ...
## $ last_updated : chr "44:26.0" "36:39.9" "09:35.5" "23:21.4" ...
## - attr(*, ".internal.selfref")=<externalptr>
#Viewing the dataframe data_companies in a table format
View(data_companies)
#Checking the dimensions of the data_companies
dim(data_companies)
## [1] 529 22
The raw data has 529 observations and 22 variables. As per my understanding of the dataset, I have decided to drop the below variables due to following reasons:
# Deleting the variable "company_name_id" bcoz it is same as "company_name"
data_companies <- data_companies[,-c(1,14,18:21)]
#converting the variable "zip_code" into a character
data_companies$zip_code <- as.character(data_companies$zip_code)
#Correcting faulty values in full_time_employees and source_count
data_companies$full_time_employees[data_companies$full_time_employees == "10-Jan"] <- "1-10"
data_companies$full_time_employees[data_companies$full_time_employees == "Nov-50"] <- "11-50"
data_companies$source_count[data_companies$source_count == "10-Jan"] <- "1-10"
data_companies$source_count[data_companies$source_count == "Nov-50"] <- "11-50"
#converting the variable full_time_employees and source_count into a factor
data_companies$full_time_employees <- as.factor(data_companies$full_time_employees)
data_companies$source_count <- as.factor(data_companies$source_count)
Also, we have converted the variable “zip_code” in the data to a character in the above code. Along with that, we have changed the faulty values of “10-Jan” and “Nov-50” in variables full_time_employees and source_count to “01-10” and “11-50”. After that, we have converted the variables full_time_employees and source_count into categorical variables.
Now, I would like to check the structure of the data once again.
str(data_companies)
## Classes 'data.table' and 'data.frame': 529 obs. of 16 variables:
## $ company_name : chr "3 Round Stones, Inc." "48 Factoring Inc." "5PSolutions" "Abt Associates" ...
## $ url : chr "http://3RoundStones.com" "https://www.48factoring.com" "www.5psolutions.com" "abtassoc.com" ...
## $ year_founded : int 2010 2014 2007 1965 1999 1989 1962 1969 2001 2009 ...
## $ city : chr "Washington" "Philadelphia" "Fairfax" "Cambridge" ...
## $ state : chr "DC" "PA" "VA" "MA" ...
## $ country : chr "us" "us" "us" "us" ...
## $ zip_code : chr "20004" "19087" "22003" "2138" ...
## $ full_time_employees: Factor w/ 8 levels "1-10","1,001-5,000",..: 1 8 1 2 7 3 5 6 4 3 ...
## $ company_type : chr "Private" "Private" "Private" "Private" ...
## $ company_category : chr "Data/Technology" "Finance & Investment" "Data/Technology" "Research & Consulting" ...
## $ revenue_source : chr "Data analysis for clients, Database licensing, Subscriptions" "Financial Services" "Subscriptions, User fees for web or mobile access" "Data analysis for clients, Database licensing" ...
## $ business_model : chr "Business to Business, Business to Consumer" "Business to Business" "Business to Business, Business to Consumer, Business to Government" "" ...
## $ description : chr "3 Round Stones produces a platform for publishing data on the Web. 3 Round Stones provides commercial support f"| __truncated__ "The company mission is to provide finance to small business. We also provide financing to small business with b"| __truncated__ "At 5PSolutions, we wish to make all basic information of different categories easily available to via tablets or phones." "Abt Associates is a mission-driven, international company conducting research and program implementation in the"| __truncated__ ...
## $ description_short : chr "Our Open Source platform is used by the Fortune2000 and US Government Agencies to collect, publish and reuse da"| __truncated__ "48 Factoring Inc. is one of the best financial services company using unique factoring 2.0 financial product wh"| __truncated__ "5PSolutions are artisans of mobile platforms." "Abt Associates is a mission-driven, global leader in research and program implementation in the fields of healt"| __truncated__ ...
## $ source_count : Factor w/ 5 levels "","1-10","101+",..: NA 4 NA 3 3 NA NA 3 NA 3 ...
## $ last_updated : chr "44:26.0" "36:39.9" "09:35.5" "23:21.4" ...
## - attr(*, ".internal.selfref")=<externalptr>
Here, I am trying to find the number of missing values in each Variable.
colSums(is.na(data_companies))
## company_name url year_founded
## 0 0 0
## city state country
## 0 0 0
## zip_code full_time_employees company_type
## 37 29 1
## company_category revenue_source business_model
## 0 0 0
## description description_short source_count
## 0 0 299
## last_updated
## 0
I would like to take the below actions for the missing values:
zip_code, I would like to search for the company and fill out these missing values.full_time_employees, I would like to replace the missing values with the string “NA”.company_type, I would like to replace the missing values with the string “NA”.source_count, I would like to replace the missing values with the string “NA”.(One interesting thing i have observed is, the values that were “NA” in the csv file are being read as NA values in R. Blank values in the csv file are not being shown as NA in R. I would like to look into fixing this issue.)
Now, we will look at the structure and summary of the dataframe data_companies
str(data_companies)
## Classes 'data.table' and 'data.frame': 529 obs. of 16 variables:
## $ company_name : chr "3 Round Stones, Inc." "48 Factoring Inc." "5PSolutions" "Abt Associates" ...
## $ url : chr "http://3RoundStones.com" "https://www.48factoring.com" "www.5psolutions.com" "abtassoc.com" ...
## $ year_founded : int 2010 2014 2007 1965 1999 1989 1962 1969 2001 2009 ...
## $ city : chr "Washington" "Philadelphia" "Fairfax" "Cambridge" ...
## $ state : chr "DC" "PA" "VA" "MA" ...
## $ country : chr "us" "us" "us" "us" ...
## $ zip_code : chr "20004" "19087" "22003" "2138" ...
## $ full_time_employees: Factor w/ 8 levels "1-10","1,001-5,000",..: 1 8 1 2 7 3 5 6 4 3 ...
## $ company_type : chr "Private" "Private" "Private" "Private" ...
## $ company_category : chr "Data/Technology" "Finance & Investment" "Data/Technology" "Research & Consulting" ...
## $ revenue_source : chr "Data analysis for clients, Database licensing, Subscriptions" "Financial Services" "Subscriptions, User fees for web or mobile access" "Data analysis for clients, Database licensing" ...
## $ business_model : chr "Business to Business, Business to Consumer" "Business to Business" "Business to Business, Business to Consumer, Business to Government" "" ...
## $ description : chr "3 Round Stones produces a platform for publishing data on the Web. 3 Round Stones provides commercial support f"| __truncated__ "The company mission is to provide finance to small business. We also provide financing to small business with b"| __truncated__ "At 5PSolutions, we wish to make all basic information of different categories easily available to via tablets or phones." "Abt Associates is a mission-driven, international company conducting research and program implementation in the"| __truncated__ ...
## $ description_short : chr "Our Open Source platform is used by the Fortune2000 and US Government Agencies to collect, publish and reuse da"| __truncated__ "48 Factoring Inc. is one of the best financial services company using unique factoring 2.0 financial product wh"| __truncated__ "5PSolutions are artisans of mobile platforms." "Abt Associates is a mission-driven, global leader in research and program implementation in the fields of healt"| __truncated__ ...
## $ source_count : Factor w/ 5 levels "","1-10","101+",..: NA 4 NA 3 3 NA NA 3 NA 3 ...
## $ last_updated : chr "44:26.0" "36:39.9" "09:35.5" "23:21.4" ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(data_companies)
## company_name url year_founded city
## Length:529 Length:529 Min. :1799 Length:529
## Class :character Class :character 1st Qu.:1994 Class :character
## Mode :character Mode :character Median :2007 Mode :character
## Mean :1993
## 3rd Qu.:2010
## Max. :2015
##
## state country zip_code
## Length:529 Length:529 Length:529
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## full_time_employees company_type company_category
## 1-10 :143 Length:529 Length:529
## 11-50 :115 Class :character Class :character
## 51-200 : 93 Mode :character Mode :character
## 10,001+ : 56
## 1,001-5,000: 30
## (Other) : 63
## NA's : 29
## revenue_source business_model description
## Length:529 Length:529 Length:529
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## description_short source_count last_updated
## Length:529 : 4 Length:529
## Class :character 1-10 : 57 Class :character
## Mode :character 101+ :126 Mode :character
## 11-50 : 32
## 51-100: 11
## NA's :299
##
I plan to go through the following ideas from here:
zip_code of the companies that have missing zip_code and fill up these values.