The primary objective of this project is going to be Exploratory data analysis. The data I am considering for this purpose is ‘Open data 500’ which contains information about a list of 500+ U.S. based companies.
The purpose of this data is to generate new business and develop new products and services. For this, I intend on understanding and analysing the data by importing the data into R, clean the data and building reports by certain categories. By this approach, I will be able to suggest which sector/industry makes more revenue and which sector/industry can be considered for development.
To analyze this data, we will use the following R packages:
library(tidyverse)
library(data.table)
library(DT)
Package tidyverse is a combination of packagees which helps us with functions to clean and manipulate date, plot and perform some explanatory data analysis which includes plots.
Package data.table provides us with the function ‘fread’ which is faster in terms of reading data from files into R.
Package DT provides us with a function datatable to print the data in the form of a table to the html pages.
As stated in the introduction, the data we are using for this project is Open Data 500 which contains information about a list of 500+ U.S. based companies from 2014. This list is complied by GovLab through
Below is more information regarding the above processes:
The primary study goals of the data were as below:
My primary objective is do a Exploratory data analysis to understand the data and clean the data and build reports by certain categories.
Here, we start with the data cleaning process:
#Read the date into R
data_companies <- fread("us_companies.csv")
#Viewing the dataframe data_companies in a table format
View(data_companies)
#Checking the dimensions of the data_companies
dim(data_companies)
## [1] 529 23
The raw data has 529observations and 23 variables. As per my understanding of the dataset, I have decided to drop the below variables due to following reasons:
comapny_name_id: This vaiable is same as the variable company_name.social_impact: This variable is irrelevant to the analysis as it has 98% of its values missingdata_types,example_uses,data_impacts,financial_info: These variables are irrelevant to the analysis# Deleting the variable "company_name_id" bcoz it is same as "company_name"
data_companies <- data_companies[,-c(1,14,18:21)]
#converting the variable "zip_code" into a character and "Revenue" into numeric
data_companies$zip_code <- as.character(data_companies$zip_code)
data_companies$Revenue <- as.numeric(data_companies$Revenue)
#Correcting faulty values in full_time_employees and source_count
data_companies$full_time_employees[data_companies$full_time_employees == "10-Jan"] <- "1-10"
data_companies$full_time_employees[data_companies$full_time_employees == "Nov-50"] <- "11-50"
data_companies$source_count[data_companies$source_count == "10-Jan"] <- "1-10"
data_companies$source_count[data_companies$source_count == "Nov-50"] <- "11-50"
#converting the variable full_time_employees and source_count into a factor
data_companies$full_time_employees <- as.factor(data_companies$full_time_employees)
data_companies$source_count <- as.factor(data_companies$source_count)
Also, we have converted the variable zip_code in the data to a character in the above code. Along with that, we have changed the faulty values of “10-Jan” and “Nov-50” in variables full_time_employees and source_count to “01-10” and “11-50”. After that, we have converted the variables full_time_employees and source_count into categorical variables.
attach(data_companies)
revenue_bin <- vector()
revenue_bin <- ifelse(is.na(Revenue),"NA", ifelse(Revenue < 1, "0-1M",ifelse(Revenue < 100,"0-100M", ifelse(Revenue < 1000,"100M-1B","More than 1B"))))
data_companies <- cbind(data_companies,revenue_bin)
detach(data_companies)
In the above segment, we are trying to create a new columnrevenue_bin which is a categorical variable for segmenting the Revenue columm.
Here, I am trying to find the number of missing values in each Variable.
colSums(is.na(data_companies))
## company_name url year_founded
## 0 0 0
## city state country
## 0 0 0
## zip_code full_time_employees company_type
## 39 29 16
## company_category revenue_source business_model
## 3 0 76
## description description_short source_count
## 0 0 303
## last_updated Revenue revenue_bin
## 0 60 0
I would like to take the below actions for the missing values:
zip_code, full_time_employees, company_type, company_category the NA values are acutally the string “NA” in the data which is being read as NA value by R. Also, these are categorial values and ‘NA’ values will help us in understanding the data more. So I wouldnt be imputing them with anything.business_model,source_count, I would leave these missing values as is, as these variables are not of interest.Revenue is a numeric column. The imputation of these missing values would be done on the basis of the explanatory analysis we are going to perform. For this explanatory analysis, we can keep these values and the summary functions can be run by ignoring these values.datatable(head(data_companies),class = 'cell-border stripe')
#Checking the dimensions of the data_companies
dim(data_companies)
## [1] 529 18
The data that we obtain after clenaning has 529 observations and 18 variables in total. Out of the 18 columns, the columns of interest are year_founded, state, full_time_employees, company_type, company_category, Revenue, revemnue_bin . ‘Revenue’ is the numeric column and it is in Million dollars and the rest of the columns are categorical in nature. The minimum and maximum values of Revenue variable are 0.094 and 9.510^{5}. The mean value and median of the Revenue are 4489.8038678 and 10.
In this section, we are going to plot various frequency tables and bar graphs of revenue vs some categorial variable and try to draw some conclusions fromt these plots.
#Frequency table of the year_column variable
data.frame(table(data_companies$year_founded))
## Var1 Freq
## 1 1799 1
## 2 1812 1
## 3 1836 1
## 4 1841 1
## 5 1845 1
## 6 1856 1
## 7 1860 1
## 8 1864 1
## 9 1868 1
## 10 1876 1
## 11 1880 1
## 12 1882 2
## 13 1885 1
## 14 1891 1
## 15 1895 1
## 16 1896 1
## 17 1899 1
## 18 1900 2
## 19 1901 1
## 20 1905 1
## 21 1907 1
## 22 1909 1
## 23 1912 3
## 24 1913 1
## 25 1914 2
## 26 1919 3
## 27 1922 2
## 28 1926 2
## 29 1928 2
## 30 1931 1
## 31 1933 1
## 32 1935 2
## 33 1936 2
## 34 1937 3
## 35 1941 1
## 36 1945 2
## 37 1946 2
## 38 1948 1
## 39 1950 1
## 40 1951 1
## 41 1953 1
## 42 1954 1
## 43 1955 1
## 44 1960 1
## 45 1961 1
## 46 1962 1
## 47 1963 1
## 48 1964 1
## 49 1965 2
## 50 1966 1
## 51 1967 1
## 52 1968 3
## 53 1969 4
## 54 1971 1
## 55 1972 1
## 56 1973 2
## 57 1975 2
## 58 1976 3
## 59 1977 2
## 60 1978 3
## 61 1979 4
## 62 1980 3
## 63 1981 1
## 64 1982 2
## 65 1983 3
## 66 1984 4
## 67 1986 1
## 68 1987 6
## 69 1988 2
## 70 1989 5
## 71 1990 2
## 72 1992 2
## 73 1993 5
## 74 1994 4
## 75 1995 6
## 76 1996 4
## 77 1997 4
## 78 1998 10
## 79 1999 10
## 80 2000 16
## 81 2001 10
## 82 2002 7
## 83 2003 11
## 84 2004 13
## 85 2005 17
## 86 2006 21
## 87 2007 28
## 88 2008 26
## 89 2009 32
## 90 2010 50
## 91 2011 51
## 92 2012 41
## 93 2013 30
## 94 2014 8
## 95 2015 1
From the above frequency table, we can see that out of the 529 companies started in 314 years, almost 290 companies where founded in 10 years(2003-2012).
#Frequency table of the year_column variable
ggplot(data_companies,aes(year_founded)) + geom_bar(aes(fill = full_time_employees))
In the above plot, it is interesting to find companies that were founded before 1900 and 1950 and still have ‘51-200’ employees. Also, many companies that started after 2000 have less than 200 employees.
#Frequency table of the year_column variable
ggplot(data_companies,aes(year_founded)) + geom_bar(aes(fill = company_category))
In the above plot, we can see that most of the companies that started in the laste 1990s and 200os are “Data/Technology”, “Finance & Investment”, “Housing/Real Estate”, “Research & Consulting”
#Frequency table of the year_column variable
ggplot(data_companies,aes(year_founded)) + geom_bar(aes(fill = company_type))
In the above plot, we can see that the “private” companies dominated the early 1990s and post 2000 years.
Now we start looking at the categorical variable State
Below is the frequency table of the state:
#Frequency table of the State variable
data.frame(table(data_companies$state))
## Var1 Freq
## 1 AL 1
## 2 AR 1
## 3 AZ 3
## 4 CA 132
## 5 CO 11
## 6 CT 8
## 7 DC 25
## 8 FL 7
## 9 GA 7
## 10 IA 1
## 11 IL 26
## 12 IN 7
## 13 KA 1
## 14 KY 1
## 15 MA 42
## 16 MD 13
## 17 ME 2
## 18 MI 5
## 19 MN 3
## 20 MO 6
## 21 MT 1
## 22 NC 5
## 23 NE 1
## 24 NH 2
## 25 NJ 15
## 26 NV 1
## 27 NY 106
## 28 OH 7
## 29 OK 1
## 30 OR 2
## 31 PA 11
## 32 RI 1
## 33 TN 3
## 34 TX 17
## 35 UT 2
## 36 VA 22
## 37 VT 1
## 38 WA 25
## 39 WI 4
ggplot(data_companies,aes(state)) + geom_bar()
From the above bar plot, it is evident that California has the maximum number of companies. The states of California, New York, Massachusetts consitute almost 50% of the companies.
ggplot(data_companies,aes(x = state,y = Revenue)) +
geom_bar(stat = "sum",aes(fill = year_founded))
From the Bar plot of Revenue vs State, we can see that Illinois is 3rd in terms of Revenue whereas Massachusetts was 3rd in terms of number of companies. Also, most of the companies in Illinois and New York were started beofore 2000 where as the companies in California are relatively new i.e. after 2000.
Now we are going to look at the categorical variable full_time_employees
data.frame(table(data_companies$full_time_employees))
## Var1 Freq
## 1 1-10 143
## 2 1,001-5,000 30
## 3 10,001+ 56
## 4 11-50 115
## 5 201-500 25
## 6 5,001-10,000 16
## 7 501-1,000 22
## 8 51-200 93
ggplot(data_companies,aes(full_time_employees)) + geom_bar()
From the above plot, we can see that the number of companies with no of employees less than 200 is 351 which is more than 60% of the toal companies.
ggplot(data_companies,aes(x = full_time_employees,y = Revenue)) + geom_bar(stat = "sum")
If we combine the above two plots, we find that the companies with strength ‘51-200’ are 3rd in number, but the total revenue of these companies is almost negligible in comparison to other groups or total revenue.
Now we move on to the categorical variable company_type
data.frame(table(data_companies$company_type))
## Var1 Freq
## 1 Nonprofit 15
## 2 nonprofit + commercial spinoff 1
## 3 partnership 1
## 4 Partnership 5
## 5 Permitting Assistance 1
## 6 Private 396
## 7 Public 92
## 8 Public (Federal-USGS) 1
## 9 sole proprietorship 1
More than 60% of the companies are private comapnies.
ggplot(data_companies,aes(x = company_type,y = Revenue)) +
geom_bar(stat = "sum",aes(fill = year_founded)) +
theme(axis.text.x = element_text(angle = 90))
The interesting observation from the above plot is that the public companies are distributed over the years, whereas the private companies increased during the late 1990s to 2000s.
Now we would like to look at the categorical variable company_category.
Below is the Frequendy table of company_category
#the Frequency table of company_category
data.frame(table(data_companies$company_category))
## Var1 Freq
## 1 Aerospace and Defense 1
## 2 Business & Legal Services 44
## 3 Data/Technology 98
## 4 Education 19
## 5 Energy 28
## 6 Environment & Weather 11
## 7 Finance & Investment 75
## 8 Food & Agriculture 6
## 9 Geospatial/Mapping 30
## 10 Governance 43
## 11 Healthcare 40
## 12 Housing/Real Estate 21
## 13 Insurance 11
## 14 Lifestyle & Consumer 25
## 15 Media 1
## 16 Research & Consulting 28
## 17 Scientific Research 17
## 18 Transportation 28
ggplot(data_companies,aes(company_category)) +
geom_bar(aes(fill = company_type)) +
theme(axis.text.x = element_text(angle = 90))
In terms of number of companies in a category, “Data/Technology” has the maximum number of companies.
ggplot(data_companies,aes(x = company_category,y = Revenue)) +
geom_bar(stat = "sum",aes(fill = company_type)) +
theme(axis.text.x = element_text(angle = 90))
By combining the above two plots we can see that, that the revenue of the “Scientific Research” Category is the highest among all Company categories whereas the number of companies is highest in the “Data/Technology” category. Also, the “scientific Research” Category is completely “Private”.
#Frequency table of the revenue_bin
data.frame(table(revenue_bin))
## revenue_bin Freq
## 1 0-100M 341
## 2 0-1M 19
## 3 100M-1B 50
## 4 More than 1B 59
## 5 NA 60
More than 50% of the companies falls in the bin “0-100M”.
ggplot(data_companies,aes(revenue_bin)) + geom_bar(aes(fill = company_type))
From the above plot, we can see that the proprotion of “Private” companies in the category “0-100M” is really high in comparison to other bins.
ggplot(data_companies,aes(revenue_bin)) + geom_bar(aes(fill = full_time_employees))
The companies with less than 200 employess constitute more than 2/3rds of the “0-100M” companies.
The primary objective for this project is to perform an exploratory analysis ona dataset and try to apply some/all the techniques learned in the class to do the same. We have used the Open 500 Companies data for this purpose.Firstly,we imported the data into R, cleaned the data, added some additonal columns and plotted some graphs to make some deductions out of them.
Following are the interesting insights that I oculd deduct out of the analysis:
Following are the implications/business ideas I deduce from my insights: