*** Data set “Forbes2000.csv” from http://vincentarelbundock.github.io/Rdatasets/ ***

#library(readr) 
Forbes2000.data <-read.delim(file="Forbes2000.csv",  sep=',') 
str(Forbes2000.data)
## 'data.frame':    2000 obs. of  9 variables:
##  $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ name       : Factor w/ 2000 levels "Aareal Bank",..: 438 747 100 659 311 219 870 1827 663 1921 ...
##  $ country    : Factor w/ 61 levels "Africa","Australia",..: 60 60 60 60 56 60 56 28 60 60 ...
##  $ category   : Factor w/ 27 levels "Aerospace & defense",..: 2 6 16 19 19 2 2 8 9 20 ...
##  $ sales      : num  94.7 134.2 76.7 222.9 232.6 ...
##  $ profits    : num  17.85 15.59 6.46 20.96 10.27 ...
##  $ assets     : num  1264 627 648 167 178 ...
##  $ marketvalue: num  255 329 195 277 174 ...

1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

summary(Forbes2000.data)
##        X               rank                              name     
##  Min.   :   1.0   Min.   :   1.0   Aareal Bank             :   1  
##  1st Qu.: 500.8   1st Qu.: 500.8   ABB Group               :   1  
##  Median :1000.5   Median :1000.5   Abbey National          :   1  
##  Mean   :1000.5   Mean   :1000.5   Abbott Laboratories     :   1  
##  3rd Qu.:1500.2   3rd Qu.:1500.2   Abercrombie & Fitch     :   1  
##  Max.   :2000.0   Max.   :2000.0   Abertis Infraestructuras:   1  
##                                    (Other)                 :1994  
##            country                      category        sales        
##  United States :751   Banking               : 313   Min.   :  0.010  
##  Japan         :316   Diversified financials: 158   1st Qu.:  2.018  
##  United Kingdom:137   Insurance             : 112   Median :  4.365  
##  Germany       : 65   Utilities             : 110   Mean   :  9.697  
##  France        : 63   Materials             :  97   3rd Qu.:  9.547  
##  Canada        : 56   Oil & gas operations  :  90   Max.   :256.330  
##  (Other)       :612   (Other)               :1120                    
##     profits             assets          marketvalue    
##  Min.   :-25.8300   Min.   :   0.270   Min.   :  0.02  
##  1st Qu.:  0.0800   1st Qu.:   4.025   1st Qu.:  2.72  
##  Median :  0.2000   Median :   9.345   Median :  5.15  
##  Mean   :  0.3811   Mean   :  34.042   Mean   : 11.88  
##  3rd Qu.:  0.4400   3rd Qu.:  22.793   3rd Qu.: 10.60  
##  Max.   : 20.9600   Max.   :1264.030   Max.   :328.54  
##  NA's   :5

-> Conclusion: USA as a country tops wth around 751 companies in Forbes 2000 list. Banking , Diversified Financials and Insurance as categories tops with approx 600 companies in Forbes 2000 list.

2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together):

# Data Sub set to analyze Compaines in Forbes 2000 based in Country Mexico:
Forbes2000.Mexico.data <- subset(Forbes2000.data, grepl("Mexico", Forbes2000.data$country) , select = c("rank","name","category","marketvalue", "profits"), drop = FALSE )
summary(Forbes2000.Mexico.data)
##       rank                        name                           category
##  Min.   : 375   ALFA                : 1   Food drink & tobacco       :4  
##  1st Qu.: 829   America Telecom     : 1   Banking                    :2  
##  Median :1440   Carso Global Telecom: 1   Conglomerates              :2  
##  Mean   :1240   Cemex               : 1   Telecommunications services:2  
##  3rd Qu.:1612   Coca-Cola Femsa     : 1   Capital goods              :1  
##  Max.   :1953   Femsa               : 1   Construction               :1  
##                 (Other)             :11   (Other)                    :5  
##   marketvalue      profits       
##  Min.   :0.62   Min.   :-0.0300  
##  1st Qu.:2.21   1st Qu.: 0.1300  
##  Median :3.70   Median : 0.2000  
##  Mean   :4.10   Mean   : 0.2006  
##  3rd Qu.:5.47   3rd Qu.: 0.2500  
##  Max.   :9.93   Max.   : 0.6300  
## 
# Renaming Columns:
names(Forbes2000.Mexico.data) <- c("rank" = "Rank","name"="CompanyName","category"="Category","marketvalue"="MarketValue","profits"="ProfitsInTrillions")

# Adding New Column to indicate as factor the year on year (YoY) Trend in profits:
Forbes2000.Mexico.data$ProfitsYoY = factor(ifelse(Forbes2000.Mexico.data$ProfitsInTrillions > 0 , "Up",(ifelse(Forbes2000.Mexico.data$ProfitsInTrillions <0, "Down", "Same" )) ) , levels = c("Up","Down","Same"))

summary(Forbes2000.Mexico.data)
##       Rank                    CompanyName                        Category
##  Min.   : 375   ALFA                : 1   Food drink & tobacco       :4  
##  1st Qu.: 829   America Telecom     : 1   Banking                    :2  
##  Median :1440   Carso Global Telecom: 1   Conglomerates              :2  
##  Mean   :1240   Cemex               : 1   Telecommunications services:2  
##  3rd Qu.:1612   Coca-Cola Femsa     : 1   Capital goods              :1  
##  Max.   :1953   Femsa               : 1   Construction               :1  
##                 (Other)             :11   (Other)                    :5  
##   MarketValue   ProfitsInTrillions ProfitsYoY
##  Min.   :0.62   Min.   :-0.0300    Up  :15   
##  1st Qu.:2.21   1st Qu.: 0.1300    Down: 1   
##  Median :3.70   Median : 0.2000    Same: 1   
##  Mean   :4.10   Mean   : 0.2006              
##  3rd Qu.:5.47   3rd Qu.: 0.2500              
##  Max.   :9.93   Max.   : 0.6300              
## 
# Showing Sub Set of companies with downward YoY profit trend:
subset(Forbes2000.Mexico.data, grepl("Down", Forbes2000.Mexico.data$ProfitsYoY) , select = c("Rank","CompanyName","Category","MarketValue", "ProfitsInTrillions"), drop = FALSE )
##      Rank    CompanyName Category MarketValue ProfitsInTrillions
## 1490 1490 Grupo Televisa    Media        6.42              -0.03

-> Conclusion: Minimum Rank for Mexico is 375 and Food, Drink and Tibacco Category dominates as the top category in Mexico.

3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2:

# Boxplot and ggplot for Rank for All company in Mexico country
library(ggplot2)
boxplot(Forbes2000.Mexico.data$Rank , main = 'Forbes2000 Mexico Rank Data')

ggplot(Forbes2000.Mexico.data,  aes(y=CompanyName, x=Rank) ) +  geom_point()

# Pie Chart for MarketValue for Mexico Country: 
pie(Forbes2000.Mexico.data$MarketValue, paste(Forbes2000.Mexico.data$CompanyName, " - " ,Forbes2000.Mexico.data$MarketValue, " T." , sep="" ), main = "Market Value Pie Chart for Mexico", radius=1, col = rainbow(length(Forbes2000.Mexico.data$MarketValue)))

# 3D Pie Chart for Profits Year on Year for Mexico Country:
library(plotrix)
data <- summary(Forbes2000.Mexico.data$ProfitsYoY)
owner <- levels(Forbes2000.Mexico.data$ProfitsYoY)
pie3D(data,labels= paste(owner,data, sep=' - '), explode=0.03, start=pi/2,main="ProfitsYoY Pie Chart for Mexico", col=c("brown","#ddaa00","#dd00dd"))

# Bar Graph for marketvalue total at Country grouping:
library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Forbes2000.data %>% group_by(country) %>%  summarise(sum.marketvalue = sum(marketvalue))  %>%  
  ggplot(aes(x = country, y = sum.marketvalue)) +
    geom_bar(stat = "identity") +
    
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
    labs(
        x = "Country",
        y = "Market Value",
        title = paste(
            "Country wise Market holding"
        )
    )

# Showing the countries in Forbes 2000 in world map:
library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE
data(wrld_simpl)
myCountries = wrld_simpl@data$NAME %in% names(table(Forbes2000.data$country))
plot(wrld_simpl, col = c(gray(.80), "blue")[myCountries+1])

4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end:

# Question: How does USA tally with rest of the World in terms of the market value and assets and positioing:
Forbes2000.data$country <- factor(Forbes2000.data$country)
Usa.data <- filter(Forbes2000.data, country == "United States" ) 
Rest.data <- filter(Forbes2000.data, country != "United States"  ) 
x <- matrix( , nrow = 3, ncol = 4 , dimnames = list(c("Count","Asset","Market Value"), c("USA","Rest", "USA %", "Rest %")))

x["Count", c("USA","Rest")] <- c( nrow(Usa.data), nrow(Rest.data))
x["Asset", c("USA","Rest")] <- c( sum(Usa.data$asset) ,  sum(Rest.data$asset))
x["Market Value", c("USA","Rest")] <-  c( sum(Usa.data$marketvalue), sum(Rest.data$marketvalue))

x["Count", c("USA %","Rest %")] <- c( ((x["Count","USA"] / (x["Count","USA"] + x["Count","Rest"] )) *100 ), ((x["Count","Rest"] / (x["Count","USA"] + x["Count","Rest"] )) *100 )  )
x["Asset", c("USA %","Rest %")] <- c( ((x["Asset","USA"] / (x["Asset","USA"] + x["Asset","Rest"] )) *100 ), ((x["Asset","Rest"] / (x["Asset","USA"] + x["Asset","Rest"] )) *100 )  )
x["Market Value", c("USA %","Rest %")] <-  c( ((x["Market Value","USA"] / (x["Market Value","USA"] + x["Market Value","Rest"] )) *100 ), ((x["Market Value","Rest"] / (x["Market Value","USA"] + x["Market Value","Rest"] )) *100 )  )

analysis.data <- x
analysis.data
##                   USA     Rest    USA %   Rest %
## Count          751.00  1249.00 37.55000 62.45000
## Asset        22781.89 45301.81 33.46159 66.53841
## Market Value 11575.58 12179.73 48.72839 51.27161

-> Conclusion: 33% of the Asset across forbes 2000 countries belongs to USA; 37% of the 2000 companies are based in USA and they hold a total market value of almost 49% of 2000 companies.

5. BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career:

# Access csv file via github:
csvFile.URL <-  "https://raw.githubusercontent.com/kamathvk1982/CunyBridgeR/master/Forbes2000.csv"
git.Forbes2000.data <-read.delim(file=csvFile.URL,  sep=',')
summary(git.Forbes2000.data)
##        X               rank                              name     
##  Min.   :   1.0   Min.   :   1.0   Aareal Bank             :   1  
##  1st Qu.: 500.8   1st Qu.: 500.8   ABB Group               :   1  
##  Median :1000.5   Median :1000.5   Abbey National          :   1  
##  Mean   :1000.5   Mean   :1000.5   Abbott Laboratories     :   1  
##  3rd Qu.:1500.2   3rd Qu.:1500.2   Abercrombie & Fitch     :   1  
##  Max.   :2000.0   Max.   :2000.0   Abertis Infraestructuras:   1  
##                                    (Other)                 :1994  
##            country                      category        sales        
##  United States :751   Banking               : 313   Min.   :  0.010  
##  Japan         :316   Diversified financials: 158   1st Qu.:  2.018  
##  United Kingdom:137   Insurance             : 112   Median :  4.365  
##  Germany       : 65   Utilities             : 110   Mean   :  9.697  
##  France        : 63   Materials             :  97   3rd Qu.:  9.547  
##  Canada        : 56   Oil & gas operations  :  90   Max.   :256.330  
##  (Other)       :612   (Other)               :1120                    
##     profits             assets          marketvalue    
##  Min.   :-25.8300   Min.   :   0.270   Min.   :  0.02  
##  1st Qu.:  0.0800   1st Qu.:   4.025   1st Qu.:  2.72  
##  Median :  0.2000   Median :   9.345   Median :  5.15  
##  Mean   :  0.3811   Mean   :  34.042   Mean   : 11.88  
##  3rd Qu.:  0.4400   3rd Qu.:  22.793   3rd Qu.: 10.60  
##  Max.   : 20.9600   Max.   :1264.030   Max.   :328.54  
##  NA's   :5