Part 1: Analysis of Fish Data (40 total points)

 

1.1 Import the data

fish <- read.table("http://www.amstat.org/publications/jse/datasets/fishcatch.dat.txt",
                   header = FALSE,
                   col.names = c("Obs", "Species","Weight", 
                                "Len1","Len2", "Len3",
                                "Height.Pct", "Width.Pct","Sex"))

 

1.2 Change ‘sex’ to be a factor

fish$Sex <- factor(fish$Sex, levels = c(0,1), labels = c("Male", "Female"))

 

1.3 Change ‘species’ to be a factor and add labels

fish$Species <- factor(fish$Species, levels = c(1,2,3,4,5,6,7), 
                       labels = c("Common Bream", "Whitefish",
                                  "Roach", "Silver Bream",
                                  "Smelt", "Pike","Perch"))
# Do not modify the following code:
fish.sub <- filter(fish, Sex != "NA")
knitr::kable(head(fish.sub), format = "markdown")
Obs Species Weight Len1 Len2 Len3 Height.Pct Width.Pct Sex
14 Common Bream NA 29.5 32 37.3 37.3 13.6 Female
15 Common Bream 600 29.4 32 37.2 40.2 13.9 Female
17 Common Bream 700 30.4 33 38.3 38.8 13.8 Female
21 Common Bream 575 31.3 34 39.5 38.3 14.1 Female
26 Common Bream 725 31.8 35 40.9 40.0 14.8 Female
30 Common Bream 1000 33.5 37 42.6 44.5 15.5 Male

 

1.4 Determine mean weight for each species

mean.wt <- fish %>%
  select(Weight, Species) %>%
  group_by(Species) %>%
  dplyr::summarize(count = n(),
                   mean_weight = mean(Weight, na.rm = TRUE)) %>%
  arrange(desc(mean_weight))
# Do not modify the following code:
knitr::kable(mean.wt, format = "markdown")
Species count mean_weight
Pike 17 718.70588
Common Bream 35 626.00000
Whitefish 6 531.00000
Perch 56 382.23929
Silver Bream 11 154.81818
Roach 20 152.05000
Smelt 14 11.17857

The species with the smallest mean weight is smelt with a weight of 11.17g.

 

1.5 Plot the mean weights for each species

ggplot(mean.wt, aes(x = Species, y = mean_weight)) +
  geom_bar(stat = "identity", color = "black", fill = "light gray") +
  labs(x = "Species", y = "Weight", title = "Mean weight per Species")

 

Part 2: Analysis of Forbes Global 2000 Data (60 total points)

 

2.1 Import the dataset

Forbes <- read_csv("2014 Forbes Global 2000.csv")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Company = col_character(),
##   Sector = col_character(),
##   Industry = col_character(),
##   Continent = col_character(),
##   Country = col_character(),
##   Sales = col_double(),
##   Profits = col_double(),
##   Assets = col_double(),
##   Market_Value = col_double()
## )

 

2.2 Exclude specified records

Forbes <- filter(Forbes, !is.na(Sector) & Sales != 0)

 

2.3 Convert four variables to factors

Forbes$Sector <- factor(Forbes$Sector)
Forbes$Industry <- factor(Forbes$Industry)
Forbes$Continent <- factor(Forbes$Continent)
Forbes$Country <- factor(Forbes$Country)

 

2.4 Create scatterplot of Market Value by Sales

ggplot(subset(Forbes, Continent %in% c("Asia", "Europe", "North America")), aes(y=Market_Value, x=Sales)) +
  geom_point() +
  facet_grid(~ Continent)+
  geom_smooth(lwd=0.75, color="deepskyblue3") 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

[The graphs for all three continents show data to be concentrated closer o the origin. For Asia, as sales increase initially market value for the associated company increases as well. This same trend is seem for North America, however the graph for North America shows that the market value for the company rise at a higher rate(steeper slope) in comparison to Asia. There are only a limited number of outliers which show a low market value, but high sales. This cause the graph to show a declining trend as sales go over 300 million. Europe shows the same initial increase in market value in comparison to sales. Once sales for European firms cross 100 million the market value continues to rise, but the rate at which the market value increase drops. However, Europe does not show a declining trend for market value after a certain sales threshold is crossed.]

 

2.5 Create Profit Margin variable

Forbes <- mutate(Forbes, ProfMgn = Forbes$Profits/Forbes$Sales)
# Do not modify the following code:
knitr::kable(head(Forbes), format = "markdown")
Rank Company Sector Industry Continent Country Sales Profits Assets Market_Value ProfMgn
1 ICBC Financials Major Banks Asia China 148.7 42.7 3124.9 215.6 0.2871553
2 China Construction Bank Financials Regional Banks Asia China 121.3 34.2 2449.5 174.4 0.2819456
3 Agricultural Bank of China Financials Regional Banks Asia China 136.4 27.0 2405.4 141.1 0.1979472
4 JPMorgan Chase Financials Major Banks North America United States 105.7 17.3 2435.3 229.7 0.1636708
5 Berkshire Hathaway Financials Investment Services North America United States 178.8 19.5 493.4 309.1 0.1090604
6 Exxon Mobil Energy Oil & Gas Operations North America United States 394.0 32.6 346.8 422.3 0.0827411

 

2.6 Create boxplot of Profit Margin by Sector

ggplot(Forbes, aes(x = Sector, y = ProfMgn)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot(outlier.size = 0.75, aes(fill = Sector)) +
  coord_flip()

The sectors that appear to have the greatest standard deviation are Materials, Information technology, Financials, and Consumer Discretionary

 

2.7 Calculate the SD for each sector

Forbes.SD <-  Forbes %>%
  select(Sector, ProfMgn) %>%
  group_by(Sector) %>%
  dplyr::summarize(Sector_STD = paste(round(100*sd(ProfMgn),1),"%"))
# Do not modify the following code:
knitr::kable(Forbes.SD, format = "markdown")
Sector Sector_STD
Consumer Discretionary 62.9 %
Consumer Staples 10 %
Energy 10.6 %
Financials 40.5 %
Health Care 10.7 %
Industrials 9.9 %
Information Technology 23.5 %
Materials 21.5 %
Telecommunication Services 10.2 %
Utilities 16.2 %

The sector that has the greatest standard deviation is Consumer Discretionary. [The results confirm the initial analysis from the observation of the box plots for each sector. The box plots for Materials, Information technology, Financials, and consumer discretionary showed a wide spread on data below the 25th percentile and above the 75th percentile, and even outside the box plot whiskers. Since, standard deviation is the square root of the average of the sum of squared differences for each observation from the mean, a high amount of outliers would cause the standard deviation to increase. In this case, the calculations proved what the box plots presented.]