Part 1: Analysis of Fish Data (35 total points)

 

1.1 Import the data

fish <- read.delim("http://jse.amstat.org/datasets/fishcatch.dat.txt", header = FALSE, sep = "", row.names = NULL)
fish <- fish %>% rename(obs = V1, species = V2, weight = V3, len1 = V4, len2 = V5, len3 = V6, height.pct = V7, width.pct = V8, sex = V9)

 

1.2 Change ‘sex’ to be a factor and rename

fish$sex <- as.factor(fish$sex)
fish$sex <- fct_recode(fish$sex, female = "0", male = "1")

 

1.3 Change ‘species’ to be a factor and rename

fish$species <- as.factor(fish$species)
fish$species <- fct_recode(fish$species, "Common Bream" = "1", "Whitefish" = "2", "Roach" = "3", "Silver Bream" = "4", "Smelt" = "5", "Pike" = "6", "Perch" = "7")
# Do not modify the following code:
fish.sub <- filter(fish, !is.na(sex)) # only display rows where sex is not NA
knitr::kable(head(fish.sub), format = "markdown")
obs species weight len1 len2 len3 height.pct width.pct sex
14 Common Bream NA 29.5 32 37.3 37.3 13.6 male
15 Common Bream 600 29.4 32 37.2 40.2 13.9 male
17 Common Bream 700 30.4 33 38.3 38.8 13.8 male
21 Common Bream 575 31.3 34 39.5 38.3 14.1 male
26 Common Bream 725 31.8 35 40.9 40.0 14.8 male
30 Common Bream 1000 33.5 37 42.6 44.5 15.5 female

 

1.4 Determine mean weight for each species

mean.wt <- fish %>% group_by(species) %>% summarize(weight = mean(weight, na.rm = TRUE)) %>% arrange(weight)
# Do not modify the following code:
knitr::kable(mean.wt, format = "markdown")
species weight
Smelt 11.17857
Roach 152.05000
Silver Bream 154.81818
Perch 382.23929
Whitefish 531.00000
Common Bream 626.00000
Pike 718.70588

The species with the smallest mean weight is Smelt with a weight of 11.17857.

 

1.5 Plot the mean weights for each species

ggplot(mean.wt, aes(species, weight)) + geom_col() + xlab("Species") + ylab("Weight") + ggtitle("Mean Weight Per Species") + theme(plot.title = element_text(hjust = 0.5))

 

Part 2: Analysis of Forbes Global 2000 Data (50 total points)

 

2.1 Import the dataset

Forbes <- read_csv("2019_Forbes_Global_2000.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Rank = col_double(),
##   Company = col_character(),
##   Sector = col_character(),
##   Industry = col_character(),
##   Continent = col_character(),
##   Country = col_character(),
##   Revenue = col_double(),
##   Profits = col_double(),
##   Assets = col_double(),
##   Market_Value = col_double()
## )

 

2.2 Exclude specified records

Forbes <- Forbes %>% filter(!is.na(Sector), Revenue >= 0)

 

2.3 Convert four variables to factors

cols <- c("Sector","Industry","Continent","Country")
Forbes[cols] <- lapply(Forbes[cols], factor)

 

2.4 Create a scatterplot of Market Value by Sales

ggplot(subset(Forbes, Continent %in% c("Asia","Europe","North America")), aes(x=Revenue,y=Market_Value)) + geom_point() + geom_smooth(method = "lm") +
  facet_grid(. ~ Continent)
## `geom_smooth()` using formula 'y ~ x'

The first thing that stands out are the outliers within the North American continent. There seem to be six companies that have a higher market value than any companies within the Asian and European continents. These companies do not have significantly higher revenue numbers than their counterparts, so I would be interested to dig deeper into some other factors that might have contributed to their significantly higher market valuations. Even when including the outliers, there seems to be at least somewhat of a positive correlation between a company’s revenue and its corresponding market value.

 

2.5 Create Profit Margin variable

Forbes <- Forbes %>% mutate(ProfMgn = Profits / Revenue) %>% filter(ProfMgn <= 5)
# Do not modify the following code:
knitr::kable(head(Forbes), format = "markdown")
Rank Company Sector Industry Continent Country Revenue Profits Assets Market_Value ProfMgn
1 ICBC Financials Major Banks Asia China 175.874 45.223 4034.482 305.057 0.2571329
2 JPMorgan Chase Financials Major Banks North America United States 132.912 32.738 2737.188 368.502 0.2463134
3 China Construction Bank Financials Major Banks Asia China 150.313 38.841 3382.422 224.988 0.2584008
4 Agricultural Bank of China Financials Regional Banks Asia China 137.456 30.894 3293.105 197.045 0.2247556
5 Bank of America Financials Major Banks North America United States 111.904 28.540 2377.164 287.339 0.2550400
6 Apple Information Technology Computer Hardware North America United States 261.705 59.431 373.719 961.257 0.2270916

 

2.6 Create boxplot of Profit Margin by Sector

ggplot(Forbes, aes(x=Sector, y=ProfMgn)) + 
  geom_boxplot() + coord_flip()

The sector that appears to have the greatest standard deviation is Financials.

 

2.7 Calculate the SD for each sector

Forbes.SD <- Forbes %>% group_by(Sector) %>% summarize(sd = sd(ProfMgn)) %>% arrange(desc(sd))
# Do not modify the following code:
knitr::kable(Forbes.SD, format = "markdown")
Sector sd
Utilities 0.4912375
Telecommunication Services 0.4698143
Financials 0.3285288
Information Technology 0.2303487
Industrials 0.1611717
Consumer Discretionary 0.1605991
Materials 0.1566098
Health Care 0.1488790
Consumer Staples 0.1282274
Energy 0.1089204

The sector that has the greatest standard deviation is Utilities. This is different from what I initially thought, as Financials seemed to have the greatest standard deviation when solely examining the boxplots. Financials had the most outliers of the boxplots, but I guess those offset each other. Utilities seems to have a much smaller spread, so maybe its one significant outlier plays a large role in the sector’s large standard deviation for profit margin.