fish <- read.delim("http://jse.amstat.org/datasets/fishcatch.dat.txt", header = FALSE, sep = "", row.names = NULL)
fish <- fish %>% rename(obs = V1, species = V2, weight = V3, len1 = V4, len2 = V5, len3 = V6, height.pct = V7, width.pct = V8, sex = V9)
fish$sex <- as.factor(fish$sex)
fish$sex <- fct_recode(fish$sex, female = "0", male = "1")
fish$species <- as.factor(fish$species)
fish$species <- fct_recode(fish$species, "Common Bream" = "1", "Whitefish" = "2", "Roach" = "3", "Silver Bream" = "4", "Smelt" = "5", "Pike" = "6", "Perch" = "7")
# Do not modify the following code:
fish.sub <- filter(fish, !is.na(sex)) # only display rows where sex is not NA
knitr::kable(head(fish.sub), format = "markdown")
| obs | species | weight | len1 | len2 | len3 | height.pct | width.pct | sex |
|---|---|---|---|---|---|---|---|---|
| 14 | Common Bream | NA | 29.5 | 32 | 37.3 | 37.3 | 13.6 | male |
| 15 | Common Bream | 600 | 29.4 | 32 | 37.2 | 40.2 | 13.9 | male |
| 17 | Common Bream | 700 | 30.4 | 33 | 38.3 | 38.8 | 13.8 | male |
| 21 | Common Bream | 575 | 31.3 | 34 | 39.5 | 38.3 | 14.1 | male |
| 26 | Common Bream | 725 | 31.8 | 35 | 40.9 | 40.0 | 14.8 | male |
| 30 | Common Bream | 1000 | 33.5 | 37 | 42.6 | 44.5 | 15.5 | female |
mean.wt <- fish %>% group_by(species) %>% summarize(weight = mean(weight, na.rm = TRUE)) %>% arrange(weight)
# Do not modify the following code:
knitr::kable(mean.wt, format = "markdown")
| species | weight |
|---|---|
| Smelt | 11.17857 |
| Roach | 152.05000 |
| Silver Bream | 154.81818 |
| Perch | 382.23929 |
| Whitefish | 531.00000 |
| Common Bream | 626.00000 |
| Pike | 718.70588 |
The species with the smallest mean weight is Smelt with a weight of 11.17857.
ggplot(mean.wt, aes(species, weight)) + geom_col() + xlab("Species") + ylab("Weight") + ggtitle("Mean Weight Per Species") + theme(plot.title = element_text(hjust = 0.5))
Forbes <- read_csv("2019_Forbes_Global_2000.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Rank = col_double(),
## Company = col_character(),
## Sector = col_character(),
## Industry = col_character(),
## Continent = col_character(),
## Country = col_character(),
## Revenue = col_double(),
## Profits = col_double(),
## Assets = col_double(),
## Market_Value = col_double()
## )
Forbes <- Forbes %>% filter(!is.na(Sector), Revenue >= 0)
cols <- c("Sector","Industry","Continent","Country")
Forbes[cols] <- lapply(Forbes[cols], factor)
ggplot(subset(Forbes, Continent %in% c("Asia","Europe","North America")), aes(x=Revenue,y=Market_Value)) + geom_point() + geom_smooth(method = "lm") +
facet_grid(. ~ Continent)
## `geom_smooth()` using formula 'y ~ x'
The first thing that stands out are the outliers within the North American continent. There seem to be six companies that have a higher market value than any companies within the Asian and European continents. These companies do not have significantly higher revenue numbers than their counterparts, so I would be interested to dig deeper into some other factors that might have contributed to their significantly higher market valuations. Even when including the outliers, there seems to be at least somewhat of a positive correlation between a company’s revenue and its corresponding market value.
Forbes <- Forbes %>% mutate(ProfMgn = Profits / Revenue) %>% filter(ProfMgn <= 5)
# Do not modify the following code:
knitr::kable(head(Forbes), format = "markdown")
| Rank | Company | Sector | Industry | Continent | Country | Revenue | Profits | Assets | Market_Value | ProfMgn |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ICBC | Financials | Major Banks | Asia | China | 175.874 | 45.223 | 4034.482 | 305.057 | 0.2571329 |
| 2 | JPMorgan Chase | Financials | Major Banks | North America | United States | 132.912 | 32.738 | 2737.188 | 368.502 | 0.2463134 |
| 3 | China Construction Bank | Financials | Major Banks | Asia | China | 150.313 | 38.841 | 3382.422 | 224.988 | 0.2584008 |
| 4 | Agricultural Bank of China | Financials | Regional Banks | Asia | China | 137.456 | 30.894 | 3293.105 | 197.045 | 0.2247556 |
| 5 | Bank of America | Financials | Major Banks | North America | United States | 111.904 | 28.540 | 2377.164 | 287.339 | 0.2550400 |
| 6 | Apple | Information Technology | Computer Hardware | North America | United States | 261.705 | 59.431 | 373.719 | 961.257 | 0.2270916 |
ggplot(Forbes, aes(x=Sector, y=ProfMgn)) +
geom_boxplot() + coord_flip()
The sector that appears to have the greatest standard deviation is Financials.
Forbes.SD <- Forbes %>% group_by(Sector) %>% summarize(sd = sd(ProfMgn)) %>% arrange(desc(sd))
# Do not modify the following code:
knitr::kable(Forbes.SD, format = "markdown")
| Sector | sd |
|---|---|
| Utilities | 0.4912375 |
| Telecommunication Services | 0.4698143 |
| Financials | 0.3285288 |
| Information Technology | 0.2303487 |
| Industrials | 0.1611717 |
| Consumer Discretionary | 0.1605991 |
| Materials | 0.1566098 |
| Health Care | 0.1488790 |
| Consumer Staples | 0.1282274 |
| Energy | 0.1089204 |
The sector that has the greatest standard deviation is Utilities. This is different from what I initially thought, as Financials seemed to have the greatest standard deviation when solely examining the boxplots. Financials had the most outliers of the boxplots, but I guess those offset each other. Utilities seems to have a much smaller spread, so maybe its one significant outlier plays a large role in the sector’s large standard deviation for profit margin.