fish <- read.table("http://www.amstat.org/publications/jse/datasets/fishcatch.dat.txt",
header = FALSE,
col.names = c("Obs", "Species","Weight",
"Len1","Len2", "Len3",
"Height.Pct", "Width.Pct","Sex"))
fish$Sex <- factor(fish$Sex, levels = c(0,1), labels = c("Male", "Female"))
fish$Species <- factor(fish$Species, levels = c(1,2,3,4,5,6,7),
labels = c("Common Bream", "Whitefish",
"Roach", "Silver Bream",
"Smelt", "Pike","Perch"))
# Do not modify the following code:
fish.sub <- filter(fish, Sex != "NA")
knitr::kable(head(fish.sub), format = "markdown")
| Obs | Species | Weight | Len1 | Len2 | Len3 | Height.Pct | Width.Pct | Sex |
|---|---|---|---|---|---|---|---|---|
| 14 | Common Bream | NA | 29.5 | 32 | 37.3 | 37.3 | 13.6 | Female |
| 15 | Common Bream | 600 | 29.4 | 32 | 37.2 | 40.2 | 13.9 | Female |
| 17 | Common Bream | 700 | 30.4 | 33 | 38.3 | 38.8 | 13.8 | Female |
| 21 | Common Bream | 575 | 31.3 | 34 | 39.5 | 38.3 | 14.1 | Female |
| 26 | Common Bream | 725 | 31.8 | 35 | 40.9 | 40.0 | 14.8 | Female |
| 30 | Common Bream | 1000 | 33.5 | 37 | 42.6 | 44.5 | 15.5 | Male |
mean.wt <- fish %>%
select(Weight, Species) %>%
group_by(Species) %>%
dplyr::summarize(count = n(),
mean_weight = mean(Weight, na.rm = TRUE)) %>%
arrange(desc(mean_weight))
# Do not modify the following code:
knitr::kable(mean.wt, format = "markdown")
| Species | count | mean_weight |
|---|---|---|
| Pike | 17 | 718.70588 |
| Common Bream | 35 | 626.00000 |
| Whitefish | 6 | 531.00000 |
| Perch | 56 | 382.23929 |
| Silver Bream | 11 | 154.81818 |
| Roach | 20 | 152.05000 |
| Smelt | 14 | 11.17857 |
The species with the smallest mean weight is smelt with a weight of 11.17g.
ggplot(mean.wt, aes(x = Species, y = mean_weight)) +
geom_bar(stat = "identity", color = "black", fill = "light gray") +
labs(x = "Species", y = "Weight", title = "Mean weight per Species")
Forbes <- read_csv("2014 Forbes Global 2000.csv")
## Parsed with column specification:
## cols(
## Rank = col_double(),
## Company = col_character(),
## Sector = col_character(),
## Industry = col_character(),
## Continent = col_character(),
## Country = col_character(),
## Sales = col_double(),
## Profits = col_double(),
## Assets = col_double(),
## Market_Value = col_double()
## )
Forbes <- filter(Forbes, !is.na(Sector) & Sales != 0)
Forbes$Sector <- factor(Forbes$Sector)
Forbes$Industry <- factor(Forbes$Industry)
Forbes$Continent <- factor(Forbes$Continent)
Forbes$Country <- factor(Forbes$Country)
ggplot(subset(Forbes, Continent %in% c("Asia", "Europe", "North America")), aes(y=Market_Value, x=Sales)) +
geom_point() +
facet_grid(~ Continent)+
geom_smooth(lwd=0.75, color="deepskyblue3")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
[The graphs for all three continents show data to be concentrated closer o the origin. For Asia, as sales increase initially market value for the associated company increases as well. This same trend is seem for North America, however the graph for North America shows that the market value for the company rise at a higher rate(steeper slope) in comparison to Asia. There are only a limited number of outliers which show a low market value, but high sales. This cause the graph to show a declining trend as sales go over 300 million. Europe shows the same initial increase in market value in comparison to sales. Once sales for European firms cross 100 million the market value continues to rise, but the rate at which the market value increase drops. However, Europe does not show a declining trend for market value after a certain sales threshold is crossed.]
Forbes <- mutate(Forbes, ProfMgn = Forbes$Profits/Forbes$Sales)
# Do not modify the following code:
knitr::kable(head(Forbes), format = "markdown")
| Rank | Company | Sector | Industry | Continent | Country | Sales | Profits | Assets | Market_Value | ProfMgn |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ICBC | Financials | Major Banks | Asia | China | 148.7 | 42.7 | 3124.9 | 215.6 | 0.2871553 |
| 2 | China Construction Bank | Financials | Regional Banks | Asia | China | 121.3 | 34.2 | 2449.5 | 174.4 | 0.2819456 |
| 3 | Agricultural Bank of China | Financials | Regional Banks | Asia | China | 136.4 | 27.0 | 2405.4 | 141.1 | 0.1979472 |
| 4 | JPMorgan Chase | Financials | Major Banks | North America | United States | 105.7 | 17.3 | 2435.3 | 229.7 | 0.1636708 |
| 5 | Berkshire Hathaway | Financials | Investment Services | North America | United States | 178.8 | 19.5 | 493.4 | 309.1 | 0.1090604 |
| 6 | Exxon Mobil | Energy | Oil & Gas Operations | North America | United States | 394.0 | 32.6 | 346.8 | 422.3 | 0.0827411 |
ggplot(Forbes, aes(x = Sector, y = ProfMgn)) +
stat_boxplot(geom = "errorbar", width = 0.5) +
geom_boxplot(outlier.size = 0.75, aes(fill = Sector)) +
coord_flip()
The sectors that appear to have the greatest standard deviation are Materials, Information technology, Financials, and Consumer Discretionary
Forbes.SD <- Forbes %>%
select(Sector, ProfMgn) %>%
group_by(Sector) %>%
dplyr::summarize(Sector_STD = paste(round(100*sd(ProfMgn),1),"%"))
# Do not modify the following code:
knitr::kable(Forbes.SD, format = "markdown")
| Sector | Sector_STD |
|---|---|
| Consumer Discretionary | 62.9 % |
| Consumer Staples | 10 % |
| Energy | 10.6 % |
| Financials | 40.5 % |
| Health Care | 10.7 % |
| Industrials | 9.9 % |
| Information Technology | 23.5 % |
| Materials | 21.5 % |
| Telecommunication Services | 10.2 % |
| Utilities | 16.2 % |
The sector that has the greatest standard deviation is Consumer Discretionary. [The results confirm the initial analysis from the observation of the box plots for each sector. The box plots for Materials, Information technology, Financials, and consumer discretionary showed a wide spread on data below the 25th percentile and above the 75th percentile, and even outside the box plot whiskers. Since, standard deviation is the square root of the average of the sum of squared differences for each observation from the mean, a high amount of outliers would cause the standard deviation to increase. In this case, the calculations proved what the box plots presented.]