R Exercise

Part 1: Analysis of Fish Data (40 total points)

1.1 Import the data

fish <- read.table("http://www.amstat.org/publications/jse/datasets/fishcatch.dat.txt",
                   header = FALSE,
                   col.names = c("Obs", "Species","Weight", 
                                "Len1","Len2", "Len3",
                                "Height.Pct", "Width.Pct","Sex"))

1.2 Change ‘sex’ to be a factor

fish$Sex <- factor(fish$Sex, levels = c(0,1), labels = c("Male", "Female"))

1.3 Change ‘species’ to be a factor and add labels

fish$Species <- factor(fish$Species, levels = c(1,2,3,4,5,6,7), 
                       labels = c("Common Bream", "Whitefish",
                                  "Roach", "Silver Bream",
                                  "Smelt", "Pike","Perch"))
# Do not modify the following code:
fish.sub <- filter(fish, Sex != "NA")
knitr::kable(head(fish.sub), format = "markdown")

Obs	Species	Weight	Len1	Len2	Len3	Height.Pct	Width.Pct	Sex
14	Common Bream	NA	29.5	32	37.3	37.3	13.6	Female
15	Common Bream	600	29.4	32	37.2	40.2	13.9	Female
17	Common Bream	700	30.4	33	38.3	38.8	13.8	Female
21	Common Bream	575	31.3	34	39.5	38.3	14.1	Female
26	Common Bream	725	31.8	35	40.9	40.0	14.8	Female
30	Common Bream	1000	33.5	37	42.6	44.5	15.5	Male

1.4 Determine mean weight for each species

mean.wt <- fish %>%
  select(Weight, Species) %>%
  group_by(Species) %>%
  dplyr::summarize(count = n(),
                   mean_weight = mean(Weight, na.rm = TRUE)) %>%
  arrange(desc(mean_weight))
# Do not modify the following code:
knitr::kable(mean.wt, format = "markdown")

Species	count	mean_weight
Pike	17	718.70588
Common Bream	35	626.00000
Whitefish	6	531.00000
Perch	56	382.23929
Silver Bream	11	154.81818
Roach	20	152.05000
Smelt	14	11.17857

The species with the smallest mean weight is smelt with a weight of 11.17g.

1.5 Plot the mean weights for each species

ggplot(mean.wt, aes(x = Species, y = mean_weight)) +
  geom_bar(stat = "identity", color = "black", fill = "light gray") +
  labs(x = "Species", y = "Weight", title = "Mean weight per Species")

Part 2: Analysis of Forbes Global 2000 Data (60 total points)

2.1 Import the dataset

Forbes <- read_csv("2014 Forbes Global 2000.csv")

## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   Company = col_character(),
##   Sector = col_character(),
##   Industry = col_character(),
##   Continent = col_character(),
##   Country = col_character(),
##   Sales = col_double(),
##   Profits = col_double(),
##   Assets = col_double(),
##   Market_Value = col_double()
## )

2.2 Exclude specified records

Forbes <- filter(Forbes, !is.na(Sector) & Sales != 0)

2.3 Convert four variables to factors

Forbes$Sector <- factor(Forbes$Sector)
Forbes$Industry <- factor(Forbes$Industry)
Forbes$Continent <- factor(Forbes$Continent)
Forbes$Country <- factor(Forbes$Country)

2.4 Create scatterplot of Market Value by Sales

ggplot(subset(Forbes, Continent %in% c("Asia", "Europe", "North America")), aes(y=Market_Value, x=Sales)) +
  geom_point() +
  facet_grid(~ Continent)+
  geom_smooth(lwd=0.75, color="deepskyblue3")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

[The graphs for all three continents show data to be concentrated closer o the origin. For Asia, as sales increase initially market value for the associated company increases as well. This same trend is seem for North America, however the graph for North America shows that the market value for the company rise at a higher rate(steeper slope) in comparison to Asia. There are only a limited number of outliers which show a low market value, but high sales. This cause the graph to show a declining trend as sales go over 300 million. Europe shows the same initial increase in market value in comparison to sales. Once sales for European firms cross 100 million the market value continues to rise, but the rate at which the market value increase drops. However, Europe does not show a declining trend for market value after a certain sales threshold is crossed.]

2.5 Create Profit Margin variable

Forbes <- mutate(Forbes, ProfMgn = Forbes$Profits/Forbes$Sales)
# Do not modify the following code:
knitr::kable(head(Forbes), format = "markdown")

Rank	Company	Sector	Industry	Continent	Country	Sales	Profits	Assets	Market_Value	ProfMgn
1	ICBC	Financials	Major Banks	Asia	China	148.7	42.7	3124.9	215.6	0.2871553
2	China Construction Bank	Financials	Regional Banks	Asia	China	121.3	34.2	2449.5	174.4	0.2819456
3	Agricultural Bank of China	Financials	Regional Banks	Asia	China	136.4	27.0	2405.4	141.1	0.1979472
4	JPMorgan Chase	Financials	Major Banks	North America	United States	105.7	17.3	2435.3	229.7	0.1636708
5	Berkshire Hathaway	Financials	Investment Services	North America	United States	178.8	19.5	493.4	309.1	0.1090604
6	Exxon Mobil	Energy	Oil & Gas Operations	North America	United States	394.0	32.6	346.8	422.3	0.0827411

2.6 Create boxplot of Profit Margin by Sector

ggplot(Forbes, aes(x = Sector, y = ProfMgn)) +
  stat_boxplot(geom = "errorbar", width = 0.5) +
  geom_boxplot(outlier.size = 0.75, aes(fill = Sector)) +
  coord_flip()

The sectors that appear to have the greatest standard deviation are Materials, Information technology, Financials, and Consumer Discretionary

2.7 Calculate the SD for each sector

Forbes.SD <-  Forbes %>%
  select(Sector, ProfMgn) %>%
  group_by(Sector) %>%
  dplyr::summarize(Sector_STD = paste(round(100*sd(ProfMgn),1),"%"))
# Do not modify the following code:
knitr::kable(Forbes.SD, format = "markdown")

Sector	Sector_STD
Consumer Discretionary	62.9 %
Consumer Staples	10 %
Energy	10.6 %
Financials	40.5 %
Health Care	10.7 %
Industrials	9.9 %
Information Technology	23.5 %
Materials	21.5 %
Telecommunication Services	10.2 %
Utilities	16.2 %

The sector that has the greatest standard deviation is Consumer Discretionary. [The results confirm the initial analysis from the observation of the box plots for each sector. The box plots for Materials, Information technology, Financials, and consumer discretionary showed a wide spread on data below the 25th percentile and above the 75th percentile, and even outside the box plot whiskers. Since, standard deviation is the square root of the average of the sum of squared differences for each observation from the mean, a high amount of outliers would cause the standard deviation to increase. In this case, the calculations proved what the box plots presented.]