R Exercise

Part 1: Analysis of Fish Data (35 total points)

1.1 Import the data

fish  <- read.csv("Fish Data.csv", header = FALSE, col.names = c("species", "weight", "len1", "len2", "len3", "height.pct", "width.pct", "sex"))

1.2 Change ‘sex’ to be a factor

fish$sex <- factor(fish$sex, levels=c(0,1),  labels=c("Female", "Male"))

1.3 Change ‘species’ to be a factor and add labels

fish$species <- factor(fish$species, levels=c(1, 2, 3, 4, 5, 6, 7),  labels=c("Bream", "Whitefish", "Roach", "Silver Bream", "Smelt", "Pike", "Perch"))
# Do not modify the following code:
fish.sub <- filter(fish, sex != "NA")
knitr::kable(head(fish.sub), format = "markdown")

species	weight	len1	len2	len3	height.pct	width.pct	sex
Bream	NA	29.5	32	37.3	37.3	13.6	Male
Bream	600	29.4	32	37.2	40.2	13.9	Male
Bream	700	30.4	33	38.3	38.8	13.8	Male
Bream	575	31.3	34	39.5	38.3	14.1	Male
Bream	725	31.8	35	40.9	40.0	14.8	Male
Bream	1000	33.5	37	42.6	44.5	15.5	Female

1.4 Determine mean weight for each species

mean.wt <- fish %>%
    group_by(species) %>%
    dplyr::summarize(Mean = mean(weight, na.rm=TRUE))
# Do not modify the following code:
knitr::kable(mean.wt, format = "markdown")

species	Mean
Bream	626.00000
Whitefish	531.00000
Roach	152.05000
Silver Bream	154.81818
Smelt	11.17857
Pike	718.70588
Perch	382.23929

The species with the smallest mean weight is Smelt with a weight of 11.17857g.

1.5 Plot the mean weights for each species

ggplot(data = mean.wt)+
  geom_bar(mapping = aes(y = Mean, x = species, fill = species), stat= "identity") +
  labs(title="Mean Weight per Species", x="Species", y="Weight in Grams")

Part 2: Analysis of Forbes Global 2000 Data (50 total points)

2.1 Import the dataset

Forbes <- read.csv("2014 Forbes Global 2000.csv", header = TRUE)

2.2 Exclude specified records

Forbes <- filter(Forbes, Sector !="" & Sales >0)

2.3 Convert four variables to factors

Forbes$Sector <- factor(Forbes$Sector)
Forbes$SIndustry <- factor(Forbes$Industry)
Forbes$Continent <- factor(Forbes$Continent)
Forbes$Country <- factor(Forbes$Country)

2.4 Create scatterplot of Market Value by Sales

ggplot(subset(Forbes, Continent %in% c("Asia","Europe","North America"))) +
  geom_point(mapping = aes(x = Sales, y = Market_Value, color = Continent, shape = Continent, alpha=.3)) +
  facet_grid(~ Continent) +
  geom_smooth(method = "lm", aes(x = Sales, y = Market_Value))

European and Asian company market values are closely correlated with a company’s sales. Both continents exhibit a similar correlation between the two attributes, which seems to create a consistent relationship. However, North American Company market values do not match the same trajectory as their European and Asian counterparts. Company sales in North America produce a more dramatic increase in market values comparatively. Thus increasing a North American’s market value based on sales much faster than those of European or Asian markets.

2.5 Create Profit Margin variable

Forbes <- Forbes %>%
  mutate(ProfMgn = Profits/Sales)

# Library_Open <- dplyr::mutate(Library_Open, ProfMgn = Profits/Sales)


# Do not modify the following code:
knitr::kable(head(Forbes), format = "markdown")

Rank	Company	Sector	Industry	Continent	Country	Sales	Profits	Assets	Market_Value	SIndustry	ProfMgn
1	ICBC	Financials	Major Banks	Asia	China	148.7	42.7	3124.9	215.6	Major Banks	0.2871553
2	China Construction Bank	Financials	Regional Banks	Asia	China	121.3	34.2	2449.5	174.4	Regional Banks	0.2819456
3	Agricultural Bank of China	Financials	Regional Banks	Asia	China	136.4	27.0	2405.4	141.1	Regional Banks	0.1979472
4	JPMorgan Chase	Financials	Major Banks	North America	United States	105.7	17.3	2435.3	229.7	Major Banks	0.1636708
5	Berkshire Hathaway	Financials	Investment Services	North America	United States	178.8	19.5	493.4	309.1	Investment Services	0.1090604
6	Exxon Mobil	Energy	Oil & Gas Operations	North America	United States	394.0	32.6	346.8	422.3	Oil & Gas Operations	0.0827411

2.6 Create boxplot of Profit Margin by Sector

ggplot(Forbes, aes(x = Sector, y = ProfMgn)) +
  stat_boxplot(geom='errorbar', width=0.5) +
  geom_boxplot(outlier.size = 1, aes(fill=Sector)) +
  scale_y_continuous(limits=c(-2, 10)) +
  coord_flip() +
  stat_summary(fun.y = mean, color="yellow", geom="point", size=2, shape=18)

The sector that appears to have the greatest standard deviation is Consumer Discretionary.

2.7 Calculate the SD for each sector

Forbes.SD <- Forbes %>% 
  group_by(Sector) %>%
  summarize(standard_deviation = sd(ProfMgn))

# Do not modify the following code:
knitr::kable(Forbes.SD, format = "markdown")

Sector	standard_deviation
Consumer Discretionary	0.6289455
Consumer Staples	0.1000578
Energy	0.1058560
Financials	0.4052307
Health Care	0.1074421
Industrials	0.0993903
Information Technology	0.2345233
Materials	0.2154817
Telecommunication Services	0.1022383
Utilities	0.1623265

The sector that has the greatest standard deviation is Consumer Discretionary. This large standard deviation is attributed to a much larger outlying ProfMgn value of 10. The next highest value being 6.5 gives Financials the second highest standard deviation. With the majority of ProfMgn values being less then 2, an outlier value of 10 greatly skews that set of data.