R Exercise

Part 1: Analysis of Fish Data (35 total points)

1.1 Import the data

fish <- read.table(file="http://www.amstat.org/publications/jse/datasets/fishcatch.dat.txt")
names(fish)<-c("obs", "species", "weight", "len1", "len2", "len3", "height.pct", "width.pct", "sex")

1.2 Change ‘sex’ to be a factor and rename

fish <- fish%>% mutate(sex=factor(sex))
fish <- fish%>% mutate(sex=fct_recode(sex, female="0", male="1"))

1.3 Change ‘species’ to be a factor and rename

fish <- fish%>% mutate(species=factor(species))
fish <- fish%>% mutate(species=fct_recode(species, "Common Bream"="1", "Whitefish"="2", "Roach" = "3", "Silver Bream" ="4", "Smelt" ="5", "Pike" ="6", "Perch" ="7"))
# Do not modify the following code:
fish.sub <- filter(fish, !is.na(sex)) # only display rows where sex is not NA
knitr::kable(head(fish.sub), format = "markdown")

obs	species	weight	len1	len2	len3	height.pct	width.pct	sex
14	Common Bream	NA	29.5	32	37.3	37.3	13.6	male
15	Common Bream	600	29.4	32	37.2	40.2	13.9	male
17	Common Bream	700	30.4	33	38.3	38.8	13.8	male
21	Common Bream	575	31.3	34	39.5	38.3	14.1	male
26	Common Bream	725	31.8	35	40.9	40.0	14.8	male
30	Common Bream	1000	33.5	37	42.6	44.5	15.5	female

1.4 Determine mean weight for each species

mean.wt<- fish%>%
  group_by(species)%>%
  summarize(mean_wt=mean(weight,na.rm=TRUE))%>%
  arrange(mean_wt)
# Do not modify the following code:
knitr::kable(mean.wt, format = "markdown")

species	mean_wt
Smelt	11.17857
Roach	152.05000
Silver Bream	154.81818
Perch	382.23929
Whitefish	531.00000
Common Bream	626.00000
Pike	718.70588

The species with the smallest mean weight is the Smelt with a weight of 11.2 grams.

1.5 Plot the mean weights for each species

fish%>%
  group_by(species)%>%
  summarize(mean_wt=mean(weight,na.rm=TRUE))%>%
  ggplot(aes(x=species,y=mean_wt,fill=species))+
  geom_bar(stat="identity",position="dodge")+
  theme(legend.position="none")

Part 2: Analysis of Forbes Global 2000 Data (50 total points)

2.1 Import the dataset

Forbes <- read.csv("~/Downloads/R Exercise_Grace/2019_Forbes_Global_2000.csv")

2.2 Exclude specified records

Forbes = Forbes %>% filter(Revenue >= 0, !is.na(Sector))

2.3 Convert four variables to factors

Forbes$Sector = as.factor(Forbes$Sector)
Forbes$Industry = as.factor(Forbes$Industry)
Forbes$Continent = as.factor(Forbes$Continent)
Forbes$Country = as.factor(Forbes$Country)

2.4 Create a scatterplot of Market Value by Sales

Subset_Forbes =subset(Forbes, Continent %in% c("Asia","Europe","North America"))
ggplot(Subset_Forbes,aes( Revenue, Market_Value )) + 
  geom_point()+
  geom_line()+
  facet_grid(. ~ Continent)

Based on these graphs, it is hard to determine correlation. There is no strong correlation in any direction, on any of the graphs. At about 100 billion in Revenue in each of the graphs, there is a sharp increase in market value. In both Europe and Asia, most of the data points are concentrated at under 100 billion in revenue, while in North America most of the data points lie beneath 200 billion dollars in revenue. It seems that in Asia and Europe you can have less revenue and higher market value when compared to the United States. Additionally, the North American scatterplot has the greatest range in data. I also found it interesting that in North America the company with the highest recorded revenue had lower market value than the company with the second highest recorded revenue. This is also true for Asia. However, Europe, after the 210 billion dollar mark saw both market value increase as the revenue increased. This is different than both North America and Asia. The North American graph also seems to have more outliers than the other two continents. I also found it interesting that at around 100 billion in revenue there is a company with a market value eight times greater than revenue.

2.5 Create Profit Margin variable

Forbes<- Forbes %>% mutate(ProfMgn = Profits / Revenue) %>% filter(ProfMgn<=5)


# Do not modify the following code:
knitr::kable(head(Forbes), format = "markdown")

Rank	Company	Sector	Industry	Continent	Country	Revenue	Profits	Assets	Market_Value	ProfMgn
1	ICBC	Financials	Major Banks	Asia	China	175.874	45.223	4034.482	305.057	0.2571329
2	JPMorgan Chase	Financials	Major Banks	North America	United States	132.912	32.738	2737.188	368.502	0.2463134
3	China Construction Bank	Financials	Major Banks	Asia	China	150.313	38.841	3382.422	224.988	0.2584008
4	Agricultural Bank of China	Financials	Regional Banks	Asia	China	137.456	30.894	3293.105	197.045	0.2247556
5	Bank of America	Financials	Major Banks	North America	United States	111.904	28.540	2377.164	287.339	0.2550400
6	Apple	Information Technology	Computer Hardware	North America	United States	261.705	59.431	373.719	961.257	0.2270916

2.6 Create boxplot of Profit Margin by Sector

ggplot(Forbes, aes(x= Sector, y=ProfMgn)) +
  geom_boxplot()

The sector that appears to have the greatest standard deviation is Utilities.

&nbsp;


#### 2.7 Calculate the SD for each sector


```r
sectors <- group_by(Forbes, Sector)
Forbes.SD <- summarize(sectors, sd=sd(ProfMgn)) %>% arrange(-sd)
view(Forbes.SD)
# Do not modify the following code:
knitr::kable(Forbes.SD, format = "markdown")

Sector	sd
	0.5366919
Utilities	0.4912375
Telecommunication Services	0.4698143
Financials	0.3285288
Information Technology	0.2303487
Industrials	0.1611717
Consumer Discretionary	0.1605991
Materials	0.1566098
Health Care	0.1488790
Consumer Staples	0.1282274
Energy	0.1089204

The sector that has the greatest standard deviation is Utilies. Utilities has the highest standard deviation because it has the largest outlier. Standard deviation is sensitive to outliers, which is why I came to this conclusion in the first place. I noticed on the boxplot that it had the largest outlier. Information technology has the most variable data, which we can tell because it has the largest box portion of the boxplots. It seems that means and medians are aligned for the most part across industries. The Financials and Utilities Services sectors have the largest outliers. The Financial sector has the most amount of outliers overall. Overall, there is not much difference between sectors because most boxes overlap.