fish <- read.table(file="http://www.amstat.org/publications/jse/datasets/fishcatch.dat.txt")
names(fish)<-c("obs", "species", "weight", "len1", "len2", "len3", "height.pct", "width.pct", "sex")
fish <- fish%>% mutate(sex=factor(sex))
fish <- fish%>% mutate(sex=fct_recode(sex, female="0", male="1"))
fish <- fish%>% mutate(species=factor(species))
fish <- fish%>% mutate(species=fct_recode(species, "Common Bream"="1", "Whitefish"="2", "Roach" = "3", "Silver Bream" ="4", "Smelt" ="5", "Pike" ="6", "Perch" ="7"))
# Do not modify the following code:
fish.sub <- filter(fish, !is.na(sex)) # only display rows where sex is not NA
knitr::kable(head(fish.sub), format = "markdown")
| obs | species | weight | len1 | len2 | len3 | height.pct | width.pct | sex |
|---|---|---|---|---|---|---|---|---|
| 14 | Common Bream | NA | 29.5 | 32 | 37.3 | 37.3 | 13.6 | male |
| 15 | Common Bream | 600 | 29.4 | 32 | 37.2 | 40.2 | 13.9 | male |
| 17 | Common Bream | 700 | 30.4 | 33 | 38.3 | 38.8 | 13.8 | male |
| 21 | Common Bream | 575 | 31.3 | 34 | 39.5 | 38.3 | 14.1 | male |
| 26 | Common Bream | 725 | 31.8 | 35 | 40.9 | 40.0 | 14.8 | male |
| 30 | Common Bream | 1000 | 33.5 | 37 | 42.6 | 44.5 | 15.5 | female |
mean.wt<- fish%>%
group_by(species)%>%
summarize(mean_wt=mean(weight,na.rm=TRUE))%>%
arrange(mean_wt)
# Do not modify the following code:
knitr::kable(mean.wt, format = "markdown")
| species | mean_wt |
|---|---|
| Smelt | 11.17857 |
| Roach | 152.05000 |
| Silver Bream | 154.81818 |
| Perch | 382.23929 |
| Whitefish | 531.00000 |
| Common Bream | 626.00000 |
| Pike | 718.70588 |
The species with the smallest mean weight is the Smelt with a weight of 11.2 grams.
fish%>%
group_by(species)%>%
summarize(mean_wt=mean(weight,na.rm=TRUE))%>%
ggplot(aes(x=species,y=mean_wt,fill=species))+
geom_bar(stat="identity",position="dodge")+
theme(legend.position="none")
Forbes <- read.csv("~/Downloads/R Exercise_Grace/2019_Forbes_Global_2000.csv")
Forbes = Forbes %>% filter(Revenue >= 0, !is.na(Sector))
Forbes$Sector = as.factor(Forbes$Sector)
Forbes$Industry = as.factor(Forbes$Industry)
Forbes$Continent = as.factor(Forbes$Continent)
Forbes$Country = as.factor(Forbes$Country)
Subset_Forbes =subset(Forbes, Continent %in% c("Asia","Europe","North America"))
ggplot(Subset_Forbes,aes( Revenue, Market_Value )) +
geom_point()+
geom_line()+
facet_grid(. ~ Continent)
Based on these graphs, it is hard to determine correlation. There is no strong correlation in any direction, on any of the graphs. At about 100 billion in Revenue in each of the graphs, there is a sharp increase in market value. In both Europe and Asia, most of the data points are concentrated at under 100 billion in revenue, while in North America most of the data points lie beneath 200 billion dollars in revenue. It seems that in Asia and Europe you can have less revenue and higher market value when compared to the United States. Additionally, the North American scatterplot has the greatest range in data. I also found it interesting that in North America the company with the highest recorded revenue had lower market value than the company with the second highest recorded revenue. This is also true for Asia. However, Europe, after the 210 billion dollar mark saw both market value increase as the revenue increased. This is different than both North America and Asia. The North American graph also seems to have more outliers than the other two continents. I also found it interesting that at around 100 billion in revenue there is a company with a market value eight times greater than revenue.
Forbes<- Forbes %>% mutate(ProfMgn = Profits / Revenue) %>% filter(ProfMgn<=5)
# Do not modify the following code:
knitr::kable(head(Forbes), format = "markdown")
| Rank | Company | Sector | Industry | Continent | Country | Revenue | Profits | Assets | Market_Value | ProfMgn |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ICBC | Financials | Major Banks | Asia | China | 175.874 | 45.223 | 4034.482 | 305.057 | 0.2571329 |
| 2 | JPMorgan Chase | Financials | Major Banks | North America | United States | 132.912 | 32.738 | 2737.188 | 368.502 | 0.2463134 |
| 3 | China Construction Bank | Financials | Major Banks | Asia | China | 150.313 | 38.841 | 3382.422 | 224.988 | 0.2584008 |
| 4 | Agricultural Bank of China | Financials | Regional Banks | Asia | China | 137.456 | 30.894 | 3293.105 | 197.045 | 0.2247556 |
| 5 | Bank of America | Financials | Major Banks | North America | United States | 111.904 | 28.540 | 2377.164 | 287.339 | 0.2550400 |
| 6 | Apple | Information Technology | Computer Hardware | North America | United States | 261.705 | 59.431 | 373.719 | 961.257 | 0.2270916 |
ggplot(Forbes, aes(x= Sector, y=ProfMgn)) +
geom_boxplot()
The sector that appears to have the greatest standard deviation is Utilities.
#### 2.7 Calculate the SD for each sector
```r
sectors <- group_by(Forbes, Sector)
Forbes.SD <- summarize(sectors, sd=sd(ProfMgn)) %>% arrange(-sd)
view(Forbes.SD)
# Do not modify the following code:
knitr::kable(Forbes.SD, format = "markdown")
| Sector | sd |
|---|---|
| 0.5366919 | |
| Utilities | 0.4912375 |
| Telecommunication Services | 0.4698143 |
| Financials | 0.3285288 |
| Information Technology | 0.2303487 |
| Industrials | 0.1611717 |
| Consumer Discretionary | 0.1605991 |
| Materials | 0.1566098 |
| Health Care | 0.1488790 |
| Consumer Staples | 0.1282274 |
| Energy | 0.1089204 |
The sector that has the greatest standard deviation is Utilies. Utilities has the highest standard deviation because it has the largest outlier. Standard deviation is sensitive to outliers, which is why I came to this conclusion in the first place. I noticed on the boxplot that it had the largest outlier. Information technology has the most variable data, which we can tell because it has the largest box portion of the boxplots. It seems that means and medians are aligned for the most part across industries. The Financials and Utilities Services sectors have the largest outliers. The Financial sector has the most amount of outliers overall. Overall, there is not much difference between sectors because most boxes overlap.