Find interesting dataset and prepare short report (in R Markdown) which will consists: * short description of the dataset, * 3 scatterplots which will present interesting relationships between variables, * brief comments which describes obtained results.

Please remember about: - using different aesthetics to show properly the insights from the data - adding smoothing line - labeling graph with proper title, subtitle and description of the axis.

The deadline for the homework is until next classes and cannot be postponed under any circumstances.

Introduction

Boston Housing Data consists of price of house in suburbs of Boston.

Boston Housing Data comes with the MASS library.

Source: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

Data Variables

The Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on.

library(ggplot2)
library(dplyr)
## 
## 載入套件:'dplyr'
## 下列物件被遮斷自 'package:stats':
## 
##     filter, lag
## 下列物件被遮斷自 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

athlete <- read.csv('/Users/jeank4723/Desktop/Advance VR/1/Data/athlete_events.csv')

head(athlete)
##   ID                     Name Sex Age Height Weight           Team NOC
## 1  1                A Dijiang   M  24    180     80          China CHN
## 2  2                 A Lamusi   M  23    170     60          China CHN
## 3  3      Gunnar Nielsen Aaby   M  24     NA     NA        Denmark DEN
## 4  4     Edgar Lindenau Aabye   M  34     NA     NA Denmark/Sweden DEN
## 5  5 Christine Jacoba Aaftink   F  21    185     82    Netherlands NED
## 6  5 Christine Jacoba Aaftink   F  21    185     82    Netherlands NED
##         Games Year Season      City         Sport
## 1 1992 Summer 1992 Summer Barcelona    Basketball
## 2 2012 Summer 2012 Summer    London          Judo
## 3 1920 Summer 1920 Summer Antwerpen      Football
## 4 1900 Summer 1900 Summer     Paris    Tug-Of-War
## 5 1988 Winter 1988 Winter   Calgary Speed Skating
## 6 1988 Winter 1988 Winter   Calgary Speed Skating
##                                Event Medal
## 1        Basketball Men's Basketball  <NA>
## 2       Judo Men's Extra-Lightweight  <NA>
## 3            Football Men's Football  <NA>
## 4        Tug-Of-War Men's Tug-Of-War  Gold
## 5   Speed Skating Women's 500 metres  <NA>
## 6 Speed Skating Women's 1,000 metres  <NA>
athlete$Sex <- as.factor(athlete$Sex)
athlete$Season <- as.factor(athlete$Season)
athlete$Medal <- as.factor(athlete$Medal)

str(athlete)
## 'data.frame':    271116 obs. of  15 variables:
##  $ ID    : int  1 2 3 4 5 5 5 5 5 5 ...
##  $ Name  : chr  "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
##  $ Sex   : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 1 ...
##  $ Age   : int  24 23 24 34 21 21 25 25 27 27 ...
##  $ Height: int  180 170 NA NA 185 185 185 185 185 185 ...
##  $ Weight: num  80 60 NA NA 82 82 82 82 82 82 ...
##  $ Team  : chr  "China" "China" "Denmark" "Denmark/Sweden" ...
##  $ NOC   : chr  "CHN" "CHN" "DEN" "DEN" ...
##  $ Games : chr  "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
##  $ Year  : int  1992 2012 1920 1900 1988 1988 1992 1992 1994 1994 ...
##  $ Season: Factor w/ 2 levels "Summer","Winter": 1 1 1 1 2 2 2 2 2 2 ...
##  $ City  : chr  "Barcelona" "London" "Antwerpen" "Paris" ...
##  $ Sport : chr  "Basketball" "Judo" "Football" "Tug-Of-War" ...
##  $ Event : chr  "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
##  $ Medal : Factor w/ 3 levels "Bronze","Gold",..: NA NA NA 2 NA NA NA NA NA NA ...
# 
# sum_of_na <- function(x){
#   sum(is.na(x))
# }
# athlete %>% summarise(
#   across(everything(), sum_of_na)
# )

3 Scatterplots

  1. Year and Height We can see that the height range of athletes is increasing year by year. Also, compare the close and far year from now, there were more female in Olympic than beginning.
tyeatdata <- cut(athlete$Year, breaks = seq(1896,2016, by = 20), dig.lab = 4)
p1 <- ggplot(data = athlete, aes(x = tyeatdata,
                           y = Height,
                           color = Sex
                           ))
p1 + geom_point()
## Warning: Removed 60171 rows containing missing values (geom_point).

2. Year and Age The age of athletes are not much different from and then. However, we can observe that more female athletes in the age 25 to 50 are more the before since 1980.

p2 <- ggplot(data = athlete, aes(x = Year,
                           y = Age,
                           color = Sex
                           ))
p2 + geom_point()
## Warning: Removed 9474 rows containing missing values (geom_point).

3. Medal and BMI Using the Height and Weight data to use the BMI (Body mass index) formula: Weight in kilograms divided by Height in meters squared. According to Wikipedia, the table of meaning of the value for adult.

BMI Categories:

  • Underweight = <18.5
  • Normal weight = 18.5–24.9
  • Overweight = 25–29.9
  • Obesity = BMI of 30 or greater

We can observe that there is no obvious differ from each medal winners.

BMIdata <- athlete  %>% 
  mutate(BMI = Weight/(Height*Height*0.0001))
  
p3 <- ggplot(data = BMIdata, aes(x = Medal,
                           y = BMI,
                           color = Sex
                           ))
p3 + geom_point()
## Warning: Removed 64263 rows containing missing values (geom_point).