Find interesting dataset and prepare short report (in R Markdown) which will consists: * short description of the dataset, * 3 scatterplots which will present interesting relationships between variables, * brief comments which describes obtained results.
Please remember about: - using different aesthetics to show properly the insights from the data - adding smoothing line - labeling graph with proper title, subtitle and description of the axis.
The deadline for the homework is until next classes and cannot be postponed under any circumstances.
Boston Housing Data consists of price of house in suburbs of Boston.
Boston Housing Data comes with the MASS library.
Source: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
The Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on.
library(ggplot2)
library(dplyr)
##
## 載入套件:'dplyr'
## 下列物件被遮斷自 'package:stats':
##
## filter, lag
## 下列物件被遮斷自 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
athlete <- read.csv('/Users/jeank4723/Desktop/Advance VR/1/Data/athlete_events.csv')
head(athlete)
## ID Name Sex Age Height Weight Team NOC
## 1 1 A Dijiang M 24 180 80 China CHN
## 2 2 A Lamusi M 23 170 60 China CHN
## 3 3 Gunnar Nielsen Aaby M 24 NA NA Denmark DEN
## 4 4 Edgar Lindenau Aabye M 34 NA NA Denmark/Sweden DEN
## 5 5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED
## 6 5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED
## Games Year Season City Sport
## 1 1992 Summer 1992 Summer Barcelona Basketball
## 2 2012 Summer 2012 Summer London Judo
## 3 1920 Summer 1920 Summer Antwerpen Football
## 4 1900 Summer 1900 Summer Paris Tug-Of-War
## 5 1988 Winter 1988 Winter Calgary Speed Skating
## 6 1988 Winter 1988 Winter Calgary Speed Skating
## Event Medal
## 1 Basketball Men's Basketball <NA>
## 2 Judo Men's Extra-Lightweight <NA>
## 3 Football Men's Football <NA>
## 4 Tug-Of-War Men's Tug-Of-War Gold
## 5 Speed Skating Women's 500 metres <NA>
## 6 Speed Skating Women's 1,000 metres <NA>
athlete$Sex <- as.factor(athlete$Sex)
athlete$Season <- as.factor(athlete$Season)
athlete$Medal <- as.factor(athlete$Medal)
str(athlete)
## 'data.frame': 271116 obs. of 15 variables:
## $ ID : int 1 2 3 4 5 5 5 5 5 5 ...
## $ Name : chr "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
## $ Sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 1 1 1 1 ...
## $ Age : int 24 23 24 34 21 21 25 25 27 27 ...
## $ Height: int 180 170 NA NA 185 185 185 185 185 185 ...
## $ Weight: num 80 60 NA NA 82 82 82 82 82 82 ...
## $ Team : chr "China" "China" "Denmark" "Denmark/Sweden" ...
## $ NOC : chr "CHN" "CHN" "DEN" "DEN" ...
## $ Games : chr "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
## $ Year : int 1992 2012 1920 1900 1988 1988 1992 1992 1994 1994 ...
## $ Season: Factor w/ 2 levels "Summer","Winter": 1 1 1 1 2 2 2 2 2 2 ...
## $ City : chr "Barcelona" "London" "Antwerpen" "Paris" ...
## $ Sport : chr "Basketball" "Judo" "Football" "Tug-Of-War" ...
## $ Event : chr "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
## $ Medal : Factor w/ 3 levels "Bronze","Gold",..: NA NA NA 2 NA NA NA NA NA NA ...
#
# sum_of_na <- function(x){
# sum(is.na(x))
# }
# athlete %>% summarise(
# across(everything(), sum_of_na)
# )
tyeatdata <- cut(athlete$Year, breaks = seq(1896,2016, by = 20), dig.lab = 4)
p1 <- ggplot(data = athlete, aes(x = tyeatdata,
y = Height,
color = Sex
))
p1 + geom_point()
## Warning: Removed 60171 rows containing missing values (geom_point).
2. Year and Age The age of athletes are not much different from and then. However, we can observe that more female athletes in the age 25 to 50 are more the before since 1980.
p2 <- ggplot(data = athlete, aes(x = Year,
y = Age,
color = Sex
))
p2 + geom_point()
## Warning: Removed 9474 rows containing missing values (geom_point).
3. Medal and BMI Using the Height and Weight data to use the BMI (Body mass index) formula: Weight in kilograms divided by Height in meters squared. According to Wikipedia, the table of meaning of the value for adult.
We can observe that there is no obvious differ from each medal winners.
BMIdata <- athlete %>%
mutate(BMI = Weight/(Height*Height*0.0001))
p3 <- ggplot(data = BMIdata, aes(x = Medal,
y = BMI,
color = Sex
))
p3 + geom_point()
## Warning: Removed 64263 rows containing missing values (geom_point).