DATA 606 Project Proposal

Data 606 Project Proposal
Load Libraries
Data Preparation
Research question
Cases
Data collection
Type of study
Data Source
Dependent Variable
Independent Variable
Relevant summary statistics

Data 606 Project Proposal

Load Libraries

library(knitr)
library(DT)
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(openintro)

## Please visit openintro.org for free statistics materials

## 
## Attaching package: 'openintro'

## The following object is masked from 'package:ggplot2':
## 
##     diamonds

## The following objects are masked from 'package:datasets':
## 
##     cars, trees

library(ggvis)

## 
## Attaching package: 'ggvis'

## The following objects are masked from 'package:plotly':
## 
##     add_data, hide_legend

## The following object is masked from 'package:ggplot2':
## 
##     resolution

library(tidyr)

Data Preparation

Reorganizing the Locations to currently display where the competition took place, also cleaned up the age variable to remove faulty values

data <- read.csv("https://raw.githubusercontent.com/crarnouts/CUNY-MSDS/master/lifting_data.csv", header = TRUE)
data$Event.Date.1 <- NULL
data$Event.Year.1 <- NULL
data <- data %>% filter(AGE<75)

data <- data %>% separate(Event.Location, c("Country","City"),sep ="-")

## Warning: Expected 2 pieces. Additional pieces discarded in 411 rows [3268,
## 3269, 3270, 3271, 3272, 3273, 3274, 3275, 3276, 3277, 3278, 3279, 3280,
## 3281, 3282, 3283, 3284, 3285, 3286, 3287, ...].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 14122 rows
## [22655, 22656, 22657, 22658, 22659, 22660, 22661, 22662, 22663, 22664,
## 22665, 22666, 22667, 22668, 22669, 22670, 22671, 22672, 22673, 22674, ...].

data$Country_2 <- data$Country
data <- data %>% separate(Country_2, c("City_2","State"),sep =",")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 37587
## rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
## 20, ...].

data$State[which(is.na(data$State))]<-0


data$Country <- ifelse(data$State != 0, "USA",data$Country)
data$City <- ifelse(data$State != 0, data$City_2,data$City)
data$City_2 <- NULL
data <- data[c(1,2,3,4,22,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21)]
data$State <- ifelse(data$State == 0, NA,data$State)

Research question

My research question is to look more closely at the something known as the Sinclair Coefficient in weightlifting this is a metric that is supposed to be able to accurately compare the relative strength of lifters across weight classes.The method provides the answer to the question “What would be the total of an athlete weighing x kg if he/she were an athlete in the heaviest class of the same level of ability?”, given by the formula: ACTUAL TOTAL × SINCLAIR COEFFICIENT = SINCLAIR TOTAL My goal is to more adaquetly understand this metric and if there is anything that I would or could do to improve it. I also want to figure out if certain lifts are more heavily dependent on the weight of the lifter

Cases

What are the cases, and how many are there? There are 52347 cases of lifting performances and these span many years and also many different locations,countries, indivividuals. They include the weight of the individual and what their lifting numbers where along with plenty of other attributes about the person

Data collection

Describe the method of data collection. I found this data from a research paper that was conducted on the sinclair coefficient https://www.reddit.com/r/weightlifting/comments/6wjdxm/analysis_of_weightlifting_data/

Type of study

This is an observational study

Data Source

https://www.reddit.com/r/weightlifting/comments/6wjdxm/analysis_of_weightlifting_data/

Dependent Variable

The response variable is the lifting numbers so both the snatch result and the clean and jerk result

Independent Variable

Some of the quantitative variables are both bodyweight and age which can both have a tremendous effect on the lifting numbers of the individual, one qualitative variable is the location or country that the individual is from

Relevant summary statistics

data_2 <- data %>% filter(Gender=="Male" & AGE==24 & Nation == "USA")

ggplot(data_2, aes(x=data_2$Body.Weight, y=data_2$Snatch.Result)) +
    geom_point(shape=1)

ggplot(data_2, aes(x=data_2$Body.Weight, y=data_2$Clean...Jerk.Result)) +
    geom_point(shape=1)

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

ggplot(data, aes(x=data$AGE, y=data$Snatch.Result, color = data$Country)) +
    geom_point(shape=1)

ggplot(data, aes(x=data$AGE, y=data$Snatch.Result, color = data$Gender)) +
    geom_point(shape=1)+
  scale_colour_hue(l=50) + # Use a slightly darker palette than normal
    geom_smooth(method=lm,   # Add linear regression lines
                se=FALSE,    # Don't add shaded confidence region
                fullrange=TRUE) # Extend regression lines