HW 1

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Question 1: 4 levels of measurement

Nominal: data can only be categorized
Ordinal:data can be organized and ranked
Interval: data can be categorized, ranked, and evenly spaced
Ratio: data can be categorized, ranked, evenly spaced, and has a natural zero point

#Nominal
gender <- c("male","female","male","female")
#This is nominal because the data can be categorized into male or female but female isn't consider higher or lower than male

#Ordinal
spice_level <- c("mild","spicy","extra spicy")
#This is ordinal because the data can be categorized under the different spice level and can be rank from low spice to more spicy

#Interval
time <- c(1,2,3,4,5,6,7,8,9,10)
#This is interval because it's evenly spaced and in order but 0 in time doesn't mean no time

#Ratio
income <- c(1000,2000,3000,4000)
#This is ratio because it carries the elements of interval but has a natural zero point as 0 means no income

Question 2: Scatter Plot

I combined the x and y data into a data frame. Then used ggplot2 library to put x in the x-axis and y in the y-axis. I used geom_point() function to create the scatter plot with points and choose minimal theme.

library(e1071)
library(ggplot2)
mydata <- data.frame(
  x <- c(12,17,12,11,10,7,5,14),
  y <- c(4.2,5,6,9,11,12,8,15) ) 
ggplot(mydata, aes(x=x, y=y))+ 
  geom_point()+ 
  theme_minimal()

Question 3

I assigned values to event variable and percentage variable. Later I used %>% to pass result from one line to another. First I created a new data frame with event and percentage variables under this new variable data4. Within data4, I created a new variable call label which translate the percentage variable with a “%” sign. Using ggplot2 library, I created a pie chart with x as nothing, y as the percentage variable and fill=event so I;m grouping them by event. For labels, I assign the label to label variable to show the percentage of each pie and used coord_polar(theta=“y”) to convert stacked bar chart into a pie chart.

For part b, because there is 1100 total respondent ad 40% think cancer would be found first so we need to find 40% of 1100 which is 0.40*1100 and I imputed the function into r and got 440 people.

Question 3.a

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

event <- c("cure for cancer found","end of dependence on oil", "signs of life in outer space", "peace in the middle east", "other", "none will happen")
percentage <- c(40,27,12,8,6,7)

data4 <- data.frame(event,percentage) %>% 
  mutate(label=paste0(percentage,"%")) 

ggplot(data4,aes(x="",y=percentage,fill=event))+
  geom_col()+
  geom_label(aes(label=label),
             position=position_stack(vjust=0.5),
             show.legend=FALSE)+
  coord_polar(theta="y")+
  theme_void()

Question 3.b

#b. how many people think cancer would be found first
0.40*1100

## [1] 440

#There are 440 people since 40% of 1100 is 0.40*1100=440.

Question 4

I chose the team , Mia. First we want to get access to the library and excel data.

library(readxl)
library(ggplot2)
nbadata <- read_excel("2020-2021 NBA Stats  Player Box Score  Advanced Metrics.xlsx")

Question 4.a

I used subset to calculate the number of rows when TEAM==“Mia”

ncol(subset(nbadata, TEAM=="Mia")) #29 columns for Mia Team

## [1] 29

nrow(subset(nbadata,TEAM=="Mia")) #16 rows for Mia Team

## [1] 16

Question 4.b

I created a new data set with only TEAM==“Mia”. Then I created another data set with only the selected 8 variables I used newdata2[1:3,] to display the first 3 rows

newdata1 <- subset(nbadata, TEAM=="Mia")
newdata2 <- newdata1[c("FULL NAME", "TEAM", "POS", "AGE", "GP", "MPG", "2P%","3P%")]
newdata2[1:3,]

## # A tibble: 3 × 8
##   `FULL NAME`      TEAM  POS     AGE    GP   MPG `2P%` `3P%`
##   <chr>            <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Precious Achiuwa Mia   F      21.4    31  14.2 0.573 0    
## 2 Bam Adebayo      Mia   C-F    23.6    29  33.9 0.573 0.333
## 3 Avery Bradley    Mia   G      30.2    10  21.1 0.536 0.421

Question 4.c

mean(newdata2$MPG)

## [1] 21.43125

median(newdata2$MPG)

## [1] 21.3

library("modeest")

## 
## Attaching package: 'modeest'

## The following object is masked from 'package:e1071':
## 
##     skewness

mfv(newdata2$MPG)

##  [1]  8.8  9.8 10.7 10.9 14.2 15.0 15.4 21.1 21.5 26.0 26.5 29.5 32.9 33.3 33.4
## [16] 33.9

ggplot(newdata2,aes(x=MPG))+
  geom_histogram(binwidth=5, fill="purple", color="yellow")+
  labs(x="MPG",y="count")+
  theme_minimal()

Question 4.d

sd(newdata2$'3P%')

## [1] 0.2043189

max(newdata2$`3P%`)

## [1] 1

min(newdata2$`3P%`)

## [1] 0

library(DescTools)

## Registered S3 method overwritten by 'httr':
##   method         from  
##   print.response rmutil

range(newdata2$`3P%`)

## [1] 0 1

Question 4.e

ggplot(newdata2,aes(x=AGE, y=POS, fill=POS))+
  geom_boxplot(fill="lightgreen")+
  labs(title="Age by POS",
       x="Age",
       fill="POS")+
  theme_minimal()

This chart is showing the distribution of age grouped by different POS. From the chart we can tell POS G-F has the highest median and highest age since it’s horizontal line stretches the longest towards the right, given there’s no outlier in this data set. POS G has the lowest age in the data set as the horizontal line is most toward the left. Also, POS C-F only have one observation so it’s just a vertical line.

Question4.f

freq_table <- newdata2 %>% 
  group_by(POS) %>% 
  count()

ggplot(newdata2, aes(x=POS))+
  geom_bar(fill="lightpink", color="darkred")+
  labs(x="POS",y="frequency",title="Frequency of POS")+
  theme_minimal()