Discussion 2

Author

Allison Shrivastava

#install.packages("BiocManager")
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggrepel)
library(BiocManager)
#BiocManager::install("Rgraphviz")

Write one paragraph each (in your own words) describing what are classes and one paragraph on what are data structures (with examples)

A data class refers to the data type, this could be a date variable (this lets R know that a variable can be handled in a certain order, and should be considered a date format), a numeric variable (referring to a numeric value that can be handled like a number, that is, added to or subtracted from), a character variable ( a string of characters that don’t hold order) or a factor (a categorical variable that has a rank or order).

Data structures in R are things like a data frame (a collection of vectors of equal lengths, but different types if need be) lists (wherein values can be stored without structure and of varying types), vectors, which are data of the same type stored in an order. Another structure is a tibble which gives you a slice of a data frame as well as information about the data types, or can show different calculation results of data frames.

Pick a dataset (from base R, AER package or even the titanic dataset), and apply the two commands on your data. What do you find, and does it make sense?

Using the class function on the USA arrests data, I can see that it is a data frame. Looking at a specific variable class within the data, I can see that the variable “Murder” is numeric. By using the str function I can see the number of observations (50) the number of variables (4) as well as the names of those variables, the class of the variables, and a few print outs of what some of the values are.

data("USArrests")

class(USArrests)

[1] "data.frame"

class(USArrests$Murder)

[1] "numeric"

str(USArrests)

'data.frame':   50 obs. of  4 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

Hover over two functions and guess what the function does based on the glimpse in the hover

The underlying properties of the merge function shows (x,y,…) which implies the combining of several data into one structure. You could use this to combine two data frames into one based on a matching variable

the slice max function looks like the dataframe t, followed by the varible to sort the slice on (for example, descending order of share), and the number of observations to include in the slice. This would let you make a list of the top ten of something, for example

::: {.cell}

```{.r .cell-code}
#commenting out as the doc will not render if there isn't a value in the function
#  merge()

 #   slice_max()
```
:::

Now, write your own function

#function to calculate the volume of a cylinder 
volume<-function(radius, height) {
  volume<-pi*radius^2*height
  return(volume)
}

calc_volume<-volume(radius=1, height=3)
print(calc_volume)

[1] 9.424778

Please explain Bayes Theorem in your own words, and give an example.

Bayes Theorem allows a probability calculation to be adjusted as new information is given. For example, if-while calculating the probability it will rain- its discovered that there is extensive cloud coverage in the area, using Bayes Theorem to adjust the probability based on the increased probability of rain given cloud coverage.

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]

Now solve the guided practrice problem using a tree diagram

define everything here so I can keep it straight A=academic event , P(A)=0.35 conditional: P(F|A)=0.25 S=sporting event, P(S)=0.2 conditional: P(F|S)=0.7 N= no event, P(N)=0.45 P(F|N)=0.5 F=garage full

#### first, solve using bayes therom 
prob=(0.7*0.2)/((0.7*0.2)+(0.25*0.35)+(0.05*0.45))
print(prob)

[1] 0.56

## now onto the tree
#make the data 
nodes<-data.frame(x=c(1,3,3,3,5,5,5,5,5,5),
                  y=c(5,8,5,2,9,7,6,4,3,1),
                  name=c("start","academic\n0.35","sporting\n0.2",
                         "no event\n0.45", "full\n0.25",
                         "not full\n0.75","full\n0.7",
                        "not full\n0.3","full\n0.05","not full\n0.95"))


branches<-data.frame(x=c(1,1,1,3,3,3,3,3,3),
                     y=c(5,5,5,8,8,5,5,2,2),
                     xend=c(3,3,3,5,5,5,5,5,5),
                     yend=c(8,5,2,9,7,6,4,3,1)
)

#plot
ggplot()+
  geom_segment(data=branches,aes(x=x,y=y, xend=xend, yend=yend),
               linewidth= 0.6)+
      
  geom_text_repel(data=nodes,
                  aes(x=x, y=y, label=name),
                  size=3,
                  box.padding=0.6,
                  point.padding =0.6,
                  force=2,
                  max.overlaps = Inf)+
  xlim(0.5,5.5)+
  ylim(0.5,9.5)+
  labs(title ="tree diagram",
             x = "",
    y = "")+
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 14),
    legend.position = "top",
    legend.title = element_blank()
  )

#### now add up the possibilities for a full garage
academic_full<-0.35*0.25
sporting_full<-0.2*0.7
no_event_full<-0.45*0.05

prob_full<-academic_full+sporting_full+no_event_full

## now that we have the full garage prob, add in the condition of there being a sporting event

sporting_full/prob_full

[1] 0.56