PROBLEM SET 1

When you roll a fair die 3 times, how many possible outcomes are there?

For each roll there are 6 possible outcomes. For 2 times, there are 6 x 6 = 36 possible outcomes. For 3 times, there are 6 X 6 X 6 = 216 possible outcomes. Finally for n times there are \(6^n\) possible outcomes.

What is the probability of getting a sum total of 3 when you roll a die two times?

The total possible outcomes of rolling die 2 times is 36. There are 2 possible outcomes that we can make a sum of 3: (1,2) and (2,1)

P(sum_of_three) = \(\frac{2}{36}\) = \(\frac{1}{18}\)

Assume a room of 25 strangers. What is the probability that two of them have the same birthday? Assume that all birthdays are equally likely and equal to 1/365 each. What happens to this probability when there are 50 people in the room?

Instead of comparing each person with every 24 people their birthdays, we will first calculate probability that none of the 25 people share their birthdays. Then we negate this probability with 1 and get probability of at least 2 having same birthdays.

n <- 25
b <- 365
p_none <- 1- prod(c(365:(365-(n-1)))/365)
p_none
## [1] 0.5686997

The probability a room full of 25 strangers where 2 of them having same birthday is 56.87%.

Now let us compute probability with 50 strangers.

n <- 50
b <- 365
p_none <- 1- prod(c(365:(365-(n-1)))/365)
p_none
## [1] 0.9703736

The probability a room full of 50 strangers where 2 of them having same birthday is 97.04%.

The probability improved drastically by 40%!

PROBLEM SET 2

I have used tm package to create a corpus of words in the document. I tried removing punctuation but the document is not entirely in text format and some of the quotes could not be removed. I tried various encoding options but still could not remove all the punctuations.

library("tm")
## Warning: package 'tm' was built under R version 3.2.2
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 3.2.2
# read the document
doc <- paste(readLines("assign6.sample.txt"), collapse=" ")
## Warning in readLines("assign6.sample.txt"): incomplete final line found on
## 'assign6.sample.txt'
# create corpus
corpus <-Corpus(VectorSource(doc))

# convert all letters to lowercase, remove punctuations and numbers since we are only dealing with words.
corpus.p <-tm_map(corpus, content_transformer(tolower))  
corpus.p <-tm_map(corpus.p, content_transformer(removeNumbers))
corpus.p <-tm_map(corpus.p, content_transformer(removePunctuation))

# find frequency of each word
dtm <-DocumentTermMatrix(corpus.p)
docTermMatrix <- inspect(dtm)
## <<DocumentTermMatrix (documents: 1, terms: 564)>>
## Non-/sparse entries: 564/0
## Sparsity           : 0%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs â<U+0080><U+0098>these    â<U+0080>      â<U+0080><U+009C>a â<U+0080><U+009C>for â<U+0080><U+009C>iâ<U+0080><U+0099>ve â<U+0080><U+009C>it â<U+0080><U+009C>itâ<U+0080><U+0099>s â<U+0080><U+009C>no â<U+0080><U+009C>right
##    1        1   1    1      1         1     1         2     1        1
##     Terms
## Docs â<U+0080><U+009C>that â<U+0080><U+009C>the â<U+0080><U+009C>they    â<U+0080><U+009C>threestrikesâ<U+0080>      â<U+0080><U+009C>we â<U+0080><U+009C>weâ<U+0080><U+0099>re â<U+0080><U+009C>yes
##    1       1      2       1                  1     2          1      1
##     Terms
## Docs about abundance abundant abuse abysmal according across act acting
##    1     6         1        1     3       1         1      1   1      1
##     Terms
## Docs administrationâ<U+0080><U+0099>s after aging agreed alabama    alabamaâ<U+0080>      alabamaâ<U+0080><U+0099>s
##    1                  1     5     1      1       4          1           1
##     Terms
## Docs alabaster almost also although among analyst and angel anything
##    1         1      2    1        1     3       1  38     1        2
##     Terms
## Docs appalled appetite approval april arbuthnot are argues arise armed
##    1        1        1        1     1         1   9      1     1     1
##     Terms
## Docs asked assaults assistant attention attorney autopsy average aware
##    1     2        1         1         1        1       1       1     1
##     Terms
## Docs back backward bad banned basic basics bathtub beaten because been
##    1    1        1   2      1     1      2       1      1       1    9
##     Terms
## Docs before began beginning believe bentley better beyond bigger birth
##    1      2     1         1       1       2      3      1      1     1
##     Terms
## Docs blind bodies botched both box budget build building built buried but
##    1     1      1       1    1   1      2     1        1     1      1   9
##     Terms
## Docs called calls cam came cameras can candidate capacity capita care case
##    1      1     1   1    2       1   2         1        2      1    1    1
##     Terms
## Docs caution chairman challenging change    changeâ<U+0080>      changing charlotte
##    1       1        1           1      2         1        2         1
##     Terms
## Docs child choices citizensâ<U+0080><U+0099> civil clean clinical colby cologne coming
##    1     2       1           1     1     1        1     2       1      1
##     Terms
## Docs commissioner committee commodity conditions congress constitutional
##    1            2         1         1          6        1              1
##     Terms
## Docs contact contraband convicted conviction corners corrections court
##    1       1          1         1          2       1          10     1
##     Terms
## Docs courts created crime crimes criminals crisis culture curb currency
##    1      1       1     1      3         1      1       1    1        1
##     Terms
## Docs custodial damning dangerously daughter dealing death december
##    1         1       1           1        1       1     1        1
##     Terms
## Docs defendants deliberate department departmentâ<U+0080><U+0099>s deprivation designed
##    1          1          1          7              1           1        1
##     Terms
## Docs disparages document doing donâ<U+0080><U+0099>t double down drowned drug drugs
##    1          1        1     1       2      1    1       1    2     1
##     Terms
## Docs    dynamiteâ<U+0080>      elderly employees enough environment equal even examiner
##    1           1       1         2      1           1     2    2        1
##     Terms
## Docs exchanged eyes faced faces failed family far    favorsâ<U+0080>      fearful
##    1         1    1     1     1      1      1   1         1       1
##     Terms
## Docs federal female few filled finally findings fix food for former
##    1       5      2   1      1       1        1   2    1  30      1
##     Terms
## Docs forward fresh from gave general george get getting give going good
##    1       1     1    3    1       1      1   4       1    1     1    1
##     Terms
## Docs gov government governor    governorâ<U+0080>      grave great group guard guards
##    1   1          4        1           1     1     1     1     2      2
##     Terms
## Docs guidelines guntoting had half happened harassed has have health
##    1          1         1   6    1        1        1   7    9      2
##     Terms
## Docs helped her here    hereâ<U+0080>      highest highly him hire hired his home how
##    1      1   3    2       1       1      1   2    1     1   1    1   2
##     Terms
## Docs iâ<U+0080><U+0099>ve ignoring important improve improved included includes
##    1      1        1         1       2        1        1        1
##     Terms
## Docs including indifference indigent inhumane initiative inmate inmates
##    1         1            1        1        1          2      1       4
##     Terms
## Docs inside instead institute institutions intervention interview into
##    1      2       1         1            1            1         2    3
##     Terms
## Docs investigate investigating investigation investigations issued    itâ<U+0080>     
##    1           1             1             3              1      2     1
##     Terms
## Docs itâ<U+0080><U+0099>s items its jail january jocelyn judiciary julia june just
##    1      4     2   4    1       2       1         1     2    1    6
##     Terms
## Docs justice kim lack larger larry last law lawyer least legal legislator
##    1       6   1    1      1     1    2   1      1     2     1          1
##     Terms
## Docs legislature less levels liberal life like likely live living locked
##    1           3    1      1       1    2    5      1    2      1      1
##     Terms
## Docs long longtime look low lowlevel make makeup male management many
##    1    1        1    1   1        1    1      1    1          1    2
##     Terms
## Docs marginally marked marsha matter may medical mental met middle million
##    1          1      1      1      1   1       2      2   1      1       4
##     Terms
## Docs minimal misconduct money    moneyâ<U+0080>      monica montgomery month months more
##    1       1          1     2        1      1          1     1      3    6
##     Terms
## Docs morrison most mother moved much murder named nation national near
##    1        1    2      1     1    2      1     1      2        1    2
##     Terms
## Docs need needs never new nonviolent not now number odds offenders
##    1    3     2     1   1          1   3   3      1    1         2
##     Terms
## Docs offenses officer officers    officersâ<U+0080>      officials often once one only
##    1        1       1        5           1         1     1    1   2    5
##     Terms
## Docs open organization organize original other others out over overhaul
##    1    1            2        1        1     2      2   1    2        1
##     Terms
## Docs overturned own page paper parole part past people per percent
##    1          1   1    1     1      1    1    1      1   1       1
##     Terms
## Docs    periodâ<U+0080>      personally perspective places plan policies policy
##    1         1          1           1      1    2        2      3
##     Terms
## Docs political practices premature pressing primary primitive prison
##    1         1         1         1        1       1         1     11
##     Terms
## Docs prisonâ<U+0080><U+0099>s prisoners prisons problem problems procedures programs
##    1          2         7       6       1        2          1        1
##     Terms
## Docs project prominence promising prompt property psychologist question
##    1       1          1         1      1        1            1        1
##     Terms
## Docs quit raise rampant raped rate recent recently recruiting rectify
##    1    1     1       2     2    1      1        2          1       1
##     Terms
## Docs reform relatives released releasing remained remains repeat replaced
##    1      2         1        2         1        1       3      1        1
##     Terms
## Docs report reports represents republican request rescinding resellable
##    1      4       2          1          2       1          1          1
##     Terms
## Docs review rights robbery robert rodney routinely row rules running said
##    1      1      1       1      1      1         1   1     1       2   22
##     Terms
## Docs same samuels say says scrutinizing secondhighest secure see seen sell
##    1    1       1   2    2            1             1      1   1    1    1
##     Terms
## Docs senate senator sending senior sent sentence sentencing series serious
##    1      1       1       1      1    1        1          2      2       1
##     Terms
## Docs served services serving session several sex sexual sexualized she
##    1      2        1       1       1       1   4      3          1   9
##     Terms
## Docs show showed showering sick since situation six soft solution some
##    1    1      1         1    1     3         2   4    1        1    2
##     Terms
## Docs sometimes son spending split spots stacy staffing state stateâ<U+0080><U+0099>s
##    1         2   1        1     1     1     1        2     4         2
##     Terms
## Docs step stephen stepped stetson still stillborn stockades strip strong
##    1    1       1       1       1     6         1         1     1      1
##     Terms
## Docs stuff    stupidâ<U+0080>      support system    systemâ<U+0080>      take tampons telephone texas
##    1     1         1       2      2         1    1       1         1     1
##     Terms
## Docs than that thatâ<U+0080><U+0099>s the their them then there they thing things think
##    1    7   18        2  74     1    1    1     6    5     1      1     1
##     Terms
## Docs third this    thisâ<U+0080>      thomas those three tied toilet top toxic track
##    1     1    3       1      3     1     1    1      1   2     1     1
##     Terms
## Docs tracked    transparentâ<U+0080>      treatment troubled trying tutwiler    tutwilerâ<U+0080>     
##    1       1              1         2        1      1       14           1
##     Terms
## Docs two unconstitutional uncovered unfolding uniforms use using very
##    1   1                1         1         1        1   2     1    2
##     Terms
## Docs violations wanted wants ward warden was washington watched way
##    1          1      1     1    3      1   9          2       1   1
##     Terms
## Docs weâ<U+0080><U+0099>re week weighing well were what where whether which while who
##    1       1    1        1    2    4    2     2       2     1     1   9
##     Terms
## Docs whose wideranging will with without woman women wood work worked
##    1     1           1    1    8       1     1     6    1    1      1
##     Terms
## Docs working worse would year yearâ<U+0080><U+0099>s years you
##    1       1     1     1    3        1     6   1
# total words in doc
totalWords <- length(docTermMatrix)
totalWords
## [1] 564
# probability function
p_of_XandY <- function(x,y){
  if (is.na(docTermMatrix[1,x]) || is.na(docTermMatrix[1,y])){
    return(0)
  }
  p_x <- docTermMatrix[1,x]/totalWords
  p_y <- docTermMatrix[1,y]/totalWords
  return(p_x*p_y)
}

# test cases
p_of_XandY("the","you")
## [1] 0.0002326342
p_of_XandY("and","are")
## [1] 0.001075147