Assignment 5

Question 1:

Given: 21% of two parent families atleast one parent has cryptosporidiosis. 5% of families both parents have while in 9% families father has cryptosporidiosis.
Let A be the event that the mother has cryptosporidiosis and B be the event that the father has cryptosporidiosis.

Assuming the total sample space to be all two parent families we get:
\(P(A\cup B ) = 0.21\) (21 % of the population)
\(P(A\cap B ) = 0.05\) (5% of the population)
\(P(B) = 0.09\) (9% of the population)

Answers:

It represents the event where atmost only one parent has cryptosporidiosis (i.e. either only one parent has or neither of the parents have cryptosporidiosis)
Probablility of either mother or father having cryptosporidiosis:
\(P(A\cup B ) = 0.21\)
Probability of only mother contracting cryptosporidiosis:
First we calculate \(P(A)\):
\(P(A) = P(A\cup B ) - P(B) + P(A\cap B ) = 0.21 - 0.09 + 0.05 = 0.17\)

Thus, probability of only mother contracting cryptosporidiosis is :
\(P(A) - P(A\cap B ) = 0.17 -0.05 = 0.12\)
As calculated above, probability that mother has contracted cryptosporidiosis: \(P(A) = 0.17\)
Probability that neither mother nor father has contracted cryptosporidiosis:
\(1 - P(A\cup B ) = 1 - 0.21 = 0.79\)
Probability that only father has contracted cryptosporidiosis:
\(P(B) - P(A\cap B ) = 0.09 -0.05 = 0.04\)

Question 2:

A function \(f(k)\) is a valid pmf if it fulfils the following two conditions:

\(f(k) \ge 0\) for every \(k \in \kappa\) where \(\kappa\) represents all possible values of the random variable \(X\).
\(\sum_{k \in \kappa} f(k) = 1\)

Given function \(p(x) = h(x) / \sum_{i=1}^{I} h(i)\) for \(x = 1,2,3...I\)

The function is always positive. The numerator \(h(x)>0\) for all \(x = 1,2,3...I\). Since \(h(x)\) is always positve its sum , the denominator is also positive. Thus \(p(x)\) is positive as it is the division of two positive quantities.
\(\sum_{i=1}^{I} p(i) = \frac{h(1)}{\sum_{i=1}^{I} h(i)} + \frac{h(2)}{\sum_{i=1}^{I} h(i)} + ...\frac{h(I)}{\sum_{i=1}^{I} h(i)}\)

\(\sum_{i=1}^{I} p(i) = \frac{\sum_{i=1}^{I} h(i)}{\sum_{i=1}^{I} h(i)} = 1\)

Thus the given function is a valid pmf.

Question 3:

A function \(f(.)\) is a valid pdf if it fulfils the following conditions:

\(f(x) \ge 0\) for all \(x\).
\(\int_{-\infty}^{\infty} f(x) \; dx= 1\)

The given function \(f(x) = \frac{h(x)}{c}\) where \(c = \int_{-\infty}^{\infty} h(x) \; dx\)

The numerator \(h(x)\) is always positive by definition. The denominator is also positive as integration of a positive quantity is positive. Thus the function \(f(x)\) is always positive as it’s obtained by the division of two positive quantities.
\(\int_{-\infty}^{\infty} f(x) \; dx= \int_{-\infty}^{\infty} \frac{h(x)}{c}\; dx = \frac{\int_{-\infty}^{\infty} h(x)\; dx}{c} = \frac{c}{c} = 1\)

Thus the given function is a valid pdf.

Question 4.

The value of k can be found using the two properties of a pdf:

The pdf should always be positive. This implies \(k >0\)
\(\int_{-\infty}^{\infty} f(x) \; dx= 1\). This implies: \(\int_{-\infty}^{\infty} k \; dx= \int_{0}^{1} k \; dx = \left. kx\right|_{0}^{1} = k = 1\).

Thus, \(k=1\)

Density plot:

x <- seq(-5, 5, 0.01)
fx <- (x >= 0 & x <=1) * 1 +
  (x > 1 & x < 0) * 0  
plot(x, fx, type ="S",ylab="Density function",main ="Density Plot")

The pdf of the random variable \(X\) is given by \(f(x) = 1\) for \(0 \le x \le 1\)

\(P(0.1 <X < 0.7) = \int_{0.1}^{0.7} f(x) \; dx= \left. x\right|_{0.1}^{0.7}=0.6\)

Interpretation: There is a 0.6 probability of the chosen subject having 10 % - 70 % of his skin covered in freckles.
Using the punif function in R to verify the previous calculation:

P_0.7_0.1 = punif(0.7, min=0, max =1) -punif(0.1, min =0, max =1)
print(c("The P(0.1 < X < 0.7) is obtained as ",P_0.7_0.1  ))

## [1] "The P(0.1 < X < 0.7) is obtained as "
## [2] "0.6"

The probability for any generic value of a,b such that \(0<a<b<1\): \(P(a <X < b) = \int_{a}^{b} f(x) \; dx= \left. x\right|_{a}^{b}= b-a\)

The cdf \(F(x)\)is obtained as the integration of the pdf \(f(x) = 1\) for \(0 \le x \le 1\)Since the pdf is a piece wise function we calculate the cdf in 3 intervals:

\(-\infty<x<0\): \(F(x) =\int_{-\infty}^{x} f(u) \; du= \int_{-\infty}^{x} 0\; du=0\)
\(0 \le x \le 1\): \(F(x) = \int_{-\infty}^{x} f(u) \; du= \int_{-\infty}^{0} 0\; du+\int_{0}^{x} 1 \; du = \left. u\right|_{0}^{x} = x-0 = x\)
\(x>1\): \(F(x) = \int_{-\infty}^{x} f(u) \; du= \int_{-\infty}^{0} 0\; du+\int_{0}^{1} 1 \; du + \int_{1}^{x} 0 \; du = 1\)

\[ F(x) = \left\{ \begin{array}{ll} 0 & \quad x < 0 \\ x & \quad 0 \le x \le 1\\ 1 & \quad x > 1 \\ \end{array} \right. \]

CDF Plot

x <- seq(-5, 5, 0.01)
fx <- (x<0)*0+ (x >= 0 & x <=1) * x +
  (x > 1 ) * 1  
plot(x, fx, type ="S",ylab="cdf", main= "CDF Plot")

The median is obtained from the cdf as the value \(x_{0.5}\) such that \(F(x_{0.5}) = 0.5\). From the definition of the cdf we obtain \(x_{0.5} = 0.5\).
Interpretation: 50 % of randomly chosen subjects would be having upto 50% of their skin covered in freckles.
The \(95^{th}\) percentile is obtained from the cdf as the value \(x_{0.95}\) such that \(F(x_{0.95}) = 0.95\). From the definition of the cdf we obtain \(x_{0.95} = 0.95\)

Interpretation: 50 % of randomly chosen subjects would be having upto 50% of their skin covered in freckles.

Question 5

A function \(f(.)\) is a valid pdf if it fulfils the following conditions:

\(f(x) \ge 0\) for all \(x\).
\(\int_{-\infty}^{\infty} f(x) \; dx= 1\)

The given function \(f(x) = \frac{e^{-x}}{(1+e^{-x})^2}\):
The denominator is always positive as it is a square term. The numerator is always positive since \(e^{-x} \to 0\) as \(x \to \infty\) and \(e^{-x} \to \infty\) as \(x \to -\infty\). The function \(f(x)\) is always positive.
\(\int_{-\infty}^{\infty}\frac{e^{-x}}{(1+e^{-x})^2}\; du=\int_{-\infty}^{\infty}\frac{e^{-x}}{(1+e^{-x})^2}\; du = \left. \frac{1} {(1+e^{-u})}\right|_{-\infty}^{\infty} = 1-0 = 1\).

Thus the given function is a valid pdf.

The cdf is obtained by integration of the pdf : \(F(x)=\int_{-\infty}^{x}\frac{e^{-u}}{(1+e^{-u})^2}\; du = \left. \frac{1} {(1+e^{-u})}\right|_{-\infty}^{x} = \frac{1} {(1+e^{-x})}-0 = \frac{1} {(1+e^{-x})}\).
The cdf for x=0: \(F(0) = \frac{1} {(1+e^{-0})} = 0.5\).

Interpretation: x=0 is the median value for the function. Thus 50% of the population would have \(x \le 0\)
The \(p^{th}\) quantile of the given distribution is obtained as \(x_p\) such that \(F(x_p) = \frac{1} {(1+e^{-x_p})}=p\). Solving for \(x_p\) we get:

\(e^{-x_p} = \frac{1-p}{p}\)

\(x_p = -log(\frac{1-p}{p}) = log(\frac{p}{1-p})\). Thus the \(p^{th}\) quantile equals the natuaral log of the odds of an event with probability \(p\)

Question 6

Any pdf function \(f(.)\) satisfies the condition: \(\int_{-\infty}^{\infty} f(x) \; dx= 1\). Using this condition to find the value of \(c\) we get: \(\int_{-\infty}^{\infty} cx^k \; dx= \int_{-\infty}^{0} 0\; dx +\int_{0}^{1} cx^k \; dx + \int_{1}^{\infty} 0\; dx = \left. \frac{cx^{k+1}} {k+1}\right|_{0}^{1} = \frac{c}{k+1} = 1\)

\(c = k+1\)
The cdf \(F(x)\)is obtained as the integration of the pdf for \(0 \le x \le 1\)Since the pdf is a piece wise function we calculate the cdf in 3 intervals:

\(-\infty<x<0\): \(F(x) =\int_{-\infty}^{x} f(u) \; du= \int_{-\infty}^{x} 0\; du=0\)
\(0 \le x \le 1\):\(F(X) = \int_{-\infty}^{x} f(u) \; du= \int_{0}^{x} cu^k \; du =\int_{0}^{x} (k+1) u^k \; du = \left .\frac{(k+1)u^{k+1}}{k+1}\right|_{0}^{x} = x^{k+1}\)
\(x>1\): \(F(x) = \int_{-\infty}^{x} f(u) \; du= \int_{-\infty}^{0} 0\; du+\int_{0}^{1} 1 \; du + \int_{1}^{x} 0 \; du = 1\)

\[ F(x) = \left\{ \begin{array}{ll} 0 & \quad x < 0 \\ x^{k+1} & \quad 0 \le x \le 1\\ 1 & \quad x > 1 \\ \end{array} \right. \]

The \(p^{th}\) quantile \(x_p\) is defined using the cdf such that:

\(F(x_{p}) = p\)

\(p = x_p^{k+1}\)

\(x_p = p^{\frac{1}{k+1}}\)
\(P(a<X<b)\) is obtained by integrating the pdf from \(a\) to \(b\) \(P(a<X<b) = \int_{a}^{b} cx^k dx =\left.\frac{cx^{k+1}}{k+1}\right|_{a}^{b} = \frac{c(b^{k+1}-a^{k+1})}{k+1} = [b^{k+1}-a^{k+1}]\)

Question 7

A function \(f(.)\) is a valid pdf if it fulfils the following conditions:

\(f(x) \ge 0\) for all \(x\).
\(\int_{-\infty}^{\infty} f(x) \; dx= 1\)

The given function is always positive. Thus we use the second condition to find the value of c: \(\int_{-\infty}^{\infty} c\exp(\frac{-x}{2.5})\; dx= 1\)

\(\int_{-\infty}^{\infty} c\exp(\frac{-x}{2.5})\; dx=\int_{-\infty}^{0} 0\; dx + \int_{0}^{\infty} c\exp(\frac{-x}{2.5})\; dx= \left.\frac{-c}{\frac{1}{2.5}}\exp(\frac{-x}{2.5})\right|_{0}^{\infty}= -2.5* c[0-1] = 1\)

\(c = \frac{1}{2.5}\)

The cdf is obtained from intfrating the pdf such that

\(F(x)=\int_{-\infty}^{x} c\exp(\frac{-x}{2.5})\; dx= \int_{-\infty}^{0} 0\; dx +\int_{0}^{x} c\exp(\frac{-x}{2.5})\; dx = \left.\frac{-c}{\frac {1}{2.5}}\exp(\frac{-x}{2.5})\right|_{0}^{x} = -2.5*c [\exp(\frac{-x}{2.5})-1] = 1-\exp(\frac{-x}{2.5})\)

\(F(x) = 1-\exp(\frac{-x}{2.5})\) for \(x>0\)
Survival Function:

\(S(x) = 1-F(x) = 1 - (1-\exp(\frac{-x}{2.5})) = \exp(\frac{-x}{2.5})\)
Probability of taking longer than 11 days to be discharged:

\(S(x=11) = \exp(\frac{-11}{2.5}) = 0.0123\)
The median number of days \(x_{0.5}\) is obtained as the \(50^{th}\) quantile of the cdf:

\(F(x_{0.5}) = 0.5\)

\(0.5 = 1-\exp(\frac{-x_{0.5}}{2.5})\)

\(x_{0.5} = -log(0.5) = 0.6931\)

Project Proposal

Shiny app for Diabetes Prediction

Diabetes is a wide-spread medical condition characterized by high glucose levels. While the blood glucose level is an important determinant of diabetes it also depends upon other factors such as insulin levels, BMI , age etc. The project aims to develop a model to predict whether a person is diabetic or not based on a fixed set of inputs.

Dataset

The data set to be used [http://archive.ics.uci.edu/ml/index.php] comprises of 769 subjects(both diabetic and non-diabetic features) with 9 different features for each subject. The data set is open access and can be obtained from [https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/diabetes.csv]

Project Objectives

Feature extraction: Using machine learning methods such as random forest, find the best set of predictive features for the given data set.
Build a predictive model using the features to classify subjects based on the above recognized set of features.
Implement the predictive model on a shiny app.