Statistics with R

class: center, middle, inverse, title-slide

# Statistics with R
## Introduction to R for Actuarial Students

---

* Introduction to R for Actuarial Students

* CS1B Curriculum

* Introduction to R programming
* Fundamentals of Statistical Analysis
* Probability Distributions

* Question 2 - Lognormal Probability Distribution
* Exam on basis of ***Base R***

---

Let `${\displaystyle Z}$` be a standard normal variable, and let `${\displaystyle \mu }$` and `${\displaystyle \sigma >0}$` be two real numbers. Then, the distribution of the random variable

$${\displaystyle X=e^{\mu +\sigma Z}} $$
is called the log-normal distribution with parameters `${\displaystyle \mu }$`  and `${\displaystyle \sigma }$`.

***Mean***	 $$E(X) = {\displaystyle \exp \left(\mu +{\frac {\sigma ^{2}}{2}}\right)} $$

***Variance***	 $${\displaystyle \operatorname{Var}(X) = [\exp(\sigma ^{2})-1]\exp(2\mu +\sigma ^{2})} $$

---

```r
exp(0.5)
```

```
## [1] 1.648721
```

```r
exp(2 +((0.5)^2/2 ))
```

```
## [1] 8.372897
```

---

### Exercise 1

Generate a sample of 10000 random observations following Lognormal distribution with parameters `$\mu = 2$` and `$\sigma^2 = 0.25$`

Display the first few simulated observations using the ***head (...)*** function.

(Use a seed value of 100 to generate random numbers)

---

### Exercise 1

#### Generate a random sample from a Lognormal distribution

```r
set.seed(100)

data1<-rlnorm(10000,meanlog = 2,sdlog = 0.5)

# First 6 observations are shown below

head(data1)
```

```
## [1]  5.748298  7.891337  7.103172 11.512028  7.834097  8.665200
```

---

```r
summary(data1)
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.957   5.295   7.398   8.376  10.376  51.626
```

```r
mean(data1)    
```

```
## [1] 8.375649
```

---

### Exercise 2

Compute the sample mean, median and variance from the generated sample and compare the values with those of a population following a lognormal distribution with the given
parameters.

```r
# Compute the mean, median and variance of the sample
mean(data1)
```

```
## [1] 8.375649
```

```r
median(data1)
```

```
## [1] 7.398463
```

```r
var(data1)
```

```
## [1] 19.51361
```

---

#### Analytics Values

```r
# Formula based mean values
thismean<-exp(2+0.25/2)
thismean
```

```
## [1] 8.372897
```

```r
thismedian<-exp(2)
thismedian 
```

```
## [1] 7.389056
```

```r
qlnorm(0.5,meanlog=2,sdlog=0.5)
```

```
## [1] 7.389056
```

---

#### Analytics Values

```r
thisvar<-(exp(0.25)-1)*exp(2*2+0.25)
thisvar
```

```
## [1] 19.91172
```

---

### Interpretation:

Mean, Median and Variance of the generated sample and those computed based on the parameters are almost equal because the sample size is 10,000 which is pretty large.

Generating a much larger sample will bridge those smaller differences existing between them as well

---

### Exercise 3

Treat the data generated in Exercise 1 as the population.

Generate 5000 different random samples of size 200 from the above population and compute the sample mean for each sample.
[Use a seed value of 100 to generate random numbers]

```r
set.seed(100);

data1<-rlnorm(10000,meanlog = 2,sdlog = 0.5)

means <- replicate(5000,
    mean(sample(data1,200,replace=FALSE)))
```

---

```r
#Generating 5000 different samples of size 200
#Then computing their sample means

means<-c()

set.seed(100)
for (i in 1:5000){
    selected_rows<-sample(1:10000,200,FALSE)
    
    selected_data<-data1[selected_rows]
    
    sample_mean<-mean(selected_data)
    
    means<-c(means,sample_mean)
} 
```

---

### Exercise 4
Plot the histogram of sample means generated from Exercise 3 and interpret the distribution
of sample means.

<pre><code>
#Histogram of Sample Means
hist(means,breaks = 50, 
    col= c("lightblue","lightpink","lightgreen")) 
</code></pre>

---

![](02-Part2-Probability-LogNormal_files/figure-html/unnamed-chunk-15-1.png)

---

#### Interpretation

* The sample means tend to follow a normal distribution though the actual data
comes from lognormal distribution. 
* The Central limit theorem can be verified through this
exercise that sample means tend to follow a normal distribution as the sample size increases.
Increase in Sample size from 200 to much higher can ensure better normality of the sample
means

---