class: center, middle, inverse, title-slide # Statistics with R ## Introduction to R for Actuarial Students --- * Introduction to R for Actuarial Students * CS1B Curriculum * Introduction to R programming * Fundamentals of Statistical Analysis * Probability Distributions * Question 2 - Lognormal Probability Distribution * Exam on basis of ***Base R*** --- <style type="text/css"> pre { background: #ADD8E6; max-width: 100%; overflow-x: scroll; } </style> #### Log-Normal Distribution Let `\({\displaystyle Z}\)` be a standard normal variable, and let `\({\displaystyle \mu }\)` and `\({\displaystyle \sigma >0}\)` be two real numbers. Then, the distribution of the random variable $${\displaystyle X=e^{\mu +\sigma Z}} $$ is called the log-normal distribution with parameters `\({\displaystyle \mu }\)` and `\({\displaystyle \sigma }\)`. ***Mean*** $$E(X) = {\displaystyle \exp \left(\mu +{\frac {\sigma ^{2}}{2}}\right)} $$ ***Variance*** $${\displaystyle \operatorname{Var}(X) = [\exp(\sigma ^{2})-1]\exp(2\mu +\sigma ^{2})} $$ --- ```r exp(0.5) ``` ``` ## [1] 1.648721 ``` ```r exp(2 +((0.5)^2/2 )) ``` ``` ## [1] 8.372897 ``` --- ### Exercise 1 Generate a sample of 10000 random observations following Lognormal distribution with parameters `\(\mu = 2\)` and `\(\sigma^2 = 0.25\)` Display the first few simulated observations using the ***head (...)*** function. (Use a seed value of 100 to generate random numbers) --- ### Exercise 1 #### Generate a random sample from a Lognormal distribution ```r set.seed(100) data1<-rlnorm(10000,meanlog = 2,sdlog = 0.5) # First 6 observations are shown below head(data1) ``` ``` ## [1] 5.748298 7.891337 7.103172 11.512028 7.834097 8.665200 ``` --- ```r summary(data1) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.957 5.295 7.398 8.376 10.376 51.626 ``` ```r mean(data1) ``` ``` ## [1] 8.375649 ``` --- ### Exercise 2 Compute the sample mean, median and variance from the generated sample and compare the values with those of a population following a lognormal distribution with the given parameters. ```r # Compute the mean, median and variance of the sample mean(data1) ``` ``` ## [1] 8.375649 ``` ```r median(data1) ``` ``` ## [1] 7.398463 ``` ```r var(data1) ``` ``` ## [1] 19.51361 ``` --- #### Analytics Values ```r # Formula based mean values thismean<-exp(2+0.25/2) thismean ``` ``` ## [1] 8.372897 ``` ```r thismedian<-exp(2) thismedian ``` ``` ## [1] 7.389056 ``` ```r qlnorm(0.5,meanlog=2,sdlog=0.5) ``` ``` ## [1] 7.389056 ``` --- #### Analytics Values ```r thisvar<-(exp(0.25)-1)*exp(2*2+0.25) thisvar ``` ``` ## [1] 19.91172 ``` --- ### Interpretation: Mean, Median and Variance of the generated sample and those computed based on the parameters are almost equal because the sample size is 10,000 which is pretty large. Generating a much larger sample will bridge those smaller differences existing between them as well --- ### Exercise 3 Treat the data generated in Exercise 1 as the population. Generate 5000 different random samples of size 200 from the above population and compute the sample mean for each sample. [Use a seed value of 100 to generate random numbers] ```r set.seed(100); data1<-rlnorm(10000,meanlog = 2,sdlog = 0.5) means <- replicate(5000, mean(sample(data1,200,replace=FALSE))) ``` --- ```r #Generating 5000 different samples of size 200 #Then computing their sample means means<-c() set.seed(100) for (i in 1:5000){ selected_rows<-sample(1:10000,200,FALSE) selected_data<-data1[selected_rows] sample_mean<-mean(selected_data) means<-c(means,sample_mean) } ``` --- ### Exercise 4 Plot the histogram of sample means generated from Exercise 3 and interpret the distribution of sample means. <pre><code> #Histogram of Sample Means hist(means,breaks = 50, col= c("lightblue","lightpink","lightgreen")) </code></pre> --- <!-- --> --- #### Interpretation * The sample means tend to follow a normal distribution though the actual data comes from lognormal distribution. * The Central limit theorem can be verified through this exercise that sample means tend to follow a normal distribution as the sample size increases. Increase in Sample size from 200 to much higher can ensure better normality of the sample means --- ---