Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of (N+1)/2.

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)

5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Random Variables (X)

##   [1] 2.243115 3.158907 4.697806 2.855386 3.299455 5.546018 4.197046
##   [8] 4.189005 5.862730 3.021871 3.910278 1.903285 1.060780 2.424534
##  [15] 4.735491 1.131077 5.089446 3.985846 2.875452 5.476531 5.139896
##  [22] 4.268608 4.799965 4.715935 4.269546 2.097126 2.519398 5.256092
##  [29] 1.476960 5.297421 3.233483 2.985152 2.733909 4.497558 5.043534
##  [36] 1.878292 3.806463 1.406105 2.207587 2.153000 5.573685 2.898586
##  [43] 5.528980 5.942369 4.855136 5.068381 1.675114 3.264743 3.244287
##  [50] 4.082191 1.162837 2.916941 1.141624 2.443545 2.081123 1.087785
##  [57] 1.038078 4.575422 2.497680 1.909707 5.820328 3.429680 2.864527
##  [64] 4.853227 1.908789 4.158753 1.423773 1.936554 2.881518 2.303950
##  [71] 4.608743 4.047314 1.253752 3.975929 5.009232 4.758412 1.929186
##  [78] 2.930037 4.798491 3.713033 2.739863 4.088211 3.131588 1.761847
##  [85] 4.726631 4.135154 5.495391 4.439418 4.252587 3.331408 1.911112
##  [92] 2.489018 2.226512 5.965032 3.948164 1.387035 3.948010 4.502604
##  [99] 1.952160 1.130856

Random Distribution (Y)

##   [1]  1.61818630  6.30592507  1.06242430  4.54955750  0.59759739
##   [6]  2.13405962  6.36610361  0.37440473  6.73775487  0.87011700
##  [11]  1.53619354 -1.97605858  8.22661795 -0.89480320  2.65907410
##  [16]  0.83842232  4.77733418  9.86473568  0.60852160  4.11120730
##  [21]  2.52149958  6.44977930  5.37466463  3.62059045  5.24411087
##  [26]  3.96071541  2.05196365 -1.62649161  8.01199615 11.58815912
##  [31]  4.68956476  1.13772981  3.74497967 -2.17141673  0.14851330
##  [36]  0.94566802  2.51595661  3.34982615  4.44460567  1.30835068
##  [41] -0.83946753 -3.93983876  1.65892001  2.12878815  1.22808297
##  [46]  5.22328713  3.44783741 -2.96254709  0.67393496  2.71579333
##  [51]  2.10137836  6.65844472  6.65129043 -0.53664205  0.79213109
##  [56]  5.49671985  1.02182144  4.66606255  6.78101323 -5.87562438
##  [61]  8.47285742  3.58740864  8.62106767  9.78300931  2.13304145
##  [66] -2.03481685  0.72277687  2.51445011 10.36942779 -1.48126164
##  [71]  1.88852508  8.92921384  0.72503954 -0.05551277  1.16175323
##  [76]  1.61164177  4.93815075  4.20384923  3.75939118  6.09449023
##  [81]  0.08338726  2.51958839  5.96193200  7.31381371  1.23312979
##  [86]  7.43726473  3.62776749 -2.97681167 -0.21221006  1.24212541
##  [91]  2.03223620  4.04693979  0.08687414  7.05726494  5.24790171
##  [96] -6.44381429  0.90046214  2.81214574  5.87637592 -0.79329297

P(X>x and Y>y)=P(X>x)P(Y>y)

## [1] "Equation is satisfied"

Fisher’s Exact Test

## $p.value
## [1] 0.7290389

Interpretation of Test : The p-value is 0.03 which is lesser than 5%. Hence the Null Hypothesis can be rejected in favour of alternative hypothesis. This illustrates that the relationship between the two variables is strong.

Chi Square Test

## $p.value
## [1] 0.7290345

Interpretation of Test : The p-value is 0.03 which is lesser than 5%. Hence the Null Hypothesis can be rejected in favour of alternative hypothesis. This illustrates that the relationship between the two variables is strong.

Comparison between Fisher’s and Chi Square :

For larger sample it is preferred to perform chi-square instead of fisher’s exact test. Fisher’s test is accurate for smaller samples.

Problem 2

Descriptive and Inferential Statistics

Univariate Descriptive Statistics and Plots

## Warning: package 'RCurl' was built under R version 3.6.2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11680   40622   47637   48995   55476  129588
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## df$Income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3220        0     3074        1    48995    14776    30127    34557 
##      .25      .50      .75      .90      .95 
##    40622    47637    55476    65015    74636 
## 
## lowest :  11680  12653  12820  13413  13462, highest: 112138 114795 115576 117515 129588
## Loading required package: magrittr

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

Correlation Test

## 
##  Pearson's product-moment correlation
## 
## data:  df$White and df$Black
## t = -29.718, df = 3218, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.4815974 -0.4461412
## sample estimates:
##        cor 
## -0.4640551
## 
##  Pearson's product-moment correlation
## 
## data:  df$White and df$Asian
## t = -11.771, df = 3218, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.2247327 -0.1814148
## sample estimates:
##        cor 
## -0.2031732
## 
##  Pearson's product-moment correlation
## 
## data:  df$Black and df$Asian
## t = 1.1753, df = 3218, p-value = 0.24
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.001878237  0.043284512
## sample estimates:
##       cor 
## 0.0207137

The Correlation Test for different combinations of the population groups provides p-value less than 0.05 which means that the null hypothesis cannot be rejected and hence the relationships between the population groups is weak, Looking at the dataset and type of data we know that the correlation cannot be achieved and the tests shows the same. Since we are testing using 80% confidence interval there could be false predictions and increasing the confidence level will decrease the family wise error.

Linear Algebra and Correlation

## Warning: package 'Matrix' was built under R version 3.6.2
## Warning: package 'matlib' was built under R version 3.6.3
##                                     
## [1,] 1.05061585 0.2591695 0.09850681
## [2,] 0.25916954 1.3383818 0.61571461
## [3,] 0.09850681 0.6157146 1.28368509
##           df.Asian     df.White      df.Black
## [1,]  1.000000e+00 1.774467e-09 -2.116725e-09
## [2,] -4.869896e-10 1.000000e+00 -1.997272e-09
## [3,] -1.869743e-09 7.815038e-11  1.000000e+00
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]       [,2]       [,3]      
## [1,]  1.0000000          .          .
## [2,] -0.2031732  1.0000000          .
## [3,]  0.0207137 -0.4796461  1.0000000
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]       [,2]       [,3]      
## [1,]  1.0000000 -0.2031732  0.0207137
## [2,]          .  0.9587207 -0.4598467
## [3,]          .          .  0.7790073
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]       [,2]       [,3]      
## [1,] 1.00000000          .          .
## [2,] 0.24668345 1.00000000          .
## [3,] 0.09376102 0.46405514 1.00000000
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]       [,2]       [,3]      
## [1,] 1.05061585 0.25916954 0.09850681
## [2,]          . 1.27444899 0.59141461
## [3,]          .          . 1.00000000

MASS Function

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

## [1] 160.99
## [1] 9402.456
## [1] 7095.784
##      5%     95% 
##  961.85 6988.15

Based on the exponential model we built , the 5% and 95% ranges are matching with the empirical range. Hence the model holds good for this data set.

Linear Regression Model

## 
## Call:
## lm(formula = df$Income ~ df$Men + df$Women + df$Hispanic + df$White + 
##     df$Black + df$Asian + df$Native + df$Pacific)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52772  -6961  -1392   5381  52693 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.520e+04  1.301e+04   5.782 8.08e-09 ***
## df$Men       1.345e-02  4.085e-02   0.329 0.741932    
## df$Women    -5.967e-03  3.936e-02  -0.152 0.879519    
## df$Hispanic -4.587e+02  1.309e+02  -3.504 0.000464 ***
## df$White    -2.477e+02  1.324e+02  -1.871 0.061473 .  
## df$Black    -5.310e+02  1.323e+02  -4.015 6.08e-05 ***
## df$Asian     2.157e+03  1.806e+02  11.940  < 2e-16 ***
## df$Native   -4.223e+02  1.424e+02  -2.966 0.003041 ** 
## df$Pacific  -2.993e+03  3.611e+02  -8.289  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11100 on 3211 degrees of freedom
## Multiple R-squared:  0.362,  Adjusted R-squared:  0.3604 
## F-statistic: 227.7 on 8 and 3211 DF,  p-value: < 2.2e-16
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

Based on the following conditions of Linear Regression, We can conclude that the data is providing strong relationship.

Linearity - The data is linear as the residuals are not following theoretical line.

Homoscedasticity - From Scale location plot we can see that the line is horizontal and angled. This explains the homoscedasticity is satisfied.

Independence - Income and the Population groups are independent variables

Normality - By looking at the QQ Plot we can see that the data is distributed normally