605_JR_Final

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of (N+1)/2.

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)

5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Random Variables (X)

n <- 6
mean <- (n+1)/2
sd <- (n+1)/2
tot_numbers <- 10000
random_var <- runif(tot_numbers, min=1, max=n)
random_var[1:100]

##   [1] 2.243115 3.158907 4.697806 2.855386 3.299455 5.546018 4.197046
##   [8] 4.189005 5.862730 3.021871 3.910278 1.903285 1.060780 2.424534
##  [15] 4.735491 1.131077 5.089446 3.985846 2.875452 5.476531 5.139896
##  [22] 4.268608 4.799965 4.715935 4.269546 2.097126 2.519398 5.256092
##  [29] 1.476960 5.297421 3.233483 2.985152 2.733909 4.497558 5.043534
##  [36] 1.878292 3.806463 1.406105 2.207587 2.153000 5.573685 2.898586
##  [43] 5.528980 5.942369 4.855136 5.068381 1.675114 3.264743 3.244287
##  [50] 4.082191 1.162837 2.916941 1.141624 2.443545 2.081123 1.087785
##  [57] 1.038078 4.575422 2.497680 1.909707 5.820328 3.429680 2.864527
##  [64] 4.853227 1.908789 4.158753 1.423773 1.936554 2.881518 2.303950
##  [71] 4.608743 4.047314 1.253752 3.975929 5.009232 4.758412 1.929186
##  [78] 2.930037 4.798491 3.713033 2.739863 4.088211 3.131588 1.761847
##  [85] 4.726631 4.135154 5.495391 4.439418 4.252587 3.331408 1.911112
##  [92] 2.489018 2.226512 5.965032 3.948164 1.387035 3.948010 4.502604
##  [99] 1.952160 1.130856

hist(random_var,main="Histogram for Random Variables ", 
     xlab="Uniform Numbers", 
     border="blue", 
     col="gray",
     xlim=c(1,6),
     breaks=6)

Random Distribution (Y)

n <- 6
mean <- (n+1)/2
sd <- (n+1)/2
tot_numbers <- 10000
random_dist <- rnorm(tot_numbers, mean=mean, sd=sd)
random_dist[1:100]

##   [1]  1.61818630  6.30592507  1.06242430  4.54955750  0.59759739
##   [6]  2.13405962  6.36610361  0.37440473  6.73775487  0.87011700
##  [11]  1.53619354 -1.97605858  8.22661795 -0.89480320  2.65907410
##  [16]  0.83842232  4.77733418  9.86473568  0.60852160  4.11120730
##  [21]  2.52149958  6.44977930  5.37466463  3.62059045  5.24411087
##  [26]  3.96071541  2.05196365 -1.62649161  8.01199615 11.58815912
##  [31]  4.68956476  1.13772981  3.74497967 -2.17141673  0.14851330
##  [36]  0.94566802  2.51595661  3.34982615  4.44460567  1.30835068
##  [41] -0.83946753 -3.93983876  1.65892001  2.12878815  1.22808297
##  [46]  5.22328713  3.44783741 -2.96254709  0.67393496  2.71579333
##  [51]  2.10137836  6.65844472  6.65129043 -0.53664205  0.79213109
##  [56]  5.49671985  1.02182144  4.66606255  6.78101323 -5.87562438
##  [61]  8.47285742  3.58740864  8.62106767  9.78300931  2.13304145
##  [66] -2.03481685  0.72277687  2.51445011 10.36942779 -1.48126164
##  [71]  1.88852508  8.92921384  0.72503954 -0.05551277  1.16175323
##  [76]  1.61164177  4.93815075  4.20384923  3.75939118  6.09449023
##  [81]  0.08338726  2.51958839  5.96193200  7.31381371  1.23312979
##  [86]  7.43726473  3.62776749 -2.97681167 -0.21221006  1.24212541
##  [91]  2.03223620  4.04693979  0.08687414  7.05726494  5.24790171
##  [96] -6.44381429  0.90046214  2.81214574  5.87637592 -0.79329297

hist(random_dist,main="Histogram for Random Distribution ", 
     xlab="Random Distribution", 
     border="blue", 
     col="brown",
     xlim=c(-20,20),
     breaks=20)

Probability

a.P(X>x|X>y)

x<-median(random_var)
x

## [1] 3.473126

y<-quantile(random_dist)[2]
y

##      25% 
## 1.026606

p_a_and_b = sum(random_var>x & random_var>y)
p_b = sum(random_var>y)
p_a_given_b = p_a_and_b/p_b
p_a_given_b

## [1] 0.5026136

b.P(X>x,Y>y)

p_a = sum(random_var>x)/length(random_var)
p_a

## [1] 0.5

p_b = sum(random_dist>y)/length(random_dist)
p_b

## [1] 0.75

p_a_and_b=p_a*p_b
p_a_and_b

## [1] 0.375

c.P(X<x|Y<y)

p_a_b = sum(random_var<x & random_var>y)
p_b = sum(random_var>y)
p_a_or_b = p_a_b/p_b
p_a_or_b

## [1] 0.4973864

P(X>x and Y>y)=P(X>x)P(Y>y)

r1<-sum(random_var<x & random_dist<y)
r2<-sum(random_var<x & random_dist>y)
r3<-sum(random_var>x & random_dist<y)
r4<-sum(random_var>x & random_dist>y)



prob_table <- data.frame()
prob_table[1,1]=r1
prob_table[1,2]=r2
prob_table[2,1]=r3
prob_table[2,2]=r4
prob_table[1,3]=r1+r2
prob_table[2,3]=r3+r4
prob_table[3,1]=r1+r3
prob_table[3,2]=r2+r4
prob_table[3,3]=r1+r2+r3+r4

colnames(prob_table)<-c("<y",">y","Tot")
rownames(prob_table)<-c("<x",">x","Tot")

prob_table

p_a_and_b = r4/tot_numbers
p_a = (r3+r4)/tot_numbers
p_b = (r2+r4)/tot_numbers

if(round(p_a_and_b,2) == round(p_a*p_b,2))
{
  print("Equation is satisfied")
}

## [1] "Equation is satisfied"

Fisher’s Exact Test

prob_table_fis <- matrix(c(r1, r3, r2, r4), nrow = 2,
            dimnames =
            list(c(">y", "<y"),
            c(">x", "<x")))
fis_res<-fisher.test(prob_table_fis)
fis_res[1]

## $p.value
## [1] 0.7290389

Interpretation of Test : The p-value is 0.03 which is lesser than 5%. Hence the Null Hypothesis can be rejected in favour of alternative hypothesis. This illustrates that the relationship between the two variables is strong.

Chi Square Test

prob_table_fis <- matrix(c(r1, r3, r2, r4), nrow = 2,
            dimnames =
            list(c(">y", "<y"),
            c(">x", "<x")))
chi_sq<-chisq.test(prob_table_fis)
chi_sq[3]

## $p.value
## [1] 0.7290345

Comparison between Fisher’s and Chi Square :

For larger sample it is preferred to perform chi-square instead of fisher’s exact test. Fisher’s test is accurate for smaller samples.

Problem 2

Descriptive and Inferential Statistics

Univariate Descriptive Statistics and Plots

library(RCurl)

## Warning: package 'RCurl' was built under R version 3.6.2

file_url <- getURL("https://raw.githubusercontent.com/jey1987/DATA605/master/Final_Exam/acs2017_county_data.csv")
df <- read.csv(text=file_url,header=TRUE)
df

summary(df$Income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11680   40622   47637   48995   55476  129588

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

describe(df$Income)

## df$Income 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     3220        0     3074        1    48995    14776    30127    34557 
##      .25      .50      .75      .90      .95 
##    40622    47637    55476    65015    74636 
## 
## lowest :  11680  12653  12820  13413  13462, highest: 112138 114795 115576 117515 129588

library(ggpubr)

## Loading required package: magrittr

ggboxplot(df, y = "Income", width = 0.5,color="red")

gghistogram(df, x = "Income",palette = c("#00AFBB"),fill="lightyellow",rug=TRUE)

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

ggplot(df, aes(sample = Income,colour="red")) +
  stat_qq() +
  stat_qq_line()

ScatterPlot

pairs(~df$Income+df$White+df$Black++df$Asian,col="red")

Correlation Plot

library(RColorBrewer)
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.6.3

## corrplot 0.84 loaded

cor_df<-data.frame(df$Asian,df$White,df$Black)
cor <-cor(cor_df,method=c("pearson","kendall","spearman"))
corrplot(cor, method = "square", type = "upper", tl.col = "black", order = "hclust", col = brewer.pal(n = 5, name = "RdYlBu"))

Correlation Test

cor.test(df$White,df$Black,method="pearson",conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  df$White and df$Black
## t = -29.718, df = 3218, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.4815974 -0.4461412
## sample estimates:
##        cor 
## -0.4640551

cor.test(df$White,df$Asian,method="pearson",conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  df$White and df$Asian
## t = -11.771, df = 3218, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.2247327 -0.1814148
## sample estimates:
##        cor 
## -0.2031732

cor.test(df$Black,df$Asian,method="pearson",conf.level = 0.80)

## 
##  Pearson's product-moment correlation
## 
## data:  df$Black and df$Asian
## t = 1.1753, df = 3218, p-value = 0.24
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.001878237  0.043284512
## sample estimates:
##       cor 
## 0.0207137

The Correlation Test for different combinations of the population groups provides p-value less than 0.05 which means that the null hypothesis cannot be rejected and hence the relationships between the population groups is weak, Looking at the dataset and type of data we know that the correlation cannot be achieved and the tests shows the same. Since we are testing using 80% confidence interval there could be false predictions and increasing the confidence level will decrease the family wise error.

Linear Algebra and Correlation

library(Matrix)

## Warning: package 'Matrix' was built under R version 3.6.2

library(matlib)

## Warning: package 'matlib' was built under R version 3.6.3

cor_inv_precision = inv(cor)
cor_inv_precision

##                                     
## [1,] 1.05061585 0.2591695 0.09850681
## [2,] 0.25916954 1.3383818 0.61571461
## [3,] 0.09850681 0.6157146 1.28368509

out_matrix <- cor_inv_precision %*% cor
out_matrix

##           df.Asian     df.White      df.Black
## [1,]  1.000000e+00 1.774467e-09 -2.116725e-09
## [2,] -4.869896e-10 1.000000e+00 -1.997272e-09
## [3,] -1.869743e-09 7.815038e-11  1.000000e+00

lu_cor<-lu(cor)
L_cor<-expand(lu_cor)$L
L_cor

## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]       [,2]       [,3]      
## [1,]  1.0000000          .          .
## [2,] -0.2031732  1.0000000          .
## [3,]  0.0207137 -0.4796461  1.0000000

U_cor<-expand(lu_cor)$U
U_cor

## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]       [,2]       [,3]      
## [1,]  1.0000000 -0.2031732  0.0207137
## [2,]          .  0.9587207 -0.4598467
## [3,]          .          .  0.7790073

lu_cor_inv_precision<-lu(cor_inv_precision)
L_cor<-expand(lu_cor_inv_precision)$L
L_cor

## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]       [,2]       [,3]      
## [1,] 1.00000000          .          .
## [2,] 0.24668345 1.00000000          .
## [3,] 0.09376102 0.46405514 1.00000000

U_cor<-expand(lu_cor_inv_precision)$U
U_cor

## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]       [,2]       [,3]      
## [1,] 1.05061585 0.25916954 0.09850681
## [2,]          . 1.27444899 0.59141461
## [3,]          .          . 1.00000000

MASS Function

library(MASS)

est<-fitdistr(df$IncomeErr,"exponential")

df_est<-data.frame(rexp(1000,est$estimate))
df_est

names(df_est)[1] <- "estimate"

gghistogram(df, x = "IncomeErr",palette = c("#00AFBB"),fill="lightgray",rug=TRUE)

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

gghistogram(df_est, x = "estimate",palette = c("#00AFBB"),fill="lightblue",rug=TRUE)

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

qexp(0.05,rate = est$estimate,lower.tail = TRUE,log.p = FALSE)

## [1] 160.99

qexp(0.95,rate = est$estimate,lower.tail = TRUE,log.p = FALSE)

## [1] 9402.456

mean_Income <- mean(df$IncomeErr)
sd_Income <- sd(df$IncomeErr)
qnorm(0.95,mean_Income,sd_Income)

## [1] 7095.784

quantile(df$IncomeErr,c(0.05,0.95))

##      5%     95% 
##  961.85 6988.15

Based on the exponential model we built , the 5% and 95% ranges are matching with the empirical range. Hence the model holds good for this data set.

Linear Regression Model

lm_out <- lm(df$Income~df$Men+df$Women+df$Hispanic+df$White+df$Black+df$Asian+df$Native+df$Pacific)
summary(lm_out)

## 
## Call:
## lm(formula = df$Income ~ df$Men + df$Women + df$Hispanic + df$White + 
##     df$Black + df$Asian + df$Native + df$Pacific)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52772  -6961  -1392   5381  52693 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.520e+04  1.301e+04   5.782 8.08e-09 ***
## df$Men       1.345e-02  4.085e-02   0.329 0.741932    
## df$Women    -5.967e-03  3.936e-02  -0.152 0.879519    
## df$Hispanic -4.587e+02  1.309e+02  -3.504 0.000464 ***
## df$White    -2.477e+02  1.324e+02  -1.871 0.061473 .  
## df$Black    -5.310e+02  1.323e+02  -4.015 6.08e-05 ***
## df$Asian     2.157e+03  1.806e+02  11.940  < 2e-16 ***
## df$Native   -4.223e+02  1.424e+02  -2.966 0.003041 ** 
## df$Pacific  -2.993e+03  3.611e+02  -8.289  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11100 on 3211 degrees of freedom
## Multiple R-squared:  0.362,  Adjusted R-squared:  0.3604 
## F-statistic: 227.7 on 8 and 3211 DF,  p-value: < 2.2e-16

gghistogram(lm_out$residuals,breaks=20,palette = c("#00AFBB"),fill="lightgray")

## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.

ggplot(lm_out, aes(sample = lm_out$residuals)) + theme_minimal() + 
  stat_qq() +
  stat_qq_line()

par(mfrow=c(2,2))
plot(lm_out)

Based on the following conditions of Linear Regression, We can conclude that the data is providing strong relationship.

Linearity - The data is linear as the residuals are not following theoretical line.

Homoscedasticity - From Scale location plot we can see that the line is horizontal and angled. This explains the homoscedasticity is satisfied.

Independence - Income and the Population groups are independent variables

Normality - By looking at the QQ Plot we can see that the data is distributed normally

Linear Model Testing

file_url <- getURL("https://raw.githubusercontent.com/jey1987/DATA605/master/Final_Exam/acs2017_county_data.csv")
df <- read.csv(text=file_url,header=TRUE)
lm_out <- lm(df$Income~df$Men+df$Women+df$Hispanic+df$White+df$Black+df$Asian+df$Native+df$Pacific)


file_url <- getURL("https://raw.githubusercontent.com/jey1987/DATA605/master/Final_Exam/acs2015_county_data.csv")
df_test <- read.csv(text=file_url,header=TRUE)

pred<-predict(lm_out,df_test)

kaggle <- data.frame( Id = df_test[,"CensusId"],  Income=pred)
kaggle[kaggle<0] <- 0
kaggle <- replace(kaggle,is.na(kaggle),0)
write.csv(kaggle, file="kaggle.csv", row.names = FALSE)

605_JR_Final_Exam

Jeyaraman Ramalingam

5/20/2020