Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of (N+1)/2.
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)
5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
n <- 6
mean <- (n+1)/2
sd <- (n+1)/2
tot_numbers <- 10000
random_var <- runif(tot_numbers, min=1, max=n)
random_var[1:100]## [1] 2.243115 3.158907 4.697806 2.855386 3.299455 5.546018 4.197046
## [8] 4.189005 5.862730 3.021871 3.910278 1.903285 1.060780 2.424534
## [15] 4.735491 1.131077 5.089446 3.985846 2.875452 5.476531 5.139896
## [22] 4.268608 4.799965 4.715935 4.269546 2.097126 2.519398 5.256092
## [29] 1.476960 5.297421 3.233483 2.985152 2.733909 4.497558 5.043534
## [36] 1.878292 3.806463 1.406105 2.207587 2.153000 5.573685 2.898586
## [43] 5.528980 5.942369 4.855136 5.068381 1.675114 3.264743 3.244287
## [50] 4.082191 1.162837 2.916941 1.141624 2.443545 2.081123 1.087785
## [57] 1.038078 4.575422 2.497680 1.909707 5.820328 3.429680 2.864527
## [64] 4.853227 1.908789 4.158753 1.423773 1.936554 2.881518 2.303950
## [71] 4.608743 4.047314 1.253752 3.975929 5.009232 4.758412 1.929186
## [78] 2.930037 4.798491 3.713033 2.739863 4.088211 3.131588 1.761847
## [85] 4.726631 4.135154 5.495391 4.439418 4.252587 3.331408 1.911112
## [92] 2.489018 2.226512 5.965032 3.948164 1.387035 3.948010 4.502604
## [99] 1.952160 1.130856
hist(random_var,main="Histogram for Random Variables ",
xlab="Uniform Numbers",
border="blue",
col="gray",
xlim=c(1,6),
breaks=6)n <- 6
mean <- (n+1)/2
sd <- (n+1)/2
tot_numbers <- 10000
random_dist <- rnorm(tot_numbers, mean=mean, sd=sd)
random_dist[1:100]## [1] 1.61818630 6.30592507 1.06242430 4.54955750 0.59759739
## [6] 2.13405962 6.36610361 0.37440473 6.73775487 0.87011700
## [11] 1.53619354 -1.97605858 8.22661795 -0.89480320 2.65907410
## [16] 0.83842232 4.77733418 9.86473568 0.60852160 4.11120730
## [21] 2.52149958 6.44977930 5.37466463 3.62059045 5.24411087
## [26] 3.96071541 2.05196365 -1.62649161 8.01199615 11.58815912
## [31] 4.68956476 1.13772981 3.74497967 -2.17141673 0.14851330
## [36] 0.94566802 2.51595661 3.34982615 4.44460567 1.30835068
## [41] -0.83946753 -3.93983876 1.65892001 2.12878815 1.22808297
## [46] 5.22328713 3.44783741 -2.96254709 0.67393496 2.71579333
## [51] 2.10137836 6.65844472 6.65129043 -0.53664205 0.79213109
## [56] 5.49671985 1.02182144 4.66606255 6.78101323 -5.87562438
## [61] 8.47285742 3.58740864 8.62106767 9.78300931 2.13304145
## [66] -2.03481685 0.72277687 2.51445011 10.36942779 -1.48126164
## [71] 1.88852508 8.92921384 0.72503954 -0.05551277 1.16175323
## [76] 1.61164177 4.93815075 4.20384923 3.75939118 6.09449023
## [81] 0.08338726 2.51958839 5.96193200 7.31381371 1.23312979
## [86] 7.43726473 3.62776749 -2.97681167 -0.21221006 1.24212541
## [91] 2.03223620 4.04693979 0.08687414 7.05726494 5.24790171
## [96] -6.44381429 0.90046214 2.81214574 5.87637592 -0.79329297
hist(random_dist,main="Histogram for Random Distribution ",
xlab="Random Distribution",
border="blue",
col="brown",
xlim=c(-20,20),
breaks=20)## [1] 3.473126
## 25%
## 1.026606
p_a_and_b = sum(random_var>x & random_var>y)
p_b = sum(random_var>y)
p_a_given_b = p_a_and_b/p_b
p_a_given_b## [1] 0.5026136
## [1] 0.5
## [1] 0.75
## [1] 0.375
## [1] 0.4973864
r1<-sum(random_var<x & random_dist<y)
r2<-sum(random_var<x & random_dist>y)
r3<-sum(random_var>x & random_dist<y)
r4<-sum(random_var>x & random_dist>y)
prob_table <- data.frame()
prob_table[1,1]=r1
prob_table[1,2]=r2
prob_table[2,1]=r3
prob_table[2,2]=r4
prob_table[1,3]=r1+r2
prob_table[2,3]=r3+r4
prob_table[3,1]=r1+r3
prob_table[3,2]=r2+r4
prob_table[3,3]=r1+r2+r3+r4
colnames(prob_table)<-c("<y",">y","Tot")
rownames(prob_table)<-c("<x",">x","Tot")
prob_tablep_a_and_b = r4/tot_numbers
p_a = (r3+r4)/tot_numbers
p_b = (r2+r4)/tot_numbers
if(round(p_a_and_b,2) == round(p_a*p_b,2))
{
print("Equation is satisfied")
}## [1] "Equation is satisfied"
prob_table_fis <- matrix(c(r1, r3, r2, r4), nrow = 2,
dimnames =
list(c(">y", "<y"),
c(">x", "<x")))
fis_res<-fisher.test(prob_table_fis)
fis_res[1]## $p.value
## [1] 0.7290389
Interpretation of Test : The p-value is 0.03 which is lesser than 5%. Hence the Null Hypothesis can be rejected in favour of alternative hypothesis. This illustrates that the relationship between the two variables is strong.
prob_table_fis <- matrix(c(r1, r3, r2, r4), nrow = 2,
dimnames =
list(c(">y", "<y"),
c(">x", "<x")))
chi_sq<-chisq.test(prob_table_fis)
chi_sq[3]## $p.value
## [1] 0.7290345
Interpretation of Test : The p-value is 0.03 which is lesser than 5%. Hence the Null Hypothesis can be rejected in favour of alternative hypothesis. This illustrates that the relationship between the two variables is strong.
Comparison between Fisher’s and Chi Square :
For larger sample it is preferred to perform chi-square instead of fisher’s exact test. Fisher’s test is accurate for smaller samples.
## Warning: package 'RCurl' was built under R version 3.6.2
file_url <- getURL("https://raw.githubusercontent.com/jey1987/DATA605/master/Final_Exam/acs2017_county_data.csv")
df <- read.csv(text=file_url,header=TRUE)
df## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11680 40622 47637 48995 55476 129588
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## df$Income
## n missing distinct Info Mean Gmd .05 .10
## 3220 0 3074 1 48995 14776 30127 34557
## .25 .50 .75 .90 .95
## 40622 47637 55476 65015 74636
##
## lowest : 11680 12653 12820 13413 13462, highest: 112138 114795 115576 117515 129588
## Loading required package: magrittr
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
cor_df<-data.frame(df$Asian,df$White,df$Black)
cor <-cor(cor_df,method=c("pearson","kendall","spearman"))
corrplot(cor, method = "square", type = "upper", tl.col = "black", order = "hclust", col = brewer.pal(n = 5, name = "RdYlBu"))##
## Pearson's product-moment correlation
##
## data: df$White and df$Black
## t = -29.718, df = 3218, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.4815974 -0.4461412
## sample estimates:
## cor
## -0.4640551
##
## Pearson's product-moment correlation
##
## data: df$White and df$Asian
## t = -11.771, df = 3218, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.2247327 -0.1814148
## sample estimates:
## cor
## -0.2031732
##
## Pearson's product-moment correlation
##
## data: df$Black and df$Asian
## t = 1.1753, df = 3218, p-value = 0.24
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.001878237 0.043284512
## sample estimates:
## cor
## 0.0207137
The Correlation Test for different combinations of the population groups provides p-value less than 0.05 which means that the null hypothesis cannot be rejected and hence the relationships between the population groups is weak, Looking at the dataset and type of data we know that the correlation cannot be achieved and the tests shows the same. Since we are testing using 80% confidence interval there could be false predictions and increasing the confidence level will decrease the family wise error.
## Warning: package 'Matrix' was built under R version 3.6.2
## Warning: package 'matlib' was built under R version 3.6.3
##
## [1,] 1.05061585 0.2591695 0.09850681
## [2,] 0.25916954 1.3383818 0.61571461
## [3,] 0.09850681 0.6157146 1.28368509
## df.Asian df.White df.Black
## [1,] 1.000000e+00 1.774467e-09 -2.116725e-09
## [2,] -4.869896e-10 1.000000e+00 -1.997272e-09
## [3,] -1.869743e-09 7.815038e-11 1.000000e+00
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.0000000 . .
## [2,] -0.2031732 1.0000000 .
## [3,] 0.0207137 -0.4796461 1.0000000
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.0000000 -0.2031732 0.0207137
## [2,] . 0.9587207 -0.4598467
## [3,] . . 0.7790073
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.00000000 . .
## [2,] 0.24668345 1.00000000 .
## [3,] 0.09376102 0.46405514 1.00000000
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.05061585 0.25916954 0.09850681
## [2,] . 1.27444899 0.59141461
## [3,] . . 1.00000000
library(MASS)
est<-fitdistr(df$IncomeErr,"exponential")
df_est<-data.frame(rexp(1000,est$estimate))
df_estnames(df_est)[1] <- "estimate"
gghistogram(df, x = "IncomeErr",palette = c("#00AFBB"),fill="lightgray",rug=TRUE)## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## [1] 160.99
## [1] 9402.456
## [1] 7095.784
## 5% 95%
## 961.85 6988.15
Based on the exponential model we built , the 5% and 95% ranges are matching with the empirical range. Hence the model holds good for this data set.
lm_out <- lm(df$Income~df$Men+df$Women+df$Hispanic+df$White+df$Black+df$Asian+df$Native+df$Pacific)
summary(lm_out)##
## Call:
## lm(formula = df$Income ~ df$Men + df$Women + df$Hispanic + df$White +
## df$Black + df$Asian + df$Native + df$Pacific)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52772 -6961 -1392 5381 52693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.520e+04 1.301e+04 5.782 8.08e-09 ***
## df$Men 1.345e-02 4.085e-02 0.329 0.741932
## df$Women -5.967e-03 3.936e-02 -0.152 0.879519
## df$Hispanic -4.587e+02 1.309e+02 -3.504 0.000464 ***
## df$White -2.477e+02 1.324e+02 -1.871 0.061473 .
## df$Black -5.310e+02 1.323e+02 -4.015 6.08e-05 ***
## df$Asian 2.157e+03 1.806e+02 11.940 < 2e-16 ***
## df$Native -4.223e+02 1.424e+02 -2.966 0.003041 **
## df$Pacific -2.993e+03 3.611e+02 -8.289 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11100 on 3211 degrees of freedom
## Multiple R-squared: 0.362, Adjusted R-squared: 0.3604
## F-statistic: 227.7 on 8 and 3211 DF, p-value: < 2.2e-16
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
Based on the following conditions of Linear Regression, We can conclude that the data is providing strong relationship.
Linearity - The data is linear as the residuals are not following theoretical line.
Homoscedasticity - From Scale location plot we can see that the line is horizontal and angled. This explains the homoscedasticity is satisfied.
Independence - Income and the Population groups are independent variables
Normality - By looking at the QQ Plot we can see that the data is distributed normally
file_url <- getURL("https://raw.githubusercontent.com/jey1987/DATA605/master/Final_Exam/acs2017_county_data.csv")
df <- read.csv(text=file_url,header=TRUE)
lm_out <- lm(df$Income~df$Men+df$Women+df$Hispanic+df$White+df$Black+df$Asian+df$Native+df$Pacific)
file_url <- getURL("https://raw.githubusercontent.com/jey1987/DATA605/master/Final_Exam/acs2015_county_data.csv")
df_test <- read.csv(text=file_url,header=TRUE)
pred<-predict(lm_out,df_test)
kaggle <- data.frame( Id = df_test[,"CensusId"], Income=pred)
kaggle[kaggle<0] <- 0
kaggle <- replace(kaggle,is.na(kaggle),0)
write.csv(kaggle, file="kaggle.csv", row.names = FALSE)