1. A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.
x = pnorm(110, mean=100, sd=16,lower.tail = FALSE)
mean(x)
## [1] 0.2659855
library(ggplot2)
ggplot(data.frame(x),aes(x)) + geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

n<-110
mean<-100
sigma=16
x=pnorm(110,mean=100,sd=16,lower.tail = FALSE)
hist(x)

mean(x)
## [1] 0.2659855

If you draw samples from a normal distribution, then the distribution of sample means is also normal. The mean of the distribution of sample means is identical to the mean of the “parent population,” the population from which the samples are drawn. The higher the sample size that is drawn, the “narrower” will be the spread of the distribution of sample means.

2.

Is college worth it? Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.

  1. A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with α=.05 to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value. Do you reject or fail to reject the null ? Step 1: Null/Alternative hypothesis - Ho: p = .5 Ha: p ≠ .5 Step 2: Significance level Step 3: Distribution Step 4: Test Statistic from sample
  2. Would you expect a confidence interval (α=.05 , double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.

Ho: p = .5 Ha: p ≠ .5 Ho: p = .5 Ha: p < .5 Ho: p = .5 Ha: p > .5

Ho: >= 48% of American adults who decided not to go to college did so because they could not afford it

Ha: < 48 of American adults who decided not to go to college did so because they could not afford it

t is the test statistic pie is the population proportion = 0.50 p is the sample proportion = 159 / 331 = 0.48 n is the sample size = 331

Decide Alpha

alpha<-0.05

Calculating the Se Standard Error

Se<-sqrt(sd/sqrt(n) se

phat = .48
p01 = .50
n = 331
# the test statistic is a z=(p hat-p)/sqrt(p(1-p)/n)
zee1 <-(phat-p01)/sqrt(p01*(1-p01)/n) 
zee1
## [1] -0.7277362
phat = .48
p01 = .50
n = 331

Se<-sqrt((phat*(1-phat))/n)
Se
## [1] 0.02746049

With a standard error of .027 or 2.7% and a sample proportion of .48 or 48% we can form a 95% confidence interval. We are 95% confident that the population proportion falls within 2 SE’s of the sample proportion, in the range (42.6%-53.4%). Therefore the confidence interval, at 95%, does contain .5.

phat = .48
p01 = .50
n = 331
# the test statistic is a z=(p hat-p)/sqrt(p(1-p)/n)
zee1 <-(phat-p01)/sqrt(p01*(1-p01)/n) 
zee1
## [1] -0.7277362
pval<-pnorm(zee1)
pval
## [1] 0.2333875
pvalue<-2*(1-pval)
pvalue
## [1] 1.533225
alpha<-0.05
alpha
## [1] 0.05
# compute critical value (split alpha since we have double sided hypothesis)
?pnorm
## starting httpd help server ... done
critical_value  <- qnorm(p = .975, 
                          mean = 0,
                          sd = 1
                )
critical_value
## [1] 1.959964
#find Z critical value. Another way to find critical value 
Zcriric<-qnorm(p=.05/2, lower.tail=FALSE)
Zcriric
## [1] 1.959964
##The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.
## A p-value less than 0.05 (typically ≤ 0.05) is statistically significant.`
myvector=c(4,30) 
mymatrix=matrix(c(4,30,24,45), nrow=2)
colnames(mymatrix) <- c("Control", "Treatment")
rownames(mymatrix) <-c("Alive", "Dead")
mymatrix
##       Control Treatment
## Alive       4        24
## Dead       30        45
  1. #Question 3: Hypothesis testing- Difference of two means Gaming and distracted eating. You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22 too) was 6.1, with a standard deviation of 1.8.

Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume α=5%.

n1<-22
sd1<-1.8
mean.non<-4.9
sd2<-1.8
n2<-22
n.non<-6.1


meandiff<-n1 - mean.non
SE<-sqrt((sd1^2/n2)+(sd2^2/n.non))
df<-n.non-1


Tstat<-meandiff/SE
tdf<-qt(p=.05, df, lower.tail=FALSE)


pvalue<-2*pt(Tstat, df, lower.tail = FALSE)
pvalue
## [1] 4.000813e-06
#Print the appropriate answer based on the p-value
 4.000813e-06
## [1] 4.000813e-06

Because 4.000 below the range we can reject the null hypothesis

5Question 5: Assumptions Heart transplant success. The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group -

mym <- matrix(c(4,30,24,45), nrow=2) colnames(mym) <- c(“control”, “treatment”) rownames(mym) <- c(“alive”,“dead”) mym ## control treatment ## alive 4 24 ## dead 30 45 Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?

alive = c(4,24)
dead = c(30,45)

Data1<-rbind(alive,dead)
Data1
##       [,1] [,2]
## alive    4   24
## dead    30   45
colnames(Data1)=c("control","treatment")
chisq.test(Data1)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Data1
## X-squared = 4.9891, df = 1, p-value = 0.02551
chisq.test(cbind(alive,dead))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cbind(alive, dead)
## X-squared = 4.9891, df = 1, p-value = 0.02551
#Rejectv H0 Null Hypothesis           

#4,Question 4: Working backwards. A 90% confidence interval for a population mean is (65,77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, and the sample standard deviation.

SAMPLE MEAN

sample mean = X2+X12

MARGING OF ERROR

Marging of Error = X2−X12

n<- 25
phat <- (65+77)/2
phat
## [1] 71
Mean <-(77-65)/2
Mean
## [1] 6
df <- 25-1
t.value <- qt(.95, df)
t.value
## [1] 1.710882
tdf <- round(qt(c(.05, .95), df=24)[2], 3)
tdf
## [1] 1.711
Se <- round((77-phat)/tdf, 3)
Se
## [1] 3.507
sd <- Se * sqrt(25)
sd
## [1] 17.535
sd1 <- (Mean/t.value)*5
sd1
## [1] 17.53481
  1. Question 6: Data Question The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this question, you are to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Import the titanic train.csv file in

library(ggfortify)
  library(ggplot2)
  library(readr)
  library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  library(tidyr)
  library(viridis)
## Loading required package: viridisLite
  library(ggthemes)
  library(ggalt)
## Registered S3 methods overwritten by 'ggalt':
##   method                  from     
##   fortify.table           ggfortify
##   grid.draw.absoluteGrob  ggplot2  
##   grobHeight.absoluteGrob ggplot2  
##   grobWidth.absoluteGrob  ggplot2  
##   grobX.absoluteGrob      ggplot2  
##   grobY.absoluteGrob      ggplot2
traind <- read.table(file="C:/R/final/trainfinal.csv", header=TRUE,  sep=",", stringsAsFactors = TRUE)
str(traind)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
set.seed(100)
#loading psych package
require(psych)
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
trainc<- na.omit(traind)

Traindesc <- describe(trainc)    # Summary Statistics
Traindesc 
##             vars   n   mean     sd median trimmed    mad  min    max  range
## PassengerId    1 714 448.58 259.12 445.00  448.76 337.29 1.00 891.00 890.00
## Survived       2 714   0.41   0.49   0.00    0.38   0.00 0.00   1.00   1.00
## Pclass         3 714   2.24   0.84   2.00    2.30   1.48 1.00   3.00   2.00
## Name*          4 714 422.68 263.59 397.50  418.23 341.00 1.00 891.00 890.00
## Sex*           5 714   1.63   0.48   2.00    1.67   0.00 1.00   2.00   1.00
## Age            6 714  29.70  14.53  28.00   29.27  13.34 0.42  80.00  79.58
## SibSp          7 714   0.51   0.93   0.00    0.30   0.00 0.00   5.00   5.00
## Parch          8 714   0.43   0.85   0.00    0.23   0.00 0.00   6.00   6.00
## Ticket*        9 714 336.39 203.43 332.00  335.62 278.73 1.00 681.00 680.00
## Fare          10 714  34.69  52.92  15.74   23.19  12.21 0.00 512.33 512.33
## Cabin*        11 714  21.11  40.20   1.00   10.89   0.00 1.00 148.00 147.00
## Embarked*     12 714   3.59   0.79   4.00    3.74   0.00 1.00   4.00   3.00
##              skew kurtosis   se
## PassengerId  0.00    -1.23 9.70
## Survived     0.38    -1.86 0.02
## Pclass      -0.47    -1.42 0.03
## Name*        0.13    -1.24 9.86
## Sex*        -0.56    -1.69 0.02
## Age          0.39     0.16 0.54
## SibSp        2.51     6.96 0.03
## Parch        2.61     8.75 0.03
## Ticket*      0.05    -1.30 7.61
## Fare         4.63    30.61 1.98
## Cabin*       1.88     2.17 1.50
## Embarked*   -1.48     0.37 0.03
ggplot(traind, aes(Pclass)) + geom_density(fill="blue")

ggplot(traind, aes(log(Pclass))) + geom_density(fill="blue")

ggplot(traind, aes(sqrt(Pclass))) + geom_density(fill="blue")

slope <-cor(trainc$Pclass,trainc$Fare) * (sd(trainc$Pclass)/sd(trainc$Fare))
slope
## [1] -0.008778397
intercept <- mean(trainc$Pclass) - (slope * mean(trainc$Fare))
intercept
## [1] 2.541257
library('tidyverse')
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ stringr 1.5.0
## ✔ purrr   1.0.1     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%()    masks ggplot2::%+%()
## ✖ psych::alpha()  masks ggplot2::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library('ggplot2')
library('dplyr')


trainc %>%
 ggplot(aes(x = trainc$Fare, y = trainc$Pclass)) +
 geom_point(colour = "red")

trainc[is.na(trainc)] <- 0

plot(trainc)

##cor(trainc, use="pairwise.complete.obs")
cor(trainc$Pclass,trainc$Fare)
## [1] -0.5541825
trainc %>%
 ggplot(aes(x = sqrt(Pclass), y = sqrt(Fare))) +
 geom_point(colour = "orangered") 

trainc %>% ggplot(aes(x = sqrt(Fare), y = sqrt(Pclass))) + geom_point(colour = “maroon”) + geom_smooth(method = “lm”, fill = NA)




```r
hist(x = trainc$Fare, xlab = "", main = "Outpatients (RVU)")

hist(x=trainc$Pclass,xlab = "",main = "trainc Pclass")

plot(x = trainc$Fare, y = trainc$Pclass, xlab = "Fare", ylab = "Pclass") 

univariate_reg  =  lm(trainc$Pclass ~ trainc$Fare)

summary(univariate_reg)
## 
## Call:
## lm(formula = trainc$Pclass ~ trainc$Fare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5413 -0.4403  0.5158  0.5301  2.9562 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.5412569  0.0312531   81.31   <2e-16 ***
## trainc$Fare -0.0087784  0.0004941  -17.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6982 on 712 degrees of freedom
## Multiple R-squared:  0.3071, Adjusted R-squared:  0.3061 
## F-statistic: 315.6 on 1 and 712 DF,  p-value: < 2.2e-16
options(scipen = 999)      
summary(univariate_reg)
## 
## Call:
## lm(formula = trainc$Pclass ~ trainc$Fare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5413 -0.4403  0.5158  0.5301  2.9562 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.5412569  0.0312531   81.31 <0.0000000000000002 ***
## trainc$Fare -0.0087784  0.0004941  -17.77 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6982 on 712 degrees of freedom
## Multiple R-squared:  0.3071, Adjusted R-squared:  0.3061 
## F-statistic: 315.6 on 1 and 712 DF,  p-value: < 0.00000000000000022
?abline
abline(reg = univariate_reg, col="blue")

model<-lm(Pclass ~ Fare,data=trainc)
summary(model)
## 
## Call:
## lm(formula = Pclass ~ Fare, data = trainc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5413 -0.4403  0.5158  0.5301  2.9562 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.5412569  0.0312531   81.31 <0.0000000000000002 ***
## Fare        -0.0087784  0.0004941  -17.77 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6982 on 712 degrees of freedom
## Multiple R-squared:  0.3071, Adjusted R-squared:  0.3061 
## F-statistic: 315.6 on 1 and 712 DF,  p-value: < 0.00000000000000022
lmodel <- lm(sqrt(Fare) ~ sqrt(Pclass), data = trainc)
lmodel
## 
## Call:
## lm(formula = sqrt(Fare) ~ sqrt(Pclass), data = trainc)
## 
## Coefficients:
##  (Intercept)  sqrt(Pclass)  
##       15.247        -6.963
lmodel$coefficients
##  (Intercept) sqrt(Pclass) 
##    15.246911    -6.962825
summary(lmodel)
## 
## Call:
## lm(formula = sqrt(Fare) ~ sqrt(Pclass), data = trainc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2841 -0.8174 -0.3497  0.6622 14.3506 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   15.2469     0.3998   38.14 <0.0000000000000002 ***
## sqrt(Pclass)  -6.9628     0.2673  -26.05 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.171 on 712 degrees of freedom
## Multiple R-squared:  0.4879, Adjusted R-squared:  0.4872 
## F-statistic: 678.4 on 1 and 712 DF,  p-value: < 0.00000000000000022
Fvalue<-fitted.values(model)
Fvalue
##          1          2          3          4          5          7          8 
##  2.4776135  1.9155038  2.4716881  2.0751240  2.4705908  2.0859873  2.3562522 
##          9         10         11         12         13         14         15 
##  2.4435244  2.2772835  2.3946577  2.3081905  2.4705908  2.2667125  2.4723096 
##         16         17         19         21         22         23         24 
##  2.4008025  2.2855861  2.3832458  2.3130186  2.4271377  2.4707734  2.2296238 
##         25         26         28         31         34         35         36 
##  2.3562522  2.2657250  0.2325384  2.2979127  2.4490837  1.8199290  2.0847802 
##         38         39         40         41         42         44         45 
##  2.4705908  2.3832458  2.4425728  2.4580816  2.3569106  2.1762582  2.4720902 
##         50         51         52         53         54         55         57 
##  2.3850014  2.1928643  2.4727854  1.8676975  2.3130186  1.9971789  2.4490837 
##         58         59         60         61         62         63         64 
##  2.4777961  2.2976564  2.1295501  2.4777961  1.8389851  1.8084802  2.2963396 
##         67         68         69         70         71         72         73 
##  2.4490837  2.4696401  2.4716881  2.4652140  2.4490837  2.1295501  1.8960447 
##         74         75         76         79         80         81         82 
##  2.4143722  2.0453143  2.4741022  2.2866834  2.4317464  2.4622513  2.4578621 
##         84         85         86         87         89         90         91 
##  2.1277944  2.4490837  2.4021193  2.2394995  0.2325384  2.4705908  2.4705908 
##         92         93         94         95         97         98         99 
##  2.4723096  2.0042385  2.3606414  2.4776135  2.2370486  1.9850726  2.3393538 
##        100        101        103        104        105        106        107 
##  2.3130186  2.4719444  1.8627965  2.4652869  2.4716881  2.4719444  2.4741022 
##        109        111        112        113        114        115        116 
##  2.4719444  2.0847802  2.4143722  2.4705908  2.4550091  2.4143362  2.4716881 
##        117        118        119        120        121        123        124 
##  2.4732243  2.3569106  0.3684210  2.2667125  1.8960447  2.2772835  2.4271377 
##        125        126        128        130        131        132        133 
##  1.8627965  2.4425728  2.4785642  2.4800276  2.4719444  2.4793692  2.4139701 
##        134        135        136        137        138        139        140 
##  2.3130186  2.4271377  2.4091789  2.3105317  2.0751240  2.4603490  1.8460078 
##        142        143        144        145        146        147        148 
##  2.4732243  2.4021193  2.4820027  2.4403053  2.2186508  2.4728223  2.2394995 
##        149        150        151        152        153        154        156 
##  2.3130186  2.4271377  2.4313075  1.9566157  2.4705908  2.4139701  2.0024459 
##        157        158        161        162        163        164        165 
##  2.4733709  2.4705908  2.3999247  2.4029971  2.4730049  2.4652140  2.1928643 
##        166        168        170        171        172        173        174 
##  2.3610803  2.2963396  2.0453143  2.2471806  2.2855861  2.4435244  2.4716881 
##        175        176        178        179        180        183        184 
##  2.2717970  2.4723096  2.2892072  2.4271377  2.5412569  2.2657250  2.1988994 
##        185        188        189        190        191        192        193 
##  2.3479127  2.3081905  2.4051917  2.4719444  2.4271377  2.4271377  2.4723096 
##        194        195        196        198        200        201        203 
##  2.3130186  2.2979127  1.2550391  2.4674815  2.4271377  2.4578621  2.4842342 
##        204        205        206        207        208        209        210 
##  2.4778330  2.4705908  2.4494129  2.4021193  2.3763328  2.4732243  2.2691266 
##        211        212        213        214        216        217        218 
##  2.4793692  2.3569106  2.4776135  2.4271377  1.5468840  2.4716881  2.3042402 
##        219        220        221        222        223        225        226 
##  1.8715381  2.4490837  2.4705908  2.4271377  2.4705908  1.7512012  2.4591789 
##        227        228        229        231        232        233        234 
##  2.4490837  2.4776135  2.4271377  1.8084802  2.4730049  2.4227485  2.2657250 
##        235        237        238        239        240        243        244 
##  2.4490837  2.3130186  2.3108240  2.4490837  2.4335021  2.4490837  2.4787108 
##        245        246        247        248        249        250        252 
##  2.4778330  1.7512012  2.4730049  2.4139701  2.0799153  2.3130186  2.4494129 
##        253        254        255        256        258        259        260 
##  2.3081905  2.3999247  2.3638235  2.4074232  1.7819255 -1.9561723  2.3130186 
##        262        263        264        266        267        268        269 
##  2.2657250  1.8420576  2.5412569  2.4490837  2.1928643  2.4730049  1.1941021 
##        270        272        273        274        276        277        279 
##  1.3506139  2.5412569  2.3700782  2.2805385  1.8569080  2.4732243  2.2855861 
##        280        281        282        283        284        286        287 
##  2.3634944  2.4732243  2.4723096  2.4578621  2.4705908  2.4652140  2.4578621 
##        288        289        290        291        292        293        294 
##  2.4719444  2.4271377  2.4732243  1.8490803  1.7417275  2.4282350  2.4635681 
##        295        297        298        300        303        306        308 
##  2.4719444  2.4777961  1.2108908  0.3684210  2.5412569  1.2108908  1.5852895 
##        309        310        311        312        313        314        315 
##  2.3305754  2.0415098  1.8112603  0.2380249  2.3130186  2.4719444  2.3108240 
##        316        317        318        319        320        321        322 
##  2.4723096  2.3130186  2.4183593  1.0939915  1.3605625  2.4776135  2.4719444 
##        323        324        326        327        328        329        330 
##  2.4328437  2.2866834  1.3506139  2.4865016  2.4271377  2.3610803  2.0322925 
##        332        333        334        337        338        339        340 
##  2.2910726  1.1941021  2.3832458  1.9566157  1.3605625  2.4705908  2.2296238 
##        341        342        343        344        345        346        347 
##  2.3130186  0.2325384  2.4271377  2.4271377  2.4271377  2.4271377  2.4271377 
##        349        350        351        353        354        356        357 
##  2.4016804  2.4652140  2.4602762  2.4777961  2.3850014  2.4578621  2.0584451 
##        358        361        362        363        364        366        367 
##  2.4271377  2.2963396  2.2979127  2.4143722  2.4793692  2.4776135  1.8806825 
##        370        371        372        373        374        375        377 
##  1.9329140  2.0545676  2.4842342  2.4705908  1.3506139  2.3562522  2.4776135 
##        378        379        380        381        382        383        384 
##  0.6846259  2.5060336  2.4730049  0.5439521  2.4030700  2.4716881  2.0847802 
##        386        387        388        390        391        392        393 
##  1.8960447  2.1295501  2.4271377  2.4359161  1.4878492  2.4728223  2.4716881 
##        394        395        396        397        398        399        400 
##  1.5468840  2.3946577  2.4728223  2.4723096  2.3130186  2.4490837  2.4302102 
##        401        402        403        404        405        406        407 
##  2.4716881  2.4705908  2.4550091  2.4021193  2.4652140  2.3569106  2.4732243 
##        408        409        413        415        417        418        419 
##  2.3766620  2.4730049  1.7512012  2.4716881  2.2559590  2.4271377  2.4271377 
##        420        422        423        424        425        427        428 
##  2.3292586  2.4733709  2.4721270  2.4148480  2.3638235  2.3130186  2.3130186 
##        430        431        433        434        435        436        437 
##  2.4705908  2.3081905  2.3130186  2.4787108  2.0505445  1.4878492  2.2394995 
##        438        439        440        441        442        443        444 
##  2.3766620  0.2325384  2.4490837  2.3108240  2.4578621  2.4730049  2.4271377 
##        446        447        448        449        450        451        453 
##  1.8226722  2.3700782  2.3081905  2.3721999  2.2735158  2.2976564  2.2976564 
##        454        456        457        459        461        462        463 
##  1.7590648  2.4719444  2.3081905  2.4490837  2.3081905  2.4705908  2.2032886 
##        464        466        468        470        472        473        474 
##  2.4271377  2.4793692  2.3081905  2.3721999  2.4652140  2.2976564  2.4201879 
##        475        477        478        479        480        481        483 
##  2.4548994  2.3569106  2.4794061  2.4752363  2.4333923  2.1295501  2.4705908 
##        484        485        487        488        489        490        492 
##  2.4570940  1.7417275  1.7512012  2.2805385  2.4705908  2.4016804  2.4776135 
##        493        494        495        497        499        500        501 
##  2.2735158  2.1066894  2.4705908  1.8542007  1.2108908  2.4728223  2.4652140 
##        502        504        505        506        507        509        510 
##  2.4732243  2.4570940  1.7819255  1.5852895  2.3130186  2.3435235  2.0453143 
##        511        513        514        515        516        517        519 
##  2.4732243  2.3104948  2.0198201  2.4754558  2.2426088  2.4490837  2.3130186 
##        520        521        522        524        526        527        529 
##  2.4719444  1.7204768  2.4719444  2.0322925  2.4732243  2.4490837  2.4716881 
##        530        531        533        535        536        537        538 
##  2.4403053  2.3130186  2.4777961  2.4652140  2.3108240  2.3081905  1.6070160 
##        540        541        542        543        544        545        546 
##  2.1067262  1.9179907  2.2667125  2.2667125  2.3130186  1.6070160  2.3130186 
##        547        549        550        551        552        554        555 
##  2.3130186  2.3610803  2.2186508  1.5678793  2.3130186  2.4778330  2.4730049 
##        556        557        559        560        562        563        566 
##  2.3081905  2.1936324  1.8420576  2.3885128  2.4719444  2.4227485  2.3292586 
##        567        568        570        571        572        573        575 
##  2.4719444  2.3562522  2.4723096  2.4490837  2.0893520  2.3096169  2.4705908 
##        576        577        578        580        581        582        583 
##  2.4139701  2.4271377  2.0505445  2.4716881  2.2779050  1.5678793  2.3130186 
##        584        586        587        588        589        591        592 
##  2.1890237  1.8420576  2.4095809  1.8460078  2.4705908  2.4787108  1.8542007 
##        593        595        596        598        600        601        604 
##  2.4776135  2.3130186  2.3292586  2.5412569  2.0415098  2.3042402  2.4705908 
##        605        606        607        608        609        610        611 
##  2.3081905  2.4047528  2.4719444  2.2735158  2.1762582  1.1941021  2.2667125 
##        615        616        617        618        619        620        621 
##  2.4705908  1.9706611  2.4148480  2.3999247  2.1988994  2.4490837  2.4143722 
##        622        623        624        625        626        627        628 
##  2.0799153  2.4030700  2.4723096  2.3999247  2.2575321  2.4328437  1.8569080 
##        629        631        632        633        635        636        637 
##  2.4719444  2.2779050  2.4793323  2.2735158  2.2963396  2.4271377  2.4716881 
##        638        639        641        642        643        645        646 
##  2.3108240  2.1928643  2.4723096  1.9329140  2.2963396  2.3721999  1.8676975 
##        647        648        650        652        653        655        656 
##  2.4719444  2.2296238  2.4749800  2.3393538  2.4672260  2.4820027  1.8960447 
##        658        659        660        661        662        663        664 
##  2.4051917  2.4271377  1.5468840  1.3680241  2.4778330  2.3166397  2.4754558 
##        665        666        667        669        671        672        673 
##  2.4716881  1.8960447  2.4271377  2.4705908  2.1988994  2.0847802  2.4490837 
##        674        676        677        678        679        680        682 
##  2.4271377  2.4730049  2.4705908  2.4548625  2.1295501 -1.9561723  1.8676975 
##        683        684        685        686        687        688        689 
##  2.4602762  2.1295501  2.1988994  2.1762582  2.1928643  2.4519736  2.4728223 
##        690        691        692        694        695        696        697 
##  0.6860524  2.0408883  2.4234798  2.4778330  2.3081905  2.4227485  2.4705908 
##        699        700        701        702        703        704        705 
##  1.5678793  2.4741022  0.5439521  2.3104948  2.4143722  2.4732972  2.4723096 
##        706        707        708        709        711        713        714 
##  2.3130186  2.4227485  2.3104948  1.2108908  2.1066894  2.0847802  2.4580087 
##        715        716        717        718        720        721        722 
##  2.4271377  2.4741022  0.5439521  2.4490837  2.4730049  2.2515698  2.4793323 
##        723        724        725        726        727        729        730 
##  2.4271377  2.4271377  2.0751240  2.4652140  2.3569106  2.3130186  2.4716881 
##        731        732        734        735        736        737        738 
##  0.6860524  2.3763328  2.4271377  2.4271377  2.3999247  2.2394995 -1.9561723 
##        742        743        744        745        746        747        748 
##  1.8490803  0.2380249  2.3999247  2.4716881  1.9179907  2.3634944  2.4271377 
##        749        750        751        752        753        754        755 
##  2.0751240  2.4732243  2.3393538  2.4317464  2.4578621  2.4719444  1.9706611 
##        756        757        758        759        760        762        763 
##  2.4139701  2.4728223  2.4403053  2.4705908  1.7819255  2.4787108  2.4777961 
##        764        765        766        768        770        771        772 
##  1.4878492  2.4730049  1.8569080  2.4732243  2.4678476  2.4578621  2.4723096 
##        773        775        776        778        780        781        782 
##  2.4490837  2.3393538  2.4732243  2.4317464  0.6860524  2.4777961  2.0408883 
##        783        785        786        787        788        789        790 
##  2.2779050  2.4793692  2.4776135  2.4754558  2.2855861  2.3606414  1.8460078 
##        792        795        796        797        798        799        800 
##  2.3130186  2.4719444  2.4271377  2.3136401  2.4650314  2.4777961  2.3292586 
##        801        802        803        804        805        806        807 
##  2.4271377  2.3108240  1.4878492  2.4664939  2.4800276  2.4730049  2.5412569 
##        808        809        810        811        812        813        814 
##  2.4730049  2.4271377  2.0751240  2.4720173  2.3292586  2.4490837  2.2667125 
##        815        817        818        819        820        821        822 
##  2.4705908  2.4716881  2.2164193  2.4846362  2.2963396  1.7204768  2.4652140 
##        823        824        825        828        830        831        832 
##  2.5412569  2.4317464  2.1928643  2.2164193  1.8389851  2.4143722  2.3766620 
##        834        835        836        837        839        841        842 
##  2.4723096  2.4683962  1.8112603  2.4652140  2.0453143  2.4716881  2.4490837 
##        843        844        845        846        848        849        851 
##  2.2691266  2.4847460  2.4652140  2.4749800  2.4719444  2.2515698  2.2667125 
##        852        853        854        855        856        857        858 
##  2.4730049  2.4074232  2.1953881  2.3130186  2.4591789  1.0939915  2.3081905 
##        859        861        862        863        865        866        867 
##  2.3721999  2.4174086  2.4403053  2.3136401  2.4271377  2.4271377  2.4196032 
##        868        870        871        872        873        874        875 
##  2.0979847  2.4435244  2.4719444  2.0799153  2.4973649  2.4622513  2.3305754 
##        876        877        878        880        881        882        883 
##  2.4778330  2.4548266  2.4719444  1.8112603  2.3130186  2.4719444  2.4489371 
##        884        885        886        887        888        890        891 
##  2.4490837  2.4793692  2.2855861  2.4271377  2.2779050  2.2779050  2.4732243
Res<-lm(formula = model, data = trainc)
Res
## 
## Call:
## lm(formula = model, data = trainc)
## 
## Coefficients:
## (Intercept)         Fare  
##    2.541257    -0.008778
# Residual Analysis
# plot(fitted(model),resid(model))

## plot(fitted(model),Res)
# abline(0,0)
qqnorm(resid(model))
qqline(resid(model))

par(mfrow=c(2,2))
plot(model)

Interpret the linear model - Pclass~Fare.

1 unit increase in Fare is associated with 235.1 units increase in Expenditures.

plot(x = univariate_reg)

qqnorm(y=trainc$Pclass)

qqnorm( y = log(trainc$Pclass))

Transformed regression - ln(Pclass)~RVU

hist(x = log(trainc$Pclass),xlab = "", main = "Log Pclass Details" )

hist(x=log(trainc$Fare),  xlab = "", main = "Log Fare (Fare)")

plot(x = trainc$Fare, y = log (trainc$Pclass) , xlab = "Fare", ylab = "Log of Pclass") 

univariate_reg_transformedY  =  lm( formula = log(trainc$Pclass) ~ trainc$Fare)
?abline
abline(reg = univariate_reg_transformedY, col="blue")

summary(univariate_reg)
## 
## Call:
## lm(formula = trainc$Pclass ~ trainc$Fare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5413 -0.4403  0.5158  0.5301  2.9562 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.5412569  0.0312531   81.31 <0.0000000000000002 ***
## trainc$Fare -0.0087784  0.0004941  -17.77 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6982 on 712 degrees of freedom
## Multiple R-squared:  0.3071, Adjusted R-squared:  0.3061 
## F-statistic: 315.6 on 1 and 712 DF,  p-value: < 0.00000000000000022
summary(univariate_reg_transformedY)
## 
## Call:
## lm(formula = log(trainc$Pclass) ~ trainc$Fare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8872 -0.1417  0.2456  0.2520  1.6677 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  0.8871916  0.0165851   53.49 <0.0000000000000002 ***
## trainc$Fare -0.0049868  0.0002622  -19.02 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3705 on 712 degrees of freedom
## Multiple R-squared:  0.3368, Adjusted R-squared:  0.3359 
## F-statistic: 361.7 on 1 and 712 DF,  p-value: < 0.00000000000000022
plot(univariate_reg_transformedY)

Constant variability The plot of residuals versus fitted observations shows that the variability of errors around the predicted values is slightly better.

Scale-Location plot - For OLS, the trend line is even and the residuals are uniformly scattered.

plot( univariate_reg , which = 3)               

plot( univariate_reg_transformedY , which = 3)         

hist(x = log(trainc$Pclass), xlab = "", main = "Log Pclass Details" )

hist(x = log(trainc$Fare) ,        xlab = "", main = "Log Pclass)")

#log_log_reg =  lm(formula = log(trainc$Pclass) ~ log(trainc$Fare))
#summary(log_log_reg)
# plot(log_log_reg)
 plot(x = trainc$Fare, y = log (trainc$Pclass) , xlab = "Fare", ylab = "Log of Pclass") 

trainc$Fare2 <- trainc$Fare^2

univariate_reg_transformedY2  =  lm( formula = log(trainc$Pclass) ~ trainc$Fare+ trainc$Fare2)
# abline(reg = univariate_reg_transformedY2, col="blue")


# summary(univariate_reg)
# summary(univariate_reg_transformedY)
summary(univariate_reg_transformedY2)
## 
## Call:
## lm(formula = log(trainc$Pclass) ~ trainc$Fare + trainc$Fare2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0298 -0.1910  0.1506  0.1700  0.6512 
## 
## Coefficients:
##                  Estimate   Std. Error t value            Pr(>|t|)    
## (Intercept)   1.029797839  0.016695350   61.68 <0.0000000000000002 ***
## trainc$Fare  -0.011470684  0.000458865  -25.00 <0.0000000000000002 ***
## trainc$Fare2  0.000020586  0.000001271   16.20 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3169 on 711 degrees of freedom
## Multiple R-squared:  0.5156, Adjusted R-squared:  0.5142 
## F-statistic: 378.4 on 2 and 711 DF,  p-value: < 0.00000000000000022
plot(univariate_reg_transformedY2)

Some ways to fix heteroskedasticity - Transform the dependent variable. sqrt() will have larger penalty, but interpretation is not as easy/standard as when taking log.

Alternative models - sqrt(Pclass)~RVU

plot(x = trainc$Fare, y =  sqrt(trainc$Pclass) , xlab = "Fare", ylab = "Square Root of Pclass") 

univariate_reg_transformedY_sqrt  =  lm( formula = sqrt(trainc$Pclass) ~ trainc$Fare)
?abline
abline(reg = univariate_reg_transformedY_sqrt, col="blue")

summary(univariate_reg)
## 
## Call:
## lm(formula = trainc$Pclass ~ trainc$Fare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5413 -0.4403  0.5158  0.5301  2.9562 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.5412569  0.0312531   81.31 <0.0000000000000002 ***
## trainc$Fare -0.0087784  0.0004941  -17.77 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6982 on 712 degrees of freedom
## Multiple R-squared:  0.3071, Adjusted R-squared:  0.3061 
## F-statistic: 315.6 on 1 and 712 DF,  p-value: < 0.00000000000000022
summary(univariate_reg_transformedY)
## 
## Call:
## lm(formula = log(trainc$Pclass) ~ trainc$Fare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8872 -0.1417  0.2456  0.2520  1.6677 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  0.8871916  0.0165851   53.49 <0.0000000000000002 ***
## trainc$Fare -0.0049868  0.0002622  -19.02 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3705 on 712 degrees of freedom
## Multiple R-squared:  0.3368, Adjusted R-squared:  0.3359 
## F-statistic: 361.7 on 1 and 712 DF,  p-value: < 0.00000000000000022
summary(univariate_reg_transformedY_sqrt)
## 
## Call:
## lm(formula = sqrt(trainc$Pclass) ~ trainc$Fare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5777 -0.1292  0.1756  0.1809  1.0967 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  1.5777304  0.0112090  140.75 <0.0000000000000002 ***
## trainc$Fare -0.0032683  0.0001772  -18.44 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2504 on 712 degrees of freedom
## Multiple R-squared:  0.3233, Adjusted R-squared:  0.3223 
## F-statistic: 340.1 on 1 and 712 DF,  p-value: < 0.00000000000000022
plot( univariate_reg_transformedY_sqrt)         

```