Data Analysis using R

Short Term Course - NSOU Kalyani Campus

Solution of the final Exam

———————————————————————–

Libraries required for the solution.Only tidyverse is enough as dplyr,ggplot2,readr all are kept inside tidyverse

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Q1.a Load the file
Solution:
rm(list=ls())
srcFdr="D:\\D Drive\\Certificate Course\\Examination"
fileNm="Production_2024.csv"
srcFile=paste(srcFdr,fileNm,sep="\\")
prd=read_csv(srcFile)
## Rows: 73 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): PRDMTD, STRENGTH
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Q1.b change the values of file
Solution:
summary(prd)
##      PRDMTD         STRENGTH     
##  Min.   :1.000   Min.   :-46.00  
##  1st Qu.:1.000   1st Qu.: 92.22  
##  Median :2.000   Median : 95.59  
##  Mean   :1.548   Mean   : 95.62  
##  3rd Qu.:2.000   3rd Qu.:100.03  
##  Max.   :3.000   Max.   :186.00
unique(prd$PRDMTD)
## [1] 1 2 3
prdCln=prd%>%mutate(PRDMTD=na_if(PRDMTD,3))%>%
  mutate(STRENGTH=if_else(STRENGTH<0,NA,STRENGTH))%>%
  mutate(PRDMTD=factor(PRDMTD,labels=c("Batch","Mass")))%>%
  drop_na(PRDMTD)%>%drop_na(STRENGTH)
Q1.c change the values of file
Solution:
summary(prdCln)
##    PRDMTD      STRENGTH     
##  Batch:34   Min.   :  4.00  
##  Mass :36   1st Qu.: 92.73  
##             Median : 95.61  
##             Mean   : 98.34  
##             3rd Qu.:100.00  
##             Max.   :186.00
Q1=92.73
Q3=100.00
IQR=Q3-Q1
ll=Q1-1.5*IQR
ul=Q3+1.5*IQR
prdCln=prdCln%>% filter(STRENGTH>ll & STRENGTH<ul)
summary(prdCln)
##    PRDMTD      STRENGTH     
##  Batch:30   Min.   : 83.88  
##  Mass :27   1st Qu.: 92.82  
##             Median : 95.18  
##             Mean   : 95.89  
##             3rd Qu.: 99.13  
##             Max.   :109.26
Q1.d (i)
Solution:
prdStat=prdCln%>%group_by(PRDMTD)%>%
  summarize(avg=mean(STRENGTH),
            var=var(STRENGTH),
            cnt=n())
prdStat
## # A tibble: 2 × 4
##   PRDMTD   avg   var   cnt
##   <fct>  <dbl> <dbl> <int>
## 1 Batch   95.0  8.03    30
## 2 Mass    96.8 43.0     27
Q1.d (ii)
Solution:
prdCln%>%group_by(PRDMTD)%>%
  summarize(var=sum((STRENGTH-mean(STRENGTH))^2)/n())
## # A tibble: 2 × 2
##   PRDMTD   var
##   <fct>  <dbl>
## 1 Batch   7.76
## 2 Mass   41.4

The variance values are different for (i) and (ii).The sample variance directly from r is greater than the result from the formula given in (ii). As r uses the denominator of the variance as n-1 instead of n,the result is higher than the formula given in (ii).The correct formula of sample variance is \(\frac{(x_i-\bar x)^2}{n-1}\) as it is the unbiased estimator of population variance.

Q2.
Solution:
prdCln%>%ggplot(aes(x=PRDMTD,y=STRENGTH,fill = PRDMTD))+
  geom_boxplot()+
  stat_summary(fun.y = mean,geom = "point",size=3)+
  labs(title="Strength by Production Method",
       x="Production Method",
       y="Strength",fill="Production Method")+
  theme_minimal()
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Q3.a
Sol:

\(H_0\):There is no difference between mean strength of method1 and mean strength of method2.
\(H_a\): The mean strength of method1 and mean strength of method2 are different.

Q3.b
Sol:

2 sample \(t - test\) As the variance of mean strength of type mass(43.0) more than 5 times higher than the variance of mean strength of type batch(8.03), the sample variance in \(t-test\) can be assumed to be unequal. Here the t statistics formula is \(t=\frac{\bar{x_m}-\bar{x_b}}{\sqrt(\frac{var_m}{n_1}+\frac{var_b}{n_2})}\)

t=(prdStat[2,2]-prdStat[1,2])/sqrt(prdStat[2,3]/prdStat[2,4]+
  prdStat[1,3]/prdStat[1,4])
Q3.c
Sol:
t.test(prdCln$STRENGTH~prdCln$PRDMTD,var.equal=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  prdCln$STRENGTH by prdCln$PRDMTD
## t = -1.3175, df = 34.602, p-value = 0.1963
## alternative hypothesis: true difference in means between group Batch and group Mass is not equal to 0
## 95 percent confidence interval:
##  -4.5666770  0.9730275
## sample estimates:
## mean in group Batch  mean in group Mass 
##            95.03855            96.83538

As the p-value is 0.1963>.05 the null hypothesis can’t be rejected for significance level .05 and .01.

Q3.d
Sol:

\(H_0\):The mean strength of method2 is no better than mean strength of method1.
\(H_a\): The mean strength of method2 is better than mean strength of method1.

Here alternative hypothesis is one sided therefore 1 tailed \(t - test\) can be used.

t.test(prdCln$STRENGTH~prdCln$PRDMTD,var.equal=FALSE,
       alternative="less")
## 
##  Welch Two Sample t-test
## 
## data:  prdCln$STRENGTH by prdCln$PRDMTD
## t = -1.3175, df = 34.602, p-value = 0.09817
## alternative hypothesis: true difference in means between group Batch and group Mass is less than 0
## 95 percent confidence interval:
##       -Inf 0.5081784
## sample estimates:
## mean in group Batch  mean in group Mass 
##            95.03855            96.83538

The p-value = 0.09817. As p-value <0.1 we can reject null hypothesis and say that the claim2 is true for significance level=0.1

Q4.a
Sol:

1 —-> 3 2 —-> 4 3 —-> 8 4 —-> 7 5 —-> 1 6 —-> 6 7 —-> 2 8 —-> 9 9 —-> 5

Q4.b(i)
Sol:

\(H_0\):Soft drink bottle contains greater or equal to 67.6 fluid ounces. \(H_a\): Soft drink bottle contains less than 67.6 fluid ounces.

Q4.b(ii)
Sol:

\(H_0\):Soft drink bottle contain 67.6 fluid ounces. \(H_a\): Soft drink bottle doesn’t contain 67.6 fluid ounces.