JAT Case Study: Exercise

Jayalaxmi Ago Tech: Find the inner insights

Sole purpose of this presentation is to answer questions raised by Jayalaxmi Ago Tech(JAT), using the data “IMB733-XLS-ENG.xlsx” in R Studio. Default output used in this markdown is “HTML document” which can be easily viewed in any browser or can be used to export slides into PDF as per the need.

Questions raised by JAT and insights found in Data

Q1. Anand, the co founder of JAT, claims that disease 6 (leaf curl) information was accessed at least 60 times every week on average since October 2017 due to this disease outbreak. Test this claim at a significance level of 0.05 using an appropriate hypothesis test.

Hypothesis

H0 i.e. null hypothesis says: Disease 6 (d6) information was NOT accessed at least 60 times (µ<=60)

Ha i.e. alternative hypothesis says: Disease 6 (d6) information was accessed at least 60 times (µ>60)

Approach

We will compare sample mean against population mean (µ) and population standard deviation is not given to us. So, correct hypothesis can be verified using One Sample T-test.

getwd()

## [1] "C:/Users/pahar/OneDrive/IIM-K/Classes/2022-09-10"

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
all_jat_data<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Data Sheet",col_names = TRUE) # this is all data from IMB733-XLS-ENG.xlsx, sheet Data Sheet
library(janitor)

## Warning: package 'janitor' was built under R version 4.2.1

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

all_jat_data<-clean_names(all_jat_data) # using janitor package to replace spaces/other symobols with undescore and using lowercase naming thorughout in variable names
all_jat_q1<-subset(all_jat_data,all_jat_data$month_year>="2017-10-01") #this is all data from October 2017 onwards

#One Sample T-test
t.test(all_jat_q1$d6,mu=60,alternative = "greater")

## 
##  One Sample t-test
## 
## data:  all_jat_q1$d6
## t = 2.341, df = 28, p-value = 0.01329
## alternative hypothesis: true mean is greater than 60
## 95 percent confidence interval:
##  62.29976      Inf
## sample estimates:
## mean of x 
##  68.41379

Outcomes

Using One Sample T-test, we found that sample mean is 68.41379, which is higher than population mean 60
t = 2.341, df = 28, p-value = 0.01329
As p value got from t test is less than alpha (0.05), we can reject the null hypothesis i.e. µ<=60
We can support the alternative hypothesis i.e. µ>60

Inference

Based on the given data and analysis performed on the same using One Sample T-Test, we can infer that claim made by Anand, the co founder of JAT is TRUE (which claims that disease 6 (leaf curl) information was accessed at least 60 times every week on average since October 2017 due to this disease outbreak)

Q2. JAT believes that over the years, the average number of app users have increased significantly. Is there statistical evidence to support that the average number of users in year 2017-2018 is more than average number of users in year 2015-2016 at a=0.05? Support your answer with all necessary tests.

Assumption

Avg number of weekly app users in 2015-2016 =µ1

Avg number of weekly app users in 2017-2018 =µ2

Hypothesis

H0 i.e. null hypothesis says: µ2 IS NOT > µ1

Ha i.e. alternative hypothesis says: µ2 > µ1

Approach

We will Two Sample T-test to compare sample mean of 2017-2018 data with 2015-2016 data. This will verify which hypothesis we will support and which one is to reject.

all_jat_q2<-all_jat_data
all_jat_q2$group<-factor(ifelse(all_jat_data$month_year>="2017-01-01","2017-2018","2015-2016"))

all_jat_q2$group<-relevel(all_jat_q2$group,ref = "2017-2018")

t.test(all_jat_q2$no_of_users~all_jat_q2$group,alternative = "greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  all_jat_q2$no_of_users by all_jat_q2$group
## t = 9.2567, df = 121, p-value = 4.753e-16
## alternative hypothesis: true difference in means between group 2017-2018 and group 2015-2016 is greater than 0
## 95 percent confidence interval:
##  107.5685      Inf
## sample estimates:
## mean in group 2017-2018 mean in group 2015-2016 
##               181.10000                50.06849

Outcomes

Using Two Sample T-test, we found that t = 9.2567, df = 121, p-value = 4.753e-16
As p value got from t test is less than alpha(0.05),we can reject the null hypothesis i.e. µ2 IS NOT > µ1
We can support the alternative hypothesis i.e. µ2 > µ1 i.e.

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that claim made by JAT is TRUE (which claims that the average number of users in year 2017-2018 is more than average number of users in year 2015-2016 at alpha=0.05)

Q3. Farmers use apps to access information throughout the month. Using the data, check whether app usage is same or different across the four weeks of a month. Anand claims that app usage picked up after January 2016; so, test this hypothesis using data from January-2016 – May 2018.

Part1

Assumption

Avg App usage in first week of every month =µ1

Avg App usage in first week of every month =µ2

Avg App usage in first week of every month =µ3

Avg App usage in first week of every month =µ4

Hypothesis

H0 i.e. null hypothesis says: µ1 != µ2 != µ3 != µ4

Ha i.e. alternative hypothesis says: µ1 = µ2 = µ3 = µ4

Approach

We will subset the data to use only the data from 01-01-2016 onward as given in the question. ANOVA test can be used to compare Avg App usage across weeks. This will verify which hypothesis we will support and which one is to reject.

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.1

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

all_jat_q3_gt16<-filter(all_jat_data,month_year>="2016-01-01")

all_jat_q3_gt16_sub<-select(all_jat_q3_gt16, week,usage)

all_jat_q3_gt16_sub$week<-factor(all_jat_q3_gt16_sub$week)

anova_q3_gt16<-aov(usage~week,data=all_jat_q3_gt16_sub)

summary(anova_q3_gt16)

##             Df   Sum Sq Mean Sq F value Pr(>F)  
## week         3  1675404  558468   2.319 0.0804 .
## Residuals   94 22633380  240781                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Outcomes

Using ANOVA test, we found that p value is 0.0804 which is greater than 0.05, so result is insignificant
Since p value got from test is greater than alpha(0.05),we can NOT reject the the null hypothesis i.e. µ1 != µ2 != µ3 != µ4
We conclude that Avg App usage is different across the four weeks of a month for data from 01-01-2016 onward.

Inference

Based on the given subset of data and analysis performed on the same using ANOVA test, we can infer that app usage is different across the four weeks of a month.

Part2

Assumption

Avg App Usage till 31-12-2015 = µ1

Avg App Usage after 01-01-2016 = µ2

Hypothesis

H0 i.e. null hypothesis says: µ1>=µ2

Ha i.e. alternative hypothesis says: µ2 > µ1

Approach

We will Two Sample T-test to compare mean App Usage of of the two groups. This will verify which hypothesis we will support and which one is to reject.

all_jat_q3_1516<-all_jat_data

all_jat_q3_1516$group<-factor(ifelse(all_jat_data$month_year>="2015-12-31","2016_Onward","Before_2016"))

all_jat_q3_1516$group<-relevel(all_jat_q3_1516$group,ref = "2016_Onward")

t.test(all_jat_q3_1516$usage~all_jat_q3_1516$group,alternative = "greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  all_jat_q3_1516$usage by all_jat_q3_1516$group
## t = 3.5721, df = 121, p-value = 0.0002547
## alternative hypothesis: true difference in means between group 2016_Onward and group Before_2016 is greater than 0
## 95 percent confidence interval:
##  198.3213      Inf
## sample estimates:
## mean in group 2016_Onward mean in group Before_2016 
##                  657.7041                  287.6800

Outcomes

Using Two Sample T-test, we found that t = 5.1578, df = 76.377, p-value = 9.567e-07
As p value got from t test is less than alpha(0.05),we can reject the null hypothesis i.e. µ1>=µ2
We can support the alternative hypothesis i.e. µ2 > µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that claim made by Anand is TRUE (which claims app usage picked up after January 2016 at alpha=0.05)

Q4. A new version of the app was released in August 2016. Anand wants to understand which month in the given time frame after the launch of the new version, the mean usage pattern would start to show a statistically significant shift.

Assumption

Assuming app was release on the first day of August 2016, we assume that this release will impact the data from 01-08-2016 onward

Approach

This involves grouping the data 01-08-2016 onward and then aggregate the same to find mean of the usage. This can be achieved using filter, followed by grouping and summarize function.

all_jat_data_q4<-all_jat_data

all_jat_data_q4 <- filter (all_jat_data_q4, all_jat_data_q4$month_year >= "2016-08-01")

all_jat_data_q4_monthwise <- group_by (all_jat_data_q4,month_year)

all_jat_data_q4_summary <- summarise (all_jat_data_q4_monthwise, usage = mean(usage))

all_jat_data_q4_summary

## # A tibble: 19 × 2
##    month_year          usage
##    <dttm>              <dbl>
##  1 2016-08-01 00:00:00  465 
##  2 2016-09-01 00:00:00  396.
##  3 2016-10-01 00:00:00 1215 
##  4 2016-11-01 00:00:00  751 
##  5 2016-12-01 00:00:00  496.
##  6 2017-01-01 00:00:00  442.
##  7 2017-02-01 00:00:00  429.
##  8 2017-03-01 00:00:00  681.
##  9 2017-04-01 00:00:00  356.
## 10 2017-05-01 00:00:00  203 
## 11 2017-09-01 00:00:00 1923 
## 12 2017-10-01 00:00:00 1398.
## 13 2017-11-01 00:00:00 1300.
## 14 2017-12-01 00:00:00 1446.
## 15 2018-01-01 00:00:00 1123.
## 16 2018-02-01 00:00:00  718 
## 17 2018-03-01 00:00:00  865.
## 18 2018-04-01 00:00:00  623.
## 19 2018-05-01 00:00:00  380

Observation

From the summary we got from summarise function above, we found that immediately after launch in August 2016, within 2 months, there was a spike in average usage, but that did not last long
But from September 2017, there was significant raise in the average usage which lasted way longer than before

Inference

Based on the given data and analysis performed, we can infer that mean usage pattern started to show a significant shift from September 2017( even though there was a spike in October 2016, but that did not last long). This is quite evident from the graphical visualization shown below.
A point to note down is that we do not have 3 months data (June through August 2017))

Graphical Support to Question 4

plot (all_jat_data_q4_summary$month_year, all_jat_data_q4_summary$usage, type="o", col = "orange", xlab = "Year-Month", ylab = "Mean Usage",
      main = "Mean Usage by Month: Aug 2016 onward")

Q5. If a disease is likely to spread in particular weather condition (data given in the disease index sheet), then the access of that disease should be more in the months having suitable weather conditions. Help the analyst in coming up with a statistical test to support the claim for two districts for which the sample of weather and disease access data is provided in the data sheet. Identify the diseases for which you can support this claim. Test this claim both for temperature and relative humidity at 95% confidence.

Assumption

Avg disease access when conditions are not favorable to disease = µ1

Avg disease access when conditions are favorable to disease= µ2

Hypothesis

H0 i.e. null hypothesis says: µ2 <= µ1

Ha i.e. alternative hypothesis says: µ2 > µ1

Approach

We will Two Sample T-test to compare sample mean of the given data sheets. This will verify which hypothesis we will support and which one is to reject.

We will use data set naming convention as

q5_bwdata_d1(data for Belagavi disease1)

q5_dwdata_d1(data for Dharwad disease1)

Belagavi - D1

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_bwdata_d1<-q5_bwdata

q5_bwdata_d1$flag<-factor(ifelse((q5_bwdata_d1$temperature>=20 & q5_bwdata_d1$temperature<=24 & q5_bwdata_d1$humidity>80),"Y","N"))

q5_bwdata_d1$flag<-relevel(q5_bwdata_d1$flag,ref = "Y")

t.test(q5_bwdata_d1$d1~q5_bwdata_d1$flag,data=q5_bwdata_d1,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_bwdata_d1$d1 by q5_bwdata_d1$flag
## t = 2.7605, df = 22, p-value = 0.005707
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  9.704442      Inf
## sample estimates:
## mean in group Y mean in group N 
##        37.59305        11.91669

Outcomes

Using Two Sample T-test, we found that t = 2.7605, df = 22, p-value = 0.005707
As p value got from test is less than alpha(0.05),we can reject the null hypothesis i.e. µ2<=µ1
We will have to support the alternative hypothesis i.e. µ2 > µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Belagavi, D1 disease is accessed more when conditions are favorable to it than when the conditions are not favorable.

Dharwad - D1

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_dwdata_d1<-q5_dwdata

q5_dwdata_d1$flag<-factor(ifelse((q5_dwdata_d1$temperature>=20 & q5_dwdata_d1$temperature<=24 & q5_dwdata_d1$relative_humidity>80),"Y","N"))

q5_dwdata_d1$flag<-relevel(q5_dwdata_d1$flag,ref = "Y")

t.test(q5_dwdata_d1$d1~q5_dwdata_d1$flag,data=q5_dwdata_d1,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_dwdata_d1$d1 by q5_dwdata_d1$flag
## t = 4.5934, df = 20, p-value = 8.801e-05
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  15.66022      Inf
## sample estimates:
## mean in group Y mean in group N 
##       31.590651        6.515126

Outcomes

Using Two Sample T-test, we found that t = 4.5934, df = 20, p-value = 8.801e-05
As p value got from t test is less than alpha(0.05),we can reject the null hypothesis i.e. µ2<=µ1
We will have to support the alternative hypothesis i.e. µ2 > µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Dharwad, D1 disease is accessed more when conditions are favorable to it than when the conditions are not favorable.

Belagavi - D2

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_bwdata_d2<-q5_bwdata

q5_bwdata_d2$flag<-factor(ifelse((q5_bwdata_d2$temperature>=21.5 & q5_bwdata_d2$temperature<=24.5 & q5_bwdata_d2$humidity>83),"Y","N"))

q5_bwdata_d2$flag<-relevel(q5_bwdata_d2$flag,ref = "Y")

t.test(q5_bwdata_d2$d2~q5_bwdata_d2$flag,data=q5_bwdata_d2,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_bwdata_d2$d2 by q5_bwdata_d2$flag
## t = 3.7247, df = 22, p-value = 0.0005887
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  10.89113      Inf
## sample estimates:
## mean in group Y mean in group N 
##       29.380223        9.173547

Outcomes

Using Two Sample T-test, we found that t = 3.7247, df = 22, p-value = 0.0005887
As p value got from test is less than alpha(0.05),we can reject the null hypothesis i.e. µ2<=µ1
We will have to support the alternative hypothesis i.e. µ2 > µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Belagavi, D2 disease is accessed more when conditions are favorable to it than when the conditions are not favorable.

Dharwad - D2

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_dwdata_d2<-q5_dwdata

q5_dwdata_d2$flag<-factor(ifelse((q5_dwdata_d2$temperature>=21.5 & q5_dwdata_d2$temperature<=24.5 & q5_dwdata_d2$relative_humidity>83),"Y","N"))

q5_dwdata_d2$flag<-relevel(q5_dwdata_d2$flag,ref = "Y")

t.test(q5_dwdata_d2$d2~q5_dwdata_d2$flag,data=q5_dwdata_d2,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_dwdata_d2$d2 by q5_dwdata_d2$flag
## t = 4.0726, df = 20, p-value = 0.0002968
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  19.62338      Inf
## sample estimates:
## mean in group Y mean in group N 
##       40.134921        6.096486

Outcomes

Using Two Sample T-test, we found that t = 4.0726, df = 20, p-value = 0.0002968
As p value got from t test is less than alpha(0.05),we can reject the null hypothesis i.e. µ2<=µ1
We will have to support the alternative hypothesis i.e. µ2 > µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Dharwad, D2 disease is accessed more when conditions are favorable to it than when the conditions are not favorable.

Belagavi - D3

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_bwdata_d3<-q5_bwdata

q5_bwdata_d3$flag<-factor(ifelse((q5_bwdata_d3$temperature>=22 & q5_bwdata_d3$temperature<=24),"Y","N"))

q5_bwdata_d3$flag<-relevel(q5_bwdata_d3$flag,ref = "Y")

t.test(q5_bwdata_d3$d3~q5_bwdata_d3$flag,data=q5_bwdata_d3,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_bwdata_d3$d3 by q5_bwdata_d3$flag
## t = 2.2224, df = 22, p-value = 0.01843
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  4.39784     Inf
## sample estimates:
## mean in group Y mean in group N 
##        30.95773        11.61233

Outcomes

Using Two Sample T-test, we found that t = 2.2224, df = 22, p-value = 0.01843
As p value got from test is less than alpha(0.05),we can reject the null hypothesis i.e. µ2<=µ1
We will have to support the alternative hypothesis i.e. µ2 > µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Belagavi, D3 disease is accessed more when conditions are favorable to it than when the conditions are not favorable.

Dharwad - D3

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_dwdata_d3<-q5_dwdata

q5_dwdata_d3$flag<-factor(ifelse((q5_dwdata_d3$temperature>=22 & q5_dwdata_d3$temperature<=24),"Y","N"))

q5_dwdata_d3$flag<-relevel(q5_dwdata_d3$flag,ref = "Y")

t.test(q5_dwdata_d3$d3~q5_dwdata_d3$flag,data=q5_dwdata_d3,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_dwdata_d3$d3 by q5_dwdata_d3$flag
## t = 1.5057, df = 20, p-value = 0.07389
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  -4.118138       Inf
## sample estimates:
## mean in group Y mean in group N 
##        40.26971        11.96166

Outcomes

Using Two Sample T-test, we found that t = 1.5057, df = 20, p-value = 0.07389
As p value got from t test is greater than alpha(0.05), we will accept the null hypothesis i.e. µ2<=µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Dharwad, D3 disease is accessed more when conditions are NOT favorable to it than when the conditions are favorable.

Belagavi - D4

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_bwdata_d4<-q5_bwdata

q5_bwdata_d4$flag<-factor(ifelse((q5_bwdata_d4$temperature>=22 & q5_bwdata_d4$temperature<=26 & q5_bwdata_d4$humidity>85),"Y","N"))

q5_bwdata_d4$flag<-relevel(q5_bwdata_d4$flag,ref = "Y")

t.test(q5_bwdata_d4$d4~q5_bwdata_d4$flag,data=q5_bwdata_d4,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_bwdata_d4$d4 by q5_bwdata_d4$flag
## t = 1.793, df = 22, p-value = 0.04337
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  0.4785112       Inf
## sample estimates:
## mean in group Y mean in group N 
##        24.28984        12.97384

Outcomes

Using Two Sample T-test, we found that t = 1.793, df = 22, p-value = 0.04337
As p value got from test is less than alpha(0.05),we can reject the null hypothesis i.e. µ2<=µ1
We will have to support the alternative hypothesis i.e. µ2 > µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Belagavi, D4 disease is accessed more when conditions are favorable to it than when the conditions are not favorable.

Dharwad - D4

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_dwdata_d4<-q5_dwdata

q5_dwdata_d4$flag<-factor(ifelse((q5_dwdata_d4$temperature>=22 & q5_dwdata_d4$temperature<=26 & q5_dwdata_d4$relative_humidity>85),"Y","N"))

q5_dwdata_d4$flag<-relevel(q5_dwdata_d4$flag,ref = "Y")

t.test(q5_dwdata_d4$d4~q5_dwdata_d4$flag,data=q5_dwdata_d4,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_dwdata_d4$d4 by q5_dwdata_d4$flag
## t = 2.3147, df = 20, p-value = 0.01569
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  6.896259      Inf
## sample estimates:
## mean in group Y mean in group N 
##        39.16667        12.10875

Outcomes

Using Two Sample T-test, we found that t = 2.3147, df = 20, p-value = 0.01569
As p value got from t test is less than alpha(0.05),we can reject the null hypothesis i.e. µ2<=µ1
We will have to support the alternative hypothesis i.e. µ2 > µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Dharwad, D4 disease is accessed more when conditions are favorable to it than when the conditions are not favorable.

Belagavi - D5

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_bwdata_d5<-q5_bwdata

q5_bwdata_d5$flag<-factor(ifelse(((q5_bwdata_d5$temperature>=22 & q5_bwdata_d5$temperature<=24.5) & (q5_bwdata_d5$humidity>=77 & q5_bwdata_d5$humidity<=85)),"Y","N"))

q5_bwdata_d5$flag<-relevel(q5_bwdata_d5$flag,ref = "Y")

t.test(q5_bwdata_d5$d5~q5_bwdata_d5$flag,data=q5_bwdata_d5,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_bwdata_d5$d5 by q5_bwdata_d5$flag
## t = 3.6675, df = 22, p-value = 0.0006761
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  13.85781      Inf
## sample estimates:
## mean in group Y mean in group N 
##        36.57407        10.51547

Outcomes

Using Two Sample T-test, we found that t = 3.6675, df = 22, p-value = 0.0006761
As p value got from test is less than alpha(0.05),we can reject the null hypothesis i.e. µ2<=µ1
We will have to support the alternative hypothesis i.e. µ2 > µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Belagavi, D5 disease is accessed more when conditions are favorable to it than when the conditions are not favorable.

Dharwad - D5

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_dwdata_d5<-q5_dwdata

q5_dwdata_d5$flag<-factor(ifelse(((q5_dwdata_d5$temperature>=22 & q5_dwdata_d5$temperature<=24.5) & (q5_dwdata_d5$relative_humidity>=77 & q5_dwdata_d5$relative_humidity<=85)),"Y","N"))

q5_dwdata_d5$flag<-relevel(q5_dwdata_d5$flag,ref = "Y")

t.test(q5_dwdata$d5~q5_dwdata_d5$flag,data=q5_dwdata_d5,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_dwdata$d5 by q5_dwdata_d5$flag
## t = 0.10853, df = 20, p-value = 0.4573
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  -16.53381       Inf
## sample estimates:
## mean in group Y mean in group N 
##        14.17749        13.06725

Outcomes

Using Two Sample T-test, we found that t = 0.10853, df = 20, p-value = 0.4573
As p value got from t test is greater than alpha(0.05),we will accept the null hypothesis i.e. µ2<=µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Dharwad, D5 disease is accessed more when conditions are NOT favorable to it than when the conditions are favorable.

Belagavi - D7

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_bwdata_d7<-q5_bwdata

q5_bwdata_d7$flag<-factor(ifelse((q5_bwdata_d7$temperature>25 & q5_bwdata_d7$humidity>80),"Y","N"))

q5_bwdata_d7$flag<-relevel(q5_bwdata_d7$flag,ref = "Y")

t.test(q5_bwdata_d7$d4~q5_bwdata_d7$flag,data=q5_bwdata_d7,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_bwdata_d7$d4 by q5_bwdata_d7$flag
## t = 1.1738, df = 22, p-value = 0.1265
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  -4.373739       Inf
## sample estimates:
## mean in group Y mean in group N 
##        23.59788        14.15040

Outcomes

Using Two Sample T-test, we found that t = 1.1738, df = 22, p-value = 0.1265
As p value got from test is greater than alpha(0.05),we will accept the null hypothesis i.e. µ2<=µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Belagavi, D7 disease is accessed more when conditions are NOT favorable to it than when the conditions are favorable.

Dharwad - D7

setwd(dir="C:\\Users\\pahar\\OneDrive\\IIM-K\\Classes\\2022-09-10")
library(readxl)
q5_bwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Belagavi_weather",col_names = TRUE)
q5_dwdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Dharwad_weather",col_names = TRUE)

library(janitor)
q5_bwdata<-clean_names(q5_bwdata)
q5_dwdata<-clean_names(q5_dwdata)
q5_wdata<-read_xlsx("IMB733-XLS-ENG.xlsx",sheet = "Disease_index",col_names = TRUE)

q5_dwdata_d7<-q5_dwdata

q5_dwdata_d7$flag<-factor(ifelse((q5_dwdata_d7$temperature>25 & q5_dwdata_d7$relative_humidity>80),"Y","N"))

q5_dwdata_d7$flag<-relevel(q5_dwdata_d7$flag,ref = "Y")

t.test(q5_dwdata_d7$d4~q5_dwdata_d7$flag,data=q5_dwdata_d7,alternative="greater",var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  q5_dwdata_d7$d4 by q5_dwdata_d7$flag
## t = 1.0197, df = 20, p-value = 0.16
## alternative hypothesis: true difference in means between group Y and group N is greater than 0
## 95 percent confidence interval:
##  -10.80039       Inf
## sample estimates:
## mean in group Y mean in group N 
##        30.00000        14.37831

Outcomes

Using Two Sample T-test, we found that t = 1.0197, df = 20, p-value = 0.16
As p value got from t test is greater than alpha(0.05),we will accept the null hypothesis i.e. µ2<=µ1

Inference

Based on the given data and analysis performed on the same using Two Sample T-Test, we can infer that for Dharwad, D7 disease is accessed more when conditions are NOT favorable to it than when the conditions are favorable.

There is no information on the favorable conditions like temperature and humidity for remaining diseases, so we could not test which hypothesis will hold true for remaining diseases

JAT Case Study: Exercise

Pahar Singh : ADAM04 Group 4

2022-09-15

Jayalaxmi Ago Tech: Find the inner insights

Questions raised by JAT and insights found in Data

Q1. Anand, the co founder of JAT, claims that disease 6 (leaf curl) information was accessed at least 60 times every week on average since October 2017 due to this disease outbreak. Test this claim at a significance level of 0.05 using an appropriate hypothesis test.

Q3. Farmers use apps to access information throughout the month. Using the data, check whether app usage is same or different across the four weeks of a month. Anand claims that app usage picked up after January 2016; so, test this hypothesis using data from January-2016 – May 2018.

Part1

Part2

Q4. A new version of the app was released in August 2016. Anand wants to understand which month in the given time frame after the launch of the new version, the mean usage pattern would start to show a statistically significant shift.

Graphical Support to Question 4

Belagavi - D1

Dharwad - D1

Belagavi - D2

Dharwad - D2

Belagavi - D3

Dharwad - D3

Belagavi - D4

Dharwad - D4

Belagavi - D5

Dharwad - D5

Belagavi - D7

Dharwad - D7

There is no information on the favorable conditions like temperature and humidity for remaining diseases, so we could not test which hypothesis will hold true for remaining diseases