We will read a dataset UCDavis1 which includes TV hours, computer hours and gender groups from the book.

Type answers after the -> sign.

UCDavis1 <- read.csv("https://raw.githubusercontent.com/xzhang47/3125/main/UCDavis1.csv")
colnames(UCDavis1)[1]<- "Sex"
str(UCDavis1)
## 'data.frame':    173 obs. of  3 variables:
##  $ Sex     : chr  "Female" "Female" "Male" "Male" ...
##  $ TV      : num  13 2 20 15 8 2.5 2 4 8 1 ...
##  $ computer: num  10 5 7 15 20 10 14 28 10 15 ...
  1. What type of variable is Sex?

-> chr, or categorical nominal

  1. What type of variable is TV?

-> num, or quantitative continuous

  1. What type of variable is computer?

-> num, or quantitative continuous

We will go over the three situations from the lecture to make inference.

Situation 1: Estimating the mean of a quantitative variable.

Example research questions: What is the mean time that college students watch TV per week?

Next, we will use the UCDavis1 dataset to estimate the population mean time of college student watch TV per week with confidence interval and hypothesis testing.

  1. What is the population of interest?

-> College students

  1. What is the population parameter of interest?

-> the population mean time of college student watch TV per week

  1. What is the sample statistic of interest?

-> the sample mean time of college student watch TV per week

  1. Which variable from UCDavis1 dataset should we use to answer the research question?

-> TV

Confidence interval to estimate the population mean time of college students watching TV

Step 1: Verify the assumptions:

  1. What is the sample size? Is it large enough? Can we proceed?

-> n= 173 > 30, therefore we can proceed

Next, check the Shape of the Data

We can use a histogram or a boxplot to exam the quantitative variable for conditions requirements in confidence interval or hypothesis testing for a population mean.

boxplot(UCDavis1$TV,
        xlab   = "",
     ylab   = "Watching TV Time",
     main   = "TV",
     pch    = 20,
     cex    = 2,
     col    = "hotpink",
     border = "lightblue")

  1. Any outlier in the variable TV?

-> yes, we have a few outliers

#Get a histogram 
#TYPE UCDavis1$TV (SAME AS IN boxplot FUNCTION) AFTER hist(
hist(UCDavis1$TV,
        xlab   = "",
     ylab   = "Watching TV Time",
     main   = "Histogram for TV Watching Time among College students from UCDavis",
     pch    = 20,
     cex    = 2,
     col    = "hotpink",
     border = "darkred")    

  1. Describe the shape of the variable TV. Any violation of assumption?

-> TV is skewed right. we do not have a normal distribution but the sample size is large enough.

Now, we can calculate the Confidence Interval

First, you can find the following code for 95% confidence interval to estimate the population mean time of college students watching TV per week.

Now, we will use the function t.test to get the confidence interval. The same function will be used for inference throughout the whole practice.

test_results = t.test(x = UCDavis1$TV, mu = 10,  
            alternative = c("two.sided"), conf.level = 0.95)
test_results
## 
##  One Sample t-test
## 
## data:  UCDavis1$TV
## t = -1.4143, df = 172, p-value = 0.1591
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
##   7.320477 10.442529
## sample estimates:
## mean of x 
##  8.881503
  1. After what option did we set the confidence level?

-> conf.level

  1. What is the 95% Confidence interval? Interpretation?

-> we are 95% confident that the population mean time of college student watch TV per week is between 7.32 and 10.44

  1. Update the following chunk of code to calculate a 90% confidence interval.
#TYPE 0.9 AFTER THE LAST EQUAL SIGN 
t.test(x = UCDavis1$TV, mu = 10,  
            alternative = c("two.sided"), conf.level =0.9 )
## 
##  One Sample t-test
## 
## data:  UCDavis1$TV
## t = -1.4143, df = 172, p-value = 0.1591
## alternative hypothesis: true mean is not equal to 10
## 90 percent confidence interval:
##   7.573622 10.189384
## sample estimates:
## mean of x 
##  8.881503
  1. What is the 90% confidence interval? Is it wider or narrower?

-> 7.57 to 10.19, which is narrower

One Sample t-Test

Next, We will test a hypothesis to see if the population mean time of college students watching TV per week is 10 hours or not.

  1. What is the Null hypothesis?

-> the population mean time of college students watching TV per week is equal to 10

  1. What is the Alternative hypothesis?

-> the population mean time of college students watching TV per week is not equal to 10

test_results = t.test(x = UCDavis1$TV, mu = 10,  
            alternative = c("two.sided"), conf.level = 0.95)
test_results
## 
##  One Sample t-test
## 
## data:  UCDavis1$TV
## t = -1.4143, df = 172, p-value = 0.1591
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
##   7.320477 10.442529
## sample estimates:
## mean of x 
##  8.881503
  1. We have checked the conditions for confidence interval, we can carry out the one sample t-test. What is the test statistic? What is the pvalue for test? (Take it out of scientific notation)

-> t= 1.4143, and the pvalue = 0.1591

  1. So after we get the pvalue, what is your decision about the null hypothesis? And conclusion?

-> we fail to reject the null hypothesis, the population mean time of college students watching TV per week is equal to 10 hours

  1. If my research hypothesis becomes to see if the population mean time of college students watching TV per week is less than 11 hours or not. How should you update your code below accordingly?
#UPDATE THE FOLLOWING CODE BY TYPE less WITHIN THE QUATATION SIGN  AND TYPE 11 AFTER mu = 
t.test(x = UCDavis1$TV, mu =11 ,  
            alternative = c("less"), conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  UCDavis1$TV
## t = -2.6788, df = 172, p-value = 0.004053
## alternative hypothesis: true mean is less than 11
## 95 percent confidence interval:
##      -Inf 10.18938
## sample estimates:
## mean of x 
##  8.881503

Situation 2. Estimating the mean of paired differences for quantitative variables.

Example research questions: What is the population mean time difference between college student watching TV and spending on computer.

To conduct a paired-samples test, we need either two vectors of data, \(y_1\) and \(y_2\), or we need one vector of data with a second that serves as a binary grouping variable. The test is then run using the syntax t.test(y1, y2, paired=TRUE).

# Create a new variable difference first
UCDavis1$difference <- UCDavis1$TV-UCDavis1$computer
str(UCDavis1)
## 'data.frame':    173 obs. of  4 variables:
##  $ Sex       : chr  "Female" "Female" "Male" "Male" ...
##  $ TV        : num  13 2 20 15 8 2.5 2 4 8 1 ...
##  $ computer  : num  10 5 7 15 20 10 14 28 10 15 ...
##  $ difference: num  3 -3 13 0 -12 -7.5 -12 -24 -2 -14 ...

Next, we will use the UCDavis1 dataset to estimate the population mean time difference of college student watch TV vs using computer per week with confidence interval and hypothesis testing.

  1. What is the population parameter of interest?

-> the population mean time difference between college students watching TV and using a computer

  1. What is the sample statistic of interest?

-> the population mean time difference between college students watching TV and using a computer

  1. Which variable from UCDavis1 dataset should we use to answer this research question?

-> difference (TV - Computer)

Confidence interval to estimate the population mean time difference between college students watching TV and using computer

Step 1: Verify the assumptions:

  1. What is the sample size? Is it large enough? Are they related? Can we proceed?

->

Next, check the Shape of the Data

We can use a histogram or a boxplot to exam the quantitative variable for conditions requirements in confidence interval or hypothesis testing.

boxplot(UCDavis1$difference,
        xlab   = "",
     ylab   = "Watching TV Time - using computer",
     main   = "TV vs computer",
     pch    = 20,
     cex    = 2,
     col    = "hotpink",
     border = "lightblue")

#Get a histogram 
#TYPE UCDavis1$difference AFTER hist(
hist(UCDavis1$difference,
     xlab   = "",
     ylab   = "Watching TV Time - using computer",
     main   = "TV vs computer",
     pch    = 20,
     cex    = 2,
     col    = "hotpink",
     border = "dodgerblue")    

  1. Any outlier? Describe the shape of the variable difference. Any violation of assumption?

-> the distribution has some upper and lower outliers but the sample is large enough. there is no violation.

Now, we can calculate the Confidence Interval with t.test with

t.test(UCDavis1$TV, UCDavis1$computer, paired = TRUE, conf.level = 0.95)
## 
##  Paired t-test
## 
## data:  UCDavis1$TV and UCDavis1$computer
## t = -4.5409, df = 170, p-value = 1.057e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -7.647662 -3.013157
## sample estimates:
## mean difference 
##       -5.330409
t.test(UCDavis1$difference, conf.level = 0.95)
## 
##  One Sample t-test
## 
## data:  UCDavis1$difference
## t = -4.5409, df = 170, p-value = 1.057e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -7.647662 -3.013157
## sample estimates:
## mean of x 
## -5.330409
  1. What is the 95% confidence interval for the population mean difference of hours between TV and computer? Interpretation?

we are 95% confident that the population mean difference of hours between TV and computer is between -7.65 and -3.01 hours. students tend to spend more time on the computer.

Paired Sample t-Test

Next, We will test a hypothesis to see if there is any difference between the population mean time of college students watching TV vs computer.

  1. What is the Null hypothesis?

-> there is no difference between the population mean time of college students watching TV vs using their computer

  1. What is the Alternative hypothesis?

-> there is a difference between the population mean time of college students watching TV vs using their computer

  1. We have checked the conditions for confidence interval, we can carry out the paired sample t-test. What is the test statistic? What is the pvalue for test? (Take it out of scientific notation)

-> t= -4.54, p value = 0 < 0.05

  1. So after we get the pvalue, what is your decision about the null hypothesis? And conclusion?

-> we reject the null hypothesis. There is a difference between the population mean time of college students watching TV vs using their computer.

Situation 3: Estimating the difference between two populations with regard to the mean of a quantitative variable.

Example research questions: How much difference is there in average tv watching hours for female and males?

str(UCDavis1)
## 'data.frame':    173 obs. of  4 variables:
##  $ Sex       : chr  "Female" "Female" "Male" "Male" ...
##  $ TV        : num  13 2 20 15 8 2.5 2 4 8 1 ...
##  $ computer  : num  10 5 7 15 20 10 14 28 10 15 ...
##  $ difference: num  3 -3 13 0 -12 -7.5 -12 -24 -2 -14 ...

Next, we will use the UCDavis1 dataset to estimate the population mean time difference between male and female college student watch TV per week with confidence interval and hypothesis testing.

  1. What is the population parameter of interest?

-> the difference in population mean of hours watching TV, between male and female students

  1. What is the sample statistic of interest?

-> the difference in sample mean of hours watching TV, between male and female students

  1. Which variables from UCDavis1 dataset should we use to answer this research question?

-> TV and sex.

Confidence interval to estimate the population mean time difference between male and female college student watch TV per week

Step 1: Verify the assumptions:

  1. What is the sample size? Is it large enough? Are they independent? Can we proceed?

-> n1= 90 which is greater than 30 and n2= 83 which is greater than 30. we can proceed.

Next, check the Shape of the Data. For this situation, we normally provide a side by side boxplot.

Here we used the boxplot() command to create side-by-side boxplots. However, since we are now dealing with two variables, the syntax has changed. The R syntax TV ~ Sex, data = UCDavis1 reads “Plot the TV variable against the Sex variable using the dataset UCDavis1.”

boxplot(TV ~ Sex, data = UCDavis1,
     xlab   = "Gender",
     ylab   = "Hours in watching TV",
     main   = "TV vs Gender",
     pch    = 20,
     cex    = 2,
     col    = "pink",
     border = "maroon")

  1. Any outlier? Describe the shape of the variable TV. Any violation of assumption?

-> we do have some outliers in both distributions, but the sample sizes is large enough, so there is no violation

Now, we can calculate the Confidence Interval with t.test with

t.test(TV ~ Sex, data = UCDavis1, var.equal = TRUE, alternative = c("two.sided"), conf.level = 0.95)
## 
##  Two Sample t-test
## 
## data:  TV by Sex
## t = -0.70124, df = 171, p-value = 0.4841
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -4.253626  2.023623
## sample estimates:
## mean in group Female   mean in group Male 
##             8.372340             9.487342
  1. What is the 95% confidence interval for the population mean difference of hours between males and females watching TV ? Interpretation?

-> we are 95% confident that the difference in the difference in population mean of hours watching TV, between male and female students is between -4.25 and 2.02 hours.

Two Sample t-Test

Next, We will test a hypothesis to see if there is any difference between the population mean difference of hours between males and females watching TV.

  1. What is the Null hypothesis?

-> there is no difference in the population means of TV consumption between males and females

  1. What is the Alternative hypothesis?

-> there is a difference in the population means of TV consumption between males and females

  1. We have checked the conditions for confidence interval, we can carry out the two sample t-test. What is the test statistic? What is the pvalue for test? (Take it out of scientific notation)

-> t= -0.70 and p value= 0.4841 > 0.05

  1. So after we get the pvalue, what is your decision about the null hypothesis? And conclusion?

-> we fail to reject the null hypothesis

Change the author name in the YAML to your name, save your file, knit it as an PDF, add the PDF name to ‘Filename_yourname’ and then submit it to the D2L assignment folder.