We will read a dataset UCDavis1 which includes TV hours,
computer hours and gender groups from the book.
Type answers after the -> sign.
UCDavis1 <- read.csv("https://raw.githubusercontent.com/xzhang47/3125/main/UCDavis1.csv")
colnames(UCDavis1)[1]<- "Sex"
str(UCDavis1)
## 'data.frame': 173 obs. of 3 variables:
## $ Sex : chr "Female" "Female" "Male" "Male" ...
## $ TV : num 13 2 20 15 8 2.5 2 4 8 1 ...
## $ computer: num 10 5 7 15 20 10 14 28 10 15 ...
-> chr, or categorical nominal
-> num, or quantitative continuous
-> num, or quantitative continuous
We will go over the three situations from the lecture to make inference.
Example research questions: What is the mean time that college students watch TV per week?
Next, we will use the UCDavis1 dataset to estimate the
population mean time of college student watch TV per week with
confidence interval and hypothesis testing.
-> College students
-> the population mean time of college student watch TV per week
-> the sample mean time of college student watch TV per week
UCDavis1 dataset should we use to
answer the research question?-> TV
Step 1: Verify the assumptions:
-> n= 173 > 30, therefore we can proceed
Next, check the Shape of the Data
We can use a histogram or a boxplot to exam the quantitative variable for conditions requirements in confidence interval or hypothesis testing for a population mean.
boxplot(UCDavis1$TV,
xlab = "",
ylab = "Watching TV Time",
main = "TV",
pch = 20,
cex = 2,
col = "hotpink",
border = "lightblue")
-> yes, we have a few outliers
#Get a histogram
#TYPE UCDavis1$TV (SAME AS IN boxplot FUNCTION) AFTER hist(
hist(UCDavis1$TV,
xlab = "",
ylab = "Watching TV Time",
main = "Histogram for TV Watching Time among College students from UCDavis",
pch = 20,
cex = 2,
col = "hotpink",
border = "darkred")
-> TV is skewed right. we do not have a normal distribution but the sample size is large enough.
Now, we can calculate the Confidence Interval
First, you can find the following code for 95% confidence interval to estimate the population mean time of college students watching TV per week.
Now, we will use the function t.test to get the confidence interval. The same function will be used for inference throughout the whole practice.
test_results = t.test(x = UCDavis1$TV, mu = 10,
alternative = c("two.sided"), conf.level = 0.95)
test_results
##
## One Sample t-test
##
## data: UCDavis1$TV
## t = -1.4143, df = 172, p-value = 0.1591
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
## 7.320477 10.442529
## sample estimates:
## mean of x
## 8.881503
-> conf.level
-> we are 95% confident that the population mean time of college student watch TV per week is between 7.32 and 10.44
#TYPE 0.9 AFTER THE LAST EQUAL SIGN
t.test(x = UCDavis1$TV, mu = 10,
alternative = c("two.sided"), conf.level =0.9 )
##
## One Sample t-test
##
## data: UCDavis1$TV
## t = -1.4143, df = 172, p-value = 0.1591
## alternative hypothesis: true mean is not equal to 10
## 90 percent confidence interval:
## 7.573622 10.189384
## sample estimates:
## mean of x
## 8.881503
-> 7.57 to 10.19, which is narrower
Next, We will test a hypothesis to see if the population mean time of college students watching TV per week is 10 hours or not.
-> the population mean time of college students watching TV per week is equal to 10
-> the population mean time of college students watching TV per week is not equal to 10
test_results = t.test(x = UCDavis1$TV, mu = 10,
alternative = c("two.sided"), conf.level = 0.95)
test_results
##
## One Sample t-test
##
## data: UCDavis1$TV
## t = -1.4143, df = 172, p-value = 0.1591
## alternative hypothesis: true mean is not equal to 10
## 95 percent confidence interval:
## 7.320477 10.442529
## sample estimates:
## mean of x
## 8.881503
-> t= 1.4143, and the pvalue = 0.1591
-> we fail to reject the null hypothesis, the population mean time of college students watching TV per week is equal to 10 hours
#UPDATE THE FOLLOWING CODE BY TYPE less WITHIN THE QUATATION SIGN AND TYPE 11 AFTER mu =
t.test(x = UCDavis1$TV, mu =11 ,
alternative = c("less"), conf.level = 0.95)
##
## One Sample t-test
##
## data: UCDavis1$TV
## t = -2.6788, df = 172, p-value = 0.004053
## alternative hypothesis: true mean is less than 11
## 95 percent confidence interval:
## -Inf 10.18938
## sample estimates:
## mean of x
## 8.881503
Example research questions: What is the population mean time difference between college student watching TV and spending on computer.
To conduct a paired-samples test, we need either two vectors of data,
\(y_1\) and \(y_2\), or we need one vector of data with a
second that serves as a binary grouping variable. The test is then run
using the syntax t.test(y1, y2, paired=TRUE).
# Create a new variable difference first
UCDavis1$difference <- UCDavis1$TV-UCDavis1$computer
str(UCDavis1)
## 'data.frame': 173 obs. of 4 variables:
## $ Sex : chr "Female" "Female" "Male" "Male" ...
## $ TV : num 13 2 20 15 8 2.5 2 4 8 1 ...
## $ computer : num 10 5 7 15 20 10 14 28 10 15 ...
## $ difference: num 3 -3 13 0 -12 -7.5 -12 -24 -2 -14 ...
Next, we will use the UCDavis1 dataset to estimate the
population mean time difference of college student watch TV vs using
computer per week with confidence interval and hypothesis testing.
-> the population mean time difference between college students watching TV and using a computer
-> the population mean time difference between college students watching TV and using a computer
UCDavis1 dataset should we use to
answer this research question?-> difference (TV - Computer)
Step 1: Verify the assumptions:
->
Next, check the Shape of the Data
We can use a histogram or a boxplot to exam the quantitative variable for conditions requirements in confidence interval or hypothesis testing.
boxplot(UCDavis1$difference,
xlab = "",
ylab = "Watching TV Time - using computer",
main = "TV vs computer",
pch = 20,
cex = 2,
col = "hotpink",
border = "lightblue")
#Get a histogram
#TYPE UCDavis1$difference AFTER hist(
hist(UCDavis1$difference,
xlab = "",
ylab = "Watching TV Time - using computer",
main = "TV vs computer",
pch = 20,
cex = 2,
col = "hotpink",
border = "dodgerblue")
-> the distribution has some upper and lower outliers but the sample is large enough. there is no violation.
Now, we can calculate the Confidence Interval with t.test with
t.test(UCDavis1$TV, UCDavis1$computer, paired = TRUE, conf.level = 0.95)
##
## Paired t-test
##
## data: UCDavis1$TV and UCDavis1$computer
## t = -4.5409, df = 170, p-value = 1.057e-05
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -7.647662 -3.013157
## sample estimates:
## mean difference
## -5.330409
t.test(UCDavis1$difference, conf.level = 0.95)
##
## One Sample t-test
##
## data: UCDavis1$difference
## t = -4.5409, df = 170, p-value = 1.057e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -7.647662 -3.013157
## sample estimates:
## mean of x
## -5.330409
we are 95% confident that the population mean difference of hours between TV and computer is between -7.65 and -3.01 hours. students tend to spend more time on the computer.
Next, We will test a hypothesis to see if there is any difference between the population mean time of college students watching TV vs computer.
-> there is no difference between the population mean time of college students watching TV vs using their computer
-> there is a difference between the population mean time of college students watching TV vs using their computer
-> t= -4.54, p value = 0 < 0.05
-> we reject the null hypothesis. There is a difference between the population mean time of college students watching TV vs using their computer.
Example research questions: How much difference is there in average tv watching hours for female and males?
str(UCDavis1)
## 'data.frame': 173 obs. of 4 variables:
## $ Sex : chr "Female" "Female" "Male" "Male" ...
## $ TV : num 13 2 20 15 8 2.5 2 4 8 1 ...
## $ computer : num 10 5 7 15 20 10 14 28 10 15 ...
## $ difference: num 3 -3 13 0 -12 -7.5 -12 -24 -2 -14 ...
Next, we will use the UCDavis1 dataset to estimate the
population mean time difference between male and female college student
watch TV per week with confidence interval and hypothesis testing.
-> the difference in population mean of hours watching TV, between male and female students
-> the difference in sample mean of hours watching TV, between male and female students
UCDavis1 dataset should we use to
answer this research question?-> TV and sex.
Step 1: Verify the assumptions:
-> n1= 90 which is greater than 30 and n2= 83 which is greater than 30. we can proceed.
Next, check the Shape of the Data. For this situation, we normally provide a side by side boxplot.
Here we used the boxplot() command to create
side-by-side boxplots. However, since we are now dealing with two
variables, the syntax has changed. The R syntax
TV ~ Sex, data = UCDavis1 reads “Plot the TV
variable against the Sex variable using the dataset
UCDavis1.”
boxplot(TV ~ Sex, data = UCDavis1,
xlab = "Gender",
ylab = "Hours in watching TV",
main = "TV vs Gender",
pch = 20,
cex = 2,
col = "pink",
border = "maroon")
-> we do have some outliers in both distributions, but the sample sizes is large enough, so there is no violation
Now, we can calculate the Confidence Interval with t.test with
t.test(TV ~ Sex, data = UCDavis1, var.equal = TRUE, alternative = c("two.sided"), conf.level = 0.95)
##
## Two Sample t-test
##
## data: TV by Sex
## t = -0.70124, df = 171, p-value = 0.4841
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -4.253626 2.023623
## sample estimates:
## mean in group Female mean in group Male
## 8.372340 9.487342
-> we are 95% confident that the difference in the difference in population mean of hours watching TV, between male and female students is between -4.25 and 2.02 hours.
Next, We will test a hypothesis to see if there is any difference between the population mean difference of hours between males and females watching TV.
-> there is no difference in the population means of TV consumption between males and females
-> there is a difference in the population means of TV consumption between males and females
-> t= -0.70 and p value= 0.4841 > 0.05
-> we fail to reject the null hypothesis