4. Use the built-in R dataset women. By using the cor.test function, find the correlation between height and weight variables using Pearson Correlation. Interpret your output.
data(women) # Load the datasetcor.test(formula =~ height + weight, data = women, method ="pearson")
Pearson's product-moment correlation
data: height and weight
t = 37.855, df = 13, p-value = 1.091e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9860970 0.9985447
sample estimates:
cor
0.9954948
The height and weight variables has a strong, positive correlation with each other.
5. Use the built-in R dataset presidents.
How can you display in R for the first few lines of this dataset
cor.test(formula =~ Verbs + Adjectives, data = data1, method ="pearson")
Pearson's product-moment correlation
data: Verbs and Adjectives
t = -1.9414, df = 8, p-value = 0.08816
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.88145836 0.09899906
sample estimates:
cor
-0.5659012
cor.test(formula =~ Verbs + Adjectives, data = data1, method ="spearman")
Warning in cor.test.default(x = mf[[1L]], y = mf[[2L]], ...): Cannot compute
exact p-value with ties
Spearman's rank correlation rho
data: Verbs and Adjectives
S = 250.26, p-value = 0.1262
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.5167197
Tutorial 4
3. Write a R program to create a list containing strings, numbers, vectors and a logical values.
4. Write a R program to list containing a vector, a matrix and a list and give names to the elements in the list.
# create and name elements of listmy_list <-list(vector =c(1, 2, 3), matrix =matrix(1:9, nrow =3),list =c(string ="abcd", number =66))names(my_list) <-c("my_vec", "my_mat", "my_lst")
5. Given x = c(30, NULL, -2, 13, NA, 23, 43). Write a R program to find Sum, Mean and Product of a Vector, ignore element like NA or NaN.
# define xx <-c(30, NULL, -2, 13, NA, 23, 43)# calculate sum, mean, and productcalc <-function(n){ sum <-sum(x, na.rm = T) mean <-mean(x, na.rm = T) product<-prod(x, na.rm = T)cat("sum of x = ", x, "is", sum,"\nmean of x =", x, "is", mean,"\nproduct of x =", x, "is", product )}calc(x)
sum of x = 30 -2 13 NA 23 43 is 107
mean of x = 30 -2 13 NA 23 43 is 21.4
product of x = 30 -2 13 NA 23 43 is -771420
6. Develop a R program to produce the matrix given below.
1. Write a R program to call the (built-in) dataset airquality. Check whether it is a data frame or not? Order the entire data frame by the first and second column.
data(airquality) # Load datasetclass(airquality) # check class
Replace the 11th value in your random number vector with NA and calculate the same summary statistics again but this time you may use the summary() or the apply()
x1 x2
Min. :-3.23671 Min. :0.003054
1st Qu.:-0.67880 1st Qu.:0.257628
Median : 0.01180 Median :0.504780
Mean :-0.05862 Mean :0.501471
3rd Qu.: 0.65865 3rd Qu.:0.766102
Max. : 2.32689 Max. :0.970136
Compute the min,max and the mean by using the apply() function.
Find the sum function to get the sum of all values in our data frame column
apply(data[c("x1","x2")], 2, sum)
x1 x2
-5.862404 50.147110
3. Type this dataset called cars in your R. (It is already installed under this package called datasets. Find out the contents of the datasets library by using the help function (library(help=datasets)))
Compute the min, max, mean and the range for each column of this data set by using the apply() function.
Now assign your data set cars to another variable called cars2. Then add a row at the bottom of this new data set cars2 showing the column means. (Use the rbind together with the apply functions)
simple_model <-lm(sales ~ youtube, data = marketing)summary(simple_model)
Call:
lm(formula = sales ~ youtube, data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.0632 -2.3454 -0.2295 2.4805 8.6548
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.439112 0.549412 15.36 <2e-16 ***
youtube 0.047537 0.002691 17.67 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.91 on 198 degrees of freedom
Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
Residual analysis of the linear model
par(mfrow =c(2,2))plot(simple_model)
Plotting the regression line
par(mfrow =c(1,1))plot(marketing$youtube, marketing$sales, main ="YouTube vs Sales with Regression Line", xlab ="YouTube", ylab ="Sales")abline(simple_model, col ="blue")
Alternative code: Using ggplot2 package
ggplot(marketing, aes(x = youtube, y = sales)) +geom_point() +geom_smooth(method ="lm", se =TRUE) +labs(title ="YouTube vs Sales with Regression Line", x ="YouTube", y ="Sales")
`geom_smooth()` using formula = 'y ~ x'
Constructing a multiple linear regression model
lm_model_1 <-lm(sales ~ youtube, data = marketing)lm_model_2 <-lm(sales ~ facebook, data = marketing)lm_model_3 <-lm(sales ~ newspaper, data = marketing)multiple_model <-lm(sales ~ youtube + facebook + newspaper, data = marketing)summary(multiple_model)
Call:
lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.5932 -1.0690 0.2902 1.4272 3.3951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.526667 0.374290 9.422 <2e-16 ***
youtube 0.045765 0.001395 32.809 <2e-16 ***
facebook 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.023 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Residual analysis of multiple model
par(mfrow =c(2,2))plot(lm_model_1)
par(mfrow =c(2,2))plot(lm_model_2)
par(mfrow =c(2,2))plot(lm_model_3)
Plotting sales with respect to every independent variables in marketing
par(mfrow =c(1,3))plot(marketing$youtube, marketing$sales, xlab ="youtube", ylab ="sales")abline(lm(marketing$sales ~ marketing$youtube), col ="red")plot(marketing$facebook, marketing$sales, xlab ="facebook", ylab ="sales")abline(lm(marketing$sales ~ marketing$facebook), col ="blue")plot(marketing$newspaper, marketing$sales, xlab ="newspaper", ylab ="sales")abline(lm(marketing$sales ~ marketing$newspaper), col ="green")
Tutorial 7
Build a multilinear regression model for estimating sales based on the advertising budget invested in youtube, `facebook and newspaper. Interpret your output.
Call:
lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)
Residuals:
Min 1Q Median 3Q Max
-10.5932 -1.0690 0.2902 1.4272 3.3951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.526667 0.374290 9.422 <2e-16 ***
youtube 0.045765 0.001395 32.809 <2e-16 ***
facebook 0.188530 0.008611 21.893 <2e-16 ***
newspaper -0.001037 0.005871 -0.177 0.86
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.023 on 196 degrees of freedom
Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
What is the multilinear regression model? \[
\text{sales} = 3.526667 + 0.045(\text{youtube}) + 0.18(\text{facebook}) -0.001( \text{newspaper})
\] Are there any confounding variables? Perform the VIF method and explain your output.
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
Find the two-sample t-test between newspaper and facebook.
var.test(marketing$newspaper, marketing$facebook)
F test to compare two variances
data: marketing$newspaper and marketing$facebook
F = 2.1518, num df = 199, denom df = 199, p-value = 9.593e-08
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.628426 2.843289
sample estimates:
ratio of variances
2.151763
Welch Two Sample t-test
data: marketing$newspaper and marketing$facebook
t = 3.9114, df = 351.11, p-value = 0.0001102
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
4.349305 13.146695
sample estimates:
mean of x mean of y
36.6648 27.9168
Variances are unequal. Welch’s t-Test will be used.
Stating the null and alternative hypotheses:
Let \(\mu_{auto}\) and \(\mu_{manual}\) be the means of miles per gallon with respect to automatic and manual modes of transmission. Then, the null and alternative hypotheses, denoted \(H_{0}\) and \(H_{\alpha}\) are: \[
\begin{split}
H_{0}: \mu_{\text{auto}} - \mu_{\text{manual}} = 0 \\
H_{\alpha}: \mu_{\text{auto}} - \mu_{\text{manual}} \neq 0
\end{split}
\]
Plotting the boxplot of mpg w.r.t. mode of transmission
cars$am <-as.factor(ifelse(cars$am ==1, "automatic", "manual"))ggplot(cars, aes(x = am, y = mpg, fill = am, shape = am)) +geom_boxplot() +geom_jitter() +labs(x ="transmission", y ="miles per gallon") +ggtitle("miles per gallon w.r.t transmission")
Conducting a t-Test
result <-t.test(mpg ~ am, data = cars)cat("t =", round(result$statistic, 4), ", df =", round(result$parameter, 3), ", p =", format(result$p.value, digits=4), "\n")cat("95% confidence interval:", round(result$conf.int, 3), "\n")cat("Sample estimates:\n")cat("Mean in group automatic: ", round(result$estimate[1], 2), "\n")cat("Mean in group manual: ", round(result$estimate[2], 2), "\n")cat("Alternative hypothesis: mean difference not equal to 0\n")
t = 3.7671 , df = 18.332 , p = 0.001374
95% confidence interval: 3.21 11.28
Sample estimates:
Mean in group automatic: 24.39
Mean in group manual: 17.15
Alternative hypothesis: mean difference not equal to 0
Exercises, 5 January 2024
The dataset starbucks in the openintro package contains nutritional information on 77 Starbucks food items. Spend some time reading the help file of this dataset. For this problem, you will explore the relationship between the calories and carbohydrate grams in these items.
The downloaded binary packages are in
/var/folders/yq/q9fv54y91vb70rhxtfs06ysh0000gn/T//RtmpfRdr7H/downloaded_packages
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata
Attaching package: 'openintro'
The following objects are masked from 'package:MASS':
housing, mammals
The following object is masked from 'package:car':
densityPlot
data(starbucks)?starbucks
a) Create a scatterplot of this data with calories on the x -axis and carbohydrate grams on the y-axis, and describe the relationship you see.
plot(data = starbucks, carb ~ calories)
The relationship is monotonic, although there’s greater variability in carbohydrate content with respect to the calorie content.
b) In the scatterplot you made, what is the explanatory variable? What is the response variable? Why might you want to construct the problem in this way?
explanatory variable: calories
response variable: carb Because we might be curious about the relationship between the carbohydrate content and calories; whether the increase of calories does lead to the increase of carbs.
c) Fit a simple linear regression to this data, with carbohydrate grams as the dependent variable and the calories as the explanatory variable. Use the lm() function.
The adjusted \(R\)-squared is more accruate than the \(R\)-squared in explaining the goodness-of-fit of the regression model. However, the \(R\)-squared and adjusted \(R\)-squared values are 0.4556237 and 0.4483653 respectively, making them quite similar to each other. This means that only 45.56% and 44.83% of the increase in carbs are explained by the increase in calories.
f) Create a residual plot. Describe what you see in the residual plot. Does the model look like a good fit?
Rows: 435 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): bday, enrolldate, expgradate, Major, State
dbl (17): ids, Rank, Gender, Athlete, Height, Weight, Smoking, Sprint, Engl...
time (1): MileMinDur
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Create a contingency table w.r.t smoking preference and gender
Let \(H_{0}\) and \(H_{\alpha}\) be the null and alternative hypotheses respectively. Then, \[
H_{0}: \text{smoking preference is independent of gender} \\
H_{\alpha}: \text{smoking preference is dependent of gender}
\]
Welch Two Sample t-test
data: weight by height_type
t = -6.8072, df = 394.39, p-value = 3.719e-11
alternative hypothesis: true difference in means between group shorter and group taller is not equal to 0
95 percent confidence interval:
-32.29308 -17.81983
sample estimates:
mean in group shorter mean in group taller
168.1862 193.2426
Create a boxplot
boxplot(weight ~ height_type, data = hw, main ="Weight Comparison between Taller and Shorter Individuals", xlab ="Height Type", ylab ="Weight")