Tutorial 3

4. Use the built-in R dataset `women`. By using the `cor.test` function, find the correlation between `height` and `weight` variables using Pearson Correlation. Interpret your output.

data(women) # Load the dataset
cor.test(formula = ~ height + weight, data = women, method = "pearson")


    Pearson's product-moment correlation

data:  height and weight
t = 37.855, df = 13, p-value = 1.091e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9860970 0.9985447
sample estimates:
      cor 
0.9954948

The height and weight variables has a strong, positive correlation with each other.

5. Use the built-in R dataset `presidents`.

How can you display in R for the first few lines of this dataset

library(zoo)
library(dplyr)
data(presidents)
head(presidents)

[1] NA 87 82 75 63 50

Are there any missing values in this dataset?
Can you identify or locate the position of this missing values?

which(is.na(presidents))

[1]   1  15  16  31 111 112

Impute the missing values of the third column , i.e. Qtr3 with the mean value of this entire column.

Qtr3 <- presidents[seq(3, length(presidents), by = 4)]
Qtr3[is.na(Qtr3)] <-  mean(Qtr3, na.rm = T) 
presidents[seq(3, length(presidents), by = 4)] <- Qtr3

Impute the missing values of the fourth column, by using the na.approx function and the mutate function.

Qtr4 <- presidents[seq(4, length(presidents), by = 4)] 
Qtr4_df <- data.frame(og_col = Qtr4)
Qtr4_df <- Qtr4_df %>% 
  mutate(imputed_col = na.approx(og_col))
presidents[seq(4, length(presidents), by = 4)] <- Qtr4_df$imputed_col

6. Given

            [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
verbs      169.9 135.0 161.7 183.0 163.1 190.8 140.7 213.9 218.0 165.2
adjectives  96.0 102.6  91.9  76.5  98.8  77.6  68.4  60.3  74.4  76.5

Compute the Pearson’s and Spearman’s correlations between verbs and adjectives. Interprete your output.

data1 <- data.frame(Verbs = verbs, Adjectives = adjectives)
data1 %>% summarize(MeanVerbs = mean(Verbs), MedianVerbs = median(Verbs),
                    MeanAdj. = mean(Adjectives), MedianAdj. = median(Adjectives))

  MeanVerbs MedianVerbs MeanAdj. MedianAdj.
1    174.13      167.55     82.3      77.05

cor.test(formula = ~ Verbs + Adjectives, data = data1, method = "pearson")


    Pearson's product-moment correlation

data:  Verbs and Adjectives
t = -1.9414, df = 8, p-value = 0.08816
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.88145836  0.09899906
sample estimates:
       cor 
-0.5659012

cor.test(formula = ~ Verbs + Adjectives, data = data1, method = "spearman")

Warning in cor.test.default(x = mf[[1L]], y = mf[[2L]], ...): Cannot compute
exact p-value with ties


    Spearman's rank correlation rho

data:  Verbs and Adjectives
S = 250.26, p-value = 0.1262
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.5167197

Tutorial 4

3. Write a R program to create a list containing strings, numbers, vectors and a logical values.

# list strings, numbers, vectors, and logical values
strings <- "Hello World"
numbers <- 42
strings_vector <- c("a", "b", "c", "d")
numbers_vector <- c(1.1, 2.2, 3.3, 4.4) 
logical <- c(TRUE, FALSE, TRUE, FALSE)
list(strings, numbers, strings_vector, numbers_vector, logical)

[[1]]
[1] "Hello World"

[[2]]
[1] 42

[[3]]
[1] "a" "b" "c" "d"

[[4]]
[1] 1.1 2.2 3.3 4.4

[[5]]
[1]  TRUE FALSE  TRUE FALSE

4. Write a R program to list containing a vector, a matrix and a list and give names to the elements in the list.

# create and name elements of list
my_list <- list(vector = c(1, 2, 3), matrix = matrix(1:9, nrow = 3),
                list = c(string = "abcd", number = 66))
names(my_list) <- c("my_vec", "my_mat", "my_lst")

5. Given `x = c(30, NULL, -2, 13, NA, 23, 43)`. Write a R program to find Sum, Mean and Product of a Vector, ignore element like `NA` or `NaN`.

# define x
x <- c(30, NULL, -2, 13, NA, 23, 43)
# calculate sum, mean, and product
calc <- function(n){
  sum <- sum(x, na.rm = T)
  mean <- mean(x, na.rm = T)
  product<- prod(x, na.rm = T)
  cat("sum of x = ", x, "is", sum,
      "\nmean of x =", x, "is", mean,
      "\nproduct of x =", x, "is", product )
}
calc(x)

sum of x =  30 -2 13 NA 23 43 is 107 
mean of x = 30 -2 13 NA 23 43 is 21.4 
product of x = 30 -2 13 NA 23 43 is -771420

6. Develop a R program to produce the matrix given below.

       [,1] [,2] [,3] [,4]  [,5]
 [1,] -10.0 12.5 35.0 57.5  80.0
 [2,]  -7.5 15.0 37.5 60.0  82.5
 [3,]  -5.0 17.5 40.0 62.5  85.0
 [4,]  -2.5 20.0 42.5 65.0  87.5
 [5,]   0.0 22.5 45.0 67.5  90.0
 [6,]   2.5 25.0 47.5 70.0  92.5
 [7,]   5.0 27.5 50.0 72.5  95.0
 [8,]   7.5 30.0 52.5 75.0  97.5
 [9,]  10.0 32.5 55.0 77.5 100.0

# Define matrix as follows
matrix <- matrix(seq(-10, 100, 2.5), nrow = 9, ncol = 5, byrow = FALSE)

Access the elements at

# 2nd column, 4th row
matrix[4,2]

[1] 20

# 3rd and 5th rows
matrix[c(3,5),]

     [,1] [,2] [,3] [,4] [,5]
[1,]   -5 17.5   40 62.5   85
[2,]    0 22.5   45 67.5   90

# 3rd column
matrix[,3]

[1] 35.0 37.5 40.0 42.5 45.0 47.5 50.0 52.5 55.0

Remove the 6th to 9th rows and find the diagonal of this matrix

matrix_b <- matrix[-c(6,7,8,9),]
diag(matrix_b)

[1] -10  15  40  65  90

Replace all elements other than the diagonal as zeros.

ifelse(row(matrix_b) == col(matrix_b), diag(matrix_b), 0)

     [,1] [,2] [,3] [,4] [,5]
[1,]  -10    0    0    0    0
[2,]    0   15    0    0    0
[3,]    0    0   40    0    0
[4,]    0    0    0   65    0
[5,]    0    0    0    0   90

Sum up all the columns and find their means.

apply(matrix, 2, sum)

[1]   0.0 202.5 405.0 607.5 810.0

apply(matrix, 2, mean)

[1]  0.0 22.5 45.0 67.5 90.0

7. Construct a R program to test whether the value of the element of a given vector greater than 10 or not. Return `TRUE` or `FALSE`.

(5,10,7,9,8,12,21,4,-5,6)

# test whether elements in y are greater than 10
y <- c(5,10,7,9,8,12,21,4,-5,6)
ifelse(y > 10, TRUE, FALSE)

 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE

Tutorial 5

1. Write a R program to call the (built-in) dataset `airquality`. Check whether it is a data frame or not? Order the entire data frame by the first and second column.

data(airquality) # Load dataset
class(airquality) # check class

[1] "data.frame"

head(airquality) # display first few rows

  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

airquality_ordered <- airquality[order(airquality$Ozone, airquality$Solar.R), ]
airquality_ordered_decr <- airquality[order(airquality$Ozone, airquality$Solar.R, decreasing = T), ]
head(airquality_ordered)

    Ozone Solar.R Wind Temp Month Day
21      1       8  9.7   59     5  21
23      4      25  9.7   61     5  23
18      6      78 18.4   57     5  18
76      7      48 14.3   80     7  15
147     7      49 10.3   69     9  24
11      7      NA  6.9   74     5  11

head(airquality_ordered_decr)

    Ozone Solar.R Wind Temp Month Day
117   168     238  3.4   81     8  25
62    135     269  4.1   84     7   1
99    122     255  4.0   89     8   7
121   118     225  2.3   94     8  29
30    115     223  5.7   79     5  30
101   110     207  8.0   90     8   9

2. Given the data frame

Name = ('Johan', 'Sharon', 'Chong', 'James', 'Emily', 
        'Mike', 'Poh Seng', 'Fooi Sim', 'Kevin','Jonas'),
score = (12.5, 9, 16.5, 12, 9, 20, 14.5, 13.5, 8, 19),
attempts =(1, 3, 2, 3, 2, 3, 1, 1, 2, 1),
qualify = ('yes', 'no', 'yes', 'no', 'no', 
           'yes', 'yes', 'no', 'no', 'yes')

Build a program in R language to select some random rows from a given data frame.

qualifier <- data.frame(
  Name = c('Johan', 'Sharon', 'Chong', 'James', 'Emily', 
           'Mike', 'Poh Seng', 'Fooi Sim', 'Kevin', 'Jonas'), 
  score = c(12.5, 9, 16.5, 12, 9, 20, 14.5, 13.5, 8, 19), 
  attempts = c(1, 3, 2, 3, 2, 3, 1, 1, 2, 1), 
  qualify = c('yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes')
)
qualifier[sample(nrow(qualifier), size = 5), ]

      Name score attempts qualify
1    Johan  12.5        1     yes
2   Sharon   9.0        3      no
8 Fooi Sim  13.5        1      no
3    Chong  16.5        2     yes
6     Mike  20.0        3     yes

sample(nrow(qualifier), size = 5)

[1]  8  2 10  9  5

3. From question no. 2, remove the variables `attempts` and `qualify` and display the data frame.

qualifier_rm <- qualifier[ , -c(3, 4)]
qualifier_rm

       Name score
1     Johan  12.5
2    Sharon   9.0
3     Chong  16.5
4     James  12.0
5     Emily   9.0
6      Mike  20.0
7  Poh Seng  14.5
8  Fooi Sim  13.5
9     Kevin   8.0
10    Jonas  19.0

4. Find the levels of factor of the given vector.

v = c(-11, 12, 33, -3, 204, NA, 23, 21, 49, 51, NA, 5)

v <- c(-11, 12, 33, -3, 204, NA, 23, 21, 49, 51, NA, 5)
levels(as.factor(v))

 [1] "-11" "-3"  "5"   "12"  "21"  "23"  "33"  "49"  "51"  "204"

5. Given a vector which represents the sizes of T-shirts that have been prepared by a tailor.

Baju = (Large, Small, Extra Large, Extra Small, Double Large, Small, Double Large,
Medium, Large)

Convert this into an ordered factor with the labels `S, XS, L, XL, XXL, M`

Baju <- c("L", "S", "XL", "XS", "XXL", "S", "XXL", "M", "L")
baju_ordered <- factor(Baju, levels = c("XS", "S", "M", "L", "XL", "XXL"), ordered = TRUE)
print(baju_ordered)

[1] L   S   XL  XS  XXL S   XXL M   L  
Levels: XS < S < M < L < XL < XXL

sort(baju_ordered)

[1] XS  S   S   M   L   L   XL  XXL XXL
Levels: XS < S < M < L < XL < XXL

6. Write a R program to extract the seven of the levels of factor created from a random sample from the word `LETTERS`

set.seed(123)
factor(sample(LETTERS, size = 20, replace = TRUE))[1:7]

[1] O S N C J R V
Levels: C E I J K N O R S T V Y Z

7. Given this two arrays:

, , M1

     [,1] [,2] [,3]
[1,] "A"  "B"  "C" 
[2,] "A"  "B"  "C" 
[3,] "A"  "B"  "C" 

, , M2

     [,1] [,2] [,3]
[1,] "P"  "Q"  "R" 
[2,] "P"  "Q"  "R" 
[3,] "P"  "Q"  "R"

Develop a program to produce an array `M3` which combines the two arrays above so that

the first row of the M1 is followed by the first row of the M2.

M1 <- matrix(c(rep(c("A", "B", "C"), each = 3)), nrow = 3, byrow = TRUE)
M2 <- matrix(c(rep(c("P", "Q", "R"), each = 3)), nrow = 3, byrow = TRUE)

array_m3 <- function(a, b, j, k) {
  array <- array(c(a, b), dim = c(3, 3, 2), dimnames = list(NULL, NULL, c(j, k)))
  m3 <- matrix(c(a[1,], b[1,], a[2,], b[2,], a[3,], b[3,]), nrow = 6, byrow = TRUE)
  array_M3 <- array(m3, dim = c(6, 3, 1), dimnames = list(NULL, NULL, "M3"))
  return(list(array = array, array_M3 = array_M3))
}

array_m3(M1, M2, "M1", "M2")

$array
, , M1

     [,1] [,2] [,3]
[1,] "A"  "A"  "A" 
[2,] "B"  "B"  "B" 
[3,] "C"  "C"  "C" 

, , M2

     [,1] [,2] [,3]
[1,] "P"  "P"  "P" 
[2,] "Q"  "Q"  "Q" 
[3,] "R"  "R"  "R" 


$array_M3
, , M3

     [,1] [,2] [,3]
[1,] "A"  "A"  "A" 
[2,] "P"  "P"  "P" 
[3,] "B"  "B"  "B" 
[4,] "Q"  "Q"  "Q" 
[5,] "C"  "C"  "C" 
[6,] "R"  "R"  "R"

Tutorial 6

Question 1

Build a numeric vector by sampling 100 numbers from a normal distribution with mean 2 and standard deviation 4. Use the function rnorm()

random_numbers <- rnorm(100, mean = 2, sd = 4)
print(x)

[1] 30 -2 13 NA 23 43

How many of the generated values are negative? Then compute the standard deviation, mean, median of your random numbers.

length(random_numbers[random_numbers < 0])

[1] 28

mean(random_numbers); median(random_numbers); sd(random_numbers)

[1] 2.030637

[1] 1.786121

[1] 3.654478

Replace the 11th value in your random number vector with NA and calculate the same summary statistics again but this time you may use the summary() or the apply()

random_numbers[11] <- NA
summary(random_numbers, na.rm = T)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-7.2367 -0.2624  1.8199  2.0604  4.5047 10.7493       1

Replace the last position in the vector with the letter “L” and calculate the same summary statistics

random_numbers[length(random_numbers)] <- "L"
summary(random_numbers, na.rm = T)

   Length     Class      Mode 
      100 character character

2. Suppose that you are given a code below to build a data frame

set.seed(926436)
data <- data.frame(x1 = rnorm(100), x2 = runif(100), x3 = LETTERS[1:4])
head(data)

Compute the min, max, and the mean function for the variable x1 and x2.

set.seed(926436)
data <- data.frame(x1 = rnorm(100), x2 = runif(100), x3 = LETTERS[1:4])
summary(data[c("x1", "x2")])

       x1                 x2          
 Min.   :-3.23671   Min.   :0.003054  
 1st Qu.:-0.67880   1st Qu.:0.257628  
 Median : 0.01180   Median :0.504780  
 Mean   :-0.05862   Mean   :0.501471  
 3rd Qu.: 0.65865   3rd Qu.:0.766102  
 Max.   : 2.32689   Max.   :0.970136

Compute the min,max and the mean by using the apply() function.

apply(data[c("x1","x2")], 2, function(x) c(min(x), max(x), mean(x)))

              x1          x2
[1,] -3.23671504 0.003054301
[2,]  2.32688742 0.970136215
[3,] -0.05862404 0.501471101

Find the sum function to get the sum of all values in our data frame column

apply(data[c("x1","x2")], 2, sum)

       x1        x2 
-5.862404 50.147110

3. Type this dataset called `cars` in your R. (It is already installed under this package called `datasets`. Find out the contents of the `datasets` library by using the `help` function (`library(help=datasets))`)

Compute the min, max, mean and the range for each column of this data set by using the apply() function.

library(datasets)
data(cars)
apply(cars, 2, function(x) c(min(x), max(x), mean(x), range(x)))

     speed   dist
[1,]   4.0   2.00
[2,]  25.0 120.00
[3,]  15.4  42.98
[4,]   4.0   2.00
[5,]  25.0 120.00

Calculate a) but this time by using the summary function

data(cars)
summary(cars)

     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00

speed_range <- summary(cars$speed)["Max."] - summary(cars$speed)["Min."]
dist_range <- summary(cars$dist)["Max."] - summary(cars$dist)["Min."]
cat("speed range:", speed_range, "\n")

speed range: 21

cat("distance range:", dist_range, "\n")

distance range: 118

Now assign your data set cars to another variable called cars2. Then add a row at the bottom of this new data set cars2 showing the column means. (Use the rbind together with the apply functions)

cars2 <- rbind(cars, apply(cars, 2, mean))
tail(cars2)

   speed   dist
46  24.0  70.00
47  24.0  92.00
48  24.0  93.00
49  24.0 120.00
50  25.0  85.00
51  15.4  42.98

Label this last row as ‘mean’ by using the rownames function.

rownames(cars2)[51] <- c("mean")
tail(cars2)

     speed   dist
46    24.0  70.00
47    24.0  92.00
48    24.0  93.00
49    24.0 120.00
50    25.0  85.00
mean  15.4  42.98

Week 9 Tutorial

Installing `marketing` datasets

data(marketing, package = "datarium")

Constructing a simple linear regression model

simple_model <- lm(sales ~ youtube, data = marketing)
summary(simple_model)


Call:
lm(formula = sales ~ youtube, data = marketing)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.0632  -2.3454  -0.2295   2.4805   8.6548 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.439112   0.549412   15.36   <2e-16 ***
youtube     0.047537   0.002691   17.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.91 on 198 degrees of freedom
Multiple R-squared:  0.6119,    Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

Residual analysis of the linear model

par(mfrow = c(2,2))
plot(simple_model)

Plotting the regression line

par(mfrow = c(1,1))
plot(marketing$youtube, marketing$sales, 
     main = "YouTube vs Sales with Regression Line", 
     xlab = "YouTube", ylab = "Sales")
abline(simple_model, col = "blue")

Alternative code: Using `ggplot2` package

ggplot(marketing, aes(x = youtube, y = sales)) + geom_point() + 
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "YouTube vs Sales with Regression Line", x = "YouTube", y = "Sales")

`geom_smooth()` using formula = 'y ~ x'

Constructing a multiple linear regression model

lm_model_1 <- lm(sales ~ youtube, data = marketing)
lm_model_2 <- lm(sales ~ facebook, data = marketing)
lm_model_3 <- lm(sales ~ newspaper, data = marketing)
multiple_model <- lm(sales ~ youtube + facebook + newspaper, data = marketing)
summary(multiple_model)


Call:
lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.5932  -1.0690   0.2902   1.4272   3.3951 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.526667   0.374290   9.422   <2e-16 ***
youtube      0.045765   0.001395  32.809   <2e-16 ***
facebook     0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.023 on 196 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Residual analysis of multiple model

par(mfrow = c(2,2))
plot(lm_model_1)

par(mfrow = c(2,2))
plot(lm_model_2)

par(mfrow = c(2,2))
plot(lm_model_3)

Plotting `sales` with respect to every independent variables in `marketing`

par(mfrow = c(1,3))
response_variable <- "sales"
for (variable in colnames(marketing)) {
  if (variable != response_variable) {
    plot(marketing[[variable]], marketing[[response_variable]],
         main = paste(variable, "vs", response_variable),
         xlab = variable, ylab = response_variable)
    abline(lm(marketing[[response_variable]] ~ marketing[[variable]]), col = 'blue')
  }
}

Alternative code: Using base `R`

par(mfrow = c(1,3))
plot(marketing$youtube, marketing$sales, xlab = "youtube", ylab = "sales")
abline(lm(marketing$sales ~ marketing$youtube), col = "red")
plot(marketing$facebook, marketing$sales, xlab = "facebook", ylab = "sales")
abline(lm(marketing$sales ~ marketing$facebook), col = "blue")
plot(marketing$newspaper, marketing$sales, xlab = "newspaper", ylab = "sales")
abline(lm(marketing$sales ~ marketing$newspaper), col = "green")

Tutorial 7

Build a multilinear regression model for estimating sales based on the advertising budget invested in youtube, `facebook and newspaper. Interpret your output.

multiple_model <- lm(sales ~ youtube + facebook + newspaper, data = marketing)
summary(multiple_model)


Call:
lm(formula = sales ~ youtube + facebook + newspaper, data = marketing)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.5932  -1.0690   0.2902   1.4272   3.3951 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.526667   0.374290   9.422   <2e-16 ***
youtube      0.045765   0.001395  32.809   <2e-16 ***
facebook     0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.023 on 196 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

What is the multilinear regression model? \[ \text{sales} = 3.526667 + 0.045(\text{youtube}) + 0.18(\text{facebook}) -0.001( \text{newspaper}) \] Are there any confounding variables? Perform the VIF method and explain your output.

library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

library(corrplot)

corrplot 0.92 loaded

vif(multiple_model)

  youtube  facebook newspaper 
 1.004611  1.144952  1.145187

cor(marketing)

             youtube   facebook  newspaper     sales
youtube   1.00000000 0.05480866 0.05664787 0.7822244
facebook  0.05480866 1.00000000 0.35410375 0.5762226
newspaper 0.05664787 0.35410375 1.00000000 0.2282990
sales     0.78222442 0.57622257 0.22829903 1.0000000

Find the two-sample t-test between newspaper and facebook.

var.test(marketing$newspaper, marketing$facebook)


    F test to compare two variances

data:  marketing$newspaper and marketing$facebook
F = 2.1518, num df = 199, denom df = 199, p-value = 9.593e-08
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 1.628426 2.843289
sample estimates:
ratio of variances 
          2.151763

t.test(marketing$newspaper, marketing$facebook, var.equal = FALSE)


    Welch Two Sample t-test

data:  marketing$newspaper and marketing$facebook
t = 3.9114, df = 351.11, p-value = 0.0001102
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  4.349305 13.146695
sample estimates:
mean of x mean of y 
  36.6648   27.9168

Tutorial 8

Chi-Square test

Author DataFlair

data_frame <- read.csv("https://goo.gl/j6lRXD") # reading csv
str(data_frame)

'data.frame':   105 obs. of  3 variables:
 $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ treatment  : chr  "treated" "treated" "not-treated" "treated" ...
 $ improvement: chr  "improved" "improved" "improved" "improved" ...

table(data_frame$treatment, data_frame$improvement)

             
              improved not-improved
  not-treated       26           29
  treated           35           15

chisq.test(data_frame$treatment, data_frame$improvement, correct = FALSE)


    Pearson's Chi-squared test

data:  data_frame$treatment and data_frame$improvement
X-squared = 5.5569, df = 1, p-value = 0.01841

load the library

library("MASS")


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

create a data frame from the main data set

car.data <- data.frame(Cars93$AirBags, Cars93$Type)

create table with needed variables

car.data = table(Cars93$AirBags, Cars93$Type)
print(car.data)

                    
                     Compact Large Midsize Small Sporty Van
  Driver & Passenger       2     4       7     0      3   0
  Driver only              9     7      11     5      8   3
  None                     5     0       4    16      3   6

perform chi-square test

print(chisq.test(car.data))

Warning in chisq.test(car.data): Chi-squared approximation may be incorrect


    Pearson's Chi-squared test

data:  car.data
X-squared = 33.001, df = 10, p-value = 0.0002723

Loading `mtcars`

data(mtcars)
cars <- mtcars[,c(1,9)]
head(cars)

                   mpg am
Mazda RX4         21.0  1
Mazda RX4 Wag     21.0  1
Datsun 710        22.8  1
Hornet 4 Drive    21.4  0
Hornet Sportabout 18.7  0
Valiant           18.1  0

Computing the variance of miles per gallon w.r.t. mode of transmission

var(cars$mpg[which(cars$am == 0)])
var(cars$mpg[which(cars$am == 1)])

[1] 14.6993

[1] 38.02577

Variances are unequal. Welch’s t-Test will be used.

Stating the null and alternative hypotheses:

Let \(\mu_{auto}\) and \(\mu_{manual}\) be the means of miles per gallon with respect to automatic and manual modes of transmission. Then, the null and alternative hypotheses, denoted \(H_{0}\) and \(H_{\alpha}\) are: \[ \begin{split} H_{0}: \mu_{\text{auto}} - \mu_{\text{manual}} = 0 \\ H_{\alpha}: \mu_{\text{auto}} - \mu_{\text{manual}} \neq 0 \end{split} \]

Plotting the boxplot of `mpg` w.r.t. mode of transmission

cars$am <- as.factor(ifelse(cars$am == 1, "automatic", "manual"))
ggplot(cars, aes(x = am, y = mpg, fill = am, shape = am)) + geom_boxplot() + 
  geom_jitter() + labs(x = "transmission", y = "miles per gallon") + 
  ggtitle("miles per gallon w.r.t transmission")

Conducting a t-Test

result <- t.test(mpg ~ am, data = cars)
cat("t =", round(result$statistic, 4), ", df =", round(result$parameter, 3), 
    ", p =", format(result$p.value, digits=4), "\n")
cat("95% confidence interval:", round(result$conf.int, 3), "\n")
cat("Sample estimates:\n")
cat("Mean in group automatic: ", round(result$estimate[1], 2), "\n")
cat("Mean in group manual: ", round(result$estimate[2], 2), "\n")
cat("Alternative hypothesis: mean difference not equal to 0\n")

t = 3.7671 , df = 18.332 , p = 0.001374

95% confidence interval: 3.21 11.28

Sample estimates:

Mean in group automatic:  24.39

Mean in group manual:  17.15

Alternative hypothesis: mean difference not equal to 0

Exercises, 5 January 2024

The dataset `starbucks` in the `openintro` package contains nutritional information on 77 Starbucks food items. Spend some time reading the help file of this dataset. For this problem, you will explore the relationship between the calories and carbohydrate grams in these items.

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("openintro")


The downloaded binary packages are in
    /var/folders/yq/q9fv54y91vb70rhxtfs06ysh0000gn/T//RtmpfRdr7H/downloaded_packages

library(openintro)

Loading required package: airports

Loading required package: cherryblossom

Loading required package: usdata


Attaching package: 'openintro'

The following objects are masked from 'package:MASS':

    housing, mammals

The following object is masked from 'package:car':

    densityPlot

data(starbucks)
?starbucks

a) Create a scatterplot of this data with calories on the x -axis and carbohydrate grams on the y-axis, and describe the relationship you see.

plot(data = starbucks, carb ~ calories)

The relationship is monotonic, although there’s greater variability in carbohydrate content with respect to the calorie content.

b) In the scatterplot you made, what is the explanatory variable? What is the response variable? Why might you want to construct the problem in this way?

explanatory variable: calories
response variable: carb Because we might be curious about the relationship between the carbohydrate content and calories; whether the increase of calories does lead to the increase of carbs.

c) Fit a simple linear regression to this data, with carbohydrate grams as the dependent variable and the calories as the explanatory variable. Use the lm() function.

lm(data = starbucks, carb ~ calories)


Call:
lm(formula = carb ~ calories, data = starbucks)

Coefficients:
(Intercept)     calories  
      8.944        0.106

d) Write the fitted model out using mathematical notation. Interpret the slope and the intercept parameters.

\[ \text{carb} = 0.106(\text{calories}) + 8.944 \] ## e) Find and interpret the value of R2 for this model.

summary(lm(data = starbucks, carb ~ calories))$r.squared

[1] 0.4556237

summary(lm(data = starbucks, carb ~ calories))$adj.r.squared

[1] 0.4483653

The adjusted \(R\)-squared is more accruate than the \(R\)-squared in explaining the goodness-of-fit of the regression model. However, the \(R\)-squared and adjusted \(R\)-squared values are 0.4556237 and 0.4483653 respectively, making them quite similar to each other. This means that only 45.56% and 44.83% of the increase in carbs are explained by the increase in calories.

f) Create a residual plot. Describe what you see in the residual plot. Does the model look like a good fit?

qqnorm(residuals(lm(data = starbucks, carb ~ calories)))
qqline(residuals(lm(data = starbucks, carb ~ calories)))

Yes. It is because it has minimal outliers, and the data fits in the straight line, albeit with some degree of skewness.

Tutorial 9

Load the dataset

library(readr)

Warning: package 'readr' was built under R version 4.2.3

Sample_Dataset_2014 <- read_csv("Sample_Dataset_2014.csv")

Rows: 435 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (5): bday, enrolldate, expgradate, Major, State
dbl  (17): ids, Rank, Gender, Athlete, Height, Weight, Smoking, Sprint, Engl...
time  (1): MileMinDur

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Create a contingency table w.r.t smoking preference and gender

cont_table <- table(Sample_Dataset_2014[, c(7,11)])

State the null and alternative hypotheses

Let \(H_{0}\) and \(H_{\alpha}\) be the null and alternative hypotheses respectively. Then, \[ H_{0}: \text{smoking preference is independent of gender} \\ H_{\alpha}: \text{smoking preference is dependent of gender} \]

Conduct a \(\chi^{2}\)-test

chisq.test(cont_table)


    Pearson's Chi-squared test

data:  cont_table
X-squared = 3.1713, df = 2, p-value = 0.2048

\(p = 0.2048 > 0.05 = \alpha\), don’t reject \(H_{0}\) at \(\alpha = 0.05\).

t-Test for weight difference for tall and short people

Imputing missing values with median

height <- Sample_Dataset_2014$Height
height[is.na(height)] <- median(Sample_Dataset_2014$Height, na.rm = T)
weight <- Sample_Dataset_2014$Weight
weight[is.na(weight)] <- median(Sample_Dataset_2014$Weight, na.rm = T)

Classifying weight type with median as threshold value

median_height <- median(Sample_Dataset_2014$Height, na.rm = T)
height_type <- cut(height, breaks = c(-Inf, median_height, Inf),
                   labels = c("shorter", "taller"), include.lowest = TRUE)

Preparing the data

hw <- data.frame(weight, height_type)
hw$weight <- as.numeric(hw$weight)
hw$height_type <- as.factor(hw$height_type)

Performing two-sample t-Test

t.test(weight ~ height_type, data = hw)


    Welch Two Sample t-test

data:  weight by height_type
t = -6.8072, df = 394.39, p-value = 3.719e-11
alternative hypothesis: true difference in means between group shorter and group taller is not equal to 0
95 percent confidence interval:
 -32.29308 -17.81983
sample estimates:
mean in group shorter  mean in group taller 
             168.1862              193.2426

Create a boxplot

boxplot(weight ~ height_type, data = hw, 
        main = "Weight Comparison between Taller and Shorter Individuals", 
        xlab = "Height Type", ylab = "Weight")