library(tidyverse)
library(readxl)
thesis <- read_excel("SampleThesisData.xlsx", na = "-")

thesis
  1. A correlation test is used to determine the relationship between age and GPA1. A correlation test is used because both variables are continuous.
cor.test(thesis$Age, thesis$GPA1)

    Pearson's product-moment correlation

data:  thesis$Age and thesis$GPA1
t = 0.25668, df = 39, p-value = 0.7988
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.2699939  0.3443673
sample estimates:
       cor 
0.04106779 

The correlation between age and GPA1 is not statistically significant, r(39) = .80, ns.

The following scatterplot demonstrates the relationship between age and GPA1.

thesis %>% 
  drop_na(Age, GPA1) %>%                                  
  ggplot(aes(Age, GPA1)) +
  geom_point() +
  theme_minimal() +                                            
  geom_smooth(formula = y~x, method = lm, se = FALSE) +        
  labs(title = "Relationship Between Age and GPA1",      
       x = "Age",
       y = "GPA1")

  1. To determine if there is a difference between GPA1 of students in the Business college from students in the Arts and Science college, a T-test is used. A T-test is used because there is a continuous dependent variable (GPA1) and an independent categorical variable (college).
t.test(thesis$GPA1 ~ thesis$College)

    Welch Two Sample t-test

data:  thesis$GPA1 by thesis$College
t = -1.2753, df = 38.772, p-value = 0.2098
alternative hypothesis: true difference in means between group AS and group BU is not equal to 0
95 percent confidence interval:
 -0.7143396  0.1619586
sample estimates:
mean in group AS mean in group BU 
         3.02381          3.30000 

The students in the Arts and Sciences college (M = 3.02) had a lower average GPA than the students in the Business college (M = 3.30) and the differences in the GPAs are not statistically significantly different, t(38.77) = 1.28, ns. The following boxplot shows the relationship between the Business college and the Arts and Sciences college and GPA1.

thesis %>% 
  drop_na(College, GPA1) %>%   
  ggplot(aes(x = College, y = GPA1)) +
  geom_boxplot() +
  geom_jitter(width = .1) +
  theme_minimal() +
  labs(title = "GPA1 by College", x = "College", y = "GPA1")

  1. The following shows the relationship between the GPA1 of students in accounting versus communications. To do this, a T-test and filter are used.
thesis %>% 
  filter(Major == "Account" | Major == "Comm") -> AccCommMajor

t.test(AccCommMajor$GPA1 ~ AccCommMajor$Major)

    Welch Two Sample t-test

data:  AccCommMajor$GPA1 by AccCommMajor$Major
t = 0.95153, df = 5.297, p-value = 0.3827
alternative hypothesis: true difference in means between group Account and group Comm is not equal to 0
95 percent confidence interval:
 -0.7868789  1.7368789
sample estimates:
mean in group Account    mean in group Comm 
                3.675                 3.200 

The accounting major students (M = 3.68) have a higher average GPA than the communications major students (M = 3.20), however, the difference is not statistically significantly different, t(5.30) = 0.95, ns. The following is a boxplot to show the relationship between the two majors and GPA1.

AccCommMajor %>% 
  ggplot(aes(x = Major, y = GPA1)) +
  geom_boxplot() +
  geom_jitter(width = .2) +
  theme_minimal() +
  labs(title = "GPA of Accounting and Communications Majors", x = "Major", y = "GPA1")

  1. To determine if there is a difference between Mood1 and Mood2, a paired-samples T-test is used. This test is used because Mood1 and Mood2 are two continuous variables measured at two different times.
t.test(thesis$Mood1, thesis$Mood2, paired = T)

    Paired t-test

data:  thesis$Mood1 and thesis$Mood2
t = -2.1686, df = 40, p-value = 0.03611
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.80105415 -0.02821414
sample estimates:
mean of the differences 
             -0.4146341 

Moods were statistically significantly lower at time 1 (M = -0.24) than at time 2 (M = 0.24), t(40) = 2.17, p < .05. The following is a boxplot of the relationship between Mood1 and Mood2.

thesis %>% 
  pivot_longer(cols = c(Mood1, Mood2), names_to = "Time", values_to = "Mood") %>% 
  select(Time, Mood)
thesis %>% 
  pivot_longer(cols = c(Mood1, Mood2), names_to = "Time", values_to = "Mood") %>% 
  ggplot(aes(x = Time, y = Mood)) +
  geom_boxplot() +
  geom_jitter(width = .2) +
  theme_minimal() +
  labs(title = "Relationship Between Mood and Time", x = "Time", y = "Mood Calculation")
Warning: Removed 2 rows containing non-finite values (stat_boxplot).
Warning: Removed 2 rows containing missing values (geom_point).

  1. To determine if there is a relationship between where students are from and where they go to college, a chi-square test is used. This test is used because both variables are categorical.
table(thesis$Home, thesis$College)
            
             AS BU
  Billings    5  6
  OtherMT    11  7
  OutofState  6  6
chisq.test(thesis$Home, thesis$College)

    Pearson's Chi-squared test

data:  thesis$Home and thesis$College
X-squared = 0.76438, df = 2, p-value = 0.6824

There is not a statistically significant relationship between where students home is and where they attend college, chi-square(2) = 0.76, ns. The following is a bargraph showing the relationship between college and home.

thesis %>% 
  drop_na(College, Home) %>% 
  mutate(Home = as_factor(Home)) %>% 
  mutate(Home = fct_recode(Home,
                           "Billings" = "Billings",
                            "City/town in Montana" = "OtherMT",
                            "City/town out of Montana" = "OutofState"))  %>% 
  mutate(College = as_factor(College)) %>% 
  mutate(College = fct_recode(College,
                          "Business College" = "BU",
                          "Arts and Sciences College" = "AS")) %>% 
  ggplot(aes(x = College, fill = Home)) +
  geom_bar(position = "fill") +
  scale_fill_viridis_d() +                        # use scale_fill_grey() here if you don't want color
  theme_minimal() +
  coord_flip() +
  labs(title = "College by Home",
      y = "Proportion of Different Homes")

  1. To determine if there is a relationship between self-esteem and where a student comes from, an analysis of variance (ANOVA) is used. An ANOVA is used instead of a T-test because the categorical independent variable has more than two factors.
thesis %>% 
  drop_na(Home, SelfEsteem) %>% 
  group_by(Home) %>% 
  summarize(Mean = mean(SelfEsteem), 
            "Std Dev" = sd(SelfEsteem),
            N = n())
NA
Home_ANOVA <- aov(thesis$SelfEsteem ~ thesis$Home)
summary(Home_ANOVA)
            Df Sum Sq Mean Sq F value Pr(>F)
thesis$Home  2  15.76   7.879   1.043  0.362
Residuals   39 294.53   7.552               
1 observation deleted due to missingness

There were no statistically significant differences in self-esteem by where a student came from (home), F(2, 39) = 1.04, ns. A post hoc test is used for comparisons between individual groups.

TukeyHSD(Home_ANOVA)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = thesis$SelfEsteem ~ thesis$Home)

$`thesis$Home`
                           diff       lwr      upr     p adj
OtherMT-Billings    -1.27777778 -3.772928 1.217373 0.4329248
OutofState-Billings -0.08333333 -2.816634 2.649967 0.9969630
OutofState-OtherMT   1.19444444 -1.300706 3.689595 0.4800845

The following is a boxplot of the relationship between self-esteem and home.

thesis %>% 
  drop_na(SelfEsteem, Home) %>% 
  mutate(Home = as_factor(Home)) %>% 
  mutate(Home = fct_recode(Home,
                            "Billings" = "Billings",
                            "City/town in Montana" = "OtherMT",
                            "City/town out of Montana " = "OutofState"))  %>% 
  ggplot(aes(x = Home, y = SelfEsteem)) +
  geom_boxplot() +
  geom_jitter(width = .2) +
  theme_minimal() +
  labs(title = "Self-Esteem by Home",
       y = "Self-Esteem") 

