Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

##Initialize - install and load packages & data install.packages(“haven”) install.packages(“ppcor”) install.packages(“psych”) install.packages(“Hmisc”) install.packages(“ggplot2”) #install.packages(“apaTables”) I’ll investigate this at a later point library(“psych”) library(“haven”) library(“ppcor”) library(“Hmisc”) library(“ggplot2”) #library(“apaTables”) I’ll investigate this at a later point

## # A tibble: 6 x 58
##      id  gender    month    year language    book home_computer home_desk
##   <dbl> <dbl+l> <dbl+lb> <dbl+l> <dbl+lb> <dbl+l>     <dbl+lbl> <dbl+lbl>
## 1     1 1 [GIR~  1 [JAN~ 5 [200~ 1 [ALWA~ 4 [TWO~       1 [YES]   0 [NO] 
## 2     2 0 [BOY]  9 [SEP~ 4 [200~ 1 [ALWA~ 3 [ONE~       1 [YES]   1 [YES]
## 3     3 0 [BOY] 10 [OCT~ 4 [200~ 1 [ALWA~ 4 [TWO~       1 [YES]   1 [YES]
## 4     4 1 [GIR~  8 [AUG~ 4 [200~ 1 [ALWA~ 3 [ONE~       1 [YES]   1 [YES]
## 5     5 0 [BOY]  8 [AUG~ 4 [200~ 1 [ALWA~ 5 [THR~       1 [YES]   1 [YES]
## 6     6 0 [BOY] 11 [NOV~ 4 [200~ 1 [ALWA~ 3 [ONE~       1 [YES]   1 [YES]
## # ... with 50 more variables: home_book <dbl+lbl>, home_room <dbl+lbl>,
## #   home_internet <dbl+lbl>, computer_home <dbl+lbl>,
## #   computer_school <dbl+lbl>, computer_some <dbl+lbl>,
## #   parentsupport1 <dbl+lbl>, parentsupport2 <dbl+lbl>,
## #   parentsupport3 <dbl+lbl>, parentsupport4 <dbl+lbl>, school1 <dbl+lbl>,
## #   school2 <dbl+lbl>, school3 <dbl+lbl>, studentbullied1 <dbl+lbl>,
## #   studentbullied2 <dbl+lbl>, studentbullied3 <dbl+lbl>,
## #   studentbullied4 <dbl+lbl>, studentbullied5 <dbl+lbl>,
## #   studentbullied6 <dbl+lbl>, learning1 <dbl+lbl>, learning2 <dbl+lbl>,
## #   learning3 <dbl+lbl>, learning4 <dbl+lbl>, learning5 <dbl+lbl>,
## #   learning6 <dbl+lbl>, learning7 <dbl+lbl>, engagement1 <dbl+lbl>,
## #   engagement2 <dbl+lbl>, engagement3 <dbl+lbl>, engagement4 <dbl+lbl>,
## #   engagement5 <dbl+lbl>, confidence1 <dbl+lbl>, confidence2 <dbl+lbl>,
## #   confidence3 <dbl+lbl>, confidence4 <dbl+lbl>, confidence5 <dbl+lbl>,
## #   confidence6 <dbl+lbl>, score1 <dbl>, score2 <dbl>, score3 <dbl>,
## #   score4 <dbl>, score5 <dbl>, ParentSupport <dbl>, Home <dbl>,
## #   school <dbl>, StudentBullied <dbl>, learning <dbl>, engagement <dbl>,
## #   confidence <dbl>, ScienceScore <dbl>

Including Plots

Example from class: confi_v_sciencescore

*Usually, we build a scatterplot to precheck whether there is a high correlation between two variables We need specify a data set for us to run the scatterplot.

*We need specify a data set for us to run the scatterplot: Let’s create a dataset called

##Scatterplot###

confi_VS_sciscore <- cbind( Confidence = class_example$confidence, Sci_Score = class_example$ScienceScore)
head(round(confi_VS_sciscore, digits = 2)) #round to hundreths
##      Confidence Sci_Score
## [1,]         22    474.58
## [2,]          6    535.81
## [3,]          6    618.63
## [4,]          6    568.98
## [5,]         13    648.14
## [6,]         10    604.36
summary(confi_VS_sciscore) #descritive information for confi_vs_sciscore
##    Confidence      Sci_Score    
##  Min.   : 1.00   Min.   :276.7  
##  1st Qu.: 7.00   1st Qu.:492.7  
##  Median :10.00   Median :546.8  
##  Mean   :10.66   Mean   :541.4  
##  3rd Qu.:14.00   3rd Qu.:594.0  
##  Max.   :24.00   Max.   :774.0  
##  NA's   :238
plot(confi_VS_sciscore, main="The Scatterplot of ScineceScore VS Confidence",xlab="Sum of confidence",ylab="Science Score")

# The Current Example gender_x_learning

Conditional Density Plot

*These plots represent smoothed proportions of each category within various levels of the continuous variable. In order to interpret them you should look across at the x-axis and see how the different proportions for each category (represented by different colors) change with the different values of the numerical variable.

*These tables are useful for plotting continuous data collected along a binary predictor

Gender <- factor(rbinom(13069, 1, 0.5))

Learning <- runif(13069, 0, 30)

cdplot(Gender ~ Learning)

###from the plot we can see that there is a weak relationship

Spearman Correlation

gender_x_learning <- cbind(class_example$gender, class_example$learning)
summary(gender_x_learning) #descritive information
##        V1               V2       
##  Min.   :0.0000   Min.   : 1.00  
##  1st Qu.:0.0000   1st Qu.: 8.00  
##  Median :1.0000   Median :11.00  
##  Mean   :0.5032   Mean   :12.14  
##  3rd Qu.:1.0000   3rd Qu.:15.00  
##  Max.   :1.0000   Max.   :28.00  
##  NA's   :17       NA's   :189
# Correlation between confidence and science scores.
# Calculate the Spearman correlation
rcorr(gender_x_learning, type="spearman")
##      [,1] [,2]
## [1,] 1.00 0.03
## [2,] 0.03 1.00
## 
## n
##       [,1]  [,2]
## [1,] 13052 12880
## [2,] 12880 12880
## 
## P
##      [,1]  [,2] 
## [1,]       3e-04
## [2,] 3e-04

Point Biserial Correlation

*The point biserial correlation coefficient is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be naturally” dichotomous, like whether a coin lands heads or tails, or an artificially dichotomized variable

*The point-biserial correlation is equivalent to calculating the Pearson correlation between a continuous and a dichotomous variable.

*You can just use the standard cor.test function in R, which will output the correlation, a 95% confidence interval, and an independent t-test with associated p-value.

cor.test(class_example$learning,class_example$gender)
## 
##  Pearson's product-moment correlation
## 
## data:  class_example$learning and class_example$gender
## t = 2.5608, df = 12878, p-value = 0.01045
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.005292235 0.039815076
## sample estimates:
##        cor 
## 0.02256038

Phi Correllation

# The Phi Coefficient is a measure of association between two binary variables (i.e. living/dead, black/white, success/failure).
# The phi coefficient is identical to the Pearson coefficient in the case of a 2 x 2 data set.
# We can still use the cor.test function to calculate the phi coefficient.

gender_learning_df <- data.frame(table(class_example$gender, class_example$learning)) # creates a dataframe for gender by learning

names(gender_learning_df) <- c("Gender_","learning_", "Count_")

ggplot(data=gender_learning_df, aes(x=Gender_, y=Count_, fill=learning_)) + geom_bar(stat = "identity")

cor.test(class_example$gender,class_example$learning)
## 
##  Pearson's product-moment correlation
## 
## data:  class_example$gender and class_example$learning
## t = 2.5608, df = 12878, p-value = 0.01045
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.005292235 0.039815076
## sample estimates:
##        cor 
## 0.02256038
###Interpretation###

#Although we observe a weak relationship between the two variables (r=.03, spearman; r=.02, point-biserial, ; r=.02, Phi), gender and learning, the relationship between the two variables is significant to the alpha .05 level (p<.001). This may be due the the amount of observations within the dataset.

PART2: Partial Correlation

# Both the calculation of the partial and semi-partial correlation do not allow Missing Values, so we need prepare a clean data set without all the missing values.
uncleaned_data <- cbind(class_example$learning, class_example$engagement,class_example$confidence)
cleaned_data <- na.omit(uncleaned_data)
names <- c("Learning", "Engagement","Confidence")
colnames(cleaned_data) <- names
cleaned_data <- as.data.frame(cleaned_data)

#Partial Correlation####
# Partial correlation is the correlation of two variables while controlling for a third or more other variables.
# Calculate the correlation between Learning score and Engagement while controlling the variable Confidence


pcor.test(cleaned_data$Learning,cleaned_data$Engagement,cleaned_data$Confidence,method="pearson")
##    estimate p.value statistic     n gp  Method
## 1 0.4283649       0   53.4766 12728  1 pearson

Research Question: What is the proportion of variance in Learning Scores is not explained by Engagement while controlling for levels of confidence?

Conclusion: After controlling for the variance in confidence levels, we observe a moderate positive relationship between Learning Scores and Egagement (r = .43, p < .001).

Semi Partial Correlation

#Semipartial Correaltion####
# Similarly, the semi-partial correlations can be calculated with spcor() function.
spcor.test(cleaned_data$Learning,cleaned_data$Engagement,cleaned_data$Confidence,method="pearson")
##    estimate p.value statistic     n gp  Method
## 1 0.3317006       0  39.66306 12728  1 pearson

Research Question: What is the relationship ofbetwen learning scores and engagement after removing the shared variance betwee nconfidence and engagement?

Conclusion: After controlling for the unique variance explained by in confidence and engagement levels, we observe a low positive relationship between learning Scores and egagement (r = .33).

Comparison: This means that our model is more prodictive of learning outcomes when entirely controlling for the effects of confidence.