Overview

In this project I will analyze the behavior of college students and the factors that effect their grade point averages.


Introduction

The gpa data set I will use is from the OpenIntro library. It comes from a survey that was conducted at Duke University. 55 students responded with their GPA, the number of hours they spend studying each week on average, the number of hours they sleep each night on average, how many nights they go out each week on average, and their gender. I want to see if this data supports the theory that students who get more sleep, study an average amount (not an obscene amount but not too little either) and only go out a few nights a week do better in school.

Explanatory Variables Response Variable
Amount of sleep Student’s GPA
Number of nights out Student’s GPA
Amount of studying Student’s GPA

Exploring the Data

The first step is to load any necessary libraries (in this case we need the OpenIntro library) and import the data using the environment, console, and an R chunk.

# Load necessary library
library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
gpa <- read.csv("~/Desktop/gpa.csv")

It is important to get a basic understanding of your data using the summary function as it includes general information about all dataset variables. This determines which type of analyses should be performed.

summary(gpa)
##       gpa          studyweek       sleepnight         out       
##  Min.   :2.900   Min.   : 2.00   Min.   :5.000   Min.   :0.000  
##  1st Qu.:3.400   1st Qu.:10.00   1st Qu.:6.000   1st Qu.:1.250  
##  Median :3.650   Median :15.00   Median :7.000   Median :2.000  
##  Mean   :3.600   Mean   :19.15   Mean   :7.064   Mean   :2.109  
##  3rd Qu.:3.825   3rd Qu.:26.50   3rd Qu.:8.000   3rd Qu.:3.000  
##  Max.   :4.670   Max.   :50.00   Max.   :9.000   Max.   :4.000  
##     gender  
##  female:43  
##  male  :12  
##             
##             
##             
## 

Multivariate Regression Analysis

# Let's make a model.
lm1 <- lm(gpa ~ studyweek + sleepnight + out, data = gpa)

# Let's see which percentage of a student's GPA is determined by these three variables.
summary(lm1)$r.squared
## [1] 0.02115869

Our multiple R-squared is 0.02, therefore only 2% of the variation in the GPA of students can be explained by the how much they study, sleep, and go out.


NOT SO FAST!

It could very well be that these explanatory variables are so weak because there are only 55 observations and they are all of students at the same university. I know students at Endicott College, Boston College, UMASS Lowell, Salem State University, Fitchburg State University, and North Shore Community College, so I have decided to create a survey to collect some of my own data from current students at these colleges. I will add this data to the pre-existing data and see if that makes a difference.


Exploring the New Data

I will now import the new dataset which combines the OpenIntro data and the data I collected myself. Let’s store this.

gpa2 <- read.csv("~/Downloads/gpa_2.0 - Sheet1.csv")

Now let’s see if the strength of the linear relationship has increased between our explanatory variables and our response variable.

GPA dataset

cor(gpa$studyweek, gpa$gpa)
## [1] 0.04160403
cor(gpa$sleepnight, gpa$gpa)
## [1] 0.06098308
cor(gpa$out, gpa$gpa)
## [1] 0.1358026

GPA 2 dataset

cor(gpa2$studyweek, gpa2$gpa)
## [1] 0.03012842
cor(gpa2$sleepnight, gpa2$gpa)
## [1] 0.05464587
cor(gpa2$out, gpa2$gpa)
## [1] -0.1945154

It appears the relationship between our explanatory variables and response variable is still not significant (with the new data), but let’s make sure…


Multivariate Regression Analysis With More Data

lm2 <- lm(gpa ~ studyweek + sleepnight + out, data = gpa2)

summary(lm2)$r.squared
## [1] 0.05792935

Now, our multiple R-squared is 0.0579, therefore about 6% of the variation in the GPA of students can be explained by the how much they study, sleep, and go out. That is slightly higher than our original multiple R-squared, but still the relationship is weak.


Conclusions

Based on the data from our textbook (the gpa.csv data set) and the data I personally collected by making a Google Form: sleep, studying, and nights out are not strong predictors of a college student’s gpa. I came to this conclusion by using multivariate regressions, Pearson’s Correlation Coefficient, and the Multiple R-squared. I really did not see any explanatory power from the “explanatory” variables.

Out of the three variables, sleep is the best predictor of gpa.


Limitations

After making my first multivariate regression, the limitation seemed to be that I was only using data from Duke University students. This does not provide a range of data. However, after collecting data from students at other colleges too, the result still seemed to remain similar. I am assuming this is because I still only had 87 observations in my data set. Perhaps if my sample size was much larger, perhaps if I had been able to survey hundreds or even thousands of college students, the results would be different. If I had a social media account it could have been easier to collect more than 32 responses. Better yet, maybe sleep, studying, and nights out don’t determine a student’s performance. Maybe my research question/theory was bound to be disproven because it just isn’t accurate and I was believing stereotypes.


This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College. The course was led by Professor Billy Jackson.
Project Name: Are There Consequences to the Behaviors of a College Kid?
Student Name: Christina Pace
Semester: Spring 2018