In this project I will analyze the behavior of college students and the factors that effect their grade point averages.
The gpa data set I will use is from the OpenIntro library. It comes from a survey that was conducted at Duke University. 55 students responded with their GPA, the number of hours they spend studying each week on average, the number of hours they sleep each night on average, how many nights they go out each week on average, and their gender. I want to see if this data supports the theory that students who get more sleep, study an average amount (not an obscene amount but not too little either) and only go out a few nights a week do better in school.
| Explanatory Variables | Response Variable |
|---|---|
| Amount of sleep | Student’s GPA |
| Number of nights out | Student’s GPA |
| Amount of studying | Student’s GPA |
The first step is to load any necessary libraries (in this case we need the OpenIntro library) and import the data using the environment, console, and an R chunk.
# Load necessary library
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
gpa <- read.csv("~/Desktop/gpa.csv")
It is important to get a basic understanding of your data using the summary function as it includes general information about all dataset variables. This determines which type of analyses should be performed.
summary(gpa)
## gpa studyweek sleepnight out
## Min. :2.900 Min. : 2.00 Min. :5.000 Min. :0.000
## 1st Qu.:3.400 1st Qu.:10.00 1st Qu.:6.000 1st Qu.:1.250
## Median :3.650 Median :15.00 Median :7.000 Median :2.000
## Mean :3.600 Mean :19.15 Mean :7.064 Mean :2.109
## 3rd Qu.:3.825 3rd Qu.:26.50 3rd Qu.:8.000 3rd Qu.:3.000
## Max. :4.670 Max. :50.00 Max. :9.000 Max. :4.000
## gender
## female:43
## male :12
##
##
##
##
# Let's make a model.
lm1 <- lm(gpa ~ studyweek + sleepnight + out, data = gpa)
# Let's see which percentage of a student's GPA is determined by these three variables.
summary(lm1)$r.squared
## [1] 0.02115869
Our multiple R-squared is 0.02, therefore only 2% of the variation in the GPA of students can be explained by the how much they study, sleep, and go out.
It could very well be that these explanatory variables are so weak because there are only 55 observations and they are all of students at the same university. I know students at Endicott College, Boston College, UMASS Lowell, Salem State University, Fitchburg State University, and North Shore Community College, so I have decided to create a survey to collect some of my own data from current students at these colleges. I will add this data to the pre-existing data and see if that makes a difference.
I will now import the new dataset which combines the OpenIntro data and the data I collected myself. Let’s store this.
gpa2 <- read.csv("~/Downloads/gpa_2.0 - Sheet1.csv")
GPA dataset
cor(gpa$studyweek, gpa$gpa)
## [1] 0.04160403
cor(gpa$sleepnight, gpa$gpa)
## [1] 0.06098308
cor(gpa$out, gpa$gpa)
## [1] 0.1358026
GPA 2 dataset
cor(gpa2$studyweek, gpa2$gpa)
## [1] 0.03012842
cor(gpa2$sleepnight, gpa2$gpa)
## [1] 0.05464587
cor(gpa2$out, gpa2$gpa)
## [1] -0.1945154
It appears the relationship between our explanatory variables and response variable is still not significant (with the new data), but let’s make sure…
lm2 <- lm(gpa ~ studyweek + sleepnight + out, data = gpa2)
summary(lm2)$r.squared
## [1] 0.05792935
Now, our multiple R-squared is 0.0579, therefore about 6% of the variation in the GPA of students can be explained by the how much they study, sleep, and go out. That is slightly higher than our original multiple R-squared, but still the relationship is weak.
Based on the data from our textbook (the gpa.csv data set) and the data I personally collected by making a Google Form: sleep, studying, and nights out are not strong predictors of a college student’s gpa. I came to this conclusion by using multivariate regressions, Pearson’s Correlation Coefficient, and the Multiple R-squared. I really did not see any explanatory power from the “explanatory” variables.
Out of the three variables, sleep is the best predictor of gpa.
After making my first multivariate regression, the limitation seemed to be that I was only using data from Duke University students. This does not provide a range of data. However, after collecting data from students at other colleges too, the result still seemed to remain similar. I am assuming this is because I still only had 87 observations in my data set. Perhaps if my sample size was much larger, perhaps if I had been able to survey hundreds or even thousands of college students, the results would be different. If I had a social media account it could have been easier to collect more than 32 responses. Better yet, maybe sleep, studying, and nights out don’t determine a student’s performance. Maybe my research question/theory was bound to be disproven because it just isn’t accurate and I was believing stereotypes.
This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College. The course was led by Professor Billy Jackson.
Project Name: Are There Consequences to the Behaviors of a College Kid?
Student Name: Christina Pace
Semester: Spring 2018