Reading data of student GPA scores in relation to number of study hours per week
Question: Does the number of hours students study per week have a significant impact on their GPA scores?
After importing the data set, and creating the summary, it can be concluded that the GPA values range from a minimum of 2.60 to a maximum of 4.30. The mean GPA is approximately 3.586, and the median GPA is 3.620. The data seems to be skewed slightly to the left, as the mean is slightly lower than the median.The study hours in the dataset range from a minimum of 2.00 to a maximum of 69.00, with a median study hours value of 15.00. The mean study hours is approximately 17.48. The data appears to be positively skewed, as the mean is higher than the median.
# Load the required libraries
library(readr)
# Read the dataset
url <- ("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/gpa_study_hours.csv")
gpa_study_hours <- read_csv(url, show_col_types = FALSE)
## New names:
## • `` -> `...1`
#Use the summary function to gain an overview of the data set
summary(gpa_study_hours)
## ...1 gpa study_hours
## Min. : 1 Min. :2.600 Min. : 2.00
## 1st Qu.: 49 1st Qu.:3.400 1st Qu.:10.00
## Median : 97 Median :3.620 Median :15.00
## Mean : 97 Mean :3.586 Mean :17.48
## 3rd Qu.:145 3rd Qu.:3.800 3rd Qu.:20.00
## Max. :193 Max. :4.300 Max. :69.00
Providing the first 10 rows of the dataset for context.
# Display the first 10 rows of the dataset
head(gpa_study_hours, 10)
# Create a subset of students who study more than 20 hours
data_subset <- subset(gpa_study_hours, study_hours > 20)
# Rename the columns in the data_subset dataframe
colnames(data_subset) <- c("Student ID", "GPA Score", "Hours Studied")
# Display the first few rows of the data_subset with the renamed columns
head(data_subset, n = 10)
Checking the summary of the data subset for analysis.
# Display the summary statistics for the data_subset
summary(data_subset)
## Student ID GPA Score Hours Studied
## Min. : 2.00 Min. :3.022 Min. :21.0
## 1st Qu.: 47.50 1st Qu.:3.500 1st Qu.:25.0
## Median : 97.00 Median :3.690 Median :30.0
## Mean : 90.52 Mean :3.622 Mean :33.7
## 3rd Qu.:130.25 3rd Qu.:3.800 3rd Qu.:39.0
## Max. :189.00 Max. :4.000 Max. :69.0
# Load the required libraries
library(ggplot2)
# Create a scatter plot with a smoother
ggplot(gpa_study_hours, aes(x = study_hours, y = gpa)) +
geom_point() + # Scatter plot points
geom_smooth(method = "lm", se = FALSE) + # Smoother (linear model) without confidence intervals
labs(x = "Study Hours", y = "GPA Score") + # Labels for x and y axes
ggtitle("Relationship between Study Hours and GPA") # Plot title
## `geom_smooth()` using formula = 'y ~ x'
Creating a box plot, which shows important values of the data set.
# Load the required libraries
library(ggplot2)
# Create a box plot to show the distribution of GPA scores
ggplot(gpa_study_hours, aes(y = gpa)) +
geom_boxplot(color = "deepskyblue", fill = "lightblue") +
labs(y = "GPA Score") +
ggtitle("Distribution of GPA Scores") +
theme_minimal()
Creating a histogram, this shows how the GPA scores are distributed.
# Load the required libraries
library(ggplot2)
# Create a histogram to show the distribution of GPA scores
ggplot(gpa_study_hours, aes(x = gpa)) +
geom_histogram(binwidth = 0.1, color = "deepskyblue", fill = "lightblue") +
labs(x = "GPA Score", y = "Frequency") +
ggtitle("Distribution of GPA Scores") +
theme_minimal()
Again, the meaningful question for analysis that we can address using the provided data is: “Does the number of hours students study per week have a significant impact on their GPA scores?”
Summary: The data exploration and analysis were conducted on a dataset that includes student GPA scores in relation to the number of study hours per week. The summary statistics revealed that the GPA values ranged from 2.60 to 4.30, with a mean of approximately 3.586. The study hours ranged from 2.00 to 69.00, with a mean of about 17.48. The scatter plot displayed a slightly positive relationship between study hours and GPA, indicating that, on average, higher study hours corresponded to slightly higher GPAs. However, the scatter of points around the trend line highlighted the variability in GPA scores at different study hour levels, possibly influenced by individual differences, study habits, and other factors affecting academic performance. The box plot represents the distribution of GPA scores, showcasing the central tendency, spread, and presence of potential outliers, while the histogram visually displays the frequency and pattern of GPA scores within different score ranges. By looking at these graphical representations, we can conclude that the greater the study hours, the greater the GPA of the students.