Noori Selina R Final Project

Reading data of student GPA scores in relation to number of study hours per week

Question: Does the number of hours students study per week have a significant impact on their GPA scores?

Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set.

After importing the data set, and creating the summary, it can be concluded that the GPA values range from a minimum of 2.60 to a maximum of 4.30. The mean GPA is approximately 3.586, and the median GPA is 3.620. The data seems to be skewed slightly to the left, as the mean is slightly lower than the median.The study hours in the dataset range from a minimum of 2.00 to a maximum of 69.00, with a median study hours value of 15.00. The mean study hours is approximately 17.48. The data appears to be positively skewed, as the mean is higher than the median.

# Load the required libraries
library(readr)

# Read the dataset
url <- ("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/gpa_study_hours.csv")
gpa_study_hours <- read_csv(url, show_col_types = FALSE)

## New names:
## • `` -> `...1`

#Use the summary function to gain an overview of the data set
summary(gpa_study_hours)

##       ...1          gpa         study_hours   
##  Min.   :  1   Min.   :2.600   Min.   : 2.00  
##  1st Qu.: 49   1st Qu.:3.400   1st Qu.:10.00  
##  Median : 97   Median :3.620   Median :15.00  
##  Mean   : 97   Mean   :3.586   Mean   :17.48  
##  3rd Qu.:145   3rd Qu.:3.800   3rd Qu.:20.00  
##  Max.   :193   Max.   :4.300   Max.   :69.00

Providing the first 10 rows of the dataset for context.

# Display the first 10 rows of the dataset
head(gpa_study_hours, 10)

Data wrangling: Please perform some basic transformations. A subset of the data has been created, to show data based on students that have studied more than 20 hours per week. The column names have been renamed.

# Create a subset of students who study more than 20 hours
data_subset <- subset(gpa_study_hours, study_hours > 20)

# Rename the columns in the data_subset dataframe
colnames(data_subset) <- c("Student ID", "GPA Score", "Hours Studied")

# Display the first few rows of the data_subset with the renamed columns
head(data_subset, n = 10)

Checking the summary of the data subset for analysis.

# Display the summary statistics for the data_subset
summary(data_subset)

##    Student ID       GPA Score     Hours Studied 
##  Min.   :  2.00   Min.   :3.022   Min.   :21.0  
##  1st Qu.: 47.50   1st Qu.:3.500   1st Qu.:25.0  
##  Median : 97.00   Median :3.690   Median :30.0  
##  Mean   : 90.52   Mean   :3.622   Mean   :33.7  
##  3rd Qu.:130.25   3rd Qu.:3.800   3rd Qu.:39.0  
##  Max.   :189.00   Max.   :4.000   Max.   :69.0

Graphics: Please make sure to display at least one scatter plot, box plot and histogram. A scatter plot has been created to show the relationship between study hours, and student GPA.

# Load the required libraries
library(ggplot2)

# Create a scatter plot with a smoother
ggplot(gpa_study_hours, aes(x = study_hours, y = gpa)) +
  geom_point() +                      # Scatter plot points
  geom_smooth(method = "lm", se = FALSE) +  # Smoother (linear model) without confidence intervals
  labs(x = "Study Hours", y = "GPA Score") + # Labels for x and y axes
  ggtitle("Relationship between Study Hours and GPA")  # Plot title

## `geom_smooth()` using formula = 'y ~ x'

Creating a box plot, which shows important values of the data set.

# Load the required libraries
library(ggplot2)

# Create a box plot to show the distribution of GPA scores
ggplot(gpa_study_hours, aes(y = gpa)) +
  geom_boxplot(color = "deepskyblue", fill = "lightblue") +
  labs(y = "GPA Score") +
  ggtitle("Distribution of GPA Scores") +
  theme_minimal()

Creating a histogram, this shows how the GPA scores are distributed.

# Load the required libraries
library(ggplot2)

# Create a histogram to show the distribution of GPA scores
ggplot(gpa_study_hours, aes(x = gpa)) +
  geom_histogram(binwidth = 0.1, color = "deepskyblue", fill = "lightblue") +
  labs(x = "GPA Score", y = "Frequency") +
  ggtitle("Distribution of GPA Scores") +
  theme_minimal()

Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

Again, the meaningful question for analysis that we can address using the provided data is: “Does the number of hours students study per week have a significant impact on their GPA scores?”

Summary: The data exploration and analysis were conducted on a dataset that includes student GPA scores in relation to the number of study hours per week. The summary statistics revealed that the GPA values ranged from 2.60 to 4.30, with a mean of approximately 3.586. The study hours ranged from 2.00 to 69.00, with a mean of about 17.48. The scatter plot displayed a slightly positive relationship between study hours and GPA, indicating that, on average, higher study hours corresponded to slightly higher GPAs. However, the scatter of points around the trend line highlighted the variability in GPA scores at different study hour levels, possibly influenced by individual differences, study habits, and other factors affecting academic performance. The box plot represents the distribution of GPA scores, showcasing the central tendency, spread, and presence of potential outliers, while the histogram visually displays the frequency and pattern of GPA scores within different score ranges. By looking at these graphical representations, we can conclude that the greater the study hours, the greater the GPA of the students.

Noori Selina R Final Project

2023-07-23