library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
lab2data <- read_delim("Downloads/lab2data.txt", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)
## Rows: 395 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (16): age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
newdata <- filter(lab2data, G3 != 0)

In 2005-2006, Portuguese secondary education students were analyzed for their performance in mathematics that year. Along with that, other attributes were recorded for each student. In terms of grades, students were assessed on a scale from 0-20. Additionally, the year was split up into three parts, which we labeled G1, G2, and the final grade as G3. We prepared this data set and labeled it as lab2data. It is abundant that throughout the year, the amount of students with a zero increased. It is very unusual for a student to receive a zero unless they are not taking the class anymore or if they stopped showing up. To account for this, we have also created a new data set called newdata, that removed every student with final grade of zero. In this paper, we will analyze this data and attempt to see if there are any factors that affect these students’ grades. Specifically, does drinking alcohol on the weekend or taking extra paid math classes affect a students’ performance?

hist(newdata$G3, main = "Math Final Grades for Portuguese Secondary Education Students in 2005-06", xlab = "Student Grades", ylab = "Frequency", breaks = 15)

Above is a histogram of the final grades for the students in the newdata data set. By observation, it appears that the distribution of the final grades appears to be normal and approximately has a bell shaped curve. The number of students with a final grade of 10 was 56, while there was 47 students with a final grade of 11. On the other hand, only 7 students ended with a grade of 5, and only one student had a perfect grade of 20.

t.test(newdata$G3, mu = 10, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  newdata$G3
## t = 8.9199, df = 356, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 10
## 95 percent confidence interval:
##  11.24208      Inf
## sample estimates:
## mean of x 
##  11.52381

For Portuguese secondary education students in 2005-06, we must note that their final grade for math must be 10 or more to pass. With that in mind, we completed a hypothesis test to determine if the average final grade was significantly more than 10. We wanted to test to see if there was sufficient evidence to prove that the average final grade was one that passed the class. As seen above, we calculated the p-value for the newdata data set to be 2.2 x 10^-16. This value gives us enough evidence to reject the null hypothesis that the average final grade was 10 or less. In the context of this scenario, we can conclude from the hypothesis test that there is sufficient evidence that the average final grade was a passing grade greater than 10.

t.test(newdata$G3, conf.level = .95)
## 
##  One Sample t-test
## 
## data:  newdata$G3
## t = 67.457, df = 356, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  11.18784 11.85978
## sample estimates:
## mean of x 
##  11.52381

For the same data set (newdata), we also created a 95% confidence interval to attempt to try and find the true mean of the final grade for students. Based on the confidence interval we are 95% confident that the true mean for final grades for students is between the interval of 11.18784-11.85978.

alcStudents <- filter(newdata, Walc != 1)
noalcStudents <- filter(newdata, Walc == 1)

Above we have created two new data sets that will help us determine whether drinking alcohol on the weekend affects the final grades for students. It is important to note that a Walc variable was recorded for each student in the newdata data set. Walc is a weekend alcohol consumption of a student on a scale of 1 to 5 with 1 meaning no alcohol. One of the data sets, noalcStudents, consists of all the students with a Walc scale greater than 1. The students with a Walc scale greater than 1 were put in the other data set named alcStudents.

hist(alcStudents$G3, main = "Grades for Students Who Drink Alcohol", xlab = "Student Grades", ylab = "Frequency")

hist(noalcStudents$G3, main = "Grades for Students Who Don't Drink Alcohol", xlab = "Student Grades", ylab = "Frequency")

Here is a histogram for both of the alcStudents and noalcStudents data set. By observation, both data sets appear to have a normal distribution, having a bell-shaped curve. However, it appears that the noalcStudents histogram has more spread out data while the alcStudents data has a more drastic middle. Along with that, we may see that the mean of the alcStudents histogram is 11.13 and the mean of the noalcStudents histogram is 12.19. This means that from the sample data, students who drank alcohol on the weekend averaged a final grade of 11.13, while students who did not drink alcohol on the weekend averaged a final grade of 12.19. From this data specifically in the newdata data sample, we observed the students who did not drink alcohol on the weekend had a better final grade on average than those who drank alcohol on the weekends.

t.test(alcStudents$G3, noalcStudents$G3, mu = 0, alternate = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  alcStudents$G3 and noalcStudents$G3
## t = -2.9198, df = 246.38, p-value = 0.003827
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.7725475 -0.3444637
## sample estimates:
## mean of x mean of y 
##  11.12946  12.18797

With that in mind, we then ran a hypothesis test to see if we had sufficient evidence to conclude that drinking or not drinking alcohol on the weekend affects student tests scores. We used the alcStudents and noalcStudents data sets to run a two sample test. The test determined a p-value of .003827. Since the p-value is less than any usual alpha, we are able to reject the null hypthosis that the difference between these two groups is equal to zero. From the samples of alcStudents and noalcStudents, along with the two sample test we did above, we can see that there is significant evidence that students who did not drink alcohol on the weekends had a better final grade in math than those who did. Keep in mind this claim is specifically for students in the newdata data set, which consists of Portuguese secondary education students in 2005-06.

extraClass <- filter(newdata, paid == "yes")
noextraClass <- filter(newdata, paid == "no")

nrow(extraClass) / nrow(newdata)
## [1] 0.4845938

Now we have created two new data sets that divide students based on whether or not they take paid extra classes. Students who do were put in a data set called extraClass while those who did not were put in a data set called noextraClass. This was done by the variable called “paid” which each student has. The value is “yes” for students who took paid extra classes and the value was “no” for students who did not pay for extra classes. The first thing we did with these new data sets was to figure out what proportion of students took these extra classes. By dividing the number of rows in the extraClass data set by the number of rows in the newdata data set, we calculated a value of .48459. This decimal means that 48.459% of students in the data set took paid extra classes. Personally, I find this proportion to be surprisingly high, as this is almost half of the students in the data set.

hist(extraClass$G3, main = "Grades for Students Who Took Paid Extra Classes", xlab = "Student Grades", ylab = "Frequency")

hist(noextraClass$G3, main = "Grades for Students Who Did Not Take Paid Extra Classes", xlab = "Student Grades", ylab = "Frequency")

Above is the histogram of the extraClass and noextraClass data sets. Visually, both data sets appear to follow a normal distribution, having a bell-shaped curve. Furthermore, these visuals look almost identical with a slight difference in the number of students who are failing (1-10). This is further proven by the sample means of both data sets. The extraClass data set has a mean of 11.43 and the noextraClass data set has a mean of 11.61. This means that in this data sample the average final grade of students who took extra paid classes is comparable to the average final grade of students who did not.

t.test(extraClass$G3, noextraClass$G3, mu = 0, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  extraClass$G3 and noextraClass$G3
## t = -0.54661, df = 354.09, p-value = 0.585
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8569878  0.4842183
## sample estimates:
## mean of x mean of y 
##  11.42775  11.61413
nrow(alcStudents)
## [1] 224
nrow(noalcStudents)
## [1] 133

We then constructed a two sided t test, similar to the one about weekend alcohol consumption. This one was to see whether or not there is a significant difference in mean final grades between those who took paid extra classes and those who did not. We used the extraClass and noextraClass data sets to answer this question. The test calculated a p-value of .585, which is much greater than any commonly used alpha value. Therefore, we can conclude that there is not enough evidence that there is a significant difference in mean final grades between those who took paid extra classes and those who did not. This means the average final grade in math for students in the data set may not be effected by whether or not they take any paid classes. The difference in the data sets is not significant enough to make any assumptions.

In conclusion, we analyzed a data set of Portuguese secondary education students in 2005-06 and determined if different factors effected their final grades in math. We looked at weekend alcohol consumption and extra paid classes as potential variables. In each case we observed histograms and completed a two sided t test to help us see if they could be useful for predicting a student’s final grade. Based on the data set, we can conclude that weekend alcohol consumption for students is a factor when attempting to predict a student’s final grade. There is a signifigant connection between final grades for students and weekend alcohol consumption. In terms of extra paid classes, there was not enough significant evidence to conclude that it can be useful when predicting a student’s final grade. One interesting thing observation was the amount of students in the data set that drank alcohol on the weekend is around two thirds. This means it was very common in the students that were in this specific data set.