October 18, 2024
Load Students & Schools datasets:
students <- read.csv("Students.csv", stringsAsFactors = TRUE)
schools <- read.csv("Schools.csv", stringsAsFactors = TRUE)
Please note: this data contains many “NA”, or missing values. You should just ignore these until question 5, which asks specifically about them. In particular, do not bother subsetting them out for questions 1-4.
students$EnglishClassesPerWeek=sub("classes","",students$EnglishClassesPerWeek)
students$EnglishClassesPerWeek=sub("class","",students$EnglishClassesPerWeek)
students$EnglishClassesPerWeek=as.numeric(students$EnglishClassesPerWeek)
## Warning: NAs introduced by coercion
students$LengthEnglishClass=sub("min","",students$LengthEnglishClass)
students$LengthEnglishClass=as.numeric(students$LengthEnglishClass)
## Warning: NAs introduced by coercion
students$TotalNumberofMinutesforEnglish=students$EnglishClassesPerWeek*students$LengthEnglishClass
cat("We have cleaned the variables with the length of English classes and number of classes per week. We removed the unnecessary min and classes words and converted the values to numeric. We also created a new variable - TotalEnglishMinutesPerWeek that is the number of English classes multiplied by class length \n")
## We have cleaned the variables with the length of English classes and number of classes per week. We removed the unnecessary min and classes words and converted the values to numeric. We also created a new variable - TotalEnglishMinutesPerWeek that is the number of English classes multiplied by class length
summary(students$TotalNumberofMinutesforEnglish)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 225.0 250.0 265.7 300.0 2400.0 289
hist(students$TotalNumberofMinutesforEnglish,
breaks = 100,
col = "turquoise1",
border = "black",
main="Total Number of Minutes each student spends in English Class per week",
xlab="Total Number of Minutes Studying English per Week", ylab="Number of Students")
The results might be innacurate because there are a lot of missing
values. Some rows may contain incomplete or inconsistent entries, such
as missing values for either class length or number of classes per week,
which would result in erroneous total minutes. What could be done is we
could dentify rows where the length of classes or the number of classes
per week is missing or zero and determine whether these values should be
excluded from analysis. We can also investigate students who have
unusually high or low total minutes to check for data entry errors or
inconsistencies.
2.How does the mean reading score compare between students who read for more than 30 minutes per day (outside of class), and those who read for 30 minutes or less per day? HINT: use tapply!
students$ReadMoreThan30min=ifelse(students$DailyReading=="30 minutes or less a day",0,1)
mean_reading_scores <- tapply(students$ReadingScore, students$ReadMoreThan30min, mean)
mean_reading_scores
## 0 1
## 483.5471 534.3946
3.How many students have a mother with the word “business” in her occupation? How many students have a father with the word “business” in his occupation? HINT: Make sure you include “business”, ”Business”, ”BUSIness”, etc.
students$MotherBusiness <- grepl("business",tolower(students$MotherOccupation))
table(students$MotherBusiness)
##
## FALSE TRUE
## 5158 75
students$FatherBusiness <- grepl("business",tolower(students$FatherOccupation))
table(students$FatherBusiness)
##
## FALSE TRUE
## 5179 54
mother_business_count <- sum(students$MotherBusiness, na.rm = TRUE)
father_business_count <- sum(students$FatherBusiness, na.rm = TRUE)
cat("Number of students whose mothers have 'business' in their occupation:", mother_business_count, "\n")
## Number of students whose mothers have 'business' in their occupation: 75
cat("Number of students whose fathers have 'business' in their occupation:", father_business_count, "\n")
## Number of students whose fathers have 'business' in their occupation: 54
merged_df <- merge(students, schools[, c("SchoolID", "Type")], by = "SchoolID", all.x = TRUE)
boxplot(ReadingScore ~ Type, data = merged_df,
main = "Reading Scores by School Type",
xlab = "School Type",
ylab = "Reading Score",
col = "skyblue")
boxplot(MathScore ~ Type, data = merged_df,
main = "Math Scores by School Type",
xlab = "School Type",
ylab = "Math Score",
col = "lightgreen")
In order to boxplot the student reading scores and math scores by type of the school they attend, we had to merge the Schools and Students datasets. We did this through the SchoolID variable. From the boxplots we see that both reading and math scores are on average higher in private schools.
private_math <- merged_df$MathScore[merged_df$Type == "Private"]
public_math <- merged_df$MathScore[merged_df$Type == "Public"]
private_mean <- mean(private_math, na.rm = TRUE)
private_sd <- sd(private_math, na.rm = TRUE)
private_n <- sum(!is.na(private_math))
public_mean <- mean(public_math, na.rm = TRUE)
public_sd <- sd(public_math, na.rm = TRUE)
public_n <- sum(!is.na(public_math))
private_se <- private_sd / sqrt(private_n)
public_se <- public_sd / sqrt(public_n)
private_ci <- private_mean + c(-2, 2) * private_se
public_ci <- public_mean + c(-2, 2) * public_se
cat("95% Confidence Interval for Private schools (Math Scores):", private_ci, "\n")
## 95% Confidence Interval for Private schools (Math Scores): 515.7088 532.8873
cat("95% Confidence Interval for Public schools (Math Scores):", public_ci, "\n")
## 95% Confidence Interval for Public schools (Math Scores): 480.3522 485.4654
To construct the 95% confidence interval we had to first split the math scores into two subsets for private and public schools. For each subset we calculated the mean, the stanfard deviation and size sample. We also calculated the standard errors for both samples. The 95% confidence intervals are the estimated mean +/- 2SE.
total_students<- nrow(students)
students_with_na <- sum(apply(students, 1, function(x) any(is.na(x))))
percentage_thrown_out <- (students_with_na / total_students) * 100
print(total_students)
## [1] 5233
print(students_with_na)
## [1] 1593
cat("Percentage of observations that would be thrown out:", percentage_thrown_out, "%\n")
## Percentage of observations that would be thrown out: 30.44143 %
We have counted the number of all students and then number of all students that have at least one NA value in any variable. To obtain the percentage of observations that would be thrown out we divide the number of students with NAs by total number of students and multiple by 100.
summary(students)
## Grade Gender Preschool
## Min. : 8.00 Female:2546 No :1448
## 1st Qu.:10.00 Male :2687 Yes, for more than one year:1290
## Median :10.00 Yes, for one year or less :2418
## Mean :10.09 NA's : 77
## 3rd Qu.:10.00
## Max. :12.00
##
## MotherBachelors MotherWork
## No :3778 Looking for work : 421
## Yes : 570 Other : 958
## NA's: 885 Working Full-time:2966
## Working Part-Time: 759
## NA's : 129
##
##
## MotherOccupation
## Nursing & Midwifery Profess. [Incl. Registered Nurses, Midwives]: 311
## Housewife : 282
## Secretaries : 198
## Missing : 170
## Cooks : 162
## Child-Care Workers [Incl. Nursemaid, Governess] : 122
## (Other) :3988
## FatherBachelors FatherWork
## No :3564 Looking for work : 288
## Yes : 510 Other : 444
## NA's:1159 Working Full-time:3839
## Working Part-Time: 316
## NA's : 346
##
##
## FatherOccupation
## Missing : 326
## Heavy Truck & Lorry Drivers : 179
## Do Not Know : 160
## N/A : 159
## Vague(A Good Job, A Quiet Job, A Well Paid Job, An Office Job): 135
## [Small Enterprise] General Managers Wholesale & Retail Trade : 133
## (Other) :4141
## DailyReading EnglishClassesPerWeek LengthEnglishClass
## 1 to 2 hours a day : 456 Min. : 0.000 Min. : 15.00
## 30 minutes or less a day :3688 1st Qu.: 3.000 1st Qu.: 45.00
## Between 30 and 60 minutes: 777 Median : 5.000 Median : 55.00
## More than 2 hours a day : 257 Mean : 4.373 Mean : 62.07
## NA's : 55 3rd Qu.: 5.000 3rd Qu.: 80.00
## Max. :20.000 Max. :180.00
## NA's :179 NA's :188
## MathScore ReadingScore ScienceScore SchoolID
## Min. :155.7 Min. :156.4 Min. :119.1 Min. : 1.00
## 1st Qu.:426.7 1st Qu.:431.3 1st Qu.:434.1 1st Qu.: 42.00
## Median :485.9 Median :499.6 Median :499.4 Median : 83.00
## Mean :485.6 Mean :497.6 Mean :499.6 Mean : 83.36
## 3rd Qu.:546.7 3rd Qu.:565.5 3rd Qu.:568.6 3rd Qu.:124.00
## Max. :766.5 Max. :772.5 Max. :782.1 Max. :165.00
##
## TotalNumberofMinutesforEnglish ReadMoreThan30min MotherBusiness
## Min. : 0.0 Min. :0.0000 Mode :logical
## 1st Qu.: 225.0 1st Qu.:0.0000 FALSE:5158
## Median : 250.0 Median :0.0000 TRUE :75
## Mean : 265.7 Mean :0.2878
## 3rd Qu.: 300.0 3rd Qu.:1.0000
## Max. :2400.0 Max. :1.0000
## NA's :289 NA's :55
## FatherBusiness
## Mode :logical
## FALSE:5179
## TRUE :54
##
##
##
##
na_counts <- colSums(is.na(students))
cat("Number of missing values in each variable:\n")
## Number of missing values in each variable:
print(na_counts)
## Grade Gender
## 0 0
## Preschool MotherBachelors
## 77 885
## MotherWork MotherOccupation
## 129 0
## FatherBachelors FatherWork
## 1159 346
## FatherOccupation DailyReading
## 0 55
## EnglishClassesPerWeek LengthEnglishClass
## 179 188
## MathScore ReadingScore
## 0 0
## ScienceScore SchoolID
## 0 0
## TotalNumberofMinutesforEnglish ReadMoreThan30min
## 289 55
## MotherBusiness FatherBusiness
## 0 0
We see that the variables with most NAs are FathersBachelors, MothersBachelors, FatherWork and TotalNumberOfMinutesforEnglish.
Keeping observations with missing values leads to less reliable insights. Including rows with missing values could distort the results of certain analyses, especially in calculations requiring complete data, such as correlation. It’s also more complicated to handle a dataset with NAs (more adjustments).
If we remove all rows with NAs we might significantly reduce the dataset size, leading to reduced statistical power. Missing data might contain information (e.g., missing values could correlate with certain factors), so dropping them might lead to losing important insights or introduce bias.