Data & Decisions

Part I

Please note: this data contains many “NA”, or missing values. You should just ignore these until question 5, which asks specifically about them. In particular, do not bother subsetting them out for questions 1-4.

Compute summary statistics (minimum, mean, and maximum) and make a histogram of the total number of minutes each student spends in English class during one week. What about your results might be inaccurate? What would you do to investigate further? HINT: You need to use information from two different variables to answer this question, and you need to clean both variables first.

students$EnglishClassesPerWeek=sub("classes","",students$EnglishClassesPerWeek)
students$EnglishClassesPerWeek=sub("class","",students$EnglishClassesPerWeek)
students$EnglishClassesPerWeek=as.numeric(students$EnglishClassesPerWeek)

## Warning: NAs introduced by coercion

students$LengthEnglishClass=sub("min","",students$LengthEnglishClass)
students$LengthEnglishClass=as.numeric(students$LengthEnglishClass)

## Warning: NAs introduced by coercion

students$TotalNumberofMinutesforEnglish=students$EnglishClassesPerWeek*students$LengthEnglishClass

cat("We have cleaned the variables with the length of English classes and number of classes per week. We removed the unnecessary min and classes words and converted the values to numeric. We also created a new variable - TotalEnglishMinutesPerWeek that is the number of English classes multiplied by class length \n")

## We have cleaned the variables with the length of English classes and number of classes per week. We removed the unnecessary min and classes words and converted the values to numeric. We also created a new variable - TotalEnglishMinutesPerWeek that is the number of English classes multiplied by class length

summary(students$TotalNumberofMinutesforEnglish)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   225.0   250.0   265.7   300.0  2400.0     289

hist(students$TotalNumberofMinutesforEnglish, 
     breaks = 100, 
     col = "turquoise1", 
     border = "black", 
     main="Total Number of Minutes each student spends in English Class per week", 
     xlab="Total Number of Minutes Studying English per Week", ylab="Number of Students")

The results might be innacurate because there are a lot of missing values. Some rows may contain incomplete or inconsistent entries, such as missing values for either class length or number of classes per week, which would result in erroneous total minutes. What could be done is we could dentify rows where the length of classes or the number of classes per week is missing or zero and determine whether these values should be excluded from analysis. We can also investigate students who have unusually high or low total minutes to check for data entry errors or inconsistencies.

2.How does the mean reading score compare between students who read for more than 30 minutes per day (outside of class), and those who read for 30 minutes or less per day? HINT: use tapply!

students$ReadMoreThan30min=ifelse(students$DailyReading=="30 minutes or less a day",0,1)
mean_reading_scores <- tapply(students$ReadingScore, students$ReadMoreThan30min, mean)
mean_reading_scores

##        0        1 
## 483.5471 534.3946

3.How many students have a mother with the word “business” in her occupation? How many students have a father with the word “business” in his occupation? HINT: Make sure you include “business”, ”Business”, ”BUSIness”, etc.

students$MotherBusiness <- grepl("business",tolower(students$MotherOccupation))
table(students$MotherBusiness)

## 
## FALSE  TRUE 
##  5158    75

students$FatherBusiness <- grepl("business",tolower(students$FatherOccupation))
table(students$FatherBusiness)

## 
## FALSE  TRUE 
##  5179    54

mother_business_count <- sum(students$MotherBusiness, na.rm = TRUE)
father_business_count <- sum(students$FatherBusiness, na.rm = TRUE)

cat("Number of students whose mothers have 'business' in their occupation:", mother_business_count, "\n")

## Number of students whose mothers have 'business' in their occupation: 75

cat("Number of students whose fathers have 'business' in their occupation:", father_business_count, "\n")

## Number of students whose fathers have 'business' in their occupation: 54

HINT: This question requires merging the two data sets (like a vlookup or index match in Excel).

Create a boxplot of student reading scores, sorted by the type of school they attend (Public or Private). Then repeat this for math scores (create a boxplot of student math scores, sorted by the type of school they attend). What can you conclude (if anything) from these boxplots about student scores at the two different types of schools?

merged_df <- merge(students, schools[, c("SchoolID", "Type")], by = "SchoolID", all.x = TRUE)

boxplot(ReadingScore ~ Type, data = merged_df,
        main = "Reading Scores by School Type",
        xlab = "School Type",
        ylab = "Reading Score",
        col = "skyblue")

boxplot(MathScore ~ Type, data = merged_df,
        main = "Math Scores by School Type",
        xlab = "School Type",
        ylab = "Math Score",
        col = "lightgreen")

In order to boxplot the student reading scores and math scores by type of the school they attend, we had to merge the Schools and Students datasets. We did this through the SchoolID variable. From the boxplots we see that both reading and math scores are on average higher in private schools.

The PISA data is a random sample of students in the US. Construct a 95% confidence interval for the true mean math score for all US students at Private schools. Construct a 95% confidence interval for the true mean math score for all US students at Public schools. Do the two regions overlap? HINT: We have not yet learned the R command that computes CI’s automatically, so it’s probably best to proceed by using the R function tapply to compute some summary statistics, and then compute the CIs using the formulas.

private_math <- merged_df$MathScore[merged_df$Type == "Private"]
public_math <- merged_df$MathScore[merged_df$Type == "Public"]

private_mean <- mean(private_math, na.rm = TRUE)
private_sd <- sd(private_math, na.rm = TRUE)
private_n <- sum(!is.na(private_math))

public_mean <- mean(public_math, na.rm = TRUE)
public_sd <- sd(public_math, na.rm = TRUE)
public_n <- sum(!is.na(public_math))

private_se <- private_sd / sqrt(private_n)
public_se <- public_sd / sqrt(public_n)

private_ci <- private_mean + c(-2, 2) * private_se
public_ci <- public_mean + c(-2, 2) * public_se

cat("95% Confidence Interval for Private schools (Math Scores):", private_ci, "\n")

## 95% Confidence Interval for Private schools (Math Scores): 515.7088 532.8873

cat("95% Confidence Interval for Public schools (Math Scores):", public_ci, "\n")

## 95% Confidence Interval for Public schools (Math Scores): 480.3522 485.4654

To construct the 95% confidence interval we had to first split the math scores into two subsets for private and public schools. For each subset we calculated the mean, the stanfard deviation and size sample. We also calculated the standard errors for both samples. The 95% confidence intervals are the estimated mean +/- 2SE.

We unfortunately have many NA (or missing) values in these datasets. Answer the following questions about these NA values:

What percentage of the total number of observations would you throw out if you omitted all observations (students) in the Students dataset that have at least one NA value in any variable?

total_students<- nrow(students)
students_with_na <- sum(apply(students, 1, function(x) any(is.na(x))))
percentage_thrown_out <- (students_with_na / total_students) * 100
print(total_students)

## [1] 5233

print(students_with_na)

## [1] 1593

cat("Percentage of observations that would be thrown out:", percentage_thrown_out, "%\n")

## Percentage of observations that would be thrown out: 30.44143 %

We have counted the number of all students and then number of all students that have at least one NA value in any variable. To obtain the percentage of observations that would be thrown out we divide the number of students with NAs by total number of students and multiple by 100.

Are there particular variables that contain most of the missing values? HINT: The summary function will count the number of NA values

summary(students)

##      Grade          Gender                           Preschool   
##  Min.   : 8.00   Female:2546   No                         :1448  
##  1st Qu.:10.00   Male  :2687   Yes, for more than one year:1290  
##  Median :10.00                 Yes, for one year or less  :2418  
##  Mean   :10.09                 NA's                       :  77  
##  3rd Qu.:10.00                                                   
##  Max.   :12.00                                                   
##                                                                  
##  MotherBachelors             MotherWork  
##  No  :3778       Looking for work : 421  
##  Yes : 570       Other            : 958  
##  NA's: 885       Working Full-time:2966  
##                  Working Part-Time: 759  
##                  NA's             : 129  
##                                          
##                                          
##                                                          MotherOccupation
##  Nursing & Midwifery Profess. [Incl. Registered Nurses, Midwives]: 311   
##  Housewife                                                       : 282   
##  Secretaries                                                     : 198   
##  Missing                                                         : 170   
##  Cooks                                                           : 162   
##  Child-Care Workers [Incl. Nursemaid, Governess]                 : 122   
##  (Other)                                                         :3988   
##  FatherBachelors             FatherWork  
##  No  :3564       Looking for work : 288  
##  Yes : 510       Other            : 444  
##  NA's:1159       Working Full-time:3839  
##                  Working Part-Time: 316  
##                  NA's             : 346  
##                                          
##                                          
##                                                        FatherOccupation
##  Missing                                                       : 326   
##  Heavy Truck & Lorry Drivers                                   : 179   
##  Do Not Know                                                   : 160   
##  N/A                                                           : 159   
##  Vague(A Good Job, A Quiet Job, A Well Paid Job, An Office Job): 135   
##  [Small Enterprise] General Managers Wholesale & Retail Trade  : 133   
##  (Other)                                                       :4141   
##                     DailyReading  EnglishClassesPerWeek LengthEnglishClass
##  1 to 2 hours a day       : 456   Min.   : 0.000        Min.   : 15.00    
##  30 minutes or less a day :3688   1st Qu.: 3.000        1st Qu.: 45.00    
##  Between 30 and 60 minutes: 777   Median : 5.000        Median : 55.00    
##  More than 2 hours a day  : 257   Mean   : 4.373        Mean   : 62.07    
##  NA's                     :  55   3rd Qu.: 5.000        3rd Qu.: 80.00    
##                                   Max.   :20.000        Max.   :180.00    
##                                   NA's   :179           NA's   :188       
##    MathScore      ReadingScore    ScienceScore      SchoolID     
##  Min.   :155.7   Min.   :156.4   Min.   :119.1   Min.   :  1.00  
##  1st Qu.:426.7   1st Qu.:431.3   1st Qu.:434.1   1st Qu.: 42.00  
##  Median :485.9   Median :499.6   Median :499.4   Median : 83.00  
##  Mean   :485.6   Mean   :497.6   Mean   :499.6   Mean   : 83.36  
##  3rd Qu.:546.7   3rd Qu.:565.5   3rd Qu.:568.6   3rd Qu.:124.00  
##  Max.   :766.5   Max.   :772.5   Max.   :782.1   Max.   :165.00  
##                                                                  
##  TotalNumberofMinutesforEnglish ReadMoreThan30min MotherBusiness 
##  Min.   :   0.0                 Min.   :0.0000    Mode :logical  
##  1st Qu.: 225.0                 1st Qu.:0.0000    FALSE:5158     
##  Median : 250.0                 Median :0.0000    TRUE :75       
##  Mean   : 265.7                 Mean   :0.2878                   
##  3rd Qu.: 300.0                 3rd Qu.:1.0000                   
##  Max.   :2400.0                 Max.   :1.0000                   
##  NA's   :289                    NA's   :55                       
##  FatherBusiness 
##  Mode :logical  
##  FALSE:5179     
##  TRUE :54       
##                 
##                 
##                 
##

na_counts <- colSums(is.na(students))
cat("Number of missing values in each variable:\n")

## Number of missing values in each variable:

print(na_counts)

##                          Grade                         Gender 
##                              0                              0 
##                      Preschool                MotherBachelors 
##                             77                            885 
##                     MotherWork               MotherOccupation 
##                            129                              0 
##                FatherBachelors                     FatherWork 
##                           1159                            346 
##               FatherOccupation                   DailyReading 
##                              0                             55 
##          EnglishClassesPerWeek             LengthEnglishClass 
##                            179                            188 
##                      MathScore                   ReadingScore 
##                              0                              0 
##                   ScienceScore                       SchoolID 
##                              0                              0 
## TotalNumberofMinutesforEnglish              ReadMoreThan30min 
##                            289                             55 
##                 MotherBusiness                 FatherBusiness 
##                              0                              0

We see that the variables with most NAs are FathersBachelors, MothersBachelors, FatherWork and TotalNumberOfMinutesforEnglish.

Why do you think you might want to throw out observations with missing values? Why do you think you might want to keep observations with missing values?

Keeping observations with missing values leads to less reliable insights. Including rows with missing values could distort the results of certain analyses, especially in calculations requiring complete data, such as correlation. It’s also more complicated to handle a dataset with NAs (more adjustments).

If we remove all rows with NAs we might significantly reduce the dataset size, leading to reduced statistical power. Missing data might contain information (e.g., missing values could correlate with certain factors), so dropping them might lead to losing important insights or introduce bias.

Data & Decisions

Lab Assignment #4

Agata Braja, Yongkyun Sim

Part I