Assignment 3

Name: [Joseph Kehoe] ID: [920697315] Date: [10.19.24]

R Markdown

For this assignment, you will use visualization approaches continuing from last week’s work in addition with a new dataset. Along with visualizations like boxplots, and stem-and-leaf plot, serve as a foundational method for understanding data distributions pattern and tendencies.Additionally, you’re going to be introduced to how to subset a dataset, clean up the non-numeric values and infer useful information from stem-and-leaf plot regarding distribution pattern, mode, mean and median.

##PART ONE

PART 1: Load the data into R for use in the assignment

#load in the LA crimes data (1 point)
crime_stats <- read.csv("C:/Users/Joey/Downloads/LA_Crime_V2.csv")

#install.packages("stringr")
library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# filter the LA_Crime dataset based on sex using %>%, hint: assignment 2 (1 point)
crime_male<- crime_stats %>% 
  filter(Vict.Sex == "M")
crime_female<- crime_stats %>% 
  filter(Vict.Sex == "F")
#use boxplot to visualize and compare the mean of each sex, both in one boxplot (2 point)
boxplot(crime_female$Vict.Age, crime_male$Vict.Age)

Briefly interpret the boxplot(1point), comparing the median and variability of two genders.

Answer: the median age of male is slightly higher than that of for female as it can be seen from the median line within each box. The male age range shows relatively higher variability as whisker is slightly wider.

#PART TWO

Part 2: Import and subset a new dataset, 2012_SAT_Results, by carrying out the following steps

#import the dataset 2012_SAT_Results.csv into Rstudio in a standard way, hint: read.csv (1 point)
raw_sat <- read.csv("C:/Users/Joey/Downloads/SAT_DATASET_2012.csv")


# check and make sure the data types are correct using str() (1 point)
str(raw_sat)
## 'data.frame':    410 obs. of  6 variables:
##  $ DBN                            : chr  "01M292" "01M448" "01M450" "01M458" ...
##  $ SCHOOL.NAME                    : chr  "HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES" "UNIVERSITY NEIGHBORHOOD HIGH SCHOOL" "EAST SIDE COMMUNITY SCHOOL" "FORSYTH SATELLITE ACADEMY" ...
##  $ Num.of.SAT.Test.Takers         : int  29 91 70 7 44 112 159 18 130 16 ...
##  $ SAT.Critical.Reading.Avg..Score: int  355 383 377 414 390 332 522 417 624 395 ...
##  $ SAT.Math.Avg..Score            : int  404 423 402 401 433 557 574 418 604 400 ...
##  $ SAT.Writing.Avg..Score         : int  363 366 370 359 384 316 525 411 628 387 ...
#create a random subset the dataset for the first 80 rows using x <- sample(nrow(x), 50)(2 point)
sampled_indices <- sample(nrow(raw_sat), size = 80, replace = FALSE)
sampled_indices
##  [1]  43 327  28 123  96 188  87 262  33 358 279 361 236 387  66 221 228   8 137
## [20]  64 192 138 351 290 142  59 391 312 311 243  35 329 321 375 146 400  25 231
## [39] 145 242 167 394   5 148 316  17  63  20  47 283 161 169 310 366 113 385 284
## [58] 136 209 124 174 285 108 150 336 182 218 141  26 201 211 286 135  56 203  51
## [77] 152 275 237 323
# Create a subset with these rows
dataset_subset <- raw_sat[sampled_indices,]


#clean up the subset for NAs (1 point)
install.packages("stringr")
## Warning: package 'stringr' is in use and will not be installed
library(stringr)
sat_dataset_clean <- dataset_subset %>%
  mutate(across(1:6, ~ as.integer(str_replace_all(., "[^0-9]", ""))))
str(sat_dataset_clean)
## 'data.frame':    80 obs. of  6 variables:
##  $ DBN                            : int  2489 24299 2412 8293 5692 11290 4680 17539 2419 27475 ...
##  $ SCHOOL.NAME                    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Num.of.SAT.Test.Takers         : int  167 155 114 52 101 69 31 61 62 404 ...
##  $ SAT.Critical.Reading.Avg..Score: int  443 545 537 384 605 387 358 362 390 382 ...
##  $ SAT.Math.Avg..Score            : int  489 568 590 385 654 365 351 375 399 404 ...
##  $ SAT.Writing.Avg..Score         : int  442 550 550 389 588 383 345 368 381 368 ...
#then extract a column, SAT Math Avg. Score, using subset(x, select=....)(2 point)
sat_math_avg<- sat_dataset_clean%>%select(SAT.Math.Avg..Score)

# View the extracted column
str(sat_math_avg)
## 'data.frame':    80 obs. of  1 variable:
##  $ SAT.Math.Avg..Score: int  489 568 590 385 654 365 351 375 399 404 ...
just_scores<- sat_math_avg[, "SAT.Math.Avg..Score"]
quantile(just_scores)
##     0%    25%    50%    75%   100% 
## 339.00 376.00 396.50 432.25 654.00
IQR(just_scores)
## [1] 56.25
46.5*1.5
## [1] 69.75
375.75-69.75
## [1] 306
422.25+69.75
## [1] 492

#PART THREE

Part 3:

library(ggplot2)
# Create the stem-and-leaf plot using the first two digits as the stem and the third digit as the leaf, hint: use stem()(1 point)
stem(sat_math_avg$SAT.Math.Avg..Score, scale = 2, width = 80, atom = 10)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   32 | 9
##   34 | 69139
##   36 | 04457890112556689
##   38 | 111455667800133349
##   40 | 11343688899
##   42 | 15666722345
##   44 | 135946
##   46 | 35
##   48 | 8902
##   50 | 89
##   52 | 
##   54 | 
##   56 | 8
##   58 | 0
##   60 | 
##   62 | 
##   64 | 4

Examine the stem-and-leaf plot provided for SAT Math Average Scores. what can you infer about the distribution of scores? Are there any outliers? If so, where do they appear? Can we estimate the mode, mean, and median from this data? Under this distribution, is it likely that these parameters are equal? (2 points) -There are definitely outliers. Using the IQR, we can see that there are no lower outliers, because there are no data points below 306, which is Q1-1.5 X IQR, but there are upper outliers because there are data points above 493, which is Q3+ 15 x IQR. You can also just kind of see them in the graph. These upper outliers would skew the mean, and the median but probably to a lesser extent, but we can clearly see the mode is 406 because that is the value that occurs with the greatest frequency. I doubt the mean, median, and mode will be equal since the data is skewed.