##Question 1
library(readr)
STAR <- read_csv("STAR.csv")
## Rows: 1274 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): classtype
## dbl (3): reading, math, graduated
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(STAR)
## # A tibble: 6 × 4
## classtype reading math graduated
## <chr> <dbl> <dbl> <dbl>
## 1 small 578 610 1
## 2 regular 612 612 1
## 3 regular 583 606 1
## 4 small 661 648 1
## 5 small 614 636 1
## 6 regular 610 603 0
summary(STAR)
## classtype reading math graduated
## Length:1274 Min. :527.0 Min. :515.0 Min. :0.0000
## Class :character 1st Qu.:602.0 1st Qu.:604.0 1st Qu.:1.0000
## Mode :character Median :629.0 Median :631.0 Median :1.0000
## Mean :628.8 Mean :631.6 Mean :0.8697
## 3rd Qu.:654.0 3rd Qu.:659.0 3rd Qu.:1.0000
## Max. :753.0 Max. :774.0 Max. :1.0000
The variables differ by type because they represent different types of information. Numeric values measure quantities and can be measured summarized using five-number summaries Min/1st Quartile/Median/Mean/3rd Quartile/Max. These describe distribution spread - think graphs and curves. Character and binary variables represent categories or labels, so they are summarized using class or frequency counts.
##Question 2
hist(STAR$math,main="Distribution of Math Scores",xlab ="Math Score", ylab = "Frequency")
The histogram shows the distrobution of math scores in the STAR dataset,
with the x axis listing the math score and the y axis listing the
frequency. Students performed in the average bell curve with few low and
high scores. Main is how you add chart titles, xlab and ylabs adds data
labels.
##Question 3
boxplot(STAR$reading, main="Distribution of Reading Scores")
The boxplot shows the distribution of reading scores in the STAR
dataset. We see a balanced distribution with most data centered in the
middle and then a circle at the top is an outlier. Boxplots are good at
showing outliers or possible typos to double check.
##Question 4
plot(STAR$math, STAR$reading, main="Academic Scores")
##remember capitalization matters STAR does not equal star and will give you an error
##this is another way to add labels star$math is your x axis, star$reading is y axis (x,y)
##Question 5
total<-STAR$math+STAR$reading
summary(total)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1044 1213 1257 1260 1309 1476
hist(total,main="Student Reading and Math Scores")
##Question 6
grad_yes <- STAR$math[STAR$graduated == 1]
grad_no <- STAR$math[STAR$graduated == 0]
This builds the variables grad yes and grad no. Each variable tells us the math score for each graduate or non graduate student. The 1 and 0 is in line with the binary data in the dataset. Now I can find the mean function of each and determine the answer to the problem.
mean(grad_yes)
## [1] 635.3258
mean(grad_no)
## [1] 606.6386
The scores are a mean of 635.3528 for graduated students and 606.6386 for students who did not graduate. Graduate students have a higher math score.