Instruction

Save this file as “Math_2305_data_science_assignment_your_first_name_last_name.Rmd” (e.g., assignment4_jonny_appleseed.Rmd).
For each task, provide appropriate R command(s) in the code chunk, and execute the code chunk to generate an outcome.
After completing all tasks, save the your Rmd file, and produce an HTML report. 3a. Make sure to delete all intermediate code chunks before creating an HTML report.
Submit your Rmd file and the rendered HTML report to D2L by its due date.

1. Load the data file (5 points)

Load Math 2305 Data Science Assignment_data.csv and save it in an R object so that you can use in the subsequent analysis. Use tidyverse package in this RMD file as you will use data science tools and techniques when analyzing the data.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.2

## Warning: package 'ggplot2' was built under R version 4.4.2

## Warning: package 'tibble' was built under R version 4.4.2

## Warning: package 'tidyr' was built under R version 4.4.2

## Warning: package 'readr' was built under R version 4.4.2

## Warning: package 'purrr' was built under R version 4.4.2

## Warning: package 'dplyr' was built under R version 4.4.2

## Warning: package 'stringr' was built under R version 4.4.2

## Warning: package 'forcats' was built under R version 4.4.2

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

df <- read_csv("dsacities.csv")

## Rows: 77 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): county, district, grspan
## dbl (14): distcod, teachers, calwpct, mealpct, computer, testscr, compstu, e...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df

Examine the loaded data set

2. How many rows and columns does it have? (3 points)

dim(df)

## [1] 77 17

3. Examine the first several rows of the data sets (3 points)

head(df)

4. Compute mean and standard deviation of `teachers`, `compstu`, and `testscr` values (4 point)

df2 <- df %>%
  summarize(mean_teachers = mean(teachers),
            sd_teachers = sd(teachers),
            mean_compstu = mean(compstu),
            sd_compstu = sd(compstu),
            mean_testscr = mean(testscr), 
            sd_testscr = sd(testscr))
df2

5. Create a scatter plot of `readscr` vs. `mathscr` (5 point)

Use readscr for the X axis
Use “Reading score” for the X axis label
Use mathscr for the Y axis
Use “Math score” for the Y axis label
Show a best fit line

df %>%
  ggplot(aes(x = readscr, y = mathscr)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(x = "Reading score", y = "Math score", title = "Reading versus Math")

## `geom_smooth()` using formula = 'y ~ x'

6. Compute a correlation between `readscr` and `mathscr` (30 points)

State null hypothesis and alternative hypothesis

My Response:

Ho:There is a linear correlation between ‘readscr’ and ‘mathscr’.

Ha:There is a positive correlation between “readscr’ and ‘mathscr’.

- Report the test statistics, degree of freedom and statistical significance

My Response:

The t-statistic for the observed correlation is approximately t =21.843. In addition, the test has 75 degrees of freedom. Furthermore, regarding the statistical significance, the p - value is extremely small, approximately 2.2e-16. Given this p-value, there’s compelling evidence against the null hypothesis of no correlation, leading us to reject Ho, which states there no linear correlation between reading and math scores. This suggests that the observed strong positive correlation between `readscr` and `mathscr` is statistically significant and is not a result of random variations in our sample.

- Describe the meaning of p-value in words

My Response:

The p-value tells us the probability that the observed relationship between `readscr` and `mathscr` could have happened due to random occurrences in the dataset. A small p-value (like less than 0.05) suggests that the observed relationship has statistical significance; this means that we can reasonably generalize this relationship to the broader population, implying it’s not just a result of random occurrences in our sample.

cor.test(df$readscr, df$mathscr)

## 
##  Pearson's product-moment correlation
## 
## data:  df$readscr and df$mathscr
## t = 21.843, df = 75, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8911715 0.9547822
## sample estimates:
##       cor 
## 0.9295991

7. Report the findings from your correlation analysis (30 points)

Write a sentence describing the findings
Make sure to include all important information

My Response:

From our analysis, the test statistic is t = 21.843 with a degree of freedom of 75, and a p -value of 2.2e-16. Given we are using a 95% confidence interval, our significance level (alpha) is 0.05. Since the p-value is greater than alpha, we accept the null hypothesis (Ho) not in favor of the alternative hypothesis (Ha). As mentioned, this suggests a statistivally significant positive linear relationship between reading and math scores (r = 0.9295991, p = 2.2e-16 > .05)

8. Create an HTML report of your correlation analysis

Note: Professor reserves the right to decide what answers, code, and step-processs is correct not the student. Once the student submits the assignment, they are not able to resubmit for a higher grade and all grades are final when the professor inserts them in D2L.

Convert to HTML for 20 points and submit both the markdown file (.rmd) and HTML to the assignment folder in D2L.

You will have a total of 100 points for this assignment.

Math 2305 Data Science Assignment

Instruction

1. Load the data file (5 points)

Examine the loaded data set

2. How many rows and columns does it have? (3 points)

3. Examine the first several rows of the data sets (3 points)

4. Compute mean and standard deviation of teachers, compstu, and testscr values (4 point)

5. Create a scatter plot of readscr vs. mathscr (5 point)

6. Compute a correlation between readscr and mathscr (30 points)

My Response:

Ho:There is a linear correlation between ‘readscr’ and ‘mathscr’.

Ha:There is a positive correlation between “readscr’ and ‘mathscr’.

- Report the test statistics, degree of freedom and statistical significance

My Response:

- Describe the meaning of p-value in words

My Response:

7. Report the findings from your correlation analysis (30 points)

My Response:

8. Create an HTML report of your correlation analysis

Note: Professor reserves the right to decide what answers, code, and step-processs is correct not the student. Once the student submits the assignment, they are not able to resubmit for a higher grade and all grades are final when the professor inserts them in D2L.

Convert to HTML for 20 points and submit both the markdown file (.rmd) and HTML to the assignment folder in D2L.

4. Compute mean and standard deviation of `teachers`, `compstu`, and `testscr` values (4 point)

5. Create a scatter plot of `readscr` vs. `mathscr` (5 point)

6. Compute a correlation between `readscr` and `mathscr` (30 points)