library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

districtbase <- read_xls("district.xls")

Homework #4

  1. From the data you have chosen, select a variable that you are interested in
  2. Use pastecs::stat.desc to describe the variable. Include a few sentences about what the variable is and what it’s measuring.
  3. Remove NA’s if needed using dplyr:filter (or anything similar)
  4. Provide a histogram of the variable (as shown in this lesson)
  5. transform the variable using the log transformation or square root transformation (whatever is more appropriate) using dplyr::mutate or something similar
  6. provide a histogram of the transformed variable
  7. submit via rpubs on CANVAS

1.From the data you have chosen, select a variable that you are interested in

I will bve examining the variable DA0GR21N.

districtbase2<-districtbase %>% select(DISTNAME,DZCAMPUS,DPETSPEP,DA0AT21R,DA0GR21N) %>% na.omit(.)
  1. This has cleared up the dataset and removed the N/A.
pastecs::stat.desc(districtbase2$DA0GR21N)
##      nbr.val     nbr.null       nbr.na          min          max        range 
## 1.081000e+03 0.000000e+00 0.000000e+00 1.000000e+00 1.158800e+04 1.158700e+04 
##          sum       median         mean      SE.mean CI.mean.0.95          var 
## 3.585130e+05 6.900000e+01 3.316494e+02 2.650349e+01 5.200417e+01 7.593325e+05 
##      std.dev     coef.var 
## 8.713968e+02 2.627464e+00
  1. Use pastecs::stat.desc to describe the variable. Include a few sentences about what the variable is and what it’s measuring.

From the districtbase dataset, I will examine the variable of graduation rates. In this adminstrative case of student population examination, graduation rates are important but it determines the end goal of what a student strives to do in school, which is cross the proverbial finish line and graduate. This variable in the dataset is DA0GR21N.

ggplot(districtbase2, aes(x = DA0GR21N)) +
geom_histogram(col='red')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. transform the variable using the log transformation or square root transformation (whatever is more appropriate) using dplyr::mutate or something similar

I will now use the mutate function on the DA0GR21N variable to square it.

districtbase2 <- districtbase2 %>%
  mutate(DA0GR21Nsquared = sqrt(DA0GR21N))
  1. Now I will create a ggplot of the squared variable
ggplot(districtbase2, aes(x = DA0GR21Nsquared)) +
geom_histogram(col='red')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram shows a pronounced skew to the right.