One of the challenges in working with data is wrangling. In this assignment we will use R to perform this task. Here is a list of data sets: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list) Please select one, download it and perform the following tasks:
Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.
summary(data)
## X statefip year bmprison
## Min. : 1.0 Min. : 1.00 Min. :1985 Min. : 0.0
## 1st Qu.:204.8 1st Qu.:16.00 1st Qu.:1989 1st Qu.: 489.5
## Median :408.5 Median :29.00 Median :1992 Median : 3055.5
## Mean :408.5 Mean :28.96 Mean :1992 Mean : 7625.8
## 3rd Qu.:612.2 3rd Qu.:42.00 3rd Qu.:1996 3rd Qu.:11423.8
## Max. :816.0 Max. :56.00 Max. :2000 Max. :61861.0
##
## wmprison alcohol income ur
## Min. : 76 Min. :1.200 Min. : 9892 Min. : 2.258
## 1st Qu.: 1734 1st Qu.:2.040 1st Qu.:16613 1st Qu.: 4.317
## Median : 4176 Median :2.300 Median :20060 Median : 5.312
## Mean : 6324 Mean :2.388 Mean :20626 Mean : 5.546
## 3rd Qu.: 7484 3rd Qu.:2.560 3rd Qu.:24065 3rd Qu.: 6.575
## Max. :74992 Max. :5.050 Max. :41489 Max. :13.442
## NA's :14
## poverty black perc1519 aidscapita
## Min. : 2.90 Min. : 0.2224 Min. : 5.137 Min. : 0.000
## 1st Qu.:10.10 1st Qu.: 3.2543 1st Qu.: 6.837 1st Qu.: 2.077
## Median :12.40 Median : 8.0167 Median : 7.346 Median : 4.490
## Mean :13.06 Mean :11.8322 Mean : 7.347 Mean : 7.967
## 3rd Qu.:15.43 3rd Qu.:16.7193 3rd Qu.: 7.834 3rd Qu.: 9.434
## Max. :27.20 Max. :71.3464 Max. :10.420 Max. :121.173
##
## state
## Length:816
## Class :character
## Mode :character
##
##
##
##
mean_income <- mean(data$income)
mean_poverty <- mean(data$poverty)
med_income <- median(data$income)
med_poverty <- median(data$poverty)
mean_income
## [1] 20626.34
med_income
## [1] 20060
mean_poverty
## [1] 13.06164
med_poverty
## [1] 12.4
Create a new data frame with a subset of the columns and rows. Make sure to rename it
data_sub <- select(data, state, black, year, poverty, income)
arizona_data <- filter(data_sub,state == "Arizona")
print(arizona_data)
## state black year poverty income
## 1 Arizona 3.510044 1985 10.7 13769
## 2 Arizona 3.503285 1986 14.3 14427
## 3 Arizona 3.493985 1987 12.8 14985
## 4 Arizona 3.486758 1988 14.1 15627
## 5 Arizona 3.479827 1989 14.1 16403
## 6 Arizona 3.869853 1990 13.7 17005
## 7 Arizona 3.971641 1991 14.8 17260
## 8 Arizona 4.031487 1992 15.8 17777
## 9 Arizona 4.069799 1993 15.4 18293
## 10 Arizona 4.167428 1994 15.9 19212
## 11 Arizona 4.259800 1995 16.1 19929
## 12 Arizona 4.356447 1996 20.5 20823
## 13 Arizona 4.436450 1997 17.2 21861
## 14 Arizona 4.535818 1998 16.6 23216
## 15 Arizona 4.625891 1999 12.2 24057
## 16 Arizona 4.745077 2000 11.7 25660
Create new column names for the new data frame.
## African American Year Recorded Impoverishment Salary State
## 1 3.510044 1985 10.7 13769 Arizona
## 2 3.503285 1986 14.3 14427 Arizona
## 3 3.493985 1987 12.8 14985 Arizona
## 4 3.486758 1988 14.1 15627 Arizona
## 5 3.479827 1989 14.1 16403 Arizona
## 6 3.869853 1990 13.7 17005 Arizona
## 7 3.971641 1991 14.8 17260 Arizona
## 8 4.031487 1992 15.8 17777 Arizona
## 9 4.069799 1993 15.4 18293 Arizona
## 10 4.167428 1994 15.9 19212 Arizona
## 11 4.259800 1995 16.1 19929 Arizona
## 12 4.356447 1996 20.5 20823 Arizona
## 13 4.436450 1997 17.2 21861 Arizona
## 14 4.535818 1998 16.6 23216 Arizona
## 15 4.625891 1999 12.2 24057 Arizona
## 16 4.745077 2000 11.7 25660 Arizona
Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare
summary(arizona_data_newcolumns)
## African American Year Recorded Impoverishment Salary
## Min. :3.480 Min. :1985 Min. :10.70 Min. :13769
## 1st Qu.:3.508 1st Qu.:1989 1st Qu.:13.47 1st Qu.:16209
## Median :4.051 Median :1992 Median :14.55 Median :18035
## Mean :4.034 Mean :1992 Mean :14.74 Mean :18769
## 3rd Qu.:4.376 3rd Qu.:1996 3rd Qu.:15.95 3rd Qu.:21083
## Max. :4.745 Max. :2000 Max. :20.50 Max. :25660
## State
## Length:16
## Class :character
## Mode :character
##
##
##
mean_Impov <- mean(arizona_data_newcolumns$Impoverishment)
mean_salary <- mean(arizona_data_newcolumns$Salary)
median_Impov <-median(arizona_data_newcolumns$Impoverishment)
median_salary <- (arizona_data_newcolumns$Salary)
mean_salary
## [1] 18769
mean_Impov
## [1] 14.74375
median_Impov
## [1] 14.55
median_salary
## [1] 13769 14427 14985 15627 16403 17005 17260 17777 18293 19212 19929 20823
## [13] 21861 23216 24057 25660
For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.
Q5 <- arizona_data_newcolumns %>%
mutate(State = ifelse(State == "Arizona", "@rizona", State))
print(Q5)
## African American Year Recorded Impoverishment Salary State
## 1 3.510044 1985 10.7 13769 @rizona
## 2 3.503285 1986 14.3 14427 @rizona
## 3 3.493985 1987 12.8 14985 @rizona
## 4 3.486758 1988 14.1 15627 @rizona
## 5 3.479827 1989 14.1 16403 @rizona
## 6 3.869853 1990 13.7 17005 @rizona
## 7 3.971641 1991 14.8 17260 @rizona
## 8 4.031487 1992 15.8 17777 @rizona
## 9 4.069799 1993 15.4 18293 @rizona
## 10 4.167428 1994 15.9 19212 @rizona
## 11 4.259800 1995 16.1 19929 @rizona
## 12 4.356447 1996 20.5 20823 @rizona
## 13 4.436450 1997 17.2 21861 @rizona
## 14 4.535818 1998 16.6 23216 @rizona
## 15 4.625891 1999 12.2 24057 @rizona
## 16 4.745077 2000 11.7 25660 @rizona
Display enough rows to see examples of all of steps 1-5 above.
head(data)
head(arizona_data)
head(arizona_data_newcolumns)
BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
github_url <- "https://raw.githubusercontent.com/MAB592/Week-2-Bridge-data-/main/texas.csv"
git_data <- read.csv(github_url)
head(git_data)