R Bridge Course Week 2 Assignment

One of the challenges in working with data is wrangling. In this assignment we will use R to perform this task. Here is a list of data sets: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list) Please select one, download it and perform the following tasks:

Question 1

Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

summary(data)
##        X            statefip          year         bmprison      
##  Min.   :  1.0   Min.   : 1.00   Min.   :1985   Min.   :    0.0  
##  1st Qu.:204.8   1st Qu.:16.00   1st Qu.:1989   1st Qu.:  489.5  
##  Median :408.5   Median :29.00   Median :1992   Median : 3055.5  
##  Mean   :408.5   Mean   :28.96   Mean   :1992   Mean   : 7625.8  
##  3rd Qu.:612.2   3rd Qu.:42.00   3rd Qu.:1996   3rd Qu.:11423.8  
##  Max.   :816.0   Max.   :56.00   Max.   :2000   Max.   :61861.0  
##                                                                  
##     wmprison        alcohol          income            ur        
##  Min.   :   76   Min.   :1.200   Min.   : 9892   Min.   : 2.258  
##  1st Qu.: 1734   1st Qu.:2.040   1st Qu.:16613   1st Qu.: 4.317  
##  Median : 4176   Median :2.300   Median :20060   Median : 5.312  
##  Mean   : 6324   Mean   :2.388   Mean   :20626   Mean   : 5.546  
##  3rd Qu.: 7484   3rd Qu.:2.560   3rd Qu.:24065   3rd Qu.: 6.575  
##  Max.   :74992   Max.   :5.050   Max.   :41489   Max.   :13.442  
##  NA's   :14                                                      
##     poverty          black            perc1519        aidscapita     
##  Min.   : 2.90   Min.   : 0.2224   Min.   : 5.137   Min.   :  0.000  
##  1st Qu.:10.10   1st Qu.: 3.2543   1st Qu.: 6.837   1st Qu.:  2.077  
##  Median :12.40   Median : 8.0167   Median : 7.346   Median :  4.490  
##  Mean   :13.06   Mean   :11.8322   Mean   : 7.347   Mean   :  7.967  
##  3rd Qu.:15.43   3rd Qu.:16.7193   3rd Qu.: 7.834   3rd Qu.:  9.434  
##  Max.   :27.20   Max.   :71.3464   Max.   :10.420   Max.   :121.173  
##                                                                      
##     state          
##  Length:816        
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
mean_income <- mean(data$income)
mean_poverty <- mean(data$poverty)
med_income <- median(data$income)
med_poverty <- median(data$poverty)
mean_income
## [1] 20626.34
med_income
## [1] 20060
mean_poverty
## [1] 13.06164
med_poverty
## [1] 12.4

Question 2

Create a new data frame with a subset of the columns and rows. Make sure to rename it

data_sub <- select(data, state, black, year, poverty, income)
arizona_data <- filter(data_sub,state == "Arizona")
print(arizona_data)
##      state    black year poverty income
## 1  Arizona 3.510044 1985    10.7  13769
## 2  Arizona 3.503285 1986    14.3  14427
## 3  Arizona 3.493985 1987    12.8  14985
## 4  Arizona 3.486758 1988    14.1  15627
## 5  Arizona 3.479827 1989    14.1  16403
## 6  Arizona 3.869853 1990    13.7  17005
## 7  Arizona 3.971641 1991    14.8  17260
## 8  Arizona 4.031487 1992    15.8  17777
## 9  Arizona 4.069799 1993    15.4  18293
## 10 Arizona 4.167428 1994    15.9  19212
## 11 Arizona 4.259800 1995    16.1  19929
## 12 Arizona 4.356447 1996    20.5  20823
## 13 Arizona 4.436450 1997    17.2  21861
## 14 Arizona 4.535818 1998    16.6  23216
## 15 Arizona 4.625891 1999    12.2  24057
## 16 Arizona 4.745077 2000    11.7  25660

Question 3

Create new column names for the new data frame.

##    African American Year Recorded Impoverishment Salary   State
## 1          3.510044          1985           10.7  13769 Arizona
## 2          3.503285          1986           14.3  14427 Arizona
## 3          3.493985          1987           12.8  14985 Arizona
## 4          3.486758          1988           14.1  15627 Arizona
## 5          3.479827          1989           14.1  16403 Arizona
## 6          3.869853          1990           13.7  17005 Arizona
## 7          3.971641          1991           14.8  17260 Arizona
## 8          4.031487          1992           15.8  17777 Arizona
## 9          4.069799          1993           15.4  18293 Arizona
## 10         4.167428          1994           15.9  19212 Arizona
## 11         4.259800          1995           16.1  19929 Arizona
## 12         4.356447          1996           20.5  20823 Arizona
## 13         4.436450          1997           17.2  21861 Arizona
## 14         4.535818          1998           16.6  23216 Arizona
## 15         4.625891          1999           12.2  24057 Arizona
## 16         4.745077          2000           11.7  25660 Arizona

Question 4

Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare

summary(arizona_data_newcolumns)
##  African American Year Recorded  Impoverishment      Salary     
##  Min.   :3.480    Min.   :1985   Min.   :10.70   Min.   :13769  
##  1st Qu.:3.508    1st Qu.:1989   1st Qu.:13.47   1st Qu.:16209  
##  Median :4.051    Median :1992   Median :14.55   Median :18035  
##  Mean   :4.034    Mean   :1992   Mean   :14.74   Mean   :18769  
##  3rd Qu.:4.376    3rd Qu.:1996   3rd Qu.:15.95   3rd Qu.:21083  
##  Max.   :4.745    Max.   :2000   Max.   :20.50   Max.   :25660  
##     State          
##  Length:16         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
mean_Impov <- mean(arizona_data_newcolumns$Impoverishment)
mean_salary <- mean(arizona_data_newcolumns$Salary)
median_Impov <-median(arizona_data_newcolumns$Impoverishment)
median_salary <- (arizona_data_newcolumns$Salary)

mean_salary
## [1] 18769
mean_Impov
## [1] 14.74375
median_Impov
## [1] 14.55
median_salary
##  [1] 13769 14427 14985 15627 16403 17005 17260 17777 18293 19212 19929 20823
## [13] 21861 23216 24057 25660

Question 5

For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.

Q5 <- arizona_data_newcolumns %>%
  mutate(State = ifelse(State == "Arizona", "@rizona", State))

print(Q5)
##    African American Year Recorded Impoverishment Salary   State
## 1          3.510044          1985           10.7  13769 @rizona
## 2          3.503285          1986           14.3  14427 @rizona
## 3          3.493985          1987           12.8  14985 @rizona
## 4          3.486758          1988           14.1  15627 @rizona
## 5          3.479827          1989           14.1  16403 @rizona
## 6          3.869853          1990           13.7  17005 @rizona
## 7          3.971641          1991           14.8  17260 @rizona
## 8          4.031487          1992           15.8  17777 @rizona
## 9          4.069799          1993           15.4  18293 @rizona
## 10         4.167428          1994           15.9  19212 @rizona
## 11         4.259800          1995           16.1  19929 @rizona
## 12         4.356447          1996           20.5  20823 @rizona
## 13         4.436450          1997           17.2  21861 @rizona
## 14         4.535818          1998           16.6  23216 @rizona
## 15         4.625891          1999           12.2  24057 @rizona
## 16         4.745077          2000           11.7  25660 @rizona

Question 6

Display enough rows to see examples of all of steps 1-5 above.

head(data)
head(arizona_data)
head(arizona_data_newcolumns)

Question 7

BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

github_url <- "https://raw.githubusercontent.com/MAB592/Week-2-Bridge-data-/main/texas.csv"

git_data <- read.csv(github_url)

head(git_data)