##R Bridge Course Week 2 Assignment
One of the challenges in working with data is wrangling. In this assignment we will use R to perform this task.
Here is a list of data sets: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list)
Please select one, download it and perform the following tasks:
7. BONUS – place the original .csv in a github file and have R read from the link. This should be your own github – not the file source. This will be a very useful skill as you progress in your data science education and career.
#import copy of ChinaIncome data set from my github account
CI <- read.csv("https://raw.githubusercontent.com/pkofy/MathBridge/main/ChinaIncome.csv", stringsAsFactors = FALSE)
#import ChinaIncome data set from source
#CI <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/ChinaIncome.csv", stringsAsFactors = FALSE)
1. Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes of your data.
#use summary function
summary(CI)
## X agriculture commerce construction industry
## Min. : 1 Min. : 83.6 Min. :100.0 Min. : 100.0 Min. : 100.0
## 1st Qu.:10 1st Qu.:111.9 1st Qu.:146.6 1st Qu.: 259.0 1st Qu.: 374.9
## Median :19 Median :139.8 Median :199.2 Median : 421.0 Median : 863.0
## Mean :19 Mean :151.9 Mean :261.0 Mean : 549.9 Mean :1244.2
## 3rd Qu.:28 3rd Qu.:168.4 3rd Qu.:316.8 3rd Qu.: 584.1 3rd Qu.:1814.7
## Max. :37 Max. :279.4 Max. :760.8 Max. :1884.0 Max. :4765.0
## transport
## Min. : 100.0
## 1st Qu.: 221.1
## Median : 370.8
## Mean : 449.5
## 3rd Qu.: 560.8
## Max. :1413.6
#obtain mean and median
CIag_mean <- mean(CI$agriculture)
CIcom_mean <- mean(CI$commerce)
CIag_median <- median(CI$agriculture)
CIcom_median <- median(CI$commerce)
#show results
print(sprintf("For agriculture income in China, the mean is %g, and the median is %g", CIag_mean, CIag_median))
## [1] "For agriculture income in China, the mean is 151.949, and the median is 139.8"
print(sprintf("For commerce income in China, the mean is %g, and the median is %g", CIcom_mean, CIcom_median))
## [1] "For commerce income in China, the mean is 260.986, and the median is 199.2"
2. Create a new data frame with a subset of the columns AND rows. There are several ways to do this so feel free to try a couple if you want. Make sure to rename the new data set so it simply just doesn’t write it over.
#create new data frame with first 10 rows and first 2 columns...
#using subset
CIsub1 <- subset(CI, CI$X < 11, select = c("agriculture","commerce"))
#using direct references
CIsub2 <- CI[c(1:10),-c(1,4:6)]
#using select() function from dplyr (run next two comments if haven't added dplyr package)
install.packages("dplyr",repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/np/xh7bvgcn1gd076tdp885z4f00000gp/T//RtmpqyI67b/downloaded_packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
CIsub3 <- select(filter(CI, X < 11),c(agriculture:commerce))
#show that they all produce the datasets with the same characteristics
summary(CIsub1)
## agriculture commerce
## Min. : 83.6 Min. :100.0
## 1st Qu.:100.2 1st Qu.:133.8
## Median :102.5 Median :142.1
## Mean :104.2 Mean :142.1
## 3rd Qu.:115.2 3rd Qu.:153.6
## Max. :120.3 Max. :170.3
summary(CIsub2)
## agriculture commerce
## Min. : 83.6 Min. :100.0
## 1st Qu.:100.2 1st Qu.:133.8
## Median :102.5 Median :142.1
## Mean :104.2 Mean :142.1
## 3rd Qu.:115.2 3rd Qu.:153.6
## Max. :120.3 Max. :170.3
summary(CIsub3)
## agriculture commerce
## Min. : 83.6 Min. :100.0
## 1st Qu.:100.2 1st Qu.:133.8
## Median :102.5 Median :142.1
## Mean :104.2 Mean :142.1
## 3rd Qu.:115.2 3rd Qu.:153.6
## Max. :120.3 Max. :170.3
3. Create new column names for each column in the new data frame created in step 2.
#dplyr uses new_name = old_name and plyr uses old_name = new_name. Currently dplyr is loaded
CItest <- rename(CIsub2, c("ag"="agriculture", "com"="commerce"))
#show that the column names changed
CItest[1:3, ]
## ag com
## 1 100.0 100.0
## 2 101.6 133.0
## 3 103.3 136.4
4. Use the summary function to create an overview of your new data frame created in step 2. Then print the mean and median for the same two attributes. Please compare (i.e. tell me how the values changed and why).
#same as data frame in step2 but with shorter column names
summary(CItest)
## ag com
## Min. : 83.6 Min. :100.0
## 1st Qu.:100.2 1st Qu.:133.8
## Median :102.5 Median :142.1
## Mean :104.2 Mean :142.1
## 3rd Qu.:115.2 3rd Qu.:153.6
## Max. :120.3 Max. :170.3
#calculate mean and median
CItestag_mean <- mean(CItest$ag)
CItestcom_mean <- mean(CItest$com)
CItestag_median <- median(CItest$ag)
CItestcom_median <- median(CItest$com)
#show results
print(sprintf("Between the original and test agriculture data, the mean changed from %g to %g, and the median changed from %g to %g", CIag_mean, CItestag_mean, CIag_median, CItestag_median))
## [1] "Between the original and test agriculture data, the mean changed from 151.949 to 104.22, and the median changed from 139.8 to 102.45"
print(sprintf("Between the original and test commerce data, the mean changed from %g to %g, and the median changed from %g to %g", CIcom_mean, CItestcom_mean, CIcom_median, CItestcom_median))
## [1] "Between the original and test commerce data, the mean changed from 260.986 to 142.05, and the median changed from 199.2 to 142.05"
#conclusion
print("Since the mins stayed the same and the maxes and means decreased between the original data and the subset of the data with the first ten rows, likely the data is increasing as you go up in rows.")
## [1] "Since the mins stayed the same and the maxes and means decreased between the original data and the subset of the data with the first ten rows, likely the data is increasing as you go up in rows."
print("Looking at the data again, the first row is 100.0 for every column so likely that was the first year, row two is the second year and the values are the average wages by industry compared to year one, and there's 37 years worth of wage data.")
## [1] "Looking at the data again, the first row is 100.0 for every column so likely that was the first year, row two is the second year and the values are the average wages by industry compared to year one, and there's 37 years worth of wage data."
5. For at least 3 different/distinct values in a column please rename so that every value in that column is renamed. For example, change the letter “e” to “excellent”, the letter “a” to “average’ and the word “bad” to “terrible”.
#my data doesn't have strings so I'll make some
CIA <- CItest
CIA$Country <- c("China","Taiwan")
CIA$YearBeg <- c("Fire","Water")
CIA$YearEnd <- c("Air","Earth")
#rename values in a column
CIB <- CIA
CIB$Country <- replace(CIB$Country, CIB$Country == "China", "PRC")
CIB$YearBeg <- replace(CIB$YearBeg, CIB$YearBeg == "Water", "Shui")
CIB$YearEnd <- replace(CIB$YearEnd, CIB$YearEnd == "Earth", "Metal")
#show data before changes in this chunk and show data after the changes in the next chunk:
CIA[1:5, ]
## ag com Country YearBeg YearEnd
## 1 100.0 100.0 China Fire Air
## 2 101.6 133.0 Taiwan Water Earth
## 3 103.3 136.4 China Fire Air
## 4 111.5 137.5 Taiwan Water Earth
## 5 116.5 146.6 China Fire Air
6. Display enough rows to see examples of all of steps 1-5 above. This means use a function to show me enough row values that I can see the changes.
CIB[1:5, ]
## ag com Country YearBeg YearEnd
## 1 100.0 100.0 PRC Fire Air
## 2 101.6 133.0 Taiwan Shui Metal
## 3 103.3 136.4 PRC Fire Air
## 4 111.5 137.5 Taiwan Shui Metal
## 5 116.5 146.6 PRC Fire Air
Please submit your .rmd file and the .csv file as well as a link to your RPubs.
#Internet References Used to complete assignment https://www.r-bloggers.com/2016/11/5-ways-to-subset-a-data-frame-in-r/