Week2_Assignment2.rmd

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Selected “The Effect of Punishment Regimes on Crime Rates” USCrime dataset from the MASS package, available at https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/MASS/VA.csv

BONUS - place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Figure out syntax to pull from github for BONUS…

Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

Applied summary, mean, and median.

# USCrime.fromGitHub <- "https://github.com/JeremyOBrien16/Bridge_R_Week2_Assignment/raw/master/UScrime.csv"
# USCrime <- read.table(file=UScrime.fromGitHub, sep = ",", header=TRUE)
USCrime <- read.table(file="C:/Users/jlobr/OneDrive/Learning/_CUNY_SPS_MSDS/Bridge/R/Projects/MSDS Bridge R/MSDS Bridge R/UScrime.csv", sep = ",", header=TRUE)
summary(USCrime)

##        X              M               So               Ed       
##  Min.   : 1.0   Min.   :119.0   Min.   :0.0000   Min.   : 87.0  
##  1st Qu.:12.5   1st Qu.:130.0   1st Qu.:0.0000   1st Qu.: 97.5  
##  Median :24.0   Median :136.0   Median :0.0000   Median :108.0  
##  Mean   :24.0   Mean   :138.6   Mean   :0.3404   Mean   :105.6  
##  3rd Qu.:35.5   3rd Qu.:146.0   3rd Qu.:1.0000   3rd Qu.:114.5  
##  Max.   :47.0   Max.   :177.0   Max.   :1.0000   Max.   :122.0  
##       Po1             Po2               LF             M.F        
##  Min.   : 45.0   Min.   : 41.00   Min.   :480.0   Min.   : 934.0  
##  1st Qu.: 62.5   1st Qu.: 58.50   1st Qu.:530.5   1st Qu.: 964.5  
##  Median : 78.0   Median : 73.00   Median :560.0   Median : 977.0  
##  Mean   : 85.0   Mean   : 80.23   Mean   :561.2   Mean   : 983.0  
##  3rd Qu.:104.5   3rd Qu.: 97.00   3rd Qu.:593.0   3rd Qu.: 992.0  
##  Max.   :166.0   Max.   :157.00   Max.   :641.0   Max.   :1071.0  
##       Pop               NW              U1               U2       
##  Min.   :  3.00   Min.   :  2.0   Min.   : 70.00   Min.   :20.00  
##  1st Qu.: 10.00   1st Qu.: 24.0   1st Qu.: 80.50   1st Qu.:27.50  
##  Median : 25.00   Median : 76.0   Median : 92.00   Median :34.00  
##  Mean   : 36.62   Mean   :101.1   Mean   : 95.47   Mean   :33.98  
##  3rd Qu.: 41.50   3rd Qu.:132.5   3rd Qu.:104.00   3rd Qu.:38.50  
##  Max.   :168.00   Max.   :423.0   Max.   :142.00   Max.   :58.00  
##       GDP             Ineq            Prob              Time      
##  Min.   :288.0   Min.   :126.0   Min.   :0.00690   Min.   :12.20  
##  1st Qu.:459.5   1st Qu.:165.5   1st Qu.:0.03270   1st Qu.:21.60  
##  Median :537.0   Median :176.0   Median :0.04210   Median :25.80  
##  Mean   :525.4   Mean   :194.0   Mean   :0.04709   Mean   :26.60  
##  3rd Qu.:591.5   3rd Qu.:227.5   3rd Qu.:0.05445   3rd Qu.:30.45  
##  Max.   :689.0   Max.   :276.0   Max.   :0.11980   Max.   :44.00  
##        y         
##  Min.   : 342.0  
##  1st Qu.: 658.5  
##  Median : 831.0  
##  Mean   : 905.1  
##  3rd Qu.:1057.5  
##  Max.   :1993.0

mean(USCrime$Time)

## [1] 26.59792

median(USCrime$Time)

## [1] 25.8006

Create a new data frame with a subset of the columns and rows. Make sure to rename it.

Created, named USCrimeSub.df

USCrimeSub.df <- data.frame(USCrime$M, USCrime$So, USCrime$LF, USCrime$Pop, USCrime$U1, USCrime$Ineq, USCrime$Prob, USCrime$Time)

Create new column names for the new data frame.

Created natural language column headers.

colnames(USCrimeSub.df) <- c('% M14-24', 'Southern', '%LaborForce', 'StatePop', '%UnemployUrbanM14-24', 'IncomeIneq', 'ProbPrison', 'AvgYearsInPrison')
names(USCrimeSub.df) # checking columns headers correctly applied

## [1] "% M14-24"             "Southern"             "%LaborForce"         
## [4] "StatePop"             "%UnemployUrbanM14-24" "IncomeIneq"          
## [7] "ProbPrison"           "AvgYearsInPrison"

Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.

Mean and median for the columns in question are consistent between the variables in the intial set and subset data frame.

summary(USCrimeSub.df)

##     % M14-24        Southern       %LaborForce       StatePop     
##  Min.   :119.0   Min.   :0.0000   Min.   :480.0   Min.   :  3.00  
##  1st Qu.:130.0   1st Qu.:0.0000   1st Qu.:530.5   1st Qu.: 10.00  
##  Median :136.0   Median :0.0000   Median :560.0   Median : 25.00  
##  Mean   :138.6   Mean   :0.3404   Mean   :561.2   Mean   : 36.62  
##  3rd Qu.:146.0   3rd Qu.:1.0000   3rd Qu.:593.0   3rd Qu.: 41.50  
##  Max.   :177.0   Max.   :1.0000   Max.   :641.0   Max.   :168.00  
##  %UnemployUrbanM14-24   IncomeIneq      ProbPrison      AvgYearsInPrison
##  Min.   : 70.00       Min.   :126.0   Min.   :0.00690   Min.   :12.20   
##  1st Qu.: 80.50       1st Qu.:165.5   1st Qu.:0.03270   1st Qu.:21.60   
##  Median : 92.00       Median :176.0   Median :0.04210   Median :25.80   
##  Mean   : 95.47       Mean   :194.0   Mean   :0.04709   Mean   :26.60   
##  3rd Qu.:104.00       3rd Qu.:227.5   3rd Qu.:0.05445   3rd Qu.:30.45   
##  Max.   :142.00       Max.   :276.0   Max.   :0.11980   Max.   :44.00

mean(USCrimeSub.df$AvgYearsInPrison)

## [1] 26.59792

median(USCrimeSub.df$AvgYearsInPrison)

## [1] 25.8006

For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.

Assigned descriptors to binary values in Southern column, bipartite ordinal value to numeric values (split at 20) in IncomeIneq column.

USCrimeSub.df$Southern[USCrimeSub.df$Southern==0] <- "NotSouth"
USCrimeSub.df$Southern[USCrimeSub.df$Southern==1] <- "South"
head(USCrimeSub.df$Southern) # check transform of Southern column values

## [1] "South"    "NotSouth" "South"    "NotSouth" "NotSouth" "NotSouth"

USCrimeSub.df$IncomeIneq[USCrimeSub.df$IncomeIneq>=200] <- "High"
USCrimeSub.df$IncomeIneq[USCrimeSub.df$IncomeIneq<200] <- "Low"
head(USCrimeSub.df$IncomeIneq) # check transform of IncomeIneq column values

## [1] "High" "Low"  "High" "Low"  "Low"  "Low"

Display enough rows to see examples of all of steps 1-5 above.

Effects visible in initial rows, so used head.

head(USCrimeSub.df)

##   % M14-24 Southern %LaborForce StatePop %UnemployUrbanM14-24 IncomeIneq
## 1      151    South         510       33                  108       High
## 2      143 NotSouth         583       13                   96        Low
## 3      142    South         533       18                   94       High
## 4      136 NotSouth         577      157                  102        Low
## 5      141 NotSouth         591       18                   91        Low
## 6      121 NotSouth         547       25                   84        Low
##   ProbPrison AvgYearsInPrison
## 1   0.084602          26.2011
## 2   0.029599          25.2999
## 3   0.083401          24.3006
## 4   0.015801          29.9012
## 5   0.041399          21.2998
## 6   0.034201          20.9995

Week2_Assignment2.rmd

Jeremy

January 9, 2018

Selected “The Effect of Punishment Regimes on Crime Rates” USCrime dataset from the MASS package, available at https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/MASS/VA.csv

BONUS - place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.

Figure out syntax to pull from github for BONUS…

Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes.

Applied summary, mean, and median.

Create a new data frame with a subset of the columns and rows. Make sure to rename it.

Created, named USCrimeSub.df

Create new column names for the new data frame.

Created natural language column headers.

Use the summary function to create an overview of your new data frame. The print the mean and median for the same two attributes. Please compare.

Mean and median for the columns in question are consistent between the variables in the intial set and subset data frame.

For at least 3 values in a column please rename so that every value in that column is renamed. For example, suppose I have 20 values of the letter “e” in one column. Rename those values so that all 20 would show as “excellent”.

Assigned descriptors to binary values in Southern column, bipartite ordinal value to numeric values (split at 20) in IncomeIneq column.

Display enough rows to see examples of all of steps 1-5 above.

Effects visible in initial rows, so used head.