R Markdown
- Use the summary function to gain an overview of the data set. Then
display the mean and median for at least two attributes
##Question #1
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/maric/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\maric\AppData\Local\Temp\RtmpSc1GEE\downloaded_packages
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(readr)
#Setting directory where to look for dataset
getwd()
## [1] "C:/Users/maric/Documents/CUNY Data Science/Bridge/RHW2"
setwd("C:/Users/maric/Documents/CUNY Data Science/Bridge/RHW2")
dataset<-read.csv(file="Fatalities.csv", header=TRUE,sep=",")
View(dataset)
#Summary
summary(dataset)
## X state year spirits
## Min. : 1.00 Length:336 Min. :1982 Min. :0.790
## 1st Qu.: 84.75 Class :character 1st Qu.:1983 1st Qu.:1.300
## Median :168.50 Mode :character Median :1985 Median :1.670
## Mean :168.50 Mean :1985 Mean :1.754
## 3rd Qu.:252.25 3rd Qu.:1987 3rd Qu.:2.013
## Max. :336.00 Max. :1988 Max. :4.900
## unemp income emppop beertax
## Min. : 2.400 Min. : 9514 Min. :42.99 Min. :0.04331
## 1st Qu.: 5.475 1st Qu.:12086 1st Qu.:57.69 1st Qu.:0.20885
## Median : 7.000 Median :13763 Median :61.36 Median :0.35259
## Mean : 7.347 Mean :13880 Mean :60.81 Mean :0.51326
## 3rd Qu.: 8.900 3rd Qu.:15175 3rd Qu.:64.41 3rd Qu.:0.65157
## Max. :18.000 Max. :22193 Max. :71.27 Max. :2.72076
## baptist mormon drinkage dry
## Min. : 0.0000 Min. : 0.1000 Min. :18.00 Min. : 0.00000
## 1st Qu.: 0.6268 1st Qu.: 0.2722 1st Qu.:20.00 1st Qu.: 0.00000
## Median : 1.7492 Median : 0.3931 Median :21.00 Median : 0.08681
## Mean : 7.1569 Mean : 2.8019 Mean :20.46 Mean : 4.26707
## 3rd Qu.:13.1271 3rd Qu.: 0.6293 3rd Qu.:21.00 3rd Qu.: 2.42481
## Max. :30.3557 Max. :65.9165 Max. :21.00 Max. :45.79210
## youngdrivers miles breath jail
## Min. :0.07314 Min. : 4576 Length:336 Length:336
## 1st Qu.:0.17037 1st Qu.: 7183 Class :character Class :character
## Median :0.18539 Median : 7796 Mode :character Mode :character
## Mean :0.18593 Mean : 7891
## 3rd Qu.:0.20219 3rd Qu.: 8504
## Max. :0.28163 Max. :26148
## service fatal nfatal sfatal
## Length:336 Min. : 79.0 Min. : 13.00 Min. : 8.0
## Class :character 1st Qu.: 293.8 1st Qu.: 53.75 1st Qu.: 35.0
## Mode :character Median : 701.0 Median : 135.00 Median : 81.0
## Mean : 928.7 Mean : 182.58 Mean :109.9
## 3rd Qu.:1063.5 3rd Qu.: 212.00 3rd Qu.:131.0
## Max. :5504.0 Max. :1049.00 Max. :603.0
## fatal1517 nfatal1517 fatal1820 nfatal1820
## Min. : 3.00 Min. : 0.00 Min. : 7.0 Min. : 0.00
## 1st Qu.: 25.75 1st Qu.: 4.00 1st Qu.: 38.0 1st Qu.: 11.00
## Median : 49.00 Median :10.00 Median : 82.0 Median : 24.00
## Mean : 62.61 Mean :12.26 Mean :106.7 Mean : 33.53
## 3rd Qu.: 77.00 3rd Qu.:15.25 3rd Qu.:130.2 3rd Qu.: 44.00
## Max. :318.00 Max. :76.00 Max. :601.0 Max. :196.00
## fatal2124 nfatal2124 afatal pop
## Min. : 12.0 Min. : 1.00 Min. : 24.6 Min. : 479000
## 1st Qu.: 42.0 1st Qu.: 13.00 1st Qu.: 90.5 1st Qu.: 1545251
## Median : 97.5 Median : 30.00 Median : 211.6 Median : 3310503
## Mean :126.9 Mean : 41.38 Mean : 293.3 Mean : 4930272
## 3rd Qu.:150.5 3rd Qu.: 49.00 3rd Qu.: 364.0 3rd Qu.: 5751735
## Max. :770.0 Max. :249.00 Max. :2094.9 Max. :28314028
## pop1517 pop1820 pop2124 milestot
## Min. : 21000 Min. : 21000 Min. : 30000 Min. : 3993
## 1st Qu.: 71750 1st Qu.: 76962 1st Qu.: 103500 1st Qu.: 11692
## Median : 163000 Median : 170982 Median : 241000 Median : 28484
## Mean : 230816 Mean : 249090 Mean : 336390 Mean : 37101
## 3rd Qu.: 270500 3rd Qu.: 308311 3rd Qu.: 413000 3rd Qu.: 44140
## Max. :1172000 Max. :1321004 Max. :1892998 Max. :241575
## unempus emppopus gsp
## Min. :5.500 Min. :57.80 Min. :-0.123641
## 1st Qu.:6.200 1st Qu.:57.90 1st Qu.: 0.001182
## Median :7.200 Median :60.10 Median : 0.032413
## Mean :7.529 Mean :59.97 Mean : 0.025313
## 3rd Qu.:9.600 3rd Qu.:61.50 3rd Qu.: 0.056501
## Max. :9.700 Max. :62.30 Max. : 0.142361
#mean for fatalities(column 18) and income(column6) for all years in dataset (1982-1988)
mean(dataset[,18]) #mean fatalities in years 1982-1988 is 928.66
## [1] 928.6637
mean(dataset[,6]) #mean income in years 1982-1988 is $13880.18
## [1] 13880.18
#median for fatalities and income for all years in dataset (1982-1988)
median(dataset[,18]) #median fatalities in years 1982-1988 is 701
## [1] 701
median(dataset[,6]) #median income in years 1982-1988 is $13,763.13
## [1] 13763.13
- Create a new data frame with a subset of the columns and rows. Make
sure to rename it
#This is a subset containing data from one year (1984) and only four columns: 'state','year','fatal' and 'income'
#This subset provides data on fatalities in US states in the year 1984
class(dataset)
## [1] "data.frame"
dataset<-read.csv(file="fatalities.csv", header=TRUE,sep=",")
Fatalities_in_1984_by_State<- data.frame(subset(dataset,dataset$year=='1984',select=c(state,year,fatal,income)))#create dataframe Fatalities_in_1984_by_State
View(Fatalities_in_1984_by_State)
- Create new column names for the new data frame
#column name changes are as follows: 'state' changed to 'US_State, 'year' changed to 'Calendar_Yea','fatal' changed to 'Number_of_Fatalities' and 'income' changed to 'Annual_Income'
colnames(Fatalities_in_1984_by_State)<-c("US_State","Calendar_Year","Number_of_Fatalities","Annual_Income")
View(Fatalities_in_1984_by_State)
- Use the summary function to create an overview of your new data
frame. The print the mean and median for the same two attributes. Please
compare.
#Question 4
#Fatalities_in_1984_by_State<-subset(dataset, dataset$year=='1984',select=c(state,year,fatal))
summary(Fatalities_in_1984_by_State)
## US_State Calendar_Year Number_of_Fatalities Annual_Income
## Length:48 Min. :1984 Min. : 79.0 Min. : 9792
## Class :character 1st Qu.:1984 1st Qu.: 307.5 1st Qu.:11990
## Mode :character Median :1984 Median : 672.5 Median :13498
## Mean :1984 Mean : 915.0 Mean :13583
## 3rd Qu.:1984 3rd Qu.: 978.5 3rd Qu.:14795
## Max. :1984 Max. :5020.0 Max. :18760
#print(Fatalities_in_1984_by_State)
#compare means between data set and subset
#mean income in 1984 is less than mean income for years 1982-1988
mean(dataset[,6]) #mean of income for 48 states in years 1982-1988
## [1] 13880.18
mean(Fatalities_in_1984_by_State[,4]) #mean of income for 48 states in 1984
## [1] 13582.51
#mean fatalities in 1984 is less than mean fatalities for years 1982-1988
mean(dataset[,18]) #mean of fatalities for 48 states in years 1982-1988
## [1] 928.6637
mean(Fatalities_in_1984_by_State[,3]) #mean of fatalities for 48 states in 1984
## [1] 915.0208
#compare medians between data set and subset
#median income in 1984 is less than median income for years 1982-1988
median(dataset[,6]) #median of income for 48 states in years 1982-1988
## [1] 13763.13
median(Fatalities_in_1984_by_State[,4]) #median of income for 48 states in 1984
## [1] 13498.35
#median fatalities in 1984 is less than median fatalities for years 1982-1988
median(dataset[,18]) #median of fatalities for 48 states in years 1982-1988
## [1] 701
median(Fatalities_in_1984_by_State[,3]) #median of fatalities for 48 states in 1984
## [1] 672.5
- For at least 3 values in a column please rename so that every value
in that column is renamed. For example, suppose I have 20 values of the
letter “e” in one column. Rename those values so that all 20 would show
as “excellent”.
#Question 5
#I renamed 3 states acronyms to reflect the actual state name
Fatalities_in_1984_by_State[Fatalities_in_1984_by_State=='al'] <- 'alabama'
Fatalities_in_1984_by_State[Fatalities_in_1984_by_State=='az'] <- 'arizona'
Fatalities_in_1984_by_State[Fatalities_in_1984_by_State=='ar'] <- 'arkansas'
View(Fatalities_in_1984_by_State)
- Display enough rows to see examples of all of steps 1-5 above
#Question 6
head(Fatalities_in_1984_by_State,3)
## US_State Calendar_Year Number_of_Fatalities Annual_Income
## 3 alabama 1984 932 11108.79
## 10 arizona 1984 869 13265.93
## 17 arkansas 1984 525 10916.48
- BONUS – place the original .csv in a github file and have R read
from the link. This will be a very useful skill as you progress in your
data science education and career.
#Question 7
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Warning: package 'tidyverse' is in use and will not be installed
library(tidyverse)
library(readr)
install.packages("RCurl", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/maric/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## package 'RCurl' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\maric\AppData\Local\Temp\RtmpSc1GEE\downloaded_packages
library(RCurl)
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
temp<-getURL("https://raw.githubusercontent.com/goygoyummm/2022_CUNY_DS_Bridge_R/main/Fatalities.csv")
y<- read.csv(text=temp)
View(temp)