I am a student at University of Cincinnati
Majoring in Business analytics
I am a Front Desk assistant in University Housing
I am beginner in R
I have a experience with Python
df<- readr::read_csv("../data/blood_transfusion.csv")
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#What are the dimensions of this data (number of rows and columns)?
dim(df)
## [1] 748 5
#What are the data types of each column?
colnames(df)
## [1] "Recency" "Frequency" "Monetary" "Time" "Class"
#Are there any missing values?
sapply(df, function(x) sum(is.na(x)))
## Recency Frequency Monetary Time Class
## 0 0 0 0 0
#Check out the first 10 rows? What are the Class values for the first 10 observations?
head(df,10)
## # A tibble: 10 × 5
## Recency Frequency Monetary Time Class
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2 50 12500 98 donated
## 2 0 13 3250 28 donated
## 3 1 16 4000 35 donated
## 4 2 20 5000 45 donated
## 5 1 24 6000 77 not donated
## 6 4 4 1000 4 not donated
## 7 2 7 1750 14 donated
## 8 1 12 3000 35 not donated
## 9 2 9 2250 22 donated
## 10 5 46 11500 98 donated
#Check out the last 10 rows? What are the Class values for the last 10 observations?
tail(df,10)
## # A tibble: 10 × 5
## Recency Frequency Monetary Time Class
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 23 1 250 23 not donated
## 2 23 4 1000 52 not donated
## 3 23 1 250 23 not donated
## 4 23 7 1750 88 not donated
## 5 16 3 750 86 not donated
## 6 23 2 500 38 not donated
## 7 21 2 500 52 not donated
## 8 23 3 750 62 not donated
## 9 39 1 250 39 not donated
## 10 72 1 250 72 not donated
#Index for the 100th row and just the Monetary column. What is the value?
df[100, 'Monetary']
## # A tibble: 1 × 1
## Monetary
## <dbl>
## 1 1750
#Index for just the Monetary column. What is the mean of this vector?
mean(df[['Monetary']])
## [1] 1378.676
#Subset this data frame for all observations where Monetary is greater than the mean value. How many rows are in the resulting data frame?
above_avg <- df[['Monetary']]>mean(df[['Monetary']])
df[above_avg, 'Monetary']
## # A tibble: 267 × 1
## Monetary
## <dbl>
## 1 12500
## 2 3250
## 3 4000
## 4 5000
## 5 6000
## 6 1750
## 7 3000
## 8 2250
## 9 11500
## 10 5750
## # ℹ 257 more rows
df <- readr::read_table('http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt', col_names = c('Month', 'Day', 'Year', 'Avg_temp'))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Month = col_double(),
## Day = col_double(),
## Year = col_double(),
## Avg_temp = col_double()
## )
## Warning: 3 parsing failures.
## row col expected actual file
## 4623 -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5016 -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5213 -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
dim(df)
## [1] 9265 4
These columns represent the average daily temperature of Cincinnati from 1st January,1995 to 13th May,2020.
sum(is.na(df))
## [1] 0
df[365,c("Day", "Avg_temp")]
## # A tibble: 1 × 2
## Day Avg_temp
## <dbl> <dbl>
## 1 31 39.3
jan_2000 <- df[df[['Month']] == 1 & df[['Year']] == 2000, ]
median(jan_2000[['Avg_temp']])
## [1] 27.1
df[which.max(df[['Avg_temp']]),]
## # A tibble: 1 × 4
## Month Day Year Avg_temp
## <dbl> <dbl> <dbl> <dbl>
## 1 7 7 2012 89.2
df[which.min(df[['Avg_temp']]),]
## # A tibble: 1 × 4
## Month Day Year Avg_temp
## <dbl> <dbl> <dbl> <dbl>
## 1 12 24 1998 -99
On 24th December 2012 the coldest average temp was recorded, this temp don’t make sense # Are there more than just one date that has this temperature value recorded? If so, how many?
sum(df[['Avg_temp']] == -99)
## [1] 14
Yes, there are more. The number of dates on which the average temperature was recorded as -99 are: 14 # Compute the mean of the average temp column. Now re-code all -99s to NA and recompute the mean
mean(df[['Avg_temp']])
## [1] 54.39876
bad_values <- df[['Avg_temp']] == -99
df[bad_values, 'Avg_temp'] <- NA
mean(df[['Avg_temp']], na.rm = TRUE)
## [1] 54.6309