Synopsis of Me

I am a student at University of Cincinnati

Academic Background

Majoring in Business analytics

Professional Background

I am a Front Desk assistant in University Housing

Experience with R

I am beginner in R

Experience with other analytic software

I have a experience with Python

df<- readr::read_csv("../data/blood_transfusion.csv")
## Rows: 748 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Class
## dbl (4): Recency, Frequency, Monetary, Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#What are the dimensions of this data (number of rows and columns)?

dim(df)
## [1] 748   5

#What are the data types of each column?

colnames(df)
## [1] "Recency"   "Frequency" "Monetary"  "Time"      "Class"

#Are there any missing values?

sapply(df, function(x) sum(is.na(x)))
##   Recency Frequency  Monetary      Time     Class 
##         0         0         0         0         0

#Check out the first 10 rows? What are the Class values for the first 10 observations?

head(df,10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1       2        50    12500    98 donated    
##  2       0        13     3250    28 donated    
##  3       1        16     4000    35 donated    
##  4       2        20     5000    45 donated    
##  5       1        24     6000    77 not donated
##  6       4         4     1000     4 not donated
##  7       2         7     1750    14 donated    
##  8       1        12     3000    35 not donated
##  9       2         9     2250    22 donated    
## 10       5        46    11500    98 donated

#Check out the last 10 rows? What are the Class values for the last 10 observations?

tail(df,10)
## # A tibble: 10 × 5
##    Recency Frequency Monetary  Time Class      
##      <dbl>     <dbl>    <dbl> <dbl> <chr>      
##  1      23         1      250    23 not donated
##  2      23         4     1000    52 not donated
##  3      23         1      250    23 not donated
##  4      23         7     1750    88 not donated
##  5      16         3      750    86 not donated
##  6      23         2      500    38 not donated
##  7      21         2      500    52 not donated
##  8      23         3      750    62 not donated
##  9      39         1      250    39 not donated
## 10      72         1      250    72 not donated

#Index for the 100th row and just the Monetary column. What is the value?

df[100, 'Monetary']
## # A tibble: 1 × 1
##   Monetary
##      <dbl>
## 1     1750

#Index for just the Monetary column. What is the mean of this vector?

mean(df[['Monetary']])
## [1] 1378.676

#Subset this data frame for all observations where Monetary is greater than the mean value. How many rows are in the resulting data frame?

above_avg <- df[['Monetary']]>mean(df[['Monetary']])
df[above_avg, 'Monetary']
## # A tibble: 267 × 1
##    Monetary
##       <dbl>
##  1    12500
##  2     3250
##  3     4000
##  4     5000
##  5     6000
##  6     1750
##  7     3000
##  8     2250
##  9    11500
## 10     5750
## # ℹ 257 more rows

2.Go to this University of Dayton webpage http://academic.udayton.edu/kissock/http/Weather/citylistUS.htm, scroll down to Ohio and import the Cincinnati (OHCINCIN.txt) file.

df <- readr::read_table('http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt', col_names = c('Month', 'Day', 'Year', 'Avg_temp'))
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Month = col_double(),
##   Day = col_double(),
##   Year = col_double(),
##   Avg_temp = col_double()
## )
## Warning: 3 parsing failures.
##  row col  expected    actual                                                                           file
## 4623  -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5016  -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'
## 5213  -- 4 columns 5 columns 'http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt'

What are the dimensions of this data (number of rows and columns)?

dim(df)
## [1] 9265    4

What do you think these columns represent?

These columns represent the average daily temperature of Cincinnati from 1st January,1995 to 13th May,2020.

Are there any missing values in this data?

sum(is.na(df))
## [1] 0

Index for the 365th row. What is the date of this observation and what was the average temperature?

df[365,c("Day", "Avg_temp")]
## # A tibble: 1 × 2
##     Day Avg_temp
##   <dbl>    <dbl>
## 1    31     39.3

Subset for all observations that happened during January of 2000. What was the median average temp for this month?

jan_2000 <- df[df[['Month']] == 1 & df[['Year']] == 2000, ]
median(jan_2000[['Avg_temp']])
## [1] 27.1

Which date was the highest average temp recorded (hint: which.max)?

df[which.max(df[['Avg_temp']]),]
## # A tibble: 1 × 4
##   Month   Day  Year Avg_temp
##   <dbl> <dbl> <dbl>    <dbl>
## 1     7     7  2012     89.2

Which date was the coldest average temp recorded? Does this temp make sense?

df[which.min(df[['Avg_temp']]),]
## # A tibble: 1 × 4
##   Month   Day  Year Avg_temp
##   <dbl> <dbl> <dbl>    <dbl>
## 1    12    24  1998      -99

On 24th December 2012 the coldest average temp was recorded, this temp don’t make sense # Are there more than just one date that has this temperature value recorded? If so, how many?

sum(df[['Avg_temp']] == -99)
## [1] 14

Yes, there are more. The number of dates on which the average temperature was recorded as -99 are: 14 # Compute the mean of the average temp column. Now re-code all -99s to NA and recompute the mean

mean(df[['Avg_temp']])
## [1] 54.39876
bad_values <- df[['Avg_temp']] == -99
df[bad_values, 'Avg_temp'] <- NA
mean(df[['Avg_temp']], na.rm = TRUE)
## [1] 54.6309