Lab Assignment: Getting to Know Me

About Me

Hello! My name is Dustin Littrell. I work for a local homebuilder. I am a U.S. Army Veteran, where I was land surveyor and draftsman. I love using excel and learning new things. I wanted to continue to learn and explore how to be better at my job, which led to an opportunity to do a Data Analysis program with a local non-profit called Code Kentucky.This peaked my interest and is why I am now in the MS BANA program at the University of Cincinnati. Personally I love to spend time with my family. I have two young children Georgia(7) and Beau(5). I coach my son’s T-ball team. I love that interaction with him and I love going to my daughter’s soccer games. I love to play Fantasy Football and the analysis that comes along with that. I enjoy golf even though I am terrible. Random Fact: I am a coin nerd. After I got out of the Army I worked at my best friend’s pawn shop for about 10 years. Although it was a pawnshop the company history was rooted in Coin Collecting. My focus was United States silver coins 1964 and earlier, Pre-1928 U.S. silver and gold coins, and U.S. paper money.

Academic Background

I am a late bloomer in terms of education. Right out of high school I worked on my 2 year associates degree. When I finished my 2 year degree, life happened and I enlisted in the Ohio Army National Guard. I deployed and delayed going back to school. I finally went back to school in my 30s and received my Bachelor’s degree in Accounting.

A.S. in Computer Networking (Cincinnati State Technical College, Cincinnati, OH)
B.S. in Accounting (Mount Saint Joseph University, Cincinnati, OH)
M.S. in Business Analytics(Expected) (University of Cincinnati, Cincinnati, OH)

Professional Background

After I received my Bachelor’s degree. I started working for Drees Homes. I was a cost estimator, cost analyst, and I am currently Land Analytics Manager.

Skills and Experience: - Bachelor’s Degree in Accounting - Cost Analysis - Excel - JD Edwards - Cognos

Experience with R

I have little experience with R. I did a bootcamp style programming with a local non-profit named Code Kentucky in 2022, but this was only briefly covered R. The course consisted of two 12 week sessions where we learned basic Python, Git, and GitHub. I completed this as a pre-requisite for the MS-BANA program. Prior to this, my primary experience with data analysis is with excel via Excel Visual basic, Excel Forecast/Analyze tools, pivot tables, and Power Pivot.

Tools: - Jupyter Notebook - VS Code - GitHub - Google Colab

Thanks for taking the time to read about me!

Lab

Part 1: Importing Data

Question 1

Fill in the blanks below to import the blood_transfusion.csv file (provided via Canvas) and answer the following questions.

# Importing the data
df<- read.csv(here(file_path, blood_data))

What are the dimensions of this data (number of rows and columns)?

dim(df)

## [1] 748   5

What are the data types of each column?

str(df)

## 'data.frame':    748 obs. of  5 variables:
##  $ Recency  : int  2 0 1 2 1 4 2 1 2 5 ...
##  $ Frequency: int  50 13 16 20 24 4 7 12 9 46 ...
##  $ Monetary : int  12500 3250 4000 5000 6000 1000 1750 3000 2250 11500 ...
##  $ Time     : int  98 28 35 45 77 4 14 35 22 98 ...
##  $ Class    : chr  "donated" "donated" "donated" "donated" ...

Are there any missing values?

sum(is.na(df))

## [1] 0

Check out the first 10 rows? What are the Class values for the first 10 observations?

head(df, 10)

##    Recency Frequency Monetary Time       Class
## 1        2        50    12500   98     donated
## 2        0        13     3250   28     donated
## 3        1        16     4000   35     donated
## 4        2        20     5000   45     donated
## 5        1        24     6000   77 not donated
## 6        4         4     1000    4 not donated
## 7        2         7     1750   14     donated
## 8        1        12     3000   35 not donated
## 9        2         9     2250   22     donated
## 10       5        46    11500   98     donated

Check out the last 10 rows? What are the Class values for the last 10 observations?

tail(df, 10)

##     Recency Frequency Monetary Time       Class
## 739      23         1      250   23 not donated
## 740      23         4     1000   52 not donated
## 741      23         1      250   23 not donated
## 742      23         7     1750   88 not donated
## 743      16         3      750   86 not donated
## 744      23         2      500   38 not donated
## 745      21         2      500   52 not donated
## 746      23         3      750   62 not donated
## 747      39         1      250   39 not donated
## 748      72         1      250   72 not donated

Index for the 100th row and just the Monetary column. What is the value?

df[100, "Monetary"]

## [1] 1750

Index for just the Monetary column. What is the mean of this vector?

mean_value <- mean(df$Monetary)
mean_value

## [1] 1378.676

Subset this df frame for all observations where Monetary is greater than the mean value. How many rows are in the resulting df frame?

subset_df <- subset(df, Monetary > mean_value)
nrow(subset_df)

## [1] 267

###{ Question 2

df_2 <- readr::read_table(here(file_path, weather_data), col_names = c('Month', 'Day', 'Year', 'Avg_temp'))

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Month = col_double(),
##   Day = col_double(),
##   Year = col_double(),
##   Avg_temp = col_double()
## )

## Warning: 3 parsing failures.
##  row col  expected    actual                                                                                                                      file
## 4623  -- 4 columns 5 columns 'C:/Users/dlittrell/OneDrive - The Drees Company/Documents/7025_Data Wrangling/7025_Project/Lab_2/lab2_data/OHCINCIN.txt'
## 5016  -- 4 columns 5 columns 'C:/Users/dlittrell/OneDrive - The Drees Company/Documents/7025_Data Wrangling/7025_Project/Lab_2/lab2_data/OHCINCIN.txt'
## 5213  -- 4 columns 5 columns 'C:/Users/dlittrell/OneDrive - The Drees Company/Documents/7025_Data Wrangling/7025_Project/Lab_2/lab2_data/OHCINCIN.txt'

What are the dimensions of this data (number of rows and columns)?

dim(df_2)

## [1] 9265    4

What do you think these columns represent?

The columns likely represent the following: - Month: The month of the observation - Day: The day of the observation - Year: The year of the observation - Avg_temp: The average temperature for the day

Are there any missing values in this data?

sum(is.na(df_2))

## [1] 0

Index for the 365th row. What is the date of this observation and what was the average temperature?

df_2[365, ]

## # A tibble: 1 × 4
##   Month   Day  Year Avg_temp
##   <dbl> <dbl> <dbl>    <dbl>
## 1    12    31  1995     39.3

Subset for all observations that happened during January of 2000. What was the median average temp for this month?

jan_2000 <- df_2[df_2$Month == 1 & df_2$Year == 2000, ]
median(jan_2000$Avg_temp)

## [1] 27.1

Which date was the highest average temp recorded?

df_2[which.max(df_2$Avg_temp), ]

## # A tibble: 1 × 4
##   Month   Day  Year Avg_temp
##   <dbl> <dbl> <dbl>    <dbl>
## 1     7     7  2012     89.2

Which date was the coldest average temp recorded? Does this temp make sense? Are there more than just one date that has this temperature value recorded? If so, how many?

min_temp <- min(df_2$Avg_temp)
coldest_dates <- df_2[df_2$Avg_temp == min_temp, ]
nrow(coldest_dates)

## [1] 14

Compute the mean of the average temp column.

mean(df_2$Avg_temp)

## [1] 54.39876

Recode -99s to NA and recompute the mean.

df_2[df_2$Avg_temp == -99, "Avg_temp"] <- NA
mean(df_2$Avg_temp, na.rm = TRUE)

## [1] 54.6309

Question 3

df_3 <- read.csv(here(file_path, police_data))

What are the dimensions of this data (number of rows and columns)?

dim(df_3)

## [1] 15155    40

What do you think these columns represent?

(You might need to manually describe the columns based on your understanding and the context provided by the data portal website.)

Are there any missing values in this data? If so, how many missing values are in each column? Which column has the most missing values?

# Calculate missing values, considering both NA and "UNKNOWN"
missing_values <- sapply(df_3, function(col) {
  if (is.numeric(col) || inherits(col, "Date")) {
    return(sum(is.na(col)))
  } else {
    return(sum(is.na(col) | col == "UNKNOWN" | col == ""))
  }
})

# Print missing values
missing_values

##                     INSTANCEID                    INCIDENT_NO 
##                              0                              0 
##                  DATE_REPORTED                      DATE_FROM 
##                              0                              2 
##                        DATE_TO                           CLSD 
##                              9                            545 
##                            UCR                            DST 
##                             10                              0 
##                           BEAT                        OFFENSE 
##                             28                             10 
##                       LOCATION                     THEFT_CODE 
##                              2                          10167 
##                          FLOOR                           SIDE 
##                          14127                          14120 
##                        OPENING                      HATE_BIAS 
##                          14508                              0 
##                      DAYOFWEEK                       RPT_AREA 
##                            423                            239 
##               CPD_NEIGHBORHOOD                        WEAPONS 
##                            249                              5 
##              DATE_OF_CLEARANCE                      HOUR_FROM 
##                           2613                              2 
##                        HOUR_TO                      ADDRESS_X 
##                              9                            148 
##                    LONGITUDE_X                     LATITUDE_X 
##                           1714                           1714 
##                     VICTIM_AGE                    VICTIM_RACE 
##                           2283                           2662 
##               VICTIM_ETHNICITY                  VICTIM_GENDER 
##                           2829                           2214 
##                    SUSPECT_AGE                   SUSPECT_RACE 
##                           9003                           7992 
##              SUSPECT_ETHNICITY                 SUSPECT_GENDER 
##                           8722                           7702 
##             TOTALNUMBERVICTIMS                  TOTALSUSPECTS 
##                             33                           7082 
##                      UCR_GROUP                            ZIP 
##                             10                              1 
## COMMUNITY_COUNCIL_NEIGHBORHOOD               SNA_NEIGHBORHOOD 
##                              0                              0

# Identify the column with the most missing values
column_most_missing <- names(which.max(missing_values))
column_most_missing

## [1] "OPENING"

Using the DATE_REPORTED column, what is the range of dates included in this data?

range(as.Date(df_3$DATE_REPORTED, format="%m/%d/%Y")) # Adjust format if necessary

## [1] "2022-01-01" "2022-06-26"

Using table(), what is the most common age range for known SUSPECT_AGEs?

known_ages <- table(df_3$SUSPECT_AGE[df_3$SUSPECT_AGE != "UNKNOWN"])
most_common_known_age <- as.character(names(which.max(known_ages)))
frequency_known_age <- known_ages[most_common_known_age]
list(Age = most_common_known_age, Frequency = frequency_known_age)

## $Age
## [1] "18-25"
## 
## $Frequency
## 18-25 
##  1778

Use table() to get the number of incidents per zip code. Sort this table for those zip codes with the most activity to the least activity. Which zip code has the most incidents? Do you see any peculiar data quality issues with any of these zip code values?

zip_activity <- table(df_3$ZIP_CODE)
sorted_zip_activity <- sort(zip_activity, decreasing = TRUE)
sorted_zip_activity

## integer(0)

Using the DAYOFWEEK column, which day do most incidents occur on? What is the proportion of incidents that fall on this day?

day_activity <- table(df_3$DAYOFWEEK)
most_common_day <- names(day_activity)[which.max(day_activity)]
proportion <- max(day_activity) / sum(day_activity)
most_common_day

## [1] "SATURDAY"

proportion

## [1] 0.1499175

Looking at the information this data set provides, what are some insights you’d be interested in assessing?

(Provide a written response based on your understanding and interest.)

Analyze three different columns that could start to provide you with these insights. Are there missing values in these columns? What are some summary statistics you can compute for these columns? Are there any outliers or aberrant values in these columns? How do you know? Would you remove or recode them?

(Choose three columns of interest, then provide code and written responses for each column’s analysis.)

# Example analysis for one column (repeat for two other columns)
column1 <- df_3$COLUMN_NAME # Replace COLUMN_NAME with the actual column name
summary(column1)

## Length  Class   Mode 
##      0   NULL   NULL

# ... Add further analysis like checking for outliers, etc.

Lab Assignment: Getting to Know Me - Week 2 Lab

Dustin Littrell

2023-09-03