Hello! My name is Dustin Littrell. I work for a local homebuilder. I am a U.S. Army Veteran, where I was land surveyor and draftsman. I love using excel and learning new things. I wanted to continue to learn and explore how to be better at my job, which led to an opportunity to do a Data Analysis program with a local non-profit called Code Kentucky.This peaked my interest and is why I am now in the MS BANA program at the University of Cincinnati. Personally I love to spend time with my family. I have two young children Georgia(7) and Beau(5). I coach my son’s T-ball team. I love that interaction with him and I love going to my daughter’s soccer games. I love to play Fantasy Football and the analysis that comes along with that. I enjoy golf even though I am terrible. Random Fact: I am a coin nerd. After I got out of the Army I worked at my best friend’s pawn shop for about 10 years. Although it was a pawnshop the company history was rooted in Coin Collecting. My focus was United States silver coins 1964 and earlier, Pre-1928 U.S. silver and gold coins, and U.S. paper money.
I am a late bloomer in terms of education. Right out of high school I worked on my 2 year associates degree. When I finished my 2 year degree, life happened and I enlisted in the Ohio Army National Guard. I deployed and delayed going back to school. I finally went back to school in my 30s and received my Bachelor’s degree in Accounting.
After I received my Bachelor’s degree. I started working for Drees Homes. I was a cost estimator, cost analyst, and I am currently Land Analytics Manager.
Skills and Experience: - Bachelor’s Degree in Accounting - Cost Analysis - Excel - JD Edwards - Cognos
I have little experience with R. I did a bootcamp style programming with a local non-profit named Code Kentucky in 2022, but this was only briefly covered R. The course consisted of two 12 week sessions where we learned basic Python, Git, and GitHub. I completed this as a pre-requisite for the MS-BANA program. Prior to this, my primary experience with data analysis is with excel via Excel Visual basic, Excel Forecast/Analyze tools, pivot tables, and Power Pivot.
Tools: - Jupyter Notebook - VS Code - GitHub - Google Colab
Thanks for taking the time to read about me!
blood_transfusion.csv
file (provided via Canvas) and answer
the following questions.# Importing the data
df<- read.csv(here(file_path, blood_data))
dim(df)
## [1] 748 5
str(df)
## 'data.frame': 748 obs. of 5 variables:
## $ Recency : int 2 0 1 2 1 4 2 1 2 5 ...
## $ Frequency: int 50 13 16 20 24 4 7 12 9 46 ...
## $ Monetary : int 12500 3250 4000 5000 6000 1000 1750 3000 2250 11500 ...
## $ Time : int 98 28 35 45 77 4 14 35 22 98 ...
## $ Class : chr "donated" "donated" "donated" "donated" ...
sum(is.na(df))
## [1] 0
head(df, 10)
## Recency Frequency Monetary Time Class
## 1 2 50 12500 98 donated
## 2 0 13 3250 28 donated
## 3 1 16 4000 35 donated
## 4 2 20 5000 45 donated
## 5 1 24 6000 77 not donated
## 6 4 4 1000 4 not donated
## 7 2 7 1750 14 donated
## 8 1 12 3000 35 not donated
## 9 2 9 2250 22 donated
## 10 5 46 11500 98 donated
tail(df, 10)
## Recency Frequency Monetary Time Class
## 739 23 1 250 23 not donated
## 740 23 4 1000 52 not donated
## 741 23 1 250 23 not donated
## 742 23 7 1750 88 not donated
## 743 16 3 750 86 not donated
## 744 23 2 500 38 not donated
## 745 21 2 500 52 not donated
## 746 23 3 750 62 not donated
## 747 39 1 250 39 not donated
## 748 72 1 250 72 not donated
df[100, "Monetary"]
## [1] 1750
mean_value <- mean(df$Monetary)
mean_value
## [1] 1378.676
subset_df <- subset(df, Monetary > mean_value)
nrow(subset_df)
## [1] 267
###{ Question 2
df_2 <- readr::read_table(here(file_path, weather_data), col_names = c('Month', 'Day', 'Year', 'Avg_temp'))
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Month = col_double(),
## Day = col_double(),
## Year = col_double(),
## Avg_temp = col_double()
## )
## Warning: 3 parsing failures.
## row col expected actual file
## 4623 -- 4 columns 5 columns 'C:/Users/dlittrell/OneDrive - The Drees Company/Documents/7025_Data Wrangling/7025_Project/Lab_2/lab2_data/OHCINCIN.txt'
## 5016 -- 4 columns 5 columns 'C:/Users/dlittrell/OneDrive - The Drees Company/Documents/7025_Data Wrangling/7025_Project/Lab_2/lab2_data/OHCINCIN.txt'
## 5213 -- 4 columns 5 columns 'C:/Users/dlittrell/OneDrive - The Drees Company/Documents/7025_Data Wrangling/7025_Project/Lab_2/lab2_data/OHCINCIN.txt'
dim(df_2)
## [1] 9265 4
The columns likely represent the following: - Month
: The
month of the observation - Day
: The day of the observation
- Year
: The year of the observation -
Avg_temp
: The average temperature for the day
sum(is.na(df_2))
## [1] 0
df_2[365, ]
## # A tibble: 1 × 4
## Month Day Year Avg_temp
## <dbl> <dbl> <dbl> <dbl>
## 1 12 31 1995 39.3
jan_2000 <- df_2[df_2$Month == 1 & df_2$Year == 2000, ]
median(jan_2000$Avg_temp)
## [1] 27.1
df_2[which.max(df_2$Avg_temp), ]
## # A tibble: 1 × 4
## Month Day Year Avg_temp
## <dbl> <dbl> <dbl> <dbl>
## 1 7 7 2012 89.2
min_temp <- min(df_2$Avg_temp)
coldest_dates <- df_2[df_2$Avg_temp == min_temp, ]
nrow(coldest_dates)
## [1] 14
mean(df_2$Avg_temp)
## [1] 54.39876
df_2[df_2$Avg_temp == -99, "Avg_temp"] <- NA
mean(df_2$Avg_temp, na.rm = TRUE)
## [1] 54.6309
df_3 <- read.csv(here(file_path, police_data))
dim(df_3)
## [1] 15155 40
(You might need to manually describe the columns based on your understanding and the context provided by the data portal website.)
# Calculate missing values, considering both NA and "UNKNOWN"
missing_values <- sapply(df_3, function(col) {
if (is.numeric(col) || inherits(col, "Date")) {
return(sum(is.na(col)))
} else {
return(sum(is.na(col) | col == "UNKNOWN" | col == ""))
}
})
# Print missing values
missing_values
## INSTANCEID INCIDENT_NO
## 0 0
## DATE_REPORTED DATE_FROM
## 0 2
## DATE_TO CLSD
## 9 545
## UCR DST
## 10 0
## BEAT OFFENSE
## 28 10
## LOCATION THEFT_CODE
## 2 10167
## FLOOR SIDE
## 14127 14120
## OPENING HATE_BIAS
## 14508 0
## DAYOFWEEK RPT_AREA
## 423 239
## CPD_NEIGHBORHOOD WEAPONS
## 249 5
## DATE_OF_CLEARANCE HOUR_FROM
## 2613 2
## HOUR_TO ADDRESS_X
## 9 148
## LONGITUDE_X LATITUDE_X
## 1714 1714
## VICTIM_AGE VICTIM_RACE
## 2283 2662
## VICTIM_ETHNICITY VICTIM_GENDER
## 2829 2214
## SUSPECT_AGE SUSPECT_RACE
## 9003 7992
## SUSPECT_ETHNICITY SUSPECT_GENDER
## 8722 7702
## TOTALNUMBERVICTIMS TOTALSUSPECTS
## 33 7082
## UCR_GROUP ZIP
## 10 1
## COMMUNITY_COUNCIL_NEIGHBORHOOD SNA_NEIGHBORHOOD
## 0 0
# Identify the column with the most missing values
column_most_missing <- names(which.max(missing_values))
column_most_missing
## [1] "OPENING"
range(as.Date(df_3$DATE_REPORTED, format="%m/%d/%Y")) # Adjust format if necessary
## [1] "2022-01-01" "2022-06-26"
table()
, what is the most common age range for
known SUSPECT_AGEs?known_ages <- table(df_3$SUSPECT_AGE[df_3$SUSPECT_AGE != "UNKNOWN"])
most_common_known_age <- as.character(names(which.max(known_ages)))
frequency_known_age <- known_ages[most_common_known_age]
list(Age = most_common_known_age, Frequency = frequency_known_age)
## $Age
## [1] "18-25"
##
## $Frequency
## 18-25
## 1778
table()
to get the number of incidents per zip
code. Sort this table for those zip codes with the most activity to the
least activity. Which zip code has the most incidents? Do you see any
peculiar data quality issues with any of these zip code values?zip_activity <- table(df_3$ZIP_CODE)
sorted_zip_activity <- sort(zip_activity, decreasing = TRUE)
sorted_zip_activity
## integer(0)
day_activity <- table(df_3$DAYOFWEEK)
most_common_day <- names(day_activity)[which.max(day_activity)]
proportion <- max(day_activity) / sum(day_activity)
most_common_day
## [1] "SATURDAY"
proportion
## [1] 0.1499175
(Provide a written response based on your understanding and interest.)
(Choose three columns of interest, then provide code and written responses for each column’s analysis.)
# Example analysis for one column (repeat for two other columns)
column1 <- df_3$COLUMN_NAME # Replace COLUMN_NAME with the actual column name
summary(column1)
## Length Class Mode
## 0 NULL NULL
# ... Add further analysis like checking for outliers, etc.