Email: “
Phone: “929-228-2367”

2019 Bike Company Data

For this project, I wanted to deliver analytical metrics that could be used to improve an imaginary company with hundreds of thousands of users. The data source I got was from the Google data analytics course and Can be found here https://docs.google.com/spreadsheets/d/1UD1TpwvlnVTNpJtxo9WjCZGBpISKoVJCWLBLC7XLhF0/edit?usp=sharing.

These were the following steps

1. Load dataset and all needed packages

The simplest part of all projects dealing with data I had no trouble with this.

    library(readr)
    Data_2019 <- read_csv("Bike_data_2019.csv")```

2. Previewing the data

I got basic information about my data using count and histograms. This showed me various outliers in the trip duration, which I decided to clean at a later point. From here, I also saw that most of our users were born around 1920, which gives us insights into who is using our company.

{Data_2019 %>% count(usertype)
hist(Data_2019$tripduration)
hist(Data_2019$birthyear)}
BirtYear of users
BirtYear of users

Tripduration with outliers ## 3. Cleaning the data I did this by creating a new data frame where I set new metrics

Data_2019v1 <- Data_2019 %>% 
select(birthyear, usertype, tripduration) %>% 
group_by(usertype) %>% 
mutate(age = (birthyear - 2025)/-1, 
Tripduration_min = as.integer(tripduration/60))

This was a very important part as I got a new table for age instead of birth year and turned the trip duration into minutes from its previous seconds this made analyzing a whole lot easier.

4. Removing the outliers

This data set has A LOT of outliers so I decided to use some statistics to clear it by removing the data outside of the norm through statistical computing.

#failed code
q1 <- quantile(Data_2019v1$Tripduration_min, 0.25)
q3 <- quantile(Data_2019v1$Tripduration_min, 0.75)
IQR <- q3 - q1
Upper_limit <- q1 + 1.5 * IQR
Lower_limit <- q3 - 1.5 * IQR
#Data_2019v2 <- Data_2019v1 [Data_2019v1$Tripduration_min >= Lower_limit & Data_2019v1$Tripduration_min <= Upper_limit]

Unfortunately, this process failed so I had to rethink my way of solving this and ended up using the filter function alongside my already calculated statistical parameters. Now if you wish to follow what I did I advise you to not spend all this time doing the math as there is a function called IQR that calculates the interquartile range and I was unaware of this.

(final outlier removal code)

Data_2019v2 <- filter(Data_2019v1, Tripduration_min >= Lower_limit & Tripduration_min <= Upper_limit)

hist(Data_2019v2$Tripduration_min, col = "#A65461FF", xlab = "Trip Duration(min)", main = "total rides per minute", ylab = "total rides")

This gives us this beautiful chart where we can see a clearer range of data Tripduration Plot

5. Calculating percentage Data

I wanted to calculate the percentage data for the customers since the x values looked gaudy when plotted.

Customer<- Data_2019v2 %>% count(usertype)
 View(Customer)
 9068 + 298866 
 Customer <-mutate(Customer, percentage = (usertype/307934 * 100))

In this code you can see I added the values for subscribers (people subscribed to our product) And customers (people using our product for the first time ) even without plotting it we can see a large disparity between these two user types.

6. Plotting the data

after my analysis, I ended up with three separate plots which you will see at a later point

ggplot(Customer, aes(x = usertype, y= percentage, fill = usertype)) + geom_col
ggplot(Data_2019v2, aes(x = age)) + geom_histogram (fill = "#266B6EFF") + scale_fill_paletteer_d("rcartocolor::BluGrn") + theme_light

Tripduration Plot Age data Tripduration Plot

Now unfortunately for the third plot, my files had gotten corrupted before I could save which was truly unfortunate. Also, you might notice that I have the paleteer module installed but it honestly adds nothing to the code as the hex codes do all of the work so it might as well be negligible here.

Final Analysis

From our analysis, we can see that most of our consumers lie between 25-50 years old and are already subscribers. Alongside this, most of our users ride between 5-10 minutes on our bikes. This thus means that for our company to grow and stand out we should keep driving into the metrics that we already have. A great motto is “If it aint broke don’t fix it” When we apply this to our user base I propose that we increase our focus on marketing towards a group we already have a steak in then there bound to higher profits and we can do this in a lot of ways but what I would suggest would be social media marketing targeted towards millennials and older generations. Alongside this another area for improvement would be our bike locations from the usage data we can see that most people tend to ride for 5-10 minutes this means that people are generally riding for convenience and to stations that are close r to each other than the opposite I propose that we move more bikes within locations that are closer to each other to verify my predictions but there is a high chance of increased areas if we can exploit this fact.