607Assignment1

##Introduction to Data In this assignment, the data set named “How Baby Boomers Get High” from “https://github.com/fivethirtyeight/data/blob/master/drug-use-by-age/drug-use-by-age.csv”. This data frame includes 13 different drugs over 17 groups of ages. Depending on the 28 variables, the potential diseases and their consequential impacts can be predicted or analysed.

# Load the required libraries
library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(RCurl)


# Define the URL to the raw CSV file on GitHub
x <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv"

# Use getURL() to fetch the data from the URL
drug_use_by_age <- getURL(x)

# Use read_csv() to parse the data_text as CSV and store it in a data frame
drug_use_by_age <- read_csv(drug_use_by_age)

## Rows: 17 Columns: 28

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): age, cocaine_frequency, crack_frequency, heroin_frequency, inhalan...
## dbl (21): n, alcohol_use, alcohol_frequency, marijuana_use, marijuana_freque...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning Data

##                     age                       n             alcohol_use 
##                       0                       0                       0 
##       alcohol_frequency           marijuana_use     marijuana_frequency 
##                       0                       0                       0 
##             cocaine_use       cocaine_frequency               crack_use 
##                       0                       0                       0 
##         crack_frequency              heroin_use        heroin_frequency 
##                       0                       0                       0 
##        hallucinogen_use  hallucinogen_frequency            inhalant_use 
##                       0                       0                       0 
##      inhalant_frequency       pain_releiver_use pain_releiver_frequency 
##                       0                       0                       0 
##           oxycontin_use     oxycontin_frequency        tranquilizer_use 
##                       0                       0                       0 
##  tranquilizer_frequency           stimulant_use     stimulant_frequency 
##                       0                       0                       0 
##                meth_use          meth_frequency            sedative_use 
##                       0                       0                       0 
##      sedative_frequency 
##                       0

## numeric(0)

Data Visualization

# Box plot to visualize outliers
boxplot(drug_use_by_age$marijuana_use )

# Create a bar plot
ggplot(data = drug_use_by_age, aes(x = as.factor(age), y = marijuana_use)) +
    geom_bar(stat = "identity", fill = "grey") +
    labs(
        title = "Marijuana Use Over Ages",
        x = "Ages",
        y = "Marijuana Use"
    ) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusions

It is found that there is no missing value and outliers that means data is cleaned and good to make analysis. The bar plot illustrates the significant increases in the marijuana use over ages group is between 18 and 21 years. This is an indicatives of social norms or access fo marijuana use over this population. If the particular age group shows concerning trends in marijuana use, it might need to make prevention strategies and public education for their healh care development.

607Assignment1

Lwin Shwe

2023-09-03

Cleaning Data

Data Visualization

Conclusions