INTRODUCTION:
The air quality data set(I chose only a small subset of it for the purposes of this project) shows the amount of the different categories of air quality (Ozone,Solar.r,Wind,and Temp) and their frequencies. I will tidy/transform the data to prepare it for analysis.
DATA LOAD
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#Manually created csv from the chart,loaded it to github and added it to an object called: "air_quality"
air_quality <- read.csv("https://raw.githubusercontent.com/GitHub-Vlad/Data-Science-Projects/main/Data%20Tidying%20and%20Transformation%20Project/Air%20Quality.csv",header = TRUE)
DATA Tidying/Transformation
I will remove the month column from the data set because it is not relevant to our analysis.
#Remove the "Month" column.
air_quality <-select(air_quality, -c("Month"))
I reorder the “Day” column as the first column in the data frame. In this context(for our analysis) it makes it easier to read this column from left to right.
#Reorder the "Day" column to be the first one on the left
air_quality <- air_quality %>%
select("Day", everything())
I will now check the columns of the “air_quality” data frame to check for empty values and if they exist remove them.
#Utilize the mean() function to check each column for empty values.Any output which is: NA indicates that the column has empty values
mean(air_quality$Ozone)
## [1] NA
mean(air_quality$Wind)
## [1] 11.45
mean(air_quality$Solar.R)
## [1] NA
mean(air_quality$Temp)
## [1] 66.16667
#Omit the rows which have NA values for the "Ozone" and "Solar.R" columns.. Afterwards,the mean will be available for the "Ozone" and "Solar.R" columns.
mean(air_quality$Ozone, na.rm = TRUE)
## [1] 27
mean(air_quality$Solar.R, na.rm = TRUE)
## [1] 192.5
#Removing all rows which have "NA" values for the "Ozone" and "Solar.R" columns using the na.omit() function. Then copy the data frame into a new data frame called: "air_quality_analysis".
air_quality_analysis <- data.frame(na.omit(air_quality))
print(air_quality_analysis)
## Day Ozone Solar.R Wind Temp
## 1 1 41 190 7.4 67
## 2 2 36 118 8.0 72
## 3 3 12 149 12.6 74
## 4 4 18 313 11.5 62
DATA ANALYSIS
For my analysis, I will create bar plots for each air category for the four day period.I want to see the change in distribution of each category from one day to another
#Creating a bar plot for Ozone.
barplot( main="Ozone Distribution",air_quality_analysis$Ozone, xlab="Day", ylab="Value", names.arg = air_quality_analysis$Day,ylim= c(0,50) )
#Creating a bar plot for Solar.R.
barplot( main="Solar.R Distribution",air_quality_analysis$Solar.R, xlab="Day", ylab="Value", names.arg = air_quality_analysis$Day,ylim= c(0,400) )
#Creating a bar plot for Wind.
barplot( main="Wind Distribution",air_quality_analysis$Wind, xlab="Day", ylab="Value", names.arg = air_quality_analysis$Day,ylim= c(0,20) )
#Creating a bar plot for Temp.
barplot( main="Temp Distribution",air_quality_analysis$Temp, xlab="Day", ylab="Value", names.arg = air_quality_analysis$Day,ylim= c(0,100) )
For my analysis, after looking at the bar plot, I will explain my findings by doing a breakdown by each category followed by a breakdown by of each day to try and find any relationship/similiarity between the categories.
Breakdown by Category:
Ozone Distribution - We see a slight decrease from Day 1 to Day 2. We see an even bigger decrease from Day 2 to Day 3. Finally, we see a slight increase from Day 3 to Day 4. Solar.R Distribution - We see a slight decreases from Day 1 to Day 2. We then see a slight increase from Day 2 to Day 3. Finally, we see a large increase from Day 3 to Day 4. Wind Distribution - We see a very small increase from Day 1 to Day 2. We then see a large increase from Day 2 to Day 3. Finally, we see a small decrease from Day 3 to Day 4. Temp Distribution - We see a very small increase from Day 1 to Day 2. We then see a very small increase from Day 2 to Day 3. Finally, we see a small decrease from Day 3 to Day 4.
Breakdown by Day:
Day 1 to Day 2 - There was a slight decrease in both the Ozone and Solar.R category and a slight increase in the Wind and Temp categories. Day 2 to Day 3 - There was a decrease in Ozone distribution. However, there was an increase in all the other Distributions. Day 3 to Day 4 - There was an increase in Ozone and Solar.R category and a decrease in both the Wind and Temp categories.
CONCLUSION:
From my analysis, I can see that there is a relationship/similarity between the Ozone and Solar.R categories. This is due to the fact that from Day 1 to Day 2 they both decrease and from Day 3 to Day 4 they both increase. Likewise, I noticed a relationship/similarity between the Wind and Temp categories. From Day 1 to Day 2 they increase and from Day 3 to Day 4 they decrease. Again these are just the patterns I noticed from the small sample of data I used.