DATA 607 Assignment 1
Loading Data into a Dataframe
Prof. Catlin
In this assignment, I will explore loading data from a remote source
(Github) into R (Rstudio). I will be using data from FiveThirtyEight (https://fivethirtyeight.com/features/weather-forecast-news-app-habits/).
The Data I will be using for the assignment is “Where People Go To Check
The Weather”
data <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/weather-check/weather-check.csv")
cols = c("id", "daily_weather_report", "method", "provider", "smartwatch", "age",
"gender", "household-earnings", "us_region")
colnames(data) <- cols
In the data, there are 928 rows identified by their unique ID, with
responses to 8 survey questions.
Characteristics of the data include:
Since the data has already been binned to some level, and the rest of
the data is categorical, I will not be performing any data cleaning or
transformation. Furthermore, it is a relatively simple process to simply
count the number of each response corresponding to a bin or a category
in the data. To demonstrate this, I will count the number of age groups
in the data.
To ensure that the data used to generate the counts is separated from
the original dataset, I will create a new dataframe to store the counts.
In this new dataframe I will assign:
counted <- data %>%
group_by(age) %>%
summarise(n())
cols <- c("age", "counts")
colnames(counted) <- cols
counted
## # A tibble: 5 × 2
## age counts
## <chr> <int>
## 1 - 12
## 2 18 - 29 176
## 3 30 - 44 204
## 4 45 - 59 278
## 5 60+ 258
To better see the relationship between the number of responses by age
group, I created a simple bar graph using ggplot.
ggplot(data = counted, aes(x = age, y = counts)) + geom_col()
To Conclude, the results of this brief interaction with the data
suggests that the data is properly formatted, and is ready for further
analysis. Based on the brief exploration of the data, there are some
missing values, but they are not significant enough to warrant further
investigation. However, I have only explored the data in terms of age
groups, and there may be other issues with the data that I have not yet
encountered. For the future, I would like to explore the data in more
detail, and potentially perform some analysis on the data to see if
there are any interesting trends or patterns that can be found. In
particular I would focus on reviewing missing data throughout the set,
not just in age groups, and potentially imputing values or removing rows
with missing data.