Loading Data into a Dataframe


In this assignment, I will explore loading data from a remote source (Github) into R (Rstudio). I will be using data from FiveThirtyEight (https://fivethirtyeight.com/features/weather-forecast-news-app-habits/).

The Data


The Data I will be using for the assignment is “Where People Go To Check The Weather”

data <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/weather-check/weather-check.csv")

cols = c("id", "daily_weather_report", "method", "provider", "smartwatch", "age",
    "gender", "household-earnings", "us_region")
colnames(data) <- cols

In the data, there are 928 rows identified by their unique ID, with responses to 8 survey questions.
Characteristics of the data include:

  1. most if not all answers are strings, with the exception of the ID column.
  2. Traditionally numerical columns, such as age, earnings, are instead binned into categories such as “30 - 44” or “$50,000 to $74,999”.
  3. Not all questions have been filled out, missing answers are filled in as “-”s.
  4. The data is not sorted in any particular order.

Since the data has already been binned to some level, and the rest of the data is categorical, I will not be performing any data cleaning or transformation. Furthermore, it is a relatively simple process to simply count the number of each response corresponding to a bin or a category in the data. To demonstrate this, I will count the number of age groups in the data.

Counting the Data


To ensure that the data used to generate the counts is separated from the original dataset, I will create a new dataframe to store the counts. In this new dataframe I will assign:

  1. the grouping of the data by age.
  2. the count of each age group.
  3. the column names that better represent the new data.
counted <- data %>%
    group_by(age) %>%
    summarise(n())

cols <- c("age", "counts")
colnames(counted) <- cols

counted
## # A tibble: 5 × 2
##   age     counts
##   <chr>    <int>
## 1 -           12
## 2 18 - 29    176
## 3 30 - 44    204
## 4 45 - 59    278
## 5 60+        258

Visualizing the Data


To better see the relationship between the number of responses by age group, I created a simple bar graph using ggplot.

ggplot(data = counted, aes(x = age, y = counts)) + geom_col()

Conclusion


To Conclude, the results of this brief interaction with the data suggests that the data is properly formatted, and is ready for further analysis. Based on the brief exploration of the data, there are some missing values, but they are not significant enough to warrant further investigation. However, I have only explored the data in terms of age groups, and there may be other issues with the data that I have not yet encountered. For the future, I would like to explore the data in more detail, and potentially perform some analysis on the data to see if there are any interesting trends or patterns that can be found. In particular I would focus on reviewing missing data throughout the set, not just in age groups, and potentially imputing values or removing rows with missing data.