R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")

summary(data)

##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:93239       Length:93239       Length:93239      
##  1st Qu.: 24302   Class :character   Class :character   Class :character  
##  Median : 52275   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51561                                                           
##  3rd Qu.: 76586                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:93239       Length:93239       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079057   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005411   Mean   :-0.02749  
##                                        3rd Qu.: 0.064255   3rd Qu.: 0.05971  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  113.1   Mean   :   3.888   Mean   :   16.55  
##  3rd Qu.:   33.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00

Part 1: A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

E.g., this could be a column name, or just some value inside a cell of your data Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?

Answer 1:

We can conclude the following based on my dataset, “News Popularity in Multiple Social Media Platforms”.

SentimentTitle and SentimentHeadline: These columns have numerical values such as 0.208333333 and -0.156385811. Without documentation or context, it’s not clear what these values represent. They could be sentiment scores, but the scale and meaning of these scores are uncertain.
Facebook, GooglePlus, and LinkedIn: These columns contain numerical values as well, but it’s unclear what these values represent without documentation. They might be related to social media metrics or shares, but the exact interpretation is unclear.
Topic: The “Topic” column contains values such as “economy,” “obama,” and “microsoft.” While some topics are self-explanatory, others might not be as obvious. Without documentation, it’s challenging to understand the criteria used to categorize articles into these topics.

The reason for encoding the data this way might be to save space and represent numerical values efficiently. For example, sentiment scores can be represented as floats, and social media metrics can be stored as integers. Using particular names for topics can also reduce storage space and make it easier to process the data.

If you didn’t read the documentation, you would be left guessing the meanings of these columns and values, which could lead to misinterpretation of the data. Having access to documentation or metadata is crucial for accurately understanding and analyzing such datasets.

Part 2: At least one element or your data that is unclear even after reading the documentation

You may need to do some digging, but is there anything about the data that your documentation does not explain?

Answer 2:

After reading the additional information provided, one element of the data that is still unclear is:

Sentiment Scoring: While the documentation mentions that “SentimentTitle” and “SentimentHeadline” represent sentiment scores, it doesn’t explain the specific sentiment scoring system used. Typically, sentiment analysis scores can vary and it becomes challenging to interpret the exact sentiment conveyed by the text in the news items’ titles and headlines. A clarification on the scoring system and its range (e.g., whether positive sentiments are represented by positive values or vice versa) would be helpful for meaningful analysis.

Also, it raises concerns about ethics and epistemology consideration which is not mentioned anywhere. For example, Bias in Sentiment Analysis for ethical considerations: Sentiment analysis models and tools are known to be susceptible to biases present in the training data. It’s important to consider whether the sentiment scores in this dataset might reflect biases present in the original news sources. For example, if certain news outlets have a known bias, it could influence the sentiment scores associated with their news items.

Subjectivity in Sentiment for epistemological considerations: Sentiment analysis is inherently subjective. Different individuals may interpret the sentiment of a text differently. From an epistemological perspective, it’s essential to acknowledge that sentiment scores represent one perspective or interpretation of sentiment and may not be universally agreed upon.

A detailed documentation considering all these details would be more effective on dataset related communication creating more transparency and accountability of the data.

Part 3: Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear.

You can use color or an annotation, but also make sure to explain your thoughts using Markdown Do you notice any significant risks? If so, what could you do to reduce negative consequences?

Answer 3:

library(ggplot2)

ggplot(data, aes(x = IDLink, y = SentimentTitle)) +
  geom_point(aes(color = SentimentTitle), size = 3) +
  geom_text(aes(label = "Unclear"), x = 3, y = -0.4, size = 4, color = "red") +
  
  # Highlight the issue with a red line
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  
  # Customize the plot
  labs(
    title = "Sentiment Scores in Titles",
    x = "IDLink",
    y = "SentimentTitle",
    caption = "Unclear Sentiment Scoring"
  ) +
  theme_minimal() +
  scale_color_gradient(low = "red", high = "green") +
  theme(legend.position = "none")

In this scatter plot, the x-axis represents the IDLink, and the y-axis represents the “SentimentTitle” scores. We use color to highlight the sentiment scores, with positive scores in green and negative scores in red.
Unclear Issue Highlighted: The issue of unclear sentiment scoring is highlighted in this visualization by the scattered distribution of points. Since the documentation does not explain the specific sentiment scoring system used, we cannot interpret the exact sentiment conveyed by the news items’ titles and headlines. The sentiment scores appear to be spread across a range of values, but without knowing the scoring system’s polarity (i.e., whether positive sentiments are represented by positive values or vice versa), it is challenging to make meaningful conclusions about the sentiment of the news items.
Significant Risks: The significant risk is that without a clear understanding of the sentiment scoring system, it’s challenging to interpret whether a sentiment score is positive or negative. This ambiguity can lead to incorrect analyses and decisions based on the sentiment data.
Mitigation: To reduce negative consequences, we should have clarification on the sentiment scoring system used in the dataset with detailed documentation. Also, documenting any assumptions made regarding sentiment scoring during the dataset analysis phase becomes crucial to ensure transparency for future users of the data.

Mansi_Data_Dive_Documentation

2023-09-21