This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")
summary(data)
## IDLink Title Headline Source
## Min. : 1 Length:93239 Length:93239 Length:93239
## 1st Qu.: 24302 Class :character Class :character Class :character
## Median : 52275 Mode :character Mode :character Mode :character
## Mean : 51561
## 3rd Qu.: 76586
## Max. :104802
## Topic PublishDate SentimentTitle SentimentHeadline
## Length:93239 Length:93239 Min. :-0.950694 Min. :-0.75543
## Class :character Class :character 1st Qu.:-0.079057 1st Qu.:-0.11457
## Mode :character Mode :character Median : 0.000000 Median :-0.02606
## Mean :-0.005411 Mean :-0.02749
## 3rd Qu.: 0.064255 3rd Qu.: 0.05971
## Max. : 0.962354 Max. : 0.96465
## Facebook GooglePlus LinkedIn
## Min. : -1.0 Min. : -1.000 Min. : -1.00
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 5.0 Median : 0.000 Median : 0.00
## Mean : 113.1 Mean : 3.888 Mean : 16.55
## 3rd Qu.: 33.0 3rd Qu.: 2.000 3rd Qu.: 4.00
## Max. :49211.0 Max. :1267.000 Max. :20341.00
E.g., this could be a column name, or just some value inside a cell of your data Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?
We can conclude the following based on my dataset, “News Popularity in Multiple Social Media Platforms”.
SentimentTitle and SentimentHeadline: These columns have numerical values such as 0.208333333 and -0.156385811. Without documentation or context, it’s not clear what these values represent. They could be sentiment scores, but the scale and meaning of these scores are uncertain.
Facebook, GooglePlus, and LinkedIn: These columns contain numerical values as well, but it’s unclear what these values represent without documentation. They might be related to social media metrics or shares, but the exact interpretation is unclear.
Topic: The “Topic” column contains values such as “economy,” “obama,” and “microsoft.” While some topics are self-explanatory, others might not be as obvious. Without documentation, it’s challenging to understand the criteria used to categorize articles into these topics.
The reason for encoding the data this way might be to save space and represent numerical values efficiently. For example, sentiment scores can be represented as floats, and social media metrics can be stored as integers. Using particular names for topics can also reduce storage space and make it easier to process the data.
If you didn’t read the documentation, you would be left guessing the meanings of these columns and values, which could lead to misinterpretation of the data. Having access to documentation or metadata is crucial for accurately understanding and analyzing such datasets.
You may need to do some digging, but is there anything about the data that your documentation does not explain?
After reading the additional information provided, one element of the data that is still unclear is:
Also, it raises concerns about ethics and epistemology consideration which is not mentioned anywhere. For example, Bias in Sentiment Analysis for ethical considerations: Sentiment analysis models and tools are known to be susceptible to biases present in the training data. It’s important to consider whether the sentiment scores in this dataset might reflect biases present in the original news sources. For example, if certain news outlets have a known bias, it could influence the sentiment scores associated with their news items.
Subjectivity in Sentiment for epistemological considerations: Sentiment analysis is inherently subjective. Different individuals may interpret the sentiment of a text differently. From an epistemological perspective, it’s essential to acknowledge that sentiment scores represent one perspective or interpretation of sentiment and may not be universally agreed upon.
A detailed documentation considering all these details would be more effective on dataset related communication creating more transparency and accountability of the data.
You can use color or an annotation, but also make sure to explain your thoughts using Markdown Do you notice any significant risks? If so, what could you do to reduce negative consequences?
library(ggplot2)
ggplot(data, aes(x = IDLink, y = SentimentTitle)) +
geom_point(aes(color = SentimentTitle), size = 3) +
geom_text(aes(label = "Unclear"), x = 3, y = -0.4, size = 4, color = "red") +
# Highlight the issue with a red line
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
# Customize the plot
labs(
title = "Sentiment Scores in Titles",
x = "IDLink",
y = "SentimentTitle",
caption = "Unclear Sentiment Scoring"
) +
theme_minimal() +
scale_color_gradient(low = "red", high = "green") +
theme(legend.position = "none")
In this scatter plot, the x-axis represents the IDLink, and the y-axis represents the “SentimentTitle” scores. We use color to highlight the sentiment scores, with positive scores in green and negative scores in red.
Unclear Issue Highlighted: The issue of unclear sentiment scoring is highlighted in this visualization by the scattered distribution of points. Since the documentation does not explain the specific sentiment scoring system used, we cannot interpret the exact sentiment conveyed by the news items’ titles and headlines. The sentiment scores appear to be spread across a range of values, but without knowing the scoring system’s polarity (i.e., whether positive sentiments are represented by positive values or vice versa), it is challenging to make meaningful conclusions about the sentiment of the news items.
Significant Risks: The significant risk is that without a clear understanding of the sentiment scoring system, it’s challenging to interpret whether a sentiment score is positive or negative. This ambiguity can lead to incorrect analyses and decisions based on the sentiment data.
Mitigation: To reduce negative consequences, we should have clarification on the sentiment scoring system used in the dataset with detailed documentation. Also, documenting any assumptions made regarding sentiment scoring during the dataset analysis phase becomes crucial to ensure transparency for future users of the data.