R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load your data
data <- read.csv("C:\\Users\\mansi\\Downloads\\news+popularity+in+multiple+social+media+platforms\\News_Popularity_in_Multiple_Social_Media_Platforms.csv")

summary(data)
##      IDLink          Title             Headline            Source         
##  Min.   :     1   Length:93239       Length:93239       Length:93239      
##  1st Qu.: 24302   Class :character   Class :character   Class :character  
##  Median : 52275   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 51561                                                           
##  3rd Qu.: 76586                                                           
##  Max.   :104802                                                           
##     Topic           PublishDate        SentimentTitle      SentimentHeadline 
##  Length:93239       Length:93239       Min.   :-0.950694   Min.   :-0.75543  
##  Class :character   Class :character   1st Qu.:-0.079057   1st Qu.:-0.11457  
##  Mode  :character   Mode  :character   Median : 0.000000   Median :-0.02606  
##                                        Mean   :-0.005411   Mean   :-0.02749  
##                                        3rd Qu.: 0.064255   3rd Qu.: 0.05971  
##                                        Max.   : 0.962354   Max.   : 0.96465  
##     Facebook         GooglePlus          LinkedIn       
##  Min.   :   -1.0   Min.   :  -1.000   Min.   :   -1.00  
##  1st Qu.:    0.0   1st Qu.:   0.000   1st Qu.:    0.00  
##  Median :    5.0   Median :   0.000   Median :    0.00  
##  Mean   :  113.1   Mean   :   3.888   Mean   :   16.55  
##  3rd Qu.:   33.0   3rd Qu.:   2.000   3rd Qu.:    4.00  
##  Max.   :49211.0   Max.   :1267.000   Max.   :20341.00

Part 1: A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

E.g., this could be a column name, or just some value inside a cell of your data Why do you think they chose to encode the data the way they did? What could have happened if you didn’t read the documentation?

Answer 1:

We can conclude the following based on my dataset, “News Popularity in Multiple Social Media Platforms”.

The reason for encoding the data this way might be to save space and represent numerical values efficiently. For example, sentiment scores can be represented as floats, and social media metrics can be stored as integers. Using particular names for topics can also reduce storage space and make it easier to process the data.

If you didn’t read the documentation, you would be left guessing the meanings of these columns and values, which could lead to misinterpretation of the data. Having access to documentation or metadata is crucial for accurately understanding and analyzing such datasets.

Part 2: At least one element or your data that is unclear even after reading the documentation

You may need to do some digging, but is there anything about the data that your documentation does not explain?

Answer 2:

After reading the additional information provided, one element of the data that is still unclear is:

Also, it raises concerns about ethics and epistemology consideration which is not mentioned anywhere. For example, Bias in Sentiment Analysis for ethical considerations: Sentiment analysis models and tools are known to be susceptible to biases present in the training data. It’s important to consider whether the sentiment scores in this dataset might reflect biases present in the original news sources. For example, if certain news outlets have a known bias, it could influence the sentiment scores associated with their news items.

Subjectivity in Sentiment for epistemological considerations: Sentiment analysis is inherently subjective. Different individuals may interpret the sentiment of a text differently. From an epistemological perspective, it’s essential to acknowledge that sentiment scores represent one perspective or interpretation of sentiment and may not be universally agreed upon.

A detailed documentation considering all these details would be more effective on dataset related communication creating more transparency and accountability of the data.

Part 3: Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear.

You can use color or an annotation, but also make sure to explain your thoughts using Markdown Do you notice any significant risks? If so, what could you do to reduce negative consequences?

Answer 3:

library(ggplot2)
ggplot(data, aes(x = IDLink, y = SentimentTitle)) +
  geom_point(aes(color = SentimentTitle), size = 3) +
  geom_text(aes(label = "Unclear"), x = 3, y = -0.4, size = 4, color = "red") +
  
  # Highlight the issue with a red line
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  
  # Customize the plot
  labs(
    title = "Sentiment Scores in Titles",
    x = "IDLink",
    y = "SentimentTitle",
    caption = "Unclear Sentiment Scoring"
  ) +
  theme_minimal() +
  scale_color_gradient(low = "red", high = "green") +
  theme(legend.position = "none")