DMCA Takedown Notice Explorer

Jim Kolberg

Takedown Notice Explorer

Digital Milenium Copyright Act of 1998

  • Provides "safe harbor" for sites when users upload copyrighted content.
  • Copyright holders can submit takedown requests to have unauthorized materials removed.

Benefits

  • Sites like YouTube or Blogger could not exist with legal threat of copyright violations.
  • Copyright holders have method of addressing piracy.

Problems

  • False claims - Auto generated notices sweep up non-infringing "fair use" examples.
  • Censorship - Takedowns submitted to remove critical, embarassing or unpopular content without a legitimate copyright claims.
  • Expensive remedy - Victims of false claims have the burden of proof.

Data Sources and Preparation

The website ChillingEffects.org collects DMCA take down notices sent from copyright holders to websites that host allegedly illegal copies of protected works. Large sites, like Google, automatically redirect requests to ChillingEffects.org

This study focuses on take down notices containing the key word "Selma" for a 1 month period beginning April 13, 2015. This keyword was chosen because it is the name of a recent popular movie nominated for an Academy Award. The actual dates on the notices range from 2015-04-14 to 2015-05-03

Over the period there were 20 days containing at least 1 keyword hit, resulting in 231 notices. The search results are disaplyed over a paginatied web interface with 10 results per page, much like Google. I donwloaded the result pages to local files on my PC on May 13th. Each of the search results is a link to the details of the takedown notice.

The data contains many interesting facets and has relavence to current public policy discussions regarding the TPP trade agreement's proposed changes to intellectual property laws. An app to classify and explore takedown notices would increase public understanding.

Goals

  • My original goal was to use R's machine learning capabilities to look for false claims, in which the subject of the take down does not match the content.

Obstacles

  • The number of take down requests sent daily is staggering. Trying to get all the URLs for a month would have overwhelmed my PC. So I reduced scope to a single move, "Selma," for the past month. Even this took over 30 hours to load. Since one notice can be relavant to many copyrighted works, I discarded all URL's not related to Selma, and just kept a count.
  • To do a machine learning analysis, I would need a training set. Building a training set would involve visiting the URL's and evaluating the content. Since the majority of the URL's were clearly devoted to hosting pirate movies, I did not see that as an ethical course of action. Not to mention the risk of contracting mal-ware.
  • Thousands of the URLs contain topics not suitable for polite statisticians due to the inclusion of take downs for pornographic movies staring an actress named "Selma Sins" which also inadvertantly met my search criteria.

Findings

The top 5 domains showing alledgely illegal copies of Selma are listed below, along with the number of distinct URL's linking to the movie.

## Source: local data frame [5 x 2]
## 
##                  domain count
## 1          uploaded.net   368
## 2        rapidgator.net   363
## 3 www.extremepirate.com   225
## 4         www.scnsrc.me   130
## 5       www.shaanig.com   124

None of the DMCA requests have been sent to these domains. Of the 231 requests sent, all of them were to Google. The copyright holders, or their representatives, have used a strategy of censoring the links to the content, rather than the content itself.

Potential Improvements

I spent the bulk of the available time for this assignment getting & cleaning the takedown data, only to find that I couldn't really do much interesting work on it.

  • Many URLs did not contain the word Selma, so there must have been something on the hosting site (rather than Google, which just had a link). Setting up a hardened PC that could analyze the content at the blocked links would be an interesting study.
  • Link the domain data to a netwhois api to attach geographic data to make an interactive map. I checked a few manually.

The best pirate