To make full use of the massive amounts of social media platform data for the purposes of scientific research, data is increasingly obtained using data collection methods such as web scraping. Web scraping methods make it possible to automatically access and retrieve information directly from social media web interfaces and other websites. The technical process requires two main steps: First, the website is accessed with the assistance of a webbot or a webcrawler. Second, the information is analyzed automatically and extracted, if necessary.
In this article, we assess the reason behind the rising use of web scraping methods and the legal problems this entails. The German legal situation is discussed as an example here, as other European countries have similar regulations. First, using the example of Facebook, we consider how the Terms of Service of social media platforms impact the use of web scraping. We then explain the effects of copyright law on web scraping, specifically with regard to the data mining provision in Section 60d of the German Act on Copyright and Related Rights (UrhG).
The importance of web scraping
The growing popularity of web scraping is mainly based on the scandal surrounding Facebook and Cambridge Analytica in March 2018. Generally, the development interfaces of social networks (Application Programming Interface or API) are available for collecting data for different purposes, including scientific research. Cambridge Analytica misused Facebook’s API, partially under the pretext of scientific research, to extensively collect data for election campaign purposes.
After this became known to the public and developed into a scandal, various webpages and social media platforms, especially Facebook, severely restricted the use of their APIs for accessing data. The restrictions regarding the use of APIs caused a storm of protests by international scientists who saw their research opportunities unreasonably restricted. Following this, web scraping methods became a practical alternative for researchers aiming to collect data, replacing the use of APIs.
However, web scraping methods pose complex legal questions. To assess the legitimacy of web scraping for the purposes of scientific research, matters of contract law and copyright law have to be taken into consideration.
Is web scraping a breach of contract?
From a contract law point of view, web scraping can be in conflict with the Terms of Service of social media platforms. Anyone who creates an account for a social media platform enters into a contractual agreement with its provider. If web scraping methods are used after registration for a social media platform, web scraping has to be conducted in compliance with the Terms of Service of that social media platform.
The Terms of Service of social media platforms are legally seen as standard contracts according to Section 305 ff of the German Civil Code (BGB). Hereafter, it has to be assessed if and how these Terms of Service can effectively restrict the use of web scraping methods for users. Considering that, Facebook’s Terms of Service will be used as an example in this text. The Terms of Service of other social media platforms contain similar passages that prohibit the automatic accessing of websites and partially even directly refer to web scraping.
Facebook’s Terms of Service state:
“2. What you can share and do on Facebook
We want people to use Facebook to express themselves and to share content that is important to them, but not at the expense of the safety and well-being of others or the integrity of our community. You therefore agree not to engage in the conduct described below (or to facilitate or support others in doing so):
[…]
You may not access or collect data from our Products using automated means (without our prior permission) or attempt to access data you do not have permission to access.”
The prohibition to automatically access data from Facebook instated here, also applies to web scraping methods. However, this passage could be void by law if it contradicts regulations of the applicable copyright law.
Web scraping as permitted data mining
The starting point of these considerations is Section 60d of the German Act on Copyright and Related Rights (German: Urheberrechtsgesetz or UrhG) which allows the reproduction of works for the purposes of text and data mining. According to Section 60d paragraph 1 sentence 1 no. 1, it is admissible to automatically and systematically reproduce a number of works for scientific research, in order to gain a corpus that can be analyzed, particularly through normalization, structuring and categorization. While doing so, no commercial purposes may be pursued according to Sentence 2 of that Section.
This provision only enables the reproduction of content that can already be lawfully accessed. According to the legal rationale, Section 60d “does not create an additional right to access initially protected source material”. The rationale further states, “the ruling rather presupposes such access. It thus allows, for example scanning and searching existing texts in the collection of the institute library or literature procured via inter-library loan in order to carry out the so-called text and data mining. It also allows the use of digital source material to the extent that the right holder makes it available to everyone on the Internet.”
For data mining on Facebook, Section 60d paragraph 1 sentence 1 no. 1 UrhG therefore is fundamentally relevant. Registration (and the process of logging in) provides proper access to the data to which the web scraping procedure is to be applied. Access with automated tools is prohibited by Facebook’s Terms of Service. However, this is not sufficient to exclude Section 60d UrhG as this is not contractually admissible under Section 60g paragraph 1 UrhG.
The purpose of the permission in Section 60d paragraph 1 sentence 1 no. 1 UrhG is to enable the use of novel technical methods for research purposes, especially online. A central consideration behind the introduction of barriers to text and data mining is that following legitimate access, it should be allowed to automatically evaluate content (“the right to read is the right to mine”). If the possibility of text and data mining could easily be banned by restrictive Terms of Service, the purpose of the provision would not be appropriately fulfilled. Therefore, Section 60g paragraph 1 UrhG is to be understood in such a way that the use of technical procedures such as web scraping for scientific research purposes cannot be excluded by the Terms of Service. Web scraping for research purposes therefore does not violate Facebook’s Terms of Service due to Section 60g paragraph 1 UrhG.
Web scraping possible under certain conditions
Web scraping has contractual and copyright implications. However, web scraping for non-commercial research purposes cannot be fully excluded by the Terms of Service of social media platforms. It is permitted if it is limited to individual areas and if it is necessary for research.