On the Ephemerality of Web Media
A lot of our research and development activities rely on large collections of web media content sourced from social media platforms, such as YouTube and Twitter, and then manually curated and annotated by our researchers with the purpose of creating “ground truth” datasets. This helps us train machine learning models on specific tasks and then benchmark those models along with competing approaches in order to select the best method per case. It goes without saying that we spend loads of effort on developing scripts for crawling, monitoring, extracting and fetching the necessary data and content related to the target task, and then even more effort on curating, cleaning and labeling (aka annotating) the collected datasets. Especially, content labelling is particularly challenging due to the subjective nature of the task, e.g. different people may perceive the same content as belonging to different categories, while there are additional issues in specific annotation tasks, e.g. when dealing with NSFW and disturbing content.
Putting all the above issues aside, a major challenge that we, scientists, face when dealing with data and content that has been sourced from online sources is its ephemeral nature. Online data and information may cease to be available at its source. For instance, YouTube users may opt to delete one or more of their previously uploaded videos for a variety of reasons. An even more common case is that the media platforms decide to take down content due to violation of their terms of service or copyright infringement. Such a case happened in April 2018 when YouTube removed 5 million videos on the basis of content violation.
To demonstrate the issue we face, we would like to share our experience and concerns regarding two datasets that we recently (2019) created to support research on the problem of online media verification:
- The Fake video corpus (FVC) , a dataset of verified and debunked user-generated videos from YouTube, Facebook and Twitter. The FVC dataset contained 200 fake video cases and 180 real video cases. Following a semi-automatic procedure, as described in , we collected 3,262 near duplicates of the above fake videos and 1,933 of the above real videos. By the time it was publicly released, February 2019, in total 5,575 videos from YouTube, Facebook and Twitter were available online.
- The FIVR-200K  dataset was collected to simulate the problem of Fine-grained Incident Video Retrieval (FIVR). It offers a single means of evaluating several video retrieval tasks as special cases. The FIVR-200K, which was collected in March 2018, consisted of 225,960 YouTube videos. The videos were collected from January 1st 2013 to December 31st 2017 following the procedure described in . During our work with these datasets, we regularly noticed that specific content items were not available online. To investigate this in a more systematic way, in February 2020, we looked into both datasets in terms of video availability. We concluded that a significant number of the videos that were initially included in the datasets were not unavailable anymore. With respect to the FVC, the number of videos that have been removed by the three video platforms amounts to a 21,5% reduction of the number of videos; specifically, we measured a 18,4% reduction in the number of the initial video cases and 21,8% of the video duplicates. Similarly, from the 225,960 YouTube videos of the FIVR-200K dataset, only 190,528 were still available on YouTube in February 2020, meaning that more than 30,000 videos of the dataset had been removed. This corresponds to a reduction of 15,7% in the total amount of videos.
Summary of the number of available and unavailable videos of the FVC and the FIVR-200K at the time they were created and now (February 2020).
For that reason, the unavailable videos had to be removed from the corresponding dataset releases. The issue of online content ephemerality concerns the research community since the effort to create such a dataset is large and the reproducibility of the corresponding experiments is really harmed. Potential solutions to the issue that we see adopted by the research community include the following among others:
- Release extracted features from the original media collection: This is a practice that is common in the computer vision and multimedia community. Given the prevalence of standard and widely used feature extractors, both manually crafted (SIFT, SURF) and neural network based (VGG, ResNet, Xception, etc.), it is still possible to perform a variety of interesting experiments when one has access to the extracted features and not the original content. On the downside, maintaining and distributing a variety of features from massive media collections is expensive in terms of storage and bandwidth, while the rapid development of new feature extractors is expected to soon make such releases obsolete.
- Release reduced versions of the original media collection: This involves the periodic updating of the original collection with the purpose of removing any media items that are not available online anymore. In cases where a sizeable portion of the original collection remains available, this is an acceptable approach. However, as became clear from our above discussion, large reductions (in the order of 20%) are very likely soon (one year) after a dataset is released. This is especially problematic when the original dataset contains low-frequency classes that are of interest to the target task.
- Retaining and privately sharing of the original collection: Even though this practice is in clear violation of most platforms’ Terms of Service and pertinent regulation (including Copyright Law and GDPR), we regularly see this happen among members of the research community. This is because it is the only approach that ensures full reproducibility of research based on the original collection. The fact that this practice is still common despite its lack of legal basis may be telltale of the researchers’ agony to ensure that their research remains reproducible and relevant in the long run.
It becomes clear that this is a challenge that has no fully satisfactory solution. For that purpose, we expect that ongoing and future research on social media and the Web will continue to struggle around this problem, but we are hopeful that a set of best practices will gradually emerge that will strike a good compromise between reproducibility, research value and legal compliance.
 Papadopoulou, O., Zampoglou, M., Papadopoulos, S. & Kompatsiaris, I. (2019), “A corpus of debunked and verified user-generated videos”, Online Information Review, Vol. 43 No. 1, pp. 72-88.  Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., & Kompatsiaris, I. (2019), “FIVR: Fine-grained Incident Video Retrieval”. IEEE Transactions on Multimedia 21(10), 2638-2652.