Multilingual Coarse Political Stance Classification of Media: Limitations & Ethics Statement
:::info
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Cristina España-Bonet, DFKI GmbH, Saarland Informatics Campus.
:::
Table of Links
Abstract and Intro
Corpora Compilation
Political Stance Classification
Summary and Conclusions
Limitations and Ethics Statement
Acknowledgments and References
A. Newspapers in OSCAR 22.01
B. Topics
C. Distribution of Topics per Newspaper
D. Subjects for the ChatGPT and Bard Article Generation
E. Stance Classification at Article Level
F. Training Details
5.1 Limitations
We are assuming that All media sources have an editorial line and an associated bias, and we treat the ILM as any other media source. We do not consider the possibility of a ChatGPT or Bard article being unbiased. This is related to the distant supervision method used to gather the data that currently allows for a binary political stance annotation. Since manually annotating hundreds of thousands of articles with political biases in a truly multilingual setting seems not possible in the foreseeable future, we decided to implement a completely data-based method and study its language and culture transfer capabilities.
\
Using distant supervision for detecting the political stance at article level is a delicate topic though. First, because the same newspaper can change ideology over time. Second, and this is more related to the content of an individual article, non-controversial subjects might not have a bias. Even in cases where bias exists, there is a spectrum ranging from the extreme Left to the extreme Right, rather than a clear-cut division between the two ideologies.
\
In order to quantify and if possible mitigate the current limitations, we plan to conduct a stylistic analysis of the human-annotated corpora (Baly et al., 2020; Aksenov et al., 2021) and compare it to our semi-automatically annotated corpus. As a follow-up of this work, we will perform a stylistic analysis of the ILM-generated texts too as a similar style between the training data and these texts is needed to ensure good generalisation and transfer capabilities.
5.2. Ethics Statement
We use generative language models, ChatGPT and Bard, to create our test data. Since we deal with several controversial subjects (death penalty, sexual harassment, drugs, etc.) the automatic generation might produce harmful text. The data presented here has not undergone any human revision. We analyse and provide the corpus as it was generated, along with the indication of the systems version used.
Welcome to Billionaire Club Co LLC, your gateway to a brand-new social media experience! Sign up today and dive into over 10,000 fresh daily articles and videos curated just for your enjoyment. Enjoy the ad free experience, unlimited content interactions, and get that coveted blue check verification—all for just $1 a month!
Account Frozen
Your account is frozen. You can still view content but cannot interact with it.
Please go to your settings to update your account status.
Open Profile Settings