TEAconf community member Kostas Sgantzos, alongside co-author Massimiliano Ferrara of the University Mediterranea of Reggio Calabria, has published a new peer-reviewed paper in WSEAS Transactions on Business and Economics exploring a pressing problem at the intersection of AI and data integrity: what happens when the data used to train AI systems can't be trusted?
The paper, "Mitigating Big Data Pollution and AI Model Deterioration: A Dataset Core Approach with Blockchain-Based Verification", addresses the growing risk of AI model collapse — the phenomenon where models trained on increasingly AI-generated data progressively lose diversity and predictive reliability. As synthetic content floods the internet, successor models inherit its statistical biases, compressing the richness of the information ecosystem they depend on.
The authors' proposed solution combines two innovations. The first is a Dataset Core methodology — a mathematically grounded approach that preserves the essential information content of a training dataset while filtering out polluted or adversarially injected samples. Experimental results show that a core constructed at just 20% of the original dataset size retains around 95% of its information value, while reducing the effectiveness of adversarial poisoning attacks by 73%.
The second is a blockchain-based verification layer built on Triple Entry Accounting principles. Every training iteration is recorded as a three-part transaction: the input data hash, the model output, and an immutable blockchain record linking the two. This creates a complete, tamper-evident audit trail for AI training — something conventional machine learning pipelines entirely lack.
The paper also introduces an economic incentive mechanism that rewards high-quality data contributions and financially disincentivises data poisoning, along with a thought-provoking proposal to use proof-of-work token systems to counter what the authors term "cognitive imperialism" — the risk that AI convenience erodes genuine human reasoning and marginalises non-dominant knowledge systems.
The authors acknowledge Ian Grigg and George Papageorgiou for their review of early drafts.