In the contemporary landscape of artificial intelligence (AI) and machine learning (ML), the integrity, diversity and quality of training datasets are critical for ensuring the accuracy and reliability of predictive models. However, the phenomenon of big-data pollution, manifested through AI-generated synthetic data, inconsistencies, biases, and data poisoning within datasets, undermines model performance by diminishing the Shannon Entropy of the system. This study proposes a novel framework that integrates the Dataset Core approach with tokenized data, triple-entry accounting (TEA), and distributed ledger technology (DLT) to address these challenges. Our Dataset Core method preserves essential information value while filtering out potentially harmful elements, providing mathematically grounded protection against data pollution. Combined with blockchain-based verification, this approach establishes a foundation for enhanced transparency and trustworthiness in AI applications, with significant implications for sectors such as finance, healthcare, and beyond.