Automated pipeline for building a fraud detection training dataset
DOI:
https://doi.org/10.34185/1562-9945-3-164-2026-03Keywords:
dataset, machine learning, transaction, e-commerce, LightGBM, autoencoder, IP Insights, EMV 3D-Secure, fraud detection, synthetic dataAbstract
The study addresses the problem of preparing training data for machine learning-based fraud detection systems in e-commerce transactions. Due to the strict confidentiality of real transaction data, researchers often rely on publicly available datasets that typically suffer from limited attribute schemas, anonymized features, and a focus on specific national markets. An analysis of existing open datasets revealed the necessity of creating a specialized dataset, as none of the available sources provide a sufficient combination of realistic fraud labels, semantic transparency of features, and domain-specific attributes required for training a multi-component fraud detection system.
An automated pipeline for integrating three open Kaggle datasets (IEEE-CIS, Credit Card Transactions Fraud Detection Dataset, Fraudulent E-Commerce) is proposed. The pipeline preserves authentic fraud labels and original transaction amounts while enriching records with synthetic attributes adapted to the specifics of the Ukrainian payment market. The methods developed include: uniform normalization of timestamps based on quantile rank transformation to eliminate dataset shift artifacts while preserving intra-day patterns, synthetic generation of authentication attributes according to the EMV 3D-Secure 2.0 standard with payment network distributions based on National Bank of Ukraine statistics, formation of aggregated client behavioral profiles, and generation of “entity-IP” pairs for IP Insights model training. Both auxiliary datasets are derived exclusively from the training subset to prevent data leakage.
The resulting dataset comprises 500000 transactions spanning 24 months with a fraud rate of 3.04%, designed for training a model pipeline that includes LightGBM, an autoencoder, and IP Insights. The chronological split simulates real-world deployment conditions where models are trained on historical events and evaluated on future ones.
References
European Banking Authority & European Central Bank. (2025). Joint EBA-ECB report on payment fraud. https://www.eba.europa.eu/publications-and-media/press-releases/joint-eba-ecb-report-payment-fraud-strong-authentication-remains-effective-fraudsters-are-adapting
Visa Acceptance Solutions & Merchant Risk Council. (2025). 2025 Global eCommerce Payments & Fraud Report. https://www.visaacceptance.com/content/dam/documents/campaign/fraud-report/global-fraud-report-2025.pdf
Ostrovska, K., & Nosov, V. (2025). Machine learning methods for antifraud systems. Sys-tem technologies, 5(160), 156–163. https://doi.org/10.34185/1562-9945-5-160-2025-16
Grover, P., Xu, J., Tittelfitz, J., Cheng, A., Li, Z., Zablocki, J., Liu, J., & Zhou, H. (2022). Fraud Dataset Benchmark and Applications. Amazon Science. https://doi.org/10.48550/arXiv.2208.14417
Pushkarenko, Y., & Zaslavskyi, V. (2024). Synthetic Data Generation for Fraud Detection Using Diffusion Models. Information Systems and Innovative Technologies in Professional Activity (ISIJ), 55(2), 185–198. https://doi.org/10.11610/isij.5534
IEEE-CIS Fraud Detection. (2019). Kaggle. https://www.kaggle.com/competitions/ieee-fraud-detection/overview
Credit Card Transactions Fraud Detection Dataset. (2020). Kaggle. https://www.kaggle.com/datasets/kartik2112/fraud-detection
Sparkov Data Generation. GitHub. https://github.com/namebrandon/Sparkov_Data_Generation
Fraudulent E-Commerce Transactions. (2024). Kaggle. https://www.kaggle.com/datasets/shriyashjagtap/fraudulent-e-commerce-transactions
Anti-Money Laundering Datasets (IBM AMLSim). (2021). GitHub.
Credit Card Fraud Detection Dataset. (2018). Machine Learning Group, Université Libre de Bruxelles. Kaggle. https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
EMVCo. (2025). EMV 3-D Secure Protocol and Core Functions Specification v2.2.0. https://www.emvco.com/emv-technologies/3-d-secure/
Visa vyperedyla Mastercard za kilkistiu kartok v obihu v Ukraini [Visa overtook Master-card by number of cards in circulation in Ukraine]. (2025). Forbes Ukraine. https://forbes.ua/news/visa-viperedila-mastercard-za-kilkistyu-kartok-v-obigu-v-ukraini-27052025-30063 [in Ukrainian].
StatCounter. (2024). Mobile Operating System Market Share Ukraine. https://gs.statcounter.com/os-market-share/mobile/ukraine
Downloads
Published
Issue
Section
License
Copyright (c) 2026 System technologies

This work is licensed under a Creative Commons Attribution 4.0 International License.









