METHODOLOGY OF DATASET PREPARATION FOR TRAINING E-COMMERCE FRAUD DETECTION MODELS
DOI:
https://doi.org/10.34185/1991-7848.itmm.2026.01.084Keywords:
dataset, machine learning, transaction, e-commerce, LightGBM, autoencoder, IP InsightsAbstract
This study addresses the problem of preparing training data for machine learning-based fraud detection systems in e-commerce transactions. Based on the analysis of existing open sources, the necessity of creating a specialized dataset is justified. An automated pipeline is proposed for merging three open datasets from the Kaggle platform (IEEE-CIS, Sparkov, Fraudulent E-Commerce), preserving real fraud labels and enriching records with synthetic attributes adapted to the specifics of the Ukrainian payment market. Methods for the uniform normalization of timestamps, generation of authentication data, partitioning by payment systems, and the formation of aggregated customer profiles and pairs for training the IP Insights model have been developed. The result is a dataset comprising 500,000 transactions over a 24-month period with a fraud rate of 3.04%, designed to train a model pipeline that includes LightGBM, an autoencoder, and IP Insights.
References
Joint EBA-ECB report on payment fraud. 2025. URL: https://www.eba.europa.eu/publications-and-media/press-releases/joint-eba-ecb-report-payment-fraud-strong-authentication-remains-effective-fraudsters-are-adapting
Visa Payments & Fraud Report. 2025. URL: https://www.visaacceptance.com/content/dam/documents/campaign/fraud-report/global-fraud-report-2025.pdf
Ostrovska K., Nosov V. Machine learning methods for antifraud systems. Системні технології. 2025. Т. 5, вип. 160. С. 156–163. URL: https://doi.org/10.34185/1562-9945-5-160-2025-16
IEEE-CIS Fraud Detection. 2019. URL: https://www.kaggle.com/competitions/ieee-fraud-detection/overview
Credit Card Transactions Fraud Detection Dataset. 2020. URL: https://www.kaggle.com/datasets/kartik2112/fraud-detection
Fraudulent E-Commerce Transactions. 2024. URL: https://www.kaggle.com/datasets/shriyashjagtap/fraudulent-e-commerce-transactions.
Anti-Money Laundering Datasets. 2021. URL: https://github.com/IBM/AMLSim
Credit Card Fraud Detection. 2018. URL: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
EMV 3D-Secure. 2025. URL: https://www.emvco.com/emv-technologies/3-d-secure/




