METHODOLOGY OF DATASET PREPARATION FOR TRAINING E-COMMERCE FRAUD DETECTION MODELS

Authors

DOI:

https://doi.org/10.34185/1991-7848.itmm.2026.01.084

Keywords:

dataset, machine learning, transaction, e-commerce, LightGBM, autoencoder, IP Insights

Abstract

This study addresses the problem of preparing training data for machine learning-based fraud detection systems in e-commerce transactions. Based on the analysis of existing open sources, the necessity of creating a specialized dataset is justified. An automated pipeline is proposed for merging three open datasets from the Kaggle platform (IEEE-CIS, Sparkov, Fraudulent E-Commerce), preserving real fraud labels and enriching records with synthetic attributes adapted to the specifics of the Ukrainian payment market. Methods for the uniform normalization of timestamps, generation of authentication data, partitioning by payment systems, and the formation of aggregated customer profiles and pairs for training the IP Insights model have been developed. The result is a dataset comprising 500,000 transactions over a 24-month period with a fraud rate of 3.04%, designed to train a model pipeline that includes LightGBM, an autoencoder, and IP Insights.

References

Joint EBA-ECB report on payment fraud. 2025. URL: https://www.eba.europa.eu/publications-and-media/press-releases/joint-eba-ecb-report-payment-fraud-strong-authentication-remains-effective-fraudsters-are-adapting

Visa Payments & Fraud Report. 2025. URL: https://www.visaacceptance.com/content/dam/documents/campaign/fraud-report/global-fraud-report-2025.pdf

Ostrovska K., Nosov V. Machine learning methods for antifraud systems. Системні технології. 2025. Т. 5, вип. 160. С. 156–163. URL: https://doi.org/10.34185/1562-9945-5-160-2025-16

IEEE-CIS Fraud Detection. 2019. URL: https://www.kaggle.com/competitions/ieee-fraud-detection/overview

Credit Card Transactions Fraud Detection Dataset. 2020. URL: https://www.kaggle.com/datasets/kartik2112/fraud-detection

Fraudulent E-Commerce Transactions. 2024. URL: https://www.kaggle.com/datasets/shriyashjagtap/fraudulent-e-commerce-transactions.

Anti-Money Laundering Datasets. 2021. URL: https://github.com/IBM/AMLSim

Credit Card Fraud Detection. 2018. URL: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

EMV 3D-Secure. 2025. URL: https://www.emvco.com/emv-technologies/3-d-secure/

Published

2026-04-26

Issue

Section

Theses