Detect fraud using users' activity on the system.
I built the process to identify the fraudulent activity on the financial platform I build a couple of years back. Happy to share the core concept and idea.
There are some pre-requisite before you are able to build a model to identify fraudulent transactions.
- Capture login details: Time, duration, ip address, time, and machine.
- Mark each activity: Ensure you capture all the user activity. Each click and path.
The idea is to build each user's unique platform usage signature. We can compare the signature with every time user use the app. Hackers might be able to bypass the login but will not be able to use the app in the same manner as the original user.
To build a model that can identify any illegal entry on the account and stop fraud, you can follow these steps:
- Collect User Login and User Activity Data: Collect data on user login and activity on the platform. This data can include login timestamps, IP addresses, geolocation, device information, and user activity logs.
- Define Key Features: Identify the key features that can help you detect illegal activity on the platform. These features may include login frequency, unusual login times or locations, multiple logins from different devices, and unusual activity patterns.
- Train a Machine Learning Model: Once you have defined the key features, you can train a machine learning model on the collected data. You can use a supervised learning algorithm like logistic regression, decision trees, or random forest to train the model. The model should be able to distinguish between legitimate user activity and fraudulent activity.
- Validate and Test the Model: Validate and test the model using a holdout dataset or cross-validation. This will help you determine the accuracy, precision, and recall of the model. You can adjust the model parameters and features based on the validation results.
- Implement the Model: Once you are satisfied with the model’s performance, you can implement it on the platform. The model can be integrated with the platform’s security system to automatically detect and flag any suspicious activity. This can trigger further investigation or action to prevent fraud.
- Continuously Monitor and Improve the Model: Finally, you should continuously monitor the model’s performance and improve it over time. You can collect more data, refine the features, or use more advanced machine-learning techniques to improve the model’s accuracy and effectiveness in detecting fraud.
Sample Python code for the above :
The code is much simplified. The activity and login data are in the different sets though the code provides enough guidance to build a code for the application.
Based on code identifying fraud, an additional signup process was invoked requesting the user to enter an OTP sent on the mobile phone.
In the first year itself, able to identify 600 cases of fraud logins out of which approx 370 were positive and the rest negative.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
# Load User Login and User Activity Data
data = pd.read_csv('user_activity.csv')
# Define Key Features
features = ['login_frequency', 'unusual_login_time', 'unusual_location', 'multiple_devices', 'unusual_activity']
# Create X and y variables
X = data[features]
y = data['is_fraud']
# Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)
# Validate and Test the Model
y_pred = rfc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
# Implement the Model
new_data = pd.read_csv('new_user_activity.csv')
new_X = new_data[features]
new_pred = rfc.predict(new_X)
new_data['is_fraud'] = new_pred
new_data.to_csv('new_user_activity_with_fraud.csv', index=False)
In this code, we first load the user login and activity data from a CSV file called ‘user_activity.csv’. The data contains columns for login frequency, unusual login time, unusual location, multiple devices, and unusual activity, as well as a binary ‘is_fraud’ column indicating whether the activity is fraudulent or not.
We then define the key features to be used in the model, create X and y variables, and split the data into training and testing sets. We train a Random Forest Classifier on the training data and validate and test the model on the testing data using accuracy, precision, and recall metrics.
Finally, we implement the model on new user activity data from a CSV file called ‘new_user_activity.csv’.
We load the data, extract the features, predict the fraudulent activity using the trained model, and add the predicted ‘is_fraud’ column to the original data.
We save the data to a new CSV file called ‘new_user_activity_with_fraud.csv’.