Platform ModulesData Discovery
Platform Modules

Data Discovery

Automated scanning and PII detection across your data landscape

Overview

Neostra's Data Discovery module enables organizations to automatically scan, classify, and map personal and sensitive data across diverse data sources. It combines an orchestration service with a distributed scanner worker to detect PII using Microsoft Presidio and custom pattern matching, supporting compliance with GDPR, DPDPA, and other privacy regulations.

Data Discovery supports GDPR Article 30 Records of Processing Activities (RoPA) mapping, connecting discovered data points to processing activities and data subjects.

Architecture

The module is composed of two services that work together:

Supported Data Sources

How Scanning Works

Configure an Integration

Add a data source connection via the Integrations API. Credentials are encrypted using AES encryption before storage.

curl -X POST https://api.neostra.io/v1/integrations \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production PostgreSQL",
    "type": "POSTGRESQL",
    "host": "db.example.com",
    "port": 5432,
    "database": "app_db",
    "username": "scanner_user",
    "password": "encrypted-credential"
  }'

Create and Launch a Scan

Create a scan targeting one or more integrations. The orchestration service breaks the scan into individual scan tasks for each table, collection, or bucket.

curl -X POST https://api.neostra.io/v1/scans \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Q1 Full Discovery Scan",
    "integrationIds": ["int_abc123", "int_def456"],
    "scanType": "FULL"
  }'

Worker Polls and Executes

The Python scanner worker continuously polls the scan_items table for pending tasks. For each task, it decrypts credentials via crypto.py, connects to the data source, samples data, and runs PII detection through pii_engine.py.

PII Detection and Classification

The PII engine uses Microsoft Presidio augmented with custom regex patterns to identify sensitive data. Detected entities are classified and written back as findings with confidence scores.

Review Findings

Access discovered findings through the Findings API or the Neostra dashboard. Findings are linked to data points, attributes, and can be mapped to processing activities for RoPA compliance.

PII Detection Capabilities

The scanner's PII engine combines Microsoft Presidio's NLP-based recognizers with custom regex patterns for region-specific identifiers.

Entity TypeDescription
PERSONFull names and partial name components
EMAIL_ADDRESSEmail addresses
PHONE_NUMBERInternational phone numbers
CREDIT_CARDCredit and debit card numbers
IBAN_CODEInternational bank account numbers
DATE_TIMEDates of birth and timestamps
LOCATIONPhysical addresses and place names
NRPNationality, religion, political affiliation

Key API Endpoints

curl -X GET https://api.neostra.io/v1/scans \
  -H "Authorization: Bearer <token>"

Controller Reference

ControllerPath PrefixPurpose
ScanController/scansCreate, list, and manage scans
ScanTaskController/scan-tasksView individual scan task status
DataPointController/data-pointsBrowse discovered data points
FindingsController/findingsAccess PII detection findings
IntegrationsController/integrationsConfigure data source connections
DataSystemController/data-systemsManage registered data systems
DataSubjectController/data-subjectsMap data to data subject categories
AttributeController/attributesManage data attributes and classifications
AnalyticsController/analyticsScan and discovery analytics
ProcessingActivityController/processing-activitiesRoPA processing activity mapping
VendorController/vendorsThird-party vendor management

Database Schema

The orchestration service uses PostgreSQL with schema managed by 16 Liquibase migration files. Core tables include:

TablePurpose
sourcesRegistered data source definitions
integrationsConnection configurations with encrypted credentials
scansScan definitions and lifecycle state
scan_tasksIndividual scan work items for the worker
findingsDetected PII findings with entity type and confidence
data_pointsDiscovered data elements within sources
attributesClassification attributes for data points

RoPA Mapping

Security

All data source credentials are encrypted using AES encryption before storage. The scanner worker decrypts credentials at runtime only when connecting to a data source. Credentials are never logged or exposed through API responses.

Data Classification Schema

Neostra uses a comprehensive data classification schema to categorize discovered data elements by sensitivity level and regulatory relevance. Classification rules are pre-seeded via database migrations and can be extended with custom rules.

Sensitivity Levels

LevelLabelDescriptionExamples
1PublicInformation that can be freely disclosedCompany name, public website content
2InternalNon-sensitive business informationEmployee directories, internal memos
3ConfidentialPersonal data requiring protectionNames, email addresses, phone numbers, addresses
4Highly ConfidentialSensitive personal data with strict regulatory requirementsHealth records, financial data, government IDs
5RestrictedSpecial category data requiring highest protectionBiometric data, genetic data, children's data, political opinions

Classification Categories

CategoryData ElementsSensitivity
Basic IdentityFull name, date of birth, gender, nationalityConfidential
Contact InformationEmail, phone, physical address, postal codeConfidential
Government IDsAadhaar, PAN, SSN, passport number, driver's licenseHighly Confidential
Online IdentifiersIP address, device ID, cookie ID, advertising IDConfidential

Classification rules are pre-seeded through the CU012_ClassificationRulesInitializerChangeUnit migration. Organizations can add custom classification rules through the API to match their specific data taxonomy.

Data Mapping & Inventory

Data Mapping extends the discovery capabilities to build a comprehensive inventory of personal data processing across the organization. It provides a structured view of what data is collected, where it is stored, how it flows, and who has access.

Inventory Components

Automated RoPA Reports

Records of Processing Activities (RoPA) are mandatory under GDPR Article 30 and are supported as best practice under DPDPA. Neostra auto-generates RoPA reports by aggregating data from discovery scans, processing activities, and data subject mappings.

Run Discovery Scans

Scan all configured data sources to identify where personal data resides. The scanner automatically classifies found data using the classification schema.

Map Processing Activities

Use the ProcessingActivityController to link discovered data to declared processing purposes, legal bases, and retention periods.

Assign Data Subjects

Categorize discovered data by data subject type (customers, employees, partners) using the DataSubjectController.

Generate RoPA Report

The system aggregates all mapped data into a structured RoPA report that includes: purposes of processing, categories of personal data, categories of data subjects, recipients and transfers, retention periods, and technical/organizational security measures.

# Generate RoPA report
curl -X GET "https://api.neostra.io/v1/reports/ropa?brandId=brand_001&format=pdf" \
  -H "Authorization: Bearer <token>" \
  -o ropa_report.pdf

Regulatory Use Cases

RegulationData Mapping Use
GDPRBuild RoPA (Article 30), support DPIA, automate DSAR fulfillment
CCPA/CPRATrack disclosures and sales of personal information, support opt-out requests
DPDPADemonstrate purpose limitation, track cross-border data flows, support data principal rights
ISO 27701Foundation for Privacy Information Management System (PIMS)