Data Discovery

Automated scanning and PII detection across your data landscape

Overview

Neostra's Data Discovery module enables organizations to automatically scan, classify, and map personal and sensitive data across diverse data sources. It combines an orchestration service with a distributed scanner worker to detect PII using Microsoft Presidio and custom pattern matching, supporting compliance with GDPR, DPDPA, and other privacy regulations.

Data Discovery supports GDPR Article 30 Records of Processing Activities (RoPA) mapping, connecting discovered data points to processing activities and data subjects.

Architecture

The module is composed of two services that work together:

Data Discovery Boot

Java Spring Boot orchestration service (port 8080) backed by PostgreSQL. Manages scan configuration, integration setup, findings storage, and analytics.

Data Discovery Scanner

Python 3 worker service that polls for scan tasks, connects to configured data sources, runs PII detection, and writes findings back to the database.

Supported Data Sources

Relational Databases

PostgreSQL — Full schema and table scanning. MySQL — Full schema and table scanning.

NoSQL & Cloud Storage

MongoDB — Collection and document scanning. AWS S3 — Object and file content scanning. AWS DynamoDB — Table and item scanning.

How Scanning Works

Configure an Integration

Add a data source connection via the Integrations API. Credentials are encrypted using AES encryption before storage.

curl -X POST https://api.neostra.io/v1/integrations \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production PostgreSQL",
    "type": "POSTGRESQL",
    "host": "db.example.com",
    "port": 5432,
    "database": "app_db",
    "username": "scanner_user",
    "password": "encrypted-credential"
  }'

Create and Launch a Scan

Create a scan targeting one or more integrations. The orchestration service breaks the scan into individual scan tasks for each table, collection, or bucket.

curl -X POST https://api.neostra.io/v1/scans \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Q1 Full Discovery Scan",
    "integrationIds": ["int_abc123", "int_def456"],
    "scanType": "FULL"
  }'

Worker Polls and Executes

The Python scanner worker continuously polls the scan_items table for pending tasks. For each task, it decrypts credentials via crypto.py, connects to the data source, samples data, and runs PII detection through pii_engine.py.

PII Detection and Classification

The PII engine uses Microsoft Presidio augmented with custom regex patterns to identify sensitive data. Detected entities are classified and written back as findings with confidence scores.

Review Findings

Access discovered findings through the Findings API or the Neostra dashboard. Findings are linked to data points, attributes, and can be mapped to processing activities for RoPA compliance.

PII Detection Capabilities

The scanner's PII engine combines Microsoft Presidio's NLP-based recognizers with custom regex patterns for region-specific identifiers.

Entity Type	Description
`PERSON`	Full names and partial name components
`EMAIL_ADDRESS`	Email addresses
`PHONE_NUMBER`	International phone numbers
`CREDIT_CARD`	Credit and debit card numbers
`IBAN_CODE`	International bank account numbers
`DATE_TIME`	Dates of birth and timestamps
`LOCATION`	Physical addresses and place names
`NRP`	Nationality, religion, political affiliation

Pattern	Format	Region
Aadhaar Number	`XXXX XXXX XXXX` (12 digits)	India
PAN Card	`ABCDE1234F` (10 alphanumeric)	India
UPI ID	`user@bankhandle`	India
IP Address	IPv4 and IPv6 formats	Global
Email (extended)	Extended email pattern matching	Global
Phone (extended)	Regional phone format variations	Global

Key API Endpoints

curl -X GET https://api.neostra.io/v1/scans \
  -H "Authorization: Bearer <token>"

Controller Reference

Controller	Path Prefix	Purpose
`ScanController`	`/scans`	Create, list, and manage scans
`ScanTaskController`	`/scan-tasks`	View individual scan task status
`DataPointController`	`/data-points`	Browse discovered data points
`FindingsController`	`/findings`	Access PII detection findings
`IntegrationsController`	`/integrations`	Configure data source connections
`DataSystemController`	`/data-systems`	Manage registered data systems
`DataSubjectController`	`/data-subjects`	Map data to data subject categories
`AttributeController`	`/attributes`	Manage data attributes and classifications
`AnalyticsController`	`/analytics`	Scan and discovery analytics
`ProcessingActivityController`	`/processing-activities`	RoPA processing activity mapping
`VendorController`	`/vendors`	Third-party vendor management

Database Schema

The orchestration service uses PostgreSQL with schema managed by 16 Liquibase migration files. Core tables include:

Table	Purpose
`sources`	Registered data source definitions
`integrations`	Connection configurations with encrypted credentials
`scans`	Scan definitions and lifecycle state
`scan_tasks`	Individual scan work items for the worker
`findings`	Detected PII findings with entity type and confidence
`data_points`	Discovered data elements within sources
`attributes`	Classification attributes for data points

RoPA Mapping

Data Discovery integrates with Neostra's RoPA capabilities to map discovered data to processing activities:

Data Subject Mapping: Link discovered data points to data subject categories (customers, employees, partners).
Processing Activity Linking: Associate findings with declared processing activities for Article 30 registers.
Vendor Association: Connect data sources to third-party vendors for processor mapping.
Attribute Classification: Classify discovered attributes by data category (personal, sensitive, special category).

Use the ProcessingActivityController and DataSubjectController endpoints to build your Article 30 register from discovered data.

Security

All data source credentials are encrypted using AES encryption before storage. The scanner worker decrypts credentials at runtime only when connecting to a data source. Credentials are never logged or exposed through API responses.

Data Classification Schema

Neostra uses a comprehensive data classification schema to categorize discovered data elements by sensitivity level and regulatory relevance. Classification rules are pre-seeded via database migrations and can be extended with custom rules.

Sensitivity Levels

Level	Label	Description	Examples
1	Public	Information that can be freely disclosed	Company name, public website content
2	Internal	Non-sensitive business information	Employee directories, internal memos
3	Confidential	Personal data requiring protection	Names, email addresses, phone numbers, addresses
4	Highly Confidential	Sensitive personal data with strict regulatory requirements	Health records, financial data, government IDs
5	Restricted	Special category data requiring highest protection	Biometric data, genetic data, children's data, political opinions

Classification Categories

Category	Data Elements	Sensitivity
Basic Identity	Full name, date of birth, gender, nationality	Confidential
Contact Information	Email, phone, physical address, postal code	Confidential
Government IDs	Aadhaar, PAN, SSN, passport number, driver's license	Highly Confidential
Online Identifiers	IP address, device ID, cookie ID, advertising ID	Confidential

Category	Data Elements	Sensitivity
Banking	Account numbers, IBAN, routing numbers	Highly Confidential
Payment	Credit/debit card numbers, UPI IDs, wallet IDs	Highly Confidential
Transaction	Purchase history, payment amounts, invoices	Confidential
Tax	Tax identification numbers, tax returns	Highly Confidential

Category	Data Elements	Sensitivity
Health	Medical records, prescriptions, health conditions, insurance	Restricted
Biometric	Fingerprints, facial recognition data, iris scans, voice prints	Restricted
Children's Data	Any personal data relating to individuals under 18	Restricted
Political/Religious	Political opinions, religious beliefs, trade union membership	Restricted

Classification rules are pre-seeded through the CU012_ClassificationRulesInitializerChangeUnit migration. Organizations can add custom classification rules through the API to match their specific data taxonomy.

Data Mapping & Inventory

Data Mapping extends the discovery capabilities to build a comprehensive inventory of personal data processing across the organization. It provides a structured view of what data is collected, where it is stored, how it flows, and who has access.

Inventory Components

Data Systems

Register all systems that process personal data — databases, SaaS applications, cloud storage, and third-party services. Each system is linked to its data source integration for automated scanning.

Data Flows

Map how personal data moves between systems, departments, and external parties. Track data lineage from collection through processing to storage and deletion.

Processing Activities

Document the purposes for which personal data is processed, aligned with GDPR Article 30 / DPDPA requirements. Link activities to data subjects, categories, and legal bases.

Data Subjects

Categorize the types of individuals whose data is processed (customers, employees, contractors, partners) and link to the specific data elements collected for each category.

Automated RoPA Reports

Records of Processing Activities (RoPA) are mandatory under GDPR Article 30 and are supported as best practice under DPDPA. Neostra auto-generates RoPA reports by aggregating data from discovery scans, processing activities, and data subject mappings.

Run Discovery Scans

Scan all configured data sources to identify where personal data resides. The scanner automatically classifies found data using the classification schema.

Map Processing Activities

Use the ProcessingActivityController to link discovered data to declared processing purposes, legal bases, and retention periods.

Assign Data Subjects

Categorize discovered data by data subject type (customers, employees, partners) using the DataSubjectController.

Generate RoPA Report

The system aggregates all mapped data into a structured RoPA report that includes: purposes of processing, categories of personal data, categories of data subjects, recipients and transfers, retention periods, and technical/organizational security measures.

# Generate RoPA report
curl -X GET "https://api.neostra.io/v1/reports/ropa?brandId=brand_001&format=pdf" \
  -H "Authorization: Bearer <token>" \
  -o ropa_report.pdf

Regulatory Use Cases

Regulation	Data Mapping Use
GDPR	Build RoPA (Article 30), support DPIA, automate DSAR fulfillment
CCPA/CPRA	Track disclosures and sales of personal information, support opt-out requests
DPDPA	Demonstrate purpose limitation, track cross-border data flows, support data principal rights
ISO 27701	Foundation for Privacy Information Management System (PIMS)

Was this page helpful?