Data Discovery
Automated scanning and PII detection across your data landscape
Overview
Neostra's Data Discovery module enables organizations to automatically scan, classify, and map personal and sensitive data across diverse data sources. It combines an orchestration service with a distributed scanner worker to detect PII using Microsoft Presidio and custom pattern matching, supporting compliance with GDPR, DPDPA, and other privacy regulations.
Data Discovery supports GDPR Article 30 Records of Processing Activities (RoPA) mapping, connecting discovered data points to processing activities and data subjects.
Architecture
The module is composed of two services that work together:
Data Discovery Boot
Java Spring Boot orchestration service (port 8080) backed by PostgreSQL. Manages scan configuration, integration setup, findings storage, and analytics.
Data Discovery Scanner
Python 3 worker service that polls for scan tasks, connects to configured data sources, runs PII detection, and writes findings back to the database.
Supported Data Sources
Relational Databases
PostgreSQL — Full schema and table scanning. MySQL — Full schema and table scanning.
NoSQL & Cloud Storage
MongoDB — Collection and document scanning. AWS S3 — Object and file content scanning. AWS DynamoDB — Table and item scanning.
How Scanning Works
Configure an Integration
Add a data source connection via the Integrations API. Credentials are encrypted using AES encryption before storage.
curl -X POST https://api.neostra.io/v1/integrations \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "Production PostgreSQL",
"type": "POSTGRESQL",
"host": "db.example.com",
"port": 5432,
"database": "app_db",
"username": "scanner_user",
"password": "encrypted-credential"
}'
Create and Launch a Scan
Create a scan targeting one or more integrations. The orchestration service breaks the scan into individual scan tasks for each table, collection, or bucket.
curl -X POST https://api.neostra.io/v1/scans \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{
"name": "Q1 Full Discovery Scan",
"integrationIds": ["int_abc123", "int_def456"],
"scanType": "FULL"
}'
Worker Polls and Executes
The Python scanner worker continuously polls the scan_items table for pending tasks. For each task, it decrypts credentials via crypto.py, connects to the data source, samples data, and runs PII detection through pii_engine.py.
PII Detection and Classification
The PII engine uses Microsoft Presidio augmented with custom regex patterns to identify sensitive data. Detected entities are classified and written back as findings with confidence scores.
Review Findings
Access discovered findings through the Findings API or the Neostra dashboard. Findings are linked to data points, attributes, and can be mapped to processing activities for RoPA compliance.
PII Detection Capabilities
The scanner's PII engine combines Microsoft Presidio's NLP-based recognizers with custom regex patterns for region-specific identifiers.
| Entity Type | Description |
|---|---|
PERSON | Full names and partial name components |
EMAIL_ADDRESS | Email addresses |
PHONE_NUMBER | International phone numbers |
CREDIT_CARD | Credit and debit card numbers |
IBAN_CODE | International bank account numbers |
DATE_TIME | Dates of birth and timestamps |
LOCATION | Physical addresses and place names |
NRP | Nationality, religion, political affiliation |
| Pattern | Format | Region |
|---|---|---|
| Aadhaar Number | XXXX XXXX XXXX (12 digits) | India |
| PAN Card | ABCDE1234F (10 alphanumeric) | India |
| UPI ID | user@bankhandle | India |
| IP Address | IPv4 and IPv6 formats | Global |
| Email (extended) | Extended email pattern matching | Global |
| Phone (extended) | Regional phone format variations | Global |
Key API Endpoints
curl -X GET https://api.neostra.io/v1/scans \
-H "Authorization: Bearer <token>"
curl -X GET https://api.neostra.io/v1/scans/{scanId} \
-H "Authorization: Bearer <token>"
curl -X GET https://api.neostra.io/v1/findings?scanId={scanId} \
-H "Authorization: Bearer <token>"
curl -X GET https://api.neostra.io/v1/data-points?sourceId={sourceId} \
-H "Authorization: Bearer <token>"
curl -X GET https://api.neostra.io/v1/analytics/summary \
-H "Authorization: Bearer <token>"
Controller Reference
| Controller | Path Prefix | Purpose |
|---|---|---|
ScanController | /scans | Create, list, and manage scans |
ScanTaskController | /scan-tasks | View individual scan task status |
DataPointController | /data-points | Browse discovered data points |
FindingsController | /findings | Access PII detection findings |
IntegrationsController | /integrations | Configure data source connections |
DataSystemController | /data-systems | Manage registered data systems |
DataSubjectController | /data-subjects | Map data to data subject categories |
AttributeController | /attributes | Manage data attributes and classifications |
AnalyticsController | /analytics | Scan and discovery analytics |
ProcessingActivityController | /processing-activities | RoPA processing activity mapping |
VendorController | /vendors | Third-party vendor management |
Database Schema
The orchestration service uses PostgreSQL with schema managed by 16 Liquibase migration files. Core tables include:
| Table | Purpose |
|---|---|
sources | Registered data source definitions |
integrations | Connection configurations with encrypted credentials |
scans | Scan definitions and lifecycle state |
scan_tasks | Individual scan work items for the worker |
findings | Detected PII findings with entity type and confidence |
data_points | Discovered data elements within sources |
attributes | Classification attributes for data points |
RoPA Mapping
Security
All data source credentials are encrypted using AES encryption before storage. The scanner worker decrypts credentials at runtime only when connecting to a data source. Credentials are never logged or exposed through API responses.
Data Classification Schema
Neostra uses a comprehensive data classification schema to categorize discovered data elements by sensitivity level and regulatory relevance. Classification rules are pre-seeded via database migrations and can be extended with custom rules.
Sensitivity Levels
| Level | Label | Description | Examples |
|---|---|---|---|
| 1 | Public | Information that can be freely disclosed | Company name, public website content |
| 2 | Internal | Non-sensitive business information | Employee directories, internal memos |
| 3 | Confidential | Personal data requiring protection | Names, email addresses, phone numbers, addresses |
| 4 | Highly Confidential | Sensitive personal data with strict regulatory requirements | Health records, financial data, government IDs |
| 5 | Restricted | Special category data requiring highest protection | Biometric data, genetic data, children's data, political opinions |
Classification Categories
| Category | Data Elements | Sensitivity |
|---|---|---|
| Basic Identity | Full name, date of birth, gender, nationality | Confidential |
| Contact Information | Email, phone, physical address, postal code | Confidential |
| Government IDs | Aadhaar, PAN, SSN, passport number, driver's license | Highly Confidential |
| Online Identifiers | IP address, device ID, cookie ID, advertising ID | Confidential |
| Category | Data Elements | Sensitivity |
|---|---|---|
| Banking | Account numbers, IBAN, routing numbers | Highly Confidential |
| Payment | Credit/debit card numbers, UPI IDs, wallet IDs | Highly Confidential |
| Transaction | Purchase history, payment amounts, invoices | Confidential |
| Tax | Tax identification numbers, tax returns | Highly Confidential |
| Category | Data Elements | Sensitivity |
|---|---|---|
| Health | Medical records, prescriptions, health conditions, insurance | Restricted |
| Biometric | Fingerprints, facial recognition data, iris scans, voice prints | Restricted |
| Children's Data | Any personal data relating to individuals under 18 | Restricted |
| Political/Religious | Political opinions, religious beliefs, trade union membership | Restricted |
Classification rules are pre-seeded through the CU012_ClassificationRulesInitializerChangeUnit migration. Organizations can add custom classification rules through the API to match their specific data taxonomy.
Data Mapping & Inventory
Data Mapping extends the discovery capabilities to build a comprehensive inventory of personal data processing across the organization. It provides a structured view of what data is collected, where it is stored, how it flows, and who has access.
Inventory Components
Data Systems
Register all systems that process personal data — databases, SaaS applications, cloud storage, and third-party services. Each system is linked to its data source integration for automated scanning.
Data Flows
Map how personal data moves between systems, departments, and external parties. Track data lineage from collection through processing to storage and deletion.
Processing Activities
Document the purposes for which personal data is processed, aligned with GDPR Article 30 / DPDPA requirements. Link activities to data subjects, categories, and legal bases.
Data Subjects
Categorize the types of individuals whose data is processed (customers, employees, contractors, partners) and link to the specific data elements collected for each category.
Automated RoPA Reports
Records of Processing Activities (RoPA) are mandatory under GDPR Article 30 and are supported as best practice under DPDPA. Neostra auto-generates RoPA reports by aggregating data from discovery scans, processing activities, and data subject mappings.
Run Discovery Scans
Scan all configured data sources to identify where personal data resides. The scanner automatically classifies found data using the classification schema.
Map Processing Activities
Use the ProcessingActivityController to link discovered data to declared processing purposes, legal bases, and retention periods.
Assign Data Subjects
Categorize discovered data by data subject type (customers, employees, partners) using the DataSubjectController.
Generate RoPA Report
The system aggregates all mapped data into a structured RoPA report that includes: purposes of processing, categories of personal data, categories of data subjects, recipients and transfers, retention periods, and technical/organizational security measures.
# Generate RoPA report
curl -X GET "https://api.neostra.io/v1/reports/ropa?brandId=brand_001&format=pdf" \
-H "Authorization: Bearer <token>" \
-o ropa_report.pdf
Regulatory Use Cases
| Regulation | Data Mapping Use |
|---|---|
| GDPR | Build RoPA (Article 30), support DPIA, automate DSAR fulfillment |
| CCPA/CPRA | Track disclosures and sales of personal information, support opt-out requests |
| DPDPA | Demonstrate purpose limitation, track cross-border data flows, support data principal rights |
| ISO 27701 | Foundation for Privacy Information Management System (PIMS) |
Last updated 1 week ago
Built with Documentation.AI