Requirements Document
Introduction
The Nepal Entity Service is a comprehensive Python package designed to manage Nepali public entities (persons, organizations, and locations). The system serves as the foundation for the Nepal Public Accountability Portal, providing structured data management, versioning, and relationship tracking for entities in Nepal's political and administrative landscape.
The entity service hosts a public API that allows anyone to get the entity, relationship, and other information from the system. Besides, the entity will leverage web scraping capabilities assisted with GenAI/LLM to ensure data completeness and accuracy.
Glossary
- Entity: A public person, organization, location in Nepal's political/administrative system
- Entity_Database: A database storing entity/relationship/other information with versioning support. Currently we provide a file system-based adapter at
nes-db/v2 - Nepal_Entity_Service and NES: Core service that loads the entity database and exposes retrieval endpoints. The APIs will be read-only
- NES API: FastAPI web service providing entity retrieval endpoints
- Publication_Service: Central orchestration layer managing entity lifecycle, relationships, and versioning using core Pydantic models
- Search_Service: Read-optimized service providing entity and relationship search with filtering and pagination
- Migration_Service: Service orchestrating database updates through versioned migration scripts with Git-based tracking
-
Scraping_Service: Standalone service for extracting and normalizing data from external sources using GenAI/LLM
-
Accountability_Portal: Public-facing web platform for transparency and accountability
- Entity_Type: Classification of entities (person, organization, location, etc.)
- Entity_SubType: Specific classification within entity types (political_party, government_body, etc.)
- Version_System: Audit trail system tracking changes to entities and relationships over time using the Version model
- Relationship_System: System managing connections between entities using the Relationship model
- Scraping_Tools: ML-powered tools for building entity databases from external sources, using various providers including AWS, Google Cloud/Vertex AI, and OpenAI
Requirements
Requirement 1
User Story: As a civic tech developer, I want to access comprehensive entity data through a RESTful API, so that I can build accountability applications for Nepali citizens.
Acceptance Criteria
- WHEN a developer requests entity data via API, THE Nepal_Entity_Service SHALL return structured entity information with proper HTTP status codes
- THE Nepal_Entity_Service SHALL support filtering initially by entity type, subtype, and custom attributes, and later using powerful search algorithm.
- THE Nepal_Entity_Service SHALL provide pagination with configurable limits.
- THE Nepal_Entity_Service SHALL return entities in standardized JSON format with complete metadata
- THE Nepal_Entity_Service SHALL support CORS for cross-origin requests from web applications
- THE Nepal_Entity_Service SHALL use the highest code quality and ensure that rigorous tests are run (including code quality checks, CI/CD, black/flake8/isort/code coverage/unit, component and e2e tests)
- THE Nepal_Entity_Service SHALL serve documentation at the root endpoint
/using Markdown files - THE Nepal_Entity_Service SHALL provide API schema documentation using OpenAPI/Swagger at
/docsendpoint - THE Nepal_Entity_Service SHALL render Markdown documentation on-the-fly without requiring a separate build step
- THE Nepal_Entity_Service SHALL expose a health check API
- [Future Enhancement] The Nepal_Entity_Service MAY provide a GraphQL API in addition to REST for flexible query capabilities
Requirement 2
User Story: As a data maintainer, I want to track all changes to entity information with full audit trails, so that I can ensure data integrity and transparency.
Acceptance Criteria
- WHEN an entity is modified, THE Version_System SHALL create a new version with timestamp and change metadata (including when, who changed it, for what reason)
- THE Version_System SHALL preserve complete snapshots of entity states for historical reference
- WHEN a version is requested, THE Nepal_Entity_Service SHALL return the exact entity state at that point in time
- THE Version_System SHALL provide an interface through the Publication Service that allows a data maintainer to easily update an entity or a relationship
- THE Version_System SHALL track author attribution for all changes with change descriptions
Requirement 3
User Story: As a researcher, I want to search and filter entities by multiple criteria, so that I can find specific entities for analysis and reporting.
Acceptance Criteria
- THE Nepal_Entity_Service SHALL support entity lookup by unique identifier with exact matching
- THE Nepal_Entity_Service SHALL filter entities by type (person, organization, location) and subtype classifications
- WHEN attribute filters are provided as JSON, THE Nepal_Entity_Service SHALL apply AND logic for multiple criteria
- THE Nepal_Entity_Service SHALL support offset-based pagination for large result sets
- THE Nepal_Entity_Service SHALL return consistent result ordering for reproducible queries
Requirement 4
User Story: As a system integrator, I want to manage relationships between entities, so that I can represent complex organizational and political connections.
Acceptance Criteria
- THE Relationship_System SHALL store directional relationships between any two entities
- THE Relationship_System SHALL support typed relationships with descriptive labels and metadata
- WHEN relationships are queried, THE Nepal_Entity_Service SHALL return complete relationship information including context
- THE Relationship_System SHALL maintain relationship versioning consistent with entity versioning
- THE Nepal_Entity_Service SHALL validate relationship integrity before storage
Requirement 5
User Story: As a data curator, I want to import entity data from multiple sources using automated scraping tools, so that I can maintain comprehensive and up-to-date entity information.
Acceptance Criteria
- THE Scraping_Tools SHALL extract entity information from Wikipedia, government websites, and election databases
- THE Scraping_Tools SHALL normalize names, dates, and organizational information across Nepali and English sources
- WHEN duplicate entities are detected, THE Scraping_Tools SHALL provide merge recommendations with confidence scores
- THE Scraping_Tools SHALL validate extracted data against entity schema before import
- THE Scraping_Tools SHALL maintain source attribution for all imported data
Requirement 6
User Story: As a package consumer, I want flexible installation options with optional dependencies, so that I can use only the components I need for my specific use case.
Acceptance Criteria
- THE Nepal_Entity_Service SHALL provide core models and utilities without optional dependencies
- WHERE API functionality is needed, THE Nepal_Entity_Service SHALL install FastAPI and related dependencies
- WHERE scraping functionality is needed, THE Nepal_Entity_Service SHALL install ML and web scraping dependencies
- THE Nepal_Entity_Service SHALL support full installation with all optional features
- THE Nepal_Entity_Service SHALL maintain backward compatibility across minor version updates
Requirement 7
User Story: As a Nepali citizen, I want entity information to use authentic Nepali names and cultural context, so that the system remains relevant to Nepal's political and social structures.
Acceptance Criteria
- THE Nepal_Entity_Service SHALL support multilingual names with Nepali and English variants
- THE Nepal_Entity_Service SHALL use authentic Nepali person names in examples and documentation
- THE Nepal_Entity_Service SHALL reference actual Nepali organizations and political parties
- THE Nepal_Entity_Service SHALL maintain location references using Nepal's administrative divisions
- THE Nepal_Entity_Service SHALL preserve cultural context in entity classifications and relationships
Requirement 8
User Story: As a system administrator, I want comprehensive data validation and error handling, so that I can maintain data quality and system reliability.
Acceptance Criteria
- THE Nepal_Entity_Service SHALL validate all entity data against Pydantic schemas before storage
- WHEN invalid data is submitted, THE Nepal_Entity_Service SHALL return descriptive error messages with field-level details
- THE Nepal_Entity_Service SHALL enforce required fields including at least one primary name per entity
- THE Nepal_Entity_Service SHALL validate external identifiers and URLs for proper format
- THE Nepal_Entity_Service SHALL handle database errors gracefully with appropriate HTTP status codes
Requirement 9
User Story: As a system architect, I want a modular service architecture with clear separation of concerns, so that the system is maintainable, testable, and scalable.
Acceptance Criteria
- THE Nepal_Entity_Service SHALL implement a Publication Service as the central orchestration layer for write operations
- THE Publication Service SHALL use Entity, Relationship, Version, and Author Pydantic models for consistent operations
- THE Nepal_Entity_Service SHALL implement a Search Service as a separate read-optimized service
- THE Search Service SHALL use the same Entity and Relationship models as the Publication Service
- THE Nepal_Entity_Service SHALL implement a Migration Service for orchestrating database updates through versioned scripts
- THE Migration Service SHALL use Publication and Search services for data operations
- THE Nepal_Entity_Service SHALL implement a Scraping Service as a standalone data extraction service
- THE Scraping Service SHALL not directly access the database but return normalized data for client processing
- THE Nepal_Entity_Service SHALL support CLI, notebook, and API client applications that orchestrate services
- THE Nepal_Entity_Service SHALL maintain clear service boundaries with well-defined interfaces
Requirement 10
User Story: As an API consumer, I want fast read operations with sub-100ms response times, so that I can build responsive applications for end users.
Acceptance Criteria
- THE Nepal_Entity_Service SHALL prioritize read-time latency reduction over write-time performance
- THE Nepal_Entity_Service SHALL implement aggressive caching strategies for frequently accessed entities
- THE Nepal_Entity_Service SHALL use read-optimized file organization and pre-computed indexes
- THE Nepal_Entity_Service SHALL perform expensive operations (validation, normalization, indexing) during write operations
- THE Nepal_Entity_Service SHALL target sub-100ms response times for entity retrieval operations
- THE Nepal_Entity_Service SHALL support efficient pagination with pre-sorted data structures
- THE Nepal_Entity_Service SHALL implement HTTP caching with ETags for unchanged data
Requirement 11
User Story: As a data maintainer, I want to manage database evolution through versioned migration folders across two Git repositories, so that I can track, reproduce, and audit how the database content has changed over time.
Acceptance Criteria
- THE Migration_System SHALL support sequential migration folders with numeric prefixes (000-initial-locations/, 001-update-location-names/) in the Service_API_Repository
- THE Migration_System SHALL execute migrations in sequential order based on their numeric prefix
- THE Migration_System SHALL store migration folders in the Service_API_Repository and entity data in the Database_Repository
- WHEN a migration is executed, THE Migration_System SHALL commit changes to the Database_Repository with migration metadata in the commit message
- THE Migration_System SHALL include author, date, entities created/updated, and duration in Git commit messages
- THE Migration_System SHALL provide a command to list all available migrations with their metadata
- THE Migration_System SHALL look for a main script file (migrate.py or run.py) within each migration folder to execute
- THE Migration_System SHALL manage the Database_Repository as a Git submodule within the Service_API_Repository
Requirement 12
User Story: As a migration author, I want to organize migrations as folders with supporting files and metadata, so that I can include data files, documentation, and authorship information together in one place.
Acceptance Criteria
- THE Migration_System SHALL support migration folders containing multiple files and subdirectories
- THE Migration_System SHALL allow migrations to include CSV files, Excel spreadsheets, JSON files, and other data formats
- THE Migration_System SHALL require migrations to include README.md files documenting the migration purpose and approach
- THE Migration_System SHALL require migration scripts to define AUTHOR, DATE, and DESCRIPTION metadata constants
- THE Migration_System SHALL allow migration scripts to reference files within their migration folder using relative paths
- THE Migration_System SHALL provide the migration folder path to the migration script at runtime
- THE Migration_System SHALL use migration metadata for Git commit messages when changes are committed to the Database_Repository
Requirement 13
User Story: As a migration script author, I want to write migration scripts that can create, update, and delete entities and relationships, so that I can make any necessary data changes to the database.
Acceptance Criteria
- THE Migration_System SHALL provide a migration script API for creating new entities through the Publication_Service
- THE Migration_System SHALL provide a migration script API for updating existing entities through the Publication_Service
- THE Migration_System SHALL provide a migration script API for creating and updating relationships through the Publication_Service
- THE Migration_System SHALL provide a migration script API for querying existing entities and relationships
- THE Migration_System SHALL ensure all migration operations go through the Publication_Service for proper versioning and validation
- THE Migration_System SHALL provide helper functions for reading CSV, Excel, and JSON files from migration folders
Requirement 14
User Story: As a migration script author, I want to access existing services in my migrations, so that I can leverage scraping, search, and publication capabilities for data processing.
Acceptance Criteria
- THE Migration_System SHALL provide migration scripts with access to the Scraping_Service for data extraction and normalization
- THE Migration_System SHALL provide migration scripts with access to the Search_Service for querying existing entities
- THE Migration_System SHALL provide migration scripts with access to the Publication_Service for creating and updating entities
- THE Migration_System SHALL handle service failures gracefully with error reporting
Requirement 15
User Story: As a community member, I want to contribute migrations via GitHub pull requests, so that I can propose data improvements that maintainers can review and merge.
Acceptance Criteria
- THE Migration_System SHALL store migrations in a dedicated directory (migrations/) in the Service_API_Repository
- THE Migration_System SHALL enforce naming conventions for migration folders (NNN-descriptive-name/ format)
- THE Migration_System SHALL provide documentation and templates for creating migration folders
- THE Migration_System SHALL provide a template migration folder structure for contributors to copy
Requirement 16
User Story: As a maintainer, I want to execute migrations and commit changes to Git, so that I can apply community contributions to the database.
Acceptance Criteria
- THE Migration_System SHALL provide a command to execute a specific migration by name
- THE Migration_System SHALL provide a command to execute all migrations in sequential order
- WHEN a migration completes successfully, THE Migration_System SHALL commit changes to the Database_Repository with formatted commit message
- THE Migration_System SHALL push commits to the remote Database_Repository after successful migration execution
- WHEN a migration fails, THE Migration_System SHALL NOT commit changes to the Database_Repository
- THE Migration_System SHALL log detailed error information including stack traces for failed migrations
- WHEN a migration is executed, THE Migration_System SHALL persist the resulting data snapshot in the Database_Repository so that re-running the migration becomes deterministic
- THE Migration_System SHALL prevent re-execution of already-applied migrations by checking persisted snapshots in the Database_Repository
Requirement 17
User Story: As a data maintainer, I want to track the provenance of all data changes through Git history, so that I can understand the source and reasoning behind every modification.
Acceptance Criteria
- WHEN a migration creates or updates an entity, THE Publication_Service SHALL record the migration script name as the author
- THE Migration_System SHALL preserve contributor attribution from the migration script metadata in Git commits
- THE Migration_System SHALL link version records to the specific migration that created them through author attribution
- THE Migration_System SHALL maintain a complete audit trail through Git history in the Database_Repository
- THE Migration_System SHALL format Git commit messages with migration metadata including author, date, and statistics
Requirement 18
User Story: As a system administrator, I want to efficiently manage the large Database Repository containing 100k-1M files, so that Git operations remain performant and practical.
Acceptance Criteria
- THE Migration_System SHALL support batch commits when migrations create or modify large numbers of files
- THE Migration_System SHALL commit changes in batches of up to 1000 files per commit when appropriate
- THE Migration_System SHALL provide documentation for using shallow clones and sparse checkout with the Database_Repository
- THE Migration_System SHALL configure Git settings optimized for large repositories (core.preloadindex, core.fscache, gc.auto)
- THE Migration_System SHALL handle Git push operations for large commits with appropriate timeouts