Prev Next

Integration / Apache NiFi Interview Questions

1. What is Apache NiFi and what problem does it solve? 2. What is a FlowFile in Apache NiFi? 3. What are the three NiFi repositories and what does each store? 4. What is a Processor in Apache NiFi and what are the main processor categories? 5. What is a Connection in NiFi and how does back-pressure work? 6. What is NiFi Expression Language and where can it be used? 7. What is data provenance in Apache NiFi and how do you access it? 8. What is a Process Group in NiFi and why is it used? 9. What is NiFi Registry and how does it integrate with NiFi? 10. How does NiFi clustering work and what is the role of ZooKeeper? 11. What is a Controller Service in NiFi and how is it different from a Processor? 12. What is the GenerateTableFetch and QueryDatabaseTable pattern for incremental database ingestion? 13. What is the Record-based processing model in NiFi and why is it preferred? 14. What is State Management in NiFi and what types of state scope exist? 15. What is NiFi Site-to-Site (S2S) and when do you use it? 16. What is NiFi and how does it relate to Apache NiFi? 17. What is NiFi Parameter Context and how does it differ from Variables? 18. How does NiFi handle security — TLS, authentication, and authorization? 19. What is the NiFi NAR (NiFi Archive) classloading model? 20. What are Reporting Tasks in NiFi and what are common use cases? 21. How do you handle errors and failures in a NiFi flow? 22. What is the SplitText processor and how do you control split behavior? 23. What is the MergeContent processor and how is it used? 24. What is the InvokeHTTP processor and what are key configuration considerations? 25. What is the PublishKafka and ConsumeKafka processor pair and what are key configuration options? 26. What is the ExecuteScript processor and what scripting languages does it support? 27. What is the JoltTransformJSON processor and how do you use it? 28. What is the PutDatabaseRecord processor and how does it differ from ExecuteSQL? 29. What is the ListSFTP and FetchSFTP processor pattern and how does it work? 30. What is the LookupRecord processor used for? 31. What is the PartitionRecord processor and what is a common use case? 32. What is the ConvertRecord processor and how is it used for format conversion? 33. What are the NiFi processor scheduling strategies? 34. What is the difference between EvaluateJsonPath and FlattenJson processors? 35. How does NiFi integrate with Apache Hadoop and HDFS? 36. What is the UpdateAttribute processor and how is its Advanced Mode used? 37. How do you implement deduplication in a NiFi flow? 38. What is the HandleHttpRequest and HandleHttpResponse processor pair used for? 39. How does NiFi achieve guaranteed delivery and what are its durability guarantees? 40. What is the Funnel component in NiFi and when do you use it? 41. What is the difference between GetFile and ListFile + FetchFile processors? 42. How does NiFi support schema evolution in data pipelines? 43. What is the RouteText processor and how does it differ from RouteOnContent? 44. What performance tuning options are available in NiFi and what are common bottleneck patterns? 45. How does NiFi integrate with cloud storage services like Amazon S3?
Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is Apache NiFi and what problem does it solve?

Apache NiFi is an open-source data integration and dataflow automation platform originally developed by the NSA under the codename Niagarafiles and donated to the Apache Software Foundation in 2014. It provides a web-based graphical interface for designing, controlling, and monitoring data flows between heterogeneous systems without writing custom integration code for every connection.

The core problem NiFi solves is the complexity of moving data reliably between heterogeneous systems at scale. Data lives in dozens of formats: CSVs on FTP servers, JSON from REST APIs, records in relational databases, messages in Kafka topics, and files in S3. NiFi replaces brittle hand-coded pipelines with a visual, configuration-driven approach where flows are built by connecting processors on a canvas.

NiFi's design priorities are reliability (guaranteed delivery through persistent queues), data provenance (every data movement tracked end-to-end), back-pressure (slows producers when downstream queues are full), and ease of use (drag-and-drop design accessible to non-developers). It is widely used for IoT data ingestion, log aggregation, ETL pipelines, and security data routing.

What US government agency originally developed the software that became Apache NiFi?
Which NiFi capability automatically pauses upstream processors when downstream queues are full?
2. What is a FlowFile in Apache NiFi?

A FlowFile is the fundamental unit of data in Apache NiFi. Every piece of data moving through a flow is represented as a FlowFile with two distinct parts.

Attributes: A map of key-value string pairs acting as metadata. Every FlowFile has core attributes automatically assigned — uuid (globally unique identifier), filename, path, and entryDate. Processors can add, modify, or remove attributes. Attributes are lightweight and kept in memory.

Content: The actual payload bytes. Content is stored in the NiFi content repository on disk, not in heap memory, allowing NiFi to handle FlowFiles of arbitrary size — gigabytes or more — without memory exhaustion. Content is immutable by design: when a processor modifies content, it writes a new version rather than overwriting the original. This immutability underpins the data provenance model.

The separation of attributes from content is architecturally significant. Many routing, filtering, and enrichment operations work purely on attributes without ever reading the payload. RouteOnAttribute, for example, routes FlowFiles entirely on attribute values without touching content.

Where does NiFi physically store FlowFile content?
What happens to original FlowFile content when a processor modifies it?
3. What are the three NiFi repositories and what does each store?

Apache NiFi uses three on-disk repositories, each serving a distinct durability and query purpose.

FlowFile Repository: Stores the current state of all active FlowFiles — their attributes and a pointer (content claim) to where content lives in the content repository. Uses a Write-Ahead Log (WAL) for durability. On restart after a crash NiFi replays the WAL to recover all in-flight FlowFiles without data loss. Stores metadata only, not content bytes.

Content Repository: Stores actual FlowFile payload bytes organized into content claims within large archive files. Uses an immutable, append-only approach: processors write new content versions rather than overwriting. Old claims are garbage-collected once dereferenced. Can be spread across multiple disk volumes for higher I/O throughput.

Provenance Repository: Records every lifecycle event for every FlowFile: RECEIVE, SEND, FORK, JOIN, CONTENT_MODIFIED, DROP, etc. Creates a complete, searchable audit trail. Typically the largest repository in active deployments. Supports Lucene-based search by FlowFile UUID, filename, processor, time range, and transit URI.

NiFi Repositories
RepositoryStoresKey Feature
FlowFileAttributes and content pointersWAL-based crash recovery
ContentPayload bytesImmutable, multi-disk support
ProvenanceFull data lineage eventsSearchable audit trail
Which repository enables NiFi to recover in-flight FlowFiles after an unexpected restart?
Which NiFi repository typically consumes the most disk space in high-throughput deployments?
4. What is a Processor in Apache NiFi and what are the main processor categories?

A Processor is the fundamental building block of a NiFi data flow. Each processor performs one specific operation on FlowFiles: fetching from a source, transforming content, routing on attributes, writing to a destination, and so on. Processors are connected via Connections to form a directed dataflow graph on the canvas.

NiFi ships with hundreds of built-in processors in functional categories:

Data Ingestion: GetFile, GetHTTP, GetSFTP, ListenHTTP, ConsumerKafka, GetSQS — pull or receive data from external sources.

Data Egress: PutFile, PublishKafka, PutS3Object, PostHTTP, PutEmail, PutSFTP — write or send data to destinations.

Routing and Mediation: RouteOnAttribute, RouteOnContent, SplitText, SplitJSON, MergeContent — split, merge, or direct FlowFiles to different paths.

Database Interaction: ExecuteSQL, PutDatabaseRecord, QueryDatabaseTable, GenerateTableFetch — read from and write to JDBC-accessible databases.

Attribute Extraction: UpdateAttribute, EvaluateJsonPath, ExtractText, LookupRecord — read or modify FlowFile attributes.

Transformation: ConvertRecord, JoltTransformJSON, TransformXml, ReplaceText — change format or content of payloads.

Custom processors can be developed in Java, packaged as NAR (NiFi Archive) files, and deployed to NiFi's lib directory.

Which processor routes FlowFiles to different downstream connections based on attribute values?
What file format packages custom NiFi processors with classloader-isolated dependencies for deployment?
5. What is a Connection in NiFi and how does back-pressure work?

A Connection in NiFi is a directed, persistent queue that links the output of one processor to the input of another. FlowFiles physically queue here when the downstream processor cannot keep up. Each connection carries one or more relationships from the upstream processor (e.g., success, failure). All processor relationships must be either connected or auto-terminated before the processor can start.

Back-pressure is configured on each connection with two independent thresholds:

Back Pressure Object Threshold: Maximum FlowFile count in the queue. When reached, the upstream processor stops being scheduled. Default is 10,000.

Back Pressure Data Size Threshold: Maximum total byte size of queued content. When exceeded, the upstream processor also pauses. Default is 1 GB.

Setting both to 0 disables back-pressure, allowing the queue to grow unbounded — use cautiously. Connections also support prioritizers controlling dequeue order: First In First Out (default), Newest First, Oldest First, or by size.

What happens to an upstream processor when a connection's back-pressure object threshold is reached?
What must be true about all processor relationships before a processor can be started in NiFi?
6. What is NiFi Expression Language and where can it be used?

NiFi Expression Language (EL) is a built-in expression engine enabling dynamic evaluation of FlowFile attribute values within processor property configurations. Instead of hardcoded values, you write expressions evaluated at runtime against each FlowFile's attributes using the syntax ${attribute.name}.

EL supports a rich function library:

String functions: ${filename:toUpper()}, ${mime.type:substringAfter('/')}, ${attr:startsWith('prefix')}, ${attr:replace('old','new')}

Math functions: ${count:toNumber():plus(1)}, ${size:multiply(2)}

Date/time: ${now():format('yyyy-MM-dd')}, ${attr:toDate('MM/dd/yyyy')}

Boolean / conditional: ${attr:equals('active'):ifElse('yes','no')}, ${attr:isEmpty():not()}

System: ${hostname()}, ${uuid()}, ${literal('fixed')}

EL can only be used in property fields that display the EL icon (a curved arrow) in the NiFi UI. In non-EL-enabled fields, the expression is treated as a literal string — no error is raised. Common applications include dynamic file paths, Kafka topic routing, SQL query construction, and HTTP endpoint URLs.

What NiFi Expression Language expression converts the filename attribute to uppercase?
What happens if you type a NiFi EL expression into a processor property field that does not support EL?
7. What is data provenance in Apache NiFi and how do you access it?

Data provenance in NiFi is the complete, immutable audit trail of everything that happens to every FlowFile from when it enters until it leaves or is dropped. NiFi records a provenance event automatically for every significant action — no explicit configuration is required.

Provenance event types include: RECEIVE (data enters NiFi), SEND (data sent to external destination), FETCH, CREATE, FORK (FlowFile split into multiple), JOIN (FlowFiles merged), CLONE, CONTENT_MODIFIED, ATTRIBUTES_MODIFIED, DROP, and REPLAY.

Each event records: timestamp, event type, duration, FlowFile UUID, attributes before and after, the component that processed it, SHA-256 content hashes, and transit URI (the actual endpoint URL for SEND/RECEIVE).

Access provenance via the NiFi global menu → Data Provenance. Search by FlowFile UUID, filename, content size, processor, or time range. From a FORK event you can navigate to child FlowFiles; from JOIN to parents — reconstructing complete lineage. The Replay button on RECEIVE or CONTENT_MODIFIED events re-injects that exact FlowFile state back into the flow, invaluable for debugging and reprocessing.

Which provenance event type is recorded when SplitJSON divides one FlowFile into multiple child FlowFiles?
What does the Replay feature on a provenance event allow you to do?

8. What is a Process Group in NiFi and why is it used?

A Process Group is a named container that groups processors, connections, funnels, and other components into a single unit on the NiFi canvas. It appears as a rectangle; double-clicking opens it to reveal its internals. Process Groups can be nested, enabling hierarchical flow organization.

Organization: Large flows with hundreds of processors become unmanageable on a flat canvas. Grouping related processors into named groups makes the top-level view a clean architectural diagram.

Reuse: A Process Group can be saved as a versioned flow in NiFi Registry and instantiated multiple times for different environments or data sources.

Access control: NiFi policy-based access control applies at the Process Group level. Different teams can be granted operate, view, or modify rights to specific groups without affecting others.

Input and Output Ports: Data enters and exits through explicitly defined Input Ports and Output Ports — the only official data gateways. In a NiFi cluster, Remote Process Groups (RPGs) use Input Ports of remote NiFi instances as Site-to-Site transfer targets.

How does data officially enter a Process Group in NiFi?
Which NiFi feature enables a Process Group's flow definition to be version-controlled and promoted across environments?
9. What is NiFi Registry and how does it integrate with NiFi?

NiFi Registry is a complementary subproject that provides centralized storage, management, and versioning of NiFi flow definitions. It functions as a version control system for Process Group configurations — similar in concept to Git for source code.

NiFi Registry organizes flows into Buckets — named containers for flow versions, each with independent access policies. The integration workflow is:

1. Add the Registry URL under NiFi Controller Settings → Registry Clients.

2. Right-click a Process Group on the canvas → Version → Start Version Control, selecting a bucket and flow name. NiFi sends the flow JSON to Registry as version 1.

3. After further changes, commit via Version → Commit Local Changes to create a new snapshot.

4. To promote to production, a production NiFi imports the flow from Registry at the target version — ensuring consistent configuration across environments.

5. To roll back, right-click the versioned group → Version → Change Version to select any previous commit.

Registry exposes a REST API, enabling CI/CD pipelines to promote flow versions from dev to staging to prod without manual canvas operations.

What is a Bucket in NiFi Registry?
How do you roll back a versioned NiFi Process Group to a previous Registry version?
10. How does NiFi clustering work and what is the role of ZooKeeper?

A NiFi cluster consists of multiple NiFi nodes that all run the same flow and collectively process data in parallel. Every node receives a copy of the flow definition and runs the same processors, but each node independently processes a subset of the FlowFiles — distributing the workload.

NiFi uses an embedded ZooKeeper (or an external ZooKeeper ensemble) for cluster coordination. ZooKeeper serves two critical roles:

Cluster Coordinator election: One node is elected Cluster Coordinator via ZooKeeper leader election. The Coordinator manages cluster state — which nodes are connected, which are heartbeating, and which are being disconnected. If the Coordinator fails, ZooKeeper elects a new one from the remaining nodes automatically.

Primary Node election: Separately, one node is designated as Primary Node. Processors configured with Execution: Primary Node Only run exclusively on the Primary Node. This is essential for processors that must not run concurrently — ListSFTP or QueryDatabaseTable would produce duplicate FlowFiles if all nodes ran them simultaneously.

All nodes communicate through the Site-to-Site protocol. The NiFi web UI can be accessed on any node and proxies canvas operations to all nodes transparently.

Why must processors like ListSFTP be configured to run on the Primary Node only in a NiFi cluster?
What role does ZooKeeper play in NiFi clustering?
11. What is a Controller Service in NiFi and how is it different from a Processor?

A Controller Service is a shared, reusable service that processors within a Process Group (or across the entire NiFi instance) can reference in their configurations. Where a Processor performs work on individual FlowFiles, a Controller Service provides a shared capability — a database connection pool, an SSL context, a distributed map cache client, a record reader/writer — that multiple processors reuse simultaneously.

Common Controller Services include:

DBCPConnectionPool: Manages a JDBC connection pool. Processors like ExecuteSQL and PutDatabaseRecord reference this service instead of each opening their own connections, dramatically reducing database connection overhead.

JsonTreeReader / JsonRecordSetWriter: Define how to read and write JSON records. Used by record-aware processors to decouple format handling from logic.

StandardSSLContextService: Provides a shared TLS/SSL context (keystore, truststore) referenced by any processor needing secure connections.

DistributedMapCacheClientService: Implements a distributed in-memory key-value cache accessible across the flow, useful for deduplication and state sharing.

Controller Services have their own lifecycle: they must be enabled before any referencing processor can start, and they cannot be disabled while processors referencing them are running.

Why do multiple processors reference a single DBCPConnectionPool Controller Service rather than each managing their own connections?
What must happen to a Controller Service before any processor that references it can be started?
12. What is the GenerateTableFetch and QueryDatabaseTable pattern for incremental database ingestion?

QueryDatabaseTable and GenerateTableFetch are the two primary patterns for incrementally ingesting data from relational database tables. Each has different performance characteristics and use cases.

QueryDatabaseTable: A simpler, single-processor approach. It issues a SELECT query using a configurable Maximum Value Columns setting to track the last seen value (typically a timestamp or auto-increment ID). On each execution, it queries only rows where the tracked column exceeds the stored state — an incremental read. It produces one FlowFile per execution containing all new rows. Suited for moderate-size increments on single tables.

GenerateTableFetch + ExecuteSQL: A scalable, parallelizable pattern for large tables. GenerateTableFetch queries the database to determine the range of new rows (min and max of the tracking column), then generates one SQL SELECT statement per partition chunk as FlowFile attributes. Each generated FlowFile is routed to ExecuteSQL, which executes the SQL and returns that chunk's result set. Multiple ExecuteSQL processors can run in parallel, fetching different partitions simultaneously — dramatically improving throughput for large incremental loads.

Both processors use NiFi's State Management to persist the last processed value across restarts, ensuring no rows are missed or re-read on processor restart.

What is the primary advantage of the GenerateTableFetch + ExecuteSQL pattern over using QueryDatabaseTable alone?
How do QueryDatabaseTable and GenerateTableFetch track the last ingested row to avoid re-reading data across restarts?
13. What is the Record-based processing model in NiFi and why is it preferred?

NiFi's record-based processing model treats FlowFile content as a structured stream of records rather than an opaque blob. A record is one logical row — one JSON object, one CSV line, one Avro record, one database row. Record-aware processors operate on individual records within a FlowFile, enabling format-agnostic transformations.

The model relies on three Controller Service types:

RecordReader: Parses the FlowFile content and produces a stream of records. Implementations include JsonTreeReader, CSVReader, AvroReader, ParquetReader, XMLReader, and GrokReader (for unstructured log parsing).

RecordWriter: Serializes records back to bytes. Implementations include JsonRecordSetWriter, CSVRecordSetWriter, AvroRecordSetWriter, and ParquetRecordSetWriter.

Schema Registry: Optionally provides Avro schemas that readers and writers use to interpret and validate records. NiFi includes an embedded AvroSchemaRegistry.

Key record-aware processors: ConvertRecord (format conversion), QueryRecord (apply SQL SELECT against FlowFile records using Apache Calcite), LookupRecord (enrich records from external sources), UpdateRecord, and PartitionRecord (split into one FlowFile per distinct field value).

The key advantage is format independence: changing from JSON to CSV input requires only swapping the RecordReader Controller Service — no processor logic changes. It also avoids materializing entire FlowFiles into memory by streaming records one at a time.

How would you change a record-based NiFi flow from processing JSON input to processing CSV input?
Which record-aware processor allows you to apply SQL SELECT statements against the records within a FlowFile?
14. What is State Management in NiFi and what types of state scope exist?

State Management is NiFi's built-in mechanism for processors and controller services to persistently store small amounts of key-value data that survive processor restarts and NiFi restarts. Without state management, a processor like QueryDatabaseTable would forget the last ingested timestamp every time it was stopped, causing duplicate ingestion.

State has two scopes:

Local State: Scoped to a specific processor on a specific NiFi node. Stored on local disk using LevelDB by default. Used when each node needs its own independent tracking — for example, a GetFile processor tracking which files it has processed from a local directory on that node.

Cluster State: Scoped to a specific processor but shared across all nodes in a NiFi cluster. Stored in ZooKeeper. Used when only one node should track state for the cluster — for example, QueryDatabaseTable running on the Primary Node needs its last-value state visible to whichever node becomes Primary after a failover.

State is accessed programmatically via the StateManager API. From the NiFi UI, you can view and clear a processor's state by right-clicking the processor → View State.

Why would a processor like QueryDatabaseTable use Cluster scope state rather than Local scope?
Where is Local scope state physically stored in NiFi by default?
15. What is NiFi Site-to-Site (S2S) and when do you use it?

NiFi Site-to-Site (S2S) is a native protocol for transferring FlowFiles directly between NiFi instances — between two standalone NiFi servers, between nodes in a cluster, or between NiFi and MiNiFi agents. It operates over HTTP/HTTPS or raw TCP sockets and provides end-to-end guaranteed delivery, optional compression, and mutual TLS authentication.

S2S works through Remote Process Groups (RPGs) on the sending side. An RPG is configured with the URL of the remote NiFi instance. The sending NiFi queries the remote instance for available Input Ports, then routes FlowFiles to the chosen Input Port.

Key features of S2S:

Automatic load balancing: When the remote NiFi is a cluster, the RPG automatically distributes FlowFiles across all healthy cluster nodes using peer selection.

Back-pressure propagation: If the remote Input Port queue is full, the sender is notified and slows down — extending NiFi's back-pressure model across instance boundaries.

Compression: Optional GZIP compression reduces bandwidth between instances.

Common use cases: MiNiFi agents on edge devices pushing data to a central NiFi hub; cross-datacenter data transfer pipelines; routing data from an ingestion NiFi tier to a processing NiFi tier.

What NiFi component on the sending side is used to push FlowFiles to a remote NiFi instance via Site-to-Site?
How does Site-to-Site handle back-pressure when the remote NiFi instance's Input Port queue is full?
16. What is NiFi and how does it relate to Apache NiFi?

MiNiFi (Minimum NiFi) is a lightweight subproject of Apache NiFi designed specifically for edge data collection on resource-constrained devices — IoT sensors, industrial controllers, embedded systems, and edge servers. It implements a subset of NiFi's capabilities with a dramatically reduced footprint: the MiNiFi Java agent runs in under 256 MB heap; the MiNiFi C++ agent runs in tens of megabytes.

MiNiFi flows are created in the full NiFi canvas (as a Process Group) and exported as a YAML or JSON template deployed to MiNiFi agents. The MiNiFi C2 Server (Command and Control) allows centralized management — pushing updated flow templates to agents without manual file deployment.

MiNiFi agents collect data at the source and use NiFi Site-to-Site protocol to ship FlowFiles to a central NiFi hub for further processing, enrichment, and routing. This hub-and-spoke architecture keeps complex transformation logic in the central NiFi where compute resources are available, while keeping the edge agent minimal.

Key differences from full NiFi: MiNiFi has no web UI (flows are pushed from outside), supports fewer processors, has no provenance UI, and is designed for unattended operation in network-constrained or intermittently connected environments.

What protocol do MiNiFi agents use to send collected data to a central Apache NiFi instance?
How are flow configurations deployed to MiNiFi agents at scale?
17. What is NiFi Parameter Context and how does it differ from Variables?

A Parameter Context is a named collection of key-value parameters applied to a Process Group to externalize configuration values from the flow definition. Instead of hardcoding a Kafka broker address or database URL inside processor properties, you reference a parameter with the syntax #{parameter.name} and define its value in the Parameter Context. This makes flows environment-independent: the same flow definition can point to different Kafka clusters in dev, staging, and prod by binding different Parameter Contexts.

Parameter Contexts were introduced in NiFi 1.10 as the replacement for the older Variables feature. Key differences:

Parameter Context vs Variables
FeatureParameter ContextVariables
Syntax#{param.name}${var.name} (same as EL)
Sensitive valuesSupported (masked in UI)Not supported
Registry versioningReferenced by name; values stored separatelyEmbedded in flow template
InheritanceChild groups inherit parent contextScoped to single group level
StatusCurrent recommended approachDeprecated

Parameter Contexts integrate with NiFi Registry: the flow definition references the context by name, but the actual parameter values are managed outside the Registry-versioned flow, preventing sensitive credentials from being committed to version control.

What is the NiFi syntax for referencing a value from a Parameter Context in a processor property?
What advantage do Parameter Contexts have over Variables for storing sensitive values like passwords?
18. How does NiFi handle security — TLS, authentication, and authorization?

NiFi provides a comprehensive security model covering transport encryption, user authentication, and fine-grained authorization.

TLS / Transport Encryption: NiFi can be configured to serve its UI and API exclusively over HTTPS. TLS is configured in nifi.properties using a keystore (server certificate) and truststore (CA certificates for client certificate validation). The tls-toolkit utility generates self-signed certificates and keystores for development clusters.

Authentication — NiFi supports multiple Login Identity Providers:

Client Certificate: Mutual TLS — the browser or API client presents a certificate. The Common Name (CN) becomes the NiFi identity.

LDAP / Active Directory: Username and password validated via LdapIdentityProvider.

Kerberos: Single sign-on via Kerberos ticket; the authenticated principal becomes the NiFi identity.

OIDC / OAuth2: Integration with Keycloak, Okta, or Azure AD via OpenID Connect.

Authorization: NiFi uses a policy-based model. Every resource (processor, connection, Process Group, provenance) has access policies — Read, Write, and Operate — that users or groups can be granted or denied. The authorizers.xml file configures the authorization provider. LDAP-based groups can be imported for group-level policy management.

When using client certificate authentication in NiFi, what becomes the user's identity?
What NiFi command-line utility helps generate TLS certificates and keystores for securing a NiFi cluster?
19. What is the NiFi NAR (NiFi Archive) classloading model?

The NAR (NiFi ARchive) is the extension packaging format for NiFi components: processors, controller services, and reporting tasks. A NAR file is similar to a JAR but includes a special manifest that declares its dependencies and classloader parent chain. The NAR classloading model solves the dependency isolation problem — different processors may depend on conflicting versions of the same library.

Each NAR is loaded by its own NARClassLoader. When a processor in NAR A needs to load a class, the classloader first looks in NAR A's own classpath. Only classes not found there are delegated up the parent chain. Most NARs declare nifi-standard-services-api-nar or nifi-framework-api as their parent NAR, but do not share classloaders with sibling NARs.

This means NiFi can simultaneously run a processor using AWS SDK v1.x (in one NAR) and a processor using AWS SDK v2.x (in another NAR) without any classpath conflicts.

NARs are deployed by dropping them into NiFi's ./lib directory and restarting NiFi. NiFi 2.x introduces NAR Provider support for dynamically loading NARs at runtime without a restart, fetching from NiFi Registry or a Maven repository.

Why does NiFi use separate NARClassLoaders for each NAR rather than a single shared classloader?
Where are NAR files deployed in a NiFi installation?
20. What are Reporting Tasks in NiFi and what are common use cases?

Reporting Tasks are NiFi extension components that run on a scheduled basis to collect and report metrics, bulletin events, and operational data from the NiFi instance itself — not from FlowFiles. They operate at the NiFi system level rather than the data flow level, making them the primary tool for NiFi self-monitoring and integration with external observability platforms.

Reporting Tasks have their own scheduling (time-driven or CRON) and run independently of any flow. They are configured via Controller Settings → Reporting Tasks in the NiFi UI.

Common built-in Reporting Tasks:

SiteToSiteProvenanceReportingTask: Streams provenance events via Site-to-Site to a remote NiFi instance. Used for SIEM integration and long-term provenance archival outside the local provenance repository.

SiteToSiteBulletinReportingTask: Sends NiFi bulletin (warning and error) events via S2S to a remote NiFi for centralized alerting.

ControllerStatusReportingTask: Logs NiFi instance metrics (active threads, FlowFile counts, queue depths) to the NiFi log file.

PrometheusReportingTask: Exposes NiFi metrics as a Prometheus scrape endpoint, enabling Grafana dashboards for NiFi operational monitoring.

What distinguishes a Reporting Task from a Processor in Apache NiFi?
Which Reporting Task would you configure to expose NiFi operational metrics for scraping by Prometheus?
21. How do you handle errors and failures in a NiFi flow?

NiFi provides several mechanisms for handling failures gracefully, ensuring that failed FlowFiles are not silently lost and that problems are visible to operators.

Failure Relationships: Most processors emit FlowFiles that cannot be processed to a failure relationship. Always connect this relationship to a destination — commonly a LogAttribute processor, a PutFile processor (to archive failed FlowFiles to disk), or a PublishKafka processor (to route failures to an error topic). Never auto-terminate the failure relationship in production without deliberate consideration.

Retry connections: Loop a failure relationship back to the same processor or an earlier processor to implement retry logic. UpdateAttribute can increment a retry counter attribute, and RouteOnAttribute can route FlowFiles with exceeded retry counts to a dead-letter path.

Bulletins: When a processor encounters an error it logs to its bulletin board. Bulletins appear as colored indicators on the processor in the canvas. Severity levels: DEBUG, INFO, WARNING, ERROR.

Yield Duration: If a processor fails to acquire a connection or resource, it enters a yield state for the configured Yield Duration (default 1 second) before being rescheduled again, preventing tight error loops.

Penalty Duration: When a FlowFile is penalized, it is not re-selected for processing for the Penalty Duration period, giving upstream systems time to recover before retry.

What prevents a processor in a persistent error state from consuming all NiFi processing threads in a tight loop?
What is the difference between Yield Duration and Penalty Duration in NiFi?
22. What is the SplitText processor and how do you control split behavior?

SplitText is a NiFi processor that splits a FlowFile containing multiple lines of text into multiple smaller FlowFiles, each containing a configurable number of lines. It is the workhorse for splitting large text, CSV, or log files into processable chunks before parallel processing.

Key configuration properties:

Line Split Count: The number of lines per output FlowFile. Set to 1 for one FlowFile per line; set to 1000 for batches of 1000 lines. A value of 0 means no line-count limit (used with Maximum Fragment Size).

Maximum Fragment Size: Optional maximum byte size per output FlowFile. When the current fragment reaches this size during splitting, a new fragment begins. Useful when downstream systems have size limits.

Header Line Count: Number of header lines to include in every output FlowFile (e.g., 1 for a CSV header row). The header is prepended to every fragment so each fragment is independently parseable as a complete CSV file.

Header Marker: A regex pattern identifying header lines embedded in the file.

SplitText sets these attributes on each output FlowFile: fragment.identifier (UUID shared by all fragments from the same original), fragment.index (1-based fragment number), and fragment.count (total number of fragments). These attributes enable MergeContent to reassemble fragments in the correct order.

How do you ensure every SplitText output FlowFile for a CSV file includes the header row?
What FlowFile attribute set by SplitText helps MergeContent reassemble fragments in the original order?
23. What is the MergeContent processor and how is it used?

MergeContent is a NiFi processor that combines multiple FlowFiles into a single FlowFile. It is the counterpart to processors like SplitText and SplitJSON, enabling a scatter-gather pattern: split a large FlowFile into pieces for parallel processing, then merge the results back together.

MergeContent supports two merge strategies:

Defragment: Reassembles fragments produced by a split operation. It reads the fragment.identifier and fragment.count attributes and waits until all fragments with the same identifier have arrived before merging them in order. This mode requires fragment attributes to be present.

Bin-Packing Algorithm: Collects FlowFiles and merges when one of several triggers fires — minimum and maximum FlowFile count, minimum and maximum bin size in bytes, or a maximum wait time. Used for batching many small FlowFiles into a larger one for efficient downstream writing (e.g., batching records before writing to S3 as Parquet).

Output format options include: Binary Concatenation (concatenate raw content), TAR (create a TAR archive), ZIP (create a ZIP archive), and FlowFileStream v3 (NiFi's internal format that preserves all attributes of each constituent FlowFile). The FlowFileStream format is used with UnpackContent to later unpack the merged FlowFile back into individual FlowFiles with attributes intact.

In Defragment merge strategy, what attribute pair does MergeContent use to know when all fragments of a split have arrived?
Which merge output format preserves all FlowFile attributes when packing multiple FlowFiles so they can later be individually restored by UnpackContent?
24. What is the InvokeHTTP processor and what are key configuration considerations?

InvokeHTTP is NiFi's most flexible HTTP client processor. It sends HTTP requests to configurable URLs using any HTTP method (GET, POST, PUT, PATCH, DELETE) and routes the response to different relationships based on the HTTP response code. It is the Swiss-army knife for REST API integration in NiFi flows.

Key configuration properties:

HTTP Method: The method to use — can be static or an EL expression like ${http.method} to pick dynamically from an attribute.

Remote URL: The target endpoint URL, EL-supported: https://api.example.com/users/${user.id}.

SSL Context Service: A StandardSSLContextService reference for HTTPS endpoints requiring client certificates or custom CA trust.

Send Message Body: Whether to include the FlowFile content as the HTTP request body. Set to false for GET/DELETE requests.

Request Content-Type: Typically application/json or dynamically from ${mime.type}.

Relationships: Response (2xx responses — the response body becomes a new FlowFile), Original (the original request FlowFile), Retry (5xx responses), No Retry (4xx client errors), Failure (network failures, connection timeouts).

To which InvokeHTTP relationship does a 404 Not Found HTTP response route?
After a successful InvokeHTTP call, where does the HTTP response body appear in the NiFi flow?
25. What is the PublishKafka and ConsumeKafka processor pair and what are key configuration options?

PublishKafka and ConsumeKafka (and their record-aware variants PublishKafkaRecord and ConsumeKafkaRecord) are NiFi's integration points with Apache Kafka.

ConsumeKafka: Subscribes to one or more Kafka topics using the Kafka consumer group protocol. Key properties include: Kafka Brokers (bootstrap servers), Topic Name(s) (EL-supported), Group ID (consumer group name), Offset Reset (earliest/latest for new consumer groups), Max Poll Records (how many records per poll), and Honor Transactions (whether to respect Kafka transactional producers). Each polled batch produces one FlowFile containing the message value.

PublishKafka: Produces messages to a Kafka topic. Key properties: Topic Name (static or EL-expression like ${kafka.topic} for per-FlowFile routing), Failure Strategy (Route to Failure vs Roll Back), Message Key Field (for keyed messages), Delivery Guarantee (Best Effort, Wait for Local Ack, Wait for Replication). With PublishKafkaRecord, NiFi reads records from the FlowFile and publishes one Kafka message per record.

Both processors require a KafkaClientService Controller Service in NiFi 2.x for SSL, SASL, and schema registry integration.

What PublishKafka Delivery Guarantee setting ensures a message is acknowledged by all in-sync replicas before success is confirmed?
What is the key behavioral difference between ConsumeKafka and ConsumeKafkaRecord?
26. What is the ExecuteScript processor and what scripting languages does it support?

ExecuteScript is NiFi's escape hatch for custom logic that cannot be expressed with built-in processors. It allows you to write arbitrary script code that executes within the NiFi processor lifecycle — accessing incoming FlowFiles, creating new FlowFiles, modifying attributes, reading and writing content, and routing FlowFiles to relationships.

Supported scripting languages (via the Java Scripting Engine API):

  • Groovy (most popular in NiFi community — expressive, JVM-native)
  • Python (via Jython — Python 2.7 dialect; C extensions unavailable)
  • ECMAScript / JavaScript (via Nashorn in Java 8, deprecated in Java 11+)
  • Ruby (via JRuby)
  • Lua

In Groovy, a typical ExecuteScript pattern looks like:

def flowFile = session.get()
if (!flowFile) return
flowFile = session.write(flowFile, { inputStream, outputStream ->
    def text = inputStream.getText('UTF-8')
    outputStream.write(text.toUpperCase().bytes)
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)

ExecuteScript has access to: session (ProcessSession), context (ProcessContext), log (ComponentLogger), and predefined relationship variables (REL_SUCCESS, REL_FAILURE). The script can use any Java library available on the NiFi classpath.

Why can Jython (Python in ExecuteScript) not use libraries like NumPy or Pandas?
Which NiFi object gives ExecuteScript access to read and write FlowFile content and transfer FlowFiles to relationships?
27. What is the JoltTransformJSON processor and how do you use it?

JoltTransformJSON is a NiFi processor that transforms JSON content using the JOLT (JSON to JSON transformation) library. JOLT uses declarative JSON specifications (JOLT specs) to describe how an input JSON document should be restructured — renaming fields, changing nesting structure, filtering arrays, computing derived values — without writing imperative code.

JOLT supports several transformation types applied in a chain:

shift: The most common operation. Defines a mapping from input JSON paths to output JSON paths. Input paths can include wildcards (*), array indices ([]), and conditional matching.

default: Adds default values to the output for paths absent in the input.

remove: Removes specified paths from the document.

cardinality: Normalizes fields that may be either single values or arrays into a consistent array form.

sort: Alphabetically sorts JSON object keys.

A simple shift spec that renames firstName to first_name:

[{
  "operation": "shift",
  "spec": {
    "firstName": "first_name",
    "lastName": "last_name"
  }
}]

The JOLT spec itself supports NiFi EL, allowing dynamic spec construction from FlowFile attributes. The processor includes a Transform Tool in its configuration dialog for testing specs against sample input without running the full flow.

Which JOLT operation type defines the mapping of input JSON fields to new output JSON paths?
What NiFi feature in the JoltTransformJSON processor lets you test your JOLT spec against sample JSON before running the actual flow?
28. What is the PutDatabaseRecord processor and how does it differ from ExecuteSQL?

PutDatabaseRecord writes structured records from a FlowFile into a relational database table using JDBC. It is the write counterpart to QueryDatabaseTable and ExecuteSQL. Unlike ExecuteSQL — which executes arbitrary SQL statements — PutDatabaseRecord works with the NiFi record model: it reads records from the FlowFile using a RecordReader and generates INSERT, UPSERT, INSERT_IGNORE, UPDATE, or DELETE statements automatically based on the target table schema.

Key configuration properties:

Record Reader: Parses the FlowFile content (JsonTreeReader, CSVReader, AvroReader, etc.).

Statement Type: INSERT, UPDATE, INSERT_OR_UPDATE (upsert), INSERT_IGNORE (ignore on key conflict), DELETE, or USE_ATTR_TYPE (pick from a FlowFile attribute).

Database Connection Pooling Service: A DBCPConnectionPool reference.

Table Name: Static name or EL expression like ${target.table}.

Translate Field Names: When enabled, converts record field names (e.g., camelCase) to database column name conventions (e.g., snake_case).

The advantage over ExecuteSQL for writes is that PutDatabaseRecord handles schema mapping automatically — it queries the target table's metadata to determine column types and order, then generates correct parameterized SQL. This avoids hand-crafting INSERT statements and handles type coercion automatically.

What must you configure in PutDatabaseRecord to tell NiFi how to parse the FlowFile content before writing to the database?
Which PutDatabaseRecord Statement Type would you use to insert new rows and update existing rows based on a primary key?
29. What is the ListSFTP and FetchSFTP processor pattern and how does it work?

ListSFTP and FetchSFTP implement a two-stage pattern for ingesting files from SFTP servers, separating listing from fetching. This design also appears for S3 (ListS3/FetchS3Object), Azure Blob Storage, HDFS, and local filesystems.

ListSFTP: Connects to the SFTP server and lists files in the configured remote directory (with optional recursion and filename filtering by regex). For each file found, it creates a FlowFile with zero bytes of content but rich attributes: filename, path, sftp.remote.host, sftp.remote.port, file.size, file.lastModifiedTime, etc. Uses NiFi State Management to track already-listed files, emitting only new or modified files on subsequent runs.

FetchSFTP: Receives the listing FlowFiles and for each one downloads the actual file content from the SFTP server using the attributes. The result is a FlowFile whose content is the downloaded file bytes.

Why split listing from fetching? Listing is fast (one directory read) while fetching is slow (one network transfer per file). Separating them lets you run multiple FetchSFTP processors in parallel (by increasing concurrent task count) to download many files simultaneously, while ListSFTP runs on the Primary Node at its own pace.

Why does ListSFTP produce FlowFiles with zero content, and what do those FlowFiles contain instead?
How can you speed up bulk file ingestion using the ListSFTP + FetchSFTP pattern?
30. What is the LookupRecord processor used for?

LookupRecord is a record-aware NiFi processor that enriches records within a FlowFile by looking up values from an external source — a database, a distributed map cache, a REST API, or a file-based lookup table — and adding the result as a new field in each record.

LookupRecord works with three components:

RecordReader: Parses the incoming FlowFile into records.

RecordWriter: Serializes enriched records back to the output FlowFile.

LookupService Controller Service: The enrichment data source. Implementations include:

  • SimpleCsvFileLookupService: Looks up values from an in-memory CSV file — useful for small static reference tables.
  • IPLookupService: GeoIP enrichment from a MaxMind database.
  • DatabaseRecordLookupService: Executes a parameterized SQL query against a JDBC source for each lookup.
  • DistributedMapCacheLookupService: Looks up values from a distributed in-memory cache populated by another part of the flow.
  • RestLookupService: Calls a REST API endpoint for each lookup.

Configuration specifies which record fields are the lookup key(s) and which path in the output record receives the looked-up value. When no match is found, the processor routes to the unmatched relationship for separate handling.

Which LookupRecord LookupService would you use to add GeoIP country and city information to network log records?
To which relationship does LookupRecord route records for which no match was found in the LookupService?
31. What is the PartitionRecord processor and what is a common use case?

PartitionRecord is a record-aware NiFi processor that reads records from an input FlowFile and groups them into separate output FlowFiles based on one or more field values. All records sharing the same value for the partition field(s) go to the same output FlowFile; records with different values produce different FlowFiles.

For example, if you have a FlowFile containing 10,000 records with a country field, PartitionRecord produces one FlowFile per distinct country value. Each output FlowFile's country attribute is set to the partition value it contains.

The partitioning key is expressed using NiFi RecordPath syntax: /country for a top-level field, /address/state for a nested field, /tags[0] for an array element. Multiple partition keys can be added as separate User-Defined Properties, producing compound partitions.

Common use cases:

  • Routing records to different Kafka topics by type: Partition by /event.type, then UpdateAttribute sets kafka.topic from the partition attribute, enabling PublishKafka to route each FlowFile to its appropriate topic.
  • Partitioned file writes to object storage: Partition by /date and /region to write into Hive-compatible partition paths in S3 (e.g., dt=2024-01-15/region=us-east/data.parquet).
  • Database routing: Partition by customer ID to route records to different database shards.
What syntax does PartitionRecord use to specify the field by which to partition records?
After PartitionRecord splits records by /event.type, how can you route each output FlowFile to a different Kafka topic matching its event type?
32. What is the ConvertRecord processor and how is it used for format conversion?

ConvertRecord is a NiFi processor that converts FlowFile content from one data format to another using the NiFi record model. Its sole job is to read records using one format and write them out in another. The conversion logic lives entirely in the RecordReader and RecordWriter Controller Services — not in ConvertRecord itself.

Configuration is minimal: just a Record Reader and a Record Writer. Examples:

  • CSV → JSON: CSVReader + JsonRecordSetWriter
  • JSON → Avro: JsonTreeReader + AvroRecordSetWriter
  • Avro → Parquet: AvroReader + ParquetRecordSetWriter
  • JSON → CSV: JsonTreeReader + CSVRecordSetWriter

The schema used for conversion can be inferred from the data (for self-describing formats like Avro and Parquet), declared inline in the reader/writer configuration, or fetched from a Schema Registry. Using a Schema Registry ensures output always conforms to a validated, versioned schema.

ConvertRecord performs streaming conversion — records are read and written one at a time without loading the entire FlowFile into memory, making it suitable for very large files. The output FlowFile's mime.type attribute is automatically updated by the writer to reflect the new format.

To convert a FlowFile from CSV format to Avro format using ConvertRecord, what must you configure?
Why is ConvertRecord preferred over scripted format conversion for large files?
33. What are the NiFi processor scheduling strategies?

NiFi provides two scheduling strategies controlling when a processor's onTrigger method is invoked:

Timer Driven (default): The processor is scheduled to run at a fixed time interval. The Run Schedule property sets the interval — 0 sec means run as fast as possible (yielding only for yield duration), 10 sec means run once every 10 seconds. Most processors use timer-driven scheduling and run in a dedicated thread pool.

CRON Driven: The processor runs according to a CRON expression, enabling time-of-day or day-of-week scheduling. Uses standard CRON syntax with seconds: 0 0 * * * ? runs at the top of every hour; 0 30 9 ? * MON-FRI runs at 09:30 on weekdays. CRON-driven processors run in a separate thread pool.

Additional scheduling parameters:

Concurrent Tasks: How many threads can execute the processor simultaneously. Increasing concurrent tasks enables parallel FlowFile processing. Default is 1 for most processors. Processors must declare thread-safety (@SupportsBatching) to benefit from multiple concurrent tasks.

Execution: All Nodes (default) or Primary Node Only. Primary Node Only is required for processors that would cause duplication if run on multiple cluster nodes simultaneously (ListSFTP, QueryDatabaseTable).

A processor needs to run every weekday at 6:00 AM. Which scheduling strategy should you use?
What does increasing the Concurrent Tasks setting on a processor do?
34. What is the difference between EvaluateJsonPath and FlattenJson processors?

EvaluateJsonPath and FlattenJson both work with JSON content but serve fundamentally different purposes.

EvaluateJsonPath extracts specific values from a JSON payload using JSONPath expressions and writes those values either to FlowFile attributes or to the FlowFile content. It is a targeted extraction tool — you specify exactly which fields you want and where they go. Configuration requires one or more User-Defined Properties, each mapping a JSONPath expression to a destination attribute name. The JSON content itself is typically not modified when writing to attributes.

FlattenJson takes a nested JSON object and flattens its entire structure into a single-level key-value object, using configurable separator characters to compose the flattened key names. For example:

Input:  {"user": {"address": {"city": "Austin"}}}
Output: {"user.address.city": "Austin"}

FlattenJson is used to normalize deeply nested JSON into flat structures suitable for systems that expect flat schemas (relational databases, Elasticsearch, CSV). It operates on the entire document, producing a new content FlowFile with the flattened JSON.

The choice: use EvaluateJsonPath when you need specific field values as attributes for routing or enrichment; use FlattenJson when you need to restructure the entire document hierarchy into a flat representation for downstream storage.

When would you choose EvaluateJsonPath over FlattenJson for working with JSON FlowFile content?
Given nested JSON {"order": {"id": 42, "total": 99.50}}, what does FlattenJson produce using a dot separator?
35. How does NiFi integrate with Apache Hadoop and HDFS?

NiFi provides a suite of processors for reading from and writing to HDFS (Hadoop Distributed File System) and integrates with the broader Hadoop ecosystem including Hive and HBase. The integration uses standard Hadoop client libraries and respects Hadoop authentication (Simple or Kerberos).

Key HDFS processors:

PutHDFS: Writes FlowFile content to a specified HDFS path. Supports configuring block size, replication factor, buffer size, compression codec (GZIP, Snappy, LZO), and write conflict resolution. The output path supports NiFi EL for dynamic path construction.

FetchHDFS: Reads a file from HDFS into a FlowFile. The path is taken from the path and filename attributes set by ListHDFS.

ListHDFS: Lists files in an HDFS directory recursively, producing one FlowFile per file with metadata attributes. Uses State Management to track already-listed files.

Hadoop configuration is provided via the Hadoop Configuration Resources property pointing to hdfs-site.xml and core-site.xml files. For Kerberos-secured clusters, configure the Kerberos Principal and Kerberos Keytab properties or reference a KerberosCredentialsService Controller Service.

NiFi also integrates with Apache Hive via SelectHiveQL (query execution) and PutHiveStreaming (transactional ACID inserts), and with HBase via PutHBaseCell, GetHBase, and ScanHBase processors.

What configuration files must be referenced in NiFi HDFS processors to connect to a Hadoop cluster?
Which Hive integration processor supports transactional ACID inserts into ORC-format Hive tables?
36. What is the UpdateAttribute processor and how is its Advanced Mode used?

UpdateAttribute is one of the most versatile NiFi processors. In basic mode, each User-Defined Property becomes an attribute name, and its value (which can use NiFi Expression Language) becomes the new attribute value. Adding a property processed.timestamp with value ${now():format('yyyy-MM-dd HH:mm:ss')} stamps every FlowFile with a timestamp attribute.

Advanced Mode (accessible via the Advanced button in the processor configuration dialog) adds rule-based conditional attribute modification. You define rules, each with:

  • A set of conditions — boolean EL expressions that must all be true for the rule to apply (e.g., ${mime.type:equals('application/json')})
  • A set of actions — attribute name/value pairs applied when the rule matches

Multiple rules are evaluated in order; the FlowPolicy setting controls whether to use the first matching rule or all matching rules. This allows a single UpdateAttribute processor to implement complex attribute-setting logic without chaining multiple processors or writing scripts.

Common basic-mode uses include: computing dynamic file paths from multiple attributes, incrementing retry counters, setting MIME types, building database connection parameters from environment attributes, and normalizing attribute naming conventions.

In UpdateAttribute Advanced Mode, what is the role of a Rule Condition?
What is a common use of NiFi Expression Language in UpdateAttribute properties (basic mode)?
37. How do you implement deduplication in a NiFi flow?

Deduplication — preventing the same data from being processed more than once — is a common requirement. NiFi provides several mechanisms depending on scale, performance requirements, and what constitutes a duplicate.

DetectDuplicate processor: The simplest approach. It uses a Distributed Map Cache (a DistributedMapCacheClientService backed by a DistributedMapCacheServer) to store seen identifiers. For each incoming FlowFile, it evaluates a configurable Cache Entry Identifier (NiFi EL expression, e.g., ${filename} or ${sha256.hash}) and checks if that key already exists. Duplicates route to the duplicate relationship; new items go to non-duplicate. Cache entries can have a TTL (Age Off Duration) to forget old identifiers.

Content-based hashing: Use the HashContent processor to compute a SHA-256 hash of the FlowFile content and store it as an attribute, then use DetectDuplicate against the hash. This detects content-identical duplicates regardless of filename or metadata.

Database deduplication: For at-exactly-once semantics, track processed identifiers in a database table using PutDatabaseRecord with INSERT_IGNORE and a unique constraint on the identifier column. The database's ACID guarantees enforce uniqueness even under concurrent insertion.

What storage mechanism does the DetectDuplicate processor use to persist seen identifiers across NiFi restarts?
How would you detect duplicate FlowFiles based on identical content regardless of filename or metadata?
38. What is the HandleHttpRequest and HandleHttpResponse processor pair used for?

HandleHttpRequest and HandleHttpResponse implement an HTTP server inside NiFi, enabling NiFi to act as a web service endpoint that receives HTTP requests from external clients, processes them as FlowFiles through the flow, and returns HTTP responses.

HandleHttpRequest: Starts an embedded Jetty HTTP server listening on a configured port and path. When a client sends an HTTP request, the processor creates a FlowFile from the request body and enriches it with request attributes: http.method, http.url, http.query, all HTTP headers as attributes, and a http.context.identifier that uniquely links this request to its response.

HandleHttpResponse: Receives a processed FlowFile and sends its content back to the waiting HTTP client as the response body. It uses the http.context.identifier attribute to match the response to the original request. The HTTP status code can be set statically or from a FlowFile attribute.

The use case is building NiFi-powered APIs or webhook receivers. A webhook receiver, for example, can accept JSON payloads from a GitHub push event, validate the HMAC signature using ExecuteScript, enrich the payload via UpdateAttribute, and store it in a database via PutDatabaseRecord — all while responding 202 Accepted to the GitHub server.

A StandardHttpContextMap Controller Service is required to hold open HTTP connections between HandleHttpRequest and HandleHttpResponse. The maximum open connections setting determines concurrency.

What Controller Service is required to maintain the open HTTP connection between HandleHttpRequest and HandleHttpResponse?
What FlowFile attribute does HandleHttpResponse use to match the processed FlowFile back to the correct waiting HTTP client?
39. How does NiFi achieve guaranteed delivery and what are its durability guarantees?

NiFi's architecture is specifically designed to provide guaranteed delivery — once data enters NiFi, it will not be silently lost due to hardware failure, software crash, or network issues. Several design decisions work together to achieve this.

Persistent connection queues: FlowFiles in connection queues are tracked in the FlowFile Repository's Write-Ahead Log. On restart after a crash, NiFi replays the WAL to restore every FlowFile to its exact queue position before the crash.

Immutable content repository: FlowFile content is written to disk before the FlowFile is considered active. Content is never deleted until all references are removed. Even if NiFi crashes mid-write, the old content version is preserved.

Transactional processor sessions: Each processor invocation runs within a ProcessSession. All changes within a session are committed atomically. If the processor throws an exception before committing, the session is rolled back: all FlowFiles return to their input queues as if nothing happened.

Site-to-Site acknowledgment: Data transferred via S2S is acknowledged by the receiver before the sender removes FlowFiles from its queues. If the receiver crashes before acknowledgment, the sender retains the data and retries.

The practical implication: NiFi provides at-least-once delivery by default. Under failure and retry scenarios, a FlowFile may be processed more than once. Achieving exactly-once requires idempotent downstream systems or explicit deduplication logic in the flow.

What delivery guarantee does NiFi provide by default, and what scenario can violate exactly-once semantics?
What happens to a ProcessSession's changes if the processor throws an uncaught exception before calling session.commit()?
40. What is the Funnel component in NiFi and when do you use it?

A Funnel is a NiFi canvas component that merges FlowFiles from multiple incoming connections into a single outgoing connection. It has no processing logic — it is purely a flow topology tool for consolidating multiple data paths into one without introducing a processor's overhead, scheduling, or threading.

Common use cases:

Convergence after parallel processing: If you fan out a FlowFile through RouteOnAttribute to three different processing paths and all paths should eventually flow to the same downstream processor, a Funnel cleanly merges the three paths into one without requiring the downstream processor to have multiple input connections from different sources.

Canvas layout organization: Multiple processors feeding the same downstream path can first converge at a Funnel, reducing the number of long crossing connections on the canvas and improving readability.

Failure consolidation: Multiple processors' failure relationships can all connect to a single Funnel, which routes to a centralized error-handling path. This avoids duplicating error-handling connections from every processor to the same destination.

A Funnel differs from a Processor in that it has no scheduling, no concurrent task setting, no relationships (besides its one output), and no configuration properties. FlowFiles pass through it transparently and immediately. It also differs from a connection in that it can accept connections from any number of sources.

What is the maximum number of upstream connections a Funnel can accept?
How does a Funnel differ from a processor when merging multiple data paths?
41. What is the difference between GetFile and ListFile + FetchFile processors?

Both approaches ingest files from a local filesystem, but they differ in architecture, parallelism, and operational characteristics.

GetFile: The older, simpler, single-processor approach. It lists a directory, picks up files matching the configured filter, moves or deletes the source file atomically, and produces a FlowFile with the file content. Critical limitation: it is not safe to run with multiple concurrent tasks because two threads could attempt to process the same file simultaneously — its filesystem rename locking is not atomic on all filesystems or NFS mounts. On a cluster, GetFile should run on the Primary Node only.

ListFile + FetchFile: The modern, recommended approach. ListFile scans the directory and emits one FlowFile per found file containing only metadata attributes (filename, path, size, last modified). It uses State Management to track already-listed files. FetchFile then reads the actual file content from disk. This separation enables:

  • Parallel fetching: multiple concurrent FetchFile tasks read files simultaneously
  • Clear separation of concerns: listing happens once; fetching can be retried independently per file
  • Works correctly in clustered NiFi without Primary Node restriction on the fetch step

GetFile remains appropriate for simple single-node use cases. For new development and cluster deployments, ListFile + FetchFile is preferred.

Why is GetFile not recommended for use with multiple concurrent tasks in a NiFi cluster?
What mechanism does ListFile use to avoid re-listing files it has already seen on subsequent runs?
42. How does NiFi support schema evolution in data pipelines?

Schema evolution — handling changes in data structure without breaking pipelines — is supported primarily through the record model and schema registry integration.

NiFi's record-aware processors use Schema Access Strategies on RecordReaders and RecordWriters:

Infer Schema: The reader analyzes the data and derives a schema at runtime. Handles schema evolution naturally because the inferred schema always matches the current data shape, but may be inconsistent for heterogeneous datasets.

Use Schema from Registry: The reader fetches a versioned schema from a Schema Registry (Confluent Schema Registry or NiFi's built-in AvroSchemaRegistry). Schema evolution is governed by Avro compatibility rules: BACKWARD (new schema can read old data), FORWARD (old schema can read new data), or FULL compatibility. The registry enforces these rules on schema registration, preventing incompatible changes.

For handling missing fields: if an input record lacks a field present in the output schema, the writer uses the field's configured default value. For extra fields not in the output schema: the writer's Schema Validation property controls whether extra fields are ignored silently or cause routing to failure.

What Avro schema compatibility mode ensures that data written with a new schema can still be read by consumers using an older schema version?
When a RecordWriter encounters a field in the input record that is not present in the output schema, what does NiFi Schema Validation control?
43. What is the RouteText processor and how does it differ from RouteOnContent?

RouteText and RouteOnContent are both NiFi processors that route FlowFiles based on patterns found in content, but they operate at fundamentally different granularities.

RouteOnContent: Evaluates the entire FlowFile content against configured regex patterns. If any pattern matches anywhere in the content, the FlowFile is routed to the corresponding relationship. It produces whole-FlowFile routing — one FlowFile in, one FlowFile out on a different relationship. Use case: route JSON or XML FlowFiles based on whether they contain specific keywords or signatures.

RouteText: Operates line-by-line on the FlowFile content. For each line, it evaluates conditions and routes that individual line to a matching relationship. The output is multiple FlowFiles — one per matching relationship — containing only the lines that matched it. Lines matching no condition go to the unmatched relationship. Use case: splitting a multi-type log file where each line is an independent event, routing ERROR lines to an error handler and INFO lines to a general store.

RouteText supports multiple matching strategies: Starts With, Ends With, Contains, Equals, Matches Regex, and Satisfies Expression (NiFi EL evaluated against a text line). The Routing Strategy property controls whether a line matching multiple conditions goes to all matching relationships or only the first.

A log FlowFile contains both INFO and ERROR lines mixed together. Which processor correctly separates them into two different FlowFiles?
To which RouteText relationship do lines that match none of the configured conditions go?
44. What performance tuning options are available in NiFi and what are common bottleneck patterns?

NiFi performance tuning operates at several levels: JVM heap, thread pool sizes, repository configuration, and per-processor settings.

JVM Heap (bootstrap.conf): The java.arg.2=-Xms and java.arg.3=-Xmx settings control heap. NiFi's content repository keeps content on disk, so heap is primarily consumed by FlowFile attributes (in memory), in-flight processing, and Lucene indexes in the provenance repository. Typical production deployments use 4–16 GB. Insufficient heap causes frequent GC pauses and OutOfMemoryErrors.

Content Repository Partitioning: Splitting the content repository across multiple physical disks increases I/O parallelism, often the primary bottleneck for high-throughput flows.

Common bottleneck patterns:

  • Single processor bottleneck: One slow processor with a growing upstream queue. Solution: increase concurrent tasks on that processor.
  • Provenance repository lag: Slow provenance writes causing processor stalls. Solution: use WriteAheadProvenanceRepository instead of PersistentProvenanceRepository, or reduce provenance event detail.
  • Back-pressure chain: All processors paused because the terminal writer (PutS3Object, PutDatabaseRecord) is slow. Solution: scale the terminal processor or add a MergeContent batch before it.
  • Small FlowFile overhead: Millions of tiny FlowFiles causing high FlowFile repository overhead. Solution: use MergeContent to batch before terminal writes.
A NiFi flow processes thousands of tiny JSON records as individual FlowFiles. What is the most common performance problem and its solution?
What nifi.properties configuration can reduce provenance write latency in high-throughput deployments?
45. How does NiFi integrate with cloud storage services like Amazon S3?

NiFi provides a comprehensive set of processors for integrating with Amazon S3 available in the nifi-aws-nar extension.

ListS3: Lists objects in an S3 bucket (filtered by prefix and last modified date). Produces one FlowFile per S3 object with attributes: s3.bucket, s3.key, filename, s3.etag, s3.contentType, file.size, s3.lastModified. Uses State Management to track listed objects and emit only new or changed ones on subsequent runs.

FetchS3Object: Downloads the S3 object specified by the s3.bucket and s3.key attributes on an incoming FlowFile. Used downstream from ListS3.

PutS3Object: Uploads FlowFile content to S3. Supports configuring bucket, key (EL-supported: ${now():format('yyyy/MM/dd')}/${filename}), storage class (STANDARD, INTELLIGENT_TIERING, GLACIER), server-side encryption (AES-256, AWS:KMS), and multipart upload threshold for large files.

DeleteS3Object: Deletes an S3 object. TagS3Object: Adds or updates S3 object tags.

Authentication is handled via an AWSCredentialsProviderControllerService — supporting static credentials, environment variable resolution, EC2 instance profile (IAM role), and AWS credentials file. Using IAM roles via instance profiles is the recommended approach for deployments on EC2 or EKS, avoiding static credential management entirely.

What is the recommended AWS authentication approach for NiFi running on EC2, avoiding static credential configuration?
What PutS3Object feature automatically switches to multipart upload for large files?
«
»
Cloud

Comments & Discussions