Integration / Apache NiFi Interview Questions

1. What is Apache NiFi and what problem does it solve? 2. What is a FlowFile in Apache NiFi? 3. What are the three NiFi repositories and what does each store? 4. What is a Processor in Apache NiFi and what are the main processor categories? 5. What is a Connection in NiFi and how does back-pressure work? 6. What is NiFi Expression Language and where can it be used? 7. What is data provenance in Apache NiFi and how do you access it? 8. What is a Process Group in NiFi and why is it used? 9. What is NiFi Registry and how does it integrate with NiFi? 10. How does NiFi clustering work and what is the role of ZooKeeper? 11. What is a Controller Service in NiFi and how is it different from a Processor? 12. What is the GenerateTableFetch and QueryDatabaseTable pattern for incremental database ingestion? 13. What is the Record-based processing model in NiFi and why is it preferred? 14. What is State Management in NiFi and what types of state scope exist? 15. What is NiFi Site-to-Site (S2S) and when do you use it? 16. What is NiFi and how does it relate to Apache NiFi? 17. What is NiFi Parameter Context and how does it differ from Variables? 18. How does NiFi handle security — TLS, authentication, and authorization? 19. What is the NiFi NAR (NiFi Archive) classloading model? 20. What are Reporting Tasks in NiFi and what are common use cases? 21. How do you handle errors and failures in a NiFi flow? 22. What is the SplitText processor and how do you control split behavior? 23. What is the MergeContent processor and how is it used? 24. What is the InvokeHTTP processor and what are key configuration considerations? 25. What is the PublishKafka and ConsumeKafka processor pair and what are key configuration options? 26. What is the ExecuteScript processor and what scripting languages does it support? 27. What is the JoltTransformJSON processor and how do you use it? 28. What is the PutDatabaseRecord processor and how does it differ from ExecuteSQL? 29. What is the ListSFTP and FetchSFTP processor pattern and how does it work? 30. What is the LookupRecord processor used for? 31. What is the PartitionRecord processor and what is a common use case? 32. What is the ConvertRecord processor and how is it used for format conversion? 33. What are the NiFi processor scheduling strategies? 34. What is the difference between EvaluateJsonPath and FlattenJson processors? 35. How does NiFi integrate with Apache Hadoop and HDFS? 36. What is the UpdateAttribute processor and how is its Advanced Mode used? 37. How do you implement deduplication in a NiFi flow? 38. What is the HandleHttpRequest and HandleHttpResponse processor pair used for? 39. How does NiFi achieve guaranteed delivery and what are its durability guarantees? 40. What is the Funnel component in NiFi and when do you use it? 41. What is the difference between GetFile and ListFile + FetchFile processors? 42. How does NiFi support schema evolution in data pipelines? 43. What is the RouteText processor and how does it differ from RouteOnContent? 44. What performance tuning options are available in NiFi and what are common bottleneck patterns? 45. How does NiFi integrate with cloud storage services like Amazon S3?

Could not find what you were looking for? send us the question and we would be happy to answer your question.

1. What is Apache NiFi and what problem does it solve?

Apache NiFi is an open-source data integration and dataflow automation platform originally developed by the NSA under the codename Niagarafiles and donated to the Apache Software Foundation in 2014. It provides a web-based graphical interface for designing, controlling, and monitoring data flows between heterogeneous systems without writing custom integration code for every connection.

The core problem NiFi solves is the complexity of moving data reliably between heterogeneous systems at scale. Data lives in dozens of formats: CSVs on FTP servers, JSON from REST APIs, records in relational databases, messages in Kafka topics, and files in S3. NiFi replaces brittle hand-coded pipelines with a visual, configuration-driven approach where flows are built by connecting processors on a canvas.

NiFi's design priorities are reliability (guaranteed delivery through persistent queues), data provenance (every data movement tracked end-to-end), back-pressure (slows producers when downstream queues are full), and ease of use (drag-and-drop design accessible to non-developers). It is widely used for IoT data ingestion, log aggregation, ETL pipelines, and security data routing.

What US government agency originally developed the software that became Apache NiFi?CIA

✗ Try again.

NSA

✓ Well done — NiFi was built by the NSA as Niagarafiles before being donated to Apache.

DARPA

✗ Try again.

Which NiFi capability automatically pauses upstream processors when downstream queues are full?Data provenance

✗ Try again — provenance tracks lineage, not flow rate.

Back-pressure

✓ Well done — back-pressure pauses upstream processors when connection thresholds are exceeded.

Processor scheduling

✗ Try again — scheduling controls timing, not queue-driven throttling.

2. What is a FlowFile in Apache NiFi?

A FlowFile is the fundamental unit of data in Apache NiFi. Every piece of data moving through a flow is represented as a FlowFile with two distinct parts.

Attributes: A map of key-value string pairs acting as metadata. Every FlowFile has core attributes automatically assigned — uuid (globally unique identifier), filename, path, and entryDate. Processors can add, modify, or remove attributes. Attributes are lightweight and kept in memory.

Content: The actual payload bytes. Content is stored in the NiFi content repository on disk, not in heap memory, allowing NiFi to handle FlowFiles of arbitrary size — gigabytes or more — without memory exhaustion. Content is immutable by design: when a processor modifies content, it writes a new version rather than overwriting the original. This immutability underpins the data provenance model.

The separation of attributes from content is architecturally significant. Many routing, filtering, and enrichment operations work purely on attributes without ever reading the payload. RouteOnAttribute, for example, routes FlowFiles entirely on attribute values without touching content.

Where does NiFi physically store FlowFile content?In JVM heap memory

✗ Try again — storing large payloads in heap would cause OutOfMemoryErrors.

On disk in the content repository

✓ Well done — the content repository persists FlowFile bytes to disk, enabling arbitrarily large payloads.

In a distributed cluster cache

✗ Try again — NiFi does not use a shared distributed cache for FlowFile content.

What happens to original FlowFile content when a processor modifies it?A new version is written; the original is eventually garbage-collected

✓ Well done — FlowFile content immutability means modifications create new content claims, preserving provenance integrity.

The original is overwritten in place

✗ Try again — in-place overwrite would break the provenance chain; content is immutable.

Both versions are retained permanently

✗ Try again — old content claims are garbage-collected once no FlowFile references them.

3. What are the three NiFi repositories and what does each store?

Apache NiFi uses three on-disk repositories, each serving a distinct durability and query purpose.

FlowFile Repository: Stores the current state of all active FlowFiles — their attributes and a pointer (content claim) to where content lives in the content repository. Uses a Write-Ahead Log (WAL) for durability. On restart after a crash NiFi replays the WAL to recover all in-flight FlowFiles without data loss. Stores metadata only, not content bytes.

Content Repository: Stores actual FlowFile payload bytes organized into content claims within large archive files. Uses an immutable, append-only approach: processors write new content versions rather than overwriting. Old claims are garbage-collected once dereferenced. Can be spread across multiple disk volumes for higher I/O throughput.

Provenance Repository: Records every lifecycle event for every FlowFile: RECEIVE, SEND, FORK, JOIN, CONTENT_MODIFIED, DROP, etc. Creates a complete, searchable audit trail. Typically the largest repository in active deployments. Supports Lucene-based search by FlowFile UUID, filename, processor, time range, and transit URI.

NiFi Repositories
Repository	Stores	Key Feature
FlowFile	Attributes and content pointers	WAL-based crash recovery
Content	Payload bytes	Immutable, multi-disk support
Provenance	Full data lineage events	Searchable audit trail

Which repository enables NiFi to recover in-flight FlowFiles after an unexpected restart?Content Repository

✗ Try again — the content repository stores bytes; recovery depends on the FlowFile repository WAL.

FlowFile Repository

✓ Well done — the FlowFile repository's Write-Ahead Log is replayed on startup to restore all active FlowFiles.

Provenance Repository

✗ Try again — provenance is an audit trail, not a recovery mechanism.

Which NiFi repository typically consumes the most disk space in high-throughput deployments?FlowFile Repository

✗ Try again — it stores only metadata pointers, not content bytes.

Provenance Repository

✓ Well done — provenance records every event for every FlowFile, accumulating large indexed datasets over time.

Content Repository — always the largest

✗ Try again — provenance often exceeds content in high-throughput flows due to per-event record accumulation.

4. What is a Processor in Apache NiFi and what are the main processor categories?

A Processor is the fundamental building block of a NiFi data flow. Each processor performs one specific operation on FlowFiles: fetching from a source, transforming content, routing on attributes, writing to a destination, and so on. Processors are connected via Connections to form a directed dataflow graph on the canvas.

NiFi ships with hundreds of built-in processors in functional categories:

Data Ingestion: GetFile, GetHTTP, GetSFTP, ListenHTTP, ConsumerKafka, GetSQS — pull or receive data from external sources.

Data Egress: PutFile, PublishKafka, PutS3Object, PostHTTP, PutEmail, PutSFTP — write or send data to destinations.

Routing and Mediation: RouteOnAttribute, RouteOnContent, SplitText, SplitJSON, MergeContent — split, merge, or direct FlowFiles to different paths.

Database Interaction: ExecuteSQL, PutDatabaseRecord, QueryDatabaseTable, GenerateTableFetch — read from and write to JDBC-accessible databases.

Attribute Extraction: UpdateAttribute, EvaluateJsonPath, ExtractText, LookupRecord — read or modify FlowFile attributes.

Transformation: ConvertRecord, JoltTransformJSON, TransformXml, ReplaceText — change format or content of payloads.

Custom processors can be developed in Java, packaged as NAR (NiFi Archive) files, and deployed to NiFi's lib directory.

Which processor routes FlowFiles to different downstream connections based on attribute values?SplitText

✗ Try again — SplitText splits content into multiple FlowFiles; it does not route by attribute.

RouteOnAttribute

✓ Well done — RouteOnAttribute evaluates NiFi Expression Language conditions against attributes and routes accordingly.

UpdateAttribute

✗ Try again — UpdateAttribute modifies attributes but does not route FlowFiles.

What file format packages custom NiFi processors with classloader-isolated dependencies for deployment?JAR

✗ Try again — plain JARs are not the NiFi extension format.

WAR

✗ Try again — WAR is a Java web application archive, not a NiFi extension format.

NAR (NiFi Archive)

✓ Well done — NAR files provide classloader isolation, preventing dependency conflicts between processors.

5. What is a Connection in NiFi and how does back-pressure work?

A Connection in NiFi is a directed, persistent queue that links the output of one processor to the input of another. FlowFiles physically queue here when the downstream processor cannot keep up. Each connection carries one or more relationships from the upstream processor (e.g., success, failure). All processor relationships must be either connected or auto-terminated before the processor can start.

Back-pressure is configured on each connection with two independent thresholds:

Back Pressure Object Threshold: Maximum FlowFile count in the queue. When reached, the upstream processor stops being scheduled. Default is 10,000.

Back Pressure Data Size Threshold: Maximum total byte size of queued content. When exceeded, the upstream processor also pauses. Default is 1 GB.

Setting both to 0 disables back-pressure, allowing the queue to grow unbounded — use cautiously. Connections also support prioritizers controlling dequeue order: First In First Out (default), Newest First, Oldest First, or by size.

What happens to an upstream processor when a connection's back-pressure object threshold is reached?New FlowFiles are silently discarded

✗ Try again — NiFi never silently discards FlowFiles under back-pressure; it pauses the producer.

The upstream processor stops being scheduled until the queue drains below the threshold

✓ Well done — back-pressure pauses scheduling of the upstream processor, providing natural flow control without data loss.

The processor throws an exception and enters a failed state

✗ Try again — back-pressure is normal flow control, not an error condition.

What must be true about all processor relationships before a processor can be started in NiFi?Every relationship must be connected to a downstream component or explicitly auto-terminated

✓ Well done — unhandled relationships prevent the processor from starting, ensuring no FlowFiles are silently dropped.

At least one relationship must loop back to the same processor

✗ Try again — retry loops are optional; the requirement is that all relationships are handled.

All relationships must point to the same downstream processor

✗ Try again — different relationships can go to different processors or be auto-terminated independently.

6. What is NiFi Expression Language and where can it be used?

NiFi Expression Language (EL) is a built-in expression engine enabling dynamic evaluation of FlowFile attribute values within processor property configurations. Instead of hardcoded values, you write expressions evaluated at runtime against each FlowFile's attributes using the syntax ${attribute.name}.

EL supports a rich function library:

String functions: ${filename:toUpper()}, ${mime.type:substringAfter('/')}, ${attr:startsWith('prefix')}, ${attr:replace('old','new')}

Math functions: ${count:toNumber():plus(1)}, ${size:multiply(2)}

Date/time: ${now():format('yyyy-MM-dd')}, ${attr:toDate('MM/dd/yyyy')}

Boolean / conditional: ${attr:equals('active'):ifElse('yes','no')}, ${attr:isEmpty():not()}

System: ${hostname()}, ${uuid()}, ${literal('fixed')}

EL can only be used in property fields that display the EL icon (a curved arrow) in the NiFi UI. In non-EL-enabled fields, the expression is treated as a literal string — no error is raised. Common applications include dynamic file paths, Kafka topic routing, SQL query construction, and HTTP endpoint URLs.

What NiFi Expression Language expression converts the filename attribute to uppercase?${filename:upper()}

✗ Try again — the function name is toUpper(), not upper().

${filename:toUpper()}

✓ Well done — toUpper() is the correct NiFi EL string function for uppercase conversion.

${toUpperCase(filename)}

✗ Try again — NiFi EL uses chained method-call style on the attribute, not a function-wrapping style.

What happens if you type a NiFi EL expression into a processor property field that does not support EL?NiFi throws a validation error and prevents saving

✗ Try again — NiFi silently uses the literal string; no error is raised.

The expression is treated as a literal string and is not evaluated

✓ Well done — EL is only evaluated in EL-enabled fields; in non-EL fields the raw text is used as-is.

The attribute value is always returned as an empty string

✗ Try again — the literal expression text itself is used, not an empty string.

7. What is data provenance in Apache NiFi and how do you access it?

Data provenance in NiFi is the complete, immutable audit trail of everything that happens to every FlowFile from when it enters until it leaves or is dropped. NiFi records a provenance event automatically for every significant action — no explicit configuration is required.

Provenance event types include: RECEIVE (data enters NiFi), SEND (data sent to external destination), FETCH, CREATE, FORK (FlowFile split into multiple), JOIN (FlowFiles merged), CLONE, CONTENT_MODIFIED, ATTRIBUTES_MODIFIED, DROP, and REPLAY.

Each event records: timestamp, event type, duration, FlowFile UUID, attributes before and after, the component that processed it, SHA-256 content hashes, and transit URI (the actual endpoint URL for SEND/RECEIVE).

Access provenance via the NiFi global menu → Data Provenance. Search by FlowFile UUID, filename, content size, processor, or time range. From a FORK event you can navigate to child FlowFiles; from JOIN to parents — reconstructing complete lineage. The Replay button on RECEIVE or CONTENT_MODIFIED events re-injects that exact FlowFile state back into the flow, invaluable for debugging and reprocessing.

Which provenance event type is recorded when SplitJSON divides one FlowFile into multiple child FlowFiles?CREATE

✗ Try again — CREATE is for FlowFiles generated without a parent; a split produces a FORK event.

FORK

✓ Well done — FORK records the lineage relationship between one parent FlowFile and its multiple child FlowFiles.

CLONE

✗ Try again — CLONE is for exact copies sent to multiple relationships; FORK is the split event.

What does the Replay feature on a provenance event allow you to do?Edit the FlowFile content inline

✗ Try again — provenance is read-only; content cannot be edited from the provenance UI.

Re-inject the FlowFile from that exact point in its history back into the active flow

✓ Well done — Replay re-submits the FlowFile state at that provenance event without re-triggering the original data source.

Delete the FlowFile from its current queue

✗ Try again — queue deletion is done via List Queue on the connection, not from the provenance view.

8. What is a Process Group in NiFi and why is it used?

A Process Group is a named container that groups processors, connections, funnels, and other components into a single unit on the NiFi canvas. It appears as a rectangle; double-clicking opens it to reveal its internals. Process Groups can be nested, enabling hierarchical flow organization.

Organization: Large flows with hundreds of processors become unmanageable on a flat canvas. Grouping related processors into named groups makes the top-level view a clean architectural diagram.

Reuse: A Process Group can be saved as a versioned flow in NiFi Registry and instantiated multiple times for different environments or data sources.

Access control: NiFi policy-based access control applies at the Process Group level. Different teams can be granted operate, view, or modify rights to specific groups without affecting others.

Input and Output Ports: Data enters and exits through explicitly defined Input Ports and Output Ports — the only official data gateways. In a NiFi cluster, Remote Process Groups (RPGs) use Input Ports of remote NiFi instances as Site-to-Site transfer targets.

How does data officially enter a Process Group in NiFi?By drawing connections directly across the group boundary

✗ Try again — NiFi does not allow connections to cross Process Group boundaries directly.

Through an Input Port defined within the Process Group

✓ Well done — Input Ports are the required, explicit data entry points for a Process Group.

Via a shared queue ID that spans the boundary

✗ Try again — queue IDs are not shared across group boundaries.

Which NiFi feature enables a Process Group's flow definition to be version-controlled and promoted across environments?FlowFile provenance

✗ Try again — provenance tracks data lineage, not flow configuration versions.

NiFi Registry

✓ Well done — NiFi Registry is the version control system for NiFi flow definitions, enabling dev-to-prod promotion.

NiFi clustering

✗ Try again — clustering distributes processing load; Registry handles flow versioning.

9. What is NiFi Registry and how does it integrate with NiFi?

NiFi Registry is a complementary subproject that provides centralized storage, management, and versioning of NiFi flow definitions. It functions as a version control system for Process Group configurations — similar in concept to Git for source code.

NiFi Registry organizes flows into Buckets — named containers for flow versions, each with independent access policies. The integration workflow is:

1. Add the Registry URL under NiFi Controller Settings → Registry Clients.

2. Right-click a Process Group on the canvas → Version → Start Version Control, selecting a bucket and flow name. NiFi sends the flow JSON to Registry as version 1.

3. After further changes, commit via Version → Commit Local Changes to create a new snapshot.

4. To promote to production, a production NiFi imports the flow from Registry at the target version — ensuring consistent configuration across environments.

5. To roll back, right-click the versioned group → Version → Change Version to select any previous commit.

Registry exposes a REST API, enabling CI/CD pipelines to promote flow versions from dev to staging to prod without manual canvas operations.

What is a Bucket in NiFi Registry?An S3 bucket used to store FlowFile content externally

✗ Try again — Registry buckets are internal containers for flow definitions, not cloud storage.

A named container in NiFi Registry holding versioned flow definitions with per-bucket access policies

✓ Well done — buckets organize flows within Registry and support independent access control.

A connection queue that holds FlowFiles during back-pressure

✗ Try again — that describes a connection queue, not a Registry concept.

How do you roll back a versioned NiFi Process Group to a previous Registry version?Delete and re-import the group from a saved template file

✗ Try again — that is a destructive manual approach; Registry supports native version switching.

Right-click the Process Group → Version → Change Version and select the target version number

✓ Well done — Change Version switches to any committed version without destroying the Process Group.

Use the NiFi CLI to reset the FlowFile repository to a checkpoint

✗ Try again — FlowFile repository checkpoints relate to crash recovery, not flow definition rollback.

10. How does NiFi clustering work and what is the role of ZooKeeper?

A NiFi cluster consists of multiple NiFi nodes that all run the same flow and collectively process data in parallel. Every node receives a copy of the flow definition and runs the same processors, but each node independently processes a subset of the FlowFiles — distributing the workload.

NiFi uses an embedded ZooKeeper (or an external ZooKeeper ensemble) for cluster coordination. ZooKeeper serves two critical roles:

Cluster Coordinator election: One node is elected Cluster Coordinator via ZooKeeper leader election. The Coordinator manages cluster state — which nodes are connected, which are heartbeating, and which are being disconnected. If the Coordinator fails, ZooKeeper elects a new one from the remaining nodes automatically.

Primary Node election: Separately, one node is designated as Primary Node. Processors configured with Execution: Primary Node Only run exclusively on the Primary Node. This is essential for processors that must not run concurrently — ListSFTP or QueryDatabaseTable would produce duplicate FlowFiles if all nodes ran them simultaneously.

All nodes communicate through the Site-to-Site protocol. The NiFi web UI can be accessed on any node and proxies canvas operations to all nodes transparently.

Why must processors like ListSFTP be configured to run on the Primary Node only in a NiFi cluster?Running on all nodes simultaneously would cause each node to list and ingest the same files, producing duplicate FlowFiles

✓ Well done — source-listing processors must run on only one node to avoid parallel duplication of ingested data.

The Primary Node has the fastest disk for source access

✗ Try again — disk speed is not the reason; duplicate ingestion is the concern.

NiFi cluster nodes cannot access external systems

✗ Try again — all nodes can access external systems; the issue is preventing duplicate processing.

What role does ZooKeeper play in NiFi clustering?It stores FlowFile content for all cluster nodes

✗ Try again — content is stored in each node's own content repository, not in ZooKeeper.

It manages Cluster Coordinator and Primary Node leader elections and cluster membership

✓ Well done — ZooKeeper provides the distributed coordination and leader election that NiFi clustering depends on.

It load-balances incoming HTTP requests across NiFi nodes

✗ Try again — HTTP load balancing is done by an external load balancer, not ZooKeeper.

11. What is a Controller Service in NiFi and how is it different from a Processor?

A Controller Service is a shared, reusable service that processors within a Process Group (or across the entire NiFi instance) can reference in their configurations. Where a Processor performs work on individual FlowFiles, a Controller Service provides a shared capability — a database connection pool, an SSL context, a distributed map cache client, a record reader/writer — that multiple processors reuse simultaneously.

Common Controller Services include:

DBCPConnectionPool: Manages a JDBC connection pool. Processors like ExecuteSQL and PutDatabaseRecord reference this service instead of each opening their own connections, dramatically reducing database connection overhead.

JsonTreeReader / JsonRecordSetWriter: Define how to read and write JSON records. Used by record-aware processors to decouple format handling from logic.

StandardSSLContextService: Provides a shared TLS/SSL context (keystore, truststore) referenced by any processor needing secure connections.

DistributedMapCacheClientService: Implements a distributed in-memory key-value cache accessible across the flow, useful for deduplication and state sharing.

Controller Services have their own lifecycle: they must be enabled before any referencing processor can start, and they cannot be disabled while processors referencing them are running.

Why do multiple processors reference a single DBCPConnectionPool Controller Service rather than each managing their own connections?NiFi does not allow processors to open database connections directly

✗ Try again — NiFi does not technically prohibit direct connections, but shared pools are the standard pattern for efficiency.

Sharing a connection pool reduces total database connections and overhead compared to each processor managing its own

✓ Well done — a shared pool means 10 processors share N connections rather than each opening their own N connections.

Controller Services are faster than direct JDBC connections

✗ Try again — the speed comes from pooling and reuse, not from Controller Services being inherently faster.

What must happen to a Controller Service before any processor that references it can be started?The Controller Service must be enabled

✓ Well done — processors referencing a disabled Controller Service cannot start; the service must be enabled first.

The Controller Service must be assigned a unique UUID

✗ Try again — NiFi assigns UUIDs automatically; the operational requirement is enabling the service.

The Controller Service must be restarted after every NiFi upgrade

✗ Try again — Controller Services are persisted in the flow configuration and do not need manual restarting after upgrades.

12. What is the GenerateTableFetch and QueryDatabaseTable pattern for incremental database ingestion?

QueryDatabaseTable and GenerateTableFetch are the two primary patterns for incrementally ingesting data from relational database tables. Each has different performance characteristics and use cases.

QueryDatabaseTable: A simpler, single-processor approach. It issues a SELECT query using a configurable Maximum Value Columns setting to track the last seen value (typically a timestamp or auto-increment ID). On each execution, it queries only rows where the tracked column exceeds the stored state — an incremental read. It produces one FlowFile per execution containing all new rows. Suited for moderate-size increments on single tables.

GenerateTableFetch + ExecuteSQL: A scalable, parallelizable pattern for large tables. GenerateTableFetch queries the database to determine the range of new rows (min and max of the tracking column), then generates one SQL SELECT statement per partition chunk as FlowFile attributes. Each generated FlowFile is routed to ExecuteSQL, which executes the SQL and returns that chunk's result set. Multiple ExecuteSQL processors can run in parallel, fetching different partitions simultaneously — dramatically improving throughput for large incremental loads.

Both processors use NiFi's State Management to persist the last processed value across restarts, ensuring no rows are missed or re-read on processor restart.

What is the primary advantage of the GenerateTableFetch + ExecuteSQL pattern over using QueryDatabaseTable alone?GenerateTableFetch supports more database types than QueryDatabaseTable

✗ Try again — both support any JDBC-accessible database; the advantage is parallelism.

Large increments are split into parallel partition chunks, enabling concurrent fetching across multiple ExecuteSQL processors

✓ Well done — partitioning allows parallel execution, making it far more scalable for large tables.

It avoids the need for a JDBC Controller Service

✗ Try again — both patterns require a DBCPConnectionPool Controller Service.

How do QueryDatabaseTable and GenerateTableFetch track the last ingested row to avoid re-reading data across restarts?They write the last value to a FlowFile attribute stored in the content repository

✗ Try again — FlowFile attributes are not persisted across NiFi restarts.

They use NiFi State Management to persistently store the last processed column value

✓ Well done — State Management provides durable key-value storage that survives processor and NiFi restarts.

They rely on the source database to mark read rows with a processed flag

✗ Try again — NiFi does not modify source tables; state is tracked internally via State Management.

13. What is the Record-based processing model in NiFi and why is it preferred?

NiFi's record-based processing model treats FlowFile content as a structured stream of records rather than an opaque blob. A record is one logical row — one JSON object, one CSV line, one Avro record, one database row. Record-aware processors operate on individual records within a FlowFile, enabling format-agnostic transformations.

The model relies on three Controller Service types:

RecordReader: Parses the FlowFile content and produces a stream of records. Implementations include JsonTreeReader, CSVReader, AvroReader, ParquetReader, XMLReader, and GrokReader (for unstructured log parsing).

RecordWriter: Serializes records back to bytes. Implementations include JsonRecordSetWriter, CSVRecordSetWriter, AvroRecordSetWriter, and ParquetRecordSetWriter.

Schema Registry: Optionally provides Avro schemas that readers and writers use to interpret and validate records. NiFi includes an embedded AvroSchemaRegistry.

Key record-aware processors: ConvertRecord (format conversion), QueryRecord (apply SQL SELECT against FlowFile records using Apache Calcite), LookupRecord (enrich records from external sources), UpdateRecord, and PartitionRecord (split into one FlowFile per distinct field value).

The key advantage is format independence: changing from JSON to CSV input requires only swapping the RecordReader Controller Service — no processor logic changes. It also avoids materializing entire FlowFiles into memory by streaming records one at a time.

How would you change a record-based NiFi flow from processing JSON input to processing CSV input?Swap the RecordReader Controller Service from JsonTreeReader to CSVReader — no processor logic changes needed

✓ Well done — format independence is the core value of the record model; only the reader/writer needs to change.

Replace all record-aware processors with CSV-specific processors

✗ Try again — that negates the format-independence benefit of the record model.

Add a ConvertRecord processor before each existing processor to pre-convert CSV to JSON

✗ Try again — that is unnecessary overhead; simply changing the reader is sufficient.

Which record-aware processor allows you to apply SQL SELECT statements against the records within a FlowFile?LookupRecord

✗ Try again — LookupRecord enriches records by querying external lookup services, not SQL against the FlowFile itself.

QueryRecord

✓ Well done — QueryRecord uses Apache Calcite to evaluate SQL SELECT queries against the records within a FlowFile.

ExecuteSQL

✗ Try again — ExecuteSQL executes SQL against an external database, not against FlowFile content.

14. What is State Management in NiFi and what types of state scope exist?

State Management is NiFi's built-in mechanism for processors and controller services to persistently store small amounts of key-value data that survive processor restarts and NiFi restarts. Without state management, a processor like QueryDatabaseTable would forget the last ingested timestamp every time it was stopped, causing duplicate ingestion.

State has two scopes:

Local State: Scoped to a specific processor on a specific NiFi node. Stored on local disk using LevelDB by default. Used when each node needs its own independent tracking — for example, a GetFile processor tracking which files it has processed from a local directory on that node.

Cluster State: Scoped to a specific processor but shared across all nodes in a NiFi cluster. Stored in ZooKeeper. Used when only one node should track state for the cluster — for example, QueryDatabaseTable running on the Primary Node needs its last-value state visible to whichever node becomes Primary after a failover.

State is accessed programmatically via the StateManager API. From the NiFi UI, you can view and clear a processor's state by right-clicking the processor → View State.

Why would a processor like QueryDatabaseTable use Cluster scope state rather than Local scope?It runs on the Primary Node, and its state must be accessible to whichever node becomes Primary after failover

✓ Well done — Cluster scope state stored in ZooKeeper survives node failovers, ensuring the new Primary Node can resume from the correct position.

Local state is not persistent across processor restarts

✗ Try again — Local state is persistent across restarts; it is stored on disk. The issue is cross-node accessibility after failover.

Cluster scope state is faster to read than Local scope

✗ Try again — ZooKeeper-based Cluster state is typically slower than local disk; the reason is failover accessibility.

Where is Local scope state physically stored in NiFi by default?In ZooKeeper

✗ Try again — ZooKeeper stores Cluster scope state; Local scope uses on-disk LevelDB.

On local disk using LevelDB

✓ Well done — Local state uses a LevelDB store on each NiFi node's filesystem, keeping it node-specific and fast.

In the FlowFile Repository

✗ Try again — the FlowFile repository stores active FlowFile metadata, not processor state.

15. What is NiFi Site-to-Site (S2S) and when do you use it?

NiFi Site-to-Site (S2S) is a native protocol for transferring FlowFiles directly between NiFi instances — between two standalone NiFi servers, between nodes in a cluster, or between NiFi and MiNiFi agents. It operates over HTTP/HTTPS or raw TCP sockets and provides end-to-end guaranteed delivery, optional compression, and mutual TLS authentication.

S2S works through Remote Process Groups (RPGs) on the sending side. An RPG is configured with the URL of the remote NiFi instance. The sending NiFi queries the remote instance for available Input Ports, then routes FlowFiles to the chosen Input Port.

Key features of S2S:

Automatic load balancing: When the remote NiFi is a cluster, the RPG automatically distributes FlowFiles across all healthy cluster nodes using peer selection.

Back-pressure propagation: If the remote Input Port queue is full, the sender is notified and slows down — extending NiFi's back-pressure model across instance boundaries.

Compression: Optional GZIP compression reduces bandwidth between instances.

Common use cases: MiNiFi agents on edge devices pushing data to a central NiFi hub; cross-datacenter data transfer pipelines; routing data from an ingestion NiFi tier to a processing NiFi tier.

What NiFi component on the sending side is used to push FlowFiles to a remote NiFi instance via Site-to-Site?Output Port

✗ Try again — Output Ports are the exit point within a Process Group; the sending component is a Remote Process Group.

Remote Process Group (RPG)

✓ Well done — an RPG configured with the remote NiFi URL discovers its Input Ports and routes FlowFiles to them via S2S.

PutHTTP processor

✗ Try again — PutHTTP sends arbitrary HTTP requests; S2S uses the dedicated RPG component with the NiFi-native protocol.

How does Site-to-Site handle back-pressure when the remote NiFi instance's Input Port queue is full?The sender drops FlowFiles to avoid blocking

✗ Try again — NiFi never drops FlowFiles for back-pressure; it slows the sender instead.

The remote instance signals back-pressure to the sender, which slows its transfer rate accordingly

✓ Well done — S2S propagates back-pressure across instance boundaries, extending NiFi's flow control model end-to-end.

The connection between instances is terminated and must be manually re-established

✗ Try again — S2S handles back-pressure gracefully without dropping connections.

16. What is NiFi and how does it relate to Apache NiFi?

MiNiFi (Minimum NiFi) is a lightweight subproject of Apache NiFi designed specifically for edge data collection on resource-constrained devices — IoT sensors, industrial controllers, embedded systems, and edge servers. It implements a subset of NiFi's capabilities with a dramatically reduced footprint: the MiNiFi Java agent runs in under 256 MB heap; the MiNiFi C++ agent runs in tens of megabytes.

MiNiFi flows are created in the full NiFi canvas (as a Process Group) and exported as a YAML or JSON template deployed to MiNiFi agents. The MiNiFi C2 Server (Command and Control) allows centralized management — pushing updated flow templates to agents without manual file deployment.

MiNiFi agents collect data at the source and use NiFi Site-to-Site protocol to ship FlowFiles to a central NiFi hub for further processing, enrichment, and routing. This hub-and-spoke architecture keeps complex transformation logic in the central NiFi where compute resources are available, while keeping the edge agent minimal.

Key differences from full NiFi: MiNiFi has no web UI (flows are pushed from outside), supports fewer processors, has no provenance UI, and is designed for unattended operation in network-constrained or intermittently connected environments.

What protocol do MiNiFi agents use to send collected data to a central Apache NiFi instance?MQTT

✗ Try again — MiNiFi can consume MQTT data, but it ships data to NiFi via Site-to-Site protocol.

NiFi Site-to-Site (S2S)

✓ Well done — MiNiFi uses the NiFi S2S protocol to push FlowFiles to a central NiFi hub's Input Ports.

Apache Kafka

✗ Try again — Kafka is an optional destination; the native MiNiFi-to-NiFi transport is S2S.

How are flow configurations deployed to MiNiFi agents at scale?By manually copying YAML template files to each agent device

✗ Try again — manual copying does not scale to hundreds of edge devices; the C2 Server handles this.

Via the MiNiFi C2 (Command and Control) Server which centrally pushes updated flow templates to agents

✓ Well done — the C2 Server enables centralized, automated flow deployment to large fleets of MiNiFi agents.

Using the NiFi canvas UI on each agent device

✗ Try again — MiNiFi has no web UI; flows are pushed externally.

17. What is NiFi Parameter Context and how does it differ from Variables?

A Parameter Context is a named collection of key-value parameters applied to a Process Group to externalize configuration values from the flow definition. Instead of hardcoding a Kafka broker address or database URL inside processor properties, you reference a parameter with the syntax #{parameter.name} and define its value in the Parameter Context. This makes flows environment-independent: the same flow definition can point to different Kafka clusters in dev, staging, and prod by binding different Parameter Contexts.

Parameter Contexts were introduced in NiFi 1.10 as the replacement for the older Variables feature. Key differences:

Parameter Context vs Variables
Feature	Parameter Context	Variables
Syntax	#{param.name}	${var.name} (same as EL)
Sensitive values	Supported (masked in UI)	Not supported
Registry versioning	Referenced by name; values stored separately	Embedded in flow template
Inheritance	Child groups inherit parent context	Scoped to single group level
Status	Current recommended approach	Deprecated

Parameter Contexts integrate with NiFi Registry: the flow definition references the context by name, but the actual parameter values are managed outside the Registry-versioned flow, preventing sensitive credentials from being committed to version control.

What is the NiFi syntax for referencing a value from a Parameter Context in a processor property?${parameter.name}

✗ Try again — that is NiFi Expression Language syntax for FlowFile attributes, not Parameter Context references.

#{parameter.name}

✓ Well done — the #{} syntax is used exclusively for Parameter Context values, distinct from EL's ${} syntax.

@{parameter.name}

✗ Try again — @{} is not a NiFi parameter reference syntax.

What advantage do Parameter Contexts have over Variables for storing sensitive values like passwords?Parameter Contexts support sensitive parameters that are masked in the UI and excluded from Registry-versioned flow definitions

✓ Well done — sensitive parameters are never stored in plain text in the flow definition, preventing credential exposure in Registry.

Variables also support sensitive values; Parameter Contexts just have a different syntax

✗ Try again — Variables do not support sensitive values; this is one reason they were replaced by Parameter Contexts.

Parameter Contexts encrypt values using AES-256 automatically

✗ Try again — NiFi's sensitive property encryption is managed by the nifi.sensitive.props.key configuration, not by Parameter Contexts specifically.

18. How does NiFi handle security — TLS, authentication, and authorization?

NiFi provides a comprehensive security model covering transport encryption, user authentication, and fine-grained authorization.

TLS / Transport Encryption: NiFi can be configured to serve its UI and API exclusively over HTTPS. TLS is configured in nifi.properties using a keystore (server certificate) and truststore (CA certificates for client certificate validation). The tls-toolkit utility generates self-signed certificates and keystores for development clusters.

Authentication — NiFi supports multiple Login Identity Providers:

Client Certificate: Mutual TLS — the browser or API client presents a certificate. The Common Name (CN) becomes the NiFi identity.

LDAP / Active Directory: Username and password validated via LdapIdentityProvider.

Kerberos: Single sign-on via Kerberos ticket; the authenticated principal becomes the NiFi identity.

OIDC / OAuth2: Integration with Keycloak, Okta, or Azure AD via OpenID Connect.

Authorization: NiFi uses a policy-based model. Every resource (processor, connection, Process Group, provenance) has access policies — Read, Write, and Operate — that users or groups can be granted or denied. The authorizers.xml file configures the authorization provider. LDAP-based groups can be imported for group-level policy management.

When using client certificate authentication in NiFi, what becomes the user's identity?The certificate's serial number

✗ Try again — serial numbers are not user-friendly identities.

The Common Name (CN) from the client certificate's subject

✓ Well done — NiFi extracts the CN from the presented client certificate and uses it as the authenticated user identity.

The IP address of the client machine

✗ Try again — IP addresses are not used as NiFi identities.

What NiFi command-line utility helps generate TLS certificates and keystores for securing a NiFi cluster?nifi-admin

✗ Try again — there is no nifi-admin CLI tool for certificate generation.

tls-toolkit

✓ Well done — the NiFi tls-toolkit generates self-signed CAs, node certificates, keystores, and truststores for cluster TLS setup.

keytool

✗ Try again — keytool is a standard Java utility; tls-toolkit is the NiFi-specific wrapper that automates multi-node certificate generation.

19. What is the NiFi NAR (NiFi Archive) classloading model?

The NAR (NiFi ARchive) is the extension packaging format for NiFi components: processors, controller services, and reporting tasks. A NAR file is similar to a JAR but includes a special manifest that declares its dependencies and classloader parent chain. The NAR classloading model solves the dependency isolation problem — different processors may depend on conflicting versions of the same library.

Each NAR is loaded by its own NARClassLoader. When a processor in NAR A needs to load a class, the classloader first looks in NAR A's own classpath. Only classes not found there are delegated up the parent chain. Most NARs declare nifi-standard-services-api-nar or nifi-framework-api as their parent NAR, but do not share classloaders with sibling NARs.

This means NiFi can simultaneously run a processor using AWS SDK v1.x (in one NAR) and a processor using AWS SDK v2.x (in another NAR) without any classpath conflicts.

NARs are deployed by dropping them into NiFi's ./lib directory and restarting NiFi. NiFi 2.x introduces NAR Provider support for dynamically loading NARs at runtime without a restart, fetching from NiFi Registry or a Maven repository.

Why does NiFi use separate NARClassLoaders for each NAR rather than a single shared classloader?To isolate dependencies and prevent conflicts between processors that require different versions of the same library

✓ Well done — NAR classloader isolation is the mechanism that allows conflicting library versions to coexist in the same NiFi instance.

To improve processor startup time by parallelizing class loading

✗ Try again — isolation, not performance, is the motivation for per-NAR classloaders.

To enforce security restrictions preventing processors from accessing each other

✗ Try again — security isolation is a benefit but not the primary design motivation; dependency isolation is.

Where are NAR files deployed in a NiFi installation?In the ./conf directory alongside nifi.properties

✗ Try again — conf contains configuration files; NARs go in lib.

In the ./lib directory, which NiFi scans at startup to load extensions

✓ Well done — NiFi scans the lib directory for NAR files during startup and loads each into its own NARClassLoader.

In the ./extensions directory which is hot-watched for new files

✗ Try again — hot-watching extensions is a NiFi 2.x NAR Provider feature; the standard deployment target is ./lib.

20. What are Reporting Tasks in NiFi and what are common use cases?

Reporting Tasks are NiFi extension components that run on a scheduled basis to collect and report metrics, bulletin events, and operational data from the NiFi instance itself — not from FlowFiles. They operate at the NiFi system level rather than the data flow level, making them the primary tool for NiFi self-monitoring and integration with external observability platforms.

Reporting Tasks have their own scheduling (time-driven or CRON) and run independently of any flow. They are configured via Controller Settings → Reporting Tasks in the NiFi UI.

Common built-in Reporting Tasks:

SiteToSiteProvenanceReportingTask: Streams provenance events via Site-to-Site to a remote NiFi instance. Used for SIEM integration and long-term provenance archival outside the local provenance repository.

SiteToSiteBulletinReportingTask: Sends NiFi bulletin (warning and error) events via S2S to a remote NiFi for centralized alerting.

ControllerStatusReportingTask: Logs NiFi instance metrics (active threads, FlowFile counts, queue depths) to the NiFi log file.

PrometheusReportingTask: Exposes NiFi metrics as a Prometheus scrape endpoint, enabling Grafana dashboards for NiFi operational monitoring.

What distinguishes a Reporting Task from a Processor in Apache NiFi?Reporting Tasks report on NiFi's own operational metrics and events, not on individual FlowFiles

✓ Well done — Reporting Tasks observe the NiFi system itself; processors process data flowing through the system.

Reporting Tasks can only be scheduled using CRON expressions, not timer-based scheduling

✗ Try again — Reporting Tasks support both timer-driven and CRON scheduling, same as processors.

Reporting Tasks do not have access to the NiFi API

✗ Try again — Reporting Tasks have full access to the NiFi ReportingContext API to read system state.

Which Reporting Task would you configure to expose NiFi operational metrics for scraping by Prometheus?ControllerStatusReportingTask

✗ Try again — ControllerStatusReportingTask logs to the NiFi log file, not to a Prometheus endpoint.

PrometheusReportingTask

✓ Well done — PrometheusReportingTask exposes a /metrics HTTP endpoint that Prometheus can scrape for NiFi queue depths, thread counts, and other system metrics.

SiteToSiteProvenanceReportingTask

✗ Try again — that task ships provenance events via S2S, not Prometheus metrics.

21. How do you handle errors and failures in a NiFi flow?

NiFi provides several mechanisms for handling failures gracefully, ensuring that failed FlowFiles are not silently lost and that problems are visible to operators.

Failure Relationships: Most processors emit FlowFiles that cannot be processed to a failure relationship. Always connect this relationship to a destination — commonly a LogAttribute processor, a PutFile processor (to archive failed FlowFiles to disk), or a PublishKafka processor (to route failures to an error topic). Never auto-terminate the failure relationship in production without deliberate consideration.

Retry connections: Loop a failure relationship back to the same processor or an earlier processor to implement retry logic. UpdateAttribute can increment a retry counter attribute, and RouteOnAttribute can route FlowFiles with exceeded retry counts to a dead-letter path.

Bulletins: When a processor encounters an error it logs to its bulletin board. Bulletins appear as colored indicators on the processor in the canvas. Severity levels: DEBUG, INFO, WARNING, ERROR.

Yield Duration: If a processor fails to acquire a connection or resource, it enters a yield state for the configured Yield Duration (default 1 second) before being rescheduled again, preventing tight error loops.

Penalty Duration: When a FlowFile is penalized, it is not re-selected for processing for the Penalty Duration period, giving upstream systems time to recover before retry.

What prevents a processor in a persistent error state from consuming all NiFi processing threads in a tight loop?Yield Duration — the processor backs off for the configured yield period before being rescheduled after a failure

✓ Well done — Yield Duration provides a mandatory back-off period after a processor yields, preventing error spin-loops.

Back-pressure on the failure connection

✗ Try again — back-pressure limits queue size, not processor execution rate during errors.

Automatic processor shutdown after 3 consecutive failures

✗ Try again — NiFi does not automatically stop processors after N failures; Yield Duration is the throttle mechanism.

What is the difference between Yield Duration and Penalty Duration in NiFi?Yield Duration pauses the entire processor; Penalty Duration delays re-processing of a specific penalized FlowFile only

✓ Well done — Yield affects the whole processor's scheduling; Penalty affects individual FlowFile eligibility for processing.

Both pause the processor for the same duration but Penalty applies only to cluster nodes

✗ Try again — they affect different scopes: processor vs individual FlowFile.

Yield Duration is configured globally; Penalty Duration is per-FlowFile attribute

✗ Try again — both are configured on the processor; Yield is processor-level and Penalty applies per FlowFile.

22. What is the SplitText processor and how do you control split behavior?

SplitText is a NiFi processor that splits a FlowFile containing multiple lines of text into multiple smaller FlowFiles, each containing a configurable number of lines. It is the workhorse for splitting large text, CSV, or log files into processable chunks before parallel processing.

Key configuration properties:

Line Split Count: The number of lines per output FlowFile. Set to 1 for one FlowFile per line; set to 1000 for batches of 1000 lines. A value of 0 means no line-count limit (used with Maximum Fragment Size).

Maximum Fragment Size: Optional maximum byte size per output FlowFile. When the current fragment reaches this size during splitting, a new fragment begins. Useful when downstream systems have size limits.

Header Line Count: Number of header lines to include in every output FlowFile (e.g., 1 for a CSV header row). The header is prepended to every fragment so each fragment is independently parseable as a complete CSV file.

Header Marker: A regex pattern identifying header lines embedded in the file.

SplitText sets these attributes on each output FlowFile: fragment.identifier (UUID shared by all fragments from the same original), fragment.index (1-based fragment number), and fragment.count (total number of fragments). These attributes enable MergeContent to reassemble fragments in the correct order.

How do you ensure every SplitText output FlowFile for a CSV file includes the header row?Set the Header Line Count property to 1 so SplitText prepends the header to every output fragment

✓ Well done — Header Line Count causes SplitText to copy the specified number of header lines into every output FlowFile.

Use UpdateAttribute to add a header attribute to each output FlowFile

✗ Try again — an attribute is not the same as content; the header needs to be in the content of each fragment for CSV parsers to recognize it.

Configure Line Split Count to include the header line in the count for the first fragment only

✗ Try again — that would only include the header in the first fragment, not all fragments.

What FlowFile attribute set by SplitText helps MergeContent reassemble fragments in the original order?fragment.size

✗ Try again — there is no fragment.size attribute; size varies per fragment.

fragment.index

✓ Well done — fragment.index (combined with fragment.identifier) tells MergeContent the position of each fragment in the original sequence.

segment.original.filename

✗ Try again — segment.original.filename is set by other processors; fragment.index is the ordering attribute for SplitText output.

23. What is the MergeContent processor and how is it used?

MergeContent is a NiFi processor that combines multiple FlowFiles into a single FlowFile. It is the counterpart to processors like SplitText and SplitJSON, enabling a scatter-gather pattern: split a large FlowFile into pieces for parallel processing, then merge the results back together.

MergeContent supports two merge strategies:

Defragment: Reassembles fragments produced by a split operation. It reads the fragment.identifier and fragment.count attributes and waits until all fragments with the same identifier have arrived before merging them in order. This mode requires fragment attributes to be present.

Bin-Packing Algorithm: Collects FlowFiles and merges when one of several triggers fires — minimum and maximum FlowFile count, minimum and maximum bin size in bytes, or a maximum wait time. Used for batching many small FlowFiles into a larger one for efficient downstream writing (e.g., batching records before writing to S3 as Parquet).

Output format options include: Binary Concatenation (concatenate raw content), TAR (create a TAR archive), ZIP (create a ZIP archive), and FlowFileStream v3 (NiFi's internal format that preserves all attributes of each constituent FlowFile). The FlowFileStream format is used with UnpackContent to later unpack the merged FlowFile back into individual FlowFiles with attributes intact.

In Defragment merge strategy, what attribute pair does MergeContent use to know when all fragments of a split have arrived?fragment.identifier and fragment.count

✓ Well done — fragment.identifier groups fragments belonging to the same original FlowFile; fragment.count tells MergeContent how many to expect.

fragment.index and uuid

✗ Try again — uuid is unique per FlowFile and cannot group fragments; fragment.identifier is the grouping key.

filename and entryDate

✗ Try again — these core attributes are not the fragment coordination attributes used by MergeContent defragment mode.

Which merge output format preserves all FlowFile attributes when packing multiple FlowFiles so they can later be individually restored by UnpackContent?ZIP

✗ Try again — ZIP archives preserve file content but not NiFi FlowFile attributes.

FlowFileStream v3

✓ Well done — FlowFileStream v3 is NiFi's native format that packages both content and all attributes of each constituent FlowFile.

Binary Concatenation

✗ Try again — binary concatenation joins raw bytes without any attribute preservation or record boundary information.

24. What is the InvokeHTTP processor and what are key configuration considerations?

InvokeHTTP is NiFi's most flexible HTTP client processor. It sends HTTP requests to configurable URLs using any HTTP method (GET, POST, PUT, PATCH, DELETE) and routes the response to different relationships based on the HTTP response code. It is the Swiss-army knife for REST API integration in NiFi flows.

Key configuration properties:

HTTP Method: The method to use — can be static or an EL expression like ${http.method} to pick dynamically from an attribute.

Remote URL: The target endpoint URL, EL-supported: https://api.example.com/users/${user.id}.

SSL Context Service: A StandardSSLContextService reference for HTTPS endpoints requiring client certificates or custom CA trust.

Send Message Body: Whether to include the FlowFile content as the HTTP request body. Set to false for GET/DELETE requests.

Request Content-Type: Typically application/json or dynamically from ${mime.type}.

Relationships: Response (2xx responses — the response body becomes a new FlowFile), Original (the original request FlowFile), Retry (5xx responses), No Retry (4xx client errors), Failure (network failures, connection timeouts).

To which InvokeHTTP relationship does a 404 Not Found HTTP response route?Retry

✗ Try again — Retry is for 5xx server errors; 4xx client errors like 404 go to No Retry because retrying will not fix a client-side error.

No Retry

✓ Well done — 4xx responses indicate client errors that will not be resolved by retrying the same request.

Failure

✗ Try again — Failure is for network-level issues (connection refused, timeout), not HTTP 4xx responses.

After a successful InvokeHTTP call, where does the HTTP response body appear in the NiFi flow?It is added as an attribute to the Original FlowFile

✗ Try again — response bodies can be large; NiFi puts them in a new FlowFile on the Response relationship, not in an attribute.

As the content of a new FlowFile on the Response relationship

✓ Well done — InvokeHTTP creates a new FlowFile whose content is the HTTP response body and routes it to the Response relationship.

It is written to the provenance repository as a transit URI

✗ Try again — provenance records the URI for audit; the response body itself goes to the Response FlowFile.

25. What is the PublishKafka and ConsumeKafka processor pair and what are key configuration options?

PublishKafka and ConsumeKafka (and their record-aware variants PublishKafkaRecord and ConsumeKafkaRecord) are NiFi's integration points with Apache Kafka.

ConsumeKafka: Subscribes to one or more Kafka topics using the Kafka consumer group protocol. Key properties include: Kafka Brokers (bootstrap servers), Topic Name(s) (EL-supported), Group ID (consumer group name), Offset Reset (earliest/latest for new consumer groups), Max Poll Records (how many records per poll), and Honor Transactions (whether to respect Kafka transactional producers). Each polled batch produces one FlowFile containing the message value.

PublishKafka: Produces messages to a Kafka topic. Key properties: Topic Name (static or EL-expression like ${kafka.topic} for per-FlowFile routing), Failure Strategy (Route to Failure vs Roll Back), Message Key Field (for keyed messages), Delivery Guarantee (Best Effort, Wait for Local Ack, Wait for Replication). With PublishKafkaRecord, NiFi reads records from the FlowFile and publishes one Kafka message per record.

Both processors require a KafkaClientService Controller Service in NiFi 2.x for SSL, SASL, and schema registry integration.

What PublishKafka Delivery Guarantee setting ensures a message is acknowledged by all in-sync replicas before success is confirmed?Best Effort

✗ Try again — Best Effort (acks=0) does not wait for any acknowledgment.

Wait for Local Acknowledgment

✗ Try again — local acknowledgment (acks=1) only confirms the leader received it, not the replicas.

Wait for Replication

✓ Well done — Wait for Replication (acks=all) ensures all in-sync replicas have persisted the message before NiFi considers the publish successful.

What is the key behavioral difference between ConsumeKafka and ConsumeKafkaRecord?ConsumeKafkaRecord parses the polled batch as structured records and exposes them as a NiFi record set; ConsumeKafka treats each message as an opaque byte FlowFile

✓ Well done — ConsumeKafkaRecord integrates with RecordReader and RecordWriter, enabling schema-aware, record-level processing of Kafka message content.

ConsumeKafkaRecord supports only JSON topics; ConsumeKafka supports all formats

✗ Try again — ConsumeKafkaRecord supports any format for which a RecordReader exists (JSON, Avro, CSV, Parquet, etc.).

ConsumeKafka is deprecated and should not be used in new flows

✗ Try again — ConsumeKafka remains valid for use cases where schema-level processing is not needed.

26. What is the ExecuteScript processor and what scripting languages does it support?

ExecuteScript is NiFi's escape hatch for custom logic that cannot be expressed with built-in processors. It allows you to write arbitrary script code that executes within the NiFi processor lifecycle — accessing incoming FlowFiles, creating new FlowFiles, modifying attributes, reading and writing content, and routing FlowFiles to relationships.

Supported scripting languages (via the Java Scripting Engine API):

Groovy (most popular in NiFi community — expressive, JVM-native)
Python (via Jython — Python 2.7 dialect; C extensions unavailable)
ECMAScript / JavaScript (via Nashorn in Java 8, deprecated in Java 11+)
Ruby (via JRuby)
Lua

In Groovy, a typical ExecuteScript pattern looks like:

def flowFile = session.get()
if (!flowFile) return
flowFile = session.write(flowFile, { inputStream, outputStream ->
    def text = inputStream.getText('UTF-8')
    outputStream.write(text.toUpperCase().bytes)
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)

ExecuteScript has access to: session (ProcessSession), context (ProcessContext), log (ComponentLogger), and predefined relationship variables (REL_SUCCESS, REL_FAILURE). The script can use any Java library available on the NiFi classpath.

Why can Jython (Python in ExecuteScript) not use libraries like NumPy or Pandas?NiFi's security policy blocks Python package imports

✗ Try again — NiFi does not have a security restriction specifically on Python packages.

Jython is a Python 2.7 JVM implementation that cannot run C extension modules required by NumPy and Pandas

✓ Well done — C-extension libraries like NumPy require CPython; Jython is a pure-JVM Python runtime that cannot load native C extensions.

Python packages must be installed in NiFi's lib directory as NARs

✗ Try again — NARs are for Java-based NiFi extensions; Python packages are not packaged this way.

Which NiFi object gives ExecuteScript access to read and write FlowFile content and transfer FlowFiles to relationships?session (ProcessSession)

✓ Well done — the ProcessSession provides all FlowFile lifecycle operations: get, create, write, transfer, remove.

context (ProcessContext)

✗ Try again — ProcessContext provides access to processor properties and Controller Services, not FlowFile operations.

log (ComponentLogger)

✗ Try again — the logger is for writing log and bulletin messages only.

27. What is the JoltTransformJSON processor and how do you use it?

JoltTransformJSON is a NiFi processor that transforms JSON content using the JOLT (JSON to JSON transformation) library. JOLT uses declarative JSON specifications (JOLT specs) to describe how an input JSON document should be restructured — renaming fields, changing nesting structure, filtering arrays, computing derived values — without writing imperative code.

JOLT supports several transformation types applied in a chain:

shift: The most common operation. Defines a mapping from input JSON paths to output JSON paths. Input paths can include wildcards (*), array indices ([]), and conditional matching.

default: Adds default values to the output for paths absent in the input.

remove: Removes specified paths from the document.

cardinality: Normalizes fields that may be either single values or arrays into a consistent array form.

sort: Alphabetically sorts JSON object keys.

A simple shift spec that renames firstName to first_name:

[{
  "operation": "shift",
  "spec": {
    "firstName": "first_name",
    "lastName": "last_name"
  }
}]

The JOLT spec itself supports NiFi EL, allowing dynamic spec construction from FlowFile attributes. The processor includes a Transform Tool in its configuration dialog for testing specs against sample input without running the full flow.

Which JOLT operation type defines the mapping of input JSON fields to new output JSON paths?shift

✓ Well done — the shift operation is the primary JOLT operation for restructuring and renaming JSON document fields.

default

✗ Try again — default adds missing fields with default values; it does not remap existing fields.

cardinality

✗ Try again — cardinality normalizes value-vs-array inconsistencies; it does not perform field remapping.

What NiFi feature in the JoltTransformJSON processor lets you test your JOLT spec against sample JSON before running the actual flow?The Transform Tool built into the processor configuration dialog

✓ Well done — the Transform Tool lets you paste sample input JSON and see the JOLT-transformed output immediately within the NiFi UI.

Data provenance replay on a test FlowFile

✗ Try again — provenance replay re-runs a real FlowFile through the flow; the Transform Tool is an offline spec tester.

NiFi's built-in unit test framework

✗ Try again — NiFi does have a unit test framework for processor developers, but the Transform Tool is the UI-level interactive testing feature.

28. What is the PutDatabaseRecord processor and how does it differ from ExecuteSQL?

PutDatabaseRecord writes structured records from a FlowFile into a relational database table using JDBC. It is the write counterpart to QueryDatabaseTable and ExecuteSQL. Unlike ExecuteSQL — which executes arbitrary SQL statements — PutDatabaseRecord works with the NiFi record model: it reads records from the FlowFile using a RecordReader and generates INSERT, UPSERT, INSERT_IGNORE, UPDATE, or DELETE statements automatically based on the target table schema.

Key configuration properties:

Record Reader: Parses the FlowFile content (JsonTreeReader, CSVReader, AvroReader, etc.).

Statement Type: INSERT, UPDATE, INSERT_OR_UPDATE (upsert), INSERT_IGNORE (ignore on key conflict), DELETE, or USE_ATTR_TYPE (pick from a FlowFile attribute).

Database Connection Pooling Service: A DBCPConnectionPool reference.

Table Name: Static name or EL expression like ${target.table}.

Translate Field Names: When enabled, converts record field names (e.g., camelCase) to database column name conventions (e.g., snake_case).

The advantage over ExecuteSQL for writes is that PutDatabaseRecord handles schema mapping automatically — it queries the target table's metadata to determine column types and order, then generates correct parameterized SQL. This avoids hand-crafting INSERT statements and handles type coercion automatically.

What must you configure in PutDatabaseRecord to tell NiFi how to parse the FlowFile content before writing to the database?The SQL Statement property with a static INSERT template

✗ Try again — PutDatabaseRecord generates SQL automatically from the record schema; you do not write INSERT statements.

A RecordReader Controller Service that knows how to parse the FlowFile format (JSON, CSV, Avro, etc.)

✓ Well done — the RecordReader parses the FlowFile into a stream of records that PutDatabaseRecord then maps to database rows.

An EvaluateJsonPath processor upstream to extract each field into attributes

✗ Try again — PutDatabaseRecord works directly with the record model; attribute extraction is not necessary.

Which PutDatabaseRecord Statement Type would you use to insert new rows and update existing rows based on a primary key?INSERT

✗ Try again — INSERT fails on duplicate key conflicts; it does not update existing rows.

INSERT_OR_UPDATE

✓ Well done — INSERT_OR_UPDATE (upsert) inserts new rows and updates existing ones based on the primary key, handling both cases in one statement type.

INSERT_IGNORE

✗ Try again — INSERT_IGNORE skips duplicate key conflicts silently but does not update the existing row.

29. What is the ListSFTP and FetchSFTP processor pattern and how does it work?

ListSFTP and FetchSFTP implement a two-stage pattern for ingesting files from SFTP servers, separating listing from fetching. This design also appears for S3 (ListS3/FetchS3Object), Azure Blob Storage, HDFS, and local filesystems.

ListSFTP: Connects to the SFTP server and lists files in the configured remote directory (with optional recursion and filename filtering by regex). For each file found, it creates a FlowFile with zero bytes of content but rich attributes: filename, path, sftp.remote.host, sftp.remote.port, file.size, file.lastModifiedTime, etc. Uses NiFi State Management to track already-listed files, emitting only new or modified files on subsequent runs.

FetchSFTP: Receives the listing FlowFiles and for each one downloads the actual file content from the SFTP server using the attributes. The result is a FlowFile whose content is the downloaded file bytes.

Why split listing from fetching? Listing is fast (one directory read) while fetching is slow (one network transfer per file). Separating them lets you run multiple FetchSFTP processors in parallel (by increasing concurrent task count) to download many files simultaneously, while ListSFTP runs on the Primary Node at its own pace.

Why does ListSFTP produce FlowFiles with zero content, and what do those FlowFiles contain instead?They contain attributes with the file metadata (filename, path, size, timestamp), which FetchSFTP uses to download the actual content

✓ Well done — the listing stage is intentionally metadata-only, keeping it fast and decoupled from the slower download stage.

The content is encrypted and must be decrypted by FetchSFTP

✗ Try again — there is no encryption between ListSFTP and FetchSFTP; the separation is about metadata vs content.

ListSFTP cannot read file content due to SFTP protocol limitations

✗ Try again — SFTP supports file downloads; the split is a design choice, not a protocol constraint.

How can you speed up bulk file ingestion using the ListSFTP + FetchSFTP pattern?Run multiple ListSFTP processors in parallel on the same directory

✗ Try again — multiple ListSFTP instances would duplicate listings; increase FetchSFTP concurrent tasks instead.

Increase the concurrent task count on FetchSFTP to download multiple files in parallel

✓ Well done — FetchSFTP is the bottleneck; more concurrent tasks allow it to download N files simultaneously from the listing queue.

Enable the Turbo Mode property on ListSFTP

✗ Try again — there is no Turbo Mode property; concurrent FetchSFTP tasks are the parallelism lever.

30. What is the LookupRecord processor used for?

LookupRecord is a record-aware NiFi processor that enriches records within a FlowFile by looking up values from an external source — a database, a distributed map cache, a REST API, or a file-based lookup table — and adding the result as a new field in each record.

LookupRecord works with three components:

RecordReader: Parses the incoming FlowFile into records.

RecordWriter: Serializes enriched records back to the output FlowFile.

LookupService Controller Service: The enrichment data source. Implementations include:

SimpleCsvFileLookupService: Looks up values from an in-memory CSV file — useful for small static reference tables.
IPLookupService: GeoIP enrichment from a MaxMind database.
DatabaseRecordLookupService: Executes a parameterized SQL query against a JDBC source for each lookup.
DistributedMapCacheLookupService: Looks up values from a distributed in-memory cache populated by another part of the flow.
RestLookupService: Calls a REST API endpoint for each lookup.

Configuration specifies which record fields are the lookup key(s) and which path in the output record receives the looked-up value. When no match is found, the processor routes to the unmatched relationship for separate handling.

Which LookupRecord LookupService would you use to add GeoIP country and city information to network log records?SimpleCsvFileLookupService

✗ Try again — a CSV file cannot practically contain all IP ranges for GeoIP enrichment.

IPLookupService with a MaxMind GeoIP database

✓ Well done — IPLookupService uses the MaxMind GeoLite2 or GeoIP2 database to enrich IP addresses with geographic and ASN information.

DatabaseRecordLookupService with a SQL SELECT on each IP

✗ Try again — querying a database per record would be extremely slow for high-volume network logs; IPLookupService uses an in-memory database optimized for IP range lookups.

To which relationship does LookupRecord route records for which no match was found in the LookupService?failure

✗ Try again — failure is for processing errors; unmatched lookups are a distinct, non-error case with their own relationship.

unmatched

✓ Well done — unmatched allows you to handle records with no lookup result differently from both matched records and processing failures.

not-found

✗ Try again — the actual relationship name is unmatched, not not-found.

31. What is the PartitionRecord processor and what is a common use case?

PartitionRecord is a record-aware NiFi processor that reads records from an input FlowFile and groups them into separate output FlowFiles based on one or more field values. All records sharing the same value for the partition field(s) go to the same output FlowFile; records with different values produce different FlowFiles.

For example, if you have a FlowFile containing 10,000 records with a country field, PartitionRecord produces one FlowFile per distinct country value. Each output FlowFile's country attribute is set to the partition value it contains.

The partitioning key is expressed using NiFi RecordPath syntax: /country for a top-level field, /address/state for a nested field, /tags[0] for an array element. Multiple partition keys can be added as separate User-Defined Properties, producing compound partitions.

Common use cases:

Routing records to different Kafka topics by type: Partition by /event.type, then UpdateAttribute sets kafka.topic from the partition attribute, enabling PublishKafka to route each FlowFile to its appropriate topic.
Partitioned file writes to object storage: Partition by /date and /region to write into Hive-compatible partition paths in S3 (e.g., dt=2024-01-15/region=us-east/data.parquet).
Database routing: Partition by customer ID to route records to different database shards.

What syntax does PartitionRecord use to specify the field by which to partition records?NiFi RecordPath syntax (e.g., /country or /address/state)

✓ Well done — PartitionRecord uses RecordPath, NiFi's record field addressing language, to specify partition keys.

NiFi Expression Language (e.g., ${record:field('country')})

✗ Try again — EL operates on FlowFile attributes; RecordPath is used for record-level field access in PartitionRecord.

JSONPath (e.g., $.country)

✗ Try again — NiFi uses its own RecordPath syntax, not standard JSONPath, for record-aware processors.

After PartitionRecord splits records by /event.type, how can you route each output FlowFile to a different Kafka topic matching its event type?Use UpdateAttribute to copy the event.type partition attribute to the kafka.topic attribute, then PublishKafka uses ${kafka.topic} as the topic name

✓ Well done — PartitionRecord sets the partition field value as a FlowFile attribute, which UpdateAttribute can map to kafka.topic for dynamic topic routing.

Use RouteOnContent with a regex for each event type value

✗ Try again — RouteOnContent reads FlowFile content bytes, not record fields; the partition attribute approach is simpler and more correct.

Configure a separate PublishKafka instance per event type, each receiving one output port from PartitionRecord

✗ Try again — a dynamic topic name via attribute is far simpler and does not require multiple PublishKafka instances.

32. What is the ConvertRecord processor and how is it used for format conversion?

ConvertRecord is a NiFi processor that converts FlowFile content from one data format to another using the NiFi record model. Its sole job is to read records using one format and write them out in another. The conversion logic lives entirely in the RecordReader and RecordWriter Controller Services — not in ConvertRecord itself.

Configuration is minimal: just a Record Reader and a Record Writer. Examples:

CSV → JSON: CSVReader + JsonRecordSetWriter
JSON → Avro: JsonTreeReader + AvroRecordSetWriter
Avro → Parquet: AvroReader + ParquetRecordSetWriter
JSON → CSV: JsonTreeReader + CSVRecordSetWriter

The schema used for conversion can be inferred from the data (for self-describing formats like Avro and Parquet), declared inline in the reader/writer configuration, or fetched from a Schema Registry. Using a Schema Registry ensures output always conforms to a validated, versioned schema.

ConvertRecord performs streaming conversion — records are read and written one at a time without loading the entire FlowFile into memory, making it suitable for very large files. The output FlowFile's mime.type attribute is automatically updated by the writer to reflect the new format.

To convert a FlowFile from CSV format to Avro format using ConvertRecord, what must you configure?A CSVReader as the Record Reader and an AvroRecordSetWriter as the Record Writer

✓ Well done — ConvertRecord delegates all format logic to the reader and writer; changing formats requires only swapping these two services.

A custom JOLT spec defining the CSV-to-Avro mapping

✗ Try again — JOLT is for JSON-to-JSON transformation; ConvertRecord handles cross-format conversion through readers and writers.

An ExecuteScript processor to manually serialize each field to Avro binary format

✗ Try again — ConvertRecord handles this automatically through the record model without custom scripting.

Why is ConvertRecord preferred over scripted format conversion for large files?It streams records one at a time without loading the entire FlowFile into memory

✓ Well done — streaming conversion keeps memory usage constant regardless of FlowFile size, enabling gigabyte-scale format conversion.

It is faster than scripted conversion because it uses native code

✗ Try again — ConvertRecord is JVM-based; the advantage is memory efficiency, not native speed.

It automatically compresses output FlowFiles using GZIP

✗ Try again — compression is an optional writer feature; it is not an inherent ConvertRecord behavior.

33. What are the NiFi processor scheduling strategies?

NiFi provides two scheduling strategies controlling when a processor's onTrigger method is invoked:

Timer Driven (default): The processor is scheduled to run at a fixed time interval. The Run Schedule property sets the interval — 0 sec means run as fast as possible (yielding only for yield duration), 10 sec means run once every 10 seconds. Most processors use timer-driven scheduling and run in a dedicated thread pool.

CRON Driven: The processor runs according to a CRON expression, enabling time-of-day or day-of-week scheduling. Uses standard CRON syntax with seconds: 0 0 * * * ? runs at the top of every hour; 0 30 9 ? * MON-FRI runs at 09:30 on weekdays. CRON-driven processors run in a separate thread pool.

Additional scheduling parameters:

Concurrent Tasks: How many threads can execute the processor simultaneously. Increasing concurrent tasks enables parallel FlowFile processing. Default is 1 for most processors. Processors must declare thread-safety (@SupportsBatching) to benefit from multiple concurrent tasks.

Execution: All Nodes (default) or Primary Node Only. Primary Node Only is required for processors that would cause duplication if run on multiple cluster nodes simultaneously (ListSFTP, QueryDatabaseTable).

A processor needs to run every weekday at 6:00 AM. Which scheduling strategy should you use?Timer Driven with a 24h interval

✗ Try again — a 24h timer interval does not target specific days of the week or a specific time of day reliably.

CRON Driven with expression 0 0 6 ? * MON-FRI

✓ Well done — CRON-driven scheduling precisely targets specific times and days of the week.

Timer Driven with 0 sec and a business hours RouteOnAttribute gate

✗ Try again — that adds unnecessary complexity; CRON scheduling is the correct tool for time-of-day triggering.

What does increasing the Concurrent Tasks setting on a processor do?It increases the processor's run schedule interval

✗ Try again — concurrent tasks control parallelism, not scheduling frequency.

It allows the processor to run on multiple threads simultaneously, processing multiple FlowFiles in parallel

✓ Well done — more concurrent tasks means the processor can execute onTrigger() on N threads at once, increasing throughput.

It increases the maximum number of FlowFiles that can be queued in the processor's input connections

✗ Try again — queue sizes are controlled by connection back-pressure settings, not the processor's concurrent task count.

34. What is the difference between EvaluateJsonPath and FlattenJson processors?

EvaluateJsonPath and FlattenJson both work with JSON content but serve fundamentally different purposes.

EvaluateJsonPath extracts specific values from a JSON payload using JSONPath expressions and writes those values either to FlowFile attributes or to the FlowFile content. It is a targeted extraction tool — you specify exactly which fields you want and where they go. Configuration requires one or more User-Defined Properties, each mapping a JSONPath expression to a destination attribute name. The JSON content itself is typically not modified when writing to attributes.

FlattenJson takes a nested JSON object and flattens its entire structure into a single-level key-value object, using configurable separator characters to compose the flattened key names. For example:

Input:  {"user": {"address": {"city": "Austin"}}}
Output: {"user.address.city": "Austin"}

FlattenJson is used to normalize deeply nested JSON into flat structures suitable for systems that expect flat schemas (relational databases, Elasticsearch, CSV). It operates on the entire document, producing a new content FlowFile with the flattened JSON.

The choice: use EvaluateJsonPath when you need specific field values as attributes for routing or enrichment; use FlattenJson when you need to restructure the entire document hierarchy into a flat representation for downstream storage.

When would you choose EvaluateJsonPath over FlattenJson for working with JSON FlowFile content?When you need to extract specific field values into FlowFile attributes for routing or conditional logic downstream

✓ Well done — EvaluateJsonPath is a targeted extraction tool; FlattenJson restructures the entire document.

When you need to load deeply nested JSON into a flat relational database table

✗ Try again — flattening a nested document to a relational structure is exactly what FlattenJson is designed for.

When the JSON contains arrays that must be expanded into separate rows

✗ Try again — array expansion is better handled by SplitJSON or the record model, not EvaluateJsonPath or FlattenJson.

Given nested JSON {"order": {"id": 42, "total": 99.50}}, what does FlattenJson produce using a dot separator?{"order.id": 42, "order.total": 99.50}

✓ Well done — FlattenJson joins parent and child key names with the separator character to produce flat keys.

[{"key": "order.id", "value": 42}, {"key": "order.total", "value": 99.50}]

✗ Try again — FlattenJson outputs a flat JSON object, not an array of key-value pairs.

{"id": 42, "total": 99.50}

✗ Try again — that drops the "order" parent key; FlattenJson preserves it by concatenating with the separator.

35. How does NiFi integrate with Apache Hadoop and HDFS?

NiFi provides a suite of processors for reading from and writing to HDFS (Hadoop Distributed File System) and integrates with the broader Hadoop ecosystem including Hive and HBase. The integration uses standard Hadoop client libraries and respects Hadoop authentication (Simple or Kerberos).

Key HDFS processors:

PutHDFS: Writes FlowFile content to a specified HDFS path. Supports configuring block size, replication factor, buffer size, compression codec (GZIP, Snappy, LZO), and write conflict resolution. The output path supports NiFi EL for dynamic path construction.

FetchHDFS: Reads a file from HDFS into a FlowFile. The path is taken from the path and filename attributes set by ListHDFS.

ListHDFS: Lists files in an HDFS directory recursively, producing one FlowFile per file with metadata attributes. Uses State Management to track already-listed files.

Hadoop configuration is provided via the Hadoop Configuration Resources property pointing to hdfs-site.xml and core-site.xml files. For Kerberos-secured clusters, configure the Kerberos Principal and Kerberos Keytab properties or reference a KerberosCredentialsService Controller Service.

NiFi also integrates with Apache Hive via SelectHiveQL (query execution) and PutHiveStreaming (transactional ACID inserts), and with HBase via PutHBaseCell, GetHBase, and ScanHBase processors.

What configuration files must be referenced in NiFi HDFS processors to connect to a Hadoop cluster?hdfs-site.xml and core-site.xml

✓ Well done — these standard Hadoop configuration files contain NameNode address, replication settings, and authentication configuration.

nifi.properties and bootstrap.conf

✗ Try again — those are NiFi's own configuration files, not Hadoop cluster configuration.

yarn-site.xml only

✗ Try again — yarn-site.xml is for YARN resource management; HDFS connectivity requires hdfs-site.xml and core-site.xml.

Which Hive integration processor supports transactional ACID inserts into ORC-format Hive tables?SelectHiveQL

✗ Try again — SelectHiveQL executes read queries; it does not write data.

PutHiveStreaming

✓ Well done — PutHiveStreaming uses the Hive Streaming API to write records transactionally into ACID-enabled Hive ORC tables.

PutHDFS with Hive metastore registration

✗ Try again — PutHDFS writes raw files to HDFS without Hive transactional guarantees or metastore integration.

36. What is the UpdateAttribute processor and how is its Advanced Mode used?

UpdateAttribute is one of the most versatile NiFi processors. In basic mode, each User-Defined Property becomes an attribute name, and its value (which can use NiFi Expression Language) becomes the new attribute value. Adding a property processed.timestamp with value ${now():format('yyyy-MM-dd HH:mm:ss')} stamps every FlowFile with a timestamp attribute.

Advanced Mode (accessible via the Advanced button in the processor configuration dialog) adds rule-based conditional attribute modification. You define rules, each with:

A set of conditions — boolean EL expressions that must all be true for the rule to apply (e.g., ${mime.type:equals('application/json')})
A set of actions — attribute name/value pairs applied when the rule matches

Multiple rules are evaluated in order; the FlowPolicy setting controls whether to use the first matching rule or all matching rules. This allows a single UpdateAttribute processor to implement complex attribute-setting logic without chaining multiple processors or writing scripts.

Common basic-mode uses include: computing dynamic file paths from multiple attributes, incrementing retry counters, setting MIME types, building database connection parameters from environment attributes, and normalizing attribute naming conventions.

In UpdateAttribute Advanced Mode, what is the role of a Rule Condition?A boolean NiFi EL expression that must evaluate to true for the rule's attribute actions to be applied to the FlowFile

✓ Well done — conditions gate whether a rule's attribute modifications are applied, enabling conditional logic without separate RouteOnAttribute or scripting.

A SQL WHERE clause filtering which FlowFiles enter the processor

✗ Try again — UpdateAttribute does not use SQL; conditions are NiFi EL boolean expressions.

A regex pattern applied to FlowFile content to trigger attribute extraction

✗ Try again — content regex extraction is ExtractText's job; UpdateAttribute Advanced Mode conditions evaluate attributes, not content.

What is a common use of NiFi Expression Language in UpdateAttribute properties (basic mode)?Routing FlowFiles to different connections based on attribute values

✗ Try again — routing is RouteOnAttribute's job; UpdateAttribute only modifies attributes, it does not route.

Building computed attribute values like timestamps, string concatenations, or incremented counters from existing attributes

✓ Well done — EL in UpdateAttribute lets you derive new attribute values from existing ones, enabling dynamic configuration without scripting.

Decrypting encrypted FlowFile content using a key attribute

✗ Try again — UpdateAttribute does not access FlowFile content; use EncryptContent or ExecuteScript for content decryption.

37. How do you implement deduplication in a NiFi flow?

Deduplication — preventing the same data from being processed more than once — is a common requirement. NiFi provides several mechanisms depending on scale, performance requirements, and what constitutes a duplicate.

DetectDuplicate processor: The simplest approach. It uses a Distributed Map Cache (a DistributedMapCacheClientService backed by a DistributedMapCacheServer) to store seen identifiers. For each incoming FlowFile, it evaluates a configurable Cache Entry Identifier (NiFi EL expression, e.g., ${filename} or ${sha256.hash}) and checks if that key already exists. Duplicates route to the duplicate relationship; new items go to non-duplicate. Cache entries can have a TTL (Age Off Duration) to forget old identifiers.

Content-based hashing: Use the HashContent processor to compute a SHA-256 hash of the FlowFile content and store it as an attribute, then use DetectDuplicate against the hash. This detects content-identical duplicates regardless of filename or metadata.

Database deduplication: For at-exactly-once semantics, track processed identifiers in a database table using PutDatabaseRecord with INSERT_IGNORE and a unique constraint on the identifier column. The database's ACID guarantees enforce uniqueness even under concurrent insertion.

What storage mechanism does the DetectDuplicate processor use to persist seen identifiers across NiFi restarts?A Distributed Map Cache backed by a DistributedMapCacheServer

✓ Well done — DetectDuplicate stores seen keys in a distributed cache that persists across processor and NiFi restarts.

NiFi's FlowFile Repository

✗ Try again — the FlowFile repository stores active FlowFile state, not duplicate-detection history.

An in-memory HashSet within the processor

✗ Try again — an in-memory HashSet would be lost on processor restart; the Distributed Map Cache provides durable storage.

How would you detect duplicate FlowFiles based on identical content regardless of filename or metadata?Use HashContent to compute a SHA-256 hash of the content as an attribute, then use DetectDuplicate on that hash attribute

✓ Well done — content hashing ensures that two files with identical bytes but different names are still identified as duplicates.

Use RouteOnContent with a regex that matches the full file content

✗ Try again — RouteOnContent cannot match complete file content for deduplication; it is for pattern-based routing, not identity checking.

Compare each FlowFile's content against every FlowFile in all connection queues

✗ Try again — comparing against all queued FlowFiles is not a supported NiFi operation and would not scale.

38. What is the HandleHttpRequest and HandleHttpResponse processor pair used for?

HandleHttpRequest and HandleHttpResponse implement an HTTP server inside NiFi, enabling NiFi to act as a web service endpoint that receives HTTP requests from external clients, processes them as FlowFiles through the flow, and returns HTTP responses.

HandleHttpRequest: Starts an embedded Jetty HTTP server listening on a configured port and path. When a client sends an HTTP request, the processor creates a FlowFile from the request body and enriches it with request attributes: http.method, http.url, http.query, all HTTP headers as attributes, and a http.context.identifier that uniquely links this request to its response.

HandleHttpResponse: Receives a processed FlowFile and sends its content back to the waiting HTTP client as the response body. It uses the http.context.identifier attribute to match the response to the original request. The HTTP status code can be set statically or from a FlowFile attribute.

The use case is building NiFi-powered APIs or webhook receivers. A webhook receiver, for example, can accept JSON payloads from a GitHub push event, validate the HMAC signature using ExecuteScript, enrich the payload via UpdateAttribute, and store it in a database via PutDatabaseRecord — all while responding 202 Accepted to the GitHub server.

A StandardHttpContextMap Controller Service is required to hold open HTTP connections between HandleHttpRequest and HandleHttpResponse. The maximum open connections setting determines concurrency.

What Controller Service is required to maintain the open HTTP connection between HandleHttpRequest and HandleHttpResponse?StandardHttpContextMap

✓ Well done — the StandardHttpContextMap holds the connection context open while the FlowFile is processed, allowing HandleHttpResponse to return the result to the correct waiting client.

DBCPConnectionPool

✗ Try again — DBCPConnectionPool manages database connections; HTTP connection management is StandardHttpContextMap's role.

StandardSSLContextService

✗ Try again — SSLContextService manages TLS; HTTP connection context is a separate service.

What FlowFile attribute does HandleHttpResponse use to match the processed FlowFile back to the correct waiting HTTP client?The FlowFile's uuid attribute

✗ Try again — the uuid is a FlowFile-level identifier; the HTTP-specific correlation attribute is http.context.identifier.

http.context.identifier

✓ Well done — HandleHttpRequest sets http.context.identifier; HandleHttpResponse reads this attribute to look up and close the correct waiting connection.

http.request.id

✗ Try again — the attribute name is http.context.identifier, not http.request.id.

39. How does NiFi achieve guaranteed delivery and what are its durability guarantees?

NiFi's architecture is specifically designed to provide guaranteed delivery — once data enters NiFi, it will not be silently lost due to hardware failure, software crash, or network issues. Several design decisions work together to achieve this.

Persistent connection queues: FlowFiles in connection queues are tracked in the FlowFile Repository's Write-Ahead Log. On restart after a crash, NiFi replays the WAL to restore every FlowFile to its exact queue position before the crash.

Immutable content repository: FlowFile content is written to disk before the FlowFile is considered active. Content is never deleted until all references are removed. Even if NiFi crashes mid-write, the old content version is preserved.

Transactional processor sessions: Each processor invocation runs within a ProcessSession. All changes within a session are committed atomically. If the processor throws an exception before committing, the session is rolled back: all FlowFiles return to their input queues as if nothing happened.

Site-to-Site acknowledgment: Data transferred via S2S is acknowledged by the receiver before the sender removes FlowFiles from its queues. If the receiver crashes before acknowledgment, the sender retains the data and retries.

The practical implication: NiFi provides at-least-once delivery by default. Under failure and retry scenarios, a FlowFile may be processed more than once. Achieving exactly-once requires idempotent downstream systems or explicit deduplication logic in the flow.

What delivery guarantee does NiFi provide by default, and what scenario can violate exactly-once semantics?At-least-once — a processor crash after writing but before session commit causes the FlowFile to be retried, potentially duplicating the write

✓ Well done — NiFi's session rollback on failure re-queues the FlowFile for retry, which can cause duplicate writes to non-idempotent sinks.

Exactly-once — the transactional session prevents any duplicate processing

✗ Try again — NiFi sessions are transactional within NiFi, but external writes may succeed before the session commits, causing duplicates on retry.

Best-effort — NiFi may silently drop FlowFiles under high load

✗ Try again — NiFi's persistent WAL and back-pressure prevent silent data loss; it is at-least-once, not best-effort.

What happens to a ProcessSession's changes if the processor throws an uncaught exception before calling session.commit()?The session is rolled back and all affected FlowFiles are returned to their input queues

✓ Well done — NiFi automatically rolls back uncommitted sessions on exception, restoring FlowFiles for retry without data loss.

The session is partially committed up to the point of failure

✗ Try again — NiFi sessions are all-or-nothing; partial commits are not possible.

The FlowFiles are routed to the processor's failure relationship automatically

✗ Try again — on unhandled exceptions, the session rolls back; routing to failure requires explicit session.transfer() calls within a try-catch.

40. What is the Funnel component in NiFi and when do you use it?

A Funnel is a NiFi canvas component that merges FlowFiles from multiple incoming connections into a single outgoing connection. It has no processing logic — it is purely a flow topology tool for consolidating multiple data paths into one without introducing a processor's overhead, scheduling, or threading.

Common use cases:

Convergence after parallel processing: If you fan out a FlowFile through RouteOnAttribute to three different processing paths and all paths should eventually flow to the same downstream processor, a Funnel cleanly merges the three paths into one without requiring the downstream processor to have multiple input connections from different sources.

Canvas layout organization: Multiple processors feeding the same downstream path can first converge at a Funnel, reducing the number of long crossing connections on the canvas and improving readability.

Failure consolidation: Multiple processors' failure relationships can all connect to a single Funnel, which routes to a centralized error-handling path. This avoids duplicating error-handling connections from every processor to the same destination.

A Funnel differs from a Processor in that it has no scheduling, no concurrent task setting, no relationships (besides its one output), and no configuration properties. FlowFiles pass through it transparently and immediately. It also differs from a connection in that it can accept connections from any number of sources.

What is the maximum number of upstream connections a Funnel can accept?One

✗ Try again — if a Funnel only accepted one connection, it would have no purpose; it is designed to merge multiple inputs.

Eight

✗ Try again — there is no eight-connection limit on Funnels.

Unlimited — a Funnel can accept connections from any number of upstream sources

✓ Well done — the Funnel's purpose is precisely to merge an unlimited number of upstream connections into one outgoing connection.

How does a Funnel differ from a processor when merging multiple data paths?A Funnel has no scheduling, no processing logic, and no relationships — FlowFiles pass through instantly with zero overhead

✓ Well done — Funnels are transparent merge points with no associated thread, scheduling, or configuration overhead.

A Funnel buffers all incoming FlowFiles before releasing them in sorted order

✗ Try again — Funnels do not buffer or sort; FlowFiles pass through based on connection queue ordering on the output side.

A Funnel requires a Controller Service to manage the merged connection pool

✗ Try again — Funnels require no Controller Services; they are configuration-free canvas components.

41. What is the difference between GetFile and ListFile + FetchFile processors?

Both approaches ingest files from a local filesystem, but they differ in architecture, parallelism, and operational characteristics.

GetFile: The older, simpler, single-processor approach. It lists a directory, picks up files matching the configured filter, moves or deletes the source file atomically, and produces a FlowFile with the file content. Critical limitation: it is not safe to run with multiple concurrent tasks because two threads could attempt to process the same file simultaneously — its filesystem rename locking is not atomic on all filesystems or NFS mounts. On a cluster, GetFile should run on the Primary Node only.

ListFile + FetchFile: The modern, recommended approach. ListFile scans the directory and emits one FlowFile per found file containing only metadata attributes (filename, path, size, last modified). It uses State Management to track already-listed files. FetchFile then reads the actual file content from disk. This separation enables:

Parallel fetching: multiple concurrent FetchFile tasks read files simultaneously
Clear separation of concerns: listing happens once; fetching can be retried independently per file
Works correctly in clustered NiFi without Primary Node restriction on the fetch step

GetFile remains appropriate for simple single-node use cases. For new development and cluster deployments, ListFile + FetchFile is preferred.

Why is GetFile not recommended for use with multiple concurrent tasks in a NiFi cluster?Multiple threads could attempt to process the same file simultaneously; its filesystem-rename locking is not reliable in all environments

✓ Well done — GetFile's rename-based file locking is not guaranteed atomic on all filesystems, especially NFS, leading to potential duplicate processing.

GetFile does not support HDFS filesystems

✗ Try again — that is a separate concern; the cluster concurrency problem is about race conditions on file locking.

GetFile produces FlowFiles without filename attributes

✗ Try again — GetFile does set filename and path attributes; the limitation is concurrency safety.

What mechanism does ListFile use to avoid re-listing files it has already seen on subsequent runs?It moves processed files to a separate archive directory

✗ Try again — ListFile does not move or modify source files; it is read-only.

NiFi State Management — it persists the last seen file timestamp or set of seen filenames

✓ Well done — ListFile uses the State Management API to durably record which files have been listed, surviving processor restarts.

A Distributed Map Cache populated with each file inode number

✗ Try again — ListFile uses NiFi's built-in State Management, not an external Distributed Map Cache.

42. How does NiFi support schema evolution in data pipelines?

Schema evolution — handling changes in data structure without breaking pipelines — is supported primarily through the record model and schema registry integration.

NiFi's record-aware processors use Schema Access Strategies on RecordReaders and RecordWriters:

Infer Schema: The reader analyzes the data and derives a schema at runtime. Handles schema evolution naturally because the inferred schema always matches the current data shape, but may be inconsistent for heterogeneous datasets.

Use Schema from Registry: The reader fetches a versioned schema from a Schema Registry (Confluent Schema Registry or NiFi's built-in AvroSchemaRegistry). Schema evolution is governed by Avro compatibility rules: BACKWARD (new schema can read old data), FORWARD (old schema can read new data), or FULL compatibility. The registry enforces these rules on schema registration, preventing incompatible changes.

For handling missing fields: if an input record lacks a field present in the output schema, the writer uses the field's configured default value. For extra fields not in the output schema: the writer's Schema Validation property controls whether extra fields are ignored silently or cause routing to failure.

What Avro schema compatibility mode ensures that data written with a new schema can still be read by consumers using an older schema version?FORWARD compatibility

✓ Well done — FORWARD compatibility means old readers can process data written by new writers — the new schema is forward-compatible with older consumers.

BACKWARD compatibility

✗ Try again — BACKWARD means new readers can process data written by old writers — the opposite direction.

FULL compatibility

✗ Try again — FULL requires both BACKWARD and FORWARD compatibility simultaneously, which is more restrictive.

When a RecordWriter encounters a field in the input record that is not present in the output schema, what does NiFi Schema Validation control?Whether extra fields are silently ignored or cause the FlowFile to be routed to the failure relationship

✓ Well done — Schema Validation determines the strictness of schema enforcement; lenient mode ignores extra fields, strict mode fails on them.

Whether extra fields are automatically added to the registry as schema updates

✗ Try again — NiFi does not auto-update registry schemas; schema registration is an explicit operation.

Whether the writer pauses and waits for a human operator to approve the new field

✗ Try again — NiFi does not support human-approval gates within processor execution.

43. What is the RouteText processor and how does it differ from RouteOnContent?

RouteText and RouteOnContent are both NiFi processors that route FlowFiles based on patterns found in content, but they operate at fundamentally different granularities.

RouteOnContent: Evaluates the entire FlowFile content against configured regex patterns. If any pattern matches anywhere in the content, the FlowFile is routed to the corresponding relationship. It produces whole-FlowFile routing — one FlowFile in, one FlowFile out on a different relationship. Use case: route JSON or XML FlowFiles based on whether they contain specific keywords or signatures.

RouteText: Operates line-by-line on the FlowFile content. For each line, it evaluates conditions and routes that individual line to a matching relationship. The output is multiple FlowFiles — one per matching relationship — containing only the lines that matched it. Lines matching no condition go to the unmatched relationship. Use case: splitting a multi-type log file where each line is an independent event, routing ERROR lines to an error handler and INFO lines to a general store.

RouteText supports multiple matching strategies: Starts With, Ends With, Contains, Equals, Matches Regex, and Satisfies Expression (NiFi EL evaluated against a text line). The Routing Strategy property controls whether a line matching multiple conditions goes to all matching relationships or only the first.

A log FlowFile contains both INFO and ERROR lines mixed together. Which processor correctly separates them into two different FlowFiles?RouteText — it routes individual lines to different relationships, producing separate FlowFiles per line category

✓ Well done — RouteText line-level routing creates separate output FlowFiles for each category of matching lines.

RouteOnContent — it routes the whole FlowFile to the first matching pattern

✗ Try again — RouteOnContent routes the entire FlowFile to one relationship; it cannot split lines into separate FlowFiles.

SplitText followed by RouteOnContent — split each line then route

✗ Try again — while that works, RouteText does this in a single processor more efficiently.

To which RouteText relationship do lines that match none of the configured conditions go?failure

✗ Try again — failure is for processing errors; non-matching lines go to the unmatched relationship.

unmatched

✓ Well done — RouteText always has an unmatched relationship for lines that satisfy none of the defined conditions.

default

✗ Try again — there is no default relationship; the correct name is unmatched.

44. What performance tuning options are available in NiFi and what are common bottleneck patterns?

NiFi performance tuning operates at several levels: JVM heap, thread pool sizes, repository configuration, and per-processor settings.

JVM Heap (bootstrap.conf): The java.arg.2=-Xms and java.arg.3=-Xmx settings control heap. NiFi's content repository keeps content on disk, so heap is primarily consumed by FlowFile attributes (in memory), in-flight processing, and Lucene indexes in the provenance repository. Typical production deployments use 4–16 GB. Insufficient heap causes frequent GC pauses and OutOfMemoryErrors.

Content Repository Partitioning: Splitting the content repository across multiple physical disks increases I/O parallelism, often the primary bottleneck for high-throughput flows.

Common bottleneck patterns:

Single processor bottleneck: One slow processor with a growing upstream queue. Solution: increase concurrent tasks on that processor.
Provenance repository lag: Slow provenance writes causing processor stalls. Solution: use WriteAheadProvenanceRepository instead of PersistentProvenanceRepository, or reduce provenance event detail.
Back-pressure chain: All processors paused because the terminal writer (PutS3Object, PutDatabaseRecord) is slow. Solution: scale the terminal processor or add a MergeContent batch before it.
Small FlowFile overhead: Millions of tiny FlowFiles causing high FlowFile repository overhead. Solution: use MergeContent to batch before terminal writes.

A NiFi flow processes thousands of tiny JSON records as individual FlowFiles. What is the most common performance problem and its solution?High FlowFile metadata overhead — use MergeContent to batch FlowFiles before terminal writes to reduce per-FlowFile processing cost

✓ Well done — millions of tiny FlowFiles multiply the metadata, repository, and provenance overhead; batching reduces this dramatically.

Content repository disk fills up — increase disk size

✗ Try again — small FlowFiles use little content space; the overhead is metadata and FlowFile repository writes, not content size.

JVM heap exhaustion from storing record content in memory

✗ Try again — NiFi stores content on disk, not in heap; the tiny-FlowFile problem is metadata overhead, not heap pressure.

What nifi.properties configuration can reduce provenance write latency in high-throughput deployments?Switch from PersistentProvenanceRepository to WriteAheadProvenanceRepository

✓ Well done — WriteAheadProvenanceRepository uses a WAL for provenance writes, significantly faster than PersistentProvenanceRepository synchronous Lucene indexing.

Disable provenance completely by setting the implementation to none

✗ Try again — there is no none implementation; VolatileProvenanceRepository is the in-memory option, but losing provenance is a significant trade-off.

Increase the JVM heap to reduce GC frequency

✗ Try again — heap affects GC but not provenance write latency specifically; the repository implementation is the relevant tuning lever.

45. How does NiFi integrate with cloud storage services like Amazon S3?

NiFi provides a comprehensive set of processors for integrating with Amazon S3 available in the nifi-aws-nar extension.

ListS3: Lists objects in an S3 bucket (filtered by prefix and last modified date). Produces one FlowFile per S3 object with attributes: s3.bucket, s3.key, filename, s3.etag, s3.contentType, file.size, s3.lastModified. Uses State Management to track listed objects and emit only new or changed ones on subsequent runs.

FetchS3Object: Downloads the S3 object specified by the s3.bucket and s3.key attributes on an incoming FlowFile. Used downstream from ListS3.

PutS3Object: Uploads FlowFile content to S3. Supports configuring bucket, key (EL-supported: ${now():format('yyyy/MM/dd')}/${filename}), storage class (STANDARD, INTELLIGENT_TIERING, GLACIER), server-side encryption (AES-256, AWS:KMS), and multipart upload threshold for large files.

DeleteS3Object: Deletes an S3 object. TagS3Object: Adds or updates S3 object tags.

Authentication is handled via an AWSCredentialsProviderControllerService — supporting static credentials, environment variable resolution, EC2 instance profile (IAM role), and AWS credentials file. Using IAM roles via instance profiles is the recommended approach for deployments on EC2 or EKS, avoiding static credential management entirely.

What is the recommended AWS authentication approach for NiFi running on EC2, avoiding static credential configuration?EC2 instance profile (IAM role) — NiFi automatically uses the instance's attached IAM role credentials

✓ Well done — IAM roles via instance profiles eliminate the need for static AWS access keys, rotating credentials automatically.

Hardcoding AWS_ACCESS_KEY_ID in nifi.properties

✗ Try again — hardcoding credentials is a security anti-pattern; IAM roles are the recommended approach on EC2.

Storing credentials in a NiFi Parameter Context as sensitive parameters

✗ Try again — Parameter Contexts protect credentials from UI exposure but still require storing long-lived keys; IAM roles avoid credentials entirely.

What PutS3Object feature automatically switches to multipart upload for large files?Increasing the JVM heap enables larger single-part uploads

✗ Try again — multipart upload is an S3 protocol feature, not related to JVM heap.

The Multipart Threshold and Part Size properties — objects above the threshold are uploaded in parallel parts

✓ Well done — PutS3Object automatically switches to multipart upload when the FlowFile exceeds the configured threshold, enabling reliable upload of large files with parallel part uploads.

Setting Storage Class to INTELLIGENT_TIERING enables multipart uploads

✗ Try again — storage class controls cost/access tier, not upload mechanism; multipart is controlled by the threshold property.

Cloud

	Interviews Questions Java Spring Hibernate Maven Testing API BigData Web DataStructures AI Database Integration Cloud Scala Tools	About Javapedia.net Javapedia.net is for Java and J2EE developers, technologist and college students who prepare of interview. Also this site includes many practical examples. This site is developed using J2EE technologies by Steve Antony, a senior Developer/lead at one of the logistics based company.
	contact: javatutorials2016[at]gmail[dot]com
Kindly consider donating for maintaining this website. Thanks.
	Copyright © 2026, javapedia.net, all rights reserved. privacy policy.

Integration / Apache NiFi Interview Questions

Comments & Discussions

Recently added...