This article is an AI-assisted translation of the original French content.

Serving Parquet files with Spring MVC Link to heading

Given the growing use of the Parquet format for distributing large volumes of data as tables via HTTP, backends may be required to serve Parquet files. The purpose of this post is to show how a Java backend with Spring MVC can do this very simply.

The Parquet File Format Link to heading

Parquet format files are becoming increasingly prevalent for large data volumes: the data.gouv.fr website now offers this format. Parquet format files are indeed very well suited for making large tables available via HTTP while minimizing network traffic.

The data is organized in columns (column-oriented format) and partitioned into blocks, with the positions of those blocks available in the metadata at the beginning of the file. With just the file header, it is possible to obtain its description (variables (the columns) with type and name) and the number of records. This structuring and partitioning of the file allows quickly computing, without reading the entire file as would be necessary with CSV, the average of a variable: it suffices to read the blocks related to the variable in question. Many other more advanced optimizations explain the success of the Parquet format.

Correctly streaming Parquet files via HTTP requires supporting HTTP range requests Link to heading

The principle of HTTP range requests is that if the client requests only a portion of a resource, the server, if it knows how to handle partial requests, will only return the requested portion:

  • The resource request is made with a simple GET /myResource request
  • The specification of the portion to request is done with a Range header followed by an expression specifying the portion of the resource to return
  • if the server knows how to handle partial requests, it returns only the requested portion with a 206 (partial content) return code
  • if the server does not know how to handle partial requests, it returns the entire resource with a 200 code (responding to the HTTP request while ignoring the Range header)

The HTML page referenced in the title provides concrete examples

In order to benefit from the ability of Parquet files to split data into blocks, tools that read remote Parquet files via HTTP use HTTP range requests. Indeed, HTTP range requests allow targeting only the blocks of the file needed to fulfill the request. It is therefore essential that a server serving Parquet files via HTTP be able to handle partial requests. If the server cannot handle partial requests, it will transfer the entire file over the network and the benefit of the Parquet format will be lost.

In practice, the ranges in HTTP range requests are specified in bytes: the Range: bytes=0-9 header therefore designates the interval [0 ; 9] (bounds inclusive) of the file’s bytes. The server must return the first 10 bytes of the file (indexing from 0).

Spring MVC supports partial requests Link to heading

Although not explicitly stated in the documentation, Spring MVC automatically supports HTTP range requests:

In other words, if a file is served by Spring, either as a static resource, or by a controller (or a functional endpoint) as a Resource, if Spring MVC receives an HTTP range request targeting the resource in question, it will honor the intervals (Range header) specified in the request without serving the entire file.

Here is an example of a controller that will satisfy HTTP range requests (to be adapted for real-world cases by adding security measures to restrict user access to authorized resources):

@Controller
static class HttpRangeCompliantController{

    @GetMapping(value = "/{filename}", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
    public ResponseEntity<Resource> getFileByRange(@PathVariable String filename) {
        ClassPathResource classPathResource = new ClassPathResource(filename);

        return classPathResource.exists() ?
                ResponseEntity.ok(classPathResource) :
                ResponseEntity.notFound().build();
    }

}

NB: Necessary conditions for partial requests to be automatically handled for an MVC controller:

  • The response body must be non-null and be a subtype of Resource but not InputStreamResource
  • The request must include a syntactically correct Range header
  • The HTTP status code of the response must be equal to 200

If any of the above conditions is not met, Spring MVC will respond to the request without taking the Range header into account and will return the entire file in response to the request.

Testing a Parquet file request on a Spring MVC web server from a Parquet-reading client (DuckDB) Link to heading

DuckDB in Java Link to heading

DuckDB is an in-process DBMS capable of reading Parquet files accessible via a URL: it is a very commonly used client for reading Parquet files and will therefore make use of HTTP range requests. DuckDB is designed to be used as a lightweight DBMS that can be embedded with an application and aims to provide fast data access — it is therefore a perfectly suitable client for reading Parquet files from an application, including when they are remote. The library presents itself as a portable DBMS engine associated with various interfaces:

  • a CLI
  • APIs for Java, R, Python, Go, Node.js, Rust, …

In the following, we will use the Java API to access via HTTP a Parquet file served by the Spring controller above. The Java client is used like a standard JDBC Driver (and can therefore be associated with higher-level tools for the DAO layer such as JOOQ). We will use the DuckDB driver directly here for our test. It requires adding the following dependency to the project:

<dependency>
    <groupId>org.duckdb</groupId>
    <artifactId>duckdb_jdbc</artifactId>
    <version>1.3.0.0</version>
</dependency>

It is then configured as follows (example with proxy configuration):

private Connection getDuckDBConnection() throws ClassNotFoundException, SQLException {
    Class.forName("org.duckdb.DuckDBDriver");
    Properties props = new Properties();
    // Example proxy configuration
    props.setProperty("http_proxy", "http://proxy.company.com");
    // example configuration of other properties
    props.setProperty("enable_object_cache", "true");
    props.setProperty("default_block_size", "16384");
    
    return DriverManager.getConnection("jdbc:duckdb:", props);
}

Finally, calls are made as SQL queries just like with a standard DBMS. Here is an example query that computes the average of a single variable from a file containing several:

// parquetFileUrl is the URL of the parquet file on the Spring MVC server
double requestAverageAgeWithDuckDB(String parquetFileUrl) throws ClassNotFoundException, SQLException {
    Connection conn = getDuckDBConnection();
    double averageAge = 0;
    // 'In real life', use a preparedStatement, ...
    try (ResultSet rs = conn.createStatement().executeQuery("select avg(age) from '" + parquetFileUrl + "'")) {
        while (rs.next()) {
            averageAge = rs.getDouble(1);
        }
    }
    return averageAge;
}

Verifying that the entire file does not transit over the network Link to heading

In order to verify that only the necessary blocks transit over the network during the interaction between DuckDB and the Spring MVC server serving the Parquet file, the previous query on average age can be run on the titanic.parquet file, which will be placed on the Spring MVC server so it can serve it. The file contains an age column and therefore only the blocks related to the age column should transit over the network. The same query on average age is executed in a test context where HTTP requests are monitored: by computing the total number of bytes that transit from the Spring MVC server to DuckDB in response to requests on titanic.parquet, it is possible to verify that the entire file content did not transit over the network and therefore that only the necessary data blocks were transferred:

@Test
void should_return_parquet_by_range() throws IOException, ClassNotFoundException, SQLException, URISyntaxException {

    String parquetFilename = "titanic.parquet";
    URL parquetUrl = URI.create("http://localhost:" + port).resolve(parquetFilename).toURL();

    double averageAge = requestAverageAgeWithDuckDB(parquetUrl.toString());
    //check result's likelihood
    assertThat(averageAge).isStrictlyBetween(29.0d, 30.0d);

    assertThat(totalBytesReadByDuckDBFromRemoteTitanicParquetFile())
            .isPositive()
            .isLessThan(halfSizeOfRemoteTitanicParquetFile()); /* to ensure the entire file does not transit, we verify that less than half of the file's bytes were exchanged*/ 
}

It is not possible to intercept requests on the client side because they are made by DuckDB’s database engine, which is a native binary and is therefore not controlled by the JVM. We will therefore use Tomcat’s access logs to total the bytes exchanged in requests related to titanic.parquet. This is done using the TomcatTestMonitor.WithAccessLog class, which configures Tomcat’s access logs with an appropriate pattern and to be written to a monitored file. The class also registers a TomcatTestMonitorWithAccessLog bean responsible for reading from the access log file the HTTP requests processed by Tomcat. By calling this bean, the totalBytesReadByDuckDBFromRemoteTitanicParquetFile method in the test can retrieve the total number of bytes sent in response to GET "/titanic.parquet requests and thus allow writing the assertion on the total number of bytes transferred by those requests.

private long totalBytesReadByDuckDBFromRemoteTitanicParquetFile() {
    return this.tomcatTestMonitor.allRequests()
            .filter(ProcessedRequest::isGET)
            .filter(request -> request.uri().contains("/titanic.parquet"))
            .mapToLong(ProcessedRequest::responseSize)
            .sum();
}

tomcatTestMonitor is the TomcatTestMonitor bean declared by TomcatTestMonitor.WithAccessLog in the context and injected into the test.

References Link to heading