The background to this is simple, you have a parquet file and want to inspect the schema. The most obvious way is running spark-shell or pyspark, loading a dataframe, and calling the printSchema function.

What if for some weird reason it is not possible to load all these easy-to-use tools. The hard way but really useful is using the parquet-tools jar.

  1. Download the parquet-tool jar from this github link -> https://github.com/viirya/parquet-tools/blob/master/parquet-tools-1.8.1.jar
  2. Run the following command -> hadoop jar parquet-tools-1.8.1.jar schema -d /path/to/parquet/file

The output from the command above at the very least will show the file schema and a few other things like compression algorithm.

Leave a Reply

Your email address will not be published. Required fields are marked *