Release 1.27 - 06/30/2021 * Migrate MP4 parsing to Drew Noakes' metadata-extractor (TIKA-3459). To revert to legacy parser turn off NoakesMP4Parser and turn on MP4Parser via tika-config.xml. * Prevent rare infinite loop in tika-server's -spawnChild mode when restart fails because of failure to bind to the port (TIKA-3441). * Improve likelihood that tesseract will not be orphaned on jvm restart in tika-server (TIKA-3441). * Deprecate experimental PDFPreflightParser (TIKA-3437). * Apply encoding detection to zip entry names via Ryan421 (TIKA-3374). * Add json output for /tika endpoint in tika-server (TIKA-3352). * Tika's PDFParser should use the underlying file if one is passed in via a TikaInputStream (TIKA-3350) Release 1.26 - 03/24/2021 * Fix thread safety bug in OpenOffice parser (TIKA-3334). * The "writeLimit" header now pertains to the combined characters written per container document (and embedded documents) in the /rmeta endpoint in tika-server (TIKA-3325); it no longer functions only per container or embedded document. * Extract more embedded files in PDFs by recursively processing the embedded file tree (TIKA-3332). * Allow for case insensitive headers for configuration of the PDFParser and the TesseractOCRParser in tika-server via Subhajit Das (TIKA-3320). * Improve detection and parsing of XPS files (TIKA-3316). * General dependency upgrades (TIKA-3244). * Great optimization in ForkParser (TIKA-3237). * Fix parsing of emails attached to other emails in PST files (TIKA-3004). * MP3 parser should output the xmpDM:duration metadata as seconds not milliseconds, consistent with the other Audio and Video parsers (TIKA-3318). * MP4 parser check if any of the Compatible Brands match when identifying the subtype (TIKA-3310). Release 1.25 - 11/25/2020 * Fix inconsistent license in xmpcore (TIKA-3204). * General upgrades including some dependencies with recently found security vulnerabilities (TIKA-3119). * Add detection and a parser for flat ODF files (TIKA-3159). * Add extraction of macros from ODF files (TIKA-3161). * Add mime detection for hprof and hprof text files (TIKA-3144). * Add TextSignature and TextProfileSignature to tika-eval (TIKA-3145 and TIKA-3146) * Create a metadata filter to trigger tika-eval stats post parsing (TIKA-3140) * Add a configurable metadata-filter for the RecursiveParserWrapper (TIKA-3137) * Add status endpoint to tika-server (TIKA-3129). * Remove whitelist/blacklist terminology (TIKA-3120) * Add detection for parquet files (TIKA-3115). * Add detection and parsing for bplist (TIKA-3104). * Enable metadata value filtering for RecursiveParserWrapper (TIKA-3137) * Add a basic parser for plist files based on com.googlecode.plist:dd-plist (TIKA-3104). * Read hyperlinked images from ODT files (TIKA-3156). * Updated GrobidRESTParser to use new API location (TIKA-3191). * Add FileProfiler to tika-eval (TIKA-3216). * Add status endpoint to tika-server (TIKA-3129). * Improved handling of zip files with STORED entries with data descriptor (TIKA-3196). * Add parsers for XLZ, IDML and MIF (TIKA-2976, TIKA-3188 and TIKA-3189). * Add the beginnings of a format-aware fuzzing module (TIKA-3083). * Add wrapper for Linux 'file' command for mime detection (TIKA-3215). * Added ability to skip parsing of embedded files in Tika Server (TIKA-3227). Release 1.24.1 - 4/17/2020 * Add detection and a parser for flat ODF files (TIKA-3159). * Add extraction of macros from ODF files (TIKA-3161). * Add mime detection for hprof and hprof text files (TIKA-3144). * Add TextSignature and TextProfileSignature to tika-eval (TIKA-3145 and TIKA-3146) * Create a metadata filter to trigger tika-eval stats post parsing (TIKA-3140) * Add a configurable metadata-filter for the RecursiveParserWrapper (TIKA-3137) * Add status endpoint to tika-server (TIKA-3129). * Remove whitelist/blacklist terminology (TIKA-3120) * Add detection for parquet files (TIKA-3115). * Add detection and parsing for bplist (TIKA-3104). * Enable metadata value filtering * Add a basic parser for plist files based on com.googlecode.plist:dd-plist (TIKA-3104). Release 1.24.1 - 4/17/2020 * Allow gzip compression of input and output streams for tika-server (TIKA-3073). Release 1.24 - 3/11/2020 * Add scripts to run tika-server as a service via Eric Pugh, and add these scripts and jar as a new artifact in the release (TIKA-3010). * Upgrade Drew Noakes' metadata-extractor (TIKA-2952). * Enable optional extraction of structural tags in PDFs (alpha-grade) (TIKA-3026). * Tika app's --extract mode now outputs to STDOUT (TIKA-3035). * Add an optional Preflight parser for PDFs (TIKA-3055). * Improve detection of some zip-based formats (TIKA-3057). * Upgrade metadata-extractor to 2.13.0 (TIKA-2952). * Upgrade to POI 4.1.2 (TIKA-3047). * Extract XMP from PSD files (TIKA-3050). * Added XMLProfiler as an optional parser to profile XFA and XMP in PDFs (TIKA-3045). * Extract inline images that rely on the DCT filter from PDFs (TIKA-3041). * Upgrade to PDFBox 2.0.19 (TIKA-3033). * Fix bug in ASM parser configuration (TIKA-2992). * Upgrade to java-libpst 0.9.3 (TIKA-2546). * Fixed XLIFF12Parser failures with ToXMLHandler (TIKA-3014). Release 1.23 - 12/02/2019 * NOTE: The PDFParser now relies on OCRDPI to render page images when users configure OCR on rendered page images. This will have the effect of increasing rendered image size (TIKA-2624). * NOTE: tika-server no longer returns 415 for file types for which there is no parser. * Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002). * Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630). * Upgrade to POI 4.1.1 (TIKA-2851). * Upgrade to PDFBox 2.0.17 (TIKA-2951). * Ensure that the PDFParser respects custom configuration of Tesseract from tika-config.xml via Eric Pugh (TIKA-2970). * Add parser for XLIFF v1.2 files (TIKA-2975). * Add mime type detection support for WebAssembly (TIKA-2894), HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988); and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989). * Add an XLZ Parser (TIKA-2976). * Fix deadlock with ForkParser when InputStream throws IOException (TIKA-2892). Release 1.22 - 07/29/2019 * NOTE: tika-server no longer hard-codes the HtmlParser to handle XML files (TIKA-2910). Users must now configure that behavior via a tika-config.xml file. * NOTE: Known regression: PDFBOX-4587 -- PDF passwords with codepoints between 0xF000 and 0XF0000 will cause an exception. * Add parser for HWP v5 files via SooMyung Lee (soomyung) and JinSup Kim (ddoleye) (TIKA-2909). * Fix order of closing streams to avoid "Failed to close temporary resource" exception in TesseractOCRParser (TIKA-2908). * Improve AutoDetectReader performance by caching encoding detector (TIKA-1568). * Prevent RTFParser from outputting illegal tag combinations (TIKA-2889). * Fix RereadableInputStream to release all resources (TIKA-2903). * Implement custom language identifier in the tika-eval module based on OpenNLP's language detector; add 18 languages and add common words lists for all 121 languages (TIKA-2790). * Fix NPE in MimeTypesReader.releaseParser() via Eamonn Saunders (TIKA-2896). * Fix RTFParser to extract more content (TIKA-2883). * Add clientSubmitTime to the metadata extracted from PST files (TIKA-2898). * Improve StreamingZipContainerDetector for xltx, xltm and several other file formats (TIKA-2886). Release 1.21 - 05/14/2019 * Add optional AUTO mode to OCR'ing of PDFs. If tesseract is installed and on the path, and this option is selected programmatically or via TikaConfig(), the PDFParser will use heuristics to decide whether or not to run OCR per page on PDFs. (TIKA-2749) * The ZipContainerDetector's default behavior was changed to run streaming detection up to its markLimit. Users can get the legacy behavior (spool-to-file/rely-on-underlying-file-in-TikaInputStream) by setting markLimit=-1. The POIFSContainerDetector requires an underlying file; it will try to spool the file to disk; if the file's length is > markLimit, it will not attempt detection; set markLimit to -1 for legacy behavior (TIKA-2849). * Upgrade PDFBox to 2.0.14 (TIKA-2834). * Add CSV detection and replace TXTParser with TextAndCSVParser; users can turn off CSV detection by excluding the TextAndCSVParser and adding back the TXTParser via tika-config (TIKA-2833). * Add a CSVParser. CSV detection is currently based solely on filename and/or information conveyed via Metadata (TIKA-2826). * General upgrades: asm, bouncycastle, commons-codec, commons-lang3, cxf, guava, h2, httpcomponents, jackcess, junrar, Lucene, mime4j, opennlp, parso, sqlite-jdbc (provided), zstd-jni (provided) (TIKA-2824) * Bundle xerces2 with tika-parsers (TIKA-2802). * Upgrade jaxb to 2.3.2 (TIKA-2819). * Upgrade jackson to 2.9.8 (TIKA-2717). * Update tika-eval's common tokens lists (TIKA-2822). * Handle bad tags in tika-eval more robustly (TIKA-2810). * Add reports for tags in tika-eval (TIKA-2809). * Extract text from SDT element within textboxes in .docx files (TIKA-2807). * Try to handle truncated OOXML files more robustly (TIKA-2765). Release 1.20 - 12/17/2018 * Upgrade to POI 4.0.1 (TIKA-2751). * Integrate/parameterize new angles handling in PDFBox (TIKA-2779). * Upgrade to PDFBox 2.0.13 (TIKA-2788). * Prevent content within