PDF Integration with Edilitics
PDF (Portable Document Format) is a widely used file format for presenting documents in a manner independent of application software, hardware, and operating systems. While commonly used for textual and graphical content, PDFs often contain structured tabular data embedded within the document. Extracting and analyzing this tabular data is essential for workflows that depend on structured data.
In Edilitics, PDF files are used exclusively as data sources to extract tabular data for advanced analytics. This guide provides a detailed, step-by-step approach to integrating PDF files into Edilitics while ensuring data security and workflow optimization.
Before You Begin
Ensure the following prerequisites are met:
-
File Size Limit: PDF files must not exceed 100 MB.
-
Tabular Data: Ensure the PDF file contains well-structured tabular data for accurate extraction.
-
Password Protection: If the PDF is password-protected, ensure you have the correct password to allow Edilitics to process the file.
-
Usage Constraints:
-
PDF files are only supported as data sources, not destinations.
-
Workflows using PDF files:
-
Allow full loads with "Schedule as Once" in Replicate.
-
Support "Schedule as Once" in Transform.
-
Do not support auto updates or data refreshes in Visualize.
-
-
-
AI Column Insights: PDF files are not eligible for AI Column Insights.
File Security and Management
Edilitics implements robust security protocols for handling PDF files:
-
Security Scans: Uploaded files are validated for potential risks and data integrity.
-
Data Extraction: All tabular data from the PDF is extracted, and each table is saved with the naming convention:
"PageNo_Table"
. -
Encryption: Extracted data is securely encrypted during storage and decrypted only during user access or workflow execution (Replicate, Transform, Visualize).
-
Permanent Deletion: Upon deleting an integration, the file and all associated extracted data are permanently removed from Edilitics systems, ensuring compliance with data privacy standards.
Supported Data Structures
Edilitics scans all pages in the PDF file for tabular data. Any detected tables are extracted, structured, and stored as separate tables.
Data Type | Description | Example |
---|---|---|
Tabular Data | Structured rows and columns within a PDF file. | Tables with columns for Date, Product, Quantity, and Price. |
Note: Non-tabular data (e.g., text and images) is not extracted or stored.
Steps to Integrate PDF Files
Step 1: Add the PDF Connector
-
Navigate to the Integrations module in Edilitics.
-
Click on New Integration.
- Search for and select the PDF connector.
Step 2: Configure the Integration
Enter the following details on the setup screen:
Field Name | Details |
---|---|
Integration Title | A unique identifier for your integration. |
Integration Description | A concise summary of the tabular data being extracted. |
File Upload | Upload the PDF file directly from your local storage (must be ≤ 100 MB). |
Password Protection | Specify if the file is password-protected (True or False ). If True , provide the password. |
Step 3: Validate and Save
-
Click Test & Save Connection to validate the uploaded file.
-
Edilitics scans the file for tabular data extraction and validates schema compliance.
-
Upon successful validation:
-
Extracted tables are stored as separate tables with the naming convention:
"PageNo_Table"
. -
The file and data are securely encrypted and saved for use in workflows.
-
Need Assistance? Edilitics Support is Here for You!