Chapter 4: Content Processing and Storage

In this chapter, we'll dive into how to process the collected content and prepare it for vector storage. We'll cover HTML content extraction, markdown conversion, text preprocessing, and efficient storage strategies.

💡 Get the Complete n8n Blueprints

Fast-track your implementation with our complete n8n blueprints, including the content processing workflow covered in this chapter. These production-ready blueprints will save you hours of setup time.

Download the Blueprints Here

Here's how the n8n workflow will look like at the end: CleanShot 2024-11-09 at 07.09.56@2x.png

HTML Content Extraction

Setting Up the HTTP Request Node

The first step is to fetch the HTML content from each URL:

{
  "parameters": {
    "url": "={{ $node[\"Loop Over Items\"].json[\"loc\"] }}",
    "sendHeaders": true,
    "specifyHeaders": "json",
    "jsonHeaders": {
      "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
      "accept-language": "en-US,en;q=0.9",
      "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"
    }
  }
}

Error Handling and Retries

Implement robust error handling:

{
  "retryOnFail": true,
  "maxTries": 5,
  "waitBetweenTries": 5000,
  "onError": "continueErrorOutput"
}

Converting Content to Markdown

Setting Up the Content Conversion Node

We use a combination of JSDOM and Turndown for HTML to Markdown conversion:

// Import required modules
const { JSDOM } = require('jsdom');
const TurndownService = require('turndown');

// Initialize Turndown service
const turndownService = new TurndownService();

// Process HTML content
const processContent = (htmlContent) => {
  // Create virtual DOM
  const dom = new JSDOM(htmlContent);
  const document = dom.window.document;

  // Clean unwanted elements
  const unwantedTags = [
    'script', 'style', 'nav', 'footer', 
    'header', 'aside', 'ads'
  ];

  unwantedTags.forEach(tag => {
    document.querySelectorAll(tag).forEach(el => el.remove());
  });

  // Convert to Markdown
  return turndownService.turndown(document.body.innerHTML);
};

Content Cleaning and Preprocessing

Implement content cleaning functions:

function cleanContent(markdown) {
  return markdown
    // Remove excessive newlines
    .replace(/\n{3,}/g, '\n\n')
    // Remove HTML comments
    .replace(/<!--[\s\S]*?-->/g, '')
    // Normalize whitespace
    .replace(/\s+/g, ' ')
    // Clean up markdown artifacts
    .replace(/\[\s*\]/g, '')
    .trim();
}

Text Cleaning and Preprocessing

Content Normalization

function normalizeContent(text) {
  return text
    // Convert to lowercase
    .toLowerCase()
    // Remove special characters
    .replace(/[^\w\s-]/g, '')
    // Replace multiple spaces
    .replace(/\s+/g, ' ')
    // Trim whitespace
    .trim();
}

Content Length Management

Large documents need to be split into manageable chunks while maintaining data integrity and relationships. Here's how we implement this:

// Function to split markdown content into chunks of specified size
function splitMarkdown(markdown, maxLength = 20000) {
  let chunks = [];
  let currentIndex = 0;

  while (currentIndex < markdown.length) {
    // Find the last period before maxLength to avoid splitting sentences
    let endIndex = currentIndex + maxLength;
    if (endIndex < markdown.length) {
      // Look for the last period within the last 1000 characters of the chunk
      const searchRange = markdown.substring(endIndex - 1000, endIndex);
      const lastPeriodIndex = searchRange.lastIndexOf('.');

      if (lastPeriodIndex !== -1) {
        // Adjust endIndex to end at the period
        endIndex = endIndex - (1000 - lastPeriodIndex);
      }
    } else {
      endIndex = markdown.length;
    }

    // Extract the chunk
    const chunk = markdown.substring(currentIndex, endIndex);
    chunks.push(chunk);

    // Move to next chunk
    currentIndex = endIndex;
  }

  return chunks;
}

// Process each item and handle content splitting
const processContentChunks = (items) => {
  const processedItems = [];

  items.forEach(item => {
    const markdown = item.json.data.markdown;

    // If markdown is longer than maxLength, split it
    if (markdown.length > 20000) {
      const chunks = splitMarkdown(markdown);

      // Create new items for each chunk
      chunks.forEach((chunk, index) => {
        processedItems.push({
          json: {
            indexName: item.json.indexName,
            namespace: item.json.namespace,
            data: {
              markdown: chunk,
              url: item.json.data.url,
              title: item.json.data.title
            },
            // Keep original ID for first chunk, add index for others
            id: index === 0 ? item.json.id : `${item.json.id}-${index}`
          }
        });
      });
    } else {
      // If markdown is within length limit, keep original item
      processedItems.push(item);
    }
  });

  return processedItems;
};

This implementation provides several key benefits:

  1. Intelligent Splitting

    • Splits content at sentence boundaries to maintain readability
    • Preserves document structure and metadata
    • Ensures no content is lost during splitting
  2. ID Management

    • Maintains original ID for the first chunk
    • Creates sequential IDs for additional chunks (-1, -2, etc.)
    • Ensures unique identification while preserving origin
  3. Length Control

    • Maintains chunks under 20,000 characters
    • Optimizes for vector storage limits
    • Improves processing efficiency
  4. Data Integrity

    • Preserves all metadata and references
    • Maintains URL and title information
    • Keeps indexing and namespace consistency

Example of split document IDs:

Original: https://bizstack.tech/entrepreneur-spotlight
Chunk 1: https://bizstack.tech/entrepreneur-spotlight        (original ID)
Chunk 2: https://bizstack.tech/entrepreneur-spotlight-1
Chunk 3: https://bizstack.tech/entrepreneur-spotlight-2

Metadata Extraction

Extract and structure metadata:

function extractMetadata(document) {
  return {
    title: document.title,
    description: document.querySelector('meta[name="description"]')?.content,
    keywords: document.querySelector('meta[name="keywords"]')?.content,
    lastModified: document.querySelector('meta[property="article:modified_time"]')?.content,
    author: document.querySelector('meta[name="author"]')?.content,
    url: cleanUrl(pageUrl)
  };
}

Implementing the Pinecone Upsert Tool

Workflow Structure

The Pinecone upsert tool consists of three main components:

  1. Document Processing
  2. Embedding Generation
  3. Vector Storage

Document Processing

async function processToPinecone({
  inputData,
  embeddingsInput,
  pineconeApiKey,
  indexName,
  namespace,
}) {
  // Document preparation
  const docs = inputData.map(item => {
    const { data, id } = item.json;

    // Combine text content
    let textContent = [];
    for (const [key, value] of Object.entries(data)) {
      if (typeof value === 'string') {
        textContent.push(`${key}: ${value}`);
      }
    }

    return {
      pageContent: textContent.join('\n'),
      metadata: {
        source: id,
        ...data,
        original_structure: Object.keys(data).join(',')
      },
      id: id
    };
  });

  return docs;
}

Text Chunking Strategy

Implement efficient text chunking:

const { RecursiveCharacterTextSplitter } = require("langchain/text_splitter");

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 0,
});

// Process and split documents
const processedDocs = [];
for (const doc of docs) {
  if (typeof doc.pageContent === 'string' && doc.pageContent.trim()) {
    const cleanContent = doc.pageContent
      .replace(/\s+/g, ' ')
      .trim();

    const fragments = await splitter.createDocuments([cleanContent]);
    processedDocs.push(...fragments.map((fragment, idx) => ({
      ...fragment,
      metadata: { 
        ...fragment.metadata,
        ...doc.metadata,
        chunk_index: idx,
        total_chunks: fragments.length
      },
      id: `${doc.id}|${idx}`
    })));
  }
}

Vector Storage Management

Pinecone Configuration

Set up the vector store:

const { PineconeStore } = require('@langchain/pinecone');
const { Pinecone } = require('@pinecone-database/pinecone');

const pinecone = new Pinecone({ 
  apiKey: pineconeApiKey 
});

const pineconeIndex = pinecone.Index(indexName);

const vectorStore = new PineconeStore(embeddingsInput, {
  namespace: namespace || undefined,
  pineconeIndex,
});

Efficient Upsert Strategy

Implement batch upserts:

// Define IDs for upsert
const ids = processedDocs.map(doc => `${doc.id}`);
await vectorStore.addDocuments(processedDocs, ids);

Performance Optimization

Memory Management

Implement efficient memory handling:

function optimizeMemoryUsage(docs) {
  // Process in chunks to manage memory
  const chunkSize = 100;
  const chunks = [];

  for (let i = 0; i < docs.length; i += chunkSize) {
    chunks.push(docs.slice(i, i + chunkSize));
  }

  return chunks;
}

Processing Speed Optimization

async function processChunksInParallel(chunks) {
  const results = await Promise.all(
    chunks.map(async chunk => {
      return await processChunk(chunk);
    })
  );

  return results.flat();
}

Best Practices and Tips

  1. Content Quality

    • Remove boilerplate content
    • Maintain semantic structure
    • Preserve important metadata
  2. Storage Efficiency

    • Implement deduplication
    • Use appropriate chunk sizes
    • Maintain proper indexing
  3. Error Handling

    • Implement retry mechanisms
    • Log processing errors
    • Monitor storage operations

Next Steps

With content processing and storage implemented, we're ready to move on to vector storage and retrieval in the next chapter. We'll cover:

Key Takeaways:


Next Chapter: Vector Storage and Retrieval