Chapter 3: Building the Foundation - Data Collection

In this chapter, we'll build the foundation of our RAG chatbot by implementing efficient data collection mechanisms. We'll focus on creating a robust workflow that crawls website sitemaps, processes URLs, and prepares content for vector storage.

💡 Get the Complete n8n Blueprints

Want to fast-track your implementation? You can download the complete n8n blueprints for all workflows discussed in this book, including the data collection workflow covered in this chapter. These production-ready blueprints will save you hours of setup time.

Download the Blueprints Here

Understanding Sitemaps and Web Crawling

What is a Sitemap?

A sitemap is an XML file that lists important URLs of a website, often including metadata such as:

Example sitemap structure:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page1</loc>
    <lastmod>2024-03-15</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Sitemap Types and Management

For larger websites, you might encounter different types of sitemaps:

  1. Main sitemap index
  2. Content-specific sitemaps
  3. News sitemaps
  4. Image sitemaps

In our implementation, we handle multiple sitemap types:

const sitemap_urls = [
  "https://newsletter.bizstack.tech/sitemap.xml",
  "https://bizstack.tech/wp-sitemap-posts-blog-1.xml",
  "https://bizstack.tech/wp-sitemap-posts-page-1.xml",
  "https://bizstack.tech/wp-sitemap-posts-post-1.xml",
  "https://bizstack.tech/seo_generator_sitemap_1.xml",
  "https://bizstack.tech/sitemap-news.xml",
  "https://bizstack.tech/wp-sitemap-taxonomies-dealstore-1.xml"
];

Setting Up the URL Collection Workflow

Workflow Overview

Our URL collection workflow consists of several key components:

  1. Trigger mechanism
  2. Sitemap selection
  3. XML processing
  4. URL extraction
  5. Batch processing

Implementation in n8n

Let's break down each component:

1. Trigger Node Setup

// Schedule Trigger configuration
{
  "rule": {
    "interval": [
      {
        "field": "hours",
        "hoursInterval": 6
      }
    ]
  }
}

2. Sitemap Selection Logic

// Choose sitemap of the day
const currentDayOfYear = Math.floor(
  (new Date() - new Date(new Date().getFullYear(), 0, 0)) / 
  (1000 * 60 * 60 * 24)
);

const index = currentDayOfYear % sitemap_urls.length;
const selectedUrl = sitemap_urls[index];

This rotation system ensures:

Implementing the XML Parser

XML to JSON Conversion

The XML parsing node configuration:

{
  "parameters": {
    "options": {
      "explicitRoot": false,
      "ignoreAttrs": true
    }
  }
}

URL Extraction and Validation

After parsing, we extract URLs and validate them:

// URL processing and validation
function processUrl(url) {
  try {
    const urlObj = new URL(url);
    return {
      url: urlObj.href,
      domain: urlObj.hostname,
      path: urlObj.pathname,
      isValid: true
    };
  } catch (error) {
    return {
      url: url,
      isValid: false,
      error: error.message
    };
  }
}

Managing Data Batches

Batch Processing Implementation

We use n8n's Split In Batches node:

{
  "parameters": {
    "options": {
      "batchSize": 10,
      "options": {
        "returnAll": false
      }
    }
  }
}

Benefits of Batch Processing:

  1. Resource Management

    • Controlled memory usage
    • Predictable processing time
    • Better error recovery
  2. Rate Limit Compliance

    • Respect API limits
    • Avoid server overload
    • Maintain good citizenship
  3. Progress Tracking

    • Better monitoring
    • Easier debugging
    • Resume capability

Error Handling and Retry Mechanisms

Implementing Robust Error Handling

// HTTP Request Node configuration
{
  "parameters": {
    "url": "={{ $node[\"Loop Over Items\"].json[\"loc\"] }}",
    "options": {
      "retry": {
        "maxTries": 5,
        "waitBetweenTries": 5000
      }
    }
  }
}

Error Types and Handling Strategies

  1. Network Errors

    • Implement exponential backoff
    • Track failed URLs
    • Alert on persistent failures
  2. Rate Limiting

    • Respect retry-after headers
    • Implement request queuing
    • Adjust batch sizes dynamically
  3. Content Errors

    • Log malformed content
    • Skip problematic URLs
    • Report for manual review

Logging and Monitoring

// Error logging implementation
function logError(error, context) {
  return {
    timestamp: new Date().toISOString(),
    error: error.message,
    context: context,
    url: context.url,
    attempt: context.attempt,
    type: error.name
  };
}

Performance Optimization

Caching Strategy

Implement efficient caching:

// KVStorage node configuration
{
  "parameters": {
    "operation": "setValue",
    "key": "={{ $('Loop Over Items').item.json.loc }}",
    "val": "={{ $now }}",
    "expire": false
  }
}

Resource Usage Optimization

  1. Memory Management

    • Clear processed data
    • Implement streaming where possible
    • Monitor memory usage
  2. CPU Optimization

    • Batch processing
    • Efficient parsing
    • Asynchronous operations

Incremental Updates

Implement smart update detection:

// Check if update is needed
{
  "conditions": {
    "leftValue": "={{ $node[\"KVStorage\"].json[\"val\"][\"0\"] }}",
    "rightValue": "={{ $('Loop Over Items').item.json.lastmod }}",
    "operator": {
      "type": "dateTime",
      "operation": "before"
    }
  }
}

Best Practices and Common Pitfalls

Best Practices

  1. URL Management

    • Deduplicate URLs
    • Respect robots.txt
    • Handle redirects properly
  2. Resource Conservation

    • Implement rate limiting
    • Use conditional processing
    • Clean up temporary data
  3. Monitoring

    • Track success rates
    • Monitor processing time
    • Log error patterns

Common Pitfalls

  1. Resource Exhaustion

    • Memory leaks
    • CPU spikes
    • Network congestion
  2. Data Quality Issues

    • Malformed URLs
    • Invalid XML
    • Character encoding problems
  3. Process Management

    • Incomplete batches
    • Stuck processes
    • Lost updates

Next Steps

With the data collection foundation in place, we're ready to move on to processing and storing the collected content. In the next chapter, we'll cover:

Key Takeaways:


Next Chapter: Content Processing and Storage