Chapter 3: Building the Foundation - Data Collection

In this chapter, we'll build the foundation of our RAG chatbot by implementing efficient data collection mechanisms. We'll focus on creating a robust workflow that crawls website sitemaps, processes URLs, and prepares content for vector storage.

💡 Get the Complete n8n Blueprints

Want to fast-track your implementation? You can download the complete n8n blueprints for all workflows discussed in this book, including the data collection workflow covered in this chapter. These production-ready blueprints will save you hours of setup time.

Download the Blueprints Here

Understanding Sitemaps and Web Crawling

What is a Sitemap?

A sitemap is an XML file that lists important URLs of a website, often including metadata such as:

Last modification date
Update frequency
Priority

Example sitemap structure:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page1</loc>
    <lastmod>2024-03-15</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Sitemap Types and Management

For larger websites, you might encounter different types of sitemaps:

Main sitemap index
Content-specific sitemaps
News sitemaps
Image sitemaps

In our implementation, we handle multiple sitemap types:

const sitemap_urls = [
  "https://newsletter.bizstack.tech/sitemap.xml",
  "https://bizstack.tech/wp-sitemap-posts-blog-1.xml",
  "https://bizstack.tech/wp-sitemap-posts-page-1.xml",
  "https://bizstack.tech/wp-sitemap-posts-post-1.xml",
  "https://bizstack.tech/seo_generator_sitemap_1.xml",
  "https://bizstack.tech/sitemap-news.xml",
  "https://bizstack.tech/wp-sitemap-taxonomies-dealstore-1.xml"
];

Setting Up the URL Collection Workflow

Workflow Overview

Our URL collection workflow consists of several key components:

Trigger mechanism
Sitemap selection
XML processing
URL extraction
Batch processing

Implementation in n8n

Let's break down each component:

1. Trigger Node Setup

// Schedule Trigger configuration
{
  "rule": {
    "interval": [
      {
        "field": "hours",
        "hoursInterval": 6
      }
    ]
  }
}

2. Sitemap Selection Logic

// Choose sitemap of the day
const currentDayOfYear = Math.floor(
  (new Date() - new Date(new Date().getFullYear(), 0, 0)) / 
  (1000 * 60 * 60 * 24)
);

const index = currentDayOfYear % sitemap_urls.length;
const selectedUrl = sitemap_urls[index];

This rotation system ensures:

Even distribution of processing load
Regular updates of all content
Efficient resource utilization

Implementing the XML Parser

XML to JSON Conversion

The XML parsing node configuration:

{
  "parameters": {
    "options": {
      "explicitRoot": false,
      "ignoreAttrs": true
    }
  }
}

URL Extraction and Validation

After parsing, we extract URLs and validate them:

// URL processing and validation
function processUrl(url) {
  try {
    const urlObj = new URL(url);
    return {
      url: urlObj.href,
      domain: urlObj.hostname,
      path: urlObj.pathname,
      isValid: true
    };
  } catch (error) {
    return {
      url: url,
      isValid: false,
      error: error.message
    };
  }
}

Managing Data Batches

Batch Processing Implementation

We use n8n's Split In Batches node:

{
  "parameters": {
    "options": {
      "batchSize": 10,
      "options": {
        "returnAll": false
      }
    }
  }
}

Benefits of Batch Processing:

Resource Management
- Controlled memory usage
- Predictable processing time
- Better error recovery
Rate Limit Compliance
- Respect API limits
- Avoid server overload
- Maintain good citizenship
Progress Tracking
- Better monitoring
- Easier debugging
- Resume capability

Error Handling and Retry Mechanisms

Implementing Robust Error Handling

// HTTP Request Node configuration
{
  "parameters": {
    "url": "={{ $node[\"Loop Over Items\"].json[\"loc\"] }}",
    "options": {
      "retry": {
        "maxTries": 5,
        "waitBetweenTries": 5000
      }
    }
  }
}

Error Types and Handling Strategies

Network Errors
- Implement exponential backoff
- Track failed URLs
- Alert on persistent failures
Rate Limiting
- Respect retry-after headers
- Implement request queuing
- Adjust batch sizes dynamically
Content Errors
- Log malformed content
- Skip problematic URLs
- Report for manual review

Logging and Monitoring

// Error logging implementation
function logError(error, context) {
  return {
    timestamp: new Date().toISOString(),
    error: error.message,
    context: context,
    url: context.url,
    attempt: context.attempt,
    type: error.name
  };
}

Performance Optimization

Caching Strategy

Implement efficient caching:

// KVStorage node configuration
{
  "parameters": {
    "operation": "setValue",
    "key": "={{ $('Loop Over Items').item.json.loc }}",
    "val": "={{ $now }}",
    "expire": false
  }
}

Resource Usage Optimization

Memory Management
- Clear processed data
- Implement streaming where possible
- Monitor memory usage
CPU Optimization
- Batch processing
- Efficient parsing
- Asynchronous operations

Incremental Updates

Implement smart update detection:

// Check if update is needed
{
  "conditions": {
    "leftValue": "={{ $node[\"KVStorage\"].json[\"val\"][\"0\"] }}",
    "rightValue": "={{ $('Loop Over Items').item.json.lastmod }}",
    "operator": {
      "type": "dateTime",
      "operation": "before"
    }
  }
}

Best Practices and Common Pitfalls

Best Practices

URL Management
- Deduplicate URLs
- Respect robots.txt
- Handle redirects properly
Resource Conservation
- Implement rate limiting
- Use conditional processing
- Clean up temporary data
Monitoring
- Track success rates
- Monitor processing time
- Log error patterns

Common Pitfalls

Resource Exhaustion
- Memory leaks
- CPU spikes
- Network congestion
Data Quality Issues
- Malformed URLs
- Invalid XML
- Character encoding problems
Process Management
- Incomplete batches
- Stuck processes
- Lost updates

Next Steps

With the data collection foundation in place, we're ready to move on to processing and storing the collected content. In the next chapter, we'll cover:

Content extraction
Text processing
Vector storage preparation
Metadata management

Key Takeaways:

Efficient sitemap processing
Robust error handling
Resource optimization
Incremental update strategy

Next Chapter: Content Processing and Storage

Chapter 3: Building the Foundation - Data Collection #

Understanding Sitemaps and Web Crawling #

What is a Sitemap? #

Sitemap Types and Management #

Setting Up the URL Collection Workflow #

Workflow Overview #

Implementation in n8n #

1. Trigger Node Setup #

2. Sitemap Selection Logic #

Implementing the XML Parser #

XML to JSON Conversion #

URL Extraction and Validation #

Managing Data Batches #

Batch Processing Implementation #

Benefits of Batch Processing: #

Error Handling and Retry Mechanisms #

Implementing Robust Error Handling #

Error Types and Handling Strategies #

Logging and Monitoring #

Performance Optimization #

Caching Strategy #

Resource Usage Optimization #

Incremental Updates #

Best Practices and Common Pitfalls #

Best Practices #

Common Pitfalls #

Next Steps #