How to Build an OpenClaw Web Scraping Skill
Web scraping with OpenClaw enables automated data extraction from websites: product prices, job listings, news articles, competitor data, and more. This advanced guide covers building a robust scraping skill with Playwright (for JavaScript-heavy sites) or Cheerio (for static HTML), including pagination, error handling, and anti-bot measures.
Why This Is Hard to Do Yourself
These are the common pitfalls that trip people up.
Anti-bot detection and blocking
Modern sites use Cloudflare, Imperva, and fingerprinting to block scrapers. Headless detection is sophisticated
Dynamic content and pagination
JavaScript-rendered content, infinite scroll, and complex pagination require browser automation, not just HTTP requests
Rate limiting and politeness
Aggressive scraping gets you IP-banned. You need delays, rotating proxies, and respect for robots.txt
Data extraction reliability
Websites change their HTML structure constantly. Selectors break without warning and need fallback strategies
Data cleaning and normalization
Scraped data is messy: extra whitespace, inconsistent formats, HTML entities. Output needs cleaning and validation
Step-by-Step Guide
Choose scraping approach (Playwright vs Cheerio)
# Decision matrix:
# Use Cheerio (fast, simple) if:
# - Site is server-rendered HTML
# - No JavaScript required to load content
# - Static pagination
# - No login required
# Use Playwright (slower, powerful) if:
# - Content loads via JavaScript (React, Vue, etc.)
# - Infinite scroll or lazy loading
# - Forms, logins, or interactions required
# - Anti-bot detection present
# For this guide, we'll use Playwright (more common case)
# Install dependencies:
npm install playwright cheerioCreate the scraping skill
# Create skill structure:
mkdir -p ~/.openclaw/skills/web-scraper/scripts
cat > ~/.openclaw/skills/web-scraper/skill.md << 'EOF'
---
name: web-scraper
version: 1.0.0
description: Extracts structured data from websites
permissions:
- network:outbound
- process:spawn
- filesystem:write
triggers:
- command: /scrape
- pattern: "scrape|extract data from"
---
## Instructions
You are a web scraping specialist.
When asked to scrape a website:
1. Determine if Playwright or Cheerio is needed
2. Navigate to the target URL
3. Extract the requested data using CSS selectors or XPath
4. Handle pagination if needed
5. Clean and normalize the output
6. Return structured data (JSON or CSV)
7. Respect rate limits and robots.txt
EOFWarning: Web scraping may violate a website's Terms of Service. Always check robots.txt and terms before scraping. Some sites explicitly prohibit automated access.
Implement URL parsing and validation
// ~/.openclaw/skills/web-scraper/scripts/scraper.js
import { chromium } from 'playwright';
import * as cheerio from 'cheerio';
export async function scrape(url, options = {}) {
// Validate URL
try {
new URL(url);
} catch {
throw new Error('Invalid URL provided');
}
// Check robots.txt (simplified)
if (options.respectRobotsTxt) {
await checkRobotsTxt(url);
}
// Choose scraping method
if (options.usePlaywright) {
return await scrapeWithPlaywright(url, options);
} else {
return await scrapeWithCheerio(url, options);
}
}
async function checkRobotsTxt(url) {
const { origin } = new URL(url);
const robotsUrl = `${origin}/robots.txt`;
try {
const response = await fetch(robotsUrl);
const text = await response.text();
// Simplified check - production should use robots-parser library
if (text.includes('Disallow: /')) {
console.warn('Site may disallow scraping. Check robots.txt manually.');
}
} catch {
// robots.txt not found - proceed with caution
}
}Add data extraction logic
// Playwright-based scraping with selectors:
async function scrapeWithPlaywright(url, options) {
const browser = await chromium.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
viewport: { width: 1280, height: 720 }
});
const page = await context.newPage();
try {
await page.goto(url, { waitUntil: 'networkidle' });
// Wait for content to load
if (options.waitFor) {
await page.waitForSelector(options.waitFor);
}
// Extract data using provided selectors
const data = await page.evaluate((selectors) => {
const results = [];
const items = document.querySelectorAll(selectors.item);
items.forEach(item => {
const result = {};
for (const [key, selector] of Object.entries(selectors.fields)) {
const element = item.querySelector(selector);
result[key] = element ? element.textContent.trim() : null;
}
results.push(result);
});
return results;
}, options.selectors);
return data;
} finally {
await browser.close();
}
}
// Example usage:
// scrape('https://example.com/products', {
// usePlaywright: true,
// selectors: {
// item: '.product-card',
// fields: {
// title: '.product-title',
// price: '.product-price',
// url: 'a'
// }
// }
// });Handle pagination and multiple pages
// Add pagination support:
export async function scrapeMultiplePages(url, options) {
const allData = [];
let currentPage = 1;
let hasNextPage = true;
while (hasNextPage && currentPage <= (options.maxPages || 10)) {
console.log(`Scraping page ${currentPage}...`);
// Build paginated URL
const pageUrl = options.paginationTemplate
? options.paginationTemplate.replace('{page}', currentPage)
: `${url}?page=${currentPage}`;
// Scrape this page
const pageData = await scrape(pageUrl, options);
allData.push(...pageData);
// Check if there's a next page
// (This logic varies by site - example only)
if (pageData.length === 0) {
hasNextPage = false;
}
currentPage++;
// Polite delay between requests
await sleep(options.delayMs || 2000);
}
return allData;
}
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
// Example with pagination:
// scrapeMultiplePages('https://example.com/products', {
// paginationTemplate: 'https://example.com/products?page={page}',
// maxPages: 5,
// delayMs: 3000
// });Warning: Always add delays between pages. Scraping too fast is rude, wastes server resources, and will get you IP-banned quickly.
Configure output formatting and cleaning
// Clean and normalize scraped data:
export function cleanData(data) {
return data.map(item => {
const cleaned = {};
for (const [key, value] of Object.entries(item)) {
if (typeof value === 'string') {
// Remove extra whitespace
let cleanValue = value.replace(/\s+/g, ' ').trim();
// Decode HTML entities
cleanValue = decodeHtmlEntities(cleanValue);
// Remove common cruft
cleanValue = cleanValue.replace(/^[\s\n\t]+|[\s\n\t]+$/g, '');
cleaned[key] = cleanValue;
} else {
cleaned[key] = value;
}
}
return cleaned;
});
}
function decodeHtmlEntities(text) {
const entities = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
''': "'"
};
return text.replace(/&[^;]+;/g, match => entities[match] || match);
}
// Export to CSV or JSON:
export function exportData(data, format = 'json') {
if (format === 'json') {
return JSON.stringify(data, null, 2);
} else if (format === 'csv') {
const headers = Object.keys(data[0] || {});
const rows = data.map(item =>
headers.map(h => JSON.stringify(item[h] || '')).join(',')
);
return [headers.join(','), ...rows].join('\n');
}
}Add error handling and rate limiting
// Robust scraping with retries and rate limiting:
export async function scrapeWithRetry(url, options, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const data = await scrape(url, options);
const cleaned = cleanData(data);
return cleaned;
} catch (error) {
if (attempt === maxRetries) {
throw new Error(`Failed after ${maxRetries} attempts: ${error.message}`);
}
// Exponential backoff
const delayMs = 1000 * Math.pow(2, attempt);
console.warn(`Attempt ${attempt} failed, retrying in ${delayMs}ms...`);
await sleep(delayMs);
}
}
}
// Rate limiter:
class RateLimiter {
constructor(maxRequestsPerSecond = 1) {
this.maxRequests = maxRequestsPerSecond;
this.requests = [];
}
async waitForSlot() {
const now = Date.now();
this.requests = this.requests.filter(time => now - time < 1000);
if (this.requests.length >= this.maxRequests) {
const oldestRequest = Math.min(...this.requests);
const waitTime = 1000 - (now - oldestRequest);
await sleep(waitTime);
}
this.requests.push(Date.now());
}
}
const limiter = new RateLimiter(1); // 1 request per second
export async function scrapeSafely(url, options) {
await limiter.waitForSlot();
return await scrapeWithRetry(url, options);
}Web Scraping That Actually Works in Production
Anti-bot detection, dynamic content, pagination edge cases, rate limiting โ web scraping is full of challenges. Our experts build scrapers that stay online and extract clean data reliably.
Get matched with a specialist who can help.
Sign Up for Expert Help โ