Add Web Scraping Capabilities to AI with LangChain
LangChain is a framework for developing applications powered by language models. By integrating Browserless with LangChain, you can provide your AI applications with powerful web scraping and content processing capabilities without managing browser infrastructure.
Prerequisites
- Python 3.8 or higher
- An active Browserless API Token (available in your account dashboard)
- Basic understanding of LangChain concepts
Step-by-Step Setup
Go to your Browserless Account Dashboard and copy your API token.
Then set the BROWSERLESS_API_TOKEN
environment variable in your .env
file:
- .env file
- Command line
BROWSERLESS_API_TOKEN=your-token-here
export BROWSERLESS_API_TOKEN=your-token-here
Set up a Python virtual environment to manage your dependencies:
- venv
- conda
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
conda create -n langchain-env python=3.8
conda activate langchain-env
Install LangChain and other required packages:
- pip
- Poetry
pip install langchain-community python-dotenv
poetry add langchain-community python-dotenv
Create a file named scraper.py
with the following complete code:
from dotenv import load_dotenv
import os
from langchain_community.document_loaders import BrowserlessLoader
def main():
# Load environment variables
load_dotenv()
# Initialize the loader with your API token
loader = BrowserlessLoader(
api_token=os.getenv("BROWSERLESS_API_TOKEN"),
urls=["https://example.com"],
text_content=True # Get text content instead of raw HTML
)
# Load and process the documents
documents = loader.load()
# Print the results
for doc in documents:
print(f"Source: {doc.metadata.get('source')}")
print(f"Content: {doc.page_content[:200]}...")
if __name__ == "__main__":
main()
Run your application with the following command:
python scraper.py
You should see output showing the scraped content from the example website.
How It Works
1. Connection Setup: BrowserlessLoader connects to Browserless using your API token 2. Content Loading: The loader fetches and processes web content 3. Document Creation: Content is converted into LangChain Documents 4. Processing: Documents can be further processed with LangChain's tools
Advanced Configuration
Multiple URLs
Process multiple websites in a single operation:
loader = BrowserlessLoader(
api_token=api_token,
urls=[
"https://example1.com",
"https://example2.com",
"https://example3.com"
]
)
Raw HTML Mode
Get raw HTML content instead of text:
loader = BrowserlessLoader(
api_token=api_token,
urls=["https://example.com"],
text_content=False
)
Performance Optimization
-
Batch Processing
- Process multiple URLs in batches
- Implement proper error handling
- Use async/await for better performance
-
Resource Management
- Monitor memory usage
- Implement proper cleanup
- Handle timeouts appropriately
Security Best Practices
-
API Token Management
- Never commit tokens to version control
- Use environment variables
- Rotate tokens regularly
-
Input Validation
- Validate URLs before processing
- Implement rate limiting
- Handle sensitive data appropriately
Common Use Cases
News Aggregation
def aggregate_news(api_token, news_sites):
loader = BrowserlessLoader(
api_token=api_token,
urls=news_sites,
text_content=True
)
documents = loader.load()
# Process and analyze the news content
for doc in documents:
print(f"Source: {doc.metadata.get('source')}")
print(f"Content: {doc.page_content[:200]}...")
Content Analysis
from langchain.text_splitter import RecursiveCharacterTextSplitter
def analyze_content(api_token, url):
# Load content
loader = BrowserlessLoader(
api_token=api_token,
urls=[url],
text_content=True
)
documents = loader.load()
# Split content into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
# Process chunks
for chunk in chunks:
print(f"Chunk: {chunk.page_content[:100]}...")
For more advanced usage scenarios, please refer to: