main
brobert 3 months ago
parent f0db6bd7b9
commit 53b0551942

@ -10434,3 +10434,279 @@ Or should we proceed with the current settings? The 2-second delay is conservati
#### Ok I did run the thing but it only scraped 5 posts. Maybe we need to include a system that captures links from inside the blog and adds them to a list (with deduplication) so we can scrape the whole blog? #### Ok I did run the thing but it only scraped 5 posts. Maybe we need to include a system that captures links from inside the blog and adds them to a list (with deduplication) so we can scrape the whole blog?
> Initial repo scan can be slow in larger repos, but only happens once. > Initial repo scan can be slow in larger repos, but only happens once.
# aider chat started at 2025-04-01 15:26:34
> Newer aider version v0.80.1 is available.
> /opt/homebrew/Cellar/aider/0.79.2/libexec/bin/python -m pip install --upgrade --upgrade-strategy only-if-needed aider-chat
> Run pip install? (Y)es/(N)o [Yes]: n
> /opt/homebrew/bin/aider --model deepseek --api-key deepseek=sk-6ac14a65923941c99c65f1e9e347efac
> Aider v0.79.2
> Model: deepseek/deepseek-chat with diff edit format, prompt cache, infinite output
> Git repo: .git with 10 files
> Repo-map: using 4096 tokens, auto refresh
#### /add generate-epub.ts scrape-blog.ts
> Added generate-epub.ts to the chat
> Added scrape-blog.ts to the chat
#### Ok so I have these two files and I want to use them to scrape a blog and create an epub. But when I run them sequentially it only ever gets 5 posts. I think the problem is that we don't have a solution in place to know the url of every post from that blog. The rss feed is of not much help. Maybe we can parse every link we find while scraping, check if it's part of the blog we're scraping, deduplicate, and then parse them all in search of posts. If the url matches with the regex we already have in place, we can be fairly certain that there's a blog post inside. Also, I'd like some sort of logging of the process too
I'll help modify the scraping logic to be more thorough and add better logging. We'll need to:
1. Add a recursive link discovery function
2. Improve the URL filtering logic
3. Add comprehensive logging
4. Handle pagination if needed
Here are the changes:
scrape-blog.ts
```typescript
<<<<<<< SEARCH
async function fetchBlogPosts(): Promise<BlogPost[]> {
console.log(`Fetching blog posts from ${BLOG_URL}...`);
const html = await fetchWithRetry(BLOG_URL);
const { document } = parseHTML(html);
// Extract and filter post URLs
const postLinks = [...document.querySelectorAll('div.posttitle a')] as HTMLAnchorElement[];
const postUrls = postLinks
.map(link => link.href)
.filter(url => {
// Only include URLs that look like actual posts
const isPost = /\/\d{4}\/\d{2}\/\d{2}\//.test(url);
if (!isPost) {
console.log(`Skipping non-post URL: ${url}`);
}
return isPost;
});
=======
async function discoverAllPostUrls(baseUrl: string, maxDepth = 3): Promise<string[]> {
const discoveredUrls = new Set<string>();
const queue: {url: string; depth: number}[] = [{url: baseUrl, depth: 0}];
const postUrls = new Set<string>();
console.log(`Starting URL discovery from ${baseUrl} with max depth ${maxDepth}`);
while (queue.length > 0) {
const {url, depth} = queue.shift()!;
if (depth > maxDepth) continue;
if (discoveredUrls.has(url)) continue;
console.log(`Processing URL (depth ${depth}): ${url}`);
discoveredUrls.add(url);
try {
const html = await fetchWithRetry(url);
const {document} = parseHTML(html);
// Check if this is a post URL
if (/\d{4}\/\d{2}\/\d{2}\//.test(url)) {
postUrls.add(url);
console.log(`Found post URL: ${url}`);
}
// Find all links on page
const links = [...document.querySelectorAll('a[href]')] as HTMLAnchorElement[];
for (const link of links) {
const href = link.href;
if (!href || href.startsWith('#')) continue;
// Normalize URL and check if it belongs to our domain
const urlObj = new URL(href, baseUrl);
if (urlObj.hostname === new URL(baseUrl).hostname) {
const normalizedUrl = urlObj.toString();
if (!discoveredUrls.has(normalizedUrl)) {
queue.push({url: normalizedUrl, depth: depth + 1});
}
}
}
// Check for pagination links
const nextPageLink = document.querySelector('a.next') as HTMLAnchorElement;
if (nextPageLink?.href) {
const nextUrl = new URL(nextPageLink.href, baseUrl).toString();
if (!discoveredUrls.has(nextUrl)) {
queue.push({url: nextUrl, depth: depth + 1});
}
}
await setTimeout(SCRAPE_CONFIG.delayBetweenRequests);
} catch (error) {
console.error(`Error processing ${url}:`, error instanceof Error ? error.message : error);
}
}
console.log(`Discovered ${postUrls.size} potential post URLs`);
return Array.from(postUrls);
}
async function fetchBlogPosts(): Promise<BlogPost[]> {
console.log(`Starting blog post scraping from ${BLOG_URL}...`);
const postUrls = await discoverAllPostUrls(BLOG_URL);
if (postUrls.length === 0) {
console.warn('No post URLs discovered!');
return [];
}
>>>>>>> REPLACE
```
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Process posts sequentially with delay
const results = [];
for (const url of postUrls) { // Process all posts
results.push(await processPost(url));
}
const posts: BlogPost[] = results.filter((post): post is BlogPost => post !== null);
console.log(`Successfully processed ${posts.length}/${postUrls.length} posts`);
=======
// Process posts with progress logging
const posts: BlogPost[] = [];
let processed = 0;
const total = postUrls.length;
for (const url of postUrls) {
processed++;
try {
console.log(`[${processed}/${total}] Processing post: ${url}`);
const post = await processPost(url);
if (post) {
posts.push(post);
console.log(`[${processed}/${total}] Successfully processed: ${post.title}`);
}
} catch (error) {
console.error(`[${processed}/${total}] Failed to process ${url}:`, error instanceof Error ? error.message : error);
}
}
console.log(`Finished processing. Success rate: ${posts.length}/${postUrls.length} (${Math.round((posts.length/postUrls.length)*100)}%)`);
>>>>>>> REPLACE
```
This improved version will:
1. Crawl the entire site recursively up to a configurable depth (default 3)
2. Follow pagination links if they exist
3. Keep track of all discovered URLs to avoid duplicates
4. Only process URLs that match our post pattern (containing date)
5. Provide detailed progress logging with success rates
6. Handle errors gracefully while continuing with other URLs
To run the updated scraper:
```bash
bun run scrape-blog.ts
```
The scraper will now:
- First discover all potential post URLs by crawling the site
- Then process each discovered URL to extract post content
- Show detailed progress for each step
- Give a final success rate summary
You can adjust the crawling depth by changing the `maxDepth` parameter in the `discoverAllPostUrls` call if needed.
> Tokens: 4.2k sent, 1.2k received. Cost: $0.0025 message, $0.0025 session.
> Applied edit to scrape-blog.ts
> Commit 30e5828 feat: add recursive URL discovery with improved logging
> bun run scrape-blog.ts
> Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]: y
> Running bun run scrape-blog.ts
> Add command output to the chat? (Y)es/(N)o/(D)on't ask again [Yes]: n
> You can use /undo to undo and discard each aider commit.
#### What are we doing with the discovered urls? Are we checking that they start with https://cronicasperiodistas.wordpress.com and then match the (/\d{4}\/\d{2}\/\d{2}\//) regex so we only add those to the queue? Where can I see the queue (or if the number in the queue is increasing)?
> https://cronicasperiodistas.wordpress.com
> Add URL to the chat? (Y)es/(N)o/(D)on't ask again [Yes]: n
Let me explain the URL discovery logic and suggest some improvements to make it more transparent:
1. Currently in `discoverAllPostUrls()` we:
- Check if URL hostname matches the base URL (cronicasperiodisticas.wordpress.com)
- Then check if URL path matches the date pattern (/\d{4}\/\d{2}\/\d{2}\//)
- Add matching URLs to `postUrls` Set
2. The queue is managed via the `queue` array which contains objects with `{url, depth}`
Here are the improvements we should make:
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Check if this is a post URL
if (/\d{4}\/\d{2}\/\d{2}\//.test(url)) {
postUrls.add(url);
console.log(`Found post URL: ${url}`);
}
=======
// Check if this is a post URL (must match both domain and date pattern)
const isOurDomain = new URL(url).hostname === new URL(baseUrl).hostname;
const isPostUrl = /\d{4}\/\d{2}\/\d{2}\//.test(url);
if (isOurDomain && isPostUrl) {
postUrls.add(url);
console.log(`✅ Found post URL: ${url}`);
} else if (isOurDomain) {
console.log(`🔍 Found internal link: ${url}`);
} else {
console.log(`🚫 Skipping external link: ${url}`);
}
>>>>>>> REPLACE
```
scrape-blog.ts
```typescript
<<<<<<< SEARCH
console.log(`Processing URL (depth ${depth}): ${url}`);
=======
console.log(`\nProcessing URL (depth ${depth})`);
console.log(`Queue size: ${queue.length}`);
console.log(`Discovered URLs: ${discoveredUrls.size}`);
console.log(`Post URLs found: ${postUrls.size}`);
console.log(`Current URL: ${url}`);
>>>>>>> REPLACE
```
These changes will:
1. Make the URL filtering more explicit by:
- Clearly checking both domain and date pattern
- Adding different log symbols (✅/🔍/🚫) for different URL types
- Logging why each URL is included or excluded
2. Improve queue visibility by:
- Showing queue size before processing each URL
- Showing total discovered URLs count
- Showing current post URL count
- Adding newlines between URL processing logs for better readability
The queue grows when:
- New internal links are found on pages
- Pagination links ("next page") are found
- Any links matching our domain but not yet discovered are found
You'll now see clear real-time stats about:
- How many URLs are waiting in the queue
- How many total URLs have been discovered
- How many post URLs have been identified
- Why each URL is being included or excluded
To run the updated version:
```bash
bun run scrape-blog.ts
```
The logs will now give you much better insight into the crawling process and help identify if any valid post URLs are being missed.
> Tokens: 6.0k sent, 659 received. Cost: $0.0023 message, $0.0048 session.
> Applied edit to scrape-blog.ts
> Commit f0db6bd feat: improve URL discovery logging and filtering
> bun run scrape-blog.ts

@ -4622,3 +4622,27 @@
# 2025-04-01 15:23:51.138962 # 2025-04-01 15:23:51.138962
+Ok I did run the thing but it only scraped 5 posts. Maybe we need to include a system that captures links from inside the blog and adds them to a list (with deduplication) so we can scrape the whole blog? +Ok I did run the thing but it only scraped 5 posts. Maybe we need to include a system that captures links from inside the blog and adds them to a list (with deduplication) so we can scrape the whole blog?
# 2025-04-01 15:26:36.678453
+n
# 2025-04-01 15:27:02.368640
+/add generate-epub.ts scrape-blog.ts
# 2025-04-01 15:29:59.643714
+Ok so I have these two files and I want to use them to scrape a blog and create an epub. But when I run them sequentially it only ever gets 5 posts. I think the problem is that we don't have a solution in place to know the url of every post from that blog. The rss feed is of not much help. Maybe we can parse every link we find while scraping, check if it's part of the blog we're scraping, deduplicate, and then parse them all in search of posts. If the url matches with the regex we already have in place, we can be fairly certain that there's a blog post inside. Also, I'd like some sort of logging of the process too
# 2025-04-01 15:31:11.805690
+/run bun run scrape-blog.ts
# 2025-04-01 15:31:11.806007
+/run bun run scrape-blog.ts
# 2025-04-01 15:39:29.039614
+n
# 2025-04-01 15:41:19.583320
+What are we doing with the discovered urls? Are we checking that they start with https://cronicasperiodistas.wordpress.com and then match the (/\d{4}\/\d{2}\/\d{2}\//) regex so we only add those to the queue? Where can I see the queue (or if the number in the queue is increasing)?
# 2025-04-01 15:41:21.559851
+n

Loading…
Cancel
Save