ya ejecutado, obtenido el json y el epub con las 644 crónicas

main
brobert 3 months ago
parent 5afdda6115
commit 2a724e2850

@ -10710,3 +10710,670 @@ The logs will now give you much better insight into the crawling process and hel
> Applied edit to scrape-blog.ts
> Commit f0db6bd feat: improve URL discovery logging and filtering
> bun run scrape-blog.ts
> Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]: n
> You can use /undo to undo and discard each aider commit.
#### It grows pretty aggresively. Are you sure we're deduplicating before adding to the queue? I understand that a Set doesn't allow repeated items, but I truly wonder if it's reasonable to have over 2000 links in queue after just a couple of urls parsed. Example of a test run I cut short:
#### $ bun run scrape-blog.ts
#### Starting blog post scraping from https://cronicasperiodisticas.wordpress.com...
#### Starting URL discovery from https://cronicasperiodisticas.wordpress.com with max depth 3
####
#### Processing URL (depth 0)
#### Queue size: 0
#### Discovered URLs: 0
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com
####
#### Processing URL (depth 1)
#### Queue size: 412
#### Discovered URLs: 1
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/
####
#### Processing URL (depth 1)
#### Queue size: 818
#### Discovered URLs: 2
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/feed/
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/feed/
####
#### Processing URL (depth 1)
#### Queue size: 817
#### Discovered URLs: 3
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/tag/anfibia/
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/anfibia/
####
#### Processing URL (depth 1)
#### Queue size: 1220
#### Discovered URLs: 4
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/tag/animales/
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/animales/
####
#### Processing URL (depth 1)
#### Queue size: 1620
#### Discovered URLs: 5
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/tag/argentina/
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/argentina/
####
#### Processing URL (depth 1)
#### Queue size: 2017
#### Discovered URLs: 6
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/tag/asesinato/
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/asesinato/
####
#### Processing URL (depth 1)
#### Queue size: 2411
#### Discovered URLs: 7
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/tag/bolivia/
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/bolivia/
####
#### Processing URL (depth 1)
#### Queue size: 2809
#### Discovered URLs: 8
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/tag/boxeo/
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/boxeo/
#### ^C
> https://cronicasperiodisticas.wordpress.com
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: n
> https://cronicasperiodisticas.wordpress.com/
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: n
> https://cronicasperiodisticas.wordpress.com/tag/asesinato/
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: s
> https://cronicasperiodisticas.wordpress.com/tag/animales/
#### Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: skip
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: s
> https://cronicasperiodisticas.wordpress.com/feed/
#### Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: skip
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: s
> https://cronicasperiodisticas.wordpress.com/tag/anfibia/
#### Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: skip
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: s
> https://cronicasperiodisticas.wordpress.com/tag/boxeo/
#### Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: skip
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: s
> https://cronicasperiodisticas.wordpress.com/tag/argentina/
#### Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: skip
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: s
> https://cronicasperiodisticas.wordpress.com/tag/bolivia/
#### Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: skip
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: s
You're right - the queue is growing too aggressively because we're adding all links from each page without proper deduplication. The issue is that while we check `discoveredUrls` before processing, we're adding all links to the queue first before processing them. Let's fix this by:
1. Moving the deduplication check before adding to queue
2. Adding URL normalization to handle equivalent URLs
3. Limiting the types of URLs we add to the queue
Here are the changes:
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Find all links on page
const links = [...document.querySelectorAll('a[href]')] as HTMLAnchorElement[];
for (const link of links) {
const href = link.href;
if (!href || href.startsWith('#')) continue;
// Normalize URL and check if it belongs to our domain
const urlObj = new URL(href, baseUrl);
if (urlObj.hostname === new URL(baseUrl).hostname) {
const normalizedUrl = urlObj.toString();
if (!discoveredUrls.has(normalizedUrl)) {
queue.push({url: normalizedUrl, depth: depth + 1});
}
}
}
=======
// Find and filter links on page
const links = [...document.querySelectorAll('a[href]')] as HTMLAnchorElement[];
for (const link of links) {
const href = link.href;
if (!href || href.startsWith('#')) continue;
try {
// Normalize URL - remove trailing slashes and query params
const urlObj = new URL(href, baseUrl);
if (urlObj.hostname !== new URL(baseUrl).hostname) continue;
// Clean up URL
urlObj.hash = '';
urlObj.search = '';
let normalizedUrl = urlObj.toString().replace(/\/$/, '');
// Skip if already discovered or in queue
if (discoveredUrls.has(normalizedUrl) ||
queue.some(item => item.url === normalizedUrl)) {
continue;
}
// Only add certain types of URLs to queue
const path = urlObj.pathname;
if (path.startsWith('/tag/') ||
path.startsWith('/category/') ||
path.startsWith('/page/') ||
path.startsWith('/author/') ||
path === '/' ||
/\d{4}\/\d{2}\/\d{2}/.test(path)) {
queue.push({url: normalizedUrl, depth: depth + 1});
}
} catch (error) {
console.log(`Skipping invalid URL: ${href}`);
}
}
>>>>>>> REPLACE
```
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Check for pagination links
const nextPageLink = document.querySelector('a.next') as HTMLAnchorElement;
if (nextPageLink?.href) {
const nextUrl = new URL(nextPageLink.href, baseUrl).toString();
if (!discoveredUrls.has(nextUrl)) {
queue.push({url: nextUrl, depth: depth + 1});
}
}
=======
// Check for pagination links (more strict matching)
const paginationLinks = [...document.querySelectorAll('a.page-numbers, a.next, a.prev')] as HTMLAnchorElement[];
for (const link of paginationLinks) {
try {
const urlObj = new URL(link.href, baseUrl);
if (urlObj.hostname !== new URL(baseUrl).hostname) continue;
urlObj.hash = '';
urlObj.search = '';
const normalizedUrl = urlObj.toString().replace(/\/$/, '');
if (!discoveredUrls.has(normalizedUrl) &&
!queue.some(item => item.url === normalizedUrl)) {
queue.push({url: normalizedUrl, depth: depth + 1});
}
} catch (error) {
console.log(`Skipping pagination link: ${link.href}`);
}
}
>>>>>>> REPLACE
```
These changes will:
1. Normalize URLs more aggressively by:
- Removing trailing slashes
- Stripping query parameters and hashes
- Converting to absolute URLs
2. Only add specific types of URLs to the queue:
- Tag pages (/tag/)
- Category pages (/category/)
- Pagination pages (/page/)
- Author pages (/author/)
- The homepage (/)
- Posts (containing dates)
3. Check both discoveredUrls AND the queue before adding new URLs
4. Handle pagination links more carefully
5. Add better error handling for malformed URLs
The queue should now grow much more slowly since we're:
- Being more selective about which URLs we add
- Doing better deduplication
- Normalizing URLs more thoroughly
Try running it again with:
```bash
bun run scrape-blog.ts
```
The queue size should now stay much more manageable. You can also try reducing the maxDepth parameter if needed.
> Tokens: 7.6k sent, 1.0k received. Cost: $0.0031 message, $0.0080 session.
> Applied edit to scrape-blog.ts
> Commit 92a3852 fix: improve URL deduplication and filtering in blog scraper
> bun run scrape-blog.ts
> Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]: n
> You can use /undo to undo and discard each aider commit.
#### Ok, now it's growing at a much more reasonable rate. I'll let it run. Don't change anything anymore
Understood! I'll leave the current implementation as is since the queue growth is now at a reasonable rate. The scraper should now:
1. Discover URLs in a controlled manner
2. Properly deduplicate links
3. Only queue relevant URLs
4. Show clear progress in the logs
You can monitor the progress through the detailed logging output. The key metrics to watch are:
- Queue size (should stay manageable)
- Discovered URLs count
- Post URLs found count
Let me know if you encounter any issues or if you'd like to make adjustments later. Happy scraping!
```bash
# To monitor the progress in real-time:
bun run scrape-blog.ts | grep -E 'Processing|Queue size|Found post URL'
```
> Tokens: 8.9k sent, 152 received. Cost: $0.0026 message, $0.01 session.
> # To monitor the progress in real-time:
bun run scrape-blog.ts | grep -E 'Processing|Queue size|Found post URL'
> Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]: n
#### Should I worry that there are still no post urls found? Example fragment of the output:
#### Processing URL (depth 1)
#### Queue size: 624
#### Discovered URLs: 31
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/tag/frontera-d
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/frontera-d
####
#### Processing URL (depth 1)
#### Queue size: 631
#### Discovered URLs: 32
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/tag/futbol
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/futbol
####
#### Processing URL (depth 1)
#### Queue size: 641
#### Discovered URLs: 33
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/tag/gatopardo
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/gatopardo
> https://cronicasperiodisticas.wordpress.com/tag/gatopardo
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: n
> https://cronicasperiodisticas.wordpress.com/tag/futbol
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: n
> https://cronicasperiodisticas.wordpress.com/tag/frontera-d
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: n
Yes, it's concerning that no post URLs have been found yet. Let's analyze why this might be happening and how we can adjust the scraper:
1. The current regex pattern `/\d{4}\/\d{2}\/\d{2}\//` looks for date patterns in URLs, but Wordpress might use different URL structures.
2. We should:
- Add more flexible URL matching patterns
- Log why potential post URLs are being rejected
- Consider checking the page content structure too
Here's how we can modify the post detection:
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Check if this is a post URL (must match both domain and date pattern)
const isOurDomain = new URL(url).hostname === new URL(baseUrl).hostname;
const isPostUrl = /\d{4}\/\d{2}\/\d{2}\//.test(url);
if (isOurDomain && isPostUrl) {
postUrls.add(url);
console.log(`✅ Found post URL: ${url}`);
} else if (isOurDomain) {
console.log(`🔍 Found internal link: ${url}`);
} else {
console.log(`🚫 Skipping external link: ${url}`);
}
=======
// Check if this is a post URL
const isOurDomain = new URL(url).hostname === new URL(baseUrl).hostname;
if (!isOurDomain) {
console.log(`🚫 Skipping external link: ${url}`);
continue;
}
// Multiple ways to identify post URLs
const path = new URL(url).pathname;
const isPostUrl =
// Date-based pattern (YYYY/MM/DD)
/\d{4}\/\d{2}\/\d{2}\//.test(path) ||
// Year/month pattern (YYYY/MM)
/\/\d{4}\/\d{2}\/$/.test(path) ||
// Contains 'article' or 'post' in path
/\/article\/|\/post\//.test(path) ||
// Has post-like HTML structure
document.querySelector('article.post, div.post, div.entry-content');
if (isPostUrl) {
postUrls.add(url);
console.log(`✅ Found post URL: ${url}`);
// Debug why it was identified as post
if (/\d{4}\/\d{2}\/\d{2}\//.test(path)) console.log(' - Matched date pattern');
else if (/\/\d{4}\/\d{2}\/$/.test(path)) console.log(' - Matched year/month pattern');
else if (/\/article\/|\/post\//.test(path)) console.log(' - Matched article/post path');
else if (document.querySelector('article.post, div.post, div.entry-content')) {
console.log(' - Matched post content structure');
}
} else {
console.log(`🔍 Found internal link: ${url}`);
// Debug why it's not considered a post
if (!path.match(/\d{4}\//)) console.log(' - No date pattern in URL');
if (!document.querySelector('article.post, div.post')) console.log(' - No post content structure found');
}
>>>>>>> REPLACE
```
This change will:
1. Use multiple criteria to identify posts:
- Date patterns in URL (YYYY/MM/DD or YYYY/MM)
- Path containing 'article' or 'post'
- Presence of post-like HTML structure (article.post, div.post, etc.)
2. Add detailed logging showing:
- Why a URL was identified as a post
- Why a URL wasn't considered a post
3. Continue to skip external links immediately
The scraper should now:
- Find posts even if they don't follow the exact date pattern
- Give you clear feedback about why URLs are/aren't being considered posts
- Still maintain all the existing queue management benefits
You may want to run this in a new terminal window to monitor the detailed output:
```bash
bun run scrape-blog.ts | grep -E 'Found post URL|Matched|internal link'
```
This will show you the key decision points about what's being considered a post or not. If you're still not finding posts, we can further adjust the detection criteria based on what you see in the logs.
> Tokens: 9.2k sent, 932 received. Cost: $0.0035 message, $0.01 session.
> Applied edit to scrape-blog.ts
> Commit a1e8066 feat: improve post URL detection with multiple criteria
> bun run scrape-blog.ts | grep -E 'Found post URL|Matched|internal link'
> Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]: n
> You can use /undo to undo and discard each aider commit.
#### Let me be clear, all posts follow the same pattern: https://cronicasperiodisticas.wordpress.com/2013/10/06/misrata-calling-xxiv/. They are all `https://cronicasperiodisticas.wordpress.com/${year}/${month}/${day}/${slug}/`. That's what we should be looking for
> https://cronicasperiodisticas.wordpress.com/2013/10/06/misrata-calling-xxiv/
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: n
> https://cronicasperiodisticas.wordpress.com/${year}/${month}/${day}/${slug}/`
> Add URL to the chat? (Y)es/(N)o/(A)ll/(S)kip all/(D)on't ask again [Yes]: n
I'll modify the URL detection to specifically target this pattern and remove the more generic checks. Here's the change:
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Multiple ways to identify post URLs
const path = new URL(url).pathname;
const isPostUrl =
// Date-based pattern (YYYY/MM/DD)
/\d{4}\/\d{2}\/\d{2}\//.test(path) ||
// Year/month pattern (YYYY/MM)
/\/\d{4}\/\d{2}\/$/.test(path) ||
// Contains 'article' or 'post' in path
/\/article\/|\/post\//.test(path) ||
// Has post-like HTML structure
document.querySelector('article.post, div.post, div.entry-content');
if (isPostUrl) {
postUrls.add(url);
console.log(`✅ Found post URL: ${url}`);
// Debug why it was identified as post
if (/\d{4}\/\d{2}\/\d{2}\//.test(path)) console.log(' - Matched date pattern');
else if (/\/\d{4}\/\d{2}\/$/.test(path)) console.log(' - Matched year/month pattern');
else if (/\/article\/|\/post\//.test(path)) console.log(' - Matched article/post path');
else if (document.querySelector('article.post, div.post, div.entry-content')) {
console.log(' - Matched post content structure');
}
} else {
console.log(`🔍 Found internal link: ${url}`);
// Debug why it's not considered a post
if (!path.match(/\d{4}\//)) console.log(' - No date pattern in URL');
if (!document.querySelector('article.post, div.post')) console.log(' - No post content structure found');
}
=======
// Strict check for post URL pattern
const path = new URL(url).pathname;
const isPostUrl = /^\/\d{4}\/\d{2}\/\d{2}\/[^/]+\/$/.test(path);
if (isPostUrl) {
postUrls.add(url);
console.log(`✅ Found post URL: ${url}`);
} else {
console.log(`🔍 Found internal link: ${url}`);
}
>>>>>>> REPLACE
```
And let's update the queue filtering to only add these post URLs:
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Only add certain types of URLs to queue
const path = urlObj.pathname;
if (path.startsWith('/tag/') ||
path.startsWith('/category/') ||
path.startsWith('/page/') ||
path.startsWith('/author/') ||
path === '/' ||
/\d{4}\/\d{2}\/\d{2}/.test(path)) {
queue.push({url: normalizedUrl, depth: depth + 1});
}
=======
// Only add pagination links and post URLs to queue
const path = urlObj.pathname;
if (path.startsWith('/page/') || // Pagination
/^\/\d{4}\/\d{2}\/\d{2}\/[^/]+\/$/.test(path)) { // Post URLs
queue.push({url: normalizedUrl, depth: depth + 1});
}
>>>>>>> REPLACE
```
These changes will:
1. Use a strict regex pattern that exactly matches the post URL format:
- `/YYYY/MM/DD/slug/`
- Must start with 4 digits, then 2 digits, then 2 digits, then a slug
- Must end with a trailing slash
2. Only add two types of URLs to the queue:
- Pagination links (`/page/`)
- URLs matching the exact post pattern
3. Simplify the logging to just show found posts vs other links
The scraper should now:
- Only identify actual post URLs
- Not waste time processing tag/category/author pages
- Still follow pagination to find all posts
- Be more efficient by processing fewer irrelevant URLs
Try running it again with:
```bash
bun run scrape-blog.ts
```
> Tokens: 11k sent, 901 received. Cost: $0.0039 message, $0.02 session.
> Applied edit to scrape-blog.ts
> Commit 976245d refactor: simplify URL pattern matching for blog posts
> bun run scrape-blog.ts
> Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]: n
> You can use /undo to undo and discard each aider commit.
#### It's ok to visit other pages that are not posts (like /tag/* and /category/*) to find more post urls, just don't consider them as blog posts because they are not
I'll modify the URL discovery logic to:
1. Keep crawling through tag/category pages to find more post URLs
2. Only consider URLs matching the exact post pattern as actual posts
3. Improve the logging to show what's happening more clearly
Here are the changes:
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Only add pagination links and post URLs to queue
const path = urlObj.pathname;
if (path.startsWith('/page/') || // Pagination
/^\/\d{4}\/\d{2}\/\d{2}\/[^/]+\/$/.test(path)) { // Post URLs
queue.push({url: normalizedUrl, depth: depth + 1});
}
=======
// Add pagination, tag, category and post URLs to queue
const path = urlObj.pathname;
if (path.startsWith('/page/') || // Pagination
path.startsWith('/tag/') || // Tag pages
path.startsWith('/category/') || // Category pages
/^\/\d{4}\/\d{2}\/\d{2}\/[^/]+\/$/.test(path)) { // Post URLs
queue.push({url: normalizedUrl, depth: depth + 1});
}
>>>>>>> REPLACE
```
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Strict check for post URL pattern
const path = new URL(url).pathname;
const isPostUrl = /^\/\d{4}\/\d{2}\/\d{2}\/[^/]+\/$/.test(path);
if (isPostUrl) {
postUrls.add(url);
console.log(`✅ Found post URL: ${url}`);
} else {
console.log(`🔍 Found internal link: ${url}`);
}
=======
// Strict check for post URL pattern
const path = new URL(url).pathname;
const isPostUrl = /^\/\d{4}\/\d{2}\/\d{2}\/[^/]+\/$/.test(path);
if (isPostUrl) {
postUrls.add(url);
console.log(`✅ Found post URL: ${url}`);
} else {
// Log what type of internal link we found
if (path.startsWith('/tag/')) {
console.log(`🏷️ Found tag page: ${url}`);
} else if (path.startsWith('/category/')) {
console.log(`🗂️ Found category page: ${url}`);
} else if (path.startsWith('/page/')) {
console.log(`📄 Found pagination page: ${url}`);
} else {
console.log(`🔍 Found internal link: ${url}`);
}
}
>>>>>>> REPLACE
```
These changes will:
1. Allow crawling through tag and category pages to discover more post URLs
2. Still only consider URLs matching `/YYYY/MM/DD/slug/` as actual posts
3. Add more descriptive logging with emoji icons to easily distinguish:
- 🏷️ Tag pages
- 🗂️ Category pages
- 📄 Pagination pages
- 🔍 Other internal links
- ✅ Actual post URLs
The scraper will now:
- More thoroughly explore the site structure
- Still maintain strict criteria for what counts as a post
- Give clearer feedback about what types of pages it's finding
Try running it again with:
```bash
bun run scrape-blog.ts
```
> Tokens: 11k sent, 733 received. Cost: $0.0039 message, $0.02 session.
> Applied edit to scrape-blog.ts
> Commit bae8084 feat: Improve blog URL discovery with better logging
> bun run scrape-blog.ts
> Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]: n
> You can use /undo to undo and discard each aider commit.
#### Ok, I've seen the scraper pass through a post and not recognise it:
#### Processing URL (depth 2)
#### Queue size: 1127
#### Discovered URLs: 476
#### Post URLs found: 0
#### Current URL: https://cronicasperiodisticas.wordpress.com/2010/06/01/vida-del-senor-sombrero
#### 🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/2010/06/01/vida-del-senor-sombrero
#### Shouldn't that show that a post was found? Maybe the '-' character is confusing the regex? That is what I expect all urls to look like
> https://cronicasperiodisticas.wordpress.com/2010/06/01/vida-del-senor-sombrero
> Add URL to the chat? (Y)es/(N)o/(D)on't ask again [Yes]: n
You're absolutely right - the regex is too strict because it requires a trailing slash that isn't always present. Let's fix the post URL detection to handle both with and without trailing slashes, and make it more flexible with special characters in the slug.
Here's the change:
scrape-blog.ts
```typescript
<<<<<<< SEARCH
// Strict check for post URL pattern
const path = new URL(url).pathname;
const isPostUrl = /^\/\d{4}\/\d{2}\/\d{2}\/[^/]+\/$/.test(path);
=======
// Check for post URL pattern (with or without trailing slash)
const path = new URL(url).pathname;
const isPostUrl = /^\/\d{4}\/\d{2}\/\d{2}\/[^/]+(\/)?$/.test(path);
>>>>>>> REPLACE
```
And we need to update the queue filtering regex to match:
scrape-blog.ts
```typescript
<<<<<<< SEARCH
/^\/\d{4}\/\d{2}\/\d{2}\/[^/]+\/$/.test(path)) { // Post URLs
=======
/^\/\d{4}\/\d{2}\/\d{2}\/[^/]+(\/)?$/.test(path)) { // Post URLs
>>>>>>> REPLACE
```
These changes will:
1. Make the regex match URLs both with and without trailing slashes
2. Still maintain the strict date pattern requirement (YYYY/MM/DD)
3. Allow any characters (including hyphens) in the slug portion
4. Keep all other safety checks in place
The pattern now:
- `^\/\d{4}\/\d{2}\/\d{2}\/` - Starts with /YYYY/MM/DD/
- `[^/]+` - Has one or more non-slash characters (the slug)
- `(\/)?$` - Optionally ends with a slash
This should now properly detect all post URLs like:
- https://cronicasperiodisticas.wordpress.com/2010/06/01/vida-del-senor-sombrero
- https://cronicasperiodisticas.wordpress.com/2010/06/01/vida-del-senor-sombrero/
Try running it again with:
```bash
bun run scrape-blog.ts
```
> Tokens: 12k sent, 493 received. Cost: $0.0039 message, $0.03 session.
> Applied edit to scrape-blog.ts
> Commit 5afdda6 fix: make post URL regex match with or without trailing slash
> bun run scrape-blog.ts
> Run shell command? (Y)es/(N)o/(D)on't ask again [Yes]: n
> You can use /undo to undo and discard each aider commit.

@ -4646,3 +4646,139 @@
# 2025-04-01 15:41:21.559851
+n
# 2025-04-01 15:44:31.589371
+It grows pretty aggresively. Are you sure we're deduplicating before adding to the queue? I understand that a Set doesn't allow repeated items, but I truly wonder if it's reasonable to have over 2000 links in queue after just a couple of urls parsed. Example of a test run I cut short:
+$ bun run scrape-blog.ts
+Starting blog post scraping from https://cronicasperiodisticas.wordpress.com...
+Starting URL discovery from https://cronicasperiodisticas.wordpress.com with max depth 3
+
+Processing URL (depth 0)
+Queue size: 0
+Discovered URLs: 0
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com
+
+Processing URL (depth 1)
+Queue size: 412
+Discovered URLs: 1
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/
+
+Processing URL (depth 1)
+Queue size: 818
+Discovered URLs: 2
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/feed/
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/feed/
+
+Processing URL (depth 1)
+Queue size: 817
+Discovered URLs: 3
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/tag/anfibia/
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/anfibia/
+
+Processing URL (depth 1)
+Queue size: 1220
+Discovered URLs: 4
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/tag/animales/
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/animales/
+
+Processing URL (depth 1)
+Queue size: 1620
+Discovered URLs: 5
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/tag/argentina/
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/argentina/
+
+Processing URL (depth 1)
+Queue size: 2017
+Discovered URLs: 6
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/tag/asesinato/
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/asesinato/
+
+Processing URL (depth 1)
+Queue size: 2411
+Discovered URLs: 7
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/tag/bolivia/
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/bolivia/
+
+Processing URL (depth 1)
+Queue size: 2809
+Discovered URLs: 8
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/tag/boxeo/
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/boxeo/
+^C
# 2025-04-01 15:44:33.329033
+n
# 2025-04-01 15:44:36.905612
+s
# 2025-04-01 15:45:41.062622
+n
# 2025-04-01 15:47:08.929955
+Ok, now it's growing at a much more reasonable rate. I'll let it run. Don't change anything anymore
# 2025-04-01 15:48:11.231673
+n
# 2025-04-01 15:48:32.112001
+Should I worry that there are still no post urls found? Example fragment of the output:
+Processing URL (depth 1)
+Queue size: 624
+Discovered URLs: 31
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/tag/frontera-d
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/frontera-d
+
+Processing URL (depth 1)
+Queue size: 631
+Discovered URLs: 32
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/tag/futbol
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/futbol
+
+Processing URL (depth 1)
+Queue size: 641
+Discovered URLs: 33
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/tag/gatopardo
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/tag/gatopardo
# 2025-04-01 15:48:43.168936
+n
# 2025-04-01 15:51:02.552192
+Let me be clear, all posts follow the same pattern: https://cronicasperiodisticas.wordpress.com/2013/10/06/misrata-calling-xxiv/. They are all `https://cronicasperiodisticas.wordpress.com/${year}/${month}/${day}/${slug}/`. That's what we should be looking for
# 2025-04-01 15:51:03.884252
+n
# 2025-04-01 15:52:54.980096
+It's ok to visit other pages that are not posts (like /tag/* and /category/*) to find more post urls, just don't consider them as blog posts because they are not
# 2025-04-01 16:30:34.567340
+n
# 2025-04-01 16:31:46.231653
+Ok, I've seen the scraper pass through a post and not recognise it:
+Processing URL (depth 2)
+Queue size: 1127
+Discovered URLs: 476
+Post URLs found: 0
+Current URL: https://cronicasperiodisticas.wordpress.com/2010/06/01/vida-del-senor-sombrero
+🔍 Found internal link: https://cronicasperiodisticas.wordpress.com/2010/06/01/vida-del-senor-sombrero
+Shouldn't that show that a post was found? Maybe the '-' character is confusing the regex? That is what I expect all urls to look like
# 2025-04-01 16:31:48.047291
+n

Binary file not shown.

File diff suppressed because one or more lines are too long
Loading…
Cancel
Save