API scraping isn't new, but it's reached epidemic proportions in 2026. The Instagram breach that exposed 17.5 million records, the Facebook incident affecting 12 million users, and dozens of smaller attacks demonstrate that APIs—not databases—have become the primary target for data harvesting.
This guide explains how modern API scraping works, how to detect scraping attempts in real-time, and proven prevention strategies that don't compromise user experience.
How API Scraping Actually Works
Understanding the attacker's methodology is the first step to effective defense. Modern API scraping is sophisticated, distributed, and designed to evade basic protections.
The Anatomy of a Scraping Attack
API Discovery and Analysis
Attackers use browser developer tools, traffic interception, and automated crawlers to map your API endpoints. They identify which endpoints return user data, product catalogs, or other valuable information.
Obtaining Valid API Access
Scrapers create legitimate accounts, purchase leaked credentials, or exploit weak authentication to obtain valid API keys or session tokens. Some rent access from compromised legitimate users.
Distributed Request Networks
Attackers set up residential proxy networks (thousands of real IP addresses), rotate user agents, and distribute requests to avoid rate limiting. Cloud providers and residential ISPs are commonly abused for this.
Systematic Data Extraction
Automated scripts iterate through IDs, search terms, or pagination to extract data. They maintain request rates just below detection thresholds and mimic legitimate traffic patterns.
Common Scraping Techniques
Sequential ID enumeration: Many APIs expose resources by sequential IDs (/api/users/1, /api/users/2). Scrapers iterate through the entire range.
# Sequential scraping example
for user_id in range(1, 1000000):
response = requests.get(
f'https://api.example.com/users/{user_id}',
headers={'Authorization': f'Bearer {valid_token}'}
)
if response.status_code == 200:
save_to_database(response.json())
Pagination abuse: Scrapers exhaust paginated endpoints to extract complete datasets.
# Pagination scraping
page = 1
while True:
response = requests.get(
f'https://api.example.com/products?page={page}&limit=100'
)
data = response.json()
save_products(data['items'])
if not data['has_more']:
break
page += 1
time.sleep(random.uniform(0.5, 2)) # Randomized delay
Search query mining: Using search endpoints to extract data based on common names, keywords, or filters.
GraphQL field overload: Exploiting GraphQL's flexibility to request massive amounts of related data in single queries.
query MassiveDataExtraction {
users(first: 100) {
edges {
node {
id, email, name, phone, address, dateOfBirth
posts(first: 50) { id, content, createdAt }
friends(first: 100) { id, name, email }
settings { notifications, privacy }
}
}
}
}
Detection Strategies
Effective scraping prevention starts with detection. You need to identify scraping behavior before significant data is exfiltrated.
1. Request Pattern Analysis
Legitimate users exhibit different patterns than automated scrapers:
- Sequential access patterns: Accessing
/users/1,/users/2,/users/3in quick succession - Exhaustive pagination: Requesting every page of results without typical user navigation
- Perfect regularity: Requests at precise intervals (every 1.5 seconds)
- Abnormal data extraction: Requesting far more data than typical users
// Detection: Sequential ID access
function detectSequentialAccess(userId, requestedIds) {
const recentRequests = getRecentRequests(userId, '5 minutes');
const ids = recentRequests.map(r => r.resourceId).sort();
// Check if IDs are sequential
let sequential = 0;
for (let i = 1; i < ids.length; i++) {
if (ids[i] === ids[i-1] + 1) sequential++;
}
// If >80% of requests are sequential IDs, flag as scraping
if (sequential / ids.length > 0.8 && ids.length > 20) {
return { suspicious: true, reason: 'sequential_id_access' };
}
return { suspicious: false };
}
2. Velocity Tracking
Monitor request velocity across multiple dimensions:
- Requests per minute/hour/day: Track rates at different time scales
- Unique resources accessed: Scrapers access far more unique resources than normal users
- Geographic velocity: Impossible travel times (requests from New York then Tokyo 5 minutes later)
- Data volume extracted: Total bytes returned per user per time period
// Multi-dimensional velocity tracking
const VELOCITY_THRESHOLDS = {
requests_per_minute: 60,
requests_per_hour: 1000,
unique_resources_per_day: 5000,
data_mb_per_hour: 100
};
async function checkVelocity(userId) {
const metrics = await getVelocityMetrics(userId);
const violations = [];
if (metrics.requestsPerMinute > VELOCITY_THRESHOLDS.requests_per_minute) {
violations.push('excessive_rpm');
}
if (metrics.uniqueResources > VELOCITY_THRESHOLDS.unique_resources_per_day) {
violations.push('excessive_unique_access');
}
if (metrics.dataMB > VELOCITY_THRESHOLDS.data_mb_per_hour) {
violations.push('excessive_data_extraction');
}
return { suspicious: violations.length > 0, violations };
}
3. Behavioral Fingerprinting
Create profiles of legitimate user behavior and detect deviations:
- Access patterns: Typical users access certain endpoints in predictable sequences
- Session duration: Scrapers often maintain sessions for hours without interaction gaps
- User agent consistency: User agent changes mid-session indicate credential sharing
- Feature usage: Scrapers focus on data-rich endpoints, ignoring others
4. IP Reputation and Proxy Detection
Identify traffic from known scraping infrastructure:
- Datacenter IPs: AWS, GCP, Azure IPs are often used for scraping
- VPN and proxy services: Known VPN exit nodes and commercial proxies
- Residential proxies: Harder to detect but often have telltale signs (abnormal TLS fingerprints, timing patterns)
- IP reputation databases: Services like IPQualityScore, MaxMind, or Spur identify suspicious IPs
Corporate networks, universities, and shared WiFi can trigger datacenter IP alerts. Always combine multiple signals before blocking, and provide clear appeals processes for legitimate users.
5. TLS Fingerprinting
Automated tools often have different TLS handshake patterns than real browsers:
// TLS fingerprinting (requires reverse proxy support)
function analyzeTLSFingerprint(request) {
const ja3 = request.tlsFingerprint.ja3; // TLS client fingerprint
// Check against known bot fingerprints
const knownBots = await redis.sismember('known_bot_ja3', ja3);
// Check if fingerprint matches claimed user agent
const expectedFingerprint = getBrowserFingerprint(request.userAgent);
const mismatch = ja3 !== expectedFingerprint;
return { isBot: knownBots || mismatch };
}
Prevention Strategies
Detection alone isn't enough. You need layered defenses that make scraping economically infeasible.
1. Intelligent Rate Limiting
Basic rate limiting isn't enough—implement adaptive, multi-layered limits:
// Adaptive rate limiting
async function checkRateLimit(userId, endpoint, ipAddress) {
// Different limits for different endpoints
const limits = {
'/api/search': { rpm: 30, hourly: 500 },
'/api/users/:id': { rpm: 60, hourly: 2000 },
'/api/sensitive-data': { rpm: 10, hourly: 100 }
};
const limit = limits[endpoint] || { rpm: 100, hourly: 5000 };
// Check user-level limits
const userRequests = await redis.incr(`ratelimit:user:${userId}:minute`);
await redis.expire(`ratelimit:user:${userId}:minute`, 60);
if (userRequests > limit.rpm) {
return { allowed: false, reason: 'user_rpm_exceeded' };
}
// Check IP-level limits (broader, prevents distributed attacks)
const ipRequests = await redis.incr(`ratelimit:ip:${ipAddress}:minute`);
await redis.expire(`ratelimit:ip:${ipAddress}:minute`, 60);
if (ipRequests > limit.rpm * 3) { // More lenient for shared IPs
return { allowed: false, reason: 'ip_rpm_exceeded' };
}
// Adaptive limiting: reduce limits if scraping detected
const suspiciousScore = await getSuspiciousScore(userId);
if (suspiciousScore > 0.7) {
limit.rpm = Math.floor(limit.rpm * 0.3); // Reduce to 30% of normal
}
return { allowed: true };
}
2. Resource Obfuscation
Make enumeration harder:
- Use UUIDs instead of sequential IDs: Replace
/users/123with/users/a7b3c4d5-e6f7-8g9h-i0j1-k2l3m4n5o6p7 - Implement access tokens for resources: Require time-limited, cryptographically signed tokens to access resources
- Add CAPTCHA challenges: For sensitive endpoints or when scraping is suspected
- Implement proof-of-work: Require computational challenges for API access
// Resource access tokens
function generateResourceToken(userId, resourceId, expiresIn = 3600) {
const payload = {
userId,
resourceId,
exp: Math.floor(Date.now() / 1000) + expiresIn
};
return jwt.sign(payload, SECRET_KEY);
}
// Require token for resource access
app.get('/api/users/:id', async (req, res) => {
const token = req.query.access_token;
if (!token) {
return res.status(403).json({ error: 'Access token required' });
}
try {
const decoded = jwt.verify(token, SECRET_KEY);
if (decoded.resourceId !== req.params.id) {
return res.status(403).json({ error: 'Invalid token for resource' });
}
// Serve resource...
} catch (err) {
return res.status(403).json({ error: 'Invalid or expired token' });
}
});
3. Data Minimization
Limit data exposure in API responses:
- Return only requested fields: Don't return entire objects by default
- Implement pagination limits: Cap
?limit=parameter (max 100 items) - Require authentication for detailed data: Public endpoints return minimal info
- Redact sensitive fields: Email becomes
j***@example.com, phone becomes***-***-1234
4. Behavioral CAPTCHA
When scraping is detected, challenge with CAPTCHA without disrupting legitimate users:
// Adaptive CAPTCHA challenges
async function shouldChallengeCaptcha(userId) {
const suspiciousScore = await calculateSuspiciousScore(userId);
// Thresholds for CAPTCHA
if (suspiciousScore > 0.9) return { challenge: true, difficulty: 'hard' };
if (suspiciousScore > 0.7) return { challenge: true, difficulty: 'medium' };
if (suspiciousScore > 0.5) return { challenge: true, difficulty: 'easy' };
return { challenge: false };
}
app.get('/api/data', async (req, res) => {
const captchaNeeded = await shouldChallengeCaptcha(req.userId);
if (captchaNeeded.challenge && !req.headers['x-captcha-token']) {
return res.status(403).json({
error: 'Verification required',
captcha_site_key: CAPTCHA_SITE_KEY,
difficulty: captchaNeeded.difficulty
});
}
// Verify CAPTCHA if provided
if (req.headers['x-captcha-token']) {
const valid = await verifyCaptcha(req.headers['x-captcha-token']);
if (!valid) {
return res.status(403).json({ error: 'Invalid verification' });
}
}
// Serve data...
});
5. Honeypot Endpoints
Create fake endpoints that legitimate users would never access:
// Honeypot endpoint - legitimate users won't access this
app.get('/api/admin/all-users-export', async (req, res) => {
// Log the suspicious access
await logSecurityEvent({
type: 'honeypot_triggered',
userId: req.userId,
ip: req.ip,
endpoint: req.path,
severity: 'high'
});
// Ban the user/IP
await banUser(req.userId, 'honeypot_access');
await banIP(req.ip, '24h');
// Return fake data to waste scraper's time
res.json({
users: generateFakeUsers(10000)
});
});
Real-World Case Studies: 2026 Breaches
Instagram API Scraping (January 2026)
17.5 million records exposed through weak rate limiting and sequential user IDs. Attackers used 5,000+ residential proxy IPs to stay below detection thresholds.
What failed: Rate limits were per-IP only, allowing distributed attacks. Sequential user IDs made enumeration trivial.
Lessons: Implement multi-dimensional rate limiting (per-user AND per-IP). Use UUIDs for resource identifiers.
Facebook Graph API Abuse (March 2026)
12 million user profiles scraped via the Graph API's friends endpoint. Attackers created legitimate accounts and exploited friend connection data.
What failed: Overly permissive friend data access. No detection of abnormal friend graph traversal patterns.
Lessons: Monitor graph traversal depth and breadth. Detect unusual access patterns even with valid credentials.
How KnoxCall Prevents API Scraping
KnoxCall's AI-powered security is specifically designed to detect and prevent scraping:
- Pattern recognition: Machine learning identifies scraping patterns in real-time across all tracked dimensions
- Behavioral analysis: Profiles legitimate user behavior and flags deviations
- Adaptive rate limiting: Automatically adjusts limits based on suspicious activity
- Distributed attack detection: Correlates requests across IPs to identify coordinated scraping
- Automatic mitigation: Applies CAPTCHA challenges, rate limit reductions, or blocks without manual intervention
- Forensic analysis: Complete audit trails for investigating scraping incidents
Key Takeaways
- API scraping is the fastest-growing attack vector, responsible for major 2026 breaches
- Modern scrapers use distributed infrastructure and mimic legitimate traffic to evade basic protections
- Detection requires multi-dimensional analysis: patterns, velocity, behavior, and fingerprinting
- Prevention needs layered defenses: intelligent rate limiting, resource obfuscation, data minimization, and adaptive challenges
- Simple per-IP rate limiting is insufficient against distributed attacks
- AI-powered solutions like KnoxCall provide the most effective defense against sophisticated scrapers