Banning & Prohibiting Piwigo Related Gallery Scraping

This is a brief Apache htaccess tutorial to copy and paste for banning certain types of Piwigo gallery scraping done by unwanted bots wasting bandwidth. In dense image galleries, this rapidly becomes a problem when it's targeting 24/7 multiple times an hour. This bot I dealt with alone added 2GB of waste when left unattended for a mere week.

Due to a churning gallery spam problem that started in late March. It still happens but denies the bandwidth now. The origin of it is Tencent Cloud.

It looks like this in cPanel. I noticed my gallery was sending triple the amount of data in a few weeks on top of usual data averages, and checked it out to see if it was organic traffic or not. Only to find this mess there.

The URLs shown are spoofed. Then the bot uses known Piwigo code and files (switchbox.js and combined css file) to make it look legitimate. Missing tons of data of any real usage that shows it's fake anyway. The scraper targets HEAD requests and falsely makes it look like real URL scraping in cPanel. Maybe it will fool someone casually.

How can you shut this down? Because it's essentially hotlink circumventing in order to pull full resolution jpgs. I suspect it's AI / LLM training off photos and targets galleries. Well, I thought I had an answer too. Then I implemented it, and the scraper immediately realized I was blocking and changed tactics within 48 hours. Sheesh.

It started doing this by skipping "[filename]-me.jpg" (full res images) and faking Piwigo files; and going to grab the next best thing. Which is attacking the Piwigo's dynamic i.php page's [filename]-cu_s250x9999.jpg instead. In many cases, this will be extremely close to the full image size. In Piwigo, you never want to block any of these file extensions because real browsing of the gallery uses these constantly.

This scraper will know if you soft block it and go harder in another method. However, after that shift, it gave up changing tactics and doesn't know any others. These are the Piwigo centric rules I had to use to shut it down. You need to put this in your htaccess file at the top and before your hotlink protection. It will show them a default 403 page, which is a blank white 9 bytes seen here.

This part is a 403 denier for these types of requests. You need to customize the IP pool REMOTE_ADDR with the ones targeting your gallery. I assume they'll differ a bit per scraper assigned the service has enabled. The format is this and uses | between IPs: [first digits]\.[second digits]\.|

ErrorDocument 403 "Forbidden"

RewriteCond %{HTTP_USER_AGENT} "(?:Python|aiohttp|curl|wget|httpx|requests|Go-http-client|libwww-perl|Scrapy)" [NC]
RewriteRule ^(?:galleries|_data/i/galleries)/ - [F,L]

RewriteCond %{REQUEST_METHOD} =HEAD
RewriteRule ^(?:galleries|_data/i/galleries)/.*\.(?:jpe?g|png|gif|webp)$ - [F,L,NC]

RewriteCond %{HTTP:Range} .
RewriteRule ^(?:galleries|_data/i/galleries)/.*\.(?:jpe?g|png|gif|webp)$ - [F,L,NC]

RewriteCond %{HTTP:Request-Range} .
RewriteRule ^(?:galleries|_data/i/galleries)/.*\.(?:jpe?g|png|gif|webp)$ - [F,L,NC]

RewriteCond %{REQUEST_METHOD} =HEAD
RewriteCond %{QUERY_STRING} "(?:^|/)galleries/.*\.(?:jpe?g|png|gif|webp)" [NC]
RewriteRule ^i\.php$ - [F,L]

RewriteCond %{HTTP:Range} .
RewriteCond %{QUERY_STRING} "(?:^|/)galleries/.*\.(?:jpe?g|png|gif|webp)" [NC]
RewriteRule ^i\.php$ - [F,L]

RewriteCond %{HTTP:Request-Range} .
RewriteCond %{QUERY_STRING} "(?:^|/)galleries/.*\.(?:jpe?g|png|gif|webp)" [NC]
RewriteRule ^i\.php$ - [F,L]

# Block suspicious rotating pools from direct cached/generated images
# /_data/i/galleries/ and /galleries/

RewriteCond %{REMOTE_ADDR} ^(?:116\.204\.|121\.37\.|113\.44\.|1\.92\.)
RewriteRule ^(?:_data/i/galleries|galleries)/.*\.(?:jpe?g|png|gif|webp)$ - [F,L,NC]

# Block from Piwigo's dynamic images: /i.php?/galleries/

RewriteCond %{REMOTE_ADDR} ^(?:116\.204\.|121\.37\.|113\.44\.|1\.92\.)
RewriteCond %{QUERY_STRING} ^/galleries/.*\.(?:jpe?g|png|gif|webp)(?:$|&) [NC]
RewriteRule ^i\.php$ - [F,L]

When or if the scraper continues past this method to attack the x9999.jpg files, you can use a soft cookie to act as a "gate". A scraper bot will fail this tactic. A human won't.

The part WR_GALLERY is customizable to your site name. Nobody but you and the browser will know about it. Most visitors to the gallery will likely never check their cookies for this name to see it. Additionally, the UA of bots you want to allow is up to you. I put in Google, Pinterest, and such of known visitors. The cookie gate is only 24 hours since this is a low level scraper to clean out and seems to be poorly maintained.

# Cookie gate

<IfModule mod_headers.c>
  <FilesMatch "^(index|picture)\.php$">
    Header always set Set-Cookie "WR_GALLERY_OK=1; Max-Age=86400; Path=/; Secure; SameSite=Lax"
  </FilesMatch>
</IfModule>

# Allows:
#   - visitors who already loaded a real gallery page
#   - major image / search bots
#   - main site (I use a gallery sub-domain) / local XAMPP embeds

RewriteCond %{HTTP_COOKIE} !(^|;\s*)WR_GALLERY_OK=1(;|$)
RewriteCond %{HTTP_USER_AGENT} !(?:Googlebot|Googlebot-Image|GoogleOther-Image|Google-InspectionTool|bingbot|BingPreview|Applebot|DuckDuckBot|YandexImages|Pinterestbot) [NC]
RewriteCond %{HTTP_REFERER} !^https?://(?:www\.)?whiteribbon\.blog(?:/|$) [NC]
RewriteCond %{HTTP_REFERER} !^https?://localhost(?::[0-9]+)?(?:/|$) [NC]
RewriteCond %{HTTP_REFERER} !^https?://127\.0\.0\.1(?::[0-9]+)?(?:/|$) [NC]
RewriteRule ^(?:_data/i/galleries|galleries)/.*\.(?:jpe?g|png|gif|webp)$ - [F,L,NC]

# Cookie gate for Piwigo's dynamic images /i.php?/galleries/

RewriteCond %{HTTP_COOKIE} !(^|;\s*)WR_GALLERY_OK=1(;|$)
RewriteCond %{HTTP_USER_AGENT} !(?:Googlebot|Googlebot-Image|GoogleOther-Image|Google-InspectionTool|bingbot|BingPreview|Applebot|DuckDuckBot|YandexImages|Pinterestbot) [NC]
RewriteCond %{QUERY_STRING} ^/galleries/.*\.(?:jpe?g|png|gif|webp)(?:$|&) [NC]
RewriteRule ^i\.php$ - [F,L]