Initial commit
This commit is contained in:
commit
b5b36df745
|
@ -0,0 +1,9 @@
|
|||
*~
|
||||
*.pyc
|
||||
wget-lua
|
||||
wget-at
|
||||
STOP
|
||||
BANNED
|
||||
data/
|
||||
test/
|
||||
duplicate-urls.txt
|
|
@ -0,0 +1,3 @@
|
|||
FROM atdr.meo.ws/archiveteam/grab-base
|
||||
COPY . /grab
|
||||
RUN ln -fs /usr/local/bin/wget-lua /grab/wget-at
|
|
@ -0,0 +1,24 @@
|
|||
This is free and unencumbered software released into the public domain.
|
||||
|
||||
Anyone is free to copy, modify, publish, use, compile, sell, or
|
||||
distribute this software, either in source code form or as a compiled
|
||||
binary, for any purpose, commercial or non-commercial, and by any
|
||||
means.
|
||||
|
||||
In jurisdictions that recognize copyright laws, the author or authors
|
||||
of this software dedicate any and all copyright interest in the
|
||||
software to the public domain. We make this dedication for the benefit
|
||||
of the public at large and to the detriment of our heirs and
|
||||
successors. We intend this dedication to be an overt act of
|
||||
relinquishment in perpetuity of all present and future rights to this
|
||||
software under copyright law.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
||||
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
||||
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
||||
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
|
||||
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
|
||||
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
|
||||
OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
For more information, please refer to <http://unlicense.org>
|
|
@ -0,0 +1,184 @@
|
|||
urls-grab
|
||||
=============
|
||||
|
||||
More information about the archiving project can be found on the ArchiveTeam wiki: [URLs](http://archiveteam.org/index.php?title=URLs)
|
||||
|
||||
Setup instructions
|
||||
=========================
|
||||
|
||||
Be sure to replace `YOURNICKHERE` with the nickname that you want to be shown as, on the tracker. You don't need to register it, just pick a nickname you like.
|
||||
|
||||
In most of the below cases, there will be a web interface running at http://localhost:8001/. If you don't know or care what this is, you can just ignore it—otherwise, it gives you a fancy view of what's going on.
|
||||
|
||||
**If anything goes wrong while running the commands below, please scroll down to the bottom of this page. There's troubleshooting information there.**
|
||||
|
||||
Running with a warrior
|
||||
-------------------------
|
||||
|
||||
Follow the [instructions on the ArchiveTeam wiki](http://archiveteam.org/index.php?title=Warrior) for installing the Warrior, and select the "URLs" project in the Warrior interface.
|
||||
|
||||
Running without a warrior
|
||||
-------------------------
|
||||
To run this outside the warrior, clone this repository, cd into its directory and run:
|
||||
|
||||
python3 -m pip install setuptools wheel
|
||||
python3 -m pip install --upgrade seesaw zstandard requests
|
||||
./get-wget-lua.sh
|
||||
|
||||
then start downloading with:
|
||||
|
||||
run-pipeline3 pipeline.py --concurrent 2 YOURNICKHERE
|
||||
|
||||
For more options, run:
|
||||
|
||||
run-pipeline3 --help
|
||||
|
||||
If you don't have root access and/or your version of pip is very old, you can replace "pip install --upgrade seesaw" with:
|
||||
|
||||
wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py ; python3 get-pip.py --user ; ~/.local/bin/pip3 install --upgrade --user seesaw
|
||||
|
||||
so that pip and seesaw are installed in your home, then run
|
||||
|
||||
~/.local/bin/run-pipeline3 pipeline.py --concurrent 2 YOURNICKHERE
|
||||
|
||||
Running multiple instances on different IPs
|
||||
-------------------------------------------
|
||||
|
||||
This feature requires seesaw version 0.0.16 or greater. Use `pip install --upgrade seesaw` to upgrade.
|
||||
|
||||
Use the `--context-value` argument to pass in `bind_address=123.4.5.6` (replace the IP address with your own).
|
||||
|
||||
Example of running 2 threads, no web interface, and Wget binding of IP address:
|
||||
|
||||
run-pipeline3 pipeline.py --concurrent 2 YOURNICKHERE --disable-web-server --context-value bind_address=123.4.5.6
|
||||
|
||||
Distribution-specific setup
|
||||
-------------------------
|
||||
### For Debian/Ubuntu:
|
||||
|
||||
Package `libzstd-dev` version 1.4.4 is required which is currently available from `buster-backports`.
|
||||
|
||||
adduser --system --group --shell /bin/bash archiveteam
|
||||
echo deb http://deb.debian.org/debian buster-backports main contrib > /etc/apt/sources.list.d/backports.list
|
||||
apt-get update \
|
||||
&& apt-get install -y git-core libgnutls-dev lua5.1 liblua5.1-0 liblua5.1-0-dev screen bzip2 zlib1g-dev flex autoconf autopoint texinfo gperf lua-socket rsync automake pkg-config python3-dev python3-pip build-essential \
|
||||
&& apt-get -t buster-backports install zstd libzstd-dev libzstd1
|
||||
python3 -m pip install setuptools wheel
|
||||
python3 -m pip install --upgrade seesaw zstandard requests
|
||||
su -c "cd /home/archiveteam; git clone https://github.com/ArchiveTeam/urls-grab.git; cd urls-grab; ./get-wget-lua.sh" archiveteam
|
||||
screen su -c "cd /home/archiveteam/urls-grab/; run-pipeline3 pipeline.py --concurrent 2 --address '127.0.0.1' YOURNICKHERE" archiveteam
|
||||
[... ctrl+A D to detach ...]
|
||||
|
||||
In __Debian Jessie, Ubuntu 18.04 Bionic and above__, the `libgnutls-dev` package was renamed to `libgnutls28-dev`. So, you need to do the following instead:
|
||||
|
||||
adduser --system --group --shell /bin/bash archiveteam
|
||||
echo deb http://deb.debian.org/debian buster-backports main contrib > /etc/apt/sources.list.d/backports.list
|
||||
apt-get update \
|
||||
&& apt-get install -y git-core libgnutls28-dev lua5.1 liblua5.1-0 liblua5.1-0-dev screen bzip2 zlib1g-dev flex autoconf autopoint texinfo gperf lua-socket rsync automake pkg-config python3-dev python3-pip build-essential \
|
||||
&& apt-get -t buster-backports install zstd libzstd-dev libzstd1
|
||||
[... pretty much the same as above ...]
|
||||
|
||||
Wget-lua is also available on [ArchiveTeam's PPA](https://launchpad.net/~archiveteam/+archive/wget-lua) for Ubuntu.
|
||||
|
||||
### For CentOS:
|
||||
|
||||
Ensure that you have the CentOS equivalent of bzip2 installed as well. You will need the EPEL repository to be enabled.
|
||||
|
||||
yum -y groupinstall "Development Tools"
|
||||
yum -y install gnutls-devel lua-devel python-pip zlib-devel zstd libzstd-devel git-core gperf lua-socket luarocks texinfo git rsync gettext-devel
|
||||
pip install --upgrade seesaw
|
||||
[... pretty much the same as above ...]
|
||||
|
||||
Tested with EL7 repositories.
|
||||
|
||||
### For Fedora:
|
||||
|
||||
The same as CentOS but with "dnf" instead of "yum". Did not successfully test compiling, so far.
|
||||
|
||||
### For openSUSE:
|
||||
|
||||
zypper install liblua5_1 lua51 lua51-devel screen python-pip libgnutls-devel bzip2 python-devel gcc make
|
||||
pip install --upgrade seesaw
|
||||
[... pretty much the same as above ...]
|
||||
|
||||
### For OS X:
|
||||
|
||||
You need Homebrew. Ensure that you have the OS X equivalent of bzip2 installed as well.
|
||||
|
||||
brew install python lua gnutls
|
||||
pip install --upgrade seesaw
|
||||
[... pretty much the same as above ...]
|
||||
|
||||
**There is a known issue with some packaged versions of rsync. If you get errors during the upload stage, urls-grab will not work with your rsync version.**
|
||||
|
||||
This supposedly fixes it:
|
||||
|
||||
alias rsync=/usr/local/bin/rsync
|
||||
|
||||
### For Arch Linux:
|
||||
|
||||
Ensure that you have the Arch equivalent of bzip2 installed as well.
|
||||
|
||||
1. Make sure you have `python2-pip` installed.
|
||||
2. Install [the wget-lua package from the AUR](https://aur.archlinux.org/packages/wget-lua/).
|
||||
3. Run `pip2 install --upgrade seesaw`.
|
||||
4. Modify the run-pipeline script in seesaw to point at `#!/usr/bin/python2` instead of `#!/usr/bin/python`.
|
||||
5. `useradd --system --group users --shell /bin/bash --create-home archiveteam`
|
||||
6. `screen su -c "cd /home/archiveteam/urls-grab/; run-pipeline pipeline.py --concurrent 2 --address '127.0.0.1' YOURNICKHERE" archiveteam`
|
||||
|
||||
### For Alpine Linux:
|
||||
|
||||
apk add lua5.1 git python bzip2 bash rsync gcc libc-dev lua5.1-dev zlib-dev gnutls-dev autoconf flex make
|
||||
python -m ensurepip
|
||||
pip install -U seesaw
|
||||
git clone https://github.com/ArchiveTeam/urls-grab
|
||||
cd urls-grab; ./get-wget-lua.sh
|
||||
run-pipeline pipeline.py --concurrent 2 --address '127.0.0.1' YOURNICKHERE
|
||||
|
||||
### For FreeBSD:
|
||||
|
||||
Honestly, I have no idea. `./get-wget-lua.sh` supposedly doesn't work due to differences in the `tar` that ships with FreeBSD. Another problem is the apparent absence of Lua 5.1 development headers. If you figure this out, please do let us know on IRC (irc.efnet.org #archiveteam).
|
||||
|
||||
Troubleshooting
|
||||
=========================
|
||||
|
||||
Broken? These are some of the possible solutions:
|
||||
|
||||
### wget-lua was not successfully built
|
||||
|
||||
If you get errors about `wget.pod` or something similar, the documentation failed to compile - wget-lua, however, compiled fine. Try this:
|
||||
|
||||
cd get-wget-lua.tmp
|
||||
mv src/wget ../wget-lua
|
||||
cd ..
|
||||
|
||||
The `get-wget-lua.tmp` name may be inaccurate. If you have a folder with a similar but different name, use that instead and please let us know on IRC what folder name you had!
|
||||
|
||||
Optionally, if you know what you're doing, you may want to use wgetpod.patch.
|
||||
|
||||
### Problem with gnutls or openssl during get-wget-lua
|
||||
|
||||
Please ensure that gnutls-dev(el) and openssl-dev(el) are installed.
|
||||
|
||||
### ImportError: No module named seesaw
|
||||
|
||||
If you're sure that you followed the steps to install `seesaw`, permissions on your module directory may be set incorrectly. Try the following:
|
||||
|
||||
chmod o+rX -R /usr/local/lib/python2.7/dist-packages
|
||||
|
||||
### run-pipeline: command not found
|
||||
|
||||
Install `seesaw` using `pip2` instead of `pip`.
|
||||
|
||||
pip2 install seesaw
|
||||
|
||||
### Issues in the code
|
||||
|
||||
If you notice a bug and want to file a bug report, please use the GitHub issues tracker.
|
||||
|
||||
Are you a developer? Help write code for us! Look at our [developer documentation](http://archiveteam.org/index.php?title=Dev) for details.
|
||||
|
||||
### Other problems
|
||||
|
||||
Have an issue not listed here? Join us on IRC and ask! We can be found at hackint IRC [#//](https://webirc.hackint.org/#irc://irc.hackint.org/#//).
|
||||
|
|
@ -0,0 +1,64 @@
|
|||
utm_source
|
||||
utm_medium
|
||||
utm_campaign
|
||||
utm_term
|
||||
utm_content
|
||||
utm_adgroup
|
||||
ref
|
||||
refsrc
|
||||
referrer_id
|
||||
referrerid
|
||||
src
|
||||
i
|
||||
s
|
||||
ts
|
||||
feature
|
||||
jsessionid
|
||||
phpsessid
|
||||
aspsessionid
|
||||
sessionid
|
||||
zenid
|
||||
sid
|
||||
gclid
|
||||
fb_xd_fragment
|
||||
fb_comment_id
|
||||
fbclid
|
||||
cfid
|
||||
cftoken
|
||||
doing_wp_cron
|
||||
pk_cpn
|
||||
pk_campaign
|
||||
pk_kwd
|
||||
pk_keyword
|
||||
piwik_campaign
|
||||
piwik_kwd
|
||||
ga_source
|
||||
ga_medium
|
||||
ga_term
|
||||
ga_content
|
||||
ga_campaign
|
||||
ga_place
|
||||
yclid
|
||||
_openstat
|
||||
fb_action_ids
|
||||
fb_action_types
|
||||
fb_source
|
||||
fb_ref
|
||||
action_object_map
|
||||
action_type_map
|
||||
action_ref_map
|
||||
gs_l
|
||||
mkt_tok
|
||||
hmb_campaign
|
||||
hmb_medium
|
||||
hmb_source
|
||||
rand
|
||||
wicket:antiCache
|
||||
cachebuster
|
||||
nocache
|
||||
vs
|
||||
dilid
|
||||
script_case_session
|
||||
cid
|
||||
extid
|
||||
_flowexecutionkey
|
|
@ -0,0 +1,33 @@
|
|||
/action/consumeSharedSessionAction
|
||||
/action/consumeSsoCookie
|
||||
/action/getSharedSiteSession
|
||||
/juris/error%.jsf
|
||||
facebook%.com/login%.php
|
||||
facebook%.com/cookie/
|
||||
facebook%.com/plugins/
|
||||
facebook%.com/sharer/
|
||||
facebook%.com/sharer%.php
|
||||
gongquiz%.com.+&historyNo=[0-9]+
|
||||
univis%.univie%.ac%.at/ausschreibungstellensuche/
|
||||
fundraise%.cancerresearchuk%.org/signup/account/
|
||||
mma%.ft%.com
|
||||
^https?://dmg%.go%-2b%-planer%.de/
|
||||
^https?://3d%.espace%-aubade%.fr/
|
||||
^https?://kuechenplaner%.[^/]+/cloud/
|
||||
^https?://3d%-salledebains%.geberit%.fr/
|
||||
^https?://bibliotekanauki%.ceon%.pl/yadda/search/general%.action
|
||||
^https?://[^/]+%.icm%.edu%.pl/.*search/article%.action
|
||||
^https?://interamt%.de/koop/app/
|
||||
^https?://tesiunam%.dgb%.unam%.mx/F/
|
||||
^https?://[^%.]+%.sedelectronica%.es/.*%?x=
|
||||
^https?://www%.cp%-cc%.org/programs%-services/
|
||||
/ibank/_crypt_
|
||||
%%7B%%7B.+%%7D%%7D
|
||||
^https?://[^/]+/"
|
||||
^http://[0-9a-z][0-9a-z][0-9a-z][0-9][0-9][0-9]?%.[^%./]+%.com/$
|
||||
^http://[0-9a-z][0-9a-z][0-9a-z][0-9][0-9][0-9]?%.[^%./]+%.com/[a-z]+%.?[a-z][a-z][a-z]?$
|
||||
^http://[0-9a-z][0-9a-z][0-9a-z][0-9][0-9][0-9]?%.[^%./]+%.com/[a-z]+/[a-z]+[0-9]*%.?[a-z][a-z][a-z]?$
|
||||
^https?://[^/]*yahoo%.com/.+%%5C.+at%.atwola%.com
|
||||
^https?://[^/]*at%.atwola%.com/
|
||||
^https?://www%.bafa%.de/
|
||||
%%5C%%22
|
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,57 @@
|
|||
#!/usr/bin/env bash
|
||||
#
|
||||
# This script clones and compiles wget-lua.
|
||||
#
|
||||
|
||||
# first, try to detect gnutls or openssl
|
||||
CONFIGURE_SSL_OPT=""
|
||||
if builtin type -p pkg-config &>/dev/null
|
||||
then
|
||||
if pkg-config gnutls
|
||||
then
|
||||
echo "Compiling wget with GnuTLS."
|
||||
CONFIGURE_SSL_OPT="--with-ssl=gnutls"
|
||||
elif pkg-config openssl
|
||||
then
|
||||
echo "Compiling wget with OpenSSL."
|
||||
CONFIGURE_SSL_OPT="--with-ssl=openssl"
|
||||
fi
|
||||
fi
|
||||
|
||||
if ! zstd --version | grep -q 1.4.4
|
||||
then
|
||||
echo "Need version 1.4.4 of libzstd-dev and zstd"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
rm -rf get-wget-lua.tmp/
|
||||
mkdir -p get-wget-lua.tmp
|
||||
|
||||
cd get-wget-lua.tmp
|
||||
|
||||
git clone https://github.com/archiveteam/wget-lua.git
|
||||
|
||||
cd wget-lua
|
||||
git checkout v1.20.3-at
|
||||
|
||||
#echo -n 1.20.3-at-lua | tee ./.version ./.tarball-version > /dev/null
|
||||
|
||||
if ./bootstrap && ./configure $CONFIGURE_SSL_OPT --disable-nls && make && src/wget -V | grep -q lua
|
||||
then
|
||||
cp src/wget ../../wget-at
|
||||
cd ../../
|
||||
echo
|
||||
echo
|
||||
echo "###################################################################"
|
||||
echo
|
||||
echo "wget-lua successfully built."
|
||||
echo
|
||||
./wget-at --help | grep -iE "gnu|warc|lua"
|
||||
rm -rf get-wget-lua.tmp
|
||||
exit 0
|
||||
else
|
||||
echo
|
||||
echo "wget-lua not successfully built."
|
||||
echo
|
||||
exit 1
|
||||
fi
|
|
@ -0,0 +1,21 @@
|
|||
[%?&]ver=[0-9a-zA-Z%.]*%.16[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]
|
||||
[%?&]ver=16[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]
|
||||
[%?&]t=16[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$
|
||||
[%?&]t=16[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%.[0-9]+$
|
||||
[%?&]hash=16[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$
|
||||
%?16[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$
|
||||
%?16[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$
|
||||
%?6[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$
|
||||
%?v=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$
|
||||
;extid=[0-9a-f]+$
|
||||
[%?&;]_flowexecutionkey=
|
||||
[%?&;]sid=
|
||||
[%?&;]cid=
|
||||
[%?&;]jsessionid=
|
||||
[%?&;]script_case_session=
|
||||
[%?&;]Dilid=
|
||||
[%?&;][pP][hH][pP][sS][eE][sS][sS][iI][dD]=
|
||||
[%?&;]wtd=
|
||||
[%?&;]nonce=
|
||||
[%?&;]rnd=
|
||||
^https?://[^/]+/index%.php%?s=
|
|
@ -0,0 +1,17 @@
|
|||
%.apng
|
||||
%.avif
|
||||
%.gif
|
||||
%.jpe?g
|
||||
%.jfif
|
||||
%.pjpeg
|
||||
%.pjp
|
||||
%.png
|
||||
%.svg
|
||||
%.webp
|
||||
%.bmp
|
||||
%.ico
|
||||
%.cur
|
||||
%.tif
|
||||
%.tiff
|
||||
%.js
|
||||
%.css
|
|
@ -0,0 +1,425 @@
|
|||
# encoding=utf8
|
||||
import datetime
|
||||
from distutils.version import StrictVersion
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import shutil
|
||||
import socket
|
||||
import subprocess
|
||||
import sys
|
||||
import threading
|
||||
import time
|
||||
import string
|
||||
import sys
|
||||
|
||||
if sys.version_info[0] < 3:
|
||||
from urllib import unquote
|
||||
from urlparser import parse_qs
|
||||
else:
|
||||
from urllib.parse import unquote, parse_qs
|
||||
|
||||
import requests
|
||||
import seesaw
|
||||
from seesaw.config import realize, NumberConfigValue
|
||||
from seesaw.externalprocess import WgetDownload
|
||||
from seesaw.item import ItemInterpolation, ItemValue
|
||||
from seesaw.pipeline import Pipeline
|
||||
from seesaw.project import Project
|
||||
from seesaw.task import SimpleTask, LimitConcurrent
|
||||
from seesaw.tracker import GetItemFromTracker, PrepareStatsForTracker, \
|
||||
UploadWithTracker, SendDoneToTracker
|
||||
from seesaw.util import find_executable
|
||||
import zstandard
|
||||
|
||||
if StrictVersion(seesaw.__version__) < StrictVersion('0.8.5'):
|
||||
raise Exception('This pipeline needs seesaw version 0.8.5 or higher.')
|
||||
|
||||
LOCK = threading.Lock()
|
||||
|
||||
|
||||
###########################################################################
|
||||
# Find a useful Wget+Lua executable.
|
||||
#
|
||||
# WGET_AT will be set to the first path that
|
||||
# 1. does not crash with --version, and
|
||||
# 2. prints the required version string
|
||||
|
||||
WGET_AT = find_executable(
|
||||
'Wget+AT',
|
||||
[
|
||||
'GNU Wget 1.20.3-at.20211001.01'
|
||||
],
|
||||
[
|
||||
'./wget-at',
|
||||
'/home/warrior/data/wget-at'
|
||||
]
|
||||
)
|
||||
|
||||
if not WGET_AT:
|
||||
raise Exception('No usable Wget+At found.')
|
||||
|
||||
|
||||
###########################################################################
|
||||
# The version number of this pipeline definition.
|
||||
#
|
||||
# Update this each time you make a non-cosmetic change.
|
||||
# It will be added to the WARC files and reported to the tracker.
|
||||
VERSION = '20220423.01'
|
||||
#USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36'
|
||||
TRACKER_ID = 'urls'
|
||||
TRACKER_HOST = 'legacy-api.arpa.li'
|
||||
MULTI_ITEM_SIZE = 40
|
||||
MAX_DUPES_LIST_SIZE = 10000
|
||||
|
||||
###########################################################################
|
||||
# This section defines project-specific tasks.
|
||||
#
|
||||
# Simple tasks (tasks that do not need any concurrency) are based on the
|
||||
# SimpleTask class and have a process(item) method that is called for
|
||||
# each item.
|
||||
class CheckIP(SimpleTask):
|
||||
def __init__(self):
|
||||
SimpleTask.__init__(self, 'CheckIP')
|
||||
self._counter = 0
|
||||
|
||||
def process(self, item):
|
||||
# NEW for 2014! Check if we are behind firewall/proxy
|
||||
|
||||
if self._counter <= 0:
|
||||
item.log_output('Checking IP address.')
|
||||
ip_set = set()
|
||||
|
||||
ip_set.add(socket.gethostbyname('twitter.com'))
|
||||
#ip_set.add(socket.gethostbyname('facebook.com'))
|
||||
ip_set.add(socket.gethostbyname('youtube.com'))
|
||||
ip_set.add(socket.gethostbyname('microsoft.com'))
|
||||
ip_set.add(socket.gethostbyname('icanhas.cheezburger.com'))
|
||||
ip_set.add(socket.gethostbyname('archiveteam.org'))
|
||||
|
||||
if len(ip_set) != 5:
|
||||
item.log_output('Got IP addresses: {0}'.format(ip_set))
|
||||
item.log_output(
|
||||
'Are you behind a firewall/proxy? That is a big no-no!')
|
||||
raise Exception(
|
||||
'Are you behind a firewall/proxy? That is a big no-no!')
|
||||
|
||||
# Check only occasionally
|
||||
if self._counter <= 0:
|
||||
self._counter = 10
|
||||
else:
|
||||
self._counter -= 1
|
||||
|
||||
|
||||
class CheckRequirements(SimpleTask):
|
||||
def __init__(self):
|
||||
SimpleTask.__init__(self, 'CheckRequirements')
|
||||
self._checked = False
|
||||
|
||||
def process(self, item):
|
||||
if not self._checked:
|
||||
assert shutil.which('pdftohtml') is not None
|
||||
self._checked = True
|
||||
|
||||
|
||||
class PrepareDirectories(SimpleTask):
|
||||
def __init__(self, warc_prefix):
|
||||
SimpleTask.__init__(self, 'PrepareDirectories')
|
||||
self.warc_prefix = warc_prefix
|
||||
|
||||
def process(self, item):
|
||||
item_name = item['item_name']
|
||||
item_name_hash = hashlib.sha1(item_name.encode('utf8')).hexdigest()
|
||||
escaped_item_name = item_name_hash
|
||||
dirname = '/'.join((item['data_dir'], escaped_item_name))
|
||||
|
||||
if os.path.isdir(dirname):
|
||||
shutil.rmtree(dirname)
|
||||
|
||||
os.makedirs(dirname)
|
||||
|
||||
item['item_dir'] = dirname
|
||||
item['warc_file_base'] = '-'.join([
|
||||
self.warc_prefix,
|
||||
item_name_hash,
|
||||
time.strftime('%Y%m%d-%H%M%S')
|
||||
])
|
||||
|
||||
if not os.path.isfile('duplicate-urls.txt'):
|
||||
open('duplicate-urls.txt', 'w').close()
|
||||
|
||||
open('%(item_dir)s/%(warc_file_base)s.warc.zst' % item, 'w').close()
|
||||
open('%(item_dir)s/%(warc_file_base)s_bad-urls.txt' % item, 'w').close()
|
||||
open('%(item_dir)s/%(warc_file_base)s_duplicate-urls.txt' % item, 'w').close()
|
||||
|
||||
|
||||
class MoveFiles(SimpleTask):
|
||||
def __init__(self):
|
||||
SimpleTask.__init__(self, 'MoveFiles')
|
||||
|
||||
def process(self, item):
|
||||
os.rename('%(item_dir)s/%(warc_file_base)s.warc.zst' % item,
|
||||
'%(data_dir)s/%(warc_file_base)s.%(dict_project)s.%(dict_id)s.warc.zst' % item)
|
||||
|
||||
shutil.rmtree('%(item_dir)s' % item)
|
||||
|
||||
|
||||
class SetBadUrls(SimpleTask):
|
||||
def __init__(self):
|
||||
SimpleTask.__init__(self, 'SetBadUrls')
|
||||
|
||||
def unquote_url(self, url):
|
||||
temp = unquote(url)
|
||||
while url != temp:
|
||||
url = temp
|
||||
temp = unquote(url)
|
||||
return url
|
||||
|
||||
def process(self, item):
|
||||
item['item_name_original'] = item['item_name']
|
||||
items = item['item_name'].split('\0')
|
||||
items_lower = [self.unquote_url(url).strip().lower() for url in item['item_urls']]
|
||||
with open('%(item_dir)s/%(warc_file_base)s_bad-urls.txt' % item, 'r') as f:
|
||||
for url in {
|
||||
self.unquote_url(url).strip().lower() for url in f
|
||||
}:
|
||||
index = items_lower.index(url)
|
||||
items.pop(index)
|
||||
items_lower.pop(index)
|
||||
item['item_name'] = '\0'.join(items)
|
||||
|
||||
|
||||
class SetDuplicateUrls(SimpleTask):
|
||||
def __init__(self):
|
||||
SimpleTask.__init__(self, 'SetNewDuplicates')
|
||||
|
||||
def process(self, item):
|
||||
with LOCK:
|
||||
self._process(item)
|
||||
|
||||
def _process(self, item):
|
||||
with open('duplicate-urls.txt', 'r') as f:
|
||||
duplicates = {s.strip() for s in f}
|
||||
with open('%(item_dir)s/%(warc_file_base)s_duplicate-urls.txt' % item, 'r') as f:
|
||||
for url in f:
|
||||
duplicates.add(url.strip())
|
||||
with open('duplicate-urls.txt', 'w') as f:
|
||||
# choose randomly, to cycle periodically popular URLs
|
||||
duplicates = list(duplicates)
|
||||
random.shuffle(duplicates)
|
||||
f.write('\n'.join(duplicates[:MAX_DUPES_LIST_SIZE]))
|
||||
|
||||
|
||||
class MaybeSendDoneToTracker(SendDoneToTracker):
|
||||
def enqueue(self, item):
|
||||
if len(item['item_name']) == 0:
|
||||
return self.complete_item(item)
|
||||
return super(MaybeSendDoneToTracker, self).enqueue(item)
|
||||
|
||||
|
||||
def get_hash(filename):
|
||||
with open(filename, 'rb') as in_file:
|
||||
return hashlib.sha1(in_file.read()).hexdigest()
|
||||
|
||||
CWD = os.getcwd()
|
||||
PIPELINE_SHA1 = get_hash(os.path.join(CWD, 'pipeline.py'))
|
||||
LUA_SHA1 = get_hash(os.path.join(CWD, 'urls.lua'))
|
||||
|
||||
def stats_id_function(item):
|
||||
d = {
|
||||
'pipeline_hash': PIPELINE_SHA1,
|
||||
'lua_hash': LUA_SHA1,
|
||||
'python_version': sys.version,
|
||||
}
|
||||
|
||||
return d
|
||||
|
||||
|
||||
class ZstdDict(object):
|
||||
created = 0
|
||||
data = None
|
||||
|
||||
@classmethod
|
||||
def get_dict(cls):
|
||||
if cls.data is not None and time.time() - cls.created < 1800:
|
||||
return cls.data
|
||||
response = requests.get(
|
||||
'https://legacy-api.arpa.li/dictionary',
|
||||
params={
|
||||
'project': TRACKER_ID
|
||||
}
|
||||
)
|
||||
response.raise_for_status()
|
||||
response = response.json()
|
||||
if cls.data is not None and response['id'] == cls.data['id']:
|
||||
cls.created = time.time()
|
||||
return cls.data
|
||||
print('Downloading latest dictionary.')
|
||||
response_dict = requests.get(response['url'])
|
||||
response_dict.raise_for_status()
|
||||
raw_data = response_dict.content
|
||||
if hashlib.sha256(raw_data).hexdigest() != response['sha256']:
|
||||
raise ValueError('Hash of downloaded dictionary does not match.')
|
||||
if raw_data[:4] == b'\x28\xB5\x2F\xFD':
|
||||
raw_data = zstandard.ZstdDecompressor().decompress(raw_data)
|
||||
cls.data = {
|
||||
'id': response['id'],
|
||||
'dict': raw_data
|
||||
}
|
||||
cls.created = time.time()
|
||||
return cls.data
|
||||
|
||||
|
||||
class WgetArgs(object):
|
||||
def realize(self, item):
|
||||
with open('user-agents.txt', 'r') as f:
|
||||
USER_AGENT = random.choice(list(f)).strip()
|
||||
wget_args = [
|
||||
'timeout', '1000',
|
||||
WGET_AT,
|
||||
'-U', USER_AGENT,
|
||||
'-v',
|
||||
'--content-on-error',
|
||||
'--lua-script', 'urls.lua',
|
||||
'-o', ItemInterpolation('%(item_dir)s/wget.log'),
|
||||
#'--no-check-certificate',
|
||||
'--output-document', ItemInterpolation('%(item_dir)s/wget.tmp'),
|
||||
'--truncate-output',
|
||||
'-e', 'robots=off',
|
||||
'--rotate-dns',
|
||||
'--recursive', '--level=inf',
|
||||
'--no-parent',
|
||||
'--timeout', '10',
|
||||
'--tries', '2',
|
||||
'--span-hosts',
|
||||
'--page-requisites',
|
||||
'--waitretry', '0',
|
||||
'--warc-file', ItemInterpolation('%(item_dir)s/%(warc_file_base)s'),
|
||||
'--warc-header', 'operator: Archive Team',
|
||||
'--warc-header', 'x-wget-at-project-version: ' + VERSION,
|
||||
'--warc-header', 'x-wget-at-project-name: ' + TRACKER_ID,
|
||||
'--warc-dedup-url-agnostic',
|
||||
'--warc-compression-use-zstd',
|
||||
'--warc-zstd-dict-no-include',
|
||||
'--header', 'Connection: keep-alive',
|
||||
'--header', 'Accept-Language: en-US;q=0.9, en;q=0.8'
|
||||
]
|
||||
|
||||
dict_data = ZstdDict.get_dict()
|
||||
with open(os.path.join(item['item_dir'], 'zstdict'), 'wb') as f:
|
||||
f.write(dict_data['dict'])
|
||||
item['dict_id'] = dict_data['id']
|
||||
item['dict_project'] = TRACKER_ID
|
||||
wget_args.extend([
|
||||
'--warc-zstd-dict', ItemInterpolation('%(item_dir)s/zstdict'),
|
||||
])
|
||||
|
||||
item['item_name'] = '\0'.join([
|
||||
item_name for item_name in item['item_name'].split('\0')
|
||||
if (item_name.startswith('custom:') and '&url=' in item_name) \
|
||||
or item_name.startswith('http://') \
|
||||
or item_name.startswith('https://') \
|
||||
])
|
||||
|
||||
item['item_name_newline'] = item['item_name'].replace('\0', '\n')
|
||||
item_urls = []
|
||||
custom_items = {}
|
||||
|
||||
for item_name in item['item_name'].split('\0'):
|
||||
wget_args.extend(['--warc-header', 'x-wget-at-project-item-name: '+item_name])
|
||||
wget_args.append('item-name://'+item_name)
|
||||
if item_name.startswith('custom:'):
|
||||
data = parse_qs(item_name.split(':', 1)[1])
|
||||
for k, v in data.items():
|
||||
if len(v) == 1:
|
||||
data[k] = v[0]
|
||||
url = data['url']
|
||||
custom_items[url.lower()] = data
|
||||
else:
|
||||
url = item_name
|
||||
item_urls.append(url)
|
||||
wget_args.append(url)
|
||||
|
||||
item['item_urls'] = item_urls
|
||||
item['custom_items'] = json.dumps(custom_items)
|
||||
|
||||
if 'bind_address' in globals():
|
||||
wget_args.extend(['--bind-address', globals()['bind_address']])
|
||||
print('')
|
||||
print('*** Wget will bind address at {0} ***'.format(
|
||||
globals()['bind_address']))
|
||||
print('')
|
||||
|
||||
return realize(wget_args, item)
|
||||
|
||||
###########################################################################
|
||||
# Initialize the project.
|
||||
#
|
||||
# This will be shown in the warrior management panel. The logo should not
|
||||
# be too big. The deadline is optional.
|
||||
project = Project(
|
||||
title = 'URLs',
|
||||
project_html = '''
|
||||
<img class="project-logo" alt="logo" src="https://archiveteam.org/images/thumb/f/f3/Archive_team.png/235px-Archive_team.png" height="50px"/>
|
||||
<h2>Archiving sets of discovered outlinks. · <a href="http://tracker.archiveteam.org/urls/">Leaderboard</a></span></h2>
|
||||
'''
|
||||
)
|
||||
|
||||
pipeline = Pipeline(
|
||||
CheckIP(),
|
||||
CheckRequirements(),
|
||||
GetItemFromTracker('https://{}/{}/multi={}/'
|
||||
.format(TRACKER_HOST, TRACKER_ID, MULTI_ITEM_SIZE),
|
||||
downloader, VERSION),
|
||||
PrepareDirectories(warc_prefix='urls'),
|
||||
WgetDownload(
|
||||
WgetArgs(),
|
||||
max_tries=1,
|
||||
accept_on_exit_code=[0, 4, 8],
|
||||
env={
|
||||
'item_dir': ItemValue('item_dir'),
|
||||
'item_name': ItemValue('item_name_newline'),
|
||||
'custom_items': ItemValue('custom_items'),
|
||||
'warc_file_base': ItemValue('warc_file_base')
|
||||
}
|
||||
),
|
||||
SetBadUrls(),
|
||||
SetDuplicateUrls(),
|
||||
PrepareStatsForTracker(
|
||||
defaults={'downloader': downloader, 'version': VERSION},
|
||||
file_groups={
|
||||
'data': [
|
||||
ItemInterpolation('%(item_dir)s/%(warc_file_base)s.warc.zst')
|
||||
]
|
||||
},
|
||||
id_function=stats_id_function,
|
||||
),
|
||||
MoveFiles(),
|
||||
LimitConcurrent(NumberConfigValue(min=1, max=20, default='2',
|
||||
name='shared:rsync_threads', title='Rsync threads',
|
||||
description='The maximum number of concurrent uploads.'),
|
||||
UploadWithTracker(
|
||||
'https://%s/%s' % (TRACKER_HOST, TRACKER_ID),
|
||||
downloader=downloader,
|
||||
version=VERSION,
|
||||
files=[
|
||||
ItemInterpolation('%(data_dir)s/%(warc_file_base)s.%(dict_project)s.%(dict_id)s.warc.zst')
|
||||
],
|
||||
rsync_target_source_path=ItemInterpolation('%(data_dir)s/'),
|
||||
rsync_extra_args=[
|
||||
'--recursive',
|
||||
'--partial',
|
||||
'--partial-dir', '.rsync-tmp',
|
||||
'--min-size', '1',
|
||||
'--no-compress',
|
||||
'--compress-level', '0'
|
||||
]
|
||||
),
|
||||
),
|
||||
MaybeSendDoneToTracker(
|
||||
tracker_url='https://%s/%s' % (TRACKER_HOST, TRACKER_ID),
|
||||
stats=ItemValue('stats')
|
||||
)
|
||||
)
|
||||
|
|
@ -0,0 +1,942 @@
|
|||
local urlparse = require("socket.url")
|
||||
local http = require("socket.http")
|
||||
JSON = (loadfile "JSON.lua")()
|
||||
|
||||
local item_dir = os.getenv("item_dir")
|
||||
local item_name = os.getenv("item_name")
|
||||
local custom_items = os.getenv("custom_items")
|
||||
local warc_file_base = os.getenv("warc_file_base")
|
||||
|
||||
local url_count = 0
|
||||
local downloaded = {}
|
||||
local abortgrab = false
|
||||
local exit_url = false
|
||||
local min_dedup_mb = 5
|
||||
|
||||
local timestamp = nil
|
||||
|
||||
if urlparse == nil or http == nil then
|
||||
io.stdout:write("socket not corrently installed.\n")
|
||||
io.stdout:flush()
|
||||
abortgrab = true
|
||||
end
|
||||
|
||||
local urls = {}
|
||||
for url in string.gmatch(item_name, "([^\n]+)") do
|
||||
urls[string.lower(url)] = true
|
||||
end
|
||||
|
||||
local urls_settings = JSON:decode(custom_items)
|
||||
for k, _ in pairs(urls_settings) do
|
||||
urls[string.lower(k)] = true
|
||||
end
|
||||
|
||||
local status_code = nil
|
||||
|
||||
local redirect_urls = {}
|
||||
local visited_urls = {}
|
||||
local ids_to_ignore = {}
|
||||
for _, lengths in pairs({{8, 4, 4, 4, 12}, {8, 4, 4, 12}}) do
|
||||
local uuid = ""
|
||||
for _, i in pairs(lengths) do
|
||||
for j=1,i do
|
||||
uuid = uuid .. "[0-9a-fA-F]"
|
||||
end
|
||||
if i ~= 12 then
|
||||
uuid = uuid .. "%-"
|
||||
end
|
||||
end
|
||||
ids_to_ignore[uuid] = true
|
||||
end
|
||||
local to_ignore = ""
|
||||
for i=1,9 do
|
||||
to_ignore = to_ignore .. "[0-9]"
|
||||
end
|
||||
ids_to_ignore["%?" .. to_ignore .. "$"] = true
|
||||
ids_to_ignore["%?" .. to_ignore .. "[0-9]$"] = true
|
||||
ids_to_ignore[to_ignore .. "[0-9]%.[0-9][0-9][0-9][0-9]$"] = true
|
||||
to_ignore = ""
|
||||
for i=1,50 do
|
||||
to_ignore = to_ignore .. "[0-9a-zA-Z]"
|
||||
end
|
||||
ids_to_ignore[to_ignore .. "%-[0-9][0-9][0-9][0-9][0-9]"] = true
|
||||
ids_to_ignore["[0-9a-zA-Z%-_]!%-?[0-9]"] = true
|
||||
to_ignore = ""
|
||||
for i=1,32 do
|
||||
to_ignore = to_ignore .. "[0-9a-fA-F]"
|
||||
end
|
||||
ids_to_ignore["[^0-9a-fA-F]" .. to_ignore .. "[^0-9a-fA-F]"] = true
|
||||
ids_to_ignore["[^0-9a-fA-F]" .. to_ignore .. "$"] = true
|
||||
|
||||
local current_url = nil
|
||||
local current_settings = nil
|
||||
local bad_urls = {}
|
||||
local queued_urls = {}
|
||||
local bad_params = {}
|
||||
local bad_patterns = {}
|
||||
local ignore_patterns = {}
|
||||
local page_requisite_patterns = {}
|
||||
local duplicate_urls = {}
|
||||
local extract_outlinks_patterns = {}
|
||||
local item_first_url = nil
|
||||
local redirect_domains = {}
|
||||
local checked_domains = {}
|
||||
|
||||
local parenturl_uuid = nil
|
||||
local parenturl_requisite = nil
|
||||
|
||||
local dupes_file = io.open("duplicate-urls.txt", "r")
|
||||
for url in dupes_file:lines() do
|
||||
duplicate_urls[url] = true
|
||||
end
|
||||
dupes_file:close()
|
||||
|
||||
local bad_params_file = io.open("bad-params.txt", "r")
|
||||
for param in bad_params_file:lines() do
|
||||
local param = string.gsub(
|
||||
param, "([a-zA-Z])",
|
||||
function(c)
|
||||
return "[" .. string.lower(c) .. string.upper(c) .. "]"
|
||||
end
|
||||
)
|
||||
table.insert(bad_params, param)
|
||||
end
|
||||
bad_params_file:close()
|
||||
|
||||
local bad_patterns_file = io.open("bad-patterns.txt", "r")
|
||||
for pattern in bad_patterns_file:lines() do
|
||||
table.insert(bad_patterns, pattern)
|
||||
end
|
||||
bad_patterns_file:close()
|
||||
|
||||
local ignore_patterns_file = io.open("ignore-patterns.txt", "r")
|
||||
for pattern in ignore_patterns_file:lines() do
|
||||
table.insert(ignore_patterns, pattern)
|
||||
end
|
||||
ignore_patterns_file:close()
|
||||
|
||||
local page_requisite_patterns_file = io.open("page-requisite-patterns.txt", "r")
|
||||
for pattern in page_requisite_patterns_file:lines() do
|
||||
table.insert(page_requisite_patterns, pattern)
|
||||
end
|
||||
page_requisite_patterns_file:close()
|
||||
|
||||
local extract_outlinks_patterns_file = io.open("extract-outlinks-patterns.txt", "r")
|
||||
for pattern in extract_outlinks_patterns_file:lines() do
|
||||
extract_outlinks_patterns[pattern] = true
|
||||
end
|
||||
extract_outlinks_patterns_file:close()
|
||||
|
||||
read_file = function(file, bytes)
|
||||
if not bytes then
|
||||
bytes = "*all"
|
||||
end
|
||||
if file then
|
||||
local f = assert(io.open(file))
|
||||
local data = f:read(bytes)
|
||||
f:close()
|
||||
if not data then
|
||||
data = ""
|
||||
end
|
||||
return data
|
||||
else
|
||||
return ""
|
||||
end
|
||||
end
|
||||
|
||||
table_length = function(t)
|
||||
local count = 0
|
||||
for _ in pairs(t) do
|
||||
count = count + 1
|
||||
end
|
||||
return count
|
||||
end
|
||||
|
||||
check_domain_outlinks = function(url, target)
|
||||
local parent = string.match(url, "^https?://([^/]+)")
|
||||
while parent do
|
||||
if (not target and extract_outlinks_patterns[parent])
|
||||
or (target and parent == target) then
|
||||
return parent
|
||||
end
|
||||
parent = string.match(parent, "^[^%.]+%.(.+)$")
|
||||
end
|
||||
return false
|
||||
end
|
||||
|
||||
bad_code = function(status_code)
|
||||
return status_code ~= 200
|
||||
and status_code ~= 301
|
||||
and status_code ~= 302
|
||||
and status_code ~= 303
|
||||
and status_code ~= 307
|
||||
and status_code ~= 308
|
||||
and status_code ~= 404
|
||||
and status_code ~= 410
|
||||
end
|
||||
|
||||
find_path_loop = function(url, max_repetitions)
|
||||
local tested = {}
|
||||
for s in string.gmatch(urlparse.unescape(url), "([^/]+)") do
|
||||
s = string.lower(s)
|
||||
if not tested[s] then
|
||||
if s == "" then
|
||||
tested[s] = -2
|
||||
else
|
||||
tested[s] = 0
|
||||
end
|
||||
end
|
||||
tested[s] = tested[s] + 1
|
||||
if tested[s] == max_repetitions then
|
||||
return true
|
||||
end
|
||||
end
|
||||
return false
|
||||
end
|
||||
|
||||
percent_encode_url = function(url)
|
||||
temp = ""
|
||||
for c in string.gmatch(url, "(.)") do
|
||||
local b = string.byte(c)
|
||||
if b < 32 or b > 126 then
|
||||
c = string.format("%%%02X", b)
|
||||
end
|
||||
temp = temp .. c
|
||||
end
|
||||
return temp
|
||||
end
|
||||
|
||||
queue_url = function(url, withcustom)
|
||||
if not url then
|
||||
return nil
|
||||
end
|
||||
queue_new_urls(url)
|
||||
if not string.match(url, "^https?://[^/]+%.") then
|
||||
return nil
|
||||
end
|
||||
--local original = url
|
||||
load_setting_depth = function(s)
|
||||
n = tonumber(current_settings[s])
|
||||
if n == nil then
|
||||
n = 0
|
||||
end
|
||||
return n - 1
|
||||
end
|
||||
url = string.gsub(url, "'%s*%+%s*'", "")
|
||||
url = percent_encode_url(url)
|
||||
url = string.match(url, "^([^{]+)")
|
||||
url = string.match(url, "^([^<]+)")
|
||||
url = string.match(url, "^([^\\]+)")
|
||||
if current_settings and current_settings["all"] and withcustom then
|
||||
local depth = load_setting_depth("depth")
|
||||
local keep_random = load_setting_depth("keep_random")
|
||||
local keep_all = load_setting_depth("keep_all")
|
||||
local any_domain = load_setting_depth("any_domain")
|
||||
if depth >= 0 then
|
||||
local random = current_settings["random"]
|
||||
local all = current_settings["all"]
|
||||
if keep_random < 0 or random == "" then
|
||||
random = nil
|
||||
keep_random = nil
|
||||
end
|
||||
if keep_all < 0 or all == 0 then
|
||||
all = nil
|
||||
keep_all = nil
|
||||
end
|
||||
if any_domain <= 0 then
|
||||
any_domain = nil
|
||||
end
|
||||
local settings = {
|
||||
depth=depth,
|
||||
all=all,
|
||||
keep_all=keep_all,
|
||||
random=random,
|
||||
keep_random=keep_random,
|
||||
url=url,
|
||||
any_domain=any_domain
|
||||
}
|
||||
url = "custom:"
|
||||
for _, k in pairs(
|
||||
{"all", "any_domain", "depth", "keep_all", "keep_random", "random", "url"}
|
||||
) do
|
||||
local v = settings[k]
|
||||
if v ~= nil then
|
||||
url = url .. k .. "=" .. urlparse.escape(tostring(v)) .. "&"
|
||||
end
|
||||
end
|
||||
url = string.sub(url, 1, -2)
|
||||
end
|
||||
end
|
||||
if not duplicate_urls[url] and not queued_urls[url] then
|
||||
if find_path_loop(url, 2) then
|
||||
return false
|
||||
end
|
||||
--print("queuing",original, url)
|
||||
queued_urls[url] = true
|
||||
end
|
||||
end
|
||||
|
||||
queue_monthly_url = function(url)
|
||||
local random_s = os.date("%Y%m", timestamp)
|
||||
url = percent_encode_url(url)
|
||||
queued_urls["custom:random=" .. random_s .. "&url=" .. urlparse.escape(tostring(url))] = true
|
||||
end
|
||||
|
||||
remove_param = function(url, param_pattern)
|
||||
local newurl = url
|
||||
repeat
|
||||
url = newurl
|
||||
newurl = string.gsub(url, "([%?&;])" .. param_pattern .. "=[^%?&;]*[%?&;]?", "%1")
|
||||
until newurl == url
|
||||
return string.match(newurl, "^(.-)[%?&;]?$")
|
||||
end
|
||||
|
||||
queue_new_urls = function(url)
|
||||
if not url then
|
||||
return nil
|
||||
end
|
||||
local newurl = string.gsub(url, "([%?&;])[aA][mM][pP];", "%1")
|
||||
if url == current_url then
|
||||
if newurl ~= url then
|
||||
queue_url(newurl)
|
||||
end
|
||||
end
|
||||
for _, param_pattern in pairs(bad_params) do
|
||||
newurl = remove_param(newurl, param_pattern)
|
||||
end
|
||||
if newurl ~= url then
|
||||
queue_url(newurl)
|
||||
end
|
||||
newurl = string.match(newurl, "^([^%?&]+)")
|
||||
if newurl ~= url then
|
||||
queue_url(newurl)
|
||||
end
|
||||
url = string.gsub(url, """, '"')
|
||||
url = string.gsub(url, "&", "&")
|
||||
for newurl in string.gmatch(url, '([^"\\]+)') do
|
||||
if newurl ~= url then
|
||||
queue_url(newurl)
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
report_bad_url = function(url)
|
||||
if current_url ~= nil then
|
||||
bad_urls[current_url] = true
|
||||
else
|
||||
bad_urls[string.lower(url)] = true
|
||||
end
|
||||
end
|
||||
|
||||
strip_url = function(url)
|
||||
url = string.match(url, "^https?://(.+)$")
|
||||
newurl = string.match(url, "^www%.(.+)$")
|
||||
if newurl then
|
||||
url = newurl
|
||||
end
|
||||
return url
|
||||
end
|
||||
|
||||
wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict, reason)
|
||||
local url = urlpos["url"]["url"]
|
||||
local parenturl = parent["url"]
|
||||
local extract_page_requisites = false
|
||||
|
||||
local current_settings_all = current_settings and current_settings["all"]
|
||||
local current_settings_any_domain = current_settings and current_settings["any_domain"]
|
||||
|
||||
--queue_monthly_url(string.match(url, "^(https?://[^/]+)") .. "/")
|
||||
|
||||
if redirect_urls[parenturl] and not (
|
||||
status_code == 300 and string.match(parenturl, "^https?://[^/]*feb%-web%.ru/")
|
||||
) then
|
||||
return true
|
||||
end
|
||||
|
||||
if find_path_loop(url, 2) then
|
||||
return false
|
||||
end
|
||||
|
||||
local _, count = string.gsub(url, "[/%?]", "")
|
||||
if count >= 16 then
|
||||
return false
|
||||
end
|
||||
|
||||
for _, extension in pairs({
|
||||
"pdf",
|
||||
"doc[mx]?",
|
||||
"xls[mx]?",
|
||||
"ppt[mx]?",
|
||||
"zip",
|
||||
"odt",
|
||||
"odm",
|
||||
"ods",
|
||||
"odp",
|
||||
"xml",
|
||||
"json",
|
||||
"torrent"
|
||||
}) do
|
||||
if string.match(parenturl, "%." .. extension .. "$")
|
||||
or string.match(parenturl, "%." .. extension .. "[^a-z0-9A-Z]")
|
||||
or string.match(parenturl, "%." .. string.upper(extension) .. "$")
|
||||
or string.match(parenturl, "%." .. string.upper(extension) .. "[^a-z0-9A-Z]") then
|
||||
return false
|
||||
end
|
||||
if string.match(url, "%." .. extension .. "$")
|
||||
or string.match(url, "%." .. extension .. "[^a-z0-9A-Z]")
|
||||
or string.match(url, "%." .. string.upper(extension) .. "$")
|
||||
or string.match(url, "%." .. string.upper(extension) .. "[^a-z0-9A-Z]") then
|
||||
queue_url(url)
|
||||
return false
|
||||
end
|
||||
end
|
||||
|
||||
local domain_match = checked_domains[item_first_url]
|
||||
if not domain_match then
|
||||
domain_match = check_domain_outlinks(item_first_url)
|
||||
if not domain_match then
|
||||
domain_match = "none"
|
||||
end
|
||||
checked_domains[item_first_url] = domain_match
|
||||
end
|
||||
if domain_match ~= "none" then
|
||||
extract_page_requisites = true
|
||||
local newurl_domain = string.match(url, "^https?://([^/]+)")
|
||||
local to_queue = true
|
||||
for domain, _ in pairs(redirect_domains) do
|
||||
if check_domain_outlinks(url, domain) then
|
||||
to_queue = false
|
||||
break
|
||||
end
|
||||
end
|
||||
if to_queue then
|
||||
queue_url(url)
|
||||
return false
|
||||
end
|
||||
end
|
||||
|
||||
--[[if not extract_page_requisites then
|
||||
return false
|
||||
end]]
|
||||
|
||||
if (status_code < 200 or status_code >= 300 or not verdict)
|
||||
and not current_settings_all then
|
||||
return false
|
||||
end
|
||||
|
||||
--[[if string.len(url) == string.len(parenturl) then
|
||||
local good_url = false
|
||||
local index1, index2
|
||||
temp_url = string.match(url, "^https?://(.+)$")
|
||||
temp_parenturl = string.match(parenturl, "^https?://(.+)$")
|
||||
local start_index = 1
|
||||
repeat
|
||||
index1 = string.find(temp_url, "/", start_index)
|
||||
index2 = string.find(temp_parenturl, "/", start_index)
|
||||
if index1 ~= index2 then
|
||||
good_url = true
|
||||
break
|
||||
end
|
||||
if index1 then
|
||||
start_index = index1 + 1
|
||||
end
|
||||
until not index1 or not index2
|
||||
if not good_url then
|
||||
return false
|
||||
end
|
||||
end]]
|
||||
|
||||
if parenturl_uuid == nil then
|
||||
parenturl_uuid = false
|
||||
for old_parent_url, _ in pairs(visited_urls) do
|
||||
for id_to_ignore, _ in pairs(ids_to_ignore) do
|
||||
if string.match(old_parent_url, id_to_ignore) then
|
||||
parenturl_uuid = true
|
||||
break
|
||||
end
|
||||
end
|
||||
if parenturl_uuid then
|
||||
break
|
||||
end
|
||||
end
|
||||
end
|
||||
if parenturl_uuid then
|
||||
for id_to_ignore, _ in pairs(ids_to_ignore) do
|
||||
if string.match(url, id_to_ignore) and not current_settings_all then
|
||||
return false
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
if urlpos["link_refresh_p"] ~= 0 then
|
||||
queue_url(url)
|
||||
return false
|
||||
end
|
||||
|
||||
if parenturl_requisite == nil then
|
||||
parenturl_requisite = false
|
||||
for _, pattern in pairs(page_requisite_patterns) do
|
||||
for old_parent_url, _ in pairs(visited_urls) do
|
||||
if string.match(old_parent_url, pattern) then
|
||||
parenturl_requisite = true
|
||||
break
|
||||
end
|
||||
end
|
||||
if parenturl_requisite then
|
||||
break
|
||||
end
|
||||
end
|
||||
end
|
||||
if parenturl_requisite and not current_settings_all then
|
||||
return false
|
||||
end
|
||||
|
||||
if urlpos["link_inline_p"] ~= 0 then
|
||||
queue_url(url)
|
||||
return false
|
||||
end
|
||||
|
||||
local current_host = string.match(urlpos["url"]["host"], "([^%.]+%.[^%.]+)$")
|
||||
local first_parent_host = string.match(parent["host"], "([^%.]+%.[^%.]+)$")
|
||||
|
||||
if current_url then
|
||||
first_parent_host = string.match(current_url .. "/", "^https?://[^/]-([^/%.]+%.[^/%.]+)/")
|
||||
end
|
||||
|
||||
if current_settings_all and (
|
||||
current_settings_any_domain
|
||||
or first_parent_host == current_host
|
||||
) then
|
||||
queue_url(url, true)
|
||||
return false
|
||||
end
|
||||
|
||||
--[[for old_parent_url, _ in pairs(visited_urls) do
|
||||
for _, pattern in pairs(page_requisite_patterns) do
|
||||
if string.match(old_parent_url, pattern) then
|
||||
return false
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
for _, pattern in pairs(page_requisite_patterns) do
|
||||
if string.match(url, pattern) then
|
||||
queue_url(url)
|
||||
return false
|
||||
end
|
||||
end]]
|
||||
end
|
||||
|
||||
wget.callbacks.get_urls = function(file, url, is_css, iri)
|
||||
local html = nil
|
||||
|
||||
if url then
|
||||
downloaded[url] = true
|
||||
end
|
||||
|
||||
local function check(url, headers)
|
||||
local url = string.match(url, "^([^#]+)")
|
||||
url = string.gsub(url, "&", "&")
|
||||
queue_url(url)
|
||||
end
|
||||
|
||||
local function checknewurl(newurl, headers)
|
||||
if string.match(newurl, "^#") then
|
||||
return nil
|
||||
end
|
||||
if string.match(newurl, "\\[uU]002[fF]") then
|
||||
return checknewurl(string.gsub(newurl, "\\[uU]002[fF]", "/"), headers)
|
||||
end
|
||||
if string.match(newurl, "^https?:////") then
|
||||
check(string.gsub(newurl, ":////", "://"), headers)
|
||||
elseif string.match(newurl, "^https?://") then
|
||||
check(newurl, headers)
|
||||
elseif string.match(newurl, "^https?:\\/\\?/") then
|
||||
check(string.gsub(newurl, "\\", ""), headers)
|
||||
elseif not url then
|
||||
return nil
|
||||
elseif string.match(newurl, "^\\/") then
|
||||
checknewurl(string.gsub(newurl, "\\", ""), headers)
|
||||
elseif string.match(newurl, "^//") then
|
||||
check(urlparse.absolute(url, newurl), headers)
|
||||
elseif string.match(newurl, "^/") then
|
||||
check(urlparse.absolute(url, newurl), headers)
|
||||
elseif string.match(newurl, "^%.%./") then
|
||||
if string.match(url, "^https?://[^/]+/[^/]+/") then
|
||||
check(urlparse.absolute(url, newurl), headers)
|
||||
else
|
||||
checknewurl(string.match(newurl, "^%.%.(/.+)$"), headers)
|
||||
end
|
||||
elseif string.match(newurl, "^%./") then
|
||||
check(urlparse.absolute(url, newurl), headers)
|
||||
end
|
||||
end
|
||||
|
||||
local function checknewshorturl(newurl, headers)
|
||||
if string.match(newurl, "^#") then
|
||||
return nil
|
||||
end
|
||||
if url and string.match(newurl, "^%?") then
|
||||
check(urlparse.absolute(url, newurl), headers)
|
||||
elseif url and not (string.match(newurl, "^https?:\\?/\\?//?/?")
|
||||
or string.match(newurl, "^[/\\]")
|
||||
or string.match(newurl, "^%./")
|
||||
or string.match(newurl, "^[jJ]ava[sS]cript:")
|
||||
or string.match(newurl, "^[mM]ail[tT]o:")
|
||||
or string.match(newurl, "^vine:")
|
||||
or string.match(newurl, "^android%-app:")
|
||||
or string.match(newurl, "^ios%-app:")
|
||||
or string.match(newurl, "^%${")) then
|
||||
check(urlparse.absolute(url, newurl), headers)
|
||||
else
|
||||
checknewurl(newurl, headers)
|
||||
end
|
||||
end
|
||||
|
||||
if (status_code == 200 and current_settings and current_settings["deep_extract"])
|
||||
or not url then
|
||||
html = read_file(file)
|
||||
if not url then
|
||||
html = string.gsub(html, " ", " ")
|
||||
html = string.gsub(html, "<", "<")
|
||||
html = string.gsub(html, ">", ">")
|
||||
html = string.gsub(html, """, '"')
|
||||
html = string.gsub(html, "'", "'")
|
||||
html = string.gsub(html, "&#(%d+);",
|
||||
function(n)
|
||||
return string.char(n)
|
||||
end
|
||||
)
|
||||
html = string.gsub(html, "&#x(%d+);",
|
||||
function(n)
|
||||
return string.char(tonumber(n, 16))
|
||||
end
|
||||
)
|
||||
local temp_html = string.gsub(html, "\n", "")
|
||||
for _, remove in pairs({"", "<br/>", "</?p[^>]*>"}) do
|
||||
if remove ~= "" then
|
||||
temp_html = string.gsub(temp_html, remove, "")
|
||||
end
|
||||
for newurl in string.gmatch(temp_html, "(https?://[^%s<>#\"'\\`{})%]]+)") do
|
||||
while string.match(newurl, "[%.&,!;]$") do
|
||||
newurl = string.match(newurl, "^(.+).$")
|
||||
end
|
||||
check(newurl)
|
||||
end
|
||||
end
|
||||
end
|
||||
for newurl in string.gmatch(html, "[^%-][hH][rR][eE][fF]='([^']+)'") do
|
||||
checknewshorturl(newurl)
|
||||
end
|
||||
for newurl in string.gmatch(html, '[^%-][hH][rR][eE][fF]="([^"]+)"') do
|
||||
checknewshorturl(newurl)
|
||||
end
|
||||
for newurl in string.gmatch(string.gsub(html, "&[qQ][uU][oO][tT];", '"'), '"(https?://[^"]+)') do
|
||||
checknewurl(newurl)
|
||||
end
|
||||
for newurl in string.gmatch(string.gsub(html, "'", "'"), "'(https?://[^']+)") do
|
||||
checknewurl(newurl)
|
||||
end
|
||||
if url then
|
||||
for newurl in string.gmatch(html, ">%s*([^<%s]+)") do
|
||||
checknewurl(newurl)
|
||||
end
|
||||
end
|
||||
--[[for newurl in string.gmatch(html, "%(([^%)]+)%)") do
|
||||
checknewurl(newurl)
|
||||
end]]
|
||||
elseif string.match(url, "^https?://[^/]+/.*[^a-z0-9A-Z][pP][dD][fF]$")
|
||||
or string.match(url, "^https?://[^/]+/.*[^a-z0-9A-Z][pP][dD][fF][^a-z0-9A-Z]")
|
||||
or string.match(read_file(file, 4), "%%[pP][dD][fF]") then
|
||||
io.stdout:write("Extracting links from PDF.\n")
|
||||
io.stdout:flush()
|
||||
local temp_file = file .. "-html.html"
|
||||
local check_file = io.open(temp_file)
|
||||
if check_file then
|
||||
check_file:close()
|
||||
os.remove(temp_file)
|
||||
end
|
||||
os.execute("pdftohtml -nodrm -hidden -i -s -q " .. file)
|
||||
check_file = io.open(temp_file)
|
||||
if check_file then
|
||||
check_file:close()
|
||||
local temp_length = table_length(queued_urls)
|
||||
wget.callbacks.get_urls(temp_file, nil, nil, nil)
|
||||
io.stdout:write("Found " .. tostring(table_length(queued_urls)-temp_length) .. " URLs.\n")
|
||||
io.stdout:flush()
|
||||
os.remove(temp_file)
|
||||
else
|
||||
io.stdout:write("Not a PDF.\n")
|
||||
io.stdout:flush()
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
wget.callbacks.write_to_warc = function(url, http_stat)
|
||||
local url_lower = string.lower(url["url"])
|
||||
if urls[url_lower] then
|
||||
current_url = url_lower
|
||||
current_settings = urls_settings[url_lower]
|
||||
end
|
||||
if current_settings and not current_settings["random"] then
|
||||
queue_url(url["url"])
|
||||
return false
|
||||
end
|
||||
if bad_code(http_stat["statcode"]) then
|
||||
return false
|
||||
elseif http_stat["statcode"] >= 300 and http_stat["statcode"] <= 399 then
|
||||
local newloc = urlparse.absolute(url["url"], http_stat["newloc"])
|
||||
if string.match(newloc, "^https?://[^/]*google%.com/sorry")
|
||||
or string.match(newloc, "^https?://[^/]*google%.com/[sS]ervice[lL]ogin")
|
||||
or string.match(newloc, "^https?://consent%.youtube%.com/")
|
||||
or string.match(newloc, "^https?://consent%.google%.com/")
|
||||
or string.match(newloc, "^https?://misuse%.ncbi%.nlm%.nih%.gov/")
|
||||
or string.match(newloc, "^https?://myprivacy%.dpgmedia%.nl/")
|
||||
or string.match(newloc, "^https?://idp%.springer%.com/authorize%?")
|
||||
or string.match(newloc, "^https?://[^/]*instagram%.com/accounts/") then
|
||||
report_bad_url(url["url"])
|
||||
exit_url = true
|
||||
return false
|
||||
end
|
||||
return true
|
||||
elseif http_stat["statcode"] ~= 200 then
|
||||
return true
|
||||
end
|
||||
if true then
|
||||
return true
|
||||
end
|
||||
if http_stat["len"] > min_dedup_mb * 1024 * 1024 then
|
||||
io.stdout:write("Data larger than " .. tostring(min_dedup_mb) .. " MB. Checking with Wayback Machine.\n")
|
||||
io.stdout:flush()
|
||||
while true do
|
||||
local body, code, headers, status = http.request(
|
||||
"https://web.archive.org/__wb/calendarcaptures/2"
|
||||
.. "?url=" .. urlparse.escape(url["url"])
|
||||
.. "&date=202"
|
||||
)
|
||||
if code ~= 200 then
|
||||
io.stdout:write("Got " .. tostring(code) .. " from the Wayback Machine.\n")
|
||||
io.stdout:flush()
|
||||
os.execute("sleep 10")
|
||||
else
|
||||
data = JSON:decode(body)
|
||||
if not data["items"] or not data["colls"] then
|
||||
return true
|
||||
end
|
||||
for _, item in pairs(data["items"]) do
|
||||
if item[2] == 200 then
|
||||
local coll_id = item[3] + 1
|
||||
if not coll_id then
|
||||
io.stdout:write("Could get coll ID.\n")
|
||||
io.stdout:flush()
|
||||
end
|
||||
local collections = data["colls"][coll_id]
|
||||
if not collections then
|
||||
io.stdout:write("Could not get collections.\n")
|
||||
io.stdout:flush()
|
||||
end
|
||||
for _, collection in pairs(collections) do
|
||||
if collection == "archivebot"
|
||||
or string.find(collection, "archiveteam") then
|
||||
io.stdout:write("Archive Team got this URL before.\n")
|
||||
return false
|
||||
end
|
||||
end
|
||||
end
|
||||
end
|
||||
break
|
||||
end
|
||||
end
|
||||
end
|
||||
return true
|
||||
end
|
||||
|
||||
wget.callbacks.httploop_result = function(url, err, http_stat)
|
||||
status_code = http_stat["statcode"]
|
||||
|
||||
parenturl_uuid = nil
|
||||
parenturl_requisite = nil
|
||||
|
||||
local url_lower = string.lower(url["url"])
|
||||
if urls[url_lower] then
|
||||
current_url = url_lower
|
||||
current_settings = urls_settings[url_lower]
|
||||
end
|
||||
|
||||
if not timestamp then
|
||||
local body, code, headers, status = http.request("https://legacy-api.arpa.li/now")
|
||||
assert(code == 200)
|
||||
timestamp = tonumber(string.match(body, "^([0-9]+)"))
|
||||
end
|
||||
|
||||
|
||||
if status_code ~= 0 then
|
||||
local base_url = string.match(url["url"], "^(https://[^/]+)")
|
||||
if base_url then
|
||||
for _, newurl in pairs({
|
||||
base_url .. "/robots.txt",
|
||||
base_url .. "/favicon.ico",
|
||||
base_url .. "/"
|
||||
}) do
|
||||
queue_monthly_url(newurl)
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
url_count = url_count + 1
|
||||
io.stdout:write(url_count .. "=" .. status_code .. " " .. url["url"] .. " \n")
|
||||
io.stdout:flush()
|
||||
|
||||
if redirect_domains["done"] then
|
||||
redirect_domains = {}
|
||||
redirect_urls = {}
|
||||
visited_urls = {}
|
||||
item_first_url = nil
|
||||
end
|
||||
redirect_domains[string.match(url["url"], "^https?://([^/]+)")] = true
|
||||
if not item_first_url then
|
||||
item_first_url = url["url"]
|
||||
end
|
||||
|
||||
visited_urls[url["url"]] = true
|
||||
|
||||
if exit_url then
|
||||
exit_url = false
|
||||
return wget.actions.EXIT
|
||||
end
|
||||
|
||||
if status_code >= 300 and status_code <= 399 then
|
||||
local newloc = urlparse.absolute(url["url"], http_stat["newloc"])
|
||||
redirect_urls[url["url"]] = true
|
||||
--[[if strip_url(url["url"]) == strip_url(newloc) then
|
||||
queued_urls[newloc] = true
|
||||
return wget.actions.EXIT
|
||||
end]]
|
||||
if downloaded[newloc] then
|
||||
return wget.actions.EXIT
|
||||
elseif string.match(url["url"], "^https?://[^/]*telegram%.org/dl%?tme=")
|
||||
or (
|
||||
string.match(newloc, "^https?://www%.(.+)")
|
||||
or string.match(newloc, "^https?://(.+)")
|
||||
) == (
|
||||
string.match(url["url"], "^https?://www%.(.+)")
|
||||
or string.match(url["url"], "^https?://(.+)")
|
||||
)
|
||||
or status_code == 301
|
||||
or status_code == 308 then
|
||||
queue_url(newloc)
|
||||
return wget.actions.EXIT
|
||||
end
|
||||
else
|
||||
redirect_domains["done"] = true
|
||||
end
|
||||
|
||||
if downloaded[url["url"]] then
|
||||
report_bad_url(url["url"])
|
||||
return wget.actions.EXIT
|
||||
end
|
||||
|
||||
for _, pattern in pairs(ignore_patterns) do
|
||||
if string.match(url["url"], pattern) then
|
||||
return wget.actions.EXIT
|
||||
end
|
||||
end
|
||||
|
||||
if status_code >= 200 and status_code <= 399 then
|
||||
downloaded[url["url"]] = true
|
||||
end
|
||||
|
||||
if status_code >= 200 and status_code < 300 then
|
||||
queue_new_urls(url["url"])
|
||||
end
|
||||
|
||||
if bad_code(status_code) then
|
||||
io.stdout:write("Server returned " .. http_stat.statcode .. " (" .. err .. ").\n")
|
||||
io.stdout:flush()
|
||||
report_bad_url(url["url"])
|
||||
return wget.actions.EXIT
|
||||
end
|
||||
|
||||
local sleep_time = 0
|
||||
|
||||
if sleep_time > 0.001 then
|
||||
os.execute("sleep " .. sleep_time)
|
||||
end
|
||||
|
||||
return wget.actions.NOTHING
|
||||
end
|
||||
|
||||
wget.callbacks.finish = function(start_time, end_time, wall_time, numurls, total_downloaded_bytes, total_download_time)
|
||||
local function submit_backfeed(newurls)
|
||||
local tries = 0
|
||||
local maxtries = 4
|
||||
while tries < maxtries do
|
||||
local body, code, headers, status = http.request(
|
||||
"https://legacy-api.arpa.li/backfeed/legacy/urls-glx7ansh4e17aii",
|
||||
newurls .. "\0"
|
||||
)
|
||||
print(body)
|
||||
if code == 200 then
|
||||
io.stdout:write("Submitted discovered URLs.\n")
|
||||
io.stdout:flush()
|
||||
break
|
||||
end
|
||||
io.stdout:write("Failed to submit discovered URLs." .. tostring(code) .. tostring(body) .. "\n")
|
||||
io.stdout:flush()
|
||||
os.execute("sleep " .. math.floor(math.pow(2, tries)))
|
||||
tries = tries + 1
|
||||
end
|
||||
if tries == maxtries then
|
||||
abortgrab = true
|
||||
end
|
||||
end
|
||||
|
||||
local newurls = nil
|
||||
local is_bad = false
|
||||
local count = 0
|
||||
local dup_urls = io.open(item_dir .. "/" .. warc_file_base .. "_duplicate-urls.txt", "w")
|
||||
for url, _ in pairs(queued_urls) do
|
||||
for _, pattern in pairs(bad_patterns) do
|
||||
is_bad = string.match(url, pattern)
|
||||
if is_bad then
|
||||
io.stdout:write("Filtering out URL " .. url .. ".\n")
|
||||
io.stdout:flush()
|
||||
break
|
||||
end
|
||||
end
|
||||
if not is_bad then
|
||||
io.stdout:write("Queuing URL " .. url .. ".\n")
|
||||
io.stdout:flush()
|
||||
dup_urls:write(url .. "\n")
|
||||
if newurls == nil then
|
||||
newurls = url
|
||||
else
|
||||
newurls = newurls .. "\0" .. url
|
||||
end
|
||||
count = count + 1
|
||||
if count == 100 then
|
||||
submit_backfeed(newurls)
|
||||
newurls = nil
|
||||
count = 0
|
||||
end
|
||||
end
|
||||
end
|
||||
if newurls ~= nil then
|
||||
submit_backfeed(newurls)
|
||||
end
|
||||
dup_urls:close()
|
||||
|
||||
local file = io.open(item_dir .. "/" .. warc_file_base .. "_bad-urls.txt", "w")
|
||||
for url, _ in pairs(bad_urls) do
|
||||
file:write(url .. "\n")
|
||||
end
|
||||
file:close()
|
||||
end
|
||||
|
||||
wget.callbacks.before_exit = function(exit_status, exit_status_string)
|
||||
if abortgrab then
|
||||
return wget.exits.IO_FAIL
|
||||
end
|
||||
return exit_status
|
||||
end
|
||||
|
|
@ -0,0 +1,381 @@
|
|||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:40.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:43.0) Gecko/20100101 Firefox/43.0 SeaMonkey/2.40
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:45.0) Gecko/20100101 Firefox/45.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:47.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:48.0) Gecko/20100101 Firefox/48.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:49.0) Gecko/20100101 Firefox/49.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:50.0) Gecko/20100101 Firefox/50.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:54.0) Gecko/20100101 Firefox/54.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:56.0) Gecko/20100101 Firefox/56.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:56.0) Gecko/20100101 Firefox/56.0.4 Waterfox/56.0.4
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.3
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.4
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.5
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:58.0) Gecko/20100101 Firefox/58.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:65.0) Gecko/20100101 Firefox/65.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:43.0) Gecko/20100101 Firefox/43.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:43.0) Gecko/20100101 Firefox/43.0 SeaMonkey/2.40
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:47.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:48.0) Gecko/20100101 Firefox/48.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:49.0) Gecko/20100101 Firefox/49.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:50.0) Gecko/20100101 Firefox/50.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:51.0) Gecko/20100101 Firefox/51.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:51.0) Gecko/20100101 Firefox/51.0 SeaMonkey/2.48
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:54.0) Gecko/20100101 Firefox/54.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:56.0) Gecko/20100101 Firefox/56.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:56.0) Gecko/20100101 Firefox/56.0.4 Waterfox/56.0.4
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.3
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.4
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.5
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:57.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:58.0) Gecko/20100101 Firefox/58.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/20100101 Firefox/49.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:50.0) Gecko/20100101 Firefox/50.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:51.0) Gecko/20100101 Firefox/51.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:54.0) Gecko/20100101 Firefox/54.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:56.0) Gecko/20100101 Firefox/56.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.5
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:57.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:58.0) Gecko/20100101 Firefox/58.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:59.0.2) Gecko/20100101 Firefox/59.0.2
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:61.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:40.0) Gecko/20100101 Firefox/40.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:48.0) Gecko/20100101 Firefox/48.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.3
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:56.0) Gecko/20100101 Firefox/56.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:56.0) Gecko/20100101 Firefox/56.0.4 Waterfox/56.0.4
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.3
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.4
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.5
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/99.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:58.0) Gecko/20100101 Firefox/58.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:65.0) Gecko/20100101 Firefox/65.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.2
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.5
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:57.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:40.0) Gecko/20100101 Firefox/40.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:45.0) Gecko/20100101 Firefox/45.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:47.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:48.0) Gecko/20100101 Firefox/48.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.8.3
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:44.0) Gecko/20100101 Firefox/44.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:45.0) Gecko/20100101 Firefox/45.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:47.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:48.0) Gecko/20100101 Firefox/48.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:49.0) Gecko/20100101 Firefox/49.0.2.1 Waterfox/49.0.2.1
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:45.0) Gecko/20100101 Firefox/45.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:48.0) Gecko/20100101 Firefox/48.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:56.0) Gecko/20100101 Firefox/56.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.5
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:41.0) Gecko/20100101 Firefox/41.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:47.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:48.0) Gecko/20100101 Firefox/48.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.1
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:58.0) Gecko/20100101 Firefox/58.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1; rv:50.0) Gecko/20100101 Firefox/50.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2; rv:49.0) Gecko/20100101 Firefox/49.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.102 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.1 Safari/605.1.15
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.1 Safari/605.1.15
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.2 Safari/605.1.15
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Safari/605.1.15
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.1 Safari/605.1.15
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3639.1 Safari/537.36
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.2 Safari/605.1.15
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_29_81; rv:45.70.23) Gecko/20134284 Firefox/45.70.23
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 11.11; rv:51.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 9.3; rv:45.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Macintosh; Intel Mac OS X 9.3; rv:45.0) Gecko/20100101 Firefox/59.0.2
|
||||
Mozilla/5.0 (Macintosh; PPC Mac OS X 10.11; rv:46.0) Gecko/20100101 Firefox/46.0
|
||||
Mozilla/5.0 (Macintosh; PPC Mac OS X 10.12; rv:46.0) Gecko/20100101 Firefox/46.0
|
||||
Mozilla/5.0 (Macintosh; PPC Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Macintosh; PPC Mac OS X 10.4; FPR7; rv:45.0) Gecko/20100101 Firefox/45.0 TenFourFox/G5
|
||||
Mozilla/5.0 (Macintosh; PPC Mac OS X 10.4; FPR8; rv:45.0) Gecko/20100101 Firefox/45.0 TenFourFox/G5
|
||||
Mozilla/5.0 (Macintosh; PPC Mac OS X 10.4; FPR9; rv:45.0) Gecko/20100101 Firefox/45.0 TenFourFox/G5
|
||||
Mozilla/5.0 (Macintosh; PPC Mac OS X 10.5; FPR8; rv:45.0) Gecko/20100101 Firefox/45.0 TenFourFox/7450
|
||||
Mozilla/5.0 (Macintosh; PPC Mac OS X 10.8; rv:47.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.10; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.10; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.10; rv:65.0) Gecko/20100101 Firefox/65.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.11; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.11; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.11; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.11; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.12; rv:54.0) Gecko/20100101 Firefox/54.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.12; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.12; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.12; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.13; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.13; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.85 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:20.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/45.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 IceDragon/40.1.1.18 Firefox/40.0.2
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0 Framafox/43.0.1
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0 SeaMonkey/2.40
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.63.16) Gecko/20175595 Firefox/45.63.16
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0 SeaMonkey/2.46
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/45.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.9.1
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.1
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.2
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.2 Lightning/5.4
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.3
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.4
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0 Zotero/5.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Firefox/52.9
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.6.2
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.7.2
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.8.2
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.8.3
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.1
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.2
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.3
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 Basilisk/20180927
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0a2
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.1
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.1.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/50.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0 SeaMonkey/2.49.3
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0 IceDragon/58.0.1
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0 IceDragon/60.0.2
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.9) Gecko/20100101 Goanna/4.1 Firefox/60.9 PaleMoon/28.2.1
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0 IceDragon/61.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:62.0) Gecko/20100101 Firefox/62.0 IceDragon/62.0.2
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Windows NT 10.0; WOW64; rv:65.0) Gecko/20100101 Firefox/65.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; rv:54.0) Gecko/20100101 Firefox/54.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; rv:61.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:41.0) Gecko/20100101 Firefox/41.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:43.0) Gecko/20100101 Firefox/43.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:43.0) Gecko/20100101 Firefox/43.0.4 Waterfox/43.0.4
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:45.0) Gecko/20100101 Firefox/45.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:46.0) Gecko/20100101 Firefox/46.0.1 Waterfox/46.0.1
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.0.4
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.5.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.5.2
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.7.2
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.7.4
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.8.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.9.1
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0.2 Waterfox/52.0.2
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/3.3 Firefox/52.9 PaleMoon/27.5.1
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.8.3
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.1
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.2
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.3
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.4
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 Basilisk/20180424
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 Basilisk/20180515
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 Basilisk/20180601
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 Basilisk/20180718
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 Basilisk/20180905
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 Basilisk/20180927
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.0.1
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.0.1
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.1.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:53.0) Gecko/20100101 Firefox/53.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0.1 Waterfox/54.0.1
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0.1 Waterfox/56.0.1
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0.4 Waterfox/56.0.4
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.3
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.4
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0; Waterfox) Gecko/20100101 Firefox/56.2.5
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:44.0) Gecko/20100101 Firefox/44.0.1
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:45.0) Gecko/20100101 Firefox/45.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:47.0) Gecko/20100101 Firefox/47.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:49.0) Gecko/20100101 Firefox/49.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:50.0) Gecko/20100101 Firefox/50.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:51.0) Gecko/20100101 Firefox/51.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.7.2
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:52.0) Gecko/20100101 Firefox/52.0 Cyberfox/52.9.1
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:52.0) Gecko/20100101 Firefox/52.0 SeaMonkey/2.49.4
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.1
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.1a1
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:52.9) Gecko/20100101 Goanna/3.4 Firefox/52.9 PaleMoon/27.9.3
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:52.9) Gecko/20100101 Goanna/4.1 Firefox/52.9 PaleMoon/28.1.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:53.0) Gecko/20100101 Firefox/53.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:56.0) Gecko/20100101 Firefox/56.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:57.0) Gecko/20100101 Firefox/57.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:58.0) Gecko/20100101 Firefox/58.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:59.0) Gecko/20100101 Firefox/59.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:62.0) Gecko/20100101 Firefox/62.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Windows NT 10.0; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Windows NT 4.0; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Windows NT 5.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0
|
||||
Mozilla/5.0 (Windows NT 5.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Windows NT 5.1; WOW64; rv:61.0) Gecko/20100101 Firefox/61.0
|
||||
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0
|
||||
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (Windows NT 6.1; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (X11; CrOS x86_64 11021.81.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/70.0.3538.77 Chrome/70.0.3538.77 Safari/537.36
|
||||
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
|
||||
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0
|
||||
Mozilla/5.0 (X11; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0
|
||||
Mozilla/5.0 (X11; OpenBSD amd64; rv:56.0) Gecko/20100101 Firefox/66.0
|
||||
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0
|
||||
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0
|
Loading…
Reference in New Issue