Skip to content

iRail/harvester

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data harvester

Scraping is a slow task, usually run by multiple machines. These commands add messages to a queue, with specific data on what to scrape. Workers then listen to the queue and start scraping. Just add more workers to get the job done more quickly.

Just add configuration for a Message Queueing service, for example ironMQ. Trigger artisan commands using cron to fill the queue. Other artisan commands like queue:worker or queue:listen can then read messages and perform long running scraping tasks. These listening/long polling processes should be supervised by eg: supervisor.

Configuration.

Use a message queue. Laravel supports IronMQ, Amazon SQS and beanstalkd. Configure it in app/config/production/queue.php. This harvester uses MongoDB as system of record (SOR). Configure it in app/config/production/database.php. No need to change the default value in that config array. At the bottom, do configure redis, because we are using Redis both; to control which jobs are already on the queue, and to save statistics about the harvester. Since you have Redis configured, feel free to use it for caching as well. See app/config/cache.php for more information.

Usage

add cron job for daily run to /etc/crontab:

#m  h  dom mon dow   user    command
 *  3    *   *   *   ubuntu  cd /path/to/application && php artisan harvest:daily

also add cron job for consecutive runs:

#m  h  dom mon dow   user    command
*/5 *    *   *   *   ubuntu  cd /path/to/application && php artisan harvest:delays

If needed, one can run a manual command to pull specific data

php artisan harvest:manual type id date

The last 3 parameters are required. Type can be one of "trip" or "stop". The ID is numeric, and date is formatted Ymd.

Workers

Workers are started with

php artisan queue:listen

Optionally, add --timeout=

Releases

No releases published

Packages

No packages published