https://www.youtube.com/watch?v=eIWFnNz8mF4&t=217s
It is a concurrent scraper in @golang
How it works:
You call the binary (iterscraper) and you give it a URL "http://foo.com/%d" where '%d' is a pattern that will be replaced by an ID. e.g. 'http://foo.com/1' up to 'http://foo.com/9'. Then you can use how many go-routines you want to use at the same time (-concurrency). Then you chose where you should be writing the output (-output) to. Then the '-nameQuery, -addressQuery, emailQuery' are the CSS selectors we are going to be using to find whatever we are looking for (name? address? e-mail) in the URL (e.g. 'http://foo.com/1').
A basic package used for scraping information from a website where URLs contain an incrementing integer. Information is retrieved from HTML5 elements, and outputted as a CSV.
1. Fetch the code:
go get github.com/philipithomas/iterscraper
2. Go to the code
cd github.com/philipithomas/iterscraper
3. Create a new branch
git checkout -b work
4. Open VSCode
code .
main.go
-------
* Defines all the flags
* Parses those flags
* Uses a WaitGroup and Channels to communicate with different parts of the work
* The different parts are 3:
* emitTasks --> generates every single task that we need to do. Every task is a URL with an ID.
It sends the task to the 'taskChan' channel
* scrape --> Is a worker that will receive the task from the taskChannel, parse the URL and find
whatever we need to find and then send the results to the 'dataChan' channel.
* writeSites --> Writes all the output to a CSV file
Comments