drpanwe icon

Code Review

drpanwe | PRO | 04/07/19 09:35:20 PM UTC | 0 ⭐ | 403 👁️ | Never ⏰ | []
text |

1.58 KB

|

None

|

0 👍

/

0 👎

https://www.youtube.com/watch?v=eIWFnNz8mF4&t=217s
 It is a concurrent scraper in @golang
 How it works:
 You call the binary (iterscraper) and you give it a URL "http://foo.com/%d" where '%d' is a pattern that will be replaced by an ID. e.g. 'http://foo.com/1' up to 'http://foo.com/9'. Then you can use how many go-routines you want to use at the same time (-concurrency). Then you chose where you should be writing the output (-output) to. Then the '-nameQuery, -addressQuery, emailQuery' are the CSS selectors we are going to be using to find whatever we are looking for (name? address? e-mail) in the URL (e.g. 'http://foo.com/1').
 A basic package used for scraping information from a website where URLs contain an incrementing integer. Information is retrieved from HTML5 elements, and outputted as a CSV.
 1. Fetch the code:
go get github.com/philipithomas/iterscraper
 2. Go to the code
cd github.com/philipithomas/iterscraper
 3. Create a new branch
git checkout -b work
 4. Open VSCode
code .
 main.go
-------
* Defines all the flags
* Parses those flags
* Uses a WaitGroup and Channels to communicate with different parts of the work
* The different parts are 3:
  * emitTasks --> generates every single task that we need to do. Every task is a URL with an ID.
                  It sends the task to the 'taskChan' channel
  * scrape --> Is a worker that will receive the task from the taskChannel, parse the URL and find
               whatever we need to find and then send the results to the 'dataChan' channel.
  * writeSites --> Writes all the output to a CSV file

Comments