What is this?
The project is a small python web scraper with Selenium and BeautifulSoup.
What does it do?
The scraper goes to the impressum page of a given website and scans it for an email address and a name, following the keywords defined in a supplied file. After it scrapes the page, it writes the results in a csv file.
NOTE: The scraper does NOT return a 100% correct email-name pairs. It returns the pairs that it can build. This means that you should always take the results with a grain of salt.
How to use it?
You are going to need the following things installed:
- Python 3
- Selenium Chrome driver
After you have these 4 installed, go on.
The dependencies are listed in requirements.txt. Install them with the following command:
pip3 install -r requirements.txt
The application has the following synopsis:
SYNOPSIS python3 app.py URL_FILE KEYWORD_FILE
URL_FILE is a file with a list of URLs that should be scanned with each URL on new line and
KEYWORD_FILE contains a list of keywords based on which you will search for names. The format of the file is the same(you should trim the trailing whitespaces for best results).
You should NOT
- use this scraper for generating spam lists
- use this scraper without acknowledging the
robots.txtof the target
- use this scraper when you have explicitly agreed with the website not to scrape it
- use this scraper if you're not using it under fair use
The scraper falls under fair use because it is designed to search for facts in pages and not for content