lib | ||
.gitignore | ||
app.py | ||
LICENSE | ||
README.md | ||
requirements.txt |
Contact Scanner
What is this?
The project is a small python web scraper with Selenium and BeautifulSoup.
What does it do?
The scraper goes to the impressum page of a given website and scans it for an email address and a name, following the keywords defined in a supplied file. After it scrapes the page, it writes the results in a csv file.
NOTE: The scraper does NOT return a 100% correct email-name pairs. It returns the pairs that it can build. This means that you should always take the results with a grain of salt.
How to use it?
Prerequisites
You are going to need the following things installed:
- Chrome
- Python 3
- Pip3
- Selenium Chrome driver
After you have these 4 installed, go on.
Dependecies
The dependencies are listed in requirements.txt. Install them with the following command:
pip3 install -r requirements.txt
Usage
The application has the following synopsis:
SYNOPSIS
python3 app.py URL_FILE KEYWORD_FILE
where URL_FILE
is a file with a list of URLs that should be scanned with each URL on new line and KEYWORD_FILE
contains a list of keywords based on which you will search for names. The format of the file is the same(you should trim the trailing whitespaces for best results).
Usage constraints
You should NOT
- use this scraper for generating spam lists
- use this scraper without acknowledging the
robots.txt
of the target - use this scraper when you have explicitly agreed with the website not to scrape it
- use this scraper if you're not using it under fair use
Fair use
The scraper falls under fair use because it is designed to search for facts in pages and not for content