43 lines
1.7 KiB
Markdown
43 lines
1.7 KiB
Markdown
|
# Contact Scanner
|
||
|
## What is this?
|
||
|
The project is a small python web scraper with Selenium and BeautifulSoup.
|
||
|
|
||
|
## What does it do?
|
||
|
The scraper goes to the impressum page of a given website and scans it for an email address and a name, following the keywords defined in a supplied file. After it scrapes the page, it writes the results in a csv file.
|
||
|
|
||
|
**NOTE:** The scraper does **NOT** return a 100% correct email-name pairs. It returns the pairs that it can **build**. This means that you should always take the results with a grain of salt.
|
||
|
|
||
|
## How to use it?
|
||
|
### Prerequisites
|
||
|
You are going to need the following things installed:
|
||
|
* Chrome
|
||
|
* Python 3
|
||
|
* Pip3
|
||
|
* Selenium Chrome driver
|
||
|
|
||
|
After you have these 4 installed, go on.
|
||
|
### Dependecies
|
||
|
The dependencies are listed in [requirements.txt](requirements.txt). Install them with the following command:
|
||
|
```
|
||
|
pip3 install -r requirements.txt
|
||
|
```
|
||
|
|
||
|
### Usage
|
||
|
The application has the following synopsis:
|
||
|
```
|
||
|
SYNOPSIS
|
||
|
|
||
|
python3 app.py URL_FILE KEYWORD_FILE
|
||
|
```
|
||
|
|
||
|
where ```URL_FILE``` is a file with a list of URLs that should be scanned with each URL on new line and ```KEYWORD_FILE``` contains a list of keywords based on which you will search for names. The format of the file is the same(you should trim the trailing whitespaces for best results).
|
||
|
|
||
|
### Usage constraints
|
||
|
You should **NOT**
|
||
|
1. use this scraper for generating spam lists
|
||
|
2. use this scraper without acknowledging the `robots.txt` of the target
|
||
|
3. use this scraper when you have explicitly agreed with the website not to scrape it
|
||
|
4. use this scraper if you're not using it under fair use
|
||
|
|
||
|
## Fair use
|
||
|
The scraper falls under fair use because it is designed to search for *facts* in pages and not for *content*
|