contact-scan/README.md

43 lines
1.7 KiB
Markdown
Raw Permalink Normal View History

2018-12-18 21:55:28 +00:00
# Contact Scanner
## What is this?
The project is a small python web scraper with Selenium and BeautifulSoup.
## What does it do?
The scraper goes to the impressum page of a given website and scans it for an email address and a name, following the keywords defined in a supplied file. After it scrapes the page, it writes the results in a csv file.
**NOTE:** The scraper does **NOT** return a 100% correct email-name pairs. It returns the pairs that it can **build**. This means that you should always take the results with a grain of salt.
## How to use it?
### Prerequisites
You are going to need the following things installed:
* Chrome
* Python 3
* Pip3
* Selenium Chrome driver
After you have these 4 installed, go on.
### Dependecies
The dependencies are listed in [requirements.txt](requirements.txt). Install them with the following command:
```
pip3 install -r requirements.txt
```
### Usage
The application has the following synopsis:
```
SYNOPSIS
python3 app.py URL_FILE KEYWORD_FILE
```
where ```URL_FILE``` is a file with a list of URLs that should be scanned with each URL on new line and ```KEYWORD_FILE``` contains a list of keywords based on which you will search for names. The format of the file is the same(you should trim the trailing whitespaces for best results).
### Usage constraints
You should **NOT**
1. use this scraper for generating spam lists
2. use this scraper without acknowledging the `robots.txt` of the target
3. use this scraper when you have explicitly agreed with the website not to scrape it
4. use this scraper if you're not using it under fair use
## Fair use
The scraper falls under fair use because it is designed to search for *facts* in pages and not for *content*