"README.md" did not exist on "7fbf7e296bc14320336087cd494de02a10550e70"
README.md 1.68 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
# Contact Scanner
## What is this?
The project is a small python web scraper with Selenium and BeautifulSoup.

## What does it do?
The scraper goes to the impressum page of a given website and scans it for an email address and a name, following the keywords defined in a supplied file. After it scrapes the page, it writes the results in a csv file.

**NOTE:** The scraper does **NOT** return a 100% correct email-name pairs. It returns the pairs that it can **build**. This means that you should always take the results with a grain of salt.

## How to use it?
### Prerequisites
You are going to need the following things installed:
* Chrome
* Python 3
* Pip3
* Selenium Chrome driver

After you have these 4 installed, go on.
### Dependecies
The dependencies are listed in [requirements.txt](requirements.txt). Install them with the following command:
```
pip3 install -r requirements.txt
```

### Usage
The application has the following synopsis:
```
SYNOPSIS

python3 app.py URL_FILE KEYWORD_FILE
```

where ```URL_FILE``` is a file with a list of URLs that should be scanned with each URL on new line and ```KEYWORD_FILE``` contains a list of keywords based on which you will search for names. The format of the file is the same(you should trim the trailing whitespaces for best results).

### Usage constraints
You should **NOT**
1. use this scraper for generating spam lists
2. use this scraper without acknowledging the `robots.txt` of the target
3. use this scraper when you have explicitly agreed with the website not to scrape it
4. use this scraper if you're not using it under fair use

## Fair use
The scraper falls under fair use because it is designed to search for *facts* in pages and not for *content*