ivo/contact-scan

Go to file

Ivaylo Ivanov 00140c2b6b Add URL entry to CSV

2018-12-19 19:02:29 +01:00

Add URL entry to CSV

2018-12-19 19:02:29 +01:00

.gitignore

Add initial scraping capabilities

2018-12-18 22:55:28 +01:00

app.py

Add initial scraping capabilities

2018-12-18 22:55:28 +01:00

LICENSE

Add LICENSE

2018-12-18 21:56:41 +00:00

README.md

Add initial scraping capabilities

2018-12-18 22:55:28 +01:00

requirements.txt

Add initial scraping capabilities

2018-12-18 22:55:28 +01:00

README.md

Contact Scanner

What is this?

The project is a small python web scraper with Selenium and BeautifulSoup.

What does it do?

The scraper goes to the impressum page of a given website and scans it for an email address and a name, following the keywords defined in a supplied file. After it scrapes the page, it writes the results in a csv file.

NOTE: The scraper does NOT return a 100% correct email-name pairs. It returns the pairs that it can build. This means that you should always take the results with a grain of salt.

How to use it?

Prerequisites

You are going to need the following things installed:

Chrome
Python 3
Pip3
Selenium Chrome driver

After you have these 4 installed, go on.

Dependecies

The dependencies are listed in requirements.txt. Install them with the following command:

pip3 install -r requirements.txt

Usage

The application has the following synopsis:

SYNOPSIS

python3 app.py URL_FILE KEYWORD_FILE

where URL_FILE is a file with a list of URLs that should be scanned with each URL on new line and KEYWORD_FILE contains a list of keywords based on which you will search for names. The format of the file is the same(you should trim the trailing whitespaces for best results).

Usage constraints

You should NOT

use this scraper for generating spam lists
use this scraper without acknowledging the robots.txt of the target
use this scraper when you have explicitly agreed with the website not to scrape it
use this scraper if you're not using it under fair use

Fair use

The scraper falls under fair use because it is designed to search for facts in pages and not for content