Go to file
2018-12-18 21:56:41 +00:00
lib Add initial scraping capabilities 2018-12-18 22:55:28 +01:00
.gitignore Add initial scraping capabilities 2018-12-18 22:55:28 +01:00
app.py Add initial scraping capabilities 2018-12-18 22:55:28 +01:00
LICENSE Add LICENSE 2018-12-18 21:56:41 +00:00
README.md Add initial scraping capabilities 2018-12-18 22:55:28 +01:00
requirements.txt Add initial scraping capabilities 2018-12-18 22:55:28 +01:00

Contact Scanner

What is this?

The project is a small python web scraper with Selenium and BeautifulSoup.

What does it do?

The scraper goes to the impressum page of a given website and scans it for an email address and a name, following the keywords defined in a supplied file. After it scrapes the page, it writes the results in a csv file.

NOTE: The scraper does NOT return a 100% correct email-name pairs. It returns the pairs that it can build. This means that you should always take the results with a grain of salt.

How to use it?

Prerequisites

You are going to need the following things installed:

  • Chrome
  • Python 3
  • Pip3
  • Selenium Chrome driver

After you have these 4 installed, go on.

Dependecies

The dependencies are listed in requirements.txt. Install them with the following command:

pip3 install -r requirements.txt

Usage

The application has the following synopsis:

SYNOPSIS

python3 app.py URL_FILE KEYWORD_FILE

where URL_FILE is a file with a list of URLs that should be scanned with each URL on new line and KEYWORD_FILE contains a list of keywords based on which you will search for names. The format of the file is the same(you should trim the trailing whitespaces for best results).

Usage constraints

You should NOT

  1. use this scraper for generating spam lists
  2. use this scraper without acknowledging the robots.txt of the target
  3. use this scraper when you have explicitly agreed with the website not to scrape it
  4. use this scraper if you're not using it under fair use

Fair use

The scraper falls under fair use because it is designed to search for facts in pages and not for content