Skip to content

n-witt/BlogCrawler

Repository files navigation

#BlogCrawler

This prototype is a focus web crawler that allows the visiting pre-defined websites, extract their contents and save them into an elasticsearch datastore. The blog crawler is based on scrapy, a python framework to facilitate the development of webscraping applications. ##Requirements

  • Linux or Mac OS X
  • Python 2.7 or newer
  • Scrapy 0.22 or newer
  • lxml
  • pyOpenSSL
  • Elasticsearch 1.2.1 or newer
  • Java Runtime Environment 7 or newer

##Installation The easiest way to install the required software is to use the packet manager of the OS. The commands below are tested with Ubuntu 14.04, but they should work on all the distributions that use the apt-get packet manager. For yum-based systems like Fedora and SuSE some modifications might be required.

The following commands will make sure that all requirements are met:

sudo apt-get install -y build-essential git python-pip python python-dev libxml2-dev libxslt-dev lib32z1-dev openjdk-7-jdk
sudo pip install pyopenssl lxml scrapy elasticsearch dateutils Twisted service_identity

Elasticsearch can not be installed via apt-ge. Hence, this has to be done manually. First we download the latest Elasticsearch version:

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-x.y.z.deb

where x.y.z. is the current version number and thus needs to be replaced. Then the installation process can be trigged via:

sudo dpkg -i elasticsearch-x.y.z.deb

Again, x.y.z refers to the latest's version number and has to be replaced. The installation is complete and elasticsearch can be started using the following command:

sudo service elasticsearch start

It is to note, that the service will not start automatically when the computer boots up. If this is required, the following command has to be used:

sudo update-rc.d elasticsearch defaults 95 10

The blog crawler can be cloned from Github:

git clone purl.org/eexcess/components/research/blogcrawler}}

The crawler can be invoked by changing into the just cloned repository-directory and starting the script crawlall.sh. That the process can take several hours. To interrupt the process <ctrl+z> must be pressed.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published