Skip to content

Analyzing over 1,000,000 HTTP requests from a blog site

License

Notifications You must be signed in to change notification settings

miguelHx/log-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 

Repository files navigation

Log Analysis Project

Purpose is to answer three questions based on data from a postgresql database called news There are three tables:

  1. articles
  • Column:Type -> author:integer, title:text, slug:text, lead:text, body:text, time:timestamptz, id:integer
  • 8 rows
  1. authors
  • Column:Type -> name:text, bio:text, id:integer
  • 4 rows
  1. log
  • Column:Type -> path:text, ip:inet, method:text, status:text, time:timestamptz, id:integer
  • 1,677,735 rows

Questions that the program answers

  1. What are the most popular three articles of all time? Which articles have been accessed the most? Present this information as a sorted list with the most popular article at the top.
  2. Who are the most popular article authors of all time? That is, when you sum up all of the articles each author has written, which authors get the most page views? Present this as a sorted list with the most popular author at the top.
  3. On which days did more than 1% of requests lead to errors? The log table includes a column status that indicates the HTTP status code that the news site sent to the user's browser.

Setting up the database

  1. Install VirtualBox from this link (Install package for your OS)
  • Ubuntu users: If you are running Ubuntu 14.04, install VirtualBox using the Ubuntu Software Center instead. Due to a reported bug, installing VirtualBox from the site may uninstall other software you need.
  1. Install Vagrant
  • Download it from this link
  • Windows users: The Installer may ask you to grant network permissions to Vagrant or make a firewall exception. Be sure to allow this.
  1. Download this repository as a zip file and then unzip to wherever
  2. Open up a terminal and run cd log-analysis and then cd vagrant
  • vagrant setup file will be in there
  1. Inside the vagrant subdirectory, run the command vagrant up
  • This will cause Vagrant to download the Linux operating system and install it
  • May take a while (many minutes) depending on your internet connection speed
  1. When vagrant up finishes running and the shell prompt is back, run vagrant ssh to log in to the Linux VM
  • Files in the VM's /vagrant directory are shared with the vagrant folder on your computer. But other data inside the VM is not.
  • For instance, the PostgreSQL database itself lives only inside the VM.
  • If you exit the terminal or reboot the computer, you will need to run vagrant up to restart the VM
  1. Now that you are logged in, run cd /vagrant to gain access to the shared directory
  2. Download the .sql file from here, unzip, put file into same directory as log-analysis.py
  3. Run psql -d news -f newsdata.sql to populate the database
  4. Once the data is loaded, connect to the database using psql -d news and explore the tables using \dt and \d table commands and select statements.
  5. Setup is complete! Now it is time to run the log analysis program I wrote.

Steps to Run Log Analysis

  1. Make sure you are logged in to the VM
  2. cd into /vagrant: cd /vagrant
  3. run python log-analysis.py

Performance

  • Each question will happen quickly thanks to the speed of SQL queries.

Notes

  • Views were used for testing, but not in the final code.
  • Subqueries were used in place of views instead.
  • You can see some of the views in sql_query_notebook.txt
  • sql_query_notebook.txt just contains queries that I wanted to have as notes.

Links that helped me on this project

About

Analyzing over 1,000,000 HTTP requests from a blog site

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published