Analyzing Logs with Python

The logfile, old frenemy

From time to time, shit happens. In this trying time, you will have no choice but to review some logs. Times have changed and whereas our tools of choice used to be egrep, Notepad++ or a good old Nagios handler. Now most likely you will be reading logs in a fancy colorful GUI running Grafana, Kibana, Datadog or whathaveyou replacement tool your employer has had the brilliant idea to use instead of industry best practices

Take this Revolutionary moment in time:

time=“1968-05-02 00:00:49.150” level=MERDE msg="[ Paris is on Fire! ] "DELETE https://capitalism from A. Bunch.of.Situationists - 200 200B in 3100 µs"

How do we find out how many people revolted against capitalism in Paris, May of 1968? Well, we analyze the logs.

Using Regex

You might not be able to get away with using Regex for logs, not until you fix them up a bit. What Data Scientists like to call “Extract” and “Transform”, or “Data preparation”

But for now, let’s assume that the Paris anarchists decided to generate some nice and orderly, mostly uniform logs.
Here, Pythons regex engine or built-in module re can shine.

Method:

Build a regular expression to match your desired log classification information, such as the Time, Date, visited URL, HTTP status code, level of mayhem, error count, etc. For example:
hours = r’\d\d\d\d-\d\d-\d\d\s\d\d'
Use Python’s find_all, match or other appropriate method to gather all the results you need in one place.
```
datum = open(log_file).readlines()   
re.findall(hours, datum)
```
Use Python collections.Counter to find the most prevalent hour of revoutionary strife, the most visited URL, or other interesting data points.
the_most = Counter(list(“find results”))

The code:

from pathlib import Path
import sys
from collections import defaultdict, Counter
import re

logs_dir = Path('.')

logs_files = logs_dir.iterdir()
list_files = []
data_files = []

INFO_re = 0
ERROR_re = 0

hours_re = []
minutes_re = []

for file in logs_files:
    if file.suffix == ".py":
        continue
    list_files.append(file)

total_lines = 0
datum = ''
for file in list_files:
    with open(file, 'r') as f:
        datum += f.read()

hours = re.findall(r'\d\d\d\d-\d\d-\d\d\s\d\d', datum)
minutes = re.findall(r'\d\d\d\d-\d\d-\d\d\s\d\d:\d\d', datum)

hours_counter = Counter(hours)
minutes_counter = Counter(minutes)
max_hour = hours_counter.most_common(1)
max_minute = minutes_counter.most_common(1)

INFO_re += len(re.findall(r'level=info', datum))
ERROR_re += len(re.findall(r'level=error', datum))

URL_re = re.findall(r'https?://[.\w\d/]*\s', datum)
u = Counter(URL_re)
top_5 = u.most_common(5)

lines = datum.split('\n')
total_lines += len(lines)

print(f"Top hour: {max_hour}", f"Top minute: {max_minute}")

print(f"5 most common urls: {top_5}")
print(f"Info lines by regex: {INFO_re}")
print(f"Info lines by regex: {ERROR_re}")
print(f"We have {total_lines} total lines")

Using Pandas

Yeah but, that’s old school, right? Everyone wants to use Pandas nowadays, and we are onboard. Pandas is amazing, powerful and fast. However, it doesn’t magically read all your data in the fields you want, still gotta do the data preparation.

At a large scale, we can call this “Extract, Transform and Load” and send it through some fancy “data pipelines”, “Kafka stream”, “Hadoops” and thus we enter the realm of Data Engineering.

But on a small scale, it’s just some Pyton file, buffer and string manipulation, wrangling with Dates, and figuring out again which are our special regular expressions to be found and catalogued.

The code:

  import pandas as pd
  from pathlib import Path
  import re
  
  logs_dir = Path('.')
  logs_files = logs_dir.iterdir()
  
  lines = []
  fields = None
  
  for file in logs_files:
      if file.suffix == ".py":
          continue
      with open(file) as f:
          while ln := f.readline():
              if fields is None:
                  spaces = ln.split(" ")
                  fields = len(spaces)
              spaces = ln.split(" ")
              current_fields = len(ln.split(" "))
              if current_fields != fields:
                  print(f"Non-standard line ignored: {ln}")
                  continue
              date = spaces[0].split("=")[1].strip('"').strip("'")
              time = spaces[1].strip('"').strip("'")
              datetime = date + " " + time
              level = spaces[2].split('=')[1].strip("'").strip('"')
              msg = ' '.join(spaces[3:]).split('=')[1].strip("\n").strip('"')
              if match := re.search(r'(https?://[\w./]*)\s', msg):
                  url = match.group(0)
              else:
                  url = None
              date_fmt = "%Y-%m-%dT%H:%M:%S.%f%z"
              lines.append([datetime, level, url, msg])
  
  names = ["datetime", "level", "url", "msg"]
  df = pd.DataFrame(lines, columns=names)
  print(df.url.value_counts())

There we go. in the world of Data, we are!

The logfile, old frenemy

time=“1968-05-02 00:00:49.150” level=MERDE msg="[ Paris is on Fire! ] "DELETE https://capitalism from A. Bunch.of.Situationists - 200 200B in 3100 µs"

Using Regex#

Method:#

The code:#

Using Pandas

The code:#

Using Regex

Method:

The code:

The code: