Stale MapReduce Staging Directories

I had a problem where HDFS would fill up really fast on my small test cluster. Using hdfs dfs -du I was able to track it down to the MapReduce staging directory under /user/root/.staging. For some reason, it wasn’t always deleting some old job directories. I wasn’t sure why this kept happening on multiple clusters, but I had to come up with a quick workaround. I created a small Python script that lists all staging directories and removes any of them not belonging to a currently running job. The script runs from cron and I can now use my cluster without worrying it’s going to run out of space.

This script is pretty slow and it’s probably possible to make it way faster with Snakebite or even some Java code. That being said, for daily or even hourly clean-up, this script is good enough.

import os
import re
import subprocess

all_jobs_raw = subprocess.check_output(
  'mapred job -list all'.split())
running_jobs = re.findall(
  r'^(job_\S+)\s+(?:1|4)\s+\d+\s+\w+.*$',
  all_jobs_raw, re.M)

staging_raw = subprocess.check_output(
  'hdfs dfs -ls /user/root/.staging'.split())
staging_dirs = re.findall(
  r'^.*/user/root/.staging/(\w+)\s*$',
  staging_raw, re.M)

stale_staging_dirs = set(staging_dirs) - set(running_jobs)

for stale_dir in stale_staging_dirs:
  os.system(
    'hdfs dfs -rm -r -f -skipTrash ' +
    '/user/root/.staging/%s' % stale_dir)

The script requires at least Python 2.7 and was tested with Hadoop 2.0.0-cdh4.5.0.