Stale MapReduce Staging Directories

I had a problem where HDFS would fill up really fast on my small test cluster. Using hdfs dfs -du I was able to track it down to the MapReduce staging directory under /user/root/.staging. For some reason, it wasn’t always deleting some old job directories. I wasn’t sure why this kept happening on multiple clusters, but I had to come up with a quick workaround. I created a small Python script that lists all staging directories and removes any of them not belonging to a currently running job. The script runs from cron and I can now use my cluster without worrying it’s going to run out of space.

This script is pretty slow and it’s probably possible to make it way faster with Snakebite or even some Java code. That being said, for daily or even hourly clean-up, this script is good enough.

import os
import re
import subprocess

all_jobs_raw = subprocess.check_output(
  'mapred job -list all'.split())
running_jobs = re.findall(
  r'^(job_\S+)\s+(?:1|4)\s+\d+\s+\w+.*$',
  all_jobs_raw, re.M)

staging_raw = subprocess.check_output(
  'hdfs dfs -ls /user/root/.staging'.split())
staging_dirs = re.findall(
  r'^.*/user/root/.staging/(\w+)\s*$',
  staging_raw, re.M)

stale_staging_dirs = set(staging_dirs) - set(running_jobs)

for stale_dir in stale_staging_dirs:
  os.system(
    'hdfs dfs -rm -r -f -skipTrash ' +
    '/user/root/.staging/%s' % stale_dir)

The script requires at least Python 2.7 and was tested with Hadoop 2.0.0-cdh4.5.0.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.