Tuesday, June 2, 2015

HDFS Directories Under Metastore(hive) Location (/user/hive/warehouse/) Without Corresponding Metastore Tables

Some times it happens that for some reason we drop tables from hive/impala but the corresponding HDFS directories are not removed.
The issue mainly as per my and also from below link, is Trash directory. Either the permission on the directory or there are no HDFS home directory exists for the user dropping a table.

https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/8J0IZf8E1yc

In our case we suspect that, it was a HUE user without HDFS home directory.

Whatever it is, we have to keep track of these orphan directories.
Manually identifying these HDFS location too tiring to a lazy guys like me. So, I  prepared below simple python script for this. Hope it will help others too.

Please comment if you have better idea or there is something wrong with my script.
Please test properly before you proceed with any action based on this script.

#!/usr/bin/python

# Script to find HDFS directories under metastore location (/user/hive/warehouse/) without corresponding tables
# just call call the script with the target database name as mandatory parameter, if no parameter it will show error
# it the db name it wrong it will show "No such file or directory" & "Database does not exist"
# Prerequisites: hive CLI (NOT beeline), python 2.6, dircet access to the target hadoop cluster

import sys

# Convert a String having words separated by new line(\n) to list
# Example input : convertNewLineStringsToList('abc\n123\n')
# Example output: ['abc', '123']
def convertNewLineStringsToList(stringWithNewLine):
out = []
buff = []
for c in stringWithNewLine:
if c == '\n':
out.append(''.join(buff))
buff = []
else:
buff.append(c)
else:
if buff and buff[0]:
out.append(''.join(buff))
return out

# Get list of all directory names for corresponding table from HDFS for a metastore database
# Example input : getHDFSDirForDB("dbname"))
# Example output: ['table_a', 'table_b'] 
def getHDFSDirForDB(dbname):
import os
# get HDFS list by running hadoop fs command from OS
hdfs_list = os.popen("hadoop fs -ls /user/hive/warehouse/"+ dbname +".db/|awk '{print $8}'|cut -d'/' -f6")
hdfs_list = hdfs_list.read()
# collect HDFS dirs as list
out = convertNewLineStringsToList(hdfs_list)
# return HDFS dirs as list which are non-empty, keep tha case intact, as HDFS location should be exactly match
return [x for x in out if x]


# Get list of all table names for a metastore database
# Example input : getTablesForDB("dbname")
# Example output: ['table_a', 'table_b'] 
def getTablesForDB(dbname):
import os
# get HDFS list by running hive command from OS
# here we are using silent mode(-S) to get rid of unwanted texts
table_list = os.popen("hive --database "+dbname+" -S -e \"show tables\"|sed -n '1!p'")
table_list = table_list.read()
# collect table names as list
out = convertNewLineStringsToList(table_list)
  # return table names as list which are non-empty and make in lower case return [x.lower() for x in out if x]



# main program called with the target DB name for which we want to find HDFS directories without corresponding tables
# Example input: main ("dbname")
# Example output:
# table_a 
# table_b  
def main(dbname):
HDFS_LIST = getHDFSDirForDB(dbname)
DB_LIST = getTablesForDB(dbname)

# directories that do not have corresponding tables
# It mean, components of HDFS_LIST which are not common in DB_LIST
# make HDFS location lowercase while comparing with DB table list(already lowercase) diff = [ x for x in HDFS_LIST if x.lower() not in DB_LIST ]

print  '\n'.join(diff)


#________________main________________
if __name__ == "__main__":
    main(sys.argv[1])


Thanks for the idea of set base minus operation http://stackoverflow.com/questions/13169725/how-to-convert-a-string-that-has-newline-characters-in-it-into-a-list-in-python

No comments:

Post a Comment