The second part of the archive project See Post 1 about the Mac OS side of the fixes here and for the purpose of this project.
What the program needs to do:
- The program is to add extensions to the files that need them and clean out the files that don’t need to be archived.
- I needed to make sure that extensions are needed for when the files are recorded to the disc they will still be able to be opened.
- Removing old files based off their extensions I don’t need.
- Remove the files that didn’t get the extensions added to them.
First, I tried a couple of the python libraries that was to tell what the file type is but when I ran them it, over a group of random files, it showed all where just documents. To get around this I wrote a basic way to check on the file type.
Since this will be a one-time run program I didn’t put in any error handling or separated out the various functions for it; just a straight run line by line program. The full code with comments came out to only 50 lines, it’s a tiny program to do a basic repetitive job that if I were to do it manually would have taken days or weeks to do.
As discussed in the Mac OS side of this archive project there was several manual items done, most of which could have possible be done in the program, but were quickly and easily done on the computer. When I’m working on a one time use program, I look at how long it will take me to do it manual and a quick search to see how quick it might be done in programming and decide from there which will be quicker. If this would be something I would run several times, run off a server or build for others to use then I would add more checks and setup code to do these tasks.
The program:
Imported the Regular Expression and OS libraries.
OS to read the files, change name and delete files.
Regular Expression used to check if there is an extension but looking checking the file name to a pattern.
import os
import re
Creates variable that will add the extension to file you can add this closer to where the code is calling for it, put it towards the top I knew I would need it but wasn’t sure how much I might be rewriting that section, put it up here to keep it safe.
ext=""
List of extensions that will be deleted, this is just a list with the file extensions. My list was very long of file extensions. I added them as I would do sample looks through folders.
toDelete = ['.fla', '.log', ‘.bmap’]
Directory path to run through files:
directory = r'<Insert path to your files here>'
This will be going through the directory listed above and then walk through the directories inside of it.
for dirpath, dirnames, files in os.walk(directory):
this will go through all the files found and iterate through them.
for filename in files:
for filename in os.listdir(directory) checks if extension already exists, using regular expression, limiting the extension from 2 to 4 alpha characters after the dot, if you know you have extensions with numbers you’ll need to add them here, I knew some of the file names had dates like 09.23.12, but needed the 2 for Adobe Illustrator files (.ai):
if re.match(r"^.*\.[A-Za-z]{2,4}$", filename):
Goes through the toDelete list and if there is a match deletes the file.
if any(x in filename for x in toDelete):
#This combines the path and file name
delete = os.path.join(dirpath+'/'+filename)
os.remove(delete)
continue
if there isn’t an extension this else section will go through and read the first 107 chars and check if certain words exists then adds the extension to the variable ext. The “rb” to read in binary format.
else:
f = open( dirpath+'/'+filename, "rb" )
read = f.readline()
#move 107 higher or lower depends on your findings, I went higher than needed for mine.
sample = read[0:107]
This example is for an EPS and an PDF file just keep adding the elif for all other file types. The b at the front is to read the file in binary format.
if b'EPSF' in sample:
ext = '.eps'
elif b'PDF' in sample:
ext = '.pdf'
if there is no extension and the program isn’t able to find one and add one, then this will delete the file.
else:
#combines the path and filename for delection.
delete = os.path.join(dirpath+'/'+filename)
os.remove(delete)
continue
This will then combine the file path and the file then rename it with the extension. This is back to the main else where all the elif looking for the file extension is located.
os.rename(os.path.join(dirpath, filename), os.path.join(dirpath, filename + ext))
During the run of the program I did come into some locked files that stopped the program. I could have spent another 1+ hours to work on a try catch or find a way to programmatically remove the lock. Instead I just looked at the file if I wanted to keep it I would unlock it or just delete it. All together I had 4 of them, took me less than 15mins total for the runs and find and remove the file, finding didn’t take long the full path showed in the terminal window.
With this I know there will be files that will not get saved that I may have wanted, but these are old archives eight years and older so the chance of needed them are small, if I wanted to make sure I kept everything I would just zip up large chunks of the files and record those.
The Code:
Just the basic outline from above put together:
import os
import re
#Creates variable that will add the extension to file
ext=""
# List of extensions to check to delete
toDelete = ['.fla', '.log', ‘.bmap’]
#Directory path to run through files
directory = r''
for dirpath, dirnames, files in os.walk(directory):
for filename in files:
# for filename in os.listdir(directory) checks if extension already exists:
if re.match(r"^.*\.[A-Za-z]{2,4}$", filename):
#Goes through the toDelete list and if match deletes the file
if any(x in filename for x in toDelete):
delete = os.path.join(dirpath+'/'+filename)
os.remove(delete)
continue
#this section will go through and read the first 107 chars and check for if certain words then adds extension
else:
f = open( dirpath+'/'+filename, "rb" )
read = f.readline()
sample = read[0:107]
if b'EPSF' in sample:
ext = '.eps'
elif b'PDF' in sample:
ext = '.pdf'
#if the no extension file doesn't match any of the above then deletes it
else:
delete = os.path.join(dirpath+'/'+filename)
os.remove(delete)
continue
os.rename(os.path.join(dirpath, filename), os.path.join(dirpath, filename + ext))