Python updating a large number of files

June 12, 2021

While trying to get more practice with Python I look for projects or find things and think how could I do this as a program. They might not be the most exciting or useful but they help in learning more elements to Python. They also help me build up a coding library so if I do need to do something like this in the future I can open up one of these projects and get a head start. No use having to research it again, another reason to make sure you properly comment your code.

This project came from running across an old HTML website, I think it had a copyright date around 2007. I thought if this was a big site, over 100 pages, how would I go about changing all those pages to the current year’s copyright?

When setting up to create a project I first wrote up what I would need for the program to do, much like you would get from a client. I added extra to it so it would be more like working with a real client and get more experience in the code.

I came up with the idea of a small computer store that created their site back in 1996. They are now an LLC company and needed to added that to the end of their company name. They have decided they no longer needed to advertise their, how to get kids into computers, page and will replace those with their internet services information; but keep the kids page active on the site.

It’s important to make sure you’re working with copies of your files, on the off chance that you might have the code wrong and accidentally convert your originals. I usually create a new folder for the tests then move just a few files over to see how it works before working on the full set of files, even then I’ll work on a backup of the originals.

What the program needs to do:

  • Update the copyright year
  • Add LLC to end of the company name (My Awesome Company) everywhere on site
  • Change ‘Kid’s Corner’ to ‘Internet’
  • Change ‘kids.html’ to ‘internet.html’
  • Footer: add website above address
  • Footer: add email below phone number

To start the Python program I found the ‘os’ library would work best for this project and shouldn’t need other libraries, when researching I found different ways to do parts of this using different libraries. Using as few libraries is best, much like when doing a CMS (Content Management Site, like WordPress) having the less dependencies, like plug-ins, is the best. It has less chance of one of them causing a conflict with another.

First line we’ll import the os library, this will help with pointing to directories, saving, and a few other things:

import os

Since this will go over several files at one time, there’s little need to have a user input line for them to add the 2 directories. Also, with a project like this you’ll most likely run this program yourself and won’t want to enter these lines over and over. If this would be something for user input you’ll want to add that feature at the end for the same reason of not wasting time of repeating those tasks.

directory = r'<path to folder>’
saveTo = '<path to folder>'

First line: is where the files are located

Second line: is where the processed files will be saved to.

You can name these 2 variables anything that will be easy for you to remember. I put them in this order to remind myself that the first will be input then second is output, keeping in path to how the program works.

On a PC you can open the file then copy the path in that window’s explorer window.

On a Mac you can get the path by clicking the folder hitting command + i; then copy the area called ‘Where:’

The second directory, where you are saving to, if it’s doesn’t already exist, the program will create it. I set it up to be in the same folder as the original files folder is held: the client, called client_path/HTML; the second folder I called client_path/HTML_Converted; this directory wasn’t already created. We will put in code for it to check if it’s already created, if not then it will create it for us.

isDir = os.path.isdir(saveTo)
if isDir == False:
    os.mkdir(saveTo)
    print('Dir. Created')

First line: we’ve got a variable that is holding a Boolean value of if the save to directory. This is using the os library to check that the directory is storied in that path exists.
Second line: basic If Statement checking if the directory path is ‘False’. Meaning that it doesn’t exist. Python Boolean’s are either ‘True’ or ‘False’.   
Third line: calling the os library and getting back the instructions to create a new directory with the name you assigned.
Fourth line: doing a print statement just to let the user know that the directory was created. Mostly used during testing if you had planned to delete the directory but forgot to you wouldn’t see the message. The files will overwrite what is in there.

Now we’re going to get into the main part of the program and start to get, read and process the files. These lines are reading from the file to be processed.

for filename in os.listdir(directory):
    if filename.endswith(".html"):
        with open(os.path.join(directory, filename)) as f:
            lines = f.readlines()

First line: A ‘for loop’ that will go through all the files in the directory, where the original files are at, one at a time.
Second line: this will check if the file ends with ‘.html’; this is the extension of the HTML files used in this project.
Third line: now that we have the HTML file we need for processing. This line will take the path and the filename and merge them to a variable called ‘f’, this common use name; but you can call it whatever you want.
Fourth line: we’re created a variable called ‘lines’ to read and hold the lines from variable ‘f’.

Now that we have the file from the original file open we need to process it and save it to a new file in the the new directory. This is the biggest section; it’s also indented under the if .html line and lines up with the previous ‘with open’ statement (you can see full code at bottom).

with open(os.path.join(saveTo, filename), 'w', encoding="utf-8") as f:
            for line in lines:

First line: Like above we’re opening the file (creating a new file) in the ‘saveTo’ folder. The ‘w’ is saying we want to be able to write to this file; an ‘r’ means read only; ‘a’ means append. We’re then encoding it as ‘utf-8’ this is the standard for English language websites, if you’re doing it for another language or have other needs check what standard you need. We’re again saving as ‘f’ which is OK since this is inside another block of code.
Second line: a for loop going through the file line by line. Using the common wording practice where we use the plural form for groups, lists, then the singular version for each one inside that, line.

We’re going to now process through the lines using ‘if statements’ these will all be inside the previous ‘for statement’ block. This will change the copy write line, I’ve improved the line instead of just updating it to this year, I added code that will update to the current year, inside the script tags. For the Python program everything inside the quotes ‘ ‘ is passed in as text.

if line.startswith('<div class="column2"><p>&copy;1996 My Awesome Company</p>'):
     line = '<div class="column2"><p>Copyright &copy; <script>document.write(new Date().getFullYear())</script> My Awesome Company LLC</p>'

First line: it’s checking if the line starts with, in this case I used the full copyright line I copy and pasted from the HTML file. Used the single quotes here since the HTML was using double quotes (you can use both but need to escape them out ‘/’.
Second line: I assigned the line to be what the line will need to be. This is overriding the line that’s copying over from the original.

Now the first of the if statement was created everything after this will need ‘elif’, which is short for else if.

elif '<div class="column1"><p>123 E. Awesome St., This Town' in line:
       line = '<div class="column1"><p><a href="http://www.myawesomeco.com">My Awesome Company LLC</a></p>\n<p>123 E. Awesome St., This Town'

First line: checking if that statement inside the ‘ ’ is in the line. This is a lot like the previous if statement, but in the HTML it had a space before it, this way I don’t have to code for the space and in case it might not be there for all of them.
Second line: assigning the line to read with the website first then the address. The ‘\n’ is a special notation to the program to insert a new line here.

elif '<p>(123) 456-7890</p>' in line:
     line = line + '<p><a href = "mailto: awesome@myawesomeco.com">email us</a></p>'

First line: this is looking for the phone number line in the footer.
Second line: I’m assigning to the line the line (phone number info) then concatenating to it the email address. I don’t need to add a ‘\n’ to this since the line already had a new line at the end.

Now I’ll go through and check if the lines have the old name and change to the new one. I left this toward the end because if it was higher than the copyright line the code would have changed the copyright line to the new company name but left the copyright and not hit that line or would need to have more coding to go through the files again.

elif 'My Awesome Company' in line:
      line = line.replace('My Awesome Company', 'My Awesome Company LLC')

First line: just seeing if the old name is in a line, at this point we took care of any name change as we went.
Second line: we’re assigning the line to itself telling it to change (replace) the old name to the new name in any spots it sees.

elif '"kids.html"' in line:
       line = line.replace('"kids.html"','"internet.html"')
       line = line.replace('Kids Corner','Internet')

First line: We’re looking for the links that go to the kids page seeing if ‘kids.html’ is inside the line.
Second line: assigning line to itself with replacing the url to the new one.
Third line: assigning the line to itself with replacing the name of the link to the new one.

Now we’re done with the If Statement

f.write(line)

This line will now write the line that was created above or if the line had no change will write the original line to the file. You could have added this to the end of each ‘If’ but it would have been wasted lines of code.

This next part isn’t necessary but I added it to help speed up testing, I had a CSS file that needed to be copied over each time, to see the file correctly. If code was wrong it could have thrown off a class or ID for the CSS.

This is in line with the statement ‘if filename.endswith(“.html”):’

elif filename.endswith((".css", ".js")):
 with open(os.path.join(directory, filename)) as f:
       lines = f.readlines()
        with open(os.path.join(saveTo, filename), 'w', encoding="utf-8") as f:
            for line in lines:
                 f.write(line) 

This is basically the same information from the top of the HTML open info.

First line: checking for any css or js files, add more extensions here as needed.
Second line: just going through the original directory assigning to ‘f’
Third line: assigns the lines from the file to a variable called lines
Fourth line: now we’re creating the file in the ‘HTML_Converted’ file, opening it up for writing. Assigning it to ‘utf-8’ standard and all this to variable ‘f’.
Fifth line: going through the lines of the file and assign it to line as we go through the for loop.
Sixth line: writing the file line by line.

There are different ways, and probable better ways to write this Python program. Even while writing this I was thinking I could have done this or that. In the end, it did work right which is the main point. Something small like this that will run once or twice doesn’t need to spend a lot of time to make sure that it’s fully efficient, like something that would need to run on a server as quickly and with as few resource use as possible.

Tips from this:

Test often write some code then test to see what happens, if you write a lot of lines then it’s harder to know what is causing the issue. The way I have it broken out above is about how I setup the code and tested.

I also added print statements in parts to see if I’m getting what I expected.

If you didn’t delete the folder after the test, it will over write the files in the new folder.

Sometimes the code even in the same type of IDE for a Mac or PC may need some changes.

link to Git project

The full code:

directory = r'/Clients/My_Awesome_Co/HTML_Converted/originalHTMLs'
saveTo = '/Clients/My_Awesome_Co/HTML_Converted/updatedHTMLs'

###Create Directory if doesn't exist
isDir = os.path.isdir(saveTo)
if isDir == False:
    os.mkdir(saveTo)
    print('Dir. Created')

for filename in os.listdir(directory):
    if filename.endswith(".html"):
        with open(os.path.join(directory, filename)) as f:
            lines = f.readlines()

##PROCESS THE COPY
        with open(os.path.join(saveTo, filename), 'w', encoding="utf-8") as f:
            for line in lines:
                #Replace whole copyright line to include the code to auto update the year
                if line.startswith('<div class="column2"><p>&copy;1996 My Awesome Company</p>'):
                    line = '<div class="column2"><p>Copyright &copy; <script>document.write(new Date().getFullYear())</script> My Awesome Company LLC</p>'
                elif '<div class="column1"><p>123 E. Awesome St., This Town' in line:
                    line = '<div class="column1"><p><a href="http://www.myawesomeco.com">My Awesome Company LLC</a></p>\n<p>123 E. Awesome St., This Town' 
                elif '<p>(123) 456-7890</p>' in line:
                    line = line + '<p><a href = "mailto: awesome@myawesomeco.com">email us</a></p>'
                #Replaces the old name to the new name; keeping to last in case it throws off other lines like the copyright
                elif 'My Awesome Company' in line:
                    line = line.replace('My Awesome Company', 'My Awesome Company LLC')
                #Change the kids page links to internet page
                elif '"kids.html"' in line:
                    line = line.replace('"kids.html"','"internet.html"')
                    line = line.replace('Kids Corner','Internet')
                f.write(line)
#add extensions to anything you want to do a straight copy over; if you try for all, it will try to copy hidden files and fail
    elif filename.endswith((".css", ".js")):
        with open(os.path.join(directory, filename)) as f:
            lines = f.readlines()
        with open(os.path.join(saveTo, filename), 'w', encoding="utf-8") as f:
            for line in lines:
                f.write(line)  

Related Articles

Website Contact form using AWS

Website Contact form using AWS

For the next iteration of the HTML form was to set it up to send an email to the web owner using AWS. The purpose behind this is, if you have a HTML site and don’t want to purchase a monthly plan or build and maintain server software. Using AWS for this functionality...

read more
HTML with JavaScript contact form

HTML with JavaScript contact form

Link to the form: https://www.lynnamacher.com/pdf/contactForm_V1.html This is created to be basic test for an AWS contact form test. The full form when filled out will send an email through AWS to the site owner. Instead of creating the normal test form I try to...

read more
Sometimes the low-tech approach is the best way

Sometimes the low-tech approach is the best way

Have a family member looking to get me some family videos they had digitized, in one big file, about 80GB (they may not have optimized or know how). The need: The family member’s internet connection isn’t the best and could get disconnected at times, when this happens...

read more

Pin It on Pinterest

Share This