Another practicing Python project the original idea for this project came when I heard somebody talk about their email program got updated and all their contacts disappeared. First thought they would need to go through and add their contacts back. Then thought how would I do this in a program. This is to go through several backed up emails and pull the email addresses out.
First I saved a couple of emails out from my email client to get an idea of what the format would look like, then I went out and found an example version at (https://tools.ietf.org/id/draft-ietf-dmarc-arc-protocol-13.html) to use as an example and to make sure mine samples were about the same. I added names to the email addresses in the couple I copied.
Misconception: I was going to use the regular expression type to pull all files that had alpha-numeric then a @ followed by more alpha-numeric then a period (.) followed by more Alpha characters. When looking through the sample I noticed this process would get me a lot of false emails from items like this that would through it off ‘smtp.mfrom=jqd@d1.example;’. I wanted to pull the person’s first and last name into the CSV file and this could get messy.
What I did instead: Went with the common start to where the email addresses would be listed, these being: ‘From’, ‘To’, ‘Cc’, ‘Bcc’. From there grab what is behind them that should include first, last and email address and then I can break them down to each part. This worked for the first line if there were too many people per line it would go to a new line. The next line in my saved samples was always starting with a tab, which meant I can look to the next line and ask if it starts with a tab then pull this line.
This program is bigger than the previous ones, not just in more lines of code but we’ve also having fun with functions. Functions are like mini programs in a program. You would use these if you’re finding yourself coping and pasting code. In programming there is a term called DRY (Don’t Repeat Yourself). By taking these elements that would be used over and over and put them in one spot that you can call as you need. Also if you need to change the code you are only doing that in one spot and not several.
To start off the program I do need to import 2 libraries:
import csv
import os
First line: this is needed for saving and reading the CSV file.
Second line: this is the one we did in the change lines program; it will let use read from directories on our computer.
Now we’re going to point to the folder with all the emails to go through and name the CSV file we’ll be saving to:
directory = r'/Path/to/file/with/emails/folder/'
csvFile=r'/Path/to/where/we/save/CSV/emails.csv'
First line: the path to where the emails are located make sure to put the last ‘/’ there, it’s needed for later.
Second line: where we want to save the file along with the name of the file ‘emails.csv’
On a PC you can open the file then copy the path in that window’s explorer window.
On a Mac you can get the path by clicking the folder hitting command + i; then copy the area called ‘Where:’
The ‘r’ before stands for raw and be read in without any changes happening.
I’ll be going into the main program first then the functions after that; the first function call is toward the bottom of the program. When you look at the full code below you’ll see I did my function calls at the very top, because we’re doing a simple app the IDE is reading from top down, so we’ll want anything that will be called to show up before.
First we’re going to open and/or create the CSV file; if there is a file this will overwrite it:
with open(csvFile, 'w') as email_file:
firstLine = ('First Name', 'Last Name', 'Email Address')
writer = csv.writer( email_file )
writer.writerow( firstLine )
First line: we’re going to open the file, if it isn’t already created it will create the file in the location we put in ‘csvFile’ we’re opening it with ‘w’ for writing then naming the variable to hold the file as ‘email_file’
Second line: we’re adding the headers for our CSV columns, think a spreadsheet where at the top says what each column is, with this the columns are our separated by comas. This is the order we’ll be putting the information into the CSV file (First, Last, then email).
Third line: creating a variable called ‘writer’ and assigning it to write to the CSV file ‘email_file’. Fourth line: we’re telling the ‘writer’ to write a row to the CSV file that contains our first line ‘firstLine’ (header).
for filename in os.listdir(directory):
if filename.endswith(".eml"):
f = open( directory+filename, "r" )
read = f.readlines()
First line: We’re opening the folder (directory), in a for loop, where the emails are stored and naming the variable for each file as ‘filename’.
Second line: checking that the file ends with ‘.eml’ to know it’s an email, depending on your email app you may need to change this extension or if you’re pulling from different ones you might need to add more extensions like (“.eml”, “txt”)
Third line: creating a variable called ‘f’ and opening the file in here. We pass it the path and the file name to combine them, this is why we put the ‘/’ at the end of the ‘directory’.
Fourth line: creating the variable ‘read’ and passing the lines from ‘f’, the file we just opened.
count = 0
for line in read:
count += 1
if line.startswith( ('From', 'To', 'Cc', 'Bcc') ):
removeStart = line.split(":")[1].strip()
get_addresses(removeStart)
First line: creating a ‘count’ variable to help keep track of our indented lines for pulling email addresses.
Second line: a for loop to read the lines in the file, and go through them line by line calling each line ‘line’; while sounds a bit confusing to read here but in the code calling the variable by what it does makes it easier to keep track of.
Third line: adding one to the total of ‘count’
Fourth line: checking if the ‘line’ starts with ‘From’, ‘To’, ‘Cc’, ‘Bcc’
Fifth line: new variable ‘removesStart’ running split on the line to remove the part before the : [0] (such as ‘To:’) then only keeping that part after wards (name and email)[1] then running strip on that to remove any extra spaces at the beginning and end.
Sixth line: passing the ‘removeStart’ variable into the function ‘get_addresses’ to process the name and email to the CSV file. We’ll go through the function later.
subCount = count
while read[subCount].startswith( '\t' ) == True:
removeStart2 = read[subCount].strip('\t')
get_addresses( removeStart2 )
subCount += 1
else:
continue
f.close()
First line: new variable called ‘subCount’ and it passing in the ‘count’ number; so we’ll know what line we are on to check for the next line.
Second line: a while loop to keep checking if there is more addresses below, by seeing if the line starts with a tab (\t) the ‘From’, ‘To’, ‘Cc’, ‘Bcc’ as long as that is True it will keep running what is below. For ones like To, Cc and Bcc there could be several lines of them, we could have To separate but easier and less code to keep them all the same.
Third line: new variable ‘removeStart2’ will strip the tab from the beginning of it; we could keep this the same as above ‘removeStart2’ but wanted to show that you can pass different names to the function.
Fourth line: passing ‘removeStart2’ to the same function as before, so we’re not repeating the same code.
Fifth line: adding 1 to the total of ‘subCount’ to keep going through the lines.
Sixth and seventh line: back to the first ‘if statement’ this is it’s else statement for if the file. If it doesn’t end in ‘.eml’, with just continue statement to go back to the ‘for statement’.
Eighth line: closes the file ‘f’ file.
We’ll now go through the functions used, so far we’ve only seen one listed, 2 more will show up in this function. Yes functions can call other functions.
This function breaks the name and email down to its parts and pass them into the CSV file.
def get_addresses(indItems):
toFile = indItems.split(',')
for file in toFile:
file = file.strip()
splitFile = file.split(' ')
First line: assigning the function using ‘def’ then the name of the function ‘get_addresses’, the name inside the () is what is getting passed into the function. We’re changing the name of what the function will call it, you can keep it the same as what you’re passing in.
Line coming in might look like: John Doe <jqd@d1.example>, Arc Weld <arc@example.org>,
Second line: new variable ‘toFile’ creating a list with the list breaking them where there is commas that way if more than one email it will split them apart. Same line at this point:
John Doe <jqd@d1.example>
Arc Weld <arc@example.org>
Third line: for loop going ‘file’ in ‘toFile’ if there is one or more emails in that line it will go through and process each.
Fourth line: running strip on the line to remove any white spaces before and after.
Fifth line: now we’re going to split it by the spaces, the reason we did the strip before so we don’t get any empty groups and throw off the elements. The first email will look like:
John
Doe
<jqd@d1.example>
with open(csvFile, 'a') as email_file:
writer = csv.writer( email_file )
First line: we’re opening the csv file with ‘a’ for append and naming it ‘email_file’
Second line: creating the ‘writer’ variable to hold the info to write the information to the CSV file ‘email_file’.
Starting of ‘if statements’ to check on how many elements are getting passed into it.
Common throughout all: we are gathering the first name ‘first’; last name ‘last’; and email address ‘email’. Email is being passed to another function to remove extras like the ‘<’’>’ and will come back with just the email address. Then combining all three into the variable ‘full’ to then be written to the CSV file.
if len( splitFile ) is 3:
first = splitFile[0].strip()
last = splitFile[1]
email = replaceChar( splitFile[2] )
full = first, last, email
elif len( splitFile ) is 2:
first = splitFile[0].strip()
last = ' '
email = replaceChar( splitFile[1] )
full = first, last, email
elif len( splitFile ) is 4:
first = splitFile[0].strip() + ' ' + splitFile[1].strip()
last = splitFile[2]
email = replaceChar( splitFile[3] )
full = first, last, email
First -fifth line: this group of if statements checks for the most common version. 3 elements being passed into it: First, Last, and Email address. The list being 0 based we just work our way through adding to each in the [] for that elements area.
Sixth-tenth line: This is looking if only one name and email is passed. In this case we’re assuming that the single name is the first name, we’re assigning a black space for last name, we have to keep it even or email may show in the last name spot.
Eleventh-fifteenth line: checking if there is first name with two words like ‘Billy Bob’. This will see the extra elements and put the 0 and 1 element together for first name.
Now we’re going to check for doubles before adding to the CSV file.
if doubles( email ) is not True:
writer.writerow( full )
First line: if statement sending email to the function ‘doubles’ to check if it’s already in the list, if that comes back as not True then;
Second line: it will write it to the file, if it was already in the CSV file it will ignore it and go on.
We’re not returning anything here since this is sending the final output to the file.
This function is being called in the ‘get_addresses’ function with the emails it will remove the extra characters that haven’t been removed yet. It should be coming in looking like ‘<test@test.com>’
def replaceChar(item):
remove_characters = ['<', '>', ',']
for char in remove_characters:
item = item.replace( char, '' )
return item.strip()
First line: we’re defining the function and calling it ‘replaceChar’ and when something is passed calling it ‘item’
Second line: creating a variable called ‘remove_characters’ assigning it the items we will be removing, you may have to add more depending on your files.
Third-fourth line: going through each character in the ‘remove_characters’ list and checking them against the ‘item’ (email that was passed in) and if it sees any of those elements are there then replaces them with nothing ‘’; which the same as deleting.
Fifth line: returns the item and runs strip to make sure no extra white spaces at the beginning or end are there.
The last function, this one is before the writing of the email to CSV to make sure it’s not already there.
def doubles(check):
with open(csvFile) as f:
datafile = f.readlines()
for line in datafile:
if check in line:
return True
First line: function called ‘doubles’ and naming incoming element ‘check’.
Second line: opening the csv file and naming it ‘f’.
Third line: assigning the file to memory, ‘readlines()’ and calling it ‘datafile’.
Fourth line: For loop going through line by line of the ‘datafile’ of the CSV file.
Fifth line: checking if ‘check’ is located in the CSV file (‘line’).
Sixth line: if it is in the CSV file then it’s returning ‘True’ back to the function call. As in ‘True’ it is there. Since ‘True’ and ‘False’ are Python’s Boolean values we don’t need quotes in the app.
Now when you run you should have in the path you put in for the CSV file a new .csv file. Inside that file should look something like this:
First Name,Last Name,Email Address
Arc,Weld,arc@example.org
John Q,Doe,jqd@d1.example
test, ,test@test.com
Link to Git project
The full code:
import csv
import os
directory = r'/Path/to/file/with/emails/folder/'
csvFile=r'/Path/to/where/we/save/CSV/emails.csv'
def get_addresses(indItems):
toFile = indItems.split(',')
for file in toFile:
file = file.strip()
splitFile = file.split(' ')
with open(csvFile, 'a') as email_file:
writer = csv.writer( email_file )
if len( splitFile ) is 3:
first = splitFile[0].strip()
last = splitFile[1]
email = replaceChar( splitFile[2] )
full = first, last, email
elif len( splitFile ) is 2:
first = splitFile[0].strip()
last = ' '
email = replaceChar( splitFile[1] )
full = first, last, email
elif len( splitFile ) is 4:
first = splitFile[0].strip() + ' ' + splitFile[1].strip()
last = splitFile[2]
email = replaceChar( splitFile[3] )
full = first, last, email
if doubles( email ) is not True:
writer.writerow( full )
def replaceChar(item):
#goes through to strip the characters not needed.
remove_characters = ['<', '>', ',']
for char in remove_characters:
item = item.replace( char, '' )
return item.strip()
def doubles(check):
with open(csvFile) as f:
datafile = f.readlines()
for line in datafile:
if check in line:
return True
with open(csvFile, 'w') as email_file:
#Creates CSV file and writes the first name elements
firstLine = ('First Name', 'Last Name', 'Email Address')
writer = csv.writer( email_file )
writer.writerow( firstLine )
#Goes through files in the folder
for filename in os.listdir(directory):
if filename.endswith(".eml"):
f = open( directory+filename, "r" )
read = f.readlines()
count = 0
for line in read:
count += 1
if line.startswith( ('From', 'To', 'Cc', 'Bcc') ):
#Clears the extras From, to, etc
removeStart = line.split(":")[1].strip()
#breaks it down to individual pieces (names, email)
get_addresses(removeStart)
subCount = count
while read[subCount].startswith( '\t' ) == True:
removeStart2 = read[subCount].strip('\t')
get_addresses( removeStart2 )
subCount += 1
else:
#to ignore any none .eml file
continue
f.close()