Importing Old Blogs
While writing the previous post about How This Blog Works, I wound up doing the work to import all of the posts from old forms of the blog into the current form. There were three collections of posts to bring forward. One was the posts that were in a Wordpress site, then posts from some Blogger/Blogspot sites, and finally ones from way back where I hand-wrote the HTML and CSS.
These are some notes about how I imported those posts into the current Hugo-driven site. When all was said and done, I had imported around 300 posts from the past into the current blog format.
Importing Wordpress
It’s been a while and do not recall for certain, but I am pretty sure I used the Wordpress to Hugo Exporter plugin since I still had the Wordpress blog running at the time I was considering changing to Hugo.
What I do recall is that it was pretty simple. Install the plugin on the Wordpress site, give some information, and collect the results.
Importing Blogger and Blogspot XML Backups
Blogger backups come in well-structured XML that follows the Blogger/Atom schema with a mix of items from a Google API schema.
I wrote a Python script to parse the XML, find the relevant parts, and put them into a Hugo-format document.
I used the standard xml.etree
Python library to parse the XML. There were only a couple of things that required digging to resolve. One was how to be able to digest the XML elements without having to prefix everything with the namespace name. The other was the form of the schemas used.
See these for information on the XML formats
A code snippet from that script:
import xml.etree.ElementTree as ET
BLOGGER_NAMESPACES = {
'atom': 'http://www.w3.org/2005/Atom', # Atom schema namespace
'gd' : 'http://schemas.google.com/g/2005' # google schemas namespace
}
def import_blogger_file(filename, defaults):
count = 0
try:
tree = ET.parse(filename)
except ET.ParseError as err:
# ignore any files that do not parse as XML (e.g. the defaults.json)
return count
root = tree.getroot()
for entry in root.findall('atom:entry', BLOGGER_NAMESPACES):
if is_post(entry):
post = extract_post(entry, defaults)
count += write_post_file(post, defaults)
return count
Importing HTML For Old Blog
Lucky for me, I was a creature of habitual structure and used consistent patterns when hand-coding the old HTML for the old blog.
Much as for the XML above, I wrote a Python script to parse the HTML and create the necessary Hugo documents.
I used the Beautiful Soup HTML parsing library to parse the documents. I found this library by way of an article that come up from a Google search. This worked really well.
A code snippet from that script:
from bs4 import BeautifulSoup
def parse_page(page, args):
page_text = page.read_text(errors='ignore')
soup = BeautifulSoup(page_text, 'html.parser')
posts = []
for blogbox in soup.find_all("div", class_="blogbox"):
post_date = find_post_date(blogbox) or datetime(2001, 1, 1)
post_title = find_post_title(blogbox)
post_content = str(blogbox)
posts.append({
"date": post_date,
"title": post_title,
"content": post_content
})
return posts