Wednesday, November 23, 2016

Transforming a Wikidump archive into usable content

It had been pretty peaceful/brain-wrecking for the past few months as I was working on a portion of my project that will provide the foundation content for our platform. It is essentially the meat to our bones. The Wikipedia foundation/community had basically allow anyone to download all of their archived content. When I did manage to finish downloading one of their fortnightly .bz2 files, it was only just the beginning. From there, I learnt that I had to somehow extract the data out of the massive XML file. Finding the mwdumper.jar helped, but I encountered a bug that made me re-run the entire archive more than 20 times, with each time possibly taking hours. Fixing the bug with a quick hack in one of the class files allowed me to proceed.

Now that the MySQL database had been populated, next was to figure out how to map each article into our database. Using a combination of bliki and JSoup, I was able to parse the objects into usable content. The multitudes of date formats, "magic words", and infobox templates did throw me off a little, but the data transformation program was finally working. Now I just needed to transfer the wikidump over, from database to database.

The main project was setup to receive the data at a unique URL. A secret key was required in order for the server to accept and process the submitted data. While testing during development was done locally, this was being done over HTTPS for the live platform.

This is where it got slightly tricky. Although I eventually felt that this was almost a non-issue, it might save somebody a bit of time if I could offer a documentation of my experience. My exporter/importer (I called it "transporter", duh) was a desktop client, using RestTemplate, (for now) running off my workstation IDE. A trial run had the error coming up as "PKIX path building failed" and a search turned up this, this and this. Everybody was basically saying that adding the certificate into the JRE cert store was the best course of action.

Since my project was tapping on the service provided by the generous Let's Encrypt Certificate Authority, instead of resorting to a self-signed certificate, I had a cert signed by an actual CA. But because Let's Encrypt is not recognised by Java as a trusted CA by default, we have to add it in ourselves. While the above links suggested using the CLI to add their cert, I'd personally find it a lot more convenient by using KeyStore Explorer.

First, configure its cacert store to the correct JRE location. It defaults to your non-JDK JRE, so you should navigate into the JDK_HOME/jre/lib/security path to get to your correct cacerts file.
File menu > Tools > Preferences > Authority Certificates > CA Certificates KeyStore > Browse

Once you have the cacerts file in your KeyStore Explorer view, import the Let’s Encrypt Authority X3 cert you've downloaded into the CA keystore.
File menu > Tools > Import Trusted Certificate

You may be prompted to view and verify that the information in the cert is correct first. Click okay with the suggested alias after that to complete the process.

Oh, and don't forget to amend your RestTemplate constructor:
RestTemplate restTemplate = new RestTemplate(new HttpComponentsClientHttpRequestFactory());

Because otherwise, it defaults to using the SimpleClientHttpRequestFactory that does not support SSL.

Have fun!