Wednesday, November 23, 2016

Transforming a Wikidump archive into usable content

It had been pretty peaceful/brain-wrecking for the past few months as I was working on a portion of my project that will provide the foundation content for our platform. It is essentially the meat to our bones. The Wikipedia foundation/community had basically allow anyone to download all of their archived content. When I did manage to finish downloading one of their fortnightly .bz2 files, it was only just the beginning. From there, I learnt that I had to somehow extract the data out of the massive XML file. Finding the mwdumper.jar helped, but I encountered a bug that made me re-run the entire archive more than 20 times, with each time possibly taking hours. Fixing the bug with a quick hack in one of the class files allowed me to proceed.

Now that the MySQL database had been populated, next was to figure out how to map each article into our database. Using a combination of bliki and JSoup, I was able to parse the objects into usable content. The multitudes of date formats, "magic words", and infobox templates did throw me off a little, but the data transformation program was finally working. Now I just needed to transfer the wikidump over, from database to database.

The main project was setup to receive the data at a unique URL. A secret key was required in order for the server to accept and process the submitted data. While testing during development was done locally, this was being done over HTTPS for the live platform.

This is where it got slightly tricky. Although I eventually felt that this was almost a non-issue, it might save somebody a bit of time if I could offer a documentation of my experience. My exporter/importer (I called it "transporter", duh) was a desktop client, using RestTemplate, (for now) running off my workstation IDE. A trial run had the error coming up as "PKIX path building failed" and a search turned up this, this and this. Everybody was basically saying that adding the certificate into the JRE cert store was the best course of action.

Since my project was tapping on the service provided by the generous Let's Encrypt Certificate Authority, instead of resorting to a self-signed certificate, I had a cert signed by an actual CA. But because Let's Encrypt is not recognised by Java as a trusted CA by default, we have to add it in ourselves. While the above links suggested using the CLI to add their cert, I'd personally find it a lot more convenient by using KeyStore Explorer.

First, configure its cacert store to the correct JRE location. It defaults to your non-JDK JRE, so you should navigate into the JDK_HOME/jre/lib/security path to get to your correct cacerts file.
File menu > Tools > Preferences > Authority Certificates > CA Certificates KeyStore > Browse

Once you have the cacerts file in your KeyStore Explorer view, import the Let’s Encrypt Authority X3 cert you've downloaded into the CA keystore.
File menu > Tools > Import Trusted Certificate

You may be prompted to view and verify that the information in the cert is correct first. Click okay with the suggested alias after that to complete the process.

Oh, and don't forget to amend your RestTemplate constructor:
RestTemplate restTemplate = new RestTemplate(new HttpComponentsClientHttpRequestFactory());

Because otherwise, it defaults to using the SimpleClientHttpRequestFactory that does not support SSL.

Have fun!

Saturday, August 6, 2016

Websocket + SockJS + Apache as proxy

In a previous post, I mentioned about my journey into Understanding websocket with the Spring framework that I realised turned out more about exploring spring-security. Nevertheless, it was still part of the same journey, since I probably wouldn't have used spring-security if it weren't for its presence as a spring-websocket dependency.

That said, throughout my coding, it was generally uneventful. That is, until I tried deploying the package into the cloud. The WildFly was sitting behind Apache acting as the proxy, and this was where the problem manifested itself. The issue was compounded by the use of HTTPS, but only running on the proxy; with WildFly only listening for HTTP.

It took quite some time for me to figure out a solution on my own. If your setup is using the stack from Bitnami similar to mine then I certainly hope you'll find this post useful.

Assumptions:
  1. Your Bitnami stack is up and running correctly;
  2. WildFly is listening internally on :8080 port;
  3. You have complete access to your WildFly management console;
  4. Your app can be deployed successfully to WildFly;
  5. Both WildFly and Apache are sited on the same machine/instance;
  6. Apache has been configured with a SSL certificate for proper HTTPS operation;
  7. Apache is functioning normally for regular HTTP/HTTPS traffic
  8. Apache error log should indicate an excessive amount of traffic due to the websocket 
  9. Apache access log should indicate HTTP 502 or similar erroneous codes
  10. httpd.conf should already have uncommented the LoadModule lines for mod_proxy and all its modules, especially mod_proxy_wstunnel.so
 Next, add the following lines to your /opt/bitnami/apache2/conf/bitnami/bitnami.conf file:
  <IfModule proxy_wstunnel_module>
  RewriteCond %{HTTP:Upgrade} =websocket [NC]
  RewriteRule ^/(.*)$ ws://localhost:8080/$1 [P,L]
  </IfModule>
 Do the same for both HTTP (port :80) and HTTPS (port :443) traffic. And then restart Apache.

It took me a while to realise, after enabling the Apache debug log level, that
  1. There is never any wss:// used by WildFly because I'm do not have the HTTPS listener setup; just the vanilla HTTP;
  2. Redirecting to ws://www.mysite.com/$1 would not work either, because that's still throwing the same request to Apache;
  3. Redirecting to ws://localhost/$1 would not work, because that's equally requesting to Apache itself;
If you have SSL for both Apache and WildFly, I guess it'd be possible to utilise SSLProxyEngine that would probably circumvent all of the above, although it'd potentially add overheads to a small cloud instance.

Monday, August 1, 2016

FileUploadException: UT000020: Connection terminated as request was larger than 10485760

In the course of our limits testing, it turns out that it wasn't enough to just set the file upload limit in our own application for the CommonsMultipartResolver, using our spring-*context.xml configuration. WildFly had other ideas of its own. It comes out of the box with the default limit of 10485760 bytes.

While I managed to locate a couple of results such as this, this, this and this, which pointed me in the right direction, they were all referencing older versions, namely WildFly 8. And being lazy that I am, I'm not about to poke around the XML making changes manually, much less use the CLI to amend the value. I wanted to make the change via the WildFly Admin Console UI.

Thus I had to make explorations of my own. Based on those clues, I've identified the whereabouts to change said value.

Navigate to the Configuration tab > Subsystems > Web/HTTP - Undertow > HTTP and click View


HTTP Server tab > "default-server" > View

HTTP Listener > Edit

Then edit the "Max post size" to your desired value. Naturally, I'd think that the value should match whatever you've configured in your own application.

And don't forget to restart your WildFly!

Edit: Also, don't forget to tweak your database e.g. MySQL for the max_allowed_packet alongside this setting!

Friday, July 29, 2016

Understanding websocket with the Spring framework (feat. spring-security)

I needed to learn how websockets worked in the Spring framework. I did this by researching how the different parts work together with the resulting package here.

It was a rabbit hole that turned out to be much deeper than I expected. The first step to a problem I had for instantaneous notifications on the client browser, was turning to websockets, since long-polling seemed a little backward at this stage. My first encounter was the pure JSR-356 implementation by Oracle. But I quickly realised that, because I'm already using the Spring framework, there was already spring-websocket. @Endpoint doesn't seem to play nice with Spring, so now I'm using what Spring has: @MessageMapping for receiving messages, @SendToUser for sending messages to a specific user, and @SubscribeMapping for accepting subscriptions from users. The documentation is still rather sparse even now, despite the few years it had already come into being.

Next up was the browser-end. That required a combination of STOMP and SockJS, both of which are already kind of integrated into spring-websocket. But getting the jQuery send/subscribe/callback functions correct still took a bit of time getting used to.

The websocket uses HTTP:Upgrade to convert a HTTP:// into a WS:// request. The problem came when it was time to test this with TLS. My setup involves fronting the Wildfly with Apache, and the HTTPS connection is handled by this proxy. The usual mod_proxy required an additional mod_proxy_wstunnel. While SockJS is handling the switch from HTTPS:// to WSS:// on its own, the proxy introduced a new set of problems. Apache was having difficulty translating the upgrade request. Part of the solution included this. But since I have a hundred and one issues to solve, the fix is still pending.

Another portion of the setup requires spring-security. And while I totally appreciate what resources mkyong has to offer, I'm still trying to wade through the swamplands on my own. You see, instead of tapping into the login module that spring-security had to offer, I decided to stick to my own version that I'd established before this. I realised from this post that you could manually set the authenticated user with
SecurityContextHolder.getContext().setAuthentication(authentication);

and prepare your own list of Granted Authority for it.

A static 403 error page was prepared and added into the spring-security setup. Turned out that I was forgetful enough to add RequestMethod.POST on top of the standard RequestMethod.GET for the page. Took me a while to hunt for this issue. But now, I'm still trying to figure out why is spring-security redirecting an authenticated user to this 403 page due to a POST form submission.

Now that I've gotten the static 403 page displaying properly, instead of the miserable "Request Method 'POST' not supported", I wanted to trace the next stage of the problem. In order to do my detective work, I needed to follow the trail from the output logs, and spring-security doesn't make it any easier. I got the hint eventually, that not only is it needed to add <security:debug /> into the XML file, because I'm using Log4j, I also had to add these lines in:
    <logger name="org.springframework.web.security">
        <level value="debug" />
    </logger>
Notice my mistake? I didn't for a while, wondering why is the log not outputting anything relevant. It ought to have been this instead:
    <logger name="org.springframework.security">
        <level value="debug" />
    </logger>
Great. The console was outputting a whole bunch; too much of a bunch in fact. I had to enable the log file that will be output on to disk and trace from there. Finally getting some where, I identified the offending line:
DEBUG csrf.CsrfFilter - Invalid CSRF token found for...
I was already suspicious about the CSRF portion, but thought it to be a dead-end find. Now I can be even more certain as I investigate further. Returning to the source, I continued reading up the documentation on spring-security, and came across (again) the section for CSRF protection. I noticed a particular section that discusses the use of the Multipart (file upload) content type. This jumped out at me this time round, because I realised that the form I'm encountering an issue with CSRF, involves file uploads, which certainly requires the multipart feature.

A more detailed answer (despite it being over 2 years old) was found on StackOverflow here. And it was pretty straightforward. The documentation under spring-security for adding support for CSRF to multipart forms had missed out one issue. MultipartFilter defaults to "filterMultipartResolver". At this stage, I'd already configured the CommonsMultipartResolver in my spring-mvc-context.xml as an existing bean.

Now, in order to complete the loop, I could have simply named the bean the default "filterMultipartResolver". Of course, in order to prove that the linkage was correct, I did this instead:
<filter>
    <filter-name>springMultipartFilter</filter-name>
    <filter-class>org.springframework.web.multipart.support.MultipartFilter</filter-class>
    <init-param>
        <param-name>multipartResolverBeanName</param-name>
        <param-value>ubiomiMultipartResolver</param-value>
    </init-param>
</filter>
<filter-mapping>
    <filter-name>springMultipartFilter</filter-name>
    <url-pattern>/*</url-pattern>
</filter-mapping>
which was then placed before the springSecurityFilterChain lines of configuration in my web.xml file.

I'm happy to report that the CSRF protection remains in place, and the multipart content passes through spring-security properly!

The journey continues.

Thursday, June 16, 2016

Upgrading from Wildfly 9 + Hibernate 4.3 to Wildfly 10 + Hibernate 5

TL;DR - Exclude Hibernate libraries in POM using <scope>provided</scope> when deploying to Wildfly.

When this project first started several months ago, Wildfly 9 and Hibernate 4.3 was the most current stable releases. Things have changed however, and it's time to move on. The migration provided a couple of learning points, mostly to do with ensuring the project was compatible with the updated versions.

After having Wildfly 10 up and running, I attempted to deploy the WAR file as-is without upgrading to Hibernate 5. The major error that returned from the deployment was about the java.lang.AbstractMethodError similar to this problem. Exactly in SessionFactoryImpl, exactly on line 278. Thinking that there might be a version mismatch (using 4.3 vs 5), I edited the POM to bring all Hibernate libraries up a major version. This was not the case.

 The JPA reference for Wildfly 10 here suggested that Hibernate 5 requires a change to the persistence.xml such that, instead of org.hibernate.ejb.HibernatePersistence
the class name for the <provider> tag should be org.hibernate.jpa.HibernatePersistenceProvider for new deployments. A minor misstep on my part was ignoring the sub-package change of .ejb to .jpa and merely added in "Provider" at the end. That aside, this did not help whatsoever. The documentation goes on to indicate that it's actually possible to leave this value out entirely. So I did.

Next, I even dug into the downloaded zip for Wildfly 10 to check out the libraries they used and made sure to match up what goes into my POM version for version. It eventually caught on in my head that, Wildfly is using its own version of Hibernate to run my package, but encounters the error because I'm providing my own version. I'd previously specified an explicit exclusion in Wildfly for dom4j, but I felt that the complete set of Hibernate 5 library JAR files would be better off being used in conjunction with what Wildfly is familiar with.

The Hibernate dependencies in the POM was modified to have <scope>provided</scope> for this change. And it did the trick! My local Jetty build would have to add these dependencies in, of course.

Thursday, May 19, 2016

Use jMimeMagic instead because Wildfly conflicts with Tika

There was a need to retrieve images from external sites, so after some research, I thought that Tika would do the trick. I had preferred content type detection via magic numbers over file extension only. Unfortunately, the library had a very long list of dependencies. Jetty didn't complain, but the moment I deployed the WAR file on to Wildfly, it encountered a whole bunch of problems, including this. I didn't think the headache was worth the it and sought for an alternative. Google returned this article which had a pretty comprehensive list of libraries. Despite the smaller footprint of mime-util, I decided to go for the more recent jMimeMagic (added TIFF support, last updated 12 Dec 2014, as of today) instead.

Magic parser = new Magic();
MagicMatch match = parser.getMagicMatch(new File("image.jpg"));
System.out.println(match.getMimeType());

That's all there is to using it.

At least Wildfly accepted the new WAR file. We'll have to continue monitoring the situation for this.

Tuesday, February 16, 2016

Spring framework has issues with periods

I didn't know what it was initially, but I found stuff like this, this and this.

The one other thing that stood out to me was when I moused over @RequestMapping in Eclipse and one of the list of arguments was noted as such:
  • @PathVariable annotated parameters (Servlet-only) for access to URI template values (i.e. /hotels/{hotel}). Variable values will be converted to the declared method argument type. By default, the URI template will match against the regular expression [^\.]* (i.e. any character other than period), but this can be changed by specifying another regular expression, like so: /hotels/{hotel:\d+}.
 Simply put, if you URL ends with a period "." then Spring defaults to assuming it could be a file extension, and will strip off anything after the dot and return whatever string preceding it.

My annotation went from

@RequestMapping(value = "/fruits/{fruitName}", method = RequestMethod.GET)

Into inclusion of a simple regex that factors in the period symbol

@RequestMapping(value = "/fruits/{fruitName:.+}", method = RequestMethod.GET)

Monday, February 1, 2016

Submitting HashMap from JSP with JSTL in Spring framework 4.2.1

The Controller I'm currently working on has this model that it uses, let's just say it is
public class RecipeModel implements Serializable {
  private String recipeName; //not important
  private String authorName; //not important
  private Map<String, Ingredient> ingredients;
  //get and set methods
}

And the Ingredient class of course,
public class Ingredient implements Serializable {
  private String ingredientName;
  private String preparationStyle;
  private String useAtStage;
  //get and set methods
}

Obviously, the above are placeholder examples I'm using, in place of the actual objects I have, so while the scenario may not be ideal, it is just meant to describe the situation.

The JSP managed to display (via JSTL) the model values without a hitch:
<c:forEach items="${recipeModel.ingredients}" var="ingredientMap">
  <input name="ingredientMap['${ingredientMap.key}']" value="${ingredientMap.value.ingredientName}" />
</c:forEach>

The problem arose when I needed to return the amended changes back to the server. The solutions I found seemed either too outdated or tedious. It also didn't help that I had mispelt ingredientmap with a lowercase 'M' either.

What worked?

<input name="ingredientMap['${ingredientMap.key}']" value="${ingredientMap.value.ingredientName}" />
<input name="ingredientMap['${ingredientMap.key}'].useAtStage" value="${ingredientMap.value.useAtStage}" />
<input name="ingredientMap['${ingredientMap.key}'].preparationStyle" value="${ingredientMap.value.preparationStyle}" />

Append the property name (don't forget the dot) after the bracket for the map. Spring is able to translate each field accordingly into the corresponding property of the value object in the map.

That was it. No CustomMapEditor to add in the InitBinder, no mapping of spring:binder stuff anywhere else. Mistakes were made. Mischief managed.

Thursday, January 28, 2016

Spring Framework model naming

It's been 3 months since I'd encountered any quirks! And I've been working on something new!

If you just want the TL;DR, skip to the bottom.

This time round, I came across a weird situation whereby Spring is unable to provide me a model name I'd expect on my JSP. The object was created as "WTSoupModel" (names have been changed, but the pattern remains the same), and you can probably see the issue immediately if you're somewhat familiar with this situation.

The MVC I've been building up had been mostly running smoothly thus far, until this. The controllers, models, and services have been annotated similarly to my previous modules. What went wrong was when I tried accessing this model that was provided by my Controller via the JSP.

I was expecting the name of the model accessible by the JSP to be "wtSoupModel", but it would always return true whenever I test with ${wtSoupModel eq null} until I dug deep enough. Initially, for the sake of the experiment, I added in the model from the sub-module I'd just completed, ReportModel reportModel, and of course, ${reportModel eq null} gave me false, as expected. Then I made sure the WTSoupModel was Serializable. And then I tried renaming it to WanTonSoupModel in full. Aha! It turned out that ${wanTonSoupModel eq null} was false. This means the issue was with the naming, because there wasn't much else to rule out.

Next, I proceeded to find out if there's anyway I could custom the model naming that reaches the JSP. Upon further investigation, I found this that led to this. Experimenting with it in my Controller proved it true.

Final fix (names have been changed):

@RequestMapping(value = "/wantonsoup", method = RequestMethod.GET)
public String viewWTSoupList(Locale locale,
   @ModelAttribute("wtSoupModel") WTSoupModel wtSoupModel,
    ...
   Model model) throws MSBException {
...
}

I still haven't figured out what Spring resolved the original WTSoupModel into by default. Let me know if you do.