Elasticsearch error with python datetime: “Caused by: java.lang.IllegalArgumentException: Invalid format: “” is too short”

During normal usage, we got no loss of information on our Elasticsearch cluster. It was about 200 entries per second. We stressed tested the platform by doing bulks inserts at 7000 entries/s. We started seeing data getting lost.

It seems like the date format is not valid but the code has not been changed and there are no if-else statement where there could have been a possibility of echoing the datetime using a different branch. We are returning the time by the python function `datetime.isoformat(‘ ‘)`. Why would the same line of code produce 2 different outputs under heavy stress?

[2017-05-30T15:03:30,562][DEBUG][o.e.a.b.TransportShardBulkAction] [idc7-citadelle-0-es0] [proxydns-2017-05-30][0] failed to execute bulk item (index) BulkShardRequest [[proxydns-2017-05-30][0]] containing [1353] requests
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [date.date_answer]
        at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:298) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:450) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:576) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:396) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:373) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:447) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:467) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:383) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:373) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:93) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:66) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:277) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:532) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.shard.IndexShard.prepareIndexOnPrimary(IndexShard.java:509) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.prepareIndexOperationOnPrimary(TransportShardBulkAction.java:447) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:455) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:143) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:113) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:69) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:939) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:908) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:113) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:322) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:264) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:888) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:885) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1654) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:897) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.access$400(TransportReplicationAction.java:93) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:281) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:260) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:252) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:618) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:613) [elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.3.0.jar:5.3.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.lang.IllegalArgumentException: Invalid format: "2017-05-30 13:03:11" is too short
        at org.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187) ~[joda-time-2.9.5.jar:2.9.5]
        at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:826) ~[joda-time-2.9.5.jar:2.9.5]
        at org.elasticsearch.index.mapper.DateFieldMapper$DateFieldType.parse(DateFieldMapper.java:242) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.DateFieldMapper.parseCreateField(DateFieldMapper.java:462) ~[elasticsearch-5.3.0.jar:5.3.0]
        at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:287) ~[elasticsearch-5.3.0.jar:5.3.0]
        ... 40 more

It turns out that during stress tests, logs with a wider range of timestamps will be produced. Lemme illustrate it using this little python script for example.

import datetime

lol = datetime.datetime.fromtimestamp(1496211943.123)
print lol.isoformat(' ') # 2017-05-31 10:25:43.123000
lol = lol + datetime.timedelta(milliseconds=500) # we add 500 milliseconds to it. 
print lol.isoformat(' ') # 2017-05-31 10:25:43.623000
lol = lol - datetime.timedelta(milliseconds=623) # we substract 623 milliseconds. What will the result be?
print lol.isoformat(' ')  # 2017-05-31 10:25:43

Noticed how the format changed? I was expecting the output to be `2017-05-31 10:25:43.000000`. During normal usage, it is highly unlikely for someone to do requests at exactly 0 milliseconds. Even if they did, the data loss would be very little and non-detectable. By doing the stress test, what was very improbable become more probable. We detected it.

But why did the stupid format had to change? Reading the official documentation of the datetime library of python, it saw this:

datetime.isoformat([sep])

Return a string representing the date and time in ISO 8601 format, YYYY-MM-DDTHH:MM:SS.mmmmmm or, if microsecond is 0, YYYY-MM-DDTHH:MM:SS [1]

Why did they choose to do this? No idea. Anyways, I resorted to write my own format of which I would be sure not to have this problem again.

print lol.strftime('%Y-%m-%d %H:%M:%S.%f')

[1] https://docs.python.org/2/library/datetime.html

Analysing Car Market Trends using Big Data Tools in Mauritius

Who hasn’t heard of Big Data in the 21th century? Big Data in itself isn’t much of a great deal as knowing how to extract useful information from the data. It is the analytics part which is the killer feature of Big Data. It comprises of both science and art!

source: www.fudzilla.com

The first step of Big Data solutions is to gather data. There are lots of ways in which you can achieve this. Manual Data Entry can still be done but you’ll require quite an army of minions for that. But I was not “evil” enough to convince the minions to work for me. So I had to find other ways: Facebook Graph API allows you to get feeds from your wall, car groups you’re in.

{
      “message”: “Renault Scenic
Rs138,000 – Vacoas, Plaines Wilhems, Mauritius

For sale Renault Scenic Year 2005,STEPTRONIC gearbox,fully executive,1500 cc Petrol,never accidented,comes with digital dashboard,armrest,electric mirrors,alloy wheels and panoramic sunroof”,
      “updated_time”: “2017-04-16T05:57:29+0000”,
      “id”: 14851777

},

Good news is that the data is in JSON format. Bad news is that the message field contains unstructured data. We have to extract the informations we want from it. I use Collectiva Beta service (from nayarweb.com) for data processing.

Collectiva Beta

Let’s extract the Make, Model, Price, Location and Year from the data. I use grok patterns on the message field.

Make and Model: ^%{NOTSPACE:make} %{NOTSPACE:model}

Year: (y(ea)*r|an(n*e*))\s*\:*\s*(?<year;int>[0-9]+) // works with an 96, year 2003, yr : 14, anne 2000

Price & Location: (((Rs|\$|£|₹)(?<price_string>([0-9,])*))|FREE)( – %{GREEDYDATA:location_string})?

Now we can just throw data at it. Some will get properly parsed, some not. But with enough data, you can still get some pretty graphs. In the future, hopefully I get an AI do the extracting for me.

Let’s get to the pretty parts now: graphs.

Which car makes are the most sold in Mauritius?

Easy: A Bar Chart

How many models are within each make?

No need to have another graph. We just split the split bars for models. Voila!

How much do the car models depreciate in Mauritius?

Notice that it is graphing the 50th percentile which is also known as median. Means are kinda useless because a single outlier can cause the graphs to be very skewed by a lot. We can show the 10th, 50th, 90th percentile on the same graph so that you can compare whether the car you’re buying falls between the accepted market price. Let’s say you want a Volkswagen Polo

You can see the same lines above are not split into 3 lines. I don’t have much variation here because I’ve been collecting data for only 4 days. The lines are supposed to smoothen over time and the accuracy will increase.

Wanna see the Big Picture? Welcome the Pi Chart

In this graph, I aggregated the data into Make, Model, Year and Price. All in 1 graph. It’s like magic nah?

Want to get into Big Data World?

If you have big databases sitting around, like for supermarket, warehouse, manufacturing and agriculture, or you run an SME or NGO and would like to benefit from the insights of Big Data tools feel free to contact me for a quotation.

Detecting Brute Force Attacks on Linux using Graylog/Elasticsearch

I noticed one of my servers is sending more logs than the other. `sshd` was the application sending the most amount of logs in the last 24 hours.

The logs look like these:

pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=116.31.116.52 user=root

Failed password for root from 116.31.116.52 port 14281 ssh2

message repeated 2 times: [ Failed password for root from 116.31.116.52 port 14281 ssh2]

Received disconnect from 116.31.116.52 port 14281:11: [preauth]

Disconnected from 116.31.116.52 port 14281 [preauth]

PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=116.31.116.52 user=root

Since the IPs are being logged, I can know from where these are coming. Let’s generate a map.

Well well well, 5437 occurrences from China. Gotta do something. Here’s frequency of the attacks this week.

It seems to have amplified today. Looking at the logs, we can see that the were trying different passwords for the user `root`. Lemme just disable password authentication for SSH.

$ vim /etc/ssh/sshd_config

Change the line

#PasswordAuthentication yes

to

PasswordAuthentication no

$ service ssh restart

And we can see the logs are no more appearing 😉

Do you wish to analyse your server logs too? Feel free to message me on Facebook or Twitter or LinkedIn