Sunday 20 April 2014

Hive- screwing up with dates!

Lets start stuff.

➜  ~  jps
2369
21440 Jps

➜  ~  cd $HADOOP_HOME

➜  ~HADOOP_HOME git:(master) sbin/start-dfs.sh
2014-04-20 13:43:03.375 java[21482:1903] Unable to load realm info from SCDynamicStore
14/04/20 13:43:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/...
localhost: 2014-04-20 13:43:04.892 java[21555:1d03] Unable to load realm info from SCDynamicStore
localhost: starting datanode, logging to /usr/local/hadoop/logs/...
localhost: 2014-04-20 13:43:08.843 java[21634:1d03] Unable to load realm info from SCDynamicStore
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs...
0.0.0.0: 2014-04-20 13:43:13.916 java[21903:1d03] Unable to load realm info from SCDynamicStore
2014-04-20 13:43:18.459 java[21946:1903] Unable to load realm info from SCDynamicStore
14/04/20 13:43:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

➜  ~HADOOP_HOME git:(master) jps
21975 Jps
21634 DataNode
2369
21903 SecondaryNameNode
21555 NameNode

➜  ~HADOOP_HOME git:(master) bin-mapreduce1/start-mapred.sh
starting jobtracker, logging to /usr/local/hadoop/logs/...
2014-04-20 13:43:29.367 java[22044:1903] Unable to load realm info from SCDynamicStore
localhost: starting tasktracker, logging to /usr/local/hadoop/bin-mapreduce1/../logs/...
localhost: 2014-04-20 13:43:30.578 java[22129:1d03] Unable to load realm info from SCDynamicStore

➜  ~HADOOP_HOME git:(master) jps
22044 JobTracker
21634 DataNode
2369
22151 Jps
21903 SecondaryNameNode
21555 NameNode
22129 TaskTracker

So now we have the DataNode, NameNode, TaskTracker, JobTracker and SecondaryNameNode running.

Now I need to generate a list of dates, lets say starting from the year 2000 to year 2040... i wrote a small ruby script to do that: (In the format YYYY-MM-DD HH:mm:SS)

$ cd ~/myexperiments/ruby
$ nano ruby_dates.rb
>yearStart = 2000
>yearEnd = 2040
>(yearStart...yearEnd).each { |year| puts year.to_s + "-01-01 00:00:00" }

And in
$ cd ~/myexperiements/data
$ ruby ~/myexperiements/ruby/ruby_dates.rb > dates

Now lets create a table over this, first we need to bring this file into HDFS.

$ cd $HIVE_INSTALL
$ hive
hive> dfs -mkdir /Users/Dhruv/Input/Dates
hive> dfs -copyFromLocal /Users/Dhruv/myexperiments/data/dates /Users/Dhruv/Input/Dates/ ;

hive> dfs -ls /Users/Dhruv/Input/Dates ;
Found 1 items
-rw-r--r--   1 Dhruv supergroup        800 2014-04-20 14:11 /Users/Dhruv/Input/Dates/dates

Then we create a table on the same:

hive> create table dates(
    > a_date timestamp)
    > location '/Users/Dhruv/Input/Dates' ;
OK
Time taken: 0.027 seconds

NOTE: Hive won't let me create a column with name date!

And then we run:
hive> select * from dates order by a_date;
And it turns out:
1970-12-13 20:45:52
2000-01-01 00:00:00
2001-01-01 00:00:00
2002-01-01 00:00:00
2003-01-01 00:00:00
2004-01-01 00:00:00
2005-01-01 00:00:00
2006-01-01 00:00:00
2007-01-01 00:00:00
2008-01-01 00:00:00
2009-01-01 00:00:00
2010-01-01 00:00:00
2011-01-01 00:00:00
2012-01-01 00:00:00
2013-01-01 00:00:00
2014-01-01 00:00:00
2015-01-01 00:00:00
2016-01-01 00:00:00
2017-01-01 00:00:00
2018-01-01 00:00:00
2019-01-01 00:00:00
2020-01-01 00:00:00
2021-01-01 00:00:00
2022-01-01 00:00:00
2023-01-01 00:00:00
2024-01-01 00:00:00
2025-01-01 00:00:00
2026-01-01 00:00:00
2027-01-01 00:00:00
2028-01-01 00:00:00
2029-01-01 00:00:00
2030-01-01 00:00:00
2031-01-01 00:00:00
2032-01-01 00:00:00
2033-01-01 00:00:00
2034-01-01 00:00:00
2035-01-01 00:00:00
2036-01-01 00:00:00
2037-01-01 00:00:00
2038-01-01 00:00:00

So the 2039 date turns into 1970!

However this can be solved (for this particular date-time format) by using string type instead.

hive> create external table dates_string(
    > date_string string) location '/Users/Dhruv/Input/Dates' ;
hive> select * from dates_string order by date_string;
2000-01-01 00:00:00
2001-01-01 00:00:00
2002-01-01 00:00:00
2003-01-01 00:00:00
2004-01-01 00:00:00
2005-01-01 00:00:00
2006-01-01 00:00:00
2007-01-01 00:00:00
2008-01-01 00:00:00
2009-01-01 00:00:00
2010-01-01 00:00:00
2011-01-01 00:00:00
2012-01-01 00:00:00
2013-01-01 00:00:00
2014-01-01 00:00:00
2015-01-01 00:00:00
2016-01-01 00:00:00
2017-01-01 00:00:00
2018-01-01 00:00:00
2019-01-01 00:00:00
2020-01-01 00:00:00
2021-01-01 00:00:00
2022-01-01 00:00:00
2023-01-01 00:00:00
2024-01-01 00:00:00
2025-01-01 00:00:00
2026-01-01 00:00:00
2027-01-01 00:00:00
2028-01-01 00:00:00
2029-01-01 00:00:00
2030-01-01 00:00:00
2031-01-01 00:00:00
2032-01-01 00:00:00
2033-01-01 00:00:00
2034-01-01 00:00:00
2035-01-01 00:00:00
2036-01-01 00:00:00
2037-01-01 00:00:00
2038-01-01 00:00:00
2039-01-01 00:00:00

Which gives the correct result!

NOTE: you can generate random dates from ruby script by a slight modification->
>yearStart = 2000
>yearEnd = 2040

>dates = []
>(yearStart...yearEnd).each { |year| dates << year.to_s + "-01-01 00:00:00" }

>puts dates.shuffle

And you can also use command line to pass in the year range:

➜  ruby  cat ruby_dates.rb
> yearStart = ARGV[0] ? ARGV[0].to_i : 2000
> yearEnd = ARGV[1] ? ARGV[1].to_i : 2040

> dates = []
> (yearStart...yearEnd).each { |year| dates << year.to_s + "-01-01 00:00:00" }

> puts dates.shuffle

$ ruby ruby_dates.rb "1970" "2030"