Saturday, November 13, 2010

Batching in MYSQL

In a scenario where you have to do multiple inserts in database with the same sql statement but with different data, batching them in a single SQL query to database saves of time exponentially.

In mysql using executeBatch() solves problem of batching. In case if you use JDBCTemplate then use batchUpdate() which internally calls executeBatch only.
Here is a example code
import java.sql.BatchUpdateException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
 
public class MySqlUpdate {
 
private static Connection getConnection() throws Exception {
String driver = "org.gjt.mm.mysql.Driver";
String url = "jdbc:mysql://localhost/test";
String usrname = "root";
String passwd = "";
Class.forName(driver);
Connection conn = DriverManager.getConnection(url, usrname, passwd);
return conn;
}
 
public static void main(String[] args) {
Connection conn = null;
PreparedStatement ps = null;
try {
conn = getConnection();
conn.setAutoCommit(false);
String query = "insert into test(name, desg) values(?, ?)";
ps = conn.prepareStatement(query);
ps.setString(1, "ABC");
ps.setString(2,"Software Engineer");
ps.addBatch();
 
ps.setString(1, "XYZ");
ps.setString(2,"Sr. Software Engineer");
ps.addBatch();
 
ps.setString(1, "PQR");
ps.setString(2,"Team Leader");
ps.addBatch();
 
int[] updateCounts = ps.executeBatch();
System.out.println("Length:" + updateCounts.length);
conn.commit();
} catch (BatchUpdateException e) {
try {
conn.rollback();
} catch (Exception e2) {
e.printStackTrace();
}
} catch (Exception e) {
e.printStackTrace();
} finally {
pstmt.close();
conn.close();
}
}
}  

But this does not solve our problem.Doing this will only send all the insert queries in a together to the database and executing them one by one as independent queries.This will only save us the time in roundtrip to database.

To get the real benefit from the batching we should set rewriteBatchStatements=true in the database connection url we use to make connection .
As shown here
"jdbc:mysql://localhost/db?user=root&rewriteBatchStatements=true"

This will activate the batch mode and now multiple queries will be executed as a single query.But this works only when the number of queries is more then 3.

you must be careful about writing the query statement as well as any extra/ less space will make batch ineffective. So be sure that the query syntax is perfect. For example

"insert IGNORE into rosters (owner_jid,jid,name) values (?,?,?)" works but

"insert IGNORE into rosters(owner_jid,jid,name) values (?,?,?)" or
"insert IGNORE into rosters (owner_jid,jid,name)values (?,?,?)" or
"insert IGNORE into rosters (owner_jid,jid,name) values(?,?,?)" does not work.

If you find out the reason why it does not work with irregular spaces, please comment.

Friday, November 12, 2010

Investigation of Bosh and Punjab

Bosh(Bidirectional-streams Over Synchronous HTTP) is a protocol designed to support XMPP chat via browsers. As browsers can use HTTP protocol, they can not talk directly to XMPP servers. So there is a need for a layer between browsers and XMPP server. BOSH outlines the specifications for that layer.

Punjab is a connection manager (I will refer it by CM from now on) which acts as a layer between client and xmpp server. Client opens a session with CM. Each session has a unique session id. Once a session is established, client can make requests to the CM, each request having rid (request id). CM creates a stanza corresponding to the request and forwards that to the server. When server responds, the response will be sent to the client.

There must be at least one request always parked with CM so that if any data comes to CM from server for that client, it can send that data to client as response of that request. To make this happen client sends an empty request to the server as soon as it receives a response from the CM.

The technique employed by BOSH achieves both low latency and low bandwidth consumption by encouraging the connection manager not to respond to a request until it actually has data to send to the client. As soon as the client receives a response from the connection manager it sends another request, thereby ensuring that the connection manager is (almost) always holding a request that it can use to ”push” data to the client

Whenever a session is established, client and CM agree on few parameters such as

Parameter specified by client: -

Client specifies a few common parameters as to, version, lang, content, charset but the special parameters are
wait :- this is the maximum amount of time(in seconds) for which CM can hold a request. If no data comes from server the in this time then CM has to send an empty response to the client.
hold :- This is specifies the maximum number of request a CM can hold from a client .If hold is 1 then CM can hold at max one request . If CM is holding one request and client sends another request, then CM will have to send empty response for the first request. If hold is 0, this will be as good as polling.
route :- Client can specify to which server it wants to connect using this param. Suppose it wants to connects to a server called example.com at port 5222, then it will specify route as
route=’xmpp:example.com:5222’
ack :- This is used if client wants to have acknowledgement from CM about the reception of its requests. Punjab does not support ack.



Parameter specified by CM :-

Other then general params like to,version,lang etc CM specifies following

hold :- This is to inform client about the maximum number of requests CM will hold . Its value must be less then or equal to the hold value specified by the client.
polling :- This specifies the minimum time between two consecutive empty requests. This is specified to limit unnecessary polling from client.
inactivity :- If CM does not have any request pending to respond, inactivity is the maximum amount of time CM will wait for any request from client before declaring the session dead. This is very important for detecting a broken connection.
requests :- This is the maximum number of simultaneous requests a client can send. It is recommended to be one more then hold value
maxpause :- If CM accepts pausing the session the it will send maxpause also. This will the maximum limit on the pause attribute specified by the client in some special condition.
accept :- using this CM can specify different encodings it supports.
sid :- CM creates one unique id for each session called sid. All requests for this session will come under this sid and different rids.
window :- it is generally two more then hold. This comes in action in case we use acknowledgements.


All the parameters specified by CM can be configure in Punjab code according to our neeeds.

INACTIVITY
If the connection manager has responded to all the requests it has received within a session and the time since its last response is longer than the maximum inactivity period, then it should assume the client has been disconnected and terminate the session without informing the client. If the client subsequently makes another request, then the connection manager should respond as if the session does not exist. Inactivity should be kept smaller to detect disconnections at the earliest.

PAUSE
If client beforehand knows that it will be disconnect for some time(such as page reload), then it can sent a pause request to CM(only if CM specified maxpause at the time of session establishment). CM will then respond to all pending requests immediately and temporarily set the inactivity to pause specified in request.

TERMINATE
Client can any time send request specifying type as terminate. In that case CM will close the session, discard the session key and will send an closing stream to the server(in case of Punjab).

OVERACTIVITY
There are two cases of inactivity
1) If the number of requests parked with the CM is more then ‘requests’ specified by CM in session starting and last request is not pause or terminate request then CM will terminate the session telling by sending terminate as response and under condition policy-violation.
2) if number of requests parked with CM is equal to ‘requests’ specified by CM, and last request is empty, last request id not pause or terminate, and the last two requests came in a time interval less then polling then also CM will terminate the session telling policy-violation.

In-Order Message Forwarding
When a client makes simultaneous requests, the connection manager might receive them out of order. The connection manager MUST forward the payloads to the server and respond to the client requests in the order specified by the ’rid’ attributes. The client MUST process responses received from the connection manager in the order the requests were made.
The connection manager SHOULD expect the ’rid’ attribute to be within a window of values greater than the ’rid’ of the previous request. The size of the window is equal to the maximum number of simultaneous requests allowed by the connection manager. If it receives a request with a ’rid’ greater than the values in the window, then the connection manager MUST terminate the session

Broken Connections
CM keeps responses it sends to the client in a buffer. The number of responses to non-pause requests kept in the buffer SHOULD be either the same as the maximum number of simultaneous requests allowed by the connection manager or, if acknowledgements are being used, the number of responses that have not yet been acknowledged.

Use of tail and head


Have you ever faced a situation where you have a big text file but you are interested in a small portion of it ?
I had a situation like that a few days back . I had logs of a server of one hour duration but i was interested in only two minutes of that. While surfing I found a technique which proved very useful .

Suppose that I have logs from 09:00:00 to 10:00:00 in a file say 2010-11-12-09 but you are only interested only in logs from 09:34:54 to 09:36:54 then this is how you can do it very easily…

First grep for some text you know you will only find at  09:34:54 and get the line number for that (in my case I had timestamp i.e. 09:34:54 itself).

grep -n "0934:54" 2010-11-12-09

Say this is 100000 
Similarly find line number for 09:36:54 

grep -n "09:36:54" 2010-11-12-09

Say this is 123456
Now you can use tail and head commands to extract the data you needed.

tail -n +100000 2010-11-12-09 | head -n 23456 > filename

Here tail will give us the whole file data starting fro line number 100000 and on that data head will read first 23456 lines and '>' will send that to the filename you specify .

Now you can do whatever operation on that file as you want.