Thursday, June 25, 2015

How should I start Big Data?

Many companies are apprehending the benefits of Big Data and are starting to use their data more effectively. With the benefits of Big Data, companies now have an opportunity to control the power of real-time information and use analytics to interact with their consumers in real time.
We know the advantages of Big Data: understanding your customer, improving customer loyalty and gaining competitive advantage.  I have recently started to learn Big Data. At the beginning I've asked this most popular question many times-“How should I start Big Data?”
Actually, this is a great question since there are numerous resources to learn about Big Data and it is so difficult to select one to start. Therefore I decided to write this post to share a summary of those I found.
So how do we start our journey to Big Data? Here are the five tips I recommend to help you get started.

1-Learning the tools and technologies: start with any tool that you can access to like Python, SAS, SPSS, SQL, R (which is available as open source) and try to learn it at a deep and practical level. Then you will have some knowledge and then you can search and study relevant topics as you now know a little to grow your knowledge. Remember that people with high level knowledge about one special tool are more preferable than who know a little bit about everything! So, it is strongly recommended to master one tool and a few techniques of the tool to have a better chance of getting the opportunities and accomplishing them. For example, you can try with Introduction to Data Science from University of Washington on Coursera website. Just remember to plan in right direction what tools and technologies you want to learn.

2-Learning the tricks: indeed a supplementary step to master a tools is learning the tricks of that tool from another experienced in your company or learn from professional courses. Notice that self-study courses and tutorials mostly will not provide you the key secrets and tricks which are very crucial for solving real life problems.

3-Look for an opportunity in your company to apply analytics in your organization. Mainly it is difficult to identify where to start. If you know the sources of data and where data is being collected (like some data repository) according to a certain business process then you have a good chance to use it in your first Big Data scenario. Start by generating simple insights from the data which is not presently captured in the business reports and create simple metrics which will add tremendous value to the businesses to show to the top management in your company interested in what you are doing. Remember that most organizations do not even do the most obvious understanding from a data analysis perspective.

4-Create a case study of your work and show your analytics to your superiors. If they don’t support you, devise a job search to extramural companies related to your new skills.

5-Read more and more: it is strongly recommended to join blogs and forums on Big Data, follow carefully companies in related domain and participate in the latest discussions and events in Big Data such as LinkedIn. This help you being aware of how Big Data is being applied in different business applications and functions and increase your knowledge.

Non-associative property of floating-point operations


As you may be motivated to know how come it is possible to have different values by prioritizing the addition operator according to my last post, I decided to write this post to show you the floating point operations does not necessarily have the associative property. Before dive into the details, let's start with an example,

a=0.1
b=0.2
c=0.3
M=(a+b)+c
N=a+(b+c)
print(M, digits=20)
print(N, digits=20)
More probably you will see the following results on your screen,

[1] 0.60000000000000008882
[1] 0.5999999999999999778



Surprisingly, not only the floating point operations does not necessarily have the associative property, but also the commutative property does not hold either! It is better to demonstrate this by the following example,


a=100
b=0.1
c=0.2
M=a*(b+c)
N=a*b+a*c
print(M, digits=20)
print(N, digits=20)

The following result was produced in R,
[1] 30.000000000000003553
[1] 30


In mathematics, the associative property is a property of some binary operations. In propositional logic, associativity is a valid rule of replacement for expressions in logical proofs.

Within an expression containing two or more occurrences in a row of the same associative operator, the order in which the operations are performed does not matter as long as the sequence of the operands is not changed. That is, rearranging the parentheses in such an expression will not change its value. 
--
Wikipedia definition

Despite, the addition operator should meet the associative property theoretically for all real numbers, floating-point operations, as defined in the IEEE-754 standard, are not associative numerically.

More specifically, on massively multi-threaded systems, the non-deterministic nature of how floating-point operations are performed inside the machine, such as the intermediate values that have to be rounded or truncated to fit in the available precision leads to non-deterministic numerical error propagation.

Floating point arithmetic is known to be non-associative since the limited precision of the representation requires intermediate values be rounded. The IEEE-754 provides uniform semantics for operations across a wide range of implementations. The standard de fines correct behavior for all operations, as well as any necessary rounding.

So if you want to implement your algorithm, or reimplement an existing algorithm and port it in another programming language, be careful about this issue.

Friday, June 19, 2015

Perpetual motion


   The idea to create a machine which don't consume any energy to work becomes from middle age centuries. As one of the most famous try we could mention the da vinci overbalanced wheel. Anyway, this idea has been rejected by thermodynamic rules, but the fans never stop neither the magician who earn money from their show!
   They use their creativity by adapting the main idea with the thermodynamic rules. Instead of using the term of "no energy" they decided to use an energy which they called it as “the vacuum energy”. my point is not if it is really work or not, I'm trying to show you their creativity! 



Thursday, June 18, 2015

Logo de PolyStat

Bonjour à toutes et à tous,
Voici, ci-dessous, le logo de PolyStat:



Pour ceux qui n'étaient pas présents lors de la réunion où nous avons choisi ce logo, j'ajoute une petite description à propos de ce logo. Comme vous le voyez bien dans la figure, il représente des gens autour d'une table ronde disant que PolyStat est formé d'un groupe d'étudiants et de chercheurs qui se parlent de leurs idées et se réunissent afin de discuter sur leurs travaux de recherche. Les quatre couleurs du logo (rouge, verte, orange et bleue) sont celles que vous trouvez dans le logo de Polytechnique.
On s'est mis déjà d'accord sur le logo mais si vous avez d'autres suggestions pour la couleur grise dans le logo ou la police choisie pour PolyStat, n'hésitez pas à laisser des commentaires.

Sunday, June 7, 2015

Simple question

Who does think that \(a+(b+c)\) is always equal to \((a+b)+c\)?
Your wrong/right answers are welcome in the comment.

Saturday, June 6, 2015

Google App Inventor

I just learned that a group of researchers at MIT came up with App Inventor software for kids to write  mobile applications. At the first sight, it looks like Visual Studio.


Seems coding is getting revolutionized again, by moving from technical complicated commands, to visual and cognitive signs. I remember the time that MS Windows 3.1 pushed away MSDOS and UNIX with a similar idea. This time, the revolution is happening in the app developer level.
I was surprised with the functions that are available in the App Inventor. For instance speech recognition is a just a box! You can easily add it as a functionality to your app.



I would say that the difference between Java and the App Inventor is like R and RapidMiner.  

Thursday, June 4, 2015

Une vitesse de plus de 50 km/s



Après la Lune, la prochaine destination est la planète Mars. Voyager vers Mars n’est plus un rêve ou une mission impossible. Grâce aux nouvelles technologies d’aujourd’hui, cette planète sera accessible bientôt pour l’homme.
La découverte d’un moyen de propulsion plus rapide qui pourrait changer la façon dont nous voyagerons dans l’espace est l’un des problèmes auxquels on fait face dans cette mission. Aux États-Unis les laboratoires privés se mettent à rivaliser avec la NASA et cherche à créer un moyen de propulsion plus rapide que celui des fusées d’aujourd’hui. La compagnie Ad Astra Rocket est en train de produire un moteur fusée révolutionnaire qui est entré dans sa phase finale d’évaluation. C’est une fusée magnéto-plasma à impulsion spécifique variable ou VASIMR en anglais. Son inventeur, Franklin Chang-Diaz, est un ancien astronaute avec plus de sept missions dans l’espace. Le moteur VASIMR est un nouveau dispositif qui nous permet de nous déplacer plus rapidement dans l’espace tout en consommant moins de carburants et donc de paver la voie à la conquête de tout le système solaire. 

Moteur magnéto-plasmique à impulsion spécifique variable

Lorsqu’on active le moteur, un gaz instable, l’argon, est envoyé vers la première chambre d’allumage tel qu’il est illustré dans la figure ci-dessus. Il est alors bombarder par de puissantes ondes radars qui arrachent les électrons de leurs atomes. L’argon se transforme alors en plasma, une soupe de particules dotée d’une énergie considérable. Ce plasma est ensuite canalisé vers la deuxième étape d’allumage où il atteint de millions de degrés. Là, un système d’aimant le propulse avec force vers l’extérieur à plus de 50 km/seconde. Tous les moteurs de la fusée fonctionnent de la même façon en utilisant le principe de l’action et de la réaction. Ce processus est repris mais en éliminant toute composante chimique. À la place de cette dernière, l’énergie électrique est utilisée pour chauffer l’argon à des températures extrêmes proches de celles du soleil. Donc, on ne parle pas de quelques milliers de degrés mais de plusieurs millions de degrés. Grâce à cette température, le VASIMR pourrait aller plus vite et plus loin avec beaucoup moins de carburants. À terme, ce prototype est estimé de permettre à l’homme d’atteindre la Planète Mars en moins de deux mois, soit trois fois plus vite qu’un moteur conventionnel utilisé actuellement dans les vaisseaux spatiaux. Mais ces températures incroyables sont aussi un obstacle. Le problème est que comment contenir quelque chose d’aussi chaud sans faire fondre ce qui l’entoure. Ce problème est résolu grâce à l’action de champs magnétiques très puissants. Mais une fois sortie du moteur, le plasma est un danger capable de faire fondre tout le laboratoire où le test se déroule. Pour éviter de cette catastrophe, le test ne dure que quelques minutes.  


Vaisseau spatial muni des moteurs magnéto-plasmiques à impulsion spécifique variable

R-GPU on our Gupiter

As you may know, our gpu server is called gupiter, voted by the research members in our last formal meeting.

I just installed the gputools package in R (with a lots of difficulties). Now you can compute on the the Tesla K40  from R.  It is enough to load the gputools library and use the cuda wrapped functions. 
Moreover, our gpuiter now can give you access to RStudio through RStudioServer. Since we have only two seats over the gupiter, I thought it is much better if I provide access to the other team members. Now you can remotely can have access (with graphical support) through RStudioServer.
Sajjad already made a post on comparison of GPU with CPU in MATLAB over gupiter. I just run a toy example. Let's generate a 1024X1024 random matrix and compute the Euclidean distance over the matrix. Seems for this example gpu is about 30 times faster! I am sure we can get much more efficiency over larger data.

library(gputools)
set.seed(5446)
X <- matrix(rnorm(2^20),ncol = 2^(10))
cputime <- system.time(d <- dist(X))
gputime <- system.time(gpud <- gpuDist(X))
cputime/gputime[1]
##     user   system  elapsed 
## 31.11688  0.00000 31.12338