Skip to main content

Deals You Can't Miss

1 Year Subscription

Correlation And Causation By Example

src: https://github.com/DataScienceWorks/correlation_causation_blogpost/blob/master/cc_blog_post.ipynb


What you should know quickly?

Given the observations of two features A and B: we observe correlation between features A and B, when we see a pattern where A and B change its values at the same time. When the values of A and B, increase or decrease together, we say they are positively correlated. When the value of A increases if we find that the value of B decreases proportionately and vice-versa, we say they are negatively correlated.

Correlation is what we can visually identify by plotting the values of features in the graph and compare their trends for patterns. With this we cannot say what is causing what. In other words, you cannot claim one feature causing the other.

When the change in one feature results in the change in the other, we call it causation. When we find correlated features, we dig around the domain, do more research and homework to increase our domain knowledge to claim that one feature is causing the other.

Understanding with an example

Take the pricing trend of 5 components Petrol, Diesel, Crude Oil, Water, Coke as shown in the above graph/plot.
You will observe a strong correlation between Petrol, Diesel and Crude Oil prices.
You may also see some correlation in the prices of Water and Pepsi.

Now while the prices of water and Pepsi don't seem to change often as per the graph, we see a trend of rising prices during summer months in India. They both seem to be correlated and digging deeper we see that the increased demand for it during hot seasons cause its prices to go upwards.

In the case of gasoline components (or features, in the Data Science jargon), we see strong correlation in prices between Petrol, Diesel and Crude Oil. Among these, from just the graph data, can we say what feature is causing the change in prices of other  features? No, we can't.
As Data Scientist or Researcher, we do some amount of homework to learn that Petrol and Diesel are derivatives from  Crude Oil. Armed with this new found learning and coupling it with the graph data we shout out loud that Crude Oil price change causes or affects the change in prices of its derivatives like Petrol and Diesel.

Summary

Statistics provides you the tool to find correlation among features with quantitative data. It prescribes techniques to do your experiments for getting qualitative data or alternatively the domain knowledge aids you with the qualitative data, to identify a feature, if any, in the provided data, that causes other features.

By the way, have you ever wondered why we go through this pain or hassle of identifying correlated features and possibly the causal-feature?

Want to play with the data that is used  to generate the graph plot used in this blog post? Feel free to download or fork it from my Github repository.

Happy learning!

My Popular Posts

Ten Commandments of Egoless Programming

We are nothing but the values we carry. All through my life thus far, I tried to influence people around me with the virtues I value. Thanks to some good reading habits I had inculcated, and the fortune of being in good community of peers and mentors alike, I managed to have read some real good books. This post is about the 10 commands of egoless programming in Weinberg's book. I shall explain the commandments based on my experience here. So very many decades ago, Gerald M. Weinberg authored  The Psychology of Computer Programming . In it, he listed The Ten Commandments of  Egoless Programming , which remains relevant even today for us as not just programmers but as team-members. Weinberg is regarded as a pioneer in taking a people-centric approach to computing, and his work endures as a good guide to intelligence, skill, teamwork, and problem-solving power of a developer. When they appear to inspire and instruct, we find that they can apply to just about every business area, and e

Should I buy refurbished laptop from Amazon?

This post is based on my experience with amazon.in and guess it to be true on all other platforms as well. At least you can check out and verify for these pointers before you make that decision to buy renewed/refurbished laptop on Amazon with your hard earned money. I see this question propping up in several forums and on many different occasions. In the recent past, I had my 5 year old dell laptop that gave up because its motherboard failed. One of the options that I had in my mind was to re-use the HDD and the 16GB DDR4 RAM of that old laptop in the one that I purchase next as secondary.  I had come to a conclusion that it is not worth buying a refurbished/renewed laptop at all. Why? For the following reasons, most of which I see as BIG #RedFlags: You got to remember that Amazon provides a platform for 3rd party sellers to sell their products as well. So in your search for refurbished laptops you wouldn’t want to choose some random 3rd party seller who Amazon doesn’t endorse. You cou

Multi-tenant Architectures

  Multi-tenancy Application Deployment Architecture could be modeled in 4 broad ways: Separate Apps & Separate Databases Shared Apps & Shared Databases Separate Apps & Shared Databases Shared Apps & Separate Databases There is no right or wrong here. It's about choice and consequence that you should consider taking into your business context and constraints. In this post I intend to jot down a some key points to keep in mind for each of these multi-tenant architecture. These are more of quick notes for my quick reference, a cheat-sheet of sorts when I have to make choices. And I guess this can come handy to you too in your wise decision making. Separate Apps & Separate Databases Easiest to implement from development and deployment stand-point. Just automate the deployment infrastructure for every tenant for quick set-up. Most expensive of all the models from infrastructure cost stand-point. Relatively longer deployment t