Fun with Data Science

“In God we trust, all others bring data.“ – William Edwards Deming

We all are fascinated by data, no matter how vast or insignificant it is. From sipping tea accompanied by the morning newspaper in the childhood to the endless scrolling of a twitter thread now-a-days, I am no exception. As a beginner in the field of “data science” (started about an year ago), here are the glimpses of my mini-projects.

Expected Goals (xG) Model: My favorite sport football (soccer) is very different from other sports at the very basic level as it is notoriously low scoring and hence unpredictable at a particular match level. Luck plays a much greater role in football than we would like to admit as individual match results and even the league tables may lie to a certain extent. To decrease this uncertainty in goal scoring, Sam Green (2012) from the reputable sports analytics company Opta introduced a new metric called Expected Goals (xG). Put simply, xG is a way to measure the likelihood of a shot becoming a goal. Not all shots are equal in their quality; one shot might be a speculative 40-yarder, and another might be a two-yard tap-in. Therefore, xG measures the quality of each shot before the player shoots, taking into account many factors, including a) the distance from the goal, b) the shot angle, c) whether it was with the head, or with the stronger/weaker foot, d) whether it was from a cross, through ball, short pass etc. Overall, the xG value is always presented as number between zero (no chance of a goal) and one (a certain goal). For example, if a shot has xG of 0.3, it means that the shot would be expected to be a goal 3 times out of 10, given the situation.

Probability of scoring goal as a function of distance and angle

Looking at the English Premier League (2017/18 season) data, the first thing that catches our eye is that probability of scoring goals decreases as one go away from the goal and increases as the shooting angle gets bigger. These can be viewed as both one- and two-dimensional histogram plots (see above). Then we fit our model using logistic regression model from scikit-learn; first with distance predictor. The sensitivity and specificity of that model turned out to be 0.38 and 0.91, respectively. Then fitting with angle predictor gave sensitivity and specificity of 0.39 and 0.95, respectively. The final model which considers both distance and angle gave sensitivity of 0.50 and specificity of 0.93. All these coefficients produce a nice two-dimensional contour plot (see below).

Probability plot of my xG model

As seen from the moderate value of sensitivity (0.50) of our model, it is obvious that our model didn’t consider many other factors which lead up to a goal. As mentioned, goals are indeed a random event, where linear combinations of many subevents decide the chance of scoring. We only took two quantifiable variables (distance and angle); but there are many more variables which are hard to be translated into category or quantity. Maybe some decision tree-based methods on top of this model will greatly increase the sensitivity. Nonetheless, my model shows that a complex game like football can be optimized with a simple two parameter model. My next project would be an improvement of the current model.

National Family Health Survey, India (NFHS-5): The National Family Health Survey (NFHS) is a large-scale, multi-round survey conducted in a representative sample of households throughout India. My point of interest is the state of West Bengal, my home state.

West Bengal has 21.3% prevalence of diabetes and approximately one crore adults with type II diabetes (NFHS-5). Looking at the district level data, it’s clear that the southern districts are more vulnerable; probably owing to lifestyle, eating habits, stress etc.

District wise breakdown of child marriages and alcohol consumption in West Bengal is also very interesting to visualize.