What makes a good data scientist?

Teradata Australia Pty Ltd

By Ross Farrelly*
Tuesday, 09 December, 2014

Building a successful team of data scientists requires maintaining a balance of skills.

There is a lot of hype around data science at the moment, but the fact is that there is a plethora of substantial success stories of data scientists plying their trade to solve important problems. So who are these data scientists and how do they do it?

The main role of the data scientist is to extract value from data on behalf of the company. This value is usually delivered in the form of better decision-making.

There are many ways to make decisions: gut feel, instinct, industry knowledge and experience, to name a few. Decisions can also be made scientifically through the methodical collection, analysis and interpretation of data.

In a commercial setting, this is the role of data science: to apply the scientific method to decision-making. While all science depends on data (so all science is data science), there are many ways to use data unscientifically, such as selectively collecting data to support a pre-conceived hypothesis.

Setting the tone

Selecting the right problems to focus on is an important aspect of running a successful data science program. The problems need to meet a number of criteria in order to justify the effort involved in developing a solution. A project prioritisation matrix is a useful tool to quantify and record the project selection process.

Criteria for candidate projects can include: data availability; ease of execution; alignment with corporate strategy; appropriate fit to available technology; ease of solution implementation; and stakeholder support (which can also be captured in a separate stakeholder analysis matrix).

Having selected an appropriate project, understanding the project and clarifying the questions to be addressed is important. Even though new and unexpected insights may be gleaned during the execution of a discovery project, it’s important to be as clear as possible about the initial business questions that are being addressed.

Once the business questions have been clarified, the problem-solving process can begin. The initial stage often involves the design and implementation of a data collection strategy and decisions about large-scale data storage and management.

Problem-solving techniques vary widely depending on the nature of the problem but they typically involve some of the following steps:

Appropriate selection of an analysis tool capable of handing the volume of data and executing the required analyses. It is important to have an understanding of the different types of analyses available to solve the problem, such as: parsing; paths; graph; text; and clustering (both supervised and unsupervised).

Data scientists also need to select technologies that are capable of executing the appropriate analyses on large-scale data. Having competence in the available techniques is itself a multidisciplinary task.

To illustrate the range of skills that may be drawn on, consider the difference between natural language processing, which requires an understanding of grammar, sentence structure and parts of speech, and a graph analysis problem, which demands knowledge of bulk synchronous processing or a similar programming framework.

Dimension reduction and variable selection. Much of the growth in data volume in recent years has been driven by the increase in the number of ways we collect data about people and machines and the interactions between them. In statistical modelling terms, this translates into an explosion in the number of possible variables that could be included in a model.

Dimension reduction is now one of the more interesting challenges faced by the data scientist.

Model improvement. Many problems necessitate the building of predictive models. Having developed an initial model, the challenge is then to improve its accuracy through techniques such as variable augmentation or ensemble methods, which combine a number of different models into a single prediction.

Visualisation. Effective problem-solving often requires visualisation. This can range from a quick-and-dirty scatter plot to a complex, animated sigma diagram. Building the right visualisation to reveal the relevant aspect of the data under consideration is no trivial task, and a structured approach is often useful.

Interpreting and verifying results. Cross-checking and asking the question in a different way can help build confidence that the insights are real and worthy of being acted upon.

Communicating the results. This is an important but complex aspect of the data science project. Often the person who has the capability to execute the analysis is not particularly competent at communicating the results to a lay audience. This situation was encapsulated in a joke doing the rounds when I was at university:

Q: How can you tell if you’re talking to an extroverted statistician?

A: He’s looking down at your shoes.

Clearly this is not always the case, but it is not an unknown phenomenon. However, the person who is a good communicator may not have the technical understanding to fully grasp the details of the analysis. He or she may have a tendency to oversimplify the analysis or be unable to handle objections. Ideally you want a good technician who is also a good communicator.

Implementation. Having identified a meaningful data science problem and successfully solved it, implementing the solution into the business is the final step. This often involves negotiation with business owners and the ability to persuade and lead.

Data science is a wide-ranging discipline comprising skills from in-depth computer programming to business analysis and stakeholder management. With such a broad range of skills, there will always be a trade-off between breadth and depth.

Some data scientists will have in-depth knowledge of one aspect of the discipline and have a working knowledge of other facets while the generalist may have moderate competency in a wider range of skills.

Building a successful team of data scientists requires being aware of the range of requisite skills and hiring to maintain a balance of skills across the team.

*Ross Farrelly is Chief Data Scientist for Teradata ANZ

Image courtesy r2hox under CC

What makes a good data scientist?

Setting the tone

AI at scale demands a new approach to data resilience and privacy

Australia's path to AI sovereignty lies in strategic control, not reinvention

Can Australian businesses afford to waste $557 million?

Content from other channels on our network