Data Science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems, to extract knowledge and insight from structured and unstructured data.
Data scientists must become detectives when trying to figure out patterns, they must investigate leads and try and understand characteristics within their data sets which requires a significant amount of analytical creativity.
Data Science employs techniques and theories drawn from many fields: mathematics, statistics, information science, and computer science. Data Science is often used interchangeably with earlier concepts like Business Analytics, Business Intelligence, Predictive Modeling, and Statistics.
Data science is done through traditional methods like regression and cluster analysis or through unorthodox machine learning techniques. Similar to data mining and big data analytics, data science uses powerful hardware, powerful programming systems, and efficient algorithms in order to solve problems.
Data science is about diving in at a granular level to mine, observe, and comprehend complex behaviors, trends, and inferences in order to uncover insight found within the data.
The Data in Data Science
Data is the foundation of data science; it is the material on which all the analyses are based. In the context of data science, there are two types of data: traditional data and big data.
Traditional data is data that is structured and stored in databases which analysts can manage from one computer; it is in table format, containing numeric or text values.
Big data is bigger than traditional data. From variety (numbers, text, but also images, audio, mobile data, etc.), to velocity (retrieved and computed in real time), to volume (measured in tera-, peta-, exa-bytes), big data is usually distributed across a network of computers.
What do you do to data in data science?
Before being ready for processing, all data goes through pre-processing. This is a necessary group of operations that convert raw data into a format that is more understandable and useful for further processing.
Common processes are:
- Collect raw data and store it on a server
This is untouched data that scientists cannot analyze right away. This data can come from surveys, or through the more popular automatic data collection paradigm, like cookies on a website.
- Class-label the observations
This consists of arranging data by category or labelling data points to the correct data type. For example, numerical, or categorical.
- Data cleansing / data scrubbing
Dealing with inconsistent data, like misspelled categories and missing values.
- Data balancing
If the data is unbalanced such that the categories contain an unequal number of observations and are thus not representative, applying data balancing methods, like extracting an equal number of observations for each category, and preparing that for processing, fixes the issue.
- Data shuffling
Re-arranging data points to eliminate unwanted patterns and improve predictive performance. For example, if the first 100 observations in the data are from the first 100 people who have used a website; the data isn’t randomized, and patterns due to sampling emerge.
Big Data in Data Science
When it comes to big data and data science, there is some overlap of the approaches used in traditional data handling, but there are also a lot of differences.
First of all, big data is stored on many servers and is infinitely more complex.
In order to do data science with big data, pre-processing is even more crucial, as the complexity of the data is a lot larger. You will notice that conceptually, some of the steps are similar to traditional data pre-processing, but that’s inherent to working with data.
- Collect the data
- Class-label the data
Keep in mind that big data is extremely varied, therefore instead of ‘numerical’ vs ‘categorical’, the labels are ‘text’, ‘digital image data’, ‘digital video data’, digital audio data’, and so on.
- Data Cleansing
The methods here are massively varied, too; for example, you can verify that a digital image observation is ready for processing; or a digital video, or…
- Data Masking
When collecting data on a mass scale, this aims to ensure that any confidential information in the data remains private, without hindering the analysis and extraction of insight. The process involves concealing the original data with random and false data, allowing the scientist to conduct their analyses without compromising private details. Naturally, the scientist can do this to traditional data too, and sometimes is, but with big data the information can be much more sensitive, which masking a lot more urgent.
Where does data come from?
Traditional data may come from basic customer records, or historical stock price information.
Big data, however, is all-around us. A consistently growing number of companies and industries use and generate big data. Consider online communities, for example, Facebook, Google, and LinkedIn; or financial trading data. Temperature measuring grids in various geographical locations also amount to big data, as well as machine data from sensors in industrial equipment. And, of course, wearable tech.
Who handles the data?
The data specialists who deal with raw data and pre-processing, with creating databases, and maintaining them can go by a different name. But although their titles are similar sounding, there are palpable differences in the roles they occupy. Consider the following.
Data Architects and Data Engineers (and Big Data Architects, and Big Data Engineers, respectively) are crucial in the data science market. The former creates the database from scratch; they design the way data will be retrieved, processed, and consumed.
Consequently, the data engineer uses the data architects’ work as a stepping stone and processes (pre-processes) the available data. They are the people who ensure the data is clean and organized and ready for the analysts to take over.
The Database Administrator, on the other hand, is the person who controls the flow of data into and from the database. Of course, with Big Data almost the entirety of this process is automated, so there is no real need for a human administrator. The Database Administrator deals mostly with traditional data.
That said, once data processing is done, and the databases are clean and organized, the real data science begins.
There are also two ways of looking at data: with the intent to explain behavior that has already occurred or to predict future behavior that has not yet happened.
Before data science jumps into predictive analytics, it must look at the patterns of behavior the past provides, analyze them to draw insight and inform the path for forecasting.
Business intelligence focuses on providing data-driven answers to questions like: How many units were sold? In which region were the most goods sold? Which type of goods sold where? How did the email marketing perform last quarter in terms of click-through rates and revenue generated? How does that compare to the performance in the same quarter of last year?
What does Business Intelligence do?
Business Intelligence Analysts apply Data Science to measure business performance.
The starting point of all data science is data. Once the relevant data is in the hands of the BI Analyst (monthly revenue, customer, sales volume, etc.), they must quantify their observations, calculate KPIs, and examine measures to extract insights from their data.
Data Science is about Telling a Story
Apart from handling strictly numerical information, data science, and specifically business intelligence, is about visualizing the findings, and creating easily digestible images supported only by the most relevant numbers. All levels of management should be able to understand the insights from the data and inform their decision-making.
Business intelligence analysts create dashboards and reports, accompanied by graphs, diagrams, maps, and other comparable visualizations to present the findings relevant to the current business objectives.
Where is business intelligence used?
Price Optimization and Data Science
Notably, analysts apply data science to inform things like price optimization techniques. They extract the relevant information in real time, compare it with historicals, and take actions accordingly.
Consider hotel management behavior: management raise room prices during periods when many people want to visit the hotel and reduce them when the goal is to attract visitors in periods with low demand.
Inventory Management and Data Science
Data science, and business intelligence, are invaluable for handling over and undersupply. In-depth analyses of past sales transactions identify seasonality patterns and the times of the year with the highest sales, which results in the implementation of effective inventory management techniques that meet demands at minimum cost.
Who does the Business Intelligence branch of data science?
A Business Intelligence Analyst focuses primarily on analyses and reporting of past historical data.
The BI consultant is often just an ‘external BI analyst’. Many companies outsource their data science departments. BI consultants would be BI analysts had they been employed, however, their job is more varied as they hop on and off different projects. The dynamic nature of their role provides the BI consultant with a different perspective, and whereas the BI Analyst has highly specialized knowledge (i.e., depth).
The BI developer is the person who handles more advanced programming tools, such as Python and SQL, to create analyses specifically designed for the company. It is the third most frequently encountered job position in the BI team.
Data Science requires a blend of Skills including Mathematics
At the heart of mining data for insight and building data product is the ability to view the data through a quantitative lens. There are textures, dimensions, and correlations in data that can be expressed mathematically.
Utilizing data to find solutions becomes a brain teaser of examining quantitative technique. Solutions to many business problems involve building analytic models grounded in the hard math. Being able to understand the underlying mechanics of those models is key to success in building them.
A popular misconception is that data science all about statistics. While statistics is important, it is not the only type of math utilized. There are two branches of statistics, classical statistics and Bayesian statistics. When most people refer to statistics, they are generally referring to classical statistics, but knowledge of both types is helpful.
Furthermore, many inferential techniques and machine learning algorithms lean on knowledge of linear algebra. For example, a popular method to discover hidden characteristics in a data set is SVD, which is grounded in matrix math and has much less to do with classical statistics. It is helpful for data scientists to have broadness and depth in their knowledge of mathematics.
Data Scientists must possess knowledge of Technology and Hacking.
When we refer to hacking, we are not talking about hacking in the sense of breaking into computers. In the tech programmer subculture, the meaning of hacking is the act of using creativity, ingenuity, and technical skills to build things, and find clever solutions to problems.
Data scientists need to be able to code, prototype quick solutions, as well as integrate with complex data systems. Core languages associated with data science include SQL, Python, R, and SAS and on a lower level Java, Scala, Julia, etc. A hacker must know a bit more than just language fundamentals, a hacker must be a technical ninja, able to creatively navigate their way through technical challenges in order to make their code work.
A data science hacker must be a solid algorithmic thinker, having the ability to break down messy problems and recompose them in ways that are solvable; This skill is critical, data scientists operate within a lot of algorithmic complexity. They need to have a strong mental comprehension of high-dimensional data in order to establish full clarity on how all the pieces come together to form a cohesive solution.
Data Scientists must Possess a Strong Business Acumen.
It is important for a data scientist to be a tactical business consultant.
Working so closely with data, data scientists are positioned to learn from data in ways no one else can. This creates the responsibility to translate these observations to shared knowledge and contribute to strategy on how to solve core business problems.
Having a strong business acumen is just as important as having acumen for tech and algorithms. There needs to be clear alignment between data science projects and business goals. Ultimately, value doesn’t come from data, math, and tech itself; the real value comes from leveraging all of the above using data insights as supporting pillars which leads to guidance and formation of robust business strategies.
A common personality trait of Data Scientists is that they are deep thinkers with intense intellectual curiosity.
Data science is all about being inquisitive: asking questions, making new discoveries, and learning new things.
Ask a data scientist who is obsessed with their work what drives them, they won’t say “money”. The real motivation is being able to use their creativity and ingenuity to solve hard problems and constantly indulge in their curiosity. Deriving complex reads from data is beyond just making an observation, it is about uncovering “truth” that lies hidden beneath the surface. Problem solving is not a task, but an intellectually-stimulating journey to a solution.
Data Science is about discovering hidden wisdom that can help companies make smarter business decisions.
- Netflix data mines movie viewing patterns to understand what drives user interest and uses this information to make decisions on which Netflix original series to produce.
- Target identifies customer segments within its base and identifies the unique shopping behaviors within each segment; this helps guide messaging to each specific market audience.
- Proctor & Gamble utilizes time series models to more clearly understand future demand, this helps P&G optimally plan their production levels.
Data Science and the Development of the “Data Product”
A “data product” is a technical asset that: (1) utilizes data as input, and (2) processes that data to return algorithmically-generated results.
The classic example of a data product is a recommendation engine, which ingests user data and makes personalized recommendations based on that data.
Here are some examples of Data Products:
- Amazon’s recommendation engines suggest items for users to buy determined by their algorithms.
- Netflix recommends movies to users determined by their algorithms
- Spotify recommends music to users determined by their algorithms.
- Gmail’s spam filter is data product. An algorithm behind the scenes processes incoming mail and determines if a message is junk or not.
- Computer vision used for self-driving cars is also data product machine learning algorithm able to recognize traffic lights, pedestrians, other cars on the road, etc.
Data Scientists play a central role in developing data products. Data scientists serve as technical developers required to build out the algorithms, test the algorithms, refine the algorithms, and technical deployment.