🎯 Learning Objectives: Master NumPy arrays, understand NoSQL databases, implement cosine similarity, and develop systematic problem-solving skills for data science applications.
This comprehensive guide covers the essential topics from Week 5 of the AIO2025 course with interactive visualizations, practical examples, and hands-on exercises designed to solidify your understanding of fundamental data science concepts.
🎯 1. NumPy Basics
NumPy (Numerical Python) is a fundamental library for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with these arrays.
💻 1.1. Python Lists vs. NumPy Arrays
While Python lists are versatile, NumPy arrays offer significant advantages in terms of performance, memory, and functionality for numerical operations.
Feature | Python List | NumPy Array | Performance Impact |
---|---|---|---|
Data Type | Heterogeneous (mixed types) | Homogeneous (same type) | 🔥 Type checking overhead eliminated |
Memory | Pointers to objects (overhead) | Contiguous memory block | 🚀 50-100x faster access |
Performance | Slower (type checking) | Optimized C code | ⚡ 10-100x faster operations |
Functionality | Basic operations | Vectorized operations | 📊 Broadcasting & advanced math |
Memory Usage | Higher (pointer overhead) | Lower (direct storage) | 💾 2-10x less memory |
🔬 Interactive Memory Layout Comparison
🔧 1.2. Array Creation
NumPy provides multiple efficient ways to create arrays, each optimized for different use cases.
📋 Creation Methods Comparison
Method | Use Case | Performance | Memory Efficiency |
---|---|---|---|
np.array() | Convert existing data | ⭐⭐⭐ | ⭐⭐⭐ |
np.zeros() | Initialize with zeros | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
np.ones() | Initialize with ones | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
np.arange() | Sequential numbers | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
np.linspace() | Evenly spaced values | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
From a Python list:
import numpy as np
# Create a listmy_list = [2, 0, 2, 5, 7, 1]
# Convert the list to a NumPy arraymy_array = np.array(my_list)print(my_array)# Output: [2 0 2 5 7 1]print(f"Type: {my_array.dtype}, Shape: {my_array.shape}")# Output: Type: int64, Shape: (6,)
Using built-in functions:
# Create an array of 5 floats, initialized with zeroszeros_arr = np.zeros(5)print(zeros_arr)# Output: [0. 0. 0. 0. 0.]
# Create an array of shape (2, 3) filled with onesones_arr = np.ones((2, 3))print(ones_arr)# Output:# [[1. 1. 1.]# [1. 1. 1.]]
# Create an array with a range of elementsrange_arr = np.arange(0, 10, 2) # (start, stop, step)print(range_arr)# Output: [0 2 4 6 8]
# Create evenly spaced valueslinspace_arr = np.linspace(0, 1, 5) # 5 values from 0 to 1print(linspace_arr)# Output: [0. 0.25 0.5 0.75 1. ]
💡 Pro Tip: Use
np.zeros()
ornp.ones()
when you need to initialize large arrays efficiently. They’re much faster than creating Python lists first!
🔍 1.3. Indexing and Slicing
Indexing and slicing work similarly to Python lists but can be extended to multiple dimensions with powerful capabilities.
📝 Basic Operations
a_data = np.array([4, 5, 6, 7, 8, 9])
# Accessing elementsprint(a_data[2]) # Output: 6print(a_data[-1]) # Output: 9
# Slicing: array[start:stop:step]print(a_data[:3]) # Output: [4 5 6] (first 3 elements)print(a_data[3:]) # Output: [7 8 9] (from index 3 to end)print(a_data[::2]) # Output: [4 6 8] (every other element)
⚠️ Critical Concept: Views vs Copies
Important Note: Unlike Python lists, slices of NumPy arrays are views into the original array, not copies. Modifying a slice will modify the original array.
x = np.array([0., 0.25, 0.5, 0.75, 1.])y = x[1:4] # Create a slice (view)y[-1] = 1000.0 # Modify the slice
print(y) # Output: [ 0.25 0.5 1000. ]print(x) # Output: [ 0. 0.25 0.5 1000. 1. ] -> Original is changed!
# To create a copy, use arr.copy()z = x[1:4].copy() # This creates an independent copyz[0] = 999.0print(x) # Original array remains unchanged
🎯 Advanced Indexing Examples
# Boolean indexingarr = np.array([1, 2, 3, 4, 5, 6])mask = arr > 3print(arr[mask]) # Output: [4 5 6]
# Fancy indexing with arraysindices = np.array([0, 2, 4])print(arr[indices]) # Output: [1 3 5]
# Conditional replacementarr[arr > 4] = 0print(arr) # Output: [1 2 3 4 0 0]
Operation | Syntax | Result | Memory Impact |
---|---|---|---|
Single Element | arr[2] | 6 | No extra memory |
Slice (View) | arr[1:4] | [5 6 7] | Shared memory ⚠️ |
Slice (Copy) | arr[1:4].copy() | [5 6 7] | New memory allocation |
Boolean Mask | arr[arr > 5] | [6 7 8 9] | New array created |
Fancy Index | arr[[0,2,4]] | [4 6 8] | New array created |
⚡ 1.4. Basic Operations & Vectorization
NumPy allows for element-wise operations, which is called vectorization. This is much faster than looping through elements as you would with Python lists.
🔥 Vectorization Performance
import numpy as npimport time
# Compare performance: Loop vs Vectorizationsize = 1000000a = np.random.random(size)b = np.random.random(size)
# Python loop approachstart = time.time()result_loop = []for i in range(size): result_loop.append(a[i] + b[i])loop_time = time.time() - start
# NumPy vectorizationstart = time.time()result_vectorized = a + bvectorized_time = time.time() - start
print(f"Loop time: {loop_time:.4f}s")print(f"Vectorized time: {vectorized_time:.4f}s")print(f"Speedup: {loop_time/vectorized_time:.1f}x faster!")
📊 Common Vectorized Operations
arr1 = np.array([1, 2, 3])arr2 = np.array([4, 5, 6])
# Element-wise additionprint(arr1 + arr2) # Output: [5 7 9]
# Element-wise multiplicationprint(arr1 * 3) # Output: [3 6 9]print(arr1 * arr2) # Output: [ 4 10 18]
# More operationsprint(arr1 ** 2) # Output: [1 4 9] (power)print(np.sqrt(arr1)) # Output: [1. 1.414 1.732] (square root)print(arr1 > 2) # Output: [False False True] (comparison)
🧮 Mathematical Functions
Function | Description | Example Input | Example Output |
---|---|---|---|
np.sum() | Sum of elements | [1, 2, 3] | 6 |
np.mean() | Average value | [1, 2, 3] | 2.0 |
np.std() | Standard deviation | [1, 2, 3] | 0.816 |
np.min() | Minimum value | [1, 2, 3] | 1 |
np.max() | Maximum value | [1, 2, 3] | 3 |
np.argmin() | Index of minimum | [3, 1, 2] | 1 |
np.argmax() | Index of maximum | [3, 1, 2] | 0 |
🚀 Performance Tip: Always use NumPy’s vectorized operations instead of Python loops when working with arrays. It can be 10-100x faster!
🎯 2. NumPy Programming: 2D & 3D Data
📐 2.1. Multi-dimensional Arrays
NumPy arrays can have multiple dimensions, making them perfect for representing data like matrices (2D) and RGB images (3D).
🔢 Dimensional Concepts
- 1D Array (Vector):
[1, 2, 3]
→ shape:(3,)
- 2D Array (Matrix):
[[1, 2], [3, 4], [5, 6]]
→ shape:(3, 2)
(3 rows, 2 columns) - 3D Array (Tensor): Often used for a collection of matrices, like an RGB image → shape:
(height, width, channels)
🎨 Interactive Multi-dimensional Array Visualization
🔧 2.2. Common Functions for Multi-dimensional Arrays
reshape(new_shape): Changes the shape of an array without changing its data. The total number of elements must remain the same.
data = np.arange(6) # [0 1 2 3 4 5]data_reshaped = data.reshape((2, 3))print(data_reshaped)# [[0 1 2]# [3 4 5]]
flatten(): Collapses a multi-dimensional array into a single 1D array.
data_2d = np.array([[1, 2], [3, 4]])flat_data = data_2d.flatten()print(flat_data) # Output: [1 2 3 4]
sum(axis=…), max(axis=…), min(axis=…): Perform aggregation along a specified axis.
- axis=0: operates along the columns.
- axis=1: operates along the rows.
data = np.array([[1, 2], [3, 4]])print(np.sum(data, axis=0)) # Output: [4 6] (sum of columns)print(np.sum(data, axis=1)) # Output: [3 7] (sum of rows)
📡 2.3. Broadcasting
Broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. The smaller array is “broadcast” across the larger array so that they have compatible shapes.
📐 Broadcasting Rules
- Dimension Alignment: If arrays don’t have the same number of dimensions, prepend the shape of the lower-dimensional array with 1s
- Size Compatibility: For each dimension, sizes must be equal, or one of them is 1
- Output Shape: The size of each dimension in the output is the maximum of the input arrays
🎭 Interactive Broadcasting Demonstration
💻 Code Example
import numpy as np
# Example: Adding a vector to each row of a matrixmatrix = np.array([[1, 2, 3], [4, 5, 6]]) # shape (2, 3)
vector = np.array([10, 20, 30]) # shape (3,)
result = matrix + vector # Broadcasting happens hereprint(result)# [[11 22 33]# [14 25 36]]
# The vector is conceptually stretched to shape (2, 3) to match matrix
🎯 Broadcasting Examples
Array 1 Shape | Array 2 Shape | Result Shape | Compatible? |
---|---|---|---|
(3, 4) | (4,) | (3, 4) | ✅ Yes |
(3, 4) | (3, 1) | (3, 4) | ✅ Yes |
(3, 4) | (1, 4) | (3, 4) | ✅ Yes |
(3, 4) | (3, 2) | N/A | ❌ No |
(2, 3, 4) | (3, 4) | (2, 3, 4) | ✅ Yes |
🖼️ 2.4. Application: Image Representation & Manipulation
Grayscale Image: A 2D NumPy array where each element represents the intensity of a pixel (0=black, 255=white). Shape: (height, width).
RGB Image: A 3D NumPy array. Shape: (height, width, 3). The last dimension represents the three color channels (Red, Green, Blue).
Note: Libraries like OpenCV read images in BGR order by default, while Matplotlib expects RGB. You may need to convert between them:
image_rgb = image_bgr[:, :, ::-1]
.
Brightness Adjustment
Image data is often stored as uint8 (unsigned 8-bit integer, 0-255). Simple addition can cause values to “wrap around” (e.g., 250 + 10 becomes 4, not 255).
# Incorrect wayimage = cv2.imread('image.png')bright_image = image + 100 # This will cause wrap-around issues
# Correct way using np.clipimage = image.astype(np.float32) # Convert to float to avoid overflowbright_image = image + 100bright_image = np.clip(bright_image, 0, 255) # Clip values to the 0-255 rangebright_image = bright_image.astype(np.uint8) # Convert back to uint8
🗄️ 3. Database - NoSQL
📊 3.1. Introduction to Databases
A database is an organized collection of data. A Database Management System (DBMS) is the software that interacts with users, applications, and the database itself to capture and analyze the data.
🔄 3.2. SQL vs. NoSQL
Aspect | SQL (Relational) | NoSQL (Non-relational) |
---|---|---|
Model | Data is stored in tables with rows and columns | Data can be stored in various models (document, key-value, graph, etc.) |
Schema | Predefined, rigid schema (schema-on-write) | Dynamic or flexible schema (schema-on-read) |
Scalability | Typically scales vertically (increasing power of a single server) | Typically scales horizontally (distributing load across many servers) |
Language | Uses Structured Query Language (SQL) | Varies by database; often called “Not Only SQL” |
Examples | MySQL, PostgreSQL, SQL Server | MongoDB, Redis, Cassandra, Neo4j |
🗂️ 3.3. Types of NoSQL Databases
📄 Document Databases
Store data in documents, similar to JSON objects. Each document contains field-value pairs. The values can be a variety of types, including nested documents and arrays.
- Example: MongoDB
- Use Case: Content management, user profiles
🔑 Key-Value Stores
The simplest model. Every item is stored as a key-value pair.
- Example: Redis, Amazon DynamoDB
- Use Case: Caching, session management
🕸️ Graph Databases
Use nodes and edges to represent and store data. Excellent for exploring relationships between entities.
- Example: Neo4j
- Use Case: Social networks, recommendation engines, fraud detection
📊 Column-Family Stores
Store data in columns rather than rows. Optimized for fast queries over large datasets.
- Example: Cassandra, HBase
- Use Case: Big data analytics, time-series data
📊 Interactive NoSQL Database Types Overview
🍃 3.4. Introduction to MongoDB
MongoDB is a leading document database.
Database: A container for collections.
Collection: A group of MongoDB documents. It is the equivalent of a table in a relational database.
Document: A set of key-value pairs, represented in a format called BSON (Binary JSON). Documents have a dynamic schema. The _id
field is a unique primary key automatically added if not provided.
Example Document:
{ "_id": " ObjectId('...') ", "username": "aivn_student", "course": "AIO2025", "enrollment_date": "ISODate('2025-07-01T00:00:00Z')", "scores": [95, 88, 92], "address": { "city": "Hanoi", "country": "Vietnam" }}
🔍 3.5. Basic MongoDB Query Language (MQL)
Insert a document:
db.students.insertOne({ name: "Thai", age: 20, likes: ["AI", "Data"] })
Find documents:
// Find all documentsdb.students.find()
// Find documents where age is greater than 21db.students.find({ age: { $gt: 21 } })
// Find documents matching two conditions (implicit AND)db.students.find({ age: { $gt: 21 }, likes: "AI" })
Update a document:
// Find the first student named "Thai" and set their age to 21db.students.updateOne( { name: "Thai" }, { $set: { age: 21 } })
Delete a document:
db.students.deleteOne({ name: "Thai" })
🎯 4. Measuring Data Similarity: Cosine Similarity
🧮 4.1. Vector Dot Product
The dot product of two vectors A and B can be defined in two ways:
📊 Mathematical Definitions
Algebraic: The sum of the products of the corresponding entries.
A · B = Σ(Aᵢ * Bᵢ)
Geometric: The product of the Euclidean magnitudes of the two vectors and the cosine of the angle between them.
A · B = ||A|| * ||B|| * cos(θ)
📐 4.2. Cosine Similarity
By rearranging the geometric definition of the dot product, we get the formula for Cosine Similarity. It measures the cosine of the angle between two non-zero vectors, which indicates their directional similarity.
Cosine Similarity (cs) = cos(θ) = (A · B) / (||A|| * ||B||)
🎯 Interpretation Guide
Range | Value | Meaning | Use Case |
---|---|---|---|
1 | Perfect similarity | Vectors point in exact same direction | Identical documents |
0 | No similarity | Vectors are orthogonal (90°) | Unrelated topics |
-1 | Perfect dissimilarity | Vectors point in opposite directions | Contradictory content |
0.5 to 1 | High similarity | Small angle between vectors | Related documents |
-0.5 to 0.5 | Moderate similarity | Medium angle | Somewhat related |
🔑 Key Property: Cosine similarity is a measure of orientation, not magnitude. Two vectors with the same orientation but different magnitudes will have a cosine similarity of 1. This makes it perfect for text analysis, where document length varies greatly.
🎨 Interactive Vector Similarity Visualization
💻 4.3. Python Implementation
import numpy as np
def cosine_similarity(v1, v2): """Computes the cosine similarity between two vectors.""" dot_product = np.dot(v1, v2) norm_v1 = np.linalg.norm(v1) norm_v2 = np.linalg.norm(v2)
# Avoid division by zero if norm_v1 == 0 or norm_v2 == 0: return 0.0
return dot_product / (norm_v1 * norm_v2)
# Example usagedoc1_vector = np.array([1, 1, 0, 1]) # "AI is fun"doc2_vector = np.array([1, 1, 1, 0]) # "AI is cool"similarity = cosine_similarity(doc1_vector, doc2_vector)print(f"Cosine Similarity: {similarity:.4f}")# Cosine Similarity: 0.6667
🧠 5. Logic Thinking and Problem Solving
A structured approach to problem-solving is crucial for the success of any Data Science or AI project.
🔄 5.1. The 7-Step Problem-Solving Framework
This is an iterative process to systematically tackle complex problems.
🔄 Interactive 7-Step Problem-Solving Framework
🔍 Step 1: Define the Problem
Goal: Clearly articulate the problem. A problem is the gap between the current state and the desired state.
Techniques:
- 5W1H: Who, What, Where, When, Why, How
- 5 Whys: Repeatedly ask “Why?” to uncover the root cause
- SMART Goals: Ensure the objective is Specific, Measurable, Achievable, Relevant, and Time-bound
🧩 Step 2: Decompose the Problem
Goal: Break down a complex problem into smaller, more manageable components.
Techniques:
- MECE Principle: Mutually Exclusive, Collectively Exhaustive. Ensure sub-problems don’t overlap and that all parts of the original problem are covered
- Logic Trees: A visual tool to structure the decomposition
🌳 Interactive Logic Tree Example
⭐ Step 3: Prioritize Issues
Goal: Focus resources on the most critical issues.
Techniques:
- Impact-Feasibility Matrix: A 2x2 grid to plot tasks based on their potential impact and ease of implementation
- Pareto Principle (80/20 Rule): Identify the 20% of causes that are responsible for 80% of the effects
📊 Interactive Impact-Feasibility Matrix
🗄️ Step 4: Data Collection
Goal: Gather the necessary data to analyze hypotheses.
Methods: Interviews, surveys, system logs, databases, A/B testing, etc.
Data Quality: Ensure data is Accurate, Complete, Consistent, Timely, and Valid.
📊 Step 5: Data Analysis
Goal: Extract insights from the data.
Process: Clean data → Exploratory Data Analysis (EDA) → Diagnostic Analysis → Generate actionable insights.
💡 Step 6: Design Solution
Goal: Develop potential solutions based on the analysis.
Techniques: Brainstorming, SCAMPER, prototyping, A/B testing.
🚀 Step 7: Implement & Present
Goal: Execute the chosen solution and communicate the results effectively.
Technique:
- Pyramid Principle: Structure your communication by starting with the main conclusion, followed by supporting arguments, and finally the data evidence
🔺 Interactive Pyramid Principle Structure
🛠️ 6. TA-Exercise: Practical Applications
This section covers the practical exercises applying the concepts learned.
🎨 6.1. Image Processing: Grayscale Conversion
A color image (3 channels: R, G, B) can be converted to a grayscale image (1 channel) using several methods.
Conversion Methods
Lightness Method: Averages the most and least prominent colors.
Grayscale = (max(R, G, B) + min(R, G, B)) / 2
Average Method: Averages all three channels.
Grayscale = (R + G + B) / 3
Luminosity Method: A weighted average that accounts for human perception (we are more sensitive to green). This is generally the best method.
Grayscale = 0.21*R + 0.72*G + 0.07*B
Implementation (Luminosity):
# Assuming 'img' is a NumPy array of shape (H, W, 3)gray_img = 0.21*img[:,:,0] + 0.72*img[:,:,1] + 0.07*img[:,:,2]
🎬 6.2. Image Processing: Background Subtraction
This technique, often used with green screens, involves replacing the background of one image with another.
📸 Interactive Background Subtraction Process
📝 Steps & Code Implementation:
1. Read Images
Load the object image, original background, and target background. Ensure they are the same size.
import cv2import numpy as np
obj_img = cv2.imread('Object.png')bg1_img = cv2.imread('GreenBackground.png')bg2_img = cv2.imread('NewBackground.jpg')
# Resize images to be the sameIMG_SIZE = (obj_img.shape[1], obj_img.shape[0])bg1_img = cv2.resize(bg1_img, IMG_SIZE)bg2_img = cv2.resize(bg2_img, IMG_SIZE)
2. Compute Difference
Find the absolute difference between the object image and the original background.
diff = cv2.absdiff(bg1_img, obj_img)diff_single_channel = np.mean(diff, axis=2)
3. Create Binary Mask
Threshold the difference image to create a mask that separates the foreground (object) from the background.
# Where the difference is low (background), value is 0. Where it's high (object), value is 255._, binary_mask = cv2.threshold(diff_single_channel.astype(np.uint8), 15, 255, cv2.THRESH_BINARY)
# Expand to 3 channels to apply to color imagebinary_mask_3ch = np.stack((binary_mask,)*3, axis=-1)
4. Replace Background
Use np.where to combine the images. Where the mask is 255 (object), use the object image’s pixels. Otherwise, use the new background’s pixels.
output = np.where(binary_mask_3ch == 255, obj_img, bg2_img)cv2.imwrite('final_output.png', output)
📊 6.3. Tabular Data Analysis
Using NumPy to perform quick analysis on tabular data (e.g., from a CSV file).
import pandas as pdimport numpy as np
# Load data using pandas and convert to NumPy arraydf = pd.read_csv('advertising.csv')data = df.to_numpy()
# Get the 'Sales' column (last column)sales = data[:, -1]
# 1. Get the maximum sales valuemax_sales = np.max(sales)print(f"Max Sales: {max_sales}")
# 2. Get the average value of the 'TV' column (first column)tv_ads = data[:, 0]mean_tv = np.mean(tv_ads)print(f"Average TV spending: {mean_tv:.2f}")
# 3. Count how many records have Sales >= 20high_sales_count = np.sum(sales >= 20)print(f"Number of high sales records: {high_sales_count}")
# 4. Calculate the average 'Radio' spending for records where Sales >= 15radio_ads = data[:, 1]avg_radio_for_high_sales = np.mean(radio_ads[sales >= 15])print(f"Average Radio for high sales: {avg_radio_for_high_sales:.2f}")
📈 Learning Progress Tracker
🎯 Learning Progress Dashboard
Track your mastery of Week 5 concepts with this clean progress tracker:
📚 Week 5 Learning Tracker
📊 NumPy Basics
🚀 Advanced NumPy
🗄️ Databases
🔢 Data Similarity
🧠 Problem Solving
🎯 Overall Progress
✅ Learning Checklist
Mark your progress as you work through each section:
-
NumPy Fundamentals
- Understand memory layout differences between Python lists and NumPy arrays
- Create arrays using various methods (
array()
,zeros()
,ones()
,arange()
) - Master indexing, slicing, and the view vs copy concept
- Apply vectorized operations for performance
-
Multi-dimensional Data
- Work with 1D, 2D, and 3D arrays confidently
- Use
reshape()
,flatten()
, and aggregation functions - Understand and apply broadcasting rules
- Process images as NumPy arrays
-
Database Knowledge
- Compare SQL vs NoSQL approaches
- Identify appropriate NoSQL database types for different use cases
- Write basic MongoDB queries (insert, find, update, delete)
- Understand document-oriented data modeling
-
Data Similarity
- Calculate dot products both algebraically and geometrically
- Implement cosine similarity from scratch
- Interpret similarity scores in practical contexts
- Apply similarity measures to text analysis problems
-
Problem-Solving Skills
- Follow the 7-step problem-solving framework
- Decompose complex problems using MECE principles
- Use prioritization matrices for decision making
- Structure communication using the Pyramid Principle
🎯 Action Plan
Week 5 Goals:
- Practice Daily: Spend 30 minutes daily on NumPy array manipulation
- Build Projects: Create 2 mini-projects using the concepts learned
- Apply Knowledge: Use cosine similarity in a real text analysis task
- Document Learning: Write summary notes for each major concept
Next Steps:
- Set up a local Python environment with NumPy and MongoDB
- Download sample datasets for practice
- Join online communities for additional practice problems
- Schedule time for hands-on coding exercises
🔗 Additional Resources
📚 Recommended Reading
- NumPy Official Documentation: numpy.org
- MongoDB Tutorial: mongodb.com/docs
- Linear Algebra for ML: Khan Academy Linear Algebra course
- Problem-Solving Methods: “Thinking, Fast and Slow” by Daniel Kahneman
🛠️ Practice Platforms
- Kaggle Learn: Free micro-courses on data science topics
- LeetCode: Array and database problems
- MongoDB University: Free MongoDB courses
- NumPy Exercises: github.com/rougier/numpy-100
🎉 Key Achievements
After completing this study guide, you should be able to:
- ✅ Optimize Performance: Use NumPy for 10-100x faster numerical operations
- ✅ Handle Big Data: Work with multi-dimensional arrays efficiently
- ✅ Choose Databases: Select appropriate database technologies for projects
- ✅ Measure Similarity: Implement and apply cosine similarity in real applications
- ✅ Solve Problems: Apply systematic approaches to complex data science challenges