Hi, I’m Terrence Scott — a data science enthusiast with a background in software engineering, analytics, and machine learning. I’m passionate about turning raw data into meaningful insights, solving real-world problems, and building solutions that make a difference.
Right now, I’m actively leveling up my skills through the TripleTen Data Science Bootcamp while pursuing my Master of Science in Data Science at the University of Phoenix. These programs are helping me deepen my expertise in data analysis, statistical modeling, and machine learning, while continuing to grow as a collaborative, impact-driven professional.
When I’m not coding or diving into data, you’ll find me nurturing my plant collection, embracing continuous learning, or connecting with others who share a passion for tech and innovation.
Email: terrencejay23@gmail.com
Java, SQL, C++, C#, Swift, HTML5, CSS, JavaScript
MySQL, MongoDB, Splunk, SQL, Cassandra, SQLite, Azure, AWS, Google Cloud, Databricks, SAP
Statistical Modeling, Data Integration, A/B Testing, Hadoop, Spark
GitHub, Azure DevOps, Jira
Creative Problem-Solving, Self-Starter Mentality, Effective Communication, Collaborative Teamwork, Adaptability, Leadership & Mentorship
Here are some of the projects I have worked on, showcasing my skills in data science, software engineering, and problem-solving:
📽️ Movie & TV Show Data Analysis with Pandas
Python | Pandas | Data Cleaning | Exploratory Data Analysis
In this project, I explored a dataset containing information on movies and TV shows. Using the pandas library, I read, cleaned, and analyzed the data to uncover insights such as content distribution by type, release year trends, and popular genres. This project strengthened my skills in working with DataFrames, handling missing data, indexing, filtering, and performing key data manipulation tasks. The final analysis provides a clear, structured summary of content trends in the entertainment industry.
🛒 Grocery Shopping Behavior Analysis – Instacart Dataset
Python | Pandas | Data Analysis | Data Visualization | Consumer Insights
In this project, I analyzed the Instacart dataset to uncover patterns in customer grocery shopping behavior. Using pandas for data manipulation and exploratory analysis, I identified key trends in shopping times, days of the week, reorder habits, and product preferences. Insights included peak shopping hours, high-demand days (Sunday & Monday), frequent reorders of fresh produce (bananas, spinach, avocados), and common cart behaviors. This project demonstrates my ability to derive actionable business insights from large datasets to support inventory planning, targeted marketing, and operational efficiency.
📈 Revenue Attribution Analysis – Megaline Prepaid Plans
Python | Pandas | Statistical Testing | Data Analysis | Business Strategy
In this project, I conducted a comprehensive revenue attribution analysis for Megaline’s two prepaid mobile plans — Surf and Ultimate. Using pandas for data manipulation and hypothesis testing, I uncovered that the Surf plan consistently generated higher revenue, primarily through overage charges, despite having a lower base price than the Ultimate plan.
Key insights included seasonal revenue spikes, high user engagement on the Surf plan, and no significant revenue variation across geographic regions. Hypothesis testing validated that Surf users generated significantly more revenue than Ultimate users (p < 0.001), while churned users showed no meaningful difference in revenue compared to active users.
Based on these insights, I recommended reallocating marketing efforts toward high-engagement Surf customers and implementing targeted upsell strategies. This analysis highlights my ability to translate complex data into actionable business strategies that drive growth and customer value.
🎮 Global Video Game Market Analysis – ICE
Python | Pandas | Statistical Analysis | Data Visualization | Market Research
In this project with ICE, I analyzed global video game performance data to uncover patterns across platforms, genres, and regions. Using pandas for data manipulation and hypothesis testing, I examined regional sales trends, genre popularity, and the influence of ESRB ratings on market success.
Key findings revealed that Action games, while generating high global sales, had low median performance, indicating a high-risk, high-reward genre. Shooter and Role-Playing games demonstrated stronger consistency across markets, with regional dominance varying: PS4/Xbox One in the West, PC in Europe, and PSP in Japan. Hypothesis tests confirmed no significant difference in user ratings between Xbox One and PC titles, but found a significant rating gap between Action and Sports games (p < 0.001).
These insights inform targeted publishing strategies, regional marketing focus, and content investment decisions. This project highlights my ability to transform market data into actionable intelligence for business growth in the interactive entertainment industry.
📱 Mobile Plan Recommendation System – Megaline
Python | Pandas | Scikit-learn | Gradient Boosting | Random Forest | Logistic Regression | Data Preprocessing | Model Evaluation | Matplotlib | Seaborn
In this project, I developed a classification model to recommend the most suitable mobile plan (Smart or Ultra) for Megaline customers based on their service usage. Using scikit-learn, I analyzed features such as the number of calls, total call duration, text messages, and internet usage to train and evaluate multiple models.
Key findings revealed that the Gradient Boosting model achieved the highest accuracy of 0.829, outperforming other approaches like Random Forest and Logistic Regression. Internet usage emerged as the most influential feature, followed by calls, messages, and minutes.
These insights enable Megaline to provide data-driven plan recommendations, supporting their customer migration strategy and enhancing user satisfaction. This project highlights my ability to build and evaluate machine learning models to solve real-world business challenges.
📊 Customer Churn Prediction – Beta Bank
Python | Pandas | Scikit-learn | LightGBM | XGBoost | Gradient Boosting | Data Preprocessing | Class Imbalance Techniques | Model Evaluation | Matplotlib | Seaborn
Developed a binary classification model to predict customer churn for Beta Bank using features like credit score, account balance, and demographics. LightGBM achieved the best performance with a test accuracy of 85%, an F1-score of 0.59 for churn, and a ROC AUC of 0.85.
Class imbalance techniques and threshold tuning improved recall for churn cases, but further enhancements like resampling strategies and feature engineering are recommended. This project supports Beta Bank’s retention strategy by identifying at-risk customers for timely interventions.
🛢️ Oil Well Profit Optimization – OilyGiant
Python | Pandas | Scikit-learn | Linear Regression | Bootstrapping | Feature Analysis | RMSE Evaluation | Profit Simulation | Data Visualization
In this project, I built a predictive modeling pipeline to help OilyGiant determine the most profitable region for oil well development. Using historical geological data from three regions, I analyzed feature distributions, trained linear regression models, and evaluated model performance using RMSE and predicted reserves.
To simulate real-world uncertainty, I applied bootstrapping across 1,000 iterations per region. I selected the top 200 wells based on model predictions and calculated profit based on actual reserves, development cost, and fixed oil prices. Each region's risk of financial loss was also assessed using confidence intervals.
Key findings showed that Region 0 offered the highest estimated profit ($33.21M) and strong model reliability, while Region 2 demonstrated consistent performance and clean data. Region 1 had the lowest risk but showed signs of overfitting. This project highlights my ability to combine predictive modeling with financial simulation to support strategic, data-driven investment decisions in the energy sector.
🏆 Gold Recovery Prediction – Mining Operations Optimization
Python | Pandas | Scikit-learn | CatBoost | Random Forest | XGBoost | sMAPE | Time Series Validation | Feature Engineering | Data Visualization
This project involved building machine learning models to predict gold recovery rates at two key stages of a mineral processing pipeline: flotation (rougher) and final purification. The goal was to optimize process control and improve operational efficiency using historical production data.
I engineered features based on time-indexed process variables, applied rigorous data cleaning techniques, and validated model accuracy using the symmetric Mean Absolute Percentage Error (sMAPE). I trained and compared multiple models—including Random Forest, CatBoost, and XGBoost—using custom scoring functions and K-Fold cross-validation.
Random Forest emerged as the top performer (Final sMAPE: 4.08), with CatBoost as a close second (Final sMAPE: 4.32). I also conducted a thorough sanity check of test predictions and analyzed feature importance to understand model decision-making. This project demonstrates my ability to apply advanced regression techniques in a high-stakes industrial context where predictive accuracy directly impacts profitability and process reliability.
🏦 Insurance Benefits and Claims Prediction – Sure Tomorrow
Python | Pandas | Scikit-learn | kNN | Logistic Regression | Linear Regression | Data Obfuscation | Privacy-Preserving ML | Feature Scaling | RMSE Evaluation | Data Visualization
This project delivered a full machine learning workflow for the Sure Tomorrow insurance company, addressing customer targeting, benefit eligibility prediction, benefit quantity forecasting, and data privacy. I implemented k-Nearest Neighbors to identify customers with similar profiles, revealing the importance of feature scaling to prevent biased neighbor selection.
For benefit eligibility, I trained a logistic regression model that significantly outperformed a dummy baseline, providing reliable predictions for customer benefits. Using linear regression, I forecasted the number of benefits customers may receive, uncovering a highly skewed distribution where most customers received none.
To ensure data confidentiality, I developed a matrix-based obfuscation method that preserved model accuracy. Analytical proof and testing confirmed identical RMSE values (0.363719) between original and obfuscated data, with prediction differences near zero. This project demonstrates my ability to combine predictive modeling, statistical validation, and privacy-preserving techniques to deliver actionable business insights without compromising sensitive customer information.
Explore my AI-focused projects and the books that have shaped my understanding of artificial intelligence, machine learning, and data science:
Generalization: Build, train, and fine-tune deep neural network architectures for NLP with Python, Hugging Face, and OpenAI's GPT-3, ChatGPT, and GPT-4.
Why first: This book establishes foundational understanding of transformers, Hugging Face models, and the inner workings of GPT-based architectures. It’s critical for grasping the models that power most modern AI systems.
Generalization: A practical roadmap from deep learning fundamentals to advanced applications and Generative AI.
Why second: Computer vision complements NLP and reinforces your knowledge of deep learning, PyTorch, and practical model training. While not LLM-focused, it sharpens your understanding of architectures and model evaluation.
Generalization: The Developer’s Guide to Pretrained LLMs, Vector Databases, Retrieval Augmented Generation, and Agentic Systems.
Why third: Builds on your transformer/NLP knowledge by introducing LLMs, RAG, vector databases, and agentic systems. This gives a broad, modern perspective on GenAI development.
Generalization: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG.
Why fourth: Now that you know how LLMs work, this book focuses on fine-tuning, prompting, and RAG pipelines — essential for customizing and deploying models reliably.
Generalization: Build production-ready LLM applications and advanced agents using Python, LangChain, and LangGraph.
Why fifth: LangChain is a dominant tool for building real-world LLM applications. Study it after you’re comfortable with RAG and LLM fundamentals so you can focus on chains, agents, and LangGraph workflows.
Generalization: A practical guide to autonomous and modern AI agents.
Why sixth: This builds on LangChain and expands into knowledge graphs and autonomous agents — critical for more complex, memory-driven applications like personal assistants and decision-making tools.
Generalization: Create intelligent, autonomous AI agents that can reason, plan, and adapt.
Why seventh: This goes deep into reasoning, planning, and adaptive behavior. Best studied once you have experience with RAG, agents, and production-level tools. It’s more advanced and conceptual.
Generalization: The Complete and Up-to-Date Guide to Build, Develop and Scale Production Ready AI Systems.
Why last: This is the most comprehensive and production-focused book. It brings together everything: MLOps, scaling, monitoring, deployment, and infrastructure. Ideal as a capstone or reference once you’ve built a few prototypes.
terrencejay23@gmail.com
Oakland, CA
www.linkedin.com/in/terrencejay23