Summary |
This is a multi-phase data preparation and machine learning assignment focused on preparing datasets for environmental impact prediction. The project is structured in two phases but can be expanded into more phases for deeper learning and exploration. In the first phase, students explore, clean, and merge data from the National Pollutant Release Inventory (NPRI) dataset, addressing common data issues such as outliers, missing values, and bad housekeeping. In the second phase, students align and merge multiple datasets, perform feature engineering, and create machine-learning-ready datasets. An optional phase involves creating interactive dashboards using tools like Tableau or Power BI to visualize the results, enhancing their skills in data presentation. The assignments are spaced out across a sixteen-week course to allow for incremental learning and application of skills.
|
Topics |
- Data preparation and cleaning (outliers, missing values, normalization)
- Data merging and alignment
- Feature engineering and encoding
- Machine Learning problem formulation
- Time-series data handling
- Stakeholder engagement and client collaboration
- Data visualization and dashboard creation
|
Audience |
Undergraduate or graduate students in data science, machine learning, or environmental analytics courses.
|
Difficulty |
This assignment is divided into phases, each designed to take students approximately three to four weeks to complete (assuming 10 hours of work per week). The first phase introduces foundational data preparation skills, while subsequent phases involve more complex tasks such as feature engineering, dataset alignment, and interactive dashboard creation. Instructors can adjust the number of phases to fit the course schedule or depth of learning desired.
|
Strengths |
Students gain hands-on experience with real-world environmental data, preparing them for industry challenges. The assignment encourages creativity and flexibility by allowing students to choose their machine learning problems and approaches. The phased approach helps build confidence by starting with foundational skills before moving to more complex tasks. The assignment emphasizes stakeholder engagement, developing essential communication skills needed in data science. Students learn to understand data in collaboration with clients, a critical skill when domain expertise is limited. Additionally, tools like Tableau and Power BI for dashboard creation introduce students to practical data visualization techniques. The rubrics provided are adaptable and can be applied to different datasets and client needs, making the assignment versatile and reusable.
|
Weaknesses |
Understanding complex environmental data without specific domain expertise can be challenging for students. This challenge is intentional and part of the learning outcome, where students engage with stakeholders to gain the necessary context and insights. Group coordination could also present challenges for students inexperienced in collaborative work environments.
|
Dependencies |
Students should be familiar with Python, Jupyter Notebooks, basic statistics, machine learning concepts, data cleaning techniques, and feature engineering. Computing requirements include Python libraries such as Pandas, NumPy, and Plotly, as well as access to a computing environment that can handle large datasets (e.g., Google Colab). Additional tools like Tableau or Power BI may be required for the optional dashboard creation phase.
|
Variants |
Instructors can vary the assignment by incorporating different datasets or focusing on various environmental issues, such as air quality or water pollution. The assignment can also be expanded into more than two phases, incorporating additional tasks such as advanced feature engineering, data augmentation, or more complex machine learning models. Students could also be tasked with developing interactive dashboards to visualize their results using tools like Tableau or Power BI, which would enhance their skills in data visualization and presentation. The process and structure introduced in this assignment can be adapted to other types of data and client scenarios, ensuring its relevance across different contexts.
|