Task 1: Preliminary Data Analysis a) Using the file widget, import the provided dataset into Orange correctly, create a data table to view the data, and provide a statistical summary of the data presenting

Jamez

03 Jul 2024 • 2 min read

Assignment Task

Questions:

Task 1: Preliminary Data Analysis

a) Using the file widget, import the provided dataset into Orange correctly, create a data table to view the data, and provide a statistical summary of the data presenting the main descriptive statistics of the features, namely the minima, maxima, averages, standard deviations, and the 25th, 50th, and 75th quartiles of your non-categorical features.

b) Perform the necessary data-cleaning steps. Look for any missing data or outliers. Properly handle these data. You must provide strong rationales for your decisions. For instance, if you ended up imputing the missing data with a particular strategy or decided to remove them, please explain your reason(s) for this decision.

c) Feature encoding is done to convert categorical data into a numerical format that can be used by machine learning algorithms, with common methods such as one-hot encoding and ordinal encoding. Encode categorical variable(s) properly. Explain your rationale for the selected encoding method(s).

Task 2: Feature Ranking and Engineering

a) Feature ranking is the process of ordering the features of a dataset based on their importance or relevance to the predictive model's performance. Use the rank widget to report the most important features. Based on your engineering judgment, do you agree with what the rank widget says? Why?

Feature engineering is the process of creating new features or modifying existing ones to improve the performance of machine learning models. You can derive new features or remove the existing ineffective or redundant features for better cost estimation. Conduct proper feature engineering and provide a brief discussion about why specific features were derived or removed and their potential impact on the results.

Task 3: Machine Learning and Analytics

a) Choose any four machine learning models you like for cost prediction. Assess the performance of each model and report their performance metrics, such as R2 . Hint: You should use the “Test and Score” widget here.

b) Choose the best model. Justify your choice.

c) Assume the Texas Department of Road and Transportation has asked you which features are the worthiest of more focus for a better, more reliable, and accurate cost prediction. What would you answer them? Discuss how these features could impact the decision-making process and overall project cost management. Please provide rationales through a lens of business intelligence.

d) Summarize the key lessons you learned from this assignment.

Bonus Task: Do it again! But entirely with ChatGPT.

OpenAI recently released GPT-4o, its most powerful language model that can assist with a wide range of tasks, including data analysis and machine learning. This model is currently available for free. Repeat the entire assignment using ChatGPT. (First, upload the provided dataset. Then, using proper prompt engineering, ask ChatGPT to clean it (e.g., by removing/imputing missing values), and preprocess the data by encoding categorical variables and normalizing numerical features. Feel free to ask it to utilize feature engineering techniques to create new terms or remove redundant features, and then identify the most important features, e.g., using a Random Forest Regressor. Train and evaluate a few different machine learning models, comparing their performance. Finally, use the best-trained model to predict the costs of new projects and proceed with analytics. Note that ChatGPT might have suboptimal performance if you ask it to perform several difficult tasks simultaneously. You may find OpenAI’s prompt engineering guide helpful. Provide the link to your thread of discussion with ChatGPT or screenshot(s) of highlights of your chat. (If the screenshot takes too much space in your report, you may submit it separately in Quercus.)

Assignment Task

Sign up for more like this.