BUS 6121 - Data Wrangling and Exploration3 lecture hours 0 lab hours 3 credits Course Description This course provides an overview of data-driven decision-making and the use of business analytics to support organizational performance. Special attention is paid to identifying appropriate sources of data, evaluating the quality of data, wrangling the data for specific analytical techniques, and exploring patterns within the data. Students will learn current practices, tools, and methods for data wrangling and exploration. Effective interpretation and communication of results are emphasized. The use of programming languages such as R and Python will be emphasized to expose the student to contemporary analytic processing environments on computing clusters. (prereq: BUS 5500 or Business Programs Director consent) Course Learning Outcomes Upon successful completion of this course, the student will be able to:
- Demonstrate a broad knowledge of the field of business analytics, including trends and challenges
- Evaluate sources of data and opportunities for data collection and use across an organization
- Perform data wrangling using appropriate methods and tools for a given analytics scenario
- Apply appropriate data exploration techniques for a given analytics scenario
- Use data visualization techniques to gain insights into the meaning of the data
- Communicate the meaning of statistical or analytic results to stakeholders
- Demonstrate the ability to analyze and write statistical program code in R and/or Python that automates the processing of statistical data sets that are not conducive to manual processing due to their size and/or scale
Prerequisites by Topic Course Topics
- Python language environments
- I-Python (Interactive Python) console mode, Integrated Development Environment (IDE) and Jupyter Notebook environment
- Installation, configuration, and maintenance of I-Python environments using shared environments (Rosie, Google CoLab, etc.) and single-user environments (Anaconda)
- Python language basics
- Expressions, operators (arithmetic, logical, and relational), data types (none, integers, floating point numeric and object), compound expressions leading to statements
- Flow of control in Python (sequential, branching (if/elif/else), iteration (for loops, while loops, ranges, else parts to loops, iteration over collections), functions and lambda expressions). Points of syntax - importance of colons and tab stops
- Collections in Python (lists, tuples, dictionaries, sets) indexing and slicing, iteration over collections using comprehensions, iterable objects, and generators
- Classes and object-oriented programming in Python (instance variables, methods, operator overloading, inheritance, polymorphism)
- Numeric Python (NumPy)
- Data structures (array and n-dimensional arrays or ndarrays)
- Indexing, sorting, finding unique elements, random number generation, file I/O
- Pandas library
- Series and DataFrame data structures
- Indexing, axis manipulation, selection, filtering, sorting, ranking, summarizing
- File I/O between Series, DataFrame and files on disk, file formats
- Reading and writing data using various file formats (text, JSON, HTML, XML, Microsoft Excel)
- Data cleaning and preparation
- Handling missing data, filtering out bad/missing data, filling in missing data
- “Binning” and aggregating data values, making continuous variables discrete
- Indicator matrices and dummy variables
- Data wrangling: join, combine, reshape
- Hierarchical indexing, summarizing statistics by level
- Combining and merging data sets, SQL-style “joins” among DataFrames
- Reshaping and pivoting DataFrames
- Data visualization
- Plotting with mathplotlib
- Figures and subplots
- Line graphs, bar charts, box plots, histograms, density diagrams, scatter plots, etc.
- Seaborn, Bokeh, and other visualization libraries built on matplotlib
- Data aggregation and group operations
- The GroupBy operator
- The Split-Apply-Combine pragma for manipulating data
- Processing large data sets efficiently
- Machine learning basics
- Linear regression fundamentals, simple and multivariate linear models
- Training sets, test sets, validation sets, and other data set partitioning
- Closed form equation solution for linear regression on small data sets
- Iterative linear regression techniques for large data sets
- Learning rates, error surfaces, residual analysis
- Polynomial regression, logistic regression
Coordinator Dan Pavletich
Add to Portfolio (opens a new window)
|