Dec 04, 2024  
2023-2024 Graduate Academic Catalog-June Update 
    
2023-2024 Graduate Academic Catalog-June Update [ARCHIVED CATALOG]

Add to Portfolio (opens a new window)

BUS 6121 - Data Wrangling and Exploration

3 lecture hours 0 lab hours 3 credits
Course Description
This course provides an overview of data-driven decision-making and the use of business analytics to support organizational performance. Special attention is paid to identifying appropriate sources of data, evaluating the quality of data, wrangling the data for specific analytical techniques, and exploring patterns within the data. Students will learn current practices, tools, and methods for data wrangling and exploration. Effective interpretation and communication of results are emphasized. The use of programming languages such as R and Python will be emphasized to expose the student to contemporary analytic processing environments on computing clusters. (prereq: BUS 5500  or Business Programs Director consent)
Course Learning Outcomes
Upon successful completion of this course, the student will be able to:
  • Demonstrate a broad knowledge of the field of business analytics, including trends and challenges
  • Evaluate sources of data and opportunities for data collection and use across an organization
  • Perform data wrangling using appropriate methods and tools for a given analytics scenario
  • Apply appropriate data exploration techniques for a given analytics scenario
  • Use data visualization techniques to gain insights into the meaning of the data
  • Communicate the meaning of statistical or analytic results to stakeholders
  • Demonstrate the ability to analyze and write statistical program code in R and/or Python that automates the processing of statistical data sets that are not conducive to manual processing due to their size and/or scale

Prerequisites by Topic
  • None

Course Topics
  • Python language environments
    • I-Python (Interactive Python) console mode, Integrated Development Environment (IDE) and Jupyter Notebook environment
    • Installation, configuration, and maintenance of I-Python environments using shared environments (Rosie, Google CoLab, etc.) and single-user environments (Anaconda)
  • Python language basics
    • Expressions, operators (arithmetic, logical, and relational), data types (none, integers, floating point numeric and object), compound expressions leading to statements
    • Flow of control in Python (sequential, branching (if/elif/else), iteration (for loops, while loops, ranges, else parts to loops, iteration over collections), functions and lambda expressions). Points of syntax - importance of colons and tab stops
    • Collections in Python (lists, tuples, dictionaries, sets) indexing and slicing, iteration over collections using comprehensions, iterable objects, and generators
    • Classes and object-oriented programming in Python (instance variables, methods, operator overloading, inheritance, polymorphism)
  • Numeric Python (NumPy)
    • Data structures (array and n-dimensional arrays or ndarrays)
    • Indexing, sorting, finding unique elements, random number generation, file I/O
  • Pandas library
    • Series and DataFrame data structures
    • Indexing, axis manipulation, selection, filtering, sorting, ranking, summarizing
    • File I/O between Series, DataFrame and files on disk, file formats
    • Reading and writing data using various file formats (text, JSON, HTML, XML, Microsoft Excel)
  • Data cleaning and preparation
    • Handling missing data, filtering out bad/missing data, filling in missing data
    • “Binning” and aggregating data values, making continuous variables discrete
    • Indicator matrices and dummy variables
  • Data wrangling: join, combine, reshape
    • Hierarchical indexing, summarizing statistics by level
    • Combining and merging data sets, SQL-style “joins” among DataFrames
    • Reshaping and pivoting DataFrames
  • Data visualization
    • Plotting with mathplotlib
    • Figures and subplots
    • Line graphs, bar charts, box plots, histograms, density diagrams, scatter plots, etc.
    • Seaborn, Bokeh, and other visualization libraries built on matplotlib
  • Data aggregation and group operations
    • The GroupBy operator
    • The Split-Apply-Combine pragma for manipulating data
    • Processing large data sets efficiently
  • Machine learning basics
    • Linear regression fundamentals, simple and multivariate linear models
    • Training sets, test sets, validation sets, and other data set partitioning
    • Closed form equation solution for linear regression on small data sets
    • Iterative linear regression techniques for large data sets
    • Learning rates, error surfaces, residual analysis
    • Polynomial regression, logistic regression

Coordinator
Dan Pavletich



Add to Portfolio (opens a new window)