1 Manipulate variables (columns)

#select()=column selection

##extract: id_assessment, id_student, date_submitted, score select(df,id_assessment,id_student,date_submitted,score) df.student.info <-select(df,id_assessment,id_student,date_submitted,score)

#colums that begin with letter:“s” select(df, starts_with(match=“s”))

#colums that end with letter:“e” select(df, end_with(match=“e”))

1.1 select columns by columns index (position)

select(df,1:3) select(df,c(2, 5, 7))

#rename()-rename column

#mutate() - create a new variable ##create variable: between id_student and date_submitted df <- df %>% mutate(gender = ifelse(gender_code == “M”, “Male”, “Female”))

#transmute()- create new variable and drop other variables transmute (df1 gender)

#manipulate cases (rows)

#filter()-filter rows by conditions ## where date = 19 filter(df,date == "19"?)”

#slice()-extract rows by position

##extract first 5 rows slice(df,1:5)

##extract rows from 20th row upto 30th row slice(df,20:30)

##extract last 10 rows slice(df,(nrow(df)-9):nrow(df))

#arrange()- sort rows

##sort rows by score (ascending order) arrange(df, score)

##sort rows by score (descending order) arrange(df, desc(score))

#distinct()=unique rows

##our small example df.example <-data.frame(id=1:3, name=c(“John”,“Max”,“Julia”)) df.example <- bind_rows(df.example, slice(df.example,2))#duplicate 2nd row

##insert links ###show the address/directory/path

1.1.1 Part II: Reflect and Plan

Part A: Please refer back to Breiman’s (2001) article for these three questions.

Can you summarize the primary difference between the two cultures of statistical modeling that Breiman outlines in his paper?

The primary difference between the two cultures of statistical modelling that Breiman outlines in his paper are

Data Modelling Culture: This culture presupposes that a particular stochastic data model produces the data. Following this method, statisticians concentrate on developing parametric models that explain the underlying data-generation process. They think they can come up with passably accurate models for intricate natural mechanisms by using their imagination and by carefully examining the data.

Algorithmic Modeling Culture: In contrast, this culture views the method of data generation as being unknowable. Without making any firm assumptions about the underlying data model, statisticians in this context frequently emphasize predicted accuracy when using algorithmic models. They employ a more adaptable strategy and a variety of tools to examine data. The main distinction between them is how they go about modeling: data modelers aim to explain the underlying mechanism using particular parametric models. Whereas algorithmic modelers prioritize predictive accuracy and do not make strong assumptions about the data’s underlying structure.

How has the advent of big data and machine learning affected or reinforced Breiman’s argument since the article was published?

**Since the article’s publication, Leo Breiman’s position has been impacted and strengthened by the emergence of big data and machine learning. This is how:

Involved Argument: Big Data Problems Modeling data The traditional data modelling methodologies may be less useful when data volume, velocity, and variety rise. For exceptionally vast and complicated datasets, data modellers may find it challenging to develop accurate and comprehensible parametric models. The Dominance of Machine Learning Machine learning has grown significantly in popularity and is consistent with the algorithmic modelling culture described by Breiman. It has proven to be efficient in processing vast and varied datasets, frequently exceedingly more established statistical techniques

Supporting Argument: Breiman’s assertion that there are many good models is supported by the sheer volume and complexity of big data. The decision between various algorithms and models that offer comparably good predictions in large data scenarios becomes critical.

Focus on Predictive Accuracy: Predictive accuracy is a fundamental focus of machine learning, a crucial element of the culture of algorithmic modelling. Due to this focus, effective prediction models that can handle complicated data have been created without the need for explicit data models.

Challenges with Dimensionality: The curse of dimensionality gets worse as datasets get bigger. In comparison to conventional data modelling approaches, the algorithmic modelling culture frequently offers better tools for handling high-dimensional data because of its flexibility and concentration on prediction accuracy.

In conclusion, Breiman’s claim has been questioned and strengthened by the emergence of big data and machine learning. In line with Breiman’s beliefs, they have emphasized the value of algorithmic modeling and predictive accuracy while highlighting the shortcomings of conventional data modeling in the face of large datasets.

Breiman emphasized the importance of predictive accuracy over understanding why a method works. To what extent do you agree or disagree with this stance?

Breiman represents a distinct viewpoint in the field of statistics and machine learning by emphasizing predicted accuracy over comprehending how a strategy works. The degree to which one accepts or rejects this position is dependent on several variables and the context in which modelling is being used. Here are some things to think about:

I concur with the focus on predictive accuracy

Useful Application: Making precise forecasts or decisions is often the main objective in real-world applications. Predictive precision is crucial in these circumstances. A model can be very useful if it can regularly make correct predictions without necessitating a thorough comprehension of the underlying mechanisms.

Data complexity: It may be difficult or perhaps impossible to completely grasp the underlying data generating process in complicated, high-dimensional datasets (such as big data). Predictive accuracy-focused algorithms can nevertheless produce meaningful findings in certain circumstances.

Black-Box Models: Because they are challenging to understand, some machine learning techniques, such as deep neural networks, are referred to as “black-box” models. Despite this lack of interpretability, they frequently perform at the cutting edge on a variety of tasks, highlighting the value of accurate prediction.

Contrary to the emphasis on predictive accuracy:

Interpretability: In some disciplines, model interpretability and comprehension are crucial. Decisions based on models, for example, in healthcare or finance, must be explainable to earn trust and meet regulatory standards.

Understanding why a model works can lead to insights into the underlying processes being researched in scientific study. While predicted accuracy is crucial, it may not be enough to further scientific knowledge.

Overemphasis on predicting accuracy without recognizing the model’s biases and associated ethical implications can result in unjust or discriminatory outcomes. Understanding the model’s inner workings can aid in the detection and mitigation of biases.

When data is scarce, understanding the data-generating process and applying domain knowledge can be critical for constructing reliable models. Without this understanding, predictive accuracy may be impossible to achieve.

Finally, the value of prediction accuracy vs. comprehending the method’s workings is determined by the specific context and modelling goals. There is no one-size-fits-all solution, and in many circumstances, a balanced strategy that considers both predicted accuracy and interpretability may be suitable, especially when the repercussions of model decisions are important.

Part B:

How good was the machine learning model you developed in the badge activity? What if you read about someone using such a model as a reviewer of research? Please add your thoughts and reflections following the bullet point below.

several factors should be considered like accuracy: where one can evaluate model’s predictive accuracy using appropriate metrics (eg, accuracy, precision, F1-score and recall) depending on the nature of the problem.

-Generalization: Need to check if the model generalizes well to unseen data to avoid overfitting

-Cross validation: Perform cross-validation to ensure model’s robustness.

-Feature Importance: Understand which features are driving model’s predictions.

If I read as a reviewer of research I would consider the below

-*Strenghts**: The model could assist in identfying potential issues in research papers, such as data discrepancies, statistical inconsistencies, or anomalies in results.

-bias and fairness: To be fair in the evaluation process and be cautious about potential biases in the model.

-Human Judgement: Research review requires human judgement and critical thinking, where a machine learning model can complement but not replace.

How might the model be improved? Share any ideas you have at this time below:

Below are few ideas on how the model developed might be improved

-Data Quality: To ensure the quality and cleanliness of the training data, to address missing values, outliers, and any inconsistencies in the dataset.

-Feature Engineering: Carefully engineer features that are relevant to the problem. Considering domain knowledge to create meaningful features that can improve the model’s performance.

-Cross validation: Implement more robust cross-validation techniques to assess the model’s generalization performance accurately.

-Feedback loop: Gather feedback from users or domain experts who interact with the model and use their insights to make iterative improvements.

-Ensemble methods: To use ensemble methods such as stacking or bagging to combine the predictions of multiple models, which can often lead to improved performance.

Part C: Use the institutional library (e.g. NU Library), Google Scholar or search engine to locate a research article, presentation, or resource that applies machine learning to an educational context aligned with your research interests. More specifically, locate a machine learning study that involves making predictions.

Provide an APA citation for your selected study.
- Breiman, L. (2001). Statistical Modeling: The Two Cultures. Statistical Science, 16(3), 199–231.
-Baker, R. S., Berning, A. W., Gowda, S. M., Zhang, S., & Hawn, A. (2020). Predicting K-12 Dropout. Journal of Education for Students Placed at Risk (JESPAR), 25(1), 28-54. DOI: 10.1080/10824669.2019.1670065
What research questions were the authors of this study trying to address and why did they consider these questions important?
- The author of “Statistical Modeling: The Two Cultures,” Leo Breiman’s paper, covers various major research topics in the field of statistical modeling. These inquiries center on the fundamental methods to statistical modeling and the ramifications of each. The following are the important research questions and their significance:
What are the two unique cultures of statistical modeling, and what consequences do these cultures have for the science of statistics?

Importance: This question is critical because it exposes a fundamental split in statistical practice. Understanding and defining these two cultures is critical for the advancement of statistical methods and applications. It gives light on the many ways statisticians utilize when modeling data, which has far-reaching implications for data analysis and and decision making

Secondary Research Issues: Breiman tackles various subsidiary questions within the context of the two cultures, including: What are the key characteristics of the data modeling culture, and what assumptions does it make about the data-generation process? What are the distinguishing features of the algorithmic modeling culture, and how does it differ from the data modeling culture? What are each culture’s strengths and weaknesses in dealing with real-world problems? How do these cultures influence statistics practice in a variety of sectors, including scientific research and industrial applications? What are the consequences of relying solely on data models for statistical analysis, and how might algorithmic modeling provide a more broad collection of problem-solving tools?

Importance: These supplementary questions serve to highlight the differences between the two cultures as well as their practical ramifications. They provide insights into each culture’s strengths and shortcomings, allowing statisticians and researchers to make informed modeling method decisions based on the nature of the data and the problem at hand.

These problems were crucial to the writers because they addressed the philosophical and methodological difference in statistical modeling. The study leads to a greater understanding of statistical practice and fosters a more flexible and diversified approach to modeling, ultimately aiding numerous sectors that rely on statistical analysis by examining the two cultures and their effects.

What were the results of these analyses?
- In the classical sense, Leo Breiman’s paper “Statistical Modeling: The Two Cultures” does not give empirical analyses with precise numerical results. Instead, the paper explores and contrasts two opposing methods to statistical modeling, referred to by Breiman as the “two cultures.”

The paper’s main outcome is a conceptual framework that shows the basic contrasts between the two cultures:

Culture of Data Modeling: This culture is based mostly on parametric statistical models that presume a specific probabilistic data-generation mechanism. It emphasizes mathematical models, assumptions, and theoretical foundations. The outcomes of data modeling are conclusions about the mechanism of the model, not necessarily about the mechanism of true nature. According to the report, this culture frequently leads to irrelevant theory, dubious conclusions, and may not adequately address real world complexities.

-The algorithmic modeling culture, on the other hand, concentrates on the development of algorithms and computational methods for making predictions directly from data. It ignores explicit parametric models and treats the data-generating mechanism as unknown. Algorithmic modeling findings are frequently highly predictive and can be more accurate and insightful than data models, particularly in complicated and high-dimensional data contexts. The paper’s principal goal is to highlight the existence of these two opposing cultures and their implications for statistical practice. It encourages statisticians to investigate a broader range of tools and methodologies for problem solving, depending on the nature of the data and the objectives of the analysis.

In summary, the “results” in this study are not numerical discoveries, but rather a conceptual framework and discussion of the two cultures of statistical modeling and their implications for statistics.

Lab 1 Badge assignment

Renu Mutha

2023-09-12

0.1 R Markdown

1 Manipulate variables (columns)

1.1 select columns by columns index (position)

1.1.1 Part II: Reflect and Plan