Predictive Data

Already complete Code

As we move on from week to week and task to task, the code that you have already completed, will stay on the template but will not run, this is possible by adding eval=FALSE to the corresponding code chunk. Note that the libraries need to be linked to this program as well.

# Install and load necessary libraries
#install.packages("ggplot2") # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command
#install.packages("scales")  # Install scales for formatting
#install.packages("moments") # Install moments for skewness and kurtosis
library(ggplot2)            # Load ggplot2 library
library(scales)             # Load scales library
library(moments)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)

## 
## Attaching package: 'readr'

## The following object is masked from 'package:scales':
## 
##     col_factor

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.1

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

Setting up your directory in your computer.

This needs to be addressed here.

# Check the current working directory
getwd()

## [1] "C:/Users/benke/Downloads"

# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")

# Set the working directory to where the data file is located
# This ensures the program can access the file correctly

setwd("C:/Users/benke/Downloads")

### Choose an already existing directory in your computer.

read_csv("my_data - Copy(in).csv")

## Rows: 498 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Gender, Education, MaritalStatus, Employment, Rating, Category, Col...
## dbl (6): Age, Height, Weight, Income, Score, Happiness
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 498 × 15
##    Gender   Age Height Weight Education   Income MaritalStatus Employment Score
##    <chr>  <dbl>  <dbl>  <dbl> <chr>        <dbl> <chr>         <chr>      <dbl>
##  1 Female    29    155     54 Master's     55000 Married       Employed     7.3
##  2 Other     30    165     NA High School     NA Single        Unemployed   5.5
##  3 Male      31    180     NA Bachelor's   50000 Married       Employed     6.5
##  4 Female    27    168     65 Bachelor's   45000 Single        Employed     6.2
##  5 Male      30    182     80 Master's     65000 Married       Employed    NA  
##  6 Male      29    175     70 PhD          60000 Married       Employed     7.8
##  7 Female    29    155     54 Master's     55000 Married       Employed     7.3
##  8 Female    26    160     55 Bachelor's   48000 Single        Employed     6.1
##  9 Male      31    180     NA Bachelor's   50000 Married       Employed     6.5
## 10 Other     28    185     85 High School     NA Single        Unemployed   5.7
## # ℹ 488 more rows
## # ℹ 6 more variables: Rating <chr>, Category <chr>, Color <chr>, Hobby <chr>,
## #   Happiness <dbl>, Location <chr>

Setting up your personalized data

# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 470
B <- 389
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility


# Generate a random sample of 500 rows from the dataset
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the dataset

write.csv(df, file = "my_data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory

Knit your file

As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.

It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML

In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html

df <- read.csv("my_data - Copy(in).csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

Cleaning up your data

Step 0. Now that you read the file, you want to learn few information about your data

The following commands will not be explained here, do your research, review your csv file and answer the questions related with this part of your code.

# Basic exploratory commands
nrow(df)       # Number of rows in the dataset

## [1] 498

length(df)     # Number of columns (or variables) in the dataset

## [1] 15

str(df)        # Structure of the dataset (data types and a preview)

## 'data.frame':    498 obs. of  15 variables:
##  $ Gender       : Factor w/ 3 levels "Female","Male",..: 1 3 2 1 2 2 1 1 2 3 ...
##  $ Age          : int  29 30 31 27 30 29 29 26 31 28 ...
##  $ Height       : int  155 165 180 168 182 175 155 160 180 185 ...
##  $ Weight       : int  54 NA NA 65 80 70 54 55 NA 85 ...
##  $ Education    : Factor w/ 5 levels "","Bachelor's",..: 4 3 2 2 4 5 4 2 2 3 ...
##  $ Income       : int  55000 NA 50000 45000 65000 60000 55000 48000 50000 NA ...
##  $ MaritalStatus: Factor w/ 3 levels "","Married","Single": 2 3 2 3 2 2 2 3 2 3 ...
##  $ Employment   : Factor w/ 3 levels "","Employed",..: 2 3 2 2 2 2 2 2 2 3 ...
##  $ Score        : num  7.3 5.5 6.5 6.2 NA 7.8 7.3 6.1 6.5 5.7 ...
##  $ Rating       : Factor w/ 4 levels "","A","B","C": 3 2 2 3 4 4 3 3 2 2 ...
##  $ Category     : Factor w/ 4 levels "","Art","Music",..: 3 4 2 2 3 4 3 2 2 4 ...
##  $ Color        : Factor w/ 4 levels "","Blue","Green",..: 2 3 3 2 4 4 2 2 3 3 ...
##  $ Hobby        : Factor w/ 5 levels "","Photography",..: 4 3 3 3 5 5 4 4 3 2 ...
##  $ Happiness    : num  8.2 NA 8 7 8.5 9 8.2 7 8 6 ...
##  $ Location     : Factor w/ 4 levels "","City","Rural",..: 2 3 3 2 2 4 2 3 3 3 ...

summary(df)    # Summary statistics for each column

##     Gender         Age            Height          Weight            Education  
##  Female:215   Min.   :25.00   Min.   :155.0   Min.   :54.00              : 19  
##  Male  :187   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:60.00   Bachelor's :184  
##  Other : 96   Median :28.00   Median :175.0   Median :70.00   High School:105  
##               Mean   :28.17   Mean   :172.4   Mean   :69.49   Master's   : 96  
##               3rd Qu.:29.00   3rd Qu.:182.0   3rd Qu.:80.00   PhD        : 91  
##               Max.   :34.00   Max.   :190.0   Max.   :90.00   NA's       :  3  
##               NA's   :27      NA's   :19      NA's   :28                       
##      Income      MaritalStatus      Employment      Score       Rating 
##  Min.   :32000          : 15             : 20   Min.   :5.500    :  1  
##  1st Qu.:45000   Married:195   Employed  :361   1st Qu.:6.100   A:148  
##  Median :48000   Single :288   Unemployed:117   Median :6.200   B:212  
##  Mean   :51666                                  Mean   :6.643   C:137  
##  3rd Qu.:60000                                  3rd Qu.:7.500          
##  Max.   :70000                                  Max.   :8.900          
##  NA's   :79                                     NA's   :67             
##    Category     Color             Hobby       Happiness       Location  
##        :  6        :  1              :  4   Min.   :6.000         :  7  
##  Art   :167   Blue :201   Photography: 91   1st Qu.:7.000   City  :204  
##  Music :145   Green:136   Reading    :120   Median :7.000   Rural :189  
##  Sports:180   Red  :160   Swimming   : 99   Mean   :7.515   Suburb: 98  
##                           Traveling  :184   3rd Qu.:8.500               
##                                             Max.   :9.000               
##                                             NA's   :13

Cleanup Continued

Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs

#
# Step 1:  # Handling both blanks and NAs is not simple so first we want to eliminate
# some of those, let's eliminate the blanks and change them to NAs
#


# Replace blanks with NAs across the dataset
# This ensures that blank values are consistently treated as missing data
df[df == ""] <- NA

# Convert specific columns to factors
# This step ensures categorical variables are treated correctly after replacing blanks
factor_columns <- c("Gender", "Education", "MaritalStatus", "Category", 
                    "Employment","Rating", "Color", "Hobby", "Location")
df[factor_columns] <- lapply(df[factor_columns], function(col) as.factor(as.character(col)))

Cleanup Continued

Step 2: Count NAs in the entire dataset

#
# Step 2: Count NAs in the entire dataset


# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values

## [1] 309

Clean Up Continued

Step 3: Count rows with NAs.

#
# Step 3: Count rows with NAs
#

# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
Percent_row_NA <- percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas

## [1] 251

Percent_row_NA

## [1] "50%"

CleanUp Continued

Step 4: Count columns with NAs

#  
# Step 4: Count columns with NAs

# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas

## [1] 14

Percent_col_NA

## [1] "93%"

Imputation

Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)

In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.

# 
# Step 5: Replace NAs with appropriate values (mean for numeric and integer,
# mode for factor, "NA" for character)
# In later weeks we will learn how to replace the NAs properly based on the
# descriptive statistics and you will discuss this code.
# for now, you can assume that by setting the mean of the variable for numeric
# and mode for categorical it is correct - this is not always the case of course
# but the code will become much more complicated in that case.


# Replace NAs with appropriate values
# Numeric: Replace with the mean if sufficient data is available
# Categorical: Replace with the mode (most common value)
# Character: Replace with the string "NA"
df <- lapply(df, function(col) {
  if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
    if (sum(!is.na(col)) > 10) {
      col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
    } else {
      col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
    }
  } else if (is.factor(col)) { # Factor columns
    mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
    col[is.na(col)] <- mode_val
  } else if (is.character(col)) { # Character columns
    col[is.na(col)] <- "NA" # Replace with "NA"
  }
  return(col) # Return the modified column
})

df <- as.data.frame(df) # Convert the list back to a dataframe


#
# following the above method to impute, has now changed some of the statistics


# Check the updated dataset and ensure no remaining NAs
summary(df)

##     Gender         Age            Height          Weight            Education  
##  Female:215   Min.   :25.00   Min.   :155.0   Min.   :54.00   Bachelor's :206  
##  Male  :187   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:62.00   High School:105  
##  Other : 96   Median :28.00   Median :172.4   Median :69.49   Master's   : 96  
##               Mean   :28.17   Mean   :172.4   Mean   :69.49   PhD        : 91  
##               3rd Qu.:29.00   3rd Qu.:182.0   3rd Qu.:80.00                    
##               Max.   :34.00   Max.   :190.0   Max.   :90.00                    
##      Income      MaritalStatus      Employment      Score       Rating 
##  Min.   :32000   Married:195   Employed  :381   Min.   :5.500   A:148  
##  1st Qu.:45000   Single :303   Unemployed:117   1st Qu.:6.100   B:213  
##  Median :51666                                  Median :6.200   C:137  
##  Mean   :51666                                  Mean   :6.643          
##  3rd Qu.:60000                                  3rd Qu.:7.300          
##  Max.   :70000                                  Max.   :8.900          
##    Category     Color             Hobby       Happiness       Location  
##  Art   :167   Blue :202   Photography: 91   Min.   :6.000   City  :211  
##  Music :145   Green:136   Reading    :120   1st Qu.:7.000   Rural :189  
##  Sports:186   Red  :160   Swimming   : 99   Median :7.000   Suburb: 98  
##                           Traveling  :188   Mean   :7.515               
##                                             3rd Qu.:8.500               
##                                             Max.   :9.000

summary(df) %>%
  kbl(caption = "Table 1. Summary of Data Frame Characteristics") %>%
  kable_classic()

Table 1. Summary of Data Frame Characteristics
Gender	Age	Height	Weight	Education	Income	MaritalStatus	Employment	Score	Rating	Category	Color	Hobby	Happiness	Location
Female:215	Min. :25.00	Min. :155.0	Min. :54.00	Bachelor’s :206	Min. :32000	Married:195	Employed :381	Min. :5.500	A:148	Art :167	Blue :202	Photography: 91	Min. :6.000	City :211
Male :187	1st Qu.:27.00	1st Qu.:165.0	1st Qu.:62.00	High School:105	1st Qu.:45000	Single :303	Unemployed:117	1st Qu.:6.100	B:213	Music :145	Green:136	Reading :120	1st Qu.:7.000	Rural :189
Other : 96	Median :28.00	Median :172.4	Median :69.49	Master’s : 96	Median :51666	NA	NA	Median :6.200	C:137	Sports:186	Red :160	Swimming : 99	Median :7.000	Suburb: 98
NA	Mean :28.17	Mean :172.4	Mean :69.49	PhD : 91	Mean :51666	NA	NA	Mean :6.643	NA	NA	NA	Traveling :188	Mean :7.515	NA
NA	3rd Qu.:29.00	3rd Qu.:182.0	3rd Qu.:80.00	NA	3rd Qu.:60000	NA	NA	3rd Qu.:7.300	NA	NA	NA	NA	3rd Qu.:8.500	NA
NA	Max. :34.00	Max. :190.0	Max. :90.00	NA	Max. :70000	NA	NA	Max. :8.900	NA	NA	NA	NA	Max. :9.000	NA

head(df) %>%
  kbl(caption = "Table 1. Head of Data Frame Characteristics") %>%
  kable_classic()

Table 1. Head of Data Frame Characteristics
Gender	Age	Height	Weight	Education	Income	MaritalStatus	Employment	Score	Rating	Category	Color	Hobby	Happiness	Location
Female	29	155	54.00000	Master’s	55000.00	Married	Employed	7.300000	B	Music	Blue	Swimming	8.200000	City
Other	30	165	69.49362	High School	51665.87	Single	Unemployed	5.500000	A	Sports	Green	Reading	7.514639	Rural
Male	31	180	69.49362	Bachelor’s	50000.00	Married	Employed	6.500000	A	Art	Green	Reading	8.000000	Rural
Female	27	168	65.00000	Bachelor’s	45000.00	Single	Employed	6.200000	B	Art	Blue	Reading	7.000000	City
Male	30	182	80.00000	Master’s	65000.00	Married	Employed	6.642691	C	Music	Red	Traveling	8.500000	City
Male	29	175	70.00000	PhD	60000.00	Married	Employed	7.800000	C	Sports	Red	Traveling	9.000000	Suburb

Your Turn

swer**

Descriptive Statistics

Step 6: Create descriptive statistics for all variables

We run all the descriptive statistics for all the numeric variables

################################################################### 
# 
# Step 6: Create descriptive statistics for all variables
# We run all the descriptive statistics for all the numeric variables
#
###################################################################
# Initialize a function to compute descriptive statistics
compute_stats <- function(column, name) {
  if (is.numeric(column) || is.integer(column)) {
    data.frame(
      Variable = name,
      Mean = round(mean(column, na.rm = TRUE), 2),
      Median = round(median(column, na.rm = TRUE), 2),
      St.Deviation = round(sd(column, na.rm = TRUE), 2),
      Range = round(diff(range(column, na.rm = TRUE)), 2),
      IQR = round(IQR(column, na.rm = TRUE), 2),
      Skewness = round(skewness(column, na.rm = TRUE), 2),
      Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
      stringsAsFactors = FALSE
    )
  } else {
    NULL
  }
}

# Apply the function to each numeric or integer column in the dataset
descriptive_stats <- do.call(
  rbind,
  lapply(names(df), function(col) compute_stats(df[[col]], col))
)

# Print the descriptive statistics dataframe
descriptive_stats

##    Variable     Mean   Median St.Deviation   Range     IQR Skewness Kurtosis
## 1       Age    28.17    28.00         1.84     9.0     2.0     0.68     3.95
## 2    Height   172.39   172.39         9.19    35.0    17.0     0.00     1.85
## 3    Weight    69.49    69.49        10.43    36.0    18.0     0.16     1.87
## 4    Income 51665.87 51665.87      7916.98 38000.0 15000.0     0.18     2.41
## 5     Score     6.64     6.20         0.82     3.4     1.2     0.68     2.51
## 6 Happiness     7.51     7.00         0.99     3.0     1.5     0.04     1.83

Descriptive Statistics Continued

Step 7: Print Descriptive Statistics

Now you have all the descriptive statistics for all numeric variables Create a professional table in your paper. The library(KableExtra), can help you create the table here. If you have no programming experience you can cut and paste in Excel and beautify the table in Excel.

#############################################################
# 
# Step 7: Print Descriptive Statistics
# Now you have all the descriptive statistics for all numeric variables
# Create a professional table in your paper.
# the library(KableExtra), can help you create the table here.
# if you have no programming experience you can cut and paste in Excel
# and beautify the table in Excel
#############################################################
  
  print("Descriptive Statistics:")

## [1] "Descriptive Statistics:"

  print(descriptive_stats)

##    Variable     Mean   Median St.Deviation   Range     IQR Skewness Kurtosis
## 1       Age    28.17    28.00         1.84     9.0     2.0     0.68     3.95
## 2    Height   172.39   172.39         9.19    35.0    17.0     0.00     1.85
## 3    Weight    69.49    69.49        10.43    36.0    18.0     0.16     1.87
## 4    Income 51665.87 51665.87      7916.98 38000.0 15000.0     0.18     2.41
## 5     Score     6.64     6.20         0.82     3.4     1.2     0.68     2.51
## 6 Happiness     7.51     7.00         0.99     3.0     1.5     0.04     1.83

descriptive_stats %>%
  kbl(caption = "Table 2. Descriptive Statistics") %>%
  kable_classic()

Table 2. Descriptive Statistics
Variable	Mean	Median	St.Deviation	Range	IQR	Skewness	Kurtosis
Age	28.17	28.00	1.84	9.0	2.0	0.68	3.95
Height	172.39	172.39	9.19	35.0	17.0	0.00	1.85
Weight	69.49	69.49	10.43	36.0	18.0	0.16	1.87
Income	51665.87	51665.87	7916.98	38000.0	15000.0	0.18	2.41
Score	6.64	6.20	0.82	3.4	1.2	0.68	2.51
Happiness	7.51	7.00	0.99	3.0	1.5	0.04	1.83

Your Turn

Essay Question

Review and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. How can this be interpreted? what are your observations? Verify the descriptive statistics, and explain in detail. Explain everything that you obsevrve. Complete your research compare your variables and complete your paper

Answer

Visual Representations

Step 8: Create graphs using ggplot2

For this part there are parts that you will need to change to create your graphs. The example is set to work with Income. Make the necessary changes to create the rest of the graphs. You may also want to change the colors, the dimensions etc…

  #######################################################################
  # 
  # Step 8: Create graphs using ggplot2
  # For this part there are parts that you will need to change to create 
  # your graphs.
  # The example is set to work with Income
  # Make the necessary changes to create the rest of the graphs
  # You may also want to change the colors, the dimensions etc...
  #############################################################
  
  #############################################################
  #
  # STEP 8a: Create a bargraph or a histogram
  # Explain what graph was that and why?
  # Set col to the desired column name
  #############################################################
  #
  ##
  # In this code we start you of with an example of Happiness, later in the code
  # you should replace this with your desired variable.
  #
  
    col = "Happiness"  # This is an example, try to do the same with a different variable

Bargraph - if the variable of your choice is categorical

  # Assume df is your dataframe and col is the column name (as string)
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
  
}

col = "Happiness"
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
  
}

col = "Hobby"
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
}

col = "Location"
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
  
}

Histogram - If the varaibles of your choice is Numerical

You can also copy the chunk and create more graphs by resetting the col variable appropriately

if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 30, fill = "steelblue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

col = "Happiness"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

col = "Age"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "orange2", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

col = "Weight"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkblue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

col = "Income"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

Your Turn

Essay Question

Now that you can observe graphically your data, explain the importance of graphical representations and how this helps to communicate data with other parties. Explain what graph was that and why?

Answer

Your Turn

STEP 8b: Create a boxplot and a Histogram for numeric variables note the the Bin width cannot be set up in the same way to work with Age or Happiness that has a small range and Income that the range is in thousands. Change this appropriately

Please note that this part of the code will not run for the demo code. You will need to change the value of eval=FALSE to eval=TRUE, after you introduce your code, to run it and add it to your knitted file.

 #############################################################
      #
      # STEP 8b: Create a boxplot  and Histogram for numeric variables
      # note the the Bin width cannot be set up in the same way to work with 
      # Age or Happiness that has a small range and Income that the range is in thousands
      # Change this appropriately
      #############################################################
      #
      # Choose a numeric variable (i.e., Age) set the col variable to the name of the column then you rerun the code that is commented out here.

#col = ____ Add the variable of your choice  

# Uncomment the code and you will create a Bar graph or a Histogram of a different variable here.
# Do not forget to change the value of eval=TRUE to run and knit this chunk
    
#  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
#      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
#        geom_bar() +
#        labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
#        theme_minimal() +
#        theme(legend.position = "right")
#    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
#      ggplot(df, aes(x = .data[[col]])) +
#        geom_histogram(binwidth = 0.3) +
#        labs(title = paste("Histogram for", col), x = col, y = "Count") +
#        theme_minimal()
    }

col = "Happiness"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }

Your Turn

Essay Question

Now explain this graph. Focus on the information extracted, anomalies, outliers, relationships.

Answer

Your Turn

***Step 8c: NOTE that you should run this part with the latest value of col. Do not forget to change the eval=TRUE to knit it.

Boxplot for numeric variables

       #############################################################
       #
       # Step 8c
       # NOTE that you should run this part of the code after you 
       #  copy the graph that the previous code creates. Boxplot for numeric variables
       #############################################################
      # The next 5 lines will run only if the col is numeric, otherwise will give you an error.
      
  
       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

    col = "Age"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }

       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = .3, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

    col = "Weight"

    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
       labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
     ggplot(df, aes(x = .data[[col]])) +
      geom_histogram(binwidth = 0.3) +
       labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
    }

       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = .25, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

Your Turn

Essay Question

Explain the findings of your Boxplot. Are there any outliers? What is the IQR? Focus on the information extracted, anomalies, outliers, relationships.

Answer

Tabular Representations

Step 9: Tables

Creating tables to understand how the different categorical variables interconnect. Tabular information can be provided in both tables and parallel barplots. The following is an example on two variables, choose two others to get more valuable insights.

#############################################################
#
# Step 9
#
# Creating tables to understand how the different categorical variables
# interconnect
# Tabular information can be provided in both tables and parallel barplots.
# The following is an example on two variables, choose two others to get
# more valuable insights.
#############################################################

Gender_Education <- table(df$Education, df$Gender)
Gender_Education # what does this information tells you?

##              
##               Female Male Other
##   Bachelor's     191    8     7
##   High School     11   22    72
##   Master's        13   83     0
##   PhD              0   74    17

# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Gender_Education) # Add totals to your table

##              
##               Female Male Other Sum
##   Bachelor's     191    8     7 206
##   High School     11   22    72 105
##   Master's        13   83     0  96
##   PhD              0   74    17  91
##   Sum            215  187    96 498

color <- c("red","blue","yellow","green")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Gender_Education, col=color, beside= TRUE, main = "Education by Gender", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Gender_Education))

##              
##               Female Male Other Sum
##   Bachelor's     191    8     7 206
##   High School     11   22    72 105
##   Master's        13   83     0  96
##   PhD              0   74    17  91
##   Sum            215  187    96 498

Hobby_Location <- table(df$Hobby, df$Location)
Hobby_Location # what does this information tells you?

##              
##               City Rural Suburb
##   Photography    2    80      9
##   Reading      103    17      0
##   Swimming      42    57      0
##   Traveling     64    35     89

# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Hobby_Location) # Add totals to your table

##              
##               City Rural Suburb Sum
##   Photography    2    80      9  91
##   Reading      103    17      0 120
##   Swimming      42    57      0  99
##   Traveling     64    35     89 188
##   Sum          211   189     98 498

color <- c("orangered3","navyblue","yellow2","palegreen4")
names <- c("Photography","Reading", "Swimming","Traveling")
barplot(Hobby_Location, col=color, beside= TRUE, main = "Hobby by Location", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Hobby_Location))

##              
##               City Rural Suburb Sum
##   Photography    2    80      9  91
##   Reading      103    17      0 120
##   Swimming      42    57      0  99
##   Traveling     64    35     89 188
##   Sum          211   189     98 498

Hobby_Education <- table(df$Education, df$Hobby)
Hobby_Education # what does this information tells you?

##              
##               Photography Reading Swimming Traveling
##   Bachelor's            0      89       85        32
##   High School          82      12        1        10
##   Master's              0      19       13        64
##   PhD                   9       0        0        82

# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Hobby_Education) # Add totals to your table

##              
##               Photography Reading Swimming Traveling Sum
##   Bachelor's            0      89       85        32 206
##   High School          82      12        1        10 105
##   Master's              0      19       13        64  96
##   PhD                   9       0        0        82  91
##   Sum                  91     120       99       188 498

color <- c("magenta4","steelblue","goldenrod1","darkgreen")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Hobby_Education, col=color, beside= TRUE, main = "Hobby by Education", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Hobby_Education))

##              
##               Photography Reading Swimming Traveling Sum
##   Bachelor's            0      89       85        32 206
##   High School          82      12        1        10 105
##   Master's              0      19       13        64  96
##   PhD                   9       0        0        82  91
##   Sum                  91     120       99       188 498

Location_Education <- table(df$Education, df$Location)
Location_Education # what does this information tells you?

##              
##               City Rural Suburb
##   Bachelor's   119    87      0
##   High School    3   102      0
##   Master's      84     0     12
##   PhD            5     0     86

# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Location_Education) # Add totals to your table

##              
##               City Rural Suburb Sum
##   Bachelor's   119    87      0 206
##   High School    3   102      0 105
##   Master's      84     0     12  96
##   PhD            5     0     86  91
##   Sum          211   189     98 498

color <- c("magenta4","steelblue","goldenrod1","darkgreen")
names <- c("Bachelor's","High School", "Master's", "PhD")
barplot(Location_Education, col=color, beside= TRUE, main = "Education by Location", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Location_Education))

##              
##               City Rural Suburb Sum
##   Bachelor's   119    87      0 206
##   High School    3   102      0 105
##   Master's      84     0     12  96
##   PhD            5     0     86  91
##   Sum          211   189     98 498

library(knitr)
library(kableExtra)

# Create the contingency table
Age_Income <- table(df$Income, df$Age)

# Add row and column totals
ID_Weight_margins <- addmargins(ID_Weight)

# Make a clean and beautiful table with kable
kable(ID_Weight_margins, caption = "Gender by Education Level", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header

# Create the contingency table
Hobby_Location <- table(df$Location, df$Hobby)

# Add row and column totals
Hobby_Location_margins <- addmargins(Hobby_Location)

# Make a clean and beautiful table with kable
kable(Hobby_Location_margins, caption = "Table 3. Hobby by Location", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header

Table 3. Hobby by Location
	Photography	Reading	Swimming	Traveling	Sum
City	2	103	42	64	211
Rural	80	17	57	35	189
Suburb	9	0	0	89	98
Sum	91	120	99	188	498

# Create the contingency table
Hobby_Education <- table(df$Education, df$Hobby)

# Add row and column totals
Hobby_Education_margins <- addmargins(Hobby_Education)

# Make a clean and beautiful table with kable
kable(Hobby_Education_margins, caption = "Table 4. Hobby by Education Level", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header

Table 4. Hobby by Education Level
	Photography	Reading	Swimming	Traveling	Sum
Bachelor’s	0	89	85	32	206
High School	82	12	1	10	105
Master’s	0	19	13	64	96
PhD	9	0	0	82	91
Sum	91	120	99	188	498

# Create the contingency table
Hobby_Category <- table(df$Category, df$Hobby)

# Add row and column totals
Hobby_Category_margins <- addmargins(Hobby_Category)

# Make a clean and beautiful table with kable
kable(Hobby_Category_margins, caption = "Hobby by Category", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header

Hobby by Category
	Photography	Reading	Swimming	Traveling	Sum
Art	9	85	69	4	167
Music	0	8	30	107	145
Sports	82	27	0	77	186
Sum	91	120	99	188	498

# Create the contingency table
Location_Education <- table(df$Education, df$Location)

# Add row and column totals
Location_Education_margins <- addmargins(Location_Education)

# Make a clean and beautiful table with kable
kable(Location_Education_margins, caption = "Table 5. Education Level by Location", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header

Table 5. Education Level by Location
	City	Rural	Suburb	Sum
Bachelor’s	119	87	0	206
High School	3	102	0	105
Master’s	84	0	12	96
PhD	5	0	86	91
Sum	211	189	98	498

Your Turn

Essay Question

Explain the table in details. Focus on the information extracted, anomalies, outliers, relationships.

Answer

Predictive Modeling

Step 10 - Linear Regression and Correlation

Use the following chunk as a compass. Choose two numeric variables and run the following regression. Choose different variables than the ones presented below.

head(df)

##   Gender Age Height   Weight   Education   Income MaritalStatus Employment
## 1 Female  29    155 54.00000    Master's 55000.00       Married   Employed
## 2  Other  30    165 69.49362 High School 51665.87        Single Unemployed
## 3   Male  31    180 69.49362  Bachelor's 50000.00       Married   Employed
## 4 Female  27    168 65.00000  Bachelor's 45000.00        Single   Employed
## 5   Male  30    182 80.00000    Master's 65000.00       Married   Employed
## 6   Male  29    175 70.00000         PhD 60000.00       Married   Employed
##      Score Rating Category Color     Hobby Happiness Location
## 1 7.300000      B    Music  Blue  Swimming  8.200000     City
## 2 5.500000      A   Sports Green   Reading  7.514639    Rural
## 3 6.500000      A      Art Green   Reading  8.000000    Rural
## 4 6.200000      B      Art  Blue   Reading  7.000000     City
## 5 6.642691      C    Music   Red Traveling  8.500000     City
## 6 7.800000      C   Sports   Red Traveling  9.000000   Suburb

#############################################################
#
# Step 10 - Linear Regression and Scatterplots
# Choose two numeric variables and run the following regression.
# Do not use the following two variables
# The code is presented as an example
#
# We separate the numerical variables and review their relationships
# The numerical variables are in columns 3,4,5,7, 10, 15
#############################################################

temp_df <- df[c(2:4,6,9,14)] # we only select the numeric variables
pairs(temp_df) # this creates a correlation matrix

Three correlational matrices were completed to identify relationships between numeric variables. The first correlational matrix was completed with a comparison of age and income as seen below. A slight relationship between age and income was found with this analysis through an inspection of the correlation matrix. A clearer relationship was expected given the presence of age as an indicator of income in machine learning models in the literature (Lazar, 2004). One difference in the data set when compared to current literature for machine learning models is the range and variability of ages in census data sets. Lazar (2004) explained age data included in census values begins at 16, including workers 16 years and older. In this data set, the mean age is 28 and the standard deviation is 1.64. Reduced variation in age values may have contributed to the limited relationship between age and income in this data set according to the correlation matrix.

#Correlation matrix of Age and Income
age_income_df <- df[c(2,6)] # we only select the numeric variables
pairs(age_income_df) # this creates a correlation matrix

The second correlation matrix was completed with a comparison of income and happiness scores. Happiness appears to increase with increased income based on the scatter plot below. Happiness scores are not mentioned in current literature reviewing machine learning models for individual income predictive analysis. The happiness scores in this project appear to indicate a score from a survey or questionnaire. Further information regarding the standardized questionnaire or quality-of-life measure used is needed to determine the reliability of scores in relation to individual true happiness measure.

#Correlation matrix of income and happiness
income_happiness_df <- df[c(6,14)] # we only select the numeric variables
pairs(income_happiness_df) # this creates a correlation matrix

A third correlation matrix was completed for weight and income as seen below. A clear relationship was not identified when comparing income and weight in this data set. Weight as an indicator of income has gained attention in some research areas to highlight the risk of inequality in pay in relation to weight discrimination. Bernard et al. (2019) addressed the relationship between weight stigmatization, education level, and income in a systematic review of 17 studies. Results indicated continued conflicting results, with some studies finding an association with increased stigmatizing attitudes regarding weight in higher education levels and income brackets. Because additional research is needed with clear data regarding the relationship between weight differences and income, a relationship between these variables was explored. No clear relationship was identified in the data set used for this report. The distribution of weight values was higher than for age values, with a mean of 69.49 and a standard deviation of 10.43. Further research would identify if the wider distribution increased or decreased the likelihood of identification of an association. It is likely that the sample size of 498 for this data set was too small to identify a relationship.

#Correlation matrix of weight and income
weight_income_df <- df[c(4,6)] # we only select the numeric variables
pairs(weight_income_df) # this creates a correlation matrix

#############################################################
#
# Step 11 Run a regression model
# Make the necessary changes below to run your own regression
# Answer the questions in your paper
#
#############################################################


r <- lm(Age~Income, data=df) # it runs the least squares
r

## 
## Call:
## lm(formula = Age ~ Income, data = df)
## 
## Coefficients:
## (Intercept)       Income  
##   2.128e+01    1.334e-04

summary(r) # Information about your variables, R^2 and p value are printed

## 
## Call:
## lm(formula = Age ~ Income, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4455 -0.2836 -0.2829  0.1173  4.7164 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.128e+01  4.460e-01   47.72   <2e-16 ***
## Income      1.334e-04  8.532e-06   15.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.506 on 496 degrees of freedom
## Multiple R-squared:  0.3301, Adjusted R-squared:  0.3287 
## F-statistic: 244.4 on 1 and 496 DF,  p-value: < 2.2e-16

# In your example you should change the title and the labels of the axis appropriately
# Change Colors

r_2 <- lm(Weight~Income, data=df) # it runs the least squares
r_2

## 
## Call:
## lm(formula = Weight ~ Income, data = df)
## 
## Coefficients:
## (Intercept)       Income  
##   4.576e+01    4.595e-04

summary(r_2)

## 
## Call:
## lm(formula = Weight ~ Income, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.026  -5.809  -1.431   4.441  23.164 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.576e+01  2.896e+00  15.798  < 2e-16 ***
## Income      4.595e-04  5.541e-05   8.291 1.06e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.78 on 496 degrees of freedom
## Multiple R-squared:  0.1217, Adjusted R-squared:   0.12 
## F-statistic: 68.75 on 1 and 496 DF,  p-value: 1.059e-15

r_3 <- lm(Age~Happiness, data=df) # it runs the least squares
r_3

## 
## Call:
## lm(formula = Age ~ Happiness, data = df)
## 
## Coefficients:
## (Intercept)    Happiness  
##      21.942        0.829

summary(r_3)

## 
## Call:
## lm(formula = Age ~ Happiness, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7453 -0.7453 -0.4034  1.0837  5.8402 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.94211    0.56705   38.70   <2e-16 ***
## Happiness    0.82903    0.07482   11.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.647 on 496 degrees of freedom
## Multiple R-squared:  0.1984, Adjusted R-squared:  0.1968 
## F-statistic: 122.8 on 1 and 496 DF,  p-value: < 2.2e-16

r_4 <- lm(Weight~Happiness, data=df) # it runs the least squares
r_4

## 
## Call:
## lm(formula = Weight ~ Happiness, data = df)
## 
## Coefficients:
## (Intercept)    Happiness  
##    69.62610     -0.01763

summary(r_4)

## 
## Call:
## lm(formula = Weight ~ Happiness, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.482  -7.503   0.000  10.524  20.506 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 69.62610    3.59249  19.381   <2e-16 ***
## Happiness   -0.01763    0.47400  -0.037     0.97    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.44 on 496 degrees of freedom
## Multiple R-squared:  2.789e-06,  Adjusted R-squared:  -0.002013 
## F-statistic: 0.001383 on 1 and 496 DF,  p-value: 0.9703

r_5 <- lm(Income~Happiness, data=df) # it runs the least squares
r_5

## 
## Call:
## lm(formula = Income ~ Happiness, data = df)
## 
## Coefficients:
## (Intercept)    Happiness  
##       17762         4512

summary(r_5)

## 
## Call:
## lm(formula = Income ~ Happiness, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19666  -4344  -1112   6834  13334 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17762.3     2254.9   7.877 2.13e-14 ***
## Happiness     4511.7      297.5  15.164  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6551 on 496 degrees of freedom
## Multiple R-squared:  0.3168, Adjusted R-squared:  0.3154 
## F-statistic:   230 on 1 and 496 DF,  p-value: < 2.2e-16

Linear regression models were completed for age and income variables, weight and income variables, age and happiness variables and weight and happiness variables as seen above. The high residual standard error indicates the data set for the linear regression models was not large enough for optimal model performance. High variance reduced model accuracy. This is also noted with the low r-squared value of between 0.1217 and 0.3301 for most models. The multiple r-squared value for the linear regression model for weight and happiness is an additional outlier that requires either further refinement or elimination. The p-value was < 2.2e-16 indicates a clear statistically significant relationship between income and age, happiness and age and happiness and income. No statistical significance was detected in the relationship between weight and income or happiness.

Visual representations

A scatter plot of the relationship between income and age was also completed using this data set. From the scatter plot, outliers at an income less than 40,000 at the age of 30 and at 70,000 at the age of 28 may affect the distribution. In general, it appears a higher density of plots above an income of 50,000 are also above the age of 28, which is the mean of this sample.

#############################################################
#
# Step 12
#
# Create a scatterplot and add the regression line.
#############################################################

plot(df$Income,df$Age, col = "blue", main = "Income vs Age", xlab = "Income", ylab= "Age") # it plots the scatterplot
abline(reg=r, col = "red")          # it adds the regression line

A scatter plot of the relationship between income and weight was also completed using this data set. Again, the scatter plot did not capture a relationship between weight and income. It appears from both the statistical analysis and the scatter plot that there is not a relationship in this limited data set between the income and weight variables. As with the analysis previously with the age and income variables, the sample size of 498 observations may not have been sufficient for a reliable representation of the population tested.

plot(df$Weight,df$Income, col = "blue", main = "Income vs Weight", xlab = "Weight", ylab= "Income") # it plots the scatterplot
abline(reg=r, col = "red")          # it adds the regression line

Finally, a scatter plot graph of the relationship between the variables of income and happiness was created. The trend of higher happiness values for higher values in income is visible in the graph, with a wide spread of values. All plots for happiness at a level of 8.2 or higher have an income higher than 50,000. Values at a happiness level of 8 are spread between incomes of 40,000 and 60,000. There are two outliers at a happiness level of 7.5 above an income of 50,000. All remaining plots are at 52,000 or lower for happiness values between 6.0 and 7.5.

plot(df$Happiness,df$Income, col = "blue", main = "Income vs Happiness", xlab = "Happiness", ylab= "Income") # it plots the scatterplot
abline(reg=r, col = "red")          # it adds the regression line

Your Turn

Essay Question

How the Scatterplot provides more or different perspective to the researcher? Please describe the plot but also share your insights of this exploration.

Although the data set included a limited number of observations, general trends shown in scatter plot graphs help identify patterns. For example, when comparing income and age, increased linearity is observed for income above 45,000, with outliers for ages above 30. Additionally, linearity can be detected with the scatter plot graph of income and happiness from the initial score of 6 through the highest score of 9. These trends are apparent only through data visualization methods.

Your Turn

Complete the steps that shown above for two different numerical variables

#############################################################
#
# Step 11 Run a regression model
# Make the necessary changes below to run your own regression
# Answer the questions in your paper
#
#############################################################

# CHANGE the variables Income and Age, but always choose numerical variables.
# Do not forget to change the eval=TRUE for this and the following chunk before you knit.

r <- lm(Income~Age, data=df) # it runs the least squares
r

## 
## Call:
## lm(formula = Income ~ Age, data = df)
## 
## Coefficients:
## (Intercept)          Age  
##      -18049         2475

summary(r) # Information about your variables, R^2 and p value are printed

## 
## Call:
## lm(formula = Income ~ Age, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24190  -3766   1184   3691  18334 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18048.6     4469.0  -4.039 6.23e-05 ***
## Age           2474.6      158.3  15.633  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6487 on 496 degrees of freedom
## Multiple R-squared:  0.3301, Adjusted R-squared:  0.3287 
## F-statistic: 244.4 on 1 and 496 DF,  p-value: < 2.2e-16

# In your example you should change the title and the labels of the axis appropriately
# Change Colors

Your Regression DIagnostics

#############################################################
#
# Step 12
#
# Create a scatterplot and add the regression line.
#############################################################

plot(df$Age,df$Income, col = "blue", main = "Income vs Age", xlab = "Age", ylab= "Income") # it plots the scatterplot
abline(reg=r, col = "red")          # it adds the regression line

Your Turn

Essay Question Explain the Results that you receive What does the R^2 means in this example? what about the p-value? Why they are important? Can you tie this to your beliefs and understanding of similar data? How the Scatterplot provides more or different perspective to the researcher? Please describe the plot but also share your insights of this exploration.

Model accuracy for both models requires further refinement to ensure optimal performance. Once additional data is acquired for model development, data will be split into a training data set and a testing data set to compare the performance of the model to the testing data with the performance of the model with the training data set. Only by splitting data into testing and training set is the performance of the model clear, with careful observation to ensure overfitting or underfitting does not occur.

Past studies indicated age is a variable highly relevant to income and is often included in classification models to predict income (Lazar, 2004, Moe et al., 2023, Islam et al., 2023). The relationship between variables of age and income were supported by the linear regression model used with this data set despite the reduced variability in the data and the limited sample size of 498 observations. This was identified through statistical analysis, and the relationship was not visually clear with inspection of the correlation matrix scatter plot.  

One limitation of the methods used for this project related to variations in the data set. Corruption occurred in the data set requiring multiple pre-processing strategies to complete statistical analysis. Missing values and NA values were originally managed with mean imputation and median imputation; however, the data set became corrupted in the analysis process requiring additional preprocessing steps. An extra pre-processing step was added to tidy corrupted data. The ID column was removed due to an absence of values after corruption. The my_data.csv file was exported to Excel for the additional preprocessing for this report. Upon inspection, the observation 217 and 488 had 3 NAs and no other values added so these observations were removed to reduce error and risk of bias toward the mean with the imputation of mean/median values. The updated csv file was with the tidy data was then used for analysis of this data set. Linear regression models were then developed with the adjusted number of observations in the data set to gain the statistical information leading to discrepancies in this report from previous reports.  

In conclusion, the data set used for this project included a variety of numeric and categorical variables for inspection. Visual representations of the relationships between variables were useful in identifying insights and trends. A slight relationship between age and income was detected. A more significant relationship between income and happiness scores was found. These relationships were identified by p-values indicating statistical significance. Linear regression models require optimization for increased accuracy as identified by the low multiple r-squared results. Several limitations, including a small sample size of 498 observations and multiple data pre-processing steps likely affected model performance. Future research with this data set is planned for identification of the relationships between numeric and categorical variables, such as between education, marital status, and hobby and income.

Works Cited Part II

Bernard, Marie, et al. “Does Weight-Related Stigmatisation and Discrimination Depend on Educational Attainment and Level of Income? A Systematic Review.” BMJ Open, vol. 9, no. 11, 1 Nov. 2019, p. e027673, bmjopen.bmj.com/content/9/11/e027673, https://doi.org/10.1136/bmjopen-2018-027673. Accessed 14 Jan. 2021.

Islam, Md Aminul, et al. “An Investigation into the Prediction of Annual Income Levels through the Utilization of Demographic Features Employing the Modified UCI Adult Dataset.” 2023 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), 15 Feb. 2024, pp. 1080–1086, www.researchgate.net/publication/378251097_An_Investigation_into_the_Prediction_of_Annual_Income_Levels_Through_the_Utilization_of_Demographic_Features_Employing_the_Modified_UCI_Adult_Dataset, https://doi.org/10.1109/ICCCIS60361.2023.10425394.

Lazar, Alina. “Income Prediction via Support Vector Machine .” International Conference on Machine Learning and Applications, vol. Proceedings., 2004, https://doi.org/10.1109/icmla.2004.1383506.

Moe, Ei Ei, et al. “Adult Income Classification Using Machine Learning Techniques .” IEEE Conference on Computer Applications (ICCA), 2023.

Optional - Run and Explain a Multiple Linear Regression

An example on three predictors is shown below. You can choose your own variables or provide an explanation of the findings for this example

#############################################################
#
# Step 13 - Optional
#
# In case you want to add more variables in your model the following 
# Example is provided.
#############################################################

r2 <- lm(Income~Age+Gender+Education, data=df) # it runs the least squares
r2
summary(r2) # Information about your variables, R^2 and p value are printed
# In your example you should change the title and the labels of the axis appropriately
# Change Colors
# Explain the outcome.
# Run your model
r2 <- lm(Income ~ Age + Gender + Education, data = df)

# View the model and summary
r2
summary(r2)

# Set up plotting space: 2 rows, 2 columns
par(mfrow = c(2, 2))

# 1. Residuals vs Fitted plot
# Checks for non-linearity, unequal error variance
plot(r2, which = 1)

# 2. Normal Q-Q plot
# Checks if residuals are normally distributed
plot(r2, which = 2)

# 3. Scale-Location plot
# Checks homoscedasticity (constant variance)
plot(r2, which = 3)

# 4. Residuals vs Leverage plot
# Finds influential observations
plot(r2, which = 5)

# Reset plotting space to normal
par(mfrow = c(1,1))

# --- Additional Useful Diagnostics ---

# 5. Histogram of residuals
hist(r2$residuals,
     main = "Histogram of Residuals",
     xlab = "Residuals",
     col = "lightblue",
     border = "white")

# 6. Residuals vs each predictor
# Helps to spot non-linear patterns individually
par(mfrow = c(1, 3))
plot(df$Age, r2$residuals, main = "Residuals vs Age", xlab = "Age", ylab = "Residuals")
abline(h = 0, col = "red")
plot(df$Gender, r2$residuals, main = "Residuals vs Gender", xlab = "Gender", ylab = "Residuals")
abline(h = 0, col = "red")
plot(df$Education, r2$residuals, main = "Residuals vs Education", xlab = "Education", ylab = "Residuals")
abline(h = 0, col = "red")
par(mfrow = c(1, 1))

# 7. Cook's Distance
# Identifies influential points
cooksd <- cooks.distance(r2)
plot(cooksd, type = "h", main = "Cook's Distance", ylab = "Cook's Distance")
abline(h = 4/length(cooksd), col = "red", lty = 2)  # common threshold

Predictive Data

Jamie Heather Northam

2025-08-10

Already complete Code

Setting up your directory in your computer.

Setting up your personalized data

Knit your file

Cleaning up your data

Cleanup Continued

Cleanup Continued

Clean Up Continued

CleanUp Continued

Imputation

Your Turn

Descriptive Statistics

Descriptive Statistics Continued

Your Turn

Visual Representations

Bargraph - if the variable of your choice is categorical

Histogram - If the varaibles of your choice is Numerical

Your Turn

Your Turn

Your Turn

Your Turn

Your Turn

Tabular Representations

Your Turn

Predictive Modeling

Visual representations

Your Turn

Your Turn

Your Regression DIagnostics

Your Turn

Optional - Run and Explain a Multiple Linear Regression